173 39 24MB
English Pages 656 [657] Year 2020
Statistical Inference via Convex Optimization
Princeton Series in Applied Mathematics Ingrid Daubechies (Duke University); Weinan E (Princeton University); Jan Karel Lenstra (Centrum Wiskunde & Informatica, Amsterdam); Endre S¨ uli (University of Oxford), Series Editors The Princeton Series in Applied Mathematics features high-quality advanced texts and monographs in all areas of applied mathematics. The series includes books of a theoretical and general nature as well as those that deal with the mathematics of specific applications and real-world scenarios. For a full list of titles in the series, go to https://press.princeton.edu/series/princeton-series-in-applied-mathematics Statistical Inference via Convex Optimization, Anatoli Juditsky and Arkadi Nemirovski A Dynamical Systems Theory of Thermodynamics, Wassim M. Haddad Formal Verification of Control System Software, Pierre-Loic Garoche Rays, Waves, and Scattering: Topics in Classical Mathematical Physics, John A. Adam Mathematical Methods in Elasticity Imaging, Habib Ammari, Elie Bretin, Josselin Garnier, Hyeonbae Kang, Hyundae Lee, and Abdul Wahab Hidden Markov Processes: Theory and Applications to Biology, M. Vidyasagar Topics in Quaternion Linear Algebra, Leiba Rodman Mathematical Analysis of Deterministic and Stochastic Problems in Complex Media Electromagnetics, G. F. Roach, I. G. Stratis, and A. N. Yannacopoulos Stability and Control of Large-Scale Dynamical Systems: A Vector Dissipative Systems Approach, Wassim M. Haddad and Sergey G. Nersesov Matrix Completions, Moments, and Sums of Hermitian Squares, Mih´aly Bakonyi and Hugo J. Woerdeman Modern Anti-windup Synthesis: Control Augmentation for Actuator Saturation, Luca Zaccarian and Andrew R. Teel Totally Nonnegative Matrices, Shaun M. Fallat and Charles R. Johnson Graph Theoretic Methods in Multiagent Networks, Mehran Mesbahi and Magnus Egerstedt Matrices, Moments and Quadrature with Applications, Gene H. Golub and G´erard Meurant Control Theoretic Splines: Optimal Control, Statistics, and Path Planning, Magnus Egerstedt and Clyde Martin Robust Optimization, Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski Distributed Control of Robotic Networks: A Mathematical Approach to Motion Coordination Algorithms, Francesco Bullo, Jorge Cort´es, and Sonia Martinez Algebraic Curves over a Finite Field, J.W.P. Hirschfeld, G. Korchm´aros, and F. Torres Wave Scattering by Time-Dependent Perturbations: An Introduction, G. F. Roach Genomic Signal Processing, Ilya Shmulevich and Edward R. Dougherty The Traveling Salesman Problem: A Computational Study, David L. Applegate, Robert E. Bixby, Vaˇsek Chv´ atal, and William J. Cook Positive Definite Matrices, Rajendra Bhatia Impulsive and Hybrid Dynamical Systems: Stability, Dissipativity, and Control, Wassim M. Haddad, VijaySekhar Chellaboina, and Sergey G. Nersesov
Statistical Inference via Convex Optimization Anatoli Juditsky Arkadi Nemirovski
Princeton University Press Princeton and Oxford
c 2020 by Princeton University Press Copyright Requests for permission to reproduce material from this work should be sent to [email protected] Published by Princeton University Press 41 William Street, Princeton, New Jersey 08540 6 Oxford Street, Woodstock, Oxfordshire OX20 1TR press.princeton.edu All Rights Reserved ISBN 978-0-691-19729-6 ISBN (e-book) 978-0-691-20031-6 British Library Cataloging-in-Publication Data is available Editorial: Susannah Shoemaker and Lauren Bucca Production Editorial: Nathan Carr Production: Jacquie Poirier Publicity: Matthew Taylor and Katie Lewis Jacket/Cover Credit: Adapted from Fran¸cois de Kresz, “Excusez-moi, excusezmoi...,” 1974 Copyeditor: Bhisham Bherwani The publisher would like to acknowledge the authors of this volume for providing the camera-ready copy from which this book was printed. This book has been composed in LaTeX Printed on acid-free paper. Printed in the United States of America 10 9 8 7 6 5 4 3 2 1
Contents
List of Figures
xi
Preface
xiii
Acknowledgements
xvii
Notational Conventions
xix
About Proofs
xxi
On Computational Tractability
xxi
1 Sparse Recovery via ℓ1 Minimization 1.1 Compressed Sensing: What is it about? 1.1.1 Signal Recovery Problem . . . . . . . . . . . . . . . . . . . 1.1.2 Signal Recovery: Parametric and nonparametric cases . . . 1.1.3 Compressed Sensing via ℓ1 minimization: Motivation . . . . 1.2 Validity of sparse signal recovery via ℓ1 minimization 1.2.1 Validity of ℓ1 minimization in the noiseless case . . . . . . . 1.2.2 Imperfect ℓ1 minimization . . . . . . . . . . . . . . . . . . . 1.2.3 Regular ℓ1 recovery . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 Penalized ℓ1 recovery . . . . . . . . . . . . . . . . . . . . . . 1.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Verifiability and tractability issues 1.3.1 Restricted Isometry Property and s-goodness of random matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Verifiable sufficient conditions for Qq (s, κ) . . . . . . . . . . 1.3.3 Tractability of Q∞ (s, κ) . . . . . . . . . . . . . . . . . . . . 1.4 Exercises for Chapter 1 1.5 Proofs 1.5.1 Proofs of Theorem 1.3, 1.4 . . . . . . . . . . . . . . . . . . . 1.5.2 Proof of Theorem 1.5 . . . . . . . . . . . . . . . . . . . . . 1.5.3 Proof of Proposition 1.7 . . . . . . . . . . . . . . . . . . . . 1.5.4 Proof of Propositions 1.8 and 1.12 . . . . . . . . . . . . . . 1.5.5 Proof of Proposition 1.10 . . . . . . . . . . . . . . . . . . . 1.5.6 Proof of Proposition 1.13 . . . . . . . . . . . . . . . . . . .
20 20 22 26 30 30 32 33 36 37 39
2 Hypothesis Testing 2.1 Preliminaries from Statistics: Hypotheses, Tests, Risks 2.1.1 Hypothesis Testing Problem . . . . . . . . . . . 2.1.2 Tests . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Testing from repeated observations . . . . . . . 2.1.4 Risk of a simple test . . . . . . . . . . . . . . .
41 41 41 42 42 45
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
1 1 1 2 6 8 8 11 13 14 14 19
vi
CONTENTS 2.1.5 Two-point lower risk bound . . . . . . . . . . . . . . . . . . Hypothesis Testing via Euclidean Separation 2.2.1 Situation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Pairwise Hypothesis Testing via Euclidean Separation . . . 2.2.3 Euclidean Separation, Repeated Observations, and Majority Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 From Pairwise to Multiple Hypotheses Testing . . . . . . . 2.3 Detectors and Detector-Based Tests 2.3.1 Detectors and their risks . . . . . . . . . . . . . . . . . . . . 2.3.2 Detector-based tests . . . . . . . . . . . . . . . . . . . . . . 2.4 Simple observation schemes 2.4.1 Simple observation schemes—Motivation . . . . . . . . . . . 2.4.2 Simple observation schemes—The definition . . . . . . . . . 2.4.3 Simple observation schemes—Examples . . . . . . . . . . . 2.4.4 Simple observation schemes—Main result . . . . . . . . . . 2.4.5 Simple observation schemes—Examples of optimal detectors 2.5 Testing multiple hypotheses 2.5.1 Testing unions . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Testing multiple hypotheses “up to closeness” . . . . . . . . 2.5.3 Illustration: Selecting the best among a family of estimates 2.6 Sequential Hypothesis Testing 2.6.1 Motivation: Election polls . . . . . . . . . . . . . . . . . . . 2.6.2 Sequential hypothesis testing . . . . . . . . . . . . . . . . . 2.6.3 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . 2.7 Measurement Design in simple observation schemes 2.7.1 Motivation: Opinion polls revisited . . . . . . . . . . . . . . 2.7.2 Measurement Design: Setup . . . . . . . . . . . . . . . . . . 2.7.3 Formulating the MD problem . . . . . . . . . . . . . . . . . 2.8 Affine detectors beyond simple observation schemes 2.8.1 Situation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.2 Main result . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 Beyond the scope of affine detectors: lifting the observations 2.9.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.2 Quadratic lifting: Gaussian case . . . . . . . . . . . . . . . 2.9.3 Quadratic lifting—Does it help? . . . . . . . . . . . . . . . 2.9.4 Quadratic lifting: Sub-Gaussian case . . . . . . . . . . . . . 2.9.5 Generic application: Quadratically constrained hypotheses . 2.10 Exercises for Chapter 2 2.10.1 Two-point lower risk bound . . . . . . . . . . . . . . . . . . 2.10.2 Around Euclidean Separation . . . . . . . . . . . . . . . . . 2.10.3 Hypothesis testing via ℓ1 -separation . . . . . . . . . . . . . 2.10.4 Miscellaneous exercises . . . . . . . . . . . . . . . . . . . . . 2.11 Proofs 2.11.1 Proof of the observation in Remark 2.8 . . . . . . . . . . . 2.11.2 Proof of Proposition 2.6 in the case of quasi-stationary Krepeated observations . . . . . . . . . . . . . . . . . . . . . 2.11.3 Proof of Theorem 2.23 . . . . . . . . . . . . . . . . . . . . . 2.11.4 Proof of Proposition 2.37 . . . . . . . . . . . . . . . . . . . 2.11.5 Proof of Proposition 2.43 . . . . . . . . . . . . . . . . . . . 2.11.6 Proof of Proposition 2.46 . . . . . . . . . . . . . . . . . . . 2.2
46 49 49 50 55 58 65 65 65 72 72 73 74 79 83 87 88 91 97 105 105 108 113 113 113 115 116 123 124 132 139 139 140 142 145 147 157 157 157 157 163 168 168 168 172 175 176 180
vii
CONTENTS 3 From Hypothesis Testing to Estimating Functionals 3.1 Estimating linear forms on unions of convex sets 3.1.1 The problem . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 The estimate . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Main result . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.4 Near-optimality . . . . . . . . . . . . . . . . . . . . . . . 3.1.5 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Estimating N -convex functions on unions of convex sets 3.2.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Estimating N -convex functions: Problem setting . . . . 3.2.3 Bisection estimate: Construction . . . . . . . . . . . . . 3.2.4 Building Bisection estimate . . . . . . . . . . . . . . . . 3.2.5 Bisection estimate: Main result . . . . . . . . . . . . . . 3.2.6 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.7 Estimating N -convex functions: An alternative . . . . . 3.3 Estimating linear forms beyond simple observation schemes 3.3.1 Situation and goal . . . . . . . . . . . . . . . . . . . . . 3.3.2 Construction and main results . . . . . . . . . . . . . . 3.3.3 Estimation from repeated observations . . . . . . . . . . 3.3.4 Application: Estimating linear forms of sub-Gaussianity rameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Estimating quadratic forms via quadratic lifting 3.4.1 Estimating quadratic forms, Gaussian case . . . . . . . 3.4.2 Estimating quadratic form, sub-Gaussian case . . . . . . 3.5 Exercises for Chapter 3 3.6 Proofs 3.6.1 Proof of Proposition 3.3 . . . . . . . . . . . . . . . . . . 3.6.2 Verifying 1-convexity of the conditional quantile . . . . 3.6.3 Proof of Proposition 3.4 . . . . . . . . . . . . . . . . . . 3.6.4 Proof of Proposition 3.14 . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . . . .
. . . . . . .
. . . . . . pa. . . . . .
. . . .
4 Signal Recovery by Linear Estimation Overview 4.1 Preliminaries: Executive summary on Conic Programming 4.1.1 Cones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Conic problems and their duals . . . . . . . . . . . . . . . 4.1.3 Schur Complement Lemma . . . . . . . . . . . . . . . . . 4.2 Near-optimal linear estimation from Gaussian observations 4.2.1 Situation and goal . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Building a linear estimate . . . . . . . . . . . . . . . . . . 4.2.3 Byproduct on semidefinite relaxation . . . . . . . . . . . . 4.3 From ellitopes to spectratopes 4.3.1 Spectratopes: Definition and examples . . . . . . . . . . . 4.3.2 Semidefinite relaxation on spectratopes . . . . . . . . . . 4.3.3 Linear estimates beyond ellitopic signal sets and k · k2 -risk 4.4 Linear estimates of stochastic signals 4.4.1 Minimizing Euclidean risk . . . . . . . . . . . . . . . . . . 4.4.2 Minimizing k · k-risk . . . . . . . . . . . . . . . . . . . . . 4.5 Linear estimation under uncertain-but-bounded noise 4.5.1 Uncertain-but-bounded noise . . . . . . . . . . . . . . . .
. . . .
. . . . . . . . . . . .
185 185 186 187 189 190 191 193 194 197 199 201 202 203 205 211 212 213 216 218 222 222 228 238 250 250 253 254 258 260 260 262 262 263 265 265 265 267 274 275 275 277 278 291 293 294 295 295
viii
CONTENTS
4.6 4.7
4.8
4.5.2 Mixed noise . . . . . . . . . . . . . . . . . . . . . . . . . . . Calculus of ellitopes/spectratopes Exercises for Chapter 4 4.7.1 Linear estimates vs. Maximum Likelihood . . . . . . . . . . 4.7.2 Measurement Design in Signal Recovery . . . . . . . . . . . 4.7.3 Around semidefinite relaxation . . . . . . . . . . . . . . . . 4.7.4 Around Propositions 4.4 and 4.14 . . . . . . . . . . . . . . . 4.7.5 Signal recovery in Discrete and Poisson observation schemes 4.7.6 Numerical lower-bounding minimax risk . . . . . . . . . . . 4.7.7 Around S-Lemma . . . . . . . . . . . . . . . . . . . . . . . 4.7.8 Miscellaneous exercises . . . . . . . . . . . . . . . . . . . . . Proofs 4.8.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.2 Proof of Proposition 4.6 . . . . . . . . . . . . . . . . . . . . 4.8.3 Proof of Proposition 4.8 . . . . . . . . . . . . . . . . . . . . 4.8.4 Proof of Lemma 4.17 . . . . . . . . . . . . . . . . . . . . . . 4.8.5 Proofs of Propositions 4.5, 4.16 and 4.19 . . . . . . . . . . . 4.8.6 Proofs of Propositions 4.18 and 4.19, and justification of Remark 4.20 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
299 300 302 302 303 306 317 335 347 359 360 361 361 364 366 368 371 383
5 Signal Recovery Beyond Linear Estimates 386 Overview 386 5.1 Polyhedral estimation 386 5.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 386 5.1.2 Generic polyhedral estimate . . . . . . . . . . . . . . . . . . 388 5.1.3 Specifying sets Hδ for basic observation schemes . . . . . . 390 5.1.4 Efficient upper-bounding of R[H] and contrast design, I. . . 392 5.1.5 Efficient upper-bounding of R[H] and contrast design, II. . 399 5.1.6 Assembling estimates: Contrast aggregation . . . . . . . . . 411 5.1.7 Numerical illustration . . . . . . . . . . . . . . . . . . . . . 413 5.1.8 Calculus of compatibility . . . . . . . . . . . . . . . . . . . 413 5.2 Recovering signals from nonlinear observations by Stochastic Optimization 415 5.2.1 Problem setting . . . . . . . . . . . . . . . . . . . . . . . . . 415 5.2.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . 417 5.2.3 Estimating via Sample Average Approximation . . . . . . . 420 5.2.4 Stochastic Approximation estimate . . . . . . . . . . . . . . 423 5.2.5 Numerical illustration . . . . . . . . . . . . . . . . . . . . . 425 5.2.6 “Single-observation” case . . . . . . . . . . . . . . . . . . . 428 5.3 Exercises for Chapter 5 431 5.3.1 Estimation by Stochastic Optimization . . . . . . . . . . . . 431 5.4 Proofs 440 5.4.1 Proof of (5.8) . . . . . . . . . . . . . . . . . . . . . . . . . . 440 5.4.2 Proof of Lemma 5.6 . . . . . . . . . . . . . . . . . . . . . . 441 5.4.3 Verification of (5.44) . . . . . . . . . . . . . . . . . . . . . . 442 5.4.4 Proof of Proposition 5.10 . . . . . . . . . . . . . . . . . . . 443 Solutions to Selected Exercises 6.1 Solutions for Chapter 1 6.2 Solutions for Chapter 2
447 447 454
ix
CONTENTS
6.3 6.4
6.5
6.2.1 Two-point lower risk bound . . . . . . . . . 6.2.2 Around Euclidean Separation . . . . . . . . 6.2.3 Hypothesis testing via ℓ1 separation . . . . 6.2.4 Miscellaneous exercises . . . . . . . . . . . . Solutions for Chapter 3 Solutions for Chapter 4 6.4.1 Linear Estimates vs. Maximum Likelihood 6.4.2 Measurement Design in Signal Recovery . . 6.4.3 Around semidefinite relaxation . . . . . . . 6.4.4 Around Propositions 4.4 and 4.14 . . . . . . 6.4.5 Numerical lower-bounding minimax risk . . 6.4.6 Around S-Lemma . . . . . . . . . . . . . . 6.4.7 Miscellaneous exercises . . . . . . . . . . . . Solutions for Chapter 5 6.5.1 Estimation by Stochastic Optimization . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . . . .
454 455 457 465 477 495 495 497 502 518 572 586 589 592 592
Appendix: Executive Summary on Efficient Solvability of Convex Optimization Problems 609 Bibliography
613
Index
629
List of Figures
1.1
1.2 1.3
1.4
Top: true 256 × 256 image; bottom: sparse in the wavelet basis approximations of the image. Wavelet basis is orthonormal, and a natural way to quantify near-sparsity of a signal is to look at the fraction of total energy (sum of squares of wavelet coefficients) stored in the leading coefficients; these are the “energy data” presented in the figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Singe-pixel camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . Regular and penalized ℓ1 recovery of nearly s-sparse signals. o: true signals, +: recoveries (to make the plots readable, one per eight consecutive vector’s entries is shown). Problem sizes are m = 256 and n = 2m = 512, noise level is σ = 0.01, deviation from s-sparsity p is kx − xs k1 = 1, contrast pair is (H = n/mA, k · k∞ ). In penalized recovery, λ = 2s, parameter ρ of regular recovery is set to σ · ErfcInv(0.005/n). . . . . . . . . . . . . . . . . . . . . . . . . . . Erroneous ℓ1 recovery of a 25-sparse signal, no observation noise. Top: frequency domain, o – true signal, + – recovery. Bottom: time domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1
“Gaussian Separation” (Example 2.5): Optimal test deciding on whether the mean of Gaussian r.v. belongs to the domain A (H1 ) or to the domain B (H2 ). Hyperplane o-o separates the acceptance domains for H1 (“left” half-space) and for H2 (“right” half-space). . . . . . . . . . . . . . . .
2.2 2.3 2.4
Drawing for Proposition 2.4. . . . . . . . . . . . . . . . . . . . . . . Positron Emission Tomography (PET) . . . . . . . . . . . . . . . . Nine hypotheses on the location of the mean µ of observation ω ∼ N (µ, I2 ), each stating that µ belongs to a specific polygon. . . . . Signal (top, solid) and candidate estimates (top, dotted). Bottom: the primitive of the signal. . . . . . . . . . . . . . . . . . . . . . . . 3-candidate hypotheses in probabilistic simplex ∆3 . . . . . . . . . PET scanner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frames from a “movie” . . . . . . . . . . . . . . . . . . . . . . . .
2.5 2.6 2.7 2.8 3.1
Boxplot of empirical distributions, over 20 random estimation problems, of the upper 0.01-risk bounds max1≤i,j≤100 ρij (as in (3.15)) for different observation sample sizes K. . . . . . . . . . . . . . . . . . . . . . . .
3.2 3.3
Bisection via Hypothesis Testing. . . . . . . . . . . . . . . . . . . . A circuit (nine nodes and 16 arcs). a: arc of interest; b: arcs with measured currents; c: input node where external current and voltage are measured. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 6
18
25
48 52 76 93 105 109 121 150
193 194
209
xii
FIGURES 3.4 4.1
5.1
5.2
5.3
5.4
5.5 6.1 6.2 6.3
Histograms of recovery errors in experiments, 1,000 simulations per experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . True distribution of temperature U∗ = B(x) at time t0 = 0.01 (left) b via the optimal linear estimate (center) along with its recovery U e (right). . . . . . . . . . . . . . . . . . . and the “naive” recovery U
Recovery errors for the near-optimal linear estimate (circles) and for polyhedral estimates yielded by Proposition 5.8 (PolyI, pentagrams) and by the construction from Section 5.1.4 (PolyII, triangles), 20 simulations per each value of σ. . . . . . . . . . . . . . . . . . . . . Left: functions h; right: moduli of strong monotonicity of the operators F (·) in {z : kzk2 ≤ R} as functions of R. Dashed lines – case A (logistic sigmoid), solid lines – case B (linear regression), dash-dotted lines – case C (hinge function), dotted line – case D (ramp sigmoid). Mean errors and CPU times for SA (solid lines) and SAA estimates (dashed lines) as functions of the number of observations K. o – case A (logistic link), x – case B (linear link), + – case C (hinge function), ✷ – case D (ramp sigmoid). . . . . . . . . . . . . . . . . . . . . . . Solid curve: MωK (z), dashed curve: HωK (z). True signal x (solid vertical line): +0.081; SAA estimate (unique minimizer of HωK , dashed vertical line): −0.252; ML estimate (global minimizer of MωK on [−20, 20]): −20.00, closest to x local minimizer of MωK (dotted vertical line): −0.363. . . . . . . . . . . . . . . . . . . . . . . . . . Mean errors and CPU times for standard deviation λ = 1 (solid line) and λ = 0.1 (dashed line). . . . . . . . . . . . . . . . . . . . . . . . Recovery of a 1200×1600 image at different noise levels, ill-posed case ∆ = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Recovery of a 1200×1600 image at different noise levels, well-posed case ∆ = 0.25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Smooth curve: f ; “bumpy” curve: recovery; gray cloud: observations. In all experiments, n = 8192, κ = 2, p0 = p1 = p2 = 2, σ = 0.5, Lι = (10π)ι , 0 ≤ ι ≤ 2. . . . . . . . . . . . . . . . . . . . .
238
272
414
426
427
428 431 539 540
545
PREFACE When speaking about links between Statistics and Optimization, what comes to mind first is the indispensable role played by optimization algorithms in the “computational toolbox” of Statistics (think about the numerical implementation of the fundamental Maximum Likelihood method). However, on a second thought, we should conclude that no matter how significant this role could be, the fact that it comes to our mind first primarily reflects the weaknesses of Optimization rather than its strengths; were optimization algorithms which are used in Statistics as efficient and as reliable as, say, Linear Algebra techniques, nobody would think about special links between Statistics and Optimization, just as nobody usually thinks about special links between Statistics and Linear Algebra. When computational, rather than methodological, issues are concerned, we start to think about links of Statistics with Optimization, Linear Algebra, Numerical Analysis, etc. only when computational tools offered to us by these disciplines do not work well and need the attention of experts in these disciplines. The goal of this book is to present other types of links between Optimization and Statistics, those which have little in common with algorithms and numbercrunching. What we are speaking about, are the situations where Optimization theory (theory, not algorithms!) seems to be of methodological value in Statistics, acting as the source of statistical inferences with provably optimal, or nearly so, performance. In this context, we focus on utilizing Convex Programming theory, mainly due to its power, but also due to the desire to end up with inference routines reducing to solving convex optimization problems and thus implementable in a computationally efficient fashion. Therefore, while we do not mention computational issues explicitly, we do remember that at the end of the day we need a number, and in this respect, intrinsically computationally friendly convex optimization models are the first choice. The three topics we intend to consider are: A. Sparsity-oriented Compressive Sensing. Here the role of Convex Optimization theory as a creative tool motivating the construction of inference procedures is relatively less important than in the two other topics. This being said, its role is by far non-negligible in the analysis of Compressive Sensing routines (it allows, e.g., to derive from “first principles” the necessary and sufficient conditions for the validity of ℓ1 recovery). On account of this, and also due to its popularity and the fact that now it is one of the major “customers” of advanced convex optimization algorithms, we believe that Compressive Sensing is worthy of being considered. B. Pairwise and Multiple Hypothesis Testing, including sequential tests, estimation of linear functionals, and some rudimentary design of experiments. C. Recovery of signals from noisy observations of their linear images. B and C are the topics where, as of now, the approaches we present in this book appear to be the most successful. The exposition does not require prior knowledge of Statistics and Optimization; as far as these disciplines are concerned, all necessary facts and concepts are incorporated into the text. The actual prerequisites are basic Calculus, Probability, and Linear Algebra. Selection and treatment of our topics are inspired by a kind of “philosophy”
xiv
PREFACE
which can be explained to an expert as follows. Compare two well-known results of nonparametric statistics (“h...i” marks fragments irrelevant to the discussion to follow): Theorem A [I. Ibragimov & R. Khasminskii [124], 1979] Given α, L, k, let X be the set of all functions f : [0, 1] → R with (α, L)-H¨ older continuous k-th derivative. For a given t, the minimax risk of estimating f (t), f ∈ X , from noisy observations y = f Γn + ξ, ξ ∼ N (0; In ) taken along n-point equidistant grid Γn , up to a factor C(β) = h...i, β := k + α, is (Ln−β )1/(2β+1) , and the upper risk bound is attained at the affine in y estimate explicitly given by h...i.
Theorem B [D. Donoho [64], 1994] Let X ⊂ RN be a convex compact set, A be an n × N matrix, and g(·) be a linear form on X . The minimax, over f ∈ X , risk of recovering g(f ) from the noisy observations y = Af + ξ, ξ ∼ N (0, In ), within factor 1.2 is attained at an affine in y estimate which, along with its risk, can be built efficiently by solving convex optimization problem h...i. In many respects, A and B are similar: both are theorems on minimax optimal estimation of a given linear form of an unknown “signal” f known to belong to a given convex set X from observations, corrupted by Gaussian noise, of the image of f under linear mapping,1 and both are associated with efficiently computable nearoptimal—in a minimax sense—estimators which happen to be affine in observations. There is, however, a significant structural difference: A gives an explicit “closed form” analytic description of the minimax risk as a function of n and smoothness parameters of f , along with explicit description of the near-optimal estimator. Numerous results of this type—let us call them descriptive—form the backbone of the deep and rich theory of Nonparametric Statistics. This being said, strong “explanation power” of descriptive results has its price: we need to impose assumptions, sometimes quite restrictive, on the entities involved. For example, A says nothing about what happens with the minimax risk/estimate when in addition to smoothness other a priori information on f , like monotonicity or convexity, is available, and/or when “direct” observations of f |Γn are replaced with observations of a linear image of f (say, convolution of f with a given kernel; more often than not, this is what happens in applications), and descriptive answers to the questions just posed require a dedicated (and sometimes quite problematic) investigation more or less “from scratch.” In contrast, the explanation power of B is basically nonexistent: the statement presents “closed form” expressions neither for the near-optimal estimate, nor for its worst-case risk. As a compensation, B makes only (relatively) mild general structural assumptions about the model (convexity and compactness of X , linear dependence of y on f ), and all the rest—the near-optimal estimate and its risk—can be found by efficient computation. Moreover, we know in advance that the risk, whatever it happens to be, is within 20% of the actual minimax risk achievable under the circumstances. In this respect, B is an operational, rather than a descriptive, result: it explains how to act to achieve the (nearly) best possible performance, with no a priori prediction of what this performance will be. This hardly is a “big issue” in applications—with huge computational power readily available, efficient computability is, basically, as good as a “simple explicit formula.” We
1 Infinite dimensionality of X in A is of no importance—nothing changes when replacing the original X with its n-dimensional image under the mapping f 7→ f |Γn .
PREFACE
xv
strongly believe that as far as applications of high-dimensional statistics are concerned, operational results, possessing much broader scope than their descriptive counterparts, are of significant importance and potential. Our main motivation when writing this book was to contribute to the body of operational results in Statistics, and this is what Chapters 2–5 to follow are about. Anatoli Juditsky & Arkadi Nemirovski March 6, 2019
.
ACKNOWLEDGEMENTS We are greatly indebted to H. Edwin Romeijn who initiated creating the Ph.D. course “Topics in Data Science.” The Lecture Notes for this course form the seed of the book to follow. We gratefully acknowledge support from SF Grant CC-1523768 Statistical Inference via Convex Optimization; this research project is the source of basically all novel results presented in Chapters 2–5. Our deepest gratitude goes to Lucien Birge, who encouraged us to write this monograph, and to Stephen Boyd, who many years ago taught one of the authors “operational philosophy,” motivating the research we are presenting. Our separate thanks to those who decades ago guided our first steps along the road which led to this book—Rafail Khasminskii, Yakov Tsypkin, and Boris Polyak. We are deeply indebted to our colleagues Alekh Agarwal, Aharon Ben-Tal, Fabienne Comte, Arnak Dalalyan, David Donoho, C´eline Duval, Valentine Genon-Catalot, Alexander Goldenshluger, Yuri Golubev, Zaid Harchaoui, G´erard Kerkyacharian, Vladimir Koltchinskii, Oleg Lepski, Pascal Massart, Eric Moulines, Axel Munk, Aleksander Nazin, Yuri Nesterov, Dominique Picard, Alexander Rakhlin, Philippe Rigollet, Alex Shapiro, Vladimir Spokoiny, Alexandre Tsybakov, and Frank Werner for their advice and remarks. We would like to thank Elitsa Marielle, Andrey Kulunchakov and Hlib Tsyntseus for their assistance when preparing the manuscript. It was our pleasure to collaborate with Princeton University Press on this project. We highly appreciate valuable comments of the anonymous referees, which helped to improve the initial text. We are greatly impressed by the professionalism of Princeton University Press editors, and in particular, Lauren Bucca, Nathan Carr, and Susannah Shoemaker, and also by their care and patience. Needless to say, responsibility for all drawbacks of the book is ours. A. J. & A. N.
.
NOTATIONAL CONVENTIONS Vectors and matrices. By default, all vectors are column ones; to write them 1 down, we use “Matlab notation”: 2 is written as [1; 2; 3]. More generally, for 3 vectors/matrices A, B, ..., Z of the same “width” (or vectors/matrices A, B, C, ..., Z of the same “height”), [A; B; C; ...; D] is the matrix obtained by vertical (or horizontal) concatenation of A,B, C, etc. Examples: For what inthe “normal” notation 7 1 2 , we have ,B= 5 6 ,C= is written down as A = 8 3 4
1 2 1 2 7 [A; B] = 3 4 = [1, 2; 3, 4; 5, 6], [A, C] = = [1, 2, 7; 3, 4, 8]. 3 4 8 5 6
Blanks in matrices replace (blocks of) zero entries. 1 1 0 2 = 2 0 3 4 5 3 4
For example, 0 0 . 5
Diag{A1 , A2 , ..., Ak } stands for a block-diagonal matrix with diagonal blocks A1 , A2 , ..., Ak . For example, 1 2 1 , Diag{[1, 2]; [3; 4]} = 3 . 2 Diag{1, 2, 3} = 4 3
For an m×n matrix A, dg(A) is the diagonal of A—a vector of dimension min[m, n] with entries Aii , 1 ≤ i ≤ min[m, n].
Standard linear spaces in our book are Rn (the space of n-dimensional column vectors), Rm×n (the space of m × n real matrices), and Sn (the space of n × n real symmetric matrices). All these linear spaces are equipped with the standard inner product: X hA, Bi = Aij Bij = Tr(AB T ) = Tr(BAT ) = Tr(AT B) = Tr(B T A); i,j
in the case when A = a and B = b are column vectors, this simplifies to ha, bi = aT b = bT a, and when A, B are symmetric, there is no need to write B T in Tr(AB T ). Usually, we denote vectors by lowercase, and matrices by uppercase letters; sometimes, however, lowercase letters are used also for matrices. Given a linear mapping A(x) : Ex → Ey , where Ex , Ey are standard linear spaces, one can define the conjugate mapping A∗ (y) : Ey → Ex via the identity hA(x), yi = hx, A∗ (y)i ∀(x ∈ Ex , y ∈ Ey ). One always has (A∗ )∗ = A. When Ex = Rn , Ey = Rm and Pn A(x) = Ax, one has A∗ (y) = AT y; when Ex = Rn , Ey = Sm , so that A(x) = i=1 xi Ai , Ai ∈ Sm , we
xx
NOTATIONAL CONVENTIONS
have
A∗ (Y ) = [Tr(A1 Y ); ...; Tr(An Y )].
Zn is the set of n-dimensional integer vectors.
Norms. For 1 ≤ p ≤ ∞ and for a vector x = [x1 ; ...; xn ] ∈ Rn , kxkp is the standard p-norm of x: Pn 1/p ( i=1 |xi |p ) , 1 ≤ p < ∞, kxkp = maxi |xi | = limp′ →∞ kxkp′ , p = ∞. The spectral norm (the largest singular value) of a matrix A is denoted by kAk2,2 ; notation for other norms of matrices is specified when used. Standard cones. R+ is the nonnegative ray on the real axis; Rn+ stands for the ndimensional nonnegative orthant, the cone comprised of all entrywise nonnegative vectors from Rn ; Sn+ stands for the positive semidefinite cone in Sn , the cone comprised of all positive semidefinite matrices from Sn . Miscellaneous. • For matrices A, B, relation A B, or, equivalently, B A, means that A, B are symmetric matrices of the same size such that B − A is positive semidefinite; we write A 0 to express the fact that A is a symmetric positive semidefinite matrix. Strict version A ≻ B (⇔ B ≺ A) of A B means that A − B is positive definite (and, as above, A and B are symmetric matrices of the same size). • Linear Matrix Inequality (LMI, a.k.a. semidefinite constraint) in variables x is the constraint on x stating that a symmetric matrix affinely depending on x is positive semidefinite. When x ∈ Rn , LMI reads X xi Ai 0 [Ai ∈ Sm , 0 ≤ i ≤ n]. A0 + i
• N (µ, Θ) stands for the Gaussian distribution with mean µ and covariance matrix Θ. Poisson(µ) denotes Poisson distribution with parameter µ ∈ R+ , i.e., the disi tribution of a random variable taking values i = 0, 1, 2, ... with probabilities µi! e−µ . Uniform([a, b]) is the uniform distribution on segment [a, b]. • For a probability distribution P , • ξ ∼ P means that ξ is a random variable with distribution P . Sometimes we express the same fact by writing ξ ∼ p(·), where p is the density of P taken w.r.t. some reference measure (the latter always is fixed by the context); • Eξ∼P {f (ξ)} is the expectation of f (ξ), ξ ∼ P ; when P is clear from the context, this notation can be shortened to Eξ {f (ξ)}, or EP {f (ξ)}, or even E{f (ξ)}. Similarly, Probξ∼P {...}, Probξ {...}, ProbP {...}, and Prob{...} denote the P -probability of the event specified inside the braces. • O(1)’s stand for positive absolute constants—positive reals with numerical values (completely independent of the parameters of the situation at hand) which we do not R want or are too lazy to write down explicitly, as in sin(x) ≤ O(1)|x|. • Ω f (ξ)Π(dξ) stands for the integral, taken w.r.t. measure Π over domain Ω, of function f .
ABOUT PROOFS The book is basically self-contained in terms of proofs of the statements to follow. Simple proofs usually are placed immediately after the corresponding statements; more technical proofs are transferred to dedicated sections titled “Proof of ...” at the end of each chapter, and this is where a reader should look for “missing” proofs.
ON COMPUTATIONAL TRACTABILITY In the main body of the book, one can frequently meet sentences like “Φ(·) is an efficiently computable convex function,” or “X is a computationally tractable convex set,” or “(P ) is an explicit, and therefore efficiently solvable, convex optimization problem.” For an “executive summary” on what these words actually mean, we refer the reader to the Appendix.
ABOUT PROOFS The book is basically self-contained in terms of proofs of the statements to follow. Simple proofs usually are placed immediately after the corresponding statements; more technical proofs are transferred to dedicated sections titled “Proof of ...” at the end of each chapter, and this is where a reader should look for “missing” proofs.
ON COMPUTATIONAL TRACTABILITY In the main body of the book, one can frequently meet sentences like “Φ(·) is an efficiently computable convex function,” or “X is a computationally tractable convex set,” or “(P ) is an explicit, and therefore efficiently solvable, convex optimization problem.” For an “executive summary” on what these words actually mean, we refer the reader to the Appendix.
.
.
Statistical Inference via Convex Optimization
.
Chapter One Sparse Recovery via ℓ1 Minimization In this chapter, we overview basic results of Compressed Sensing, a relatively new and rapidly developing area in Statistics and Signal Processing dealing with recovering signals (vectors x from some Rn ) from their noisy observations Ax + η (A is a given m × n sensing matrix, η is observation noise) in the case when the number of observations m is much smaller than the signal’s dimension n, but is essentially larger than the “true” dimension—the number of nonzero entries—in the signal. This setup leads to a deep, elegant and highly innovative theory and possesses quite significant application potential. It should be added that along with the plain sparsity (small number of nonzero entries), Compressed Sensing deals with other types of “low-dimensional structure” hidden in high-dimensional signals, most notably, with the case of low rank matrix recovery—when the signal is a matrix, and sparse signals are matrices with low ranks—and the case of block sparsity, where the signal is a block vector, and sparsity means that only a small number of blocks are nonzero. In our presentation, we do not consider these extensions, and restrict ourselves to the simplest sparsity paradigm.
1.1
COMPRESSED SENSING: WHAT IS IT ABOUT?
1.1.1
Signal Recovery Problem
One of the basic problems in Signal Processing is the problem of recovering a signal x ∈ Rn from noisy observations y = Ax + η
(1.1)
of a linear image of the signal under a given sensing mapping x 7→ Ax : Rn → Rm ; in (1.1), η is the observation error. Matrix A in (1.1) is called sensing matrix. Recovery problems of the outlined types arise in many applications, including, but by far not reducing to, • communications, where x is the signal sent by the transmitter, y is the signal recorded by the receiver, and A represents the communication channel (reflecting, e.g., dependencies of decays in the signals’ amplitude on the transmitter-receiver distances); η here typically is modeled as the standard (zero mean, unit covariance matrix) m-dimensional Gaussian noise;1 1 While
the “physical” noise indeed is often Gaussian with zero mean, its covariance matrix is not necessarily the unit matrix. Note, however, that a zero mean Gaussian noise η always can be represented as Qξ with standard Gaussian ξ. Assuming that Q is known and is nonsingular (which indeed is so when the covariance matrix of η is positive definite), we can rewrite (1.1) equivalently as Q−1 y = [Q−1 A]x + ξ and treat Q−1 y and Q−1 A as our new observation and new sensing matrix; the new observation
2
CHAPTER 1
• image reconstruction, where the signal x is an image—a 2D array in the usual photography, or a 3D array in tomography—and y is data acquired by the imaging device. Here η in many cases (although not always) can again be modeled as the standard Gaussian noise; • linear regression, arising in a wide range of applications. In linear regression, one is given m pairs “input ai ∈ Rn ” to a “black box,” with output yi ∈ R. Sometimes we have reason to believe that the output is a corrupted by noise version of the “existing in nature,” but unobservable, “ideal output” yi∗ = xT ai which is just a linear function of the input (this is called “linear regression model,” with inputs ai called “regressors”). Our goal is to convert actual observations (ai , yi ), 1 ≤ i ≤ m, into estimates of the unknown “true” vector of parameters x. Denoting by A the matrix with the rows [ai ]T and assembling individual observations yi into a single observation y = [y1 ; ...; ym ] ∈ Rm , we arrive at the problem of recovering vector x from noisy observations of Ax. Here again the most popular model for η is the standard Gaussian noise. 1.1.2
Signal Recovery: Parametric and nonparametric cases
Recovering signal x from observation y would be easy if there were no observation noise (η = 0) and the rank of matrix A were equal to the dimension n of the signals. In this case, which arises only when m ≥ n (“more observations than unknown parameters”), and is typical in this range of m and n, the desired x would be the unique solution to the system of linear equations, and to find x would be a simple problem of Linear Algebra. Aside from this trivial “enough observations, no noise” case, people over the years have looked at the following two versions of the recovery problem: Parametric case: m ≫ n, η is nontrivial noise with zero mean, say, standard Gaussian. This is the classical statistical setup with the emphasis on how to use numerous available observations in order to suppress in the recovery, to the extent possible, the influence of observation noise. Nonparametric case: m ≪ n.2 If addressed literally, this case seems to be senseless: when the number of observations is less that the number of unknown parameters, even in the noiseless case we arrive at the necessity to solve an undetermined (fewer equations than unknowns) system of linear equations. Linear Algebra says that if solvable, the system has infinitely many solutions. Moreover, the solution set (an affine subspace of positive dimension) is unbounded, meaning that the solutions are in no sense close to each other. A typical way to make the case of m ≪ n meaningful is to add to the observations (1.1) some a priori information about the signal. In traditional Nonparametric Statistics, this additional information is summarized in a bounded convex set X ⊂ Rn , given to us in advance, known to contain the true signal x. This set usually is such that every signal x ∈ X can be approximated by a linear combination of s = 1, 2, ..., n vectors noise ξ is indeed standard. Thus, in the case of Gaussian zero mean observation noise, to assume the noise standard Gaussian is the same as to assume that its covariance matrix is known. 2 Of course, this is a blatant simplification—the nonparametric case covers also a variety of important and by far nontrivial situations in which m is comparable to n or larger than n (or even ≫ n). However, this simplification is very convenient, and we will use it in this introduction.
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
3
from a properly selected basis known to us in advance (“dictionary” in the slang of signal processing) within accuracy δ(s), where δ(s) is a function, known in advance, approaching 0 as s → ∞. In this situation, with appropriate A (e.g., just the unit matrix, as in the denoising problem), we can select some s ≪ m and try to recover x as if it were a vector from the linear span Es of the first s vectors of the outlined basis [54, 86, 124, 112, 208]. In the “ideal case,” x ∈ Es , recovering x in fact reduces to the case where the dimension of the signal is s ≪ m rather than n ≫ m, and we arrive at the well-studied situation of recovering a signal of low (compared to the number of observations) dimension. In the “realistic case” of x δ(s)-close to Es , deviation of x from Es results in an additional component in the recovery error (“bias”); a typical result of traditional Nonparametric Statistics quantifies the resulting error and minimizes it in s [86, 124, 178, 222, 223, 230, 239]. Of course, this outline of the traditional approach to “nonparametric” (with n ≫ m) recovery problems is extremely sketchy, but it captures the most important fact in our context: with the traditional approach to nonparametric signal recovery, one assumes that after representing the signals by vectors of their coefficients in properly selected base, the n-dimensional signal to be recovered can be well approximated by an s-sparse (at most s nonzero entries) signal, with s ≪ n, and this sparse approximation can be obtained by zeroing out all but the first s entries in the signal vector. The assumption just formulated indeed takes place for signals obtained by discretization of smooth uni- and multivariate functions, and this class of signals for several decades was the main, if not the only, focus of Nonparametric Statistics. Compressed Sensing. The situation changed dramatically around the year 2000 as a consequence of important theoretical breakthroughs due to D. Donoho, T. Tao, J. Romberg, E. Candes, and J.-J. Fuchs, among many other researchers [49, 44, 45, 46, 48, 67, 68, 69, 70, 93, 94]; as a result of these breakthroughs, a novel and rich area of research, called Compressed Sensing, emerged. In the Compressed Sensing (CS) setup of the Signal Recovery problem, as in the traditional Nonparametric Statistics approach to the m ≪ n case, it is assumed that after passing to an appropriate basis, the signal to be recovered is s-sparse (has ≤ s nonzero entries, with s ≪ m), or is well approximated by an s-sparse signal. The difference with the traditional approach is that now we assume nothing about the location of the nonzero entries. Thus, the a priori information about the signal x both in the traditional and in the CS settings is summarized in a set X known to contain the signal x we want to recover. The difference is that in the traditional setting, X is a bounded convex and “nice” (well approximated by its low-dimensional cross-sections) set, while in CS this set is, computationally speaking, a “monster”: already in the simplest case of recovering exactly s-sparse signals, X is the union of all s-dimensional coordinate planes, which is a heavily combinatorial entity. Note that, in many applications we indeed can assume that the true vector of parameters x is sparse. Consider, e.g., the following story about signal detection. There are n locations where signal transmitters could be placed, and m locations with the receivers. The contribution of a signal of unit magnitude originating in location j to the signal measured by receiver i is a known quantity Aij , and signals originating in different locations merely sum up in the receivers. Thus, if x is the n-dimensional vector with entries xj representing the magnitudes of signals transmitted in locations j = 1, 2, ..., n, then the m-dimensional vector y of measurements of the m receivers is y =
4
CHAPTER 1
Ax + η, where η is the observation noise. Given y, we intend to recover x. Now, if the receivers are, say, hydrophones registering noises emitted by submarines in a certain part of the Atlantic, tentative positions of “submarines” being discretized with resolution 500 m, the dimension of the vector x (the number of points in the discretization grid) may be in the range of tens of thousands, if not tens of millions. At the same time, presumably, there is only a handful of “submarines” (i.e., nonzero entries in x) in the area. To “see” sparsity in everyday life, look at the 256 × 256 image at the top of Figure 1.1. The image can be thought of as a 2562 = 65, 536-dimensional vector comprised of the pixels’ intensities in gray scale, and there is not much sparsity in this vector. However, when representing the image in the wavelet basis, whatever it means, we get a “nearly sparse” vector of wavelet coefficients (this is true for typical “nonpathological” images). At the bottom of Figure 1.1 we see what happens when we zero out all but a small percentage of the wavelet coefficients largest in magnitude and replace the true image by its sparse—in the wavelet basis—approximations. This simple visual illustration along with numerous similar examples shows the “everyday presence” of sparsity and the possibility to utilize it when compressing signals. The difficulty, however, is that simple compression—compute the coefficients of the signal in an appropriate basis and then keep, say, 10% of the largest in magnitude coefficients—requires us to start with digitalizing the signal— representing it as an array of all its coefficients in some orthonormal basis. These coefficients are inner products of the signal with vectors of the basis; for a “physical” signal, like speech or image, these inner products are computed by analogous devices, with subsequent discretization of the results. After the measurements are discretized, processing the signal (denoising, compression, storing, etc.) can be fully computerized. The major (to some extent, already actualized) advantage of Compressed Sensing is in the possibility to reduce the “analogous effort” in the outlined process: instead of computing analogously n linear forms of n-dimensional signal x (its coefficients in a basis), we use an analog device to compute m ≪ n other linear forms of the signal and then use the signal’s sparsity in a basis known to us in order to recover the signal reasonably well from these m observations. In our “picture illustration” this technology would work (in fact, works—it is called “single pixel camera” [83]; see Figure 1.2) as follows: in reality, the digital 256×256 image on the top of Figure 1.1 was obtained by an analog device—a digital camera which gets on input an analog signal (light of varying intensity along the field of view caught by camera’s lens) and discretizes the light’s intensity in every pixel to get the digitalized image. We then can compute the wavelet coefficients of the digitalized image, compress its representation by keeping, say, just 10% of leading coefficients, etc., but “the damage is already done”—we have already spent our analog resources to get the entire digitalized image. The technology utilizing Compressed Sensing would work as follows: instead of measuring and discretizing light intensity in each of the 65,536 pixels, we compute (using an analog device) the integral, taken over the field of view, of the product of light intensity and an analog-generated “mask.” We repeat it for, say, 20,000 different masks, thus obtaining measurements of 20,000 linear forms of our 65,536-dimensional signal. Next we utilize, via the Compressed Sensing machinery, the signal’s sparsity in the wavelet basis in order to recover the signal from these 20,000 measurements. With this approach, we reduce the “analog component” of signal processing effort,
5
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
1% of leading wavelet coefficients (97.83 % of energy) kept
5% of leading wavelet coefficients (99.51 % of energy) kept
10% of leading wavelet coefficients (99.82% of energy) kept
25% of leading wavelet coefficients (99.97% of energy) kept
Figure 1.1: Top: true 256 × 256 image; bottom: sparse in the wavelet basis approximations of the image. Wavelet basis is orthonormal, and a natural way to quantify near-sparsity of a signal is to look at the fraction of total energy (sum of squares of wavelet coefficients) stored in the leading coefficients; these are the “energy data” presented in the figure.
6
CHAPTER 1
Yh Ed/• Z W,KdK /K
WZK ^^/E'
Figure 1.2: Singe-pixel camera.
at the price of increasing the “computerized component” of the effort (instead of ready-to-use digitalized image directly given by 65,536 analog measurements, we need to recover the image by applying computationally nontrivial decoding algorithms to our 20,000 “indirect” measurements). When taking pictures with your camera or iPad, the game is not worth the candle—the analog component of taking usual pictures is cheap enough, and decreasing it at the cost of nontrivial decoding of the digitalized measurements would be counterproductive. There are, however, important applications where the advantages stemming from reduced “analog effort” outweigh significantly the drawbacks caused by the necessity to use nontrivial computerized decoding [96, 176]. 1.1.3 1.1.3.1
Compressed Sensing via ℓ1 minimization: Motivation Preliminaries
In principle there is nothing surprising in the fact that under reasonable assumption on the m × n sensing matrix A we may hope to recover from noisy observations of Ax an s-sparse signal x, with s ≪ m. Indeed, assume for the sake of simplicity that there are no observation errors, and let Colj [A] be j-th column in A. If we knew the locations j1 < j2 < ... < js of the nonzero entries Ps in x, identifying x could be reduced to solving the system of linear equations ℓ=1 xiℓ Coljℓ [A] = y with m equations and s ≪ m unknowns; assuming every s columns in A to be linearly independent (a quite unrestrictive assumption on a matrix with m ≥ s rows), the solution to the above system is unique, and is exactly the signal we are looking for. Of course, the assumption that we know the locations of nonzeros in x makes the recovery problem completely trivial. However, it suggests the following course of action: given noiseless observation y = Ax of an s-sparse signal x, let us solve the
7
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
combinatorial optimization problem min {kzk0 : Az = y} , z
(1.2)
where kzk0 is the number of nonzero entries in z. Clearly, the problem has a solution with the value of the objective at most s. Moreover, it is immediately seen that if every 2s columns in A are linearly independent (which again is a very unrestrictive assumption on the matrix A provided that m ≥ 2s), then the true signal x is the unique optimal solution to (1.2). What was said so far can be extended to the case of noisy observations and “nearly s-sparse” signals x. For example, assuming that the observation error is “uncertainbut-bounded,” specifically some known norm k · k of this error does not exceed a given ǫ > 0, and that the true signal is s-sparse, we could solve the combinatorial optimization problem min {kzk0 : kAz − yk ≤ ǫ} . (1.3) z
Assuming that every m×2s submatrix A¯ of A is not just with linearly independent columns (i.e., with trivial kernel), but is reasonably well conditioned, ¯ kAwk ≥ C −1 kwk2 for all (2s)-dimensional vectors w, with some constant C, it is immediately seen that the true signal x underlying the observation and the optimal solution x b of (1.3) are close to each other within accuracy of order of ǫ: kx − x bk2 ≤ 2Cǫ. It is easily seen that the resulting error bound is basically as good as it could be.
We see that the difficulties with recovering sparse signals stem not from the lack of information; they are of purely computational nature: (1.2) is a difficult combinatorial problem. As far as known theoretical complexity guarantees are concerned, they are not better than “brute force” search through all guesses on where the nonzeros in x are located—by inspecting first the only option that there are no nonzeros in x at all, then by inspecting n options that there is only one nonzero, for every one of n locations of this nonzero, then n(n − 1)/2 options that there are exactly two nonzeros, etc., until the current option results in a solvable system of linear equations Az = y in variables z with entries restricted to vanish outside the locations prescribed by the current option. The running time of this “brute force” search, beyond the range of small values of s and n (by far too small to be of any applied interest), is by many orders of magnitude larger than what we can afford in reality.3 A partial remedy is as follows. Well, if we do not know how to minimize the “bad” objective kzk0 under linear constraints, as in (1.2), let us “approximate” this objective with which we do know how to minimize. The true objective is Pone n separable: kzk = i=1 ξ(zj ), where ξ(s) is the function on the axis equal to 0 at the origin and equal to 1 otherwise. As a matter of fact, the separable functions which 3 When s = 5 and n = 100, a sharp upper bound on the number of linear systems we should process before termination in the “brute force” algorithm is ≈ 7.53e7—a lot, but perhaps doable. When n = 200 and s = 20, the number of systems to be processed jumps to ≈ 1.61e27, which is by many orders of magnitude beyond our “computational grasp”; we would be unable to carry out that many computations even if the fate of the mankind were at stake. And from the perspective of Compressed Sensing, n = 200 still is a completely toy size, 3–4 orders of magnitude less than we would like to handle.
8
CHAPTER 1
we do know how to minimize under linear constraints are sums of convex functions of z1 , ..., zn . The most natural candidate to the role of convex approximation of ξ(s) is |s|; with this approximation, (1.2) converts into the ℓ1 minimization problem n o Xn min kzk1 := |zj | : Az = y , (1.4) z
i=1
and (1.3) becomes the convex optimization problem
min {kzk1 : kAz − yk ≤ ǫ} .
(1.5)
z
Both problems are efficiently solvable, which is nice; the question, however, is how relevant these problems are in our context—whether it is true that they do recover the “true” s-sparse signals in the noiseless case, or “nearly recover” these signals when the observation error is small. Since we want to be able to handle any ssparse signal, the validity of ℓ1 recovery—its ability to recover well every s-sparse signal—depends solely on the sensing matrix A. Our current goal is to understand which sensing matrices are “good” in this respect.
1.2
VALIDITY OF SPARSE SIGNAL RECOVERY VIA ℓ1 MINIMIZATION
What follows is based on the standard basic results of Compressed Sensing theory originating from [19, 49, 45, 44, 46, 47, 48, 67, 69, 70, 93, 94, 232] and augmented by the results of [129, 130, 132, 133].4 1.2.1
Validity of ℓ1 minimization in the noiseless case
The minimal requirement on sensing matrix A which makes ℓ1 minimization valid is to guarantee the correct recovery of exactly s-sparse signals in the noiseless case, and we start with investigating this property. 1.2.1.1
Notational convention
From now on, for a vector x ∈ Rn • Ix = {j : xj 6= 0} stands for the support of x; we also set Ix+ = {j : xj > 0}, Ix− = {j : xj < 0}
[⇒ Ix = Ix+ ∪ Ix− ];
• for a subset I of the index set {1, ..., n}, xI stands for the vector obtained from x by zeroing out entries with indices not in I, and I o for the complement of I: I o = {i ∈ {1, ..., n} : i 6∈ I}; • for s ≤ n, xs stands for the vector obtained from x by zeroing out all but the s 4 In fact, in the latter source, an extension of the sparsity, the so-called block sparsity, is considered; in what follows, we restrict the results of [130] to the case of plain sparsity.
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
9
entries largest in magnitude.5 Note that xs is the best s-sparse approximation of x in all ℓp norms, 1 ≤ p ≤ ∞; • for s ≤ n and p ∈ [1, ∞], we set kxks,p = kxs kp ; note that k · ks,p is a norm. 1.2.1.2
s-Goodness
Definition of s-goodness. Let us say that an m × n sensing matrix A is s-good if whenever the true signal x underlying noiseless observations is s-sparse, this signal will be recovered exactly by ℓ1 minimization. In other words, A is s-good if whenever y in (1.4) is of the form y = Ax with s-sparse x, x is the unique optimal solution to (1.4). Nullspace property. There is a simply-looking necessary and sufficient condition for a sensing matrix A to be s-good—the nullspace property originating from [70]. After this property is guessed, it is easy to see that it indeed is necessary and sufficient for s-goodness; we, however, prefer to derive this condition from the “first principles,” which can be easily done via Convex Optimization. Thus, in the case in question, as in many other cases, there is no necessity to be smart to arrive at the truth via a “lucky guess”; it suffices to be knowledgeable and use the standard tools. Let us start with necessary condition for A to be such that whenever x is ssparse, x is an optimal solution (perhaps not the unique one) of the optimization problem min {kzk1 : Az = Ax} ; (P [x]) z
we refer to the latter property of A as weak s-goodness. Our first observation is as follows: Proposition 1.1. If A is weakly s-good, then the following condition holds true: whenever I is a subset of {1, ..., n} of cardinality ≤ s, we have ∀w ∈ KerA kwI k1 ≤ kwI o k1 .
(1.6)
Proof is immediate. Assume A is weakly s-good, and let us verify (1.6). Let I be an s-element subset of {1, ..., n}, and x be an s-sparse vector with support I. Since A is weakly s-good, x is an optimal solution to (P [x]). Rewriting the latter problem in the form of LP, that is, as X min{ tj : tj + zj ≥ 0, tj − zj ≥ 0, Az = Ax}, z,t
j
and invoking LP optimality conditions, the necessary and sufficient condition for 5 Note that in general xs is not uniquely defined by x and s, since the s-th largest among the magnitudes of entries in x can be achieved at several entries. In our context, it does not matter how ties of this type are resolved; for the sake of definiteness, we can assume that when ordering the entries in x according to their magnitudes, from the largest to the smallest, entries of equal magnitude are ordered in the order of their indices.
10
CHAPTER 1
− z = x to be the z-component of an optimal solution is the existence of λ+ j , λj , µ ∈ Rm (Lagrange multipliers for the constraints tj − zj ≥ 0, tj + zj ≥ 0, and Az = Ax, respectively) such that
(a) (b) (c) (d) (e) (f )
− λ+ j + λj T λ −λ +A µ λ+ j (|xj | − xj ) λ− j (|xj | + xj ) λ+ j λ− j +
−
= = = = ≥ ≥
1 ∀j, 0, 0 ∀j, 0 ∀j, 0 ∀j, 0 ∀j.
(1.7)
− + − + − From (c, d), we have λ+ j = 1, λj = 0 for j ∈ Ix and λj = 0, λj = 1 for j ∈ Ix . ± From (a) and nonnegativity of λj it follows that for j 6∈ Ix we should have −1 ≤ − λ+ j − λj ≤ 1. With this in mind, the above optimality conditions admit eliminating λ’s and reduce to the following conclusion: (!) x is an optimal solution to (P [x]) if and only if there exists vector µ ∈ Rm such that the j-th entry of AT µ is −1 if xj > 0, +1 if xj < 0, and a real from [−1, 1] if xj = 0. Now let w ∈ Ker A be a vector with the same signs of entries wi , i ∈ I, as these of the entries in x. Then P 0 = µT Aw = [AT µ]T w = j [AT µ]j wj P P P P ⇒ j∈Ix |wj | = j∈Ix [AT µ]j wj = − j6∈Ix [AT µ]j wj ≤ j6∈Ix |wj |
(we have used the fact that [AT µ]j = sign xj = sign wj for j ∈ Ix and |[AT µ]j | ≤ 1 for all j). Since I can be an arbitrary s-element subset of {1, ..., n} and the pattern of signs of an s-sparse vector x supported on I can be arbitrary, (1.6) holds true. ✷ 1.2.1.3
Nullspace property
In fact, it can be shown that (1.6) is not only a necessary, but also sufficient condition for weak s-goodness of A; we, however, skip this verification, since our goal so far was to guess the condition for s-goodness, and this goal has already been achieved—from what we already know it immediately follows that a necessary condition for s-goodness is for the inequality in (1.6) to be strict whenever w ∈ Ker A is nonzero. Indeed, we already know that if A is s-good, then for every I of cardinality s and every nonzero w ∈ Ker A it holds kwI k1 ≤ kwI o k1 . If the latter inequality for some I and w in question holds true as equality, then A clearly is not s-good, since the s-sparse signal x = wI is not the unique optimal solution to (P [x])—the vector −wI o is a different feasible solution to the same problem and with the same value of the objective. We conclude that for A to be s-good, a necessary condition is ∀(0 6= w ∈ Ker A, I, Card(I) ≤ s) : kwI k1 < kwI o k1 .
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
11
By the standard compactness argument, this is the same as the existence of γ ∈ (0, 1) such that ∀(w ∈ Ker A, I, Card(I) ≤ s) : kwI k1 ≤ γkwI o k1 , or—which is the same—existence of κ ∈ (0, 1/2) such that ∀(w ∈ Ker A, I, Card(I) ≤ s) : kwI k1 ≤ κkwk1 . Finally, the supremum of kwI k1 over I of cardinality s is the norm kwks,1 (the sum of s largest magnitudes of entries) of w, so that the condition we are processing finally can be formulated as ∃κ ∈ (0, 1/2) : kwks,1 ≤ κkwk1 ∀w ∈ Ker A.
(1.8)
The resulting nullspace condition in fact is necessary and sufficient for A to be s-good: Proposition 1.2. Condition (1.8) is necessary and sufficient for A to be s-good. Proof. We have already seen that the nullspace condition is necessary for sgoodness. To verify sufficiency, let A satisfy the nullspace condition, and let us prove that A is s-good. Indeed, let x be an s-sparse vector, and y be an optimal solution to (P [x]); all we need is to prove that y = x. Let I be the support of x, and w = y − x, so that w ∈ Ker A. By the nullspace property we have kwI k1 ≤ κkwk1 = κ[kwI k1 + kwI o k1 ] = κ[kwI k1 + kyI o k1 κ kyI o k1 ⇒ kwI k1 ≤ 1−κ κ kyI o k1 ≤ kyI k1 + kyI o k1 = kyk1 ⇒ kxk1 = kxI k1 = kyI − wI k1 ≤ kyI k1 + 1−κ where the concluding ≤ is due to κ ∈ [0, 1/2). Since x is a feasible, and y is an optimal solution to (P [x]), the resulting inequality kxk1 ≤ kyk1 must be equality, which, again due to κ ∈ [0, 1/2), is possible only when yI o = 0. Thus, y has the same support I as x, and w = x − y ∈ Ker A is supported on s-element set I; by nullspace property, we should have kwI k1 ≤ κkwk1 = κkwI k1 , which is possible only when w = 0. ✷ 1.2.2
Imperfect ℓ1 minimization
We have found a necessary and sufficient condition for ℓ1 minimization to recover exactly s-sparse signals in the noiseless case. More often than not, both these assumptions are violated: instead of s-sparse signals, we should speak about “nearly s-sparse” ones, quantifying the deviation from sparsity by the distance from the signal x underlying the observations to its best s-sparse approximation xs . Similarly, we should allow for nonzero observation noise. With noisy observations and/or imperfect sparsity, we cannot hope to recover the signal exactly. All we may hope for, is to recover it with some error depending on the level of observation noise and “deviation from s-sparsity,” and tending to zero as the level and deviation tend to 0. We are about to quantify the nullspace property to allow for instructive “error analysis.”
12
CHAPTER 1
1.2.2.1
Contrast matrices and quantifications of Nullspace property
By itself, the nullspace property says something about the signals from the kernel of the sensing matrix. We can reformulate it equivalently to say something important about all signals. Namely, observe that given sparsity s and κ ∈ (0, 1/2), the nullspace property kwks,1 ≤ κkwk1 ∀w ∈ Ker A (1.9)
is satisfied if and only if for a properly selected constant C one has6 kwks,1 ≤ CkAwk2 + κkwk1 ∀w.
(1.10)
Indeed, (1.10) clearly implies (1.9); to get the inverse implication, note that for every h orthogonal to Ker A it holds kAhk2 ≥ σkhk2 , where σ > 0 is the minimal positive singular value of A. Now, given w ∈ Rn , we can decompose w into the sum of w ¯ ∈ Ker A and h ∈ (Ker A)⊥ , so that √ √ kwks,1 ≤ kwk ¯ s,1 + khks,1 ≤ κkwk ¯ 1 + skhks,2 ≤ κ[kwk1 + khk1 ] + skhk2 √ √ √ √ ≤ κkwk1 + [κ n + s]khk2 ≤ σ −1 [κ n + s] kAhk2 +κkwk1 , {z } | {z } | C
=kAwk2
as required in (1.10).
Condition Q1 (s, κ). For our purposes, it is convenient to present the condition (1.10) in the following flexible form: kwks,1 ≤ skH T Awk + κkwk1 ,
(1.11)
where H is an m × N contrast matrix and k · k is some norm on RN . Whenever a pair (H, k · k), called contrast pair, satisfies (1.11), we say that (H, k · k) satisfies condition Q1 (s, κ). From what we have seen, If A possesses nullspace property with some sparsity level s and some κ ∈ (0, 1/2), then there are many ways to select pairs (H, k · k) satisfying Q1 (s, κ), e.g., to take H = CIm with appropriately large C and k · k = k · k2 . Conditions Qq (s, κ). As we will see in a while, it makes sense to embed the condition Q1 (s, κ) into a parametric family of conditions Qq (s, κ), where the parameter q runs through [1, ∞]. Specifically, Given an m × n sensing matrix A, sparsity level s ≤ n, and κ ∈ (0, 1/2), we say that m × N matrix H and a norm k · k on RN satisfy condition Qq (s, κ) if 1 1 (1.12) kwks,q ≤ s q kH T Awk + κs q −1 kwk1 ∀w ∈ Rn . Let us make two immediate observations on relations between the conditions: A. When a pair (H, k · k) satisfies condition Qq (s, κ), the pair satisfies also all conditions Qq′ (s, κ) with 1 ≤ q ′ ≤ q. √ that (1.9) is exactly the φ2 (s, κ)-Compatibility condition of [231] with φ(s, κ) = C/ s; see also [232] for the analysis of relationships of this condition with other assumptions (e.g., a similar Restricted Eigenvalue assumption of [20]) used to analyse ℓ1 -minimization procedures. 6 Note
13
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
Indeed in the situation in question for 1 ≤ q ′ ≤ q it holds i 1 −1 h 1 1 −1 1 −1 kwks,q′ ≤ s q′ q kwkq,s ≤ s q′ q s q kH T Awk + κs q kwk1 =
1
1
s q′ kH T Awk + κs q′
−1
kwk1 ,
where the first inequality is the standard inequality between ℓp -norms of the s-dimensional vector ws .
B. When a pair (H, k · k) satisfies condition Qq (s, κ) and 1 ≤ s′ ≤ s, the pair 1 ((s/s′ ) q H, k · k) satisfies the condition Qq (s′ , κ). Indeed, in the situation in question we clearly have for 1 ≤ s′ ≤ s: h i 1 1 1 −1 kwks′ ,q ≤ kwks,q ≤ (s′ ) q k (s/s′ ) q H Awk + κ s| q{z } kwk1 . 1 −1
≤(s′ ) q
1.2.3
Regular ℓ1 recovery
Given the observation scheme (1.1) with an m × n sensing matrix A, we define the regular ℓ1 recovery of x via observation y as (1.13) x breg (y) ∈ Argmin kuk1 : kH T (Au − y)k ≤ ρ , u
where the contrast matrix H ∈ Rm×N , the norm k · k on RN and ρ > 0 are parameters of the construction. The role of Q-conditions we have introduced is clear from the following
Theorem 1.3. Let s be a positive integer, q ∈ [1, ∞] and κ ∈ (0, 1/2). Assume that a pair (H, k · k) satisfies the condition Qq (s, κ) associated with A, and let Ξρ = {η : kH T ηk ≤ ρ}.
(1.14)
Then for all x ∈ Rn and η ∈ Ξρ one has
1 kx − xs k1 4(2s) p ρ+ , 1 ≤ p ≤ q. kb xreg (Ax + η) − xkp ≤ 1 − 2κ 2s
(1.15)
The above result can be slightly strengthened by replacing the assumption that (H, k · k) satisfies Qq (s, κ) with some κ < 1/2, with a weaker—by observation A from Section 1.2.2.1—assumption that (H, k · k) satisfies Q1 (s, κ) with κ < 1/2 and satisfies Qq (s, κ) with some (perhaps large) κ: Theorem 1.4. Given A, integer s > 0, and q ∈ [1, ∞], assume that (H, k · k) satisfies the condition Q1 (s, κ) with κ < 1/2 and the condition Qq (s, κ) with some κ ≥ κ, and let Ξρ be given by (1.14). Then for all x ∈ Rn and η ∈ Ξρ it holds: q(p−1) 1 kx − xs k1 4(2s) p [1 + κ − κ] p(q−1) ρ+ , 1 ≤ p ≤ q. (1.16) kb xreg (Ax+η)−xkp ≤ 1 − 2κ 2s
For proofs of Theorems 1.3 and 1.4, see Section 1.5.1. Before commenting on the above results, let us present their alternative versions.
14
CHAPTER 1
1.2.4
Penalized ℓ1 recovery
Penalized ℓ1 recovery of signal x from its observation (1.1) is x bpen (y) ∈ Argmin kuk1 + λkH T (Au − y)k ,
(1.17)
u
where H ∈ Rm×N , a norm k · k on RN , and a positive real λ are parameters of the construction. Theorem 1.5. Given A, positive integer s, and q ∈ [1, ∞], assume that (H, k · k) satisfies the conditions Qq (s, κ) and Q1 (s, κ) with κ < 1/2 and κ ≥ κ. Then (i) Let λ ≥ 2s. Then for all x ∈ Rn , y ∈ Rm it holds: 1
kb xpen (y) − xkp ≤
4λ p 1−2κ
1+
κλ 2s
−κ
In particular, with λ = 2s we have: 1
kb xpen (y) − xkp ≤
4(2s) p 1−2κ
q(p−1) h p(q−1)
kH T (Ax − y)k +
kx−xs k1 2s
h q(p−1) [1 + κ − κ] p(q−1) kH T (Ax − y)k +
i
kx−xs k1 2s
, 1 ≤ p ≤ q.
(1.18)
i
, 1 ≤ p ≤ q.
(1.19) (ii) Let ρ ≥ 0, and let Ξρ be given by (1.14). Then for all x ∈ Rn and all η ∈ Ξρ one has: λ ≥ 2s λ = 2s
⇒
1
kb xpen (Ax + η) − xkp ≤
4λ p 1−2κ
kb xpen (Ax + η) − xkp ≤
4(2s) p 1−2κ
⇒
1
1+
κλ 2s
−κ
q(p−1) h p(q−1)
ρ+
q(p−1) h [1 + κ − κ] p(q−1) ρ +
kx−xs k1 2s
kx−xs k1 2s
i
i
, 1 ≤ p ≤ q;
, 1 ≤ p ≤ q.
(1.20)
For proof, see Section 1.5.2. 1.2.5
Discussion
Some remarks are in order. A. Qualitatively speaking, Theorems 1.3, 1.4, and 1.5 say the same thing: when Q-conditions are satisfied, the regular or penalized recoveries reproduce the true signal exactly when there is no observation noise and the signal is s-sparse. In the presence of observation error η and imperfect sparsity, the signal is recovered within the error which can be upper-bounded by the sum of two terms, one proportional to the magnitude of observation noise and one proportional to the deviation kx − xs k1 of the signal from s-sparse ones. In the penalized recovery, the observation error is measured in the scale given by the contrast matrix and the norm k · k—as kH T ηk— and in the regular recovery by an a priori upper bound ρ on kH T ηk; when ρ ≥ kH T ηk, η belongs to Ξρ and thus the bounds (1.15) and (1.16) are applicable to the actual observation error η. Clearly, in qualitative terms, an error bound of this type is the best we may hope for. Now let us look at the quantitative aspect. Assume that in the regular recovery we use ρ ≈ kH T ηk, and in the penalized one λ = 2s. In this case, error bounds (1.15), (1.16), and (1.20), up to factors C depending solely on κ and κ, are the same, specifically, kb x − xkp ≤ Cs1/p [kH T ηk + kx − xs k1 /s], 1 ≤ p ≤ q.
(!)
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
15
Is this error bound bad or good? The answer depends on many factors, including on how well we select H and k · k. To get a kind of orientation, consider the trivial case of direct observations, where matrix A is square and, moreover, is proportional to the unit matrix: A = αI. Let us assume in addition that x is exactly s-sparse. In this case, the simplest way to ensure condition Qq (s, κ), even with κ = 0, is to take k · k = k · ks,q and H = s−1/q α−1 I, so that (!) becomes kb x − xkp ≤ Cα−1 s1/p−1/q kηks,q , 1 ≤ p ≤ q.
(!!)
As far as the dependence of the bound on the magnitude kηks,q of the observation noise is concerned, this dependence is as good as it can be—even if we knew in advance the positions of the s entries of x of largest magnitudes, we would be unable to recover x in q-norm with error ≤ α−1 kηks,q . In addition, with the s largest magnitudes of entries in η equal to each other, the k·kp -norm of the recovery error clearly cannot be guaranteed to be less than α−1 kηks,p = α−1 s1/p−1/q kηks,q . Thus, at least for s-sparse signals x, our error bound is, basically, the best one can get already in the “ideal” case of direct observations. B. Given that (H, k · k) obeys Q1 (s, κ) with some κ < 1/2, the larger the q such that the pair (H, k · k) obeys the condition Qq (s, κ) with a given κ ≥ κ (recall that κ can be ≥ 1/2) and s, the larger the range p ≤ q of values of p where the error bounds (1.16) and (1.20) are applicable. This is in full accordance with the fact that if a pair (H, k · k) obeys condition Qq (s, κ), it obeys also all conditions Qq′ (s, κ) with 1 ≤ q ′ ≤ q (item A in Section 1.2.2.1). C. The flexibility offered by contrast matrix H and norm k · k allows us to adjust, to some extent, the recovery to the “geometry of observation errors.” For example, when η is “uncertain but bounded,” say, when all we know is that kηk2 ≤ δ with some given δ, all that matters (on the top of the requirement for (H, k · k) to obey Q-conditions) is how large kH T ηk could be when kηk2 ≤ δ. In particular, when k · k = k · k2 , the error bound “is governed” by the spectral norm of H. Consequently, if we have a technique allowing us to design H such that (H, k · k2 ) obeys Q-condition(s) with given parameters, it makes sense to look for a design with as small a spectral norm of H as possible. In contrast to this, in the case of Gaussian noise the most interesting for applications, y = Ax + η, η ∼ N (0, σ 2 Im ),
(1.21)
looking at the spectral norm of H, with k·k2 in the role of k·k, is counterproductive, √ since a typical realization of η is of Euclidean norm of order of mσ and thus is quite large when m is large. In this case to quantify “the magnitude” of H T η by the product of the spectral norm of H and the Euclidean norm of η is completely misleading—in typical cases, this product will grow rapidly with the number of observations m, completely ignoring the fact that η is random with zero mean.7 What is much better suited for the case of Gaussian noise, is the k · k∞ norm in the role of k·k and the norm of H which is “the maximum of k·k2 -norms of the columns 7 The simplest way to see the difference is to look at a particular entry hT η in H T η. Operating with spectral norms, we upper-bound √ this entry by khk2 kηk2 , and the second factor for η ∼ N (0, σ 2 Im ) is typically as large as σ m. This is in sharp contrast to the fact that typical values of hT η are of order of σkhk2 , independently of what m is!
16
CHAPTER 1
in H,” denoted by kHk1,2 . Indeed, with η ∼ N (0, σ 2 Im ), the entries in H T η are Gaussian with zero mean and variance bounded by σ 2 kHk21,2 , so that kH T ηk∞ is the maximum of magnitudes of N zero mean Gaussian random variables with standard deviations bounded by σkHk1,2 . As a result, T
Prob{kH ηk∞ ≥ ρ} ≤ 2N Erfc where
ρ σkHk1,2
≤ Ne
1 Erfc(s) = Probξ∼N (0,1) {ξ ≥ s} = √ 2π
Z
−
∞
ρ2 2σ 2 kHk2 1,2
e−t
2
/2
,
(1.22)
dt
s
is the (slightly rescaled) complementary error function. T 2 It follows p that the typical values of kH ηk∞ , η ∼ N (0, σ Im ) are of order of at most σ ln(N )kHk1,2 . In applications we consider in this chapter, we have N = O(m), so that with σ and kHk1,2 given, typical values kH T ηk∞ are nearly independent of m. The bottom line is that ℓ1 minimization is capable of handling large-scale Gaussian observation noise incomparably better than “uncertain-butbounded” observation noise of similar magnitude (measured in Euclidean norm).
D. As far as comparison of regular and penalized ℓ1 recoveries with the same pair (H, k · k) is concerned, the situation is as follows. Assume for the sake of simplicity that (H, k · k) satisfies Qq (s, κ) with some s and κ < 1/2, and let the observation error be random. Given ǫ ∈ (0, 1), let ρǫ [H, k · k] = min ρ : Prob η : kH T ηk ≤ ρ ≥ 1 − ǫ ; (1.23) this is nothing but the smallest ρ such that
Prob{η ∈ Ξρ } ≥ 1 − ǫ
(1.24)
(see (1.14)), and thus the smallest ρ for which the error bound (1.15) for the regular ℓ1 recovery holds true with probability 1 − ǫ (or at least the smallest ρ for which the latter claim is supported by Theorem 1.3). With ρ = ρǫ [H, k · k], the regular ℓ1 recovery guarantees (and that is the best guarantee one can extract from Theorem 1.3) that (#) For some set Ξ, Prob{η ∈ Ξ} ≥ 1 − ǫ, of “good” realizations of η ∼ N (0, σ 2 Im ), one has 1 kx − xs k1 4(2s) p ρǫ [H, k · k] + , 1 ≤ p ≤ q, (1.25) kb x(Ax + η) − xkp ≤ 1 − 2κ 2s
whenever x ∈ Rn and η ∈ Ξρ . The error bound (1.19) (where we set κ = κ) says that (#) holds true for the penalized ℓ1 recovery with λ = 2s. The latter observation suggests that the penalized ℓ1 recovery associated with (H, k · k) and λ = 2s is better than its regular counterpart, the reason being twofold. First, in order to ensure (#) with the regular recovery, the “built in” parameter ρ of this recovery should be set to ρǫ [H, k · k], and the latter quantity is not always easy to identify. In contrast to this, the construc-
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
17
tion of penalized ℓ1 recovery is completely independent of a priori assumptions on the structure of observation errors, while automatically ensuring (#) for the error model we use. Second, and more importantly, for the penalized recovery the bound (1.25) is no more than the “worst, with confidence 1 − ǫ, case,” while the typical values of the quantity kH T ηk which indeed participates in the error bound (1.18) may be essentially smaller than ρǫ [H, k · k]. Numerical experience fully supports the above claim: the difference in observed performance of the two routines in question, although not dramatic, is definitely in favor of the penalized recovery. The only potential disadvantage of the latter routine is that the penalty parameter λ should be tuned to the level s of sparsity we aim at, while the regular recovery is free of any guess of this type. Of course, the “tuning” is rather loose—all we need (and experiments show that we indeed need this) is the relation λ ≥ 2s, so that a rough upper bound on s will do. However, that bound (1.18) deteriorates as λ grows. Finally, we remark that when H is m × N and η ∼ N (0, σ 2 Im ), we have ρǫ [H, k · k∞ ] ≤ σErfcInv(
p ǫ )kHk1,2 ≤ σ 2 ln(N/ǫ)kHk1,2 2N
(see (1.22)); here ErfcInv(δ) is the inverse complementary error function: Erfc(ErfcInv(δ)) = δ, 0 < δ < 1.
(1.26)
How it works. Here we present a small numerical illustration. We observe in Gaussian noise m = n/2 randomly selected terms in n-element “time series” z = (z1 , ..., zn ) and want to recover this series under the assumption that the series is “nearly s-sparse in frequency domain,” that is, that z = F x with kx − xs k1 ≤ δ, where F is the matrix of n × n the Inverse Discrete Cosine Transform, xs is the vector obtained from x by zeroing out all but the s entries of largest magnitudes and δ upper-bounds the distance from x to s-sparse signals. Denoting by A the m × n submatrix of F corresponding to the time instants t where zt is observed, our observation becomes y = Ax + σξ, where ξ is the standard Gaussian noise. After the signal in frequency domain, that is, x, is recovered by ℓ1 minimization, let the recovery be x b, we recover the signal in the time domain as zb = F x b. In Figure 1.3, we present four test signals, of different (near-)sparsity, along with their regular and penalized ℓ1 recoveries. The data in Figure 1.3 clearly show how the quality of ℓ1 recovery deteriorates as the number s of “essential nonzeros” of the signal in the frequency domain grows. It is seen also that the penalized recovery meaningfully outperforms the regular one in the range of sparsities up to 64.
18
CHAPTER 1
0.5 0.5 0
0 -0.5
-0.5 0
50
100
150
200
250
300
350
400
450
500
0
50
100
150
200
250
300
350
400
450
500
0
50
100
150
200
250
300
350
400
450
500
0.5 0.5 0
0
-0.5 -0.5 0
50
100
150
200
250
300
350
400
450
500
s=16 s=32 Top plots: regular ℓ1 recovery, bottom plots: penalized ℓ1 recovery. 2
1
h
0.5
1
0
0
-0.5 -1
-1 -2
0
50
100
150
200
250
300
350
400
450
500
0
50
100
150
200
250
300
350
400
450
500
0
50
100
150
200
250
300
350
400
450
500
2
1 1
0.5 0
0
-0.5 -1
-1 -2
0
50
100
150
200
250
300
350
400
450
500
s=64 s=128 Top plots: regular ℓ1 recovery, bottom plots: penalized ℓ1 recovery. kz − z bk2 kz − z bk∞
s = 16 0.2417 0.0343
s = 32 0.3871 0.0514
s = 64 0.8178 0.1744
s = 128 4.8256 0.8272
recovery errors, regular ℓ1 recovery
kz − z bk2 kz − z bk∞
s = 16 0.1399 0.0177
s = 32 0.2385 0.0362
s = 64 0.4216 0.1023
s = 128 5.3431 0.9141
recovery errors, penalized ℓ1 recovery
Figure 1.3: Regular and penalized ℓ1 recovery of nearly s-sparse signals. o: true signals, +: recoveries (to make the plots readable, one per eight consecutive vector’s entries is shown). Problem sizes are m = 256 and n = 2m = 512, noise level s is p σ = 0.01, deviation from s-sparsity is kx − x k1 = 1, contrast pair is (H = n/mA, k · k∞ ). In penalized recovery, λ = 2s, parameter ρ of regular recovery is set to σ · ErfcInv(0.005/n).
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
1.3
19
VERIFIABILITY AND TRACTABILITY ISSUES
The good news about ℓ1 recovery stated in Theorems 1.3, 1.4, and 1.5 is “conditional”—we assume that we are smart enough to point out a pair (H, k · k) satisfying condition Q1 (s, κ) with κ < 1/2 (and condition Qq (s, κ) with a “moderate” κ 8 ). The related issues are twofold: 1. First, we do not know in which range of s, m, and n these conditions, or even the weaker than Q1 (s, κ), κ < 1/2, nullspace property can be satisfied; and without the nullspace property, ℓ1 minimization becomes useless, at least when we want to guarantee its validity whatever be the s-sparse signal we want to recover; 2. Second, it is unclear how to verify whether a given sensing matrix A satisfies the nullspace property for a given s, or a given pair (H, k · k) satisfies the condition Qq (s, κ) with given parameters. What is known about these crucial issues can be outlined as follows. 1. It is known that for given m, n with m ≪ n (say, m/n ≤ 1/2), there exist m × n sensing matrices which are s-good for the values of s “nearly as large m .9 Moreover, there are natural families as m,” specifically, for s ≤ O(1) ln(n/m) of matrices where this level of goodness “is a rule.” E.g., when drawing an m × n matrix at random from Gaussian or Rademacher distributions (i.e., when filling the matrix with independent realizations of a random variable which is either a standard (zero mean, unit variance) Gaussian one, or takes values ±1 with probabilities 0.5), the result will be s-good, for the outlined value of s, with probability approaching 1 as m and n grow. All this remains true when instead of speaking about matrices A satisfying “plain” nullspace properties, we are speaking about matrices A for which it is easy to point out a pair (H, k · k) satisfying the condition Q2 (s, κ) with, say, κ = 1/4. The above results can be considered as a good news. A bad news is that we do not know how to check efficiently, given an s and a sensing matrix A, that the matrix is s-good, just as we do not know how to check that A admits good (i.e., satisfying Q1 (s, κ) with κ < 1/2) pairs (H, k · k). Even worse: we do not know m an efficient recipe allowing us to build, given √ m, an m × 2m matrix A which is provably s-good for s larger than O(1) m, which is a much smaller “level of goodness” than the one promised by theory for randomly generated matrices.10 The “common life” analogy of this situation would be as follows: you know that 90% of bricks in your wall are made of gold, and at the same time, you do not know how to tell a golden brick from a usual one. 2. There exist verifiable sufficient conditions for s-goodness of a sensing matrix, similarly to verifiable sufficient conditions for a pair (H, k · k) to satisfy condition 8 Q (s, κ) is always satisfied with “large enough” κ, e.g., κ = s, but such values of κ are of no q interest: the associated bounds on p-norms of the recovery error are straightforward consequences of the bounds on the k · k1 -norm of this error yielded by the condition Q1 (s, κ). 9 Recall that O(1)’s denote positive absolute constants—appropriately chosen numbers like 0.5, or 1, or perhaps 100,000. We could, in principle, replace all O(1)’s with specific numbers; following the standard mathematical practice, we do not do it, partly out of laziness, partly because particular values of these numbers in our context are irrelevant. 10 Note that the naive algorithm “generate m × 2m matrices at random until an s-good, with s promised by the theory, matrix is generated” is not an efficient recipe, since we still do not know how to check s-goodness efficiently.
20
CHAPTER 1
Qq (s, κ). The bad news is that when m √ ≪ n, these verifiable sufficient conditions can be satisfied only when s ≤ O(1) m—once again, in a much more narrow range of values of s than √ when typical randomly selected sensing matrices are s-good. In fact, s = O( m) is so far the best known sparsity level for which we know individual s-good m × n sensing matrices with m ≤ n/2. 1.3.1
Restricted Isometry Property and s-goodness of random matrices
There are several sufficient conditions for s-goodness, equally difficult to verify, but provably satisfied for typical random sensing matrices. The best known of them is the Restricted Isometry Property (RIP) defined as follows: Definition 1.6. Let k be an integer and δ ∈ (0, 1). We say that an m × n sensing matrix A possesses the Restricted Isometry Property with parameters δ and k, RIP(δ, k), if for every k-sparse x ∈ Rn one has (1 − δ)kxk22 ≤ kAxk22 ≤ (1 + δ)kxk22 .
(1.27)
It turns out that for natural ensembles of random m × n matrices, a typical matrix from the ensemble satisfies RIP(δ, k) with small δ and k “nearly as large as m,” and that RIP( 61 , 2s) implies the nullspace condition, and more. The simplest versions of the corresponding results are as follows. Proposition 1.7. Given δ ∈ (0, 51 ], with properly selected positive c = c(δ), d = d(δ), f = f (δ) for all m ≤ n and all positive integers k such that k≤
m c ln(n/m) + d
(1.28)
1 the probability for a random m × n matrix A with independent N (0, m ) entries to satisfy RIP(δ, k) is at least 1 − exp{−f m}.
For proof, see Section 1.5.3. Proposition 1.8. Let A ∈ Rm×n satisfy RIP(δ, 2s) for some δ < 1/3 and positive integer s. Then s−1/2 δ (i) The pair H = √ I , k · k satisfies the condition Q s, m 2 2 1−δ associated 1−δ with A; δ 1 A, k · k∞ ) satisfies the condition Q2 s, 1−δ associated (ii) The pair (H = 1−δ with A. For proof, see Section 1.5.4. 1.3.2
Verifiable sufficient conditions for Qq (s, κ)
When speaking about verifiable sufficient conditions for a pair (H, k · k) to satisfy Qq (s, κ), it is convenient to restrict ourselves to the case where H, like A, is an m × n matrix, and k · k = k · k∞ . Proposition 1.9. Let A be an m × n sensing matrix, and s ≤ n be a sparsity level.
21
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
Given an m × n matrix H and q ∈ [1, ∞], let us set νs,q [H] = max kColj [I − H T A]ks,q ,
(1.29)
j≤n
where Colj [C] is j-th column of matrix C. Then kwks,q ≤ s1/q kH T Awk∞ + νs,q [H]kwk1 ∀w ∈ Rn ,
(1.30)
1
implying that the pair (H, k · k∞ ) satisfies the condition Qq (s, s1− q νs,q [H]). Proof is immediate. Setting V = I − H T A, we have kwks,q = k[H T A + VP ]wks,q ≤ kH T Awks,q + kV wks,q 1/q T ≤ s kH Awk∞ + j |wj |kColj [V ]ks,q ≤ s1/q kH T Ak∞ + νs,q [H]kwk1 .
✷
Observe that the function νs,q [H] is an efficiently computable convex function of H, so that the set 1
κ Hs,q = {H ∈ Rm×n : νs,q [H] ≤ s q −1 κ}
(1.31)
is a computationally tractable convex set. When this set is nonempty for some κ < 1/2, every point H in this set is a contrast matrix such that (H, k · k∞ ) satisfies the condition Qq (s, κ), that is, we can find contrast matrices making ℓ1 minimization valid. Moreover, we can design contrast matrix, e.g., by minimizing κ over Hs,q the function kHk1,2 , thus optimizing the sensitivity of the corresponding ℓ1 recoveries to Gaussian observation noise; see items C, D in Section 1.2.5. Explanation. The sufficient condition for s-goodness of A stated in Proposition 1.9 looks as if coming out of thin air; in fact it is a particular case of a simple and general construction as follows. Let f (x) be a real-valued convex function on Rn , and X ⊂ Rn be a nonempty bounded polytope represented as X = {x ∈ Conv{g1 , ..., gN } : Ax = 0}, P P where Conv{g1 , ..., gN } = { i λi gi : λ ≥ 0, i λi = 1} is the convex hull of vectors g1 , ..., gN . Our goal is to upper-bound the maximum Opt = maxx∈X f (x); this is a meaningful problem, since precisely maximizing a convex function over a polyhedron typically is a computationally intractable task. Let us act as follows: clearly, for any matrix H of the same size as A we have maxx∈X f (x) = maxx∈X f ([I − H T A]x), since on X we have [I − H T A]x = x. As a result, Opt
:= ≤ =
max f (x) = max f ([I − H T A]x) x∈X
x∈X
max
x∈Conv{g1 ,...,gN }
f ([I − H T A]x)
max f ([I − H T A]gj ). j≤N
We get a parametric—the parameter being H—upper bound on Opt, namely, the bound maxj≤N f ([I − H T A]gj ). This parametric bound is convex in H, and thus is well suited for minimization over this parameter. The result of Proposition 1.9 is inspired by this construction as applied to the
22
CHAPTER 1
nullspace property: given an m × n sensing matrix A and setting X = {x ∈ Rn : kxk1 ≤ 1, Ax = 0} = {x ∈ Conv{±e1 , ..., ±en } : Ax = 0} (ei are the basic orths in Rn ), A is s-good if and only if Opts := max{f (x) := kxks,1 } < 1/2. x∈X
A verifiable sufficient condition for this, as yielded by the above construction, is the existence of an m × n matrix H such that max max[f ([In − H T A]ej ), f (−[In − H T A]ej )] < 1/2, j≤n
or, which is the same, max kColj [In − H T A]ks,1 < 1/2. j
This observation brings to our attention the matrix I − H T A with varying H and the idea of expressing sufficient conditions for s-goodness and related properties in terms of this matrix. 1.3.3
Tractability of Q∞ (s, κ)
As we have already mentioned, the conditions Qq (s, κ) are intractable, in the sense that we do not know how to verify whether a given pair (H, k · k) satisfies the condition. Surprisingly, this is not the case with the strongest of these conditions, the one with q = ∞. Namely, Proposition 1.10. Let A be an m × n sensing matrix, s be a sparsity level, and ¯ k · k) satisfies the condition Q∞ (s, κ), there exists κ ≥ 0. Then whenever a pair (H, an m × n matrix H such that kColj [In − H T A]ks,∞ = kColj [In − H T A]k∞ ≤ s−1 κ, 1 ≤ j ≤ n (so that (H, k · k∞ ) satisfies Q∞ (s, κ) by Proposition 1.9), and also ¯ T ηk ∀η ∈ Rm . kH T ηk∞ ≤ kH
(1.32)
In addition, the m × n contrast matrix H such that the pair (H, k · k∞ ) satisfies the condition Q∞ (s, κ) with as small κ as possible can be found as follows. Consider n LP programs Opti = min ν : kAT h − ei k∞ ≤ ν , (#i ) ν,h
where ei is i-th basic orth of Rn . Let Opti , hi , i = 1, ..., n be optimal solutions to these problems; we set H = [h1 , ..., hn ]; the corresponding value of κ is κ∗ = s max Opti . i
Besides this, there exists a transparent alternative description of the quantities Opti
23
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
(and thus of κ∗ ); specifically, Opti = max {xi : kxk1 ≤ 1, Ax = 0} .
(1.33)
x
For proof, see Section 1.5.5. Taken along with (1.32) and error bounds of Theorems 1.3, 1.4, and 1.5, Proposition 1.10 says that As far as the condition Q∞ (s, κ) is concerned, we lose nothing when restricting ourselves with pairs (H ∈ Rm×n , k · k∞ ) and contrast matrices H satisfying the condition |[In − H T A]ij | ≤ s−1 κ,
(1.34)
implying that (H, k · k∞ ) satisfies Q∞ (s, κ). The good news is that (1.34) is an explicit convex constraint on H (in fact, even on H and κ), so that we can solve the design problems, where we want to optimize a convex function of H under the requirement that (H, k · k∞ ) satisfies the condition Q∞ (s, κ) (and, perhaps, additional convex constraints on H and κ). 1.3.3.1
Mutual Incoherence
The simplest (and up to some point in time, the only) verifiable sufficient condition for s-goodness of a sensing matrix A is expressed in terms of mutual incoherence of A, defined as |ColTi [A]Colj [A]| . (1.35) µ(A) = max i6=j kColi [A]k22 This quantity is well defined whenever A has no zero columns (otherwise A is not even 1-good). Note that when A is normalized to have all columns of equal k · k2 lengths,11 µ(A) is small when the columns of A are nearly mutually orthogonal. The standard related result is that Whenever A and a positive integer s are such that
2µ(A) 1+µ(A)
< 1s , A is s-good.
It is immediately seen that the latter condition is weaker than what we can get with the aid of (1.34): Proposition 1.11. Let A be an m × n matrix, and let the columns of m × n matrix H be given by Colj (H) =
1 Colj (A), 1 ≤ j ≤ n. (1 + µ(A))kColj (A)k22
Then |[Im − H T A]ij | ≤
µ(A) ∀i, j. 1 + µ(A)
(1.36)
11 As far as ℓ minimization is concerned, this normalization is non-restrictive: we always can 1 enforce it by diagonal scaling of the signal underlying observations (1.1), and ℓ1 minimization in scaled variables is the same as weighted ℓ1 minimization in original variables.
24
CHAPTER 1
In particular, when
2µ(A) 1+µ(A)
< 1s , A is s-good.
1 = Proof. With H as above, the diagonal entries in I −H T A are equal to 1− 1+µ(A) µ(A) 1+µ(A) ,
while by definition of mutual incoherence the magnitudes of the off-diagonal
µ(A) entries in I − H T A are ≤ 1+µ(A) as well, implying (1.36). The “in particular” claim is given by (1.36) combined with Proposition 1.9. ✷
1.3.3.2
From RIP to conditions Qq (·, κ)
It turns out that when A is RIP(δ, k) and q ≥ 2, it is easy to point out pairs (H, k·k) satisfying Qq (t, κ) with a desired κ > 0 and properly selected t: Proposition 1.12. Let A be an m × n sensing matrix satisfying RIP(δ, 2s) with some s and some δ ∈ (0, 1), and let q ∈ [2, ∞] and κ > 0 be given. Then (i) Whenever a positive integer t satisfies # " q q−2 q κ(1 − δ) q−1 q−1 ,s s 2q−2 , (1.37) t ≤ min δ the pair (H =
−1 q
t √
I ,k 1−δ m
· k2 ) satisfies Qq (t, κ);
(ii) Whenever a positive integer t satisfies (1.37), the pair (H = satisfies Qq (t, κ).
1 −1
s2 t q 1−δ
A, k · k∞ )
For proof, see Section 1.5.4. The most important consequence of Proposition 1.12 deals with the case of q = ∞ and states that when s-goodness of a sensing matrix A can be ensured by the difficult to verify condition RIP(δ, 2s) with, say, δ = 0.2, the somehow worse level of √ sparsity, t = O(1) s with properly selected absolute constant O(1), can be certified via condition Q∞ (t, 13 )—there exists a pair (H, k·k∞ ) satisfying this condition. The point is that by Proposition 1.10, if the condition Q∞ (t, 31 ) can at all be satisfied, a pair (H, k · k∞ ) satisfying this condition can be found efficiently. Unfortunately, the significant “dropdown” in the level of sparsity when passing from unverifiable RIP to verifiable Q∞ is inevitable; this bad news is what is on our agenda now. 1.3.3.3
Limits of performance of verifiable sufficient conditions for goodness
Proposition 1.13. Let A be an m × n sensing matrix which is “essentially nonsquare,” specifically, such that 2m ≤ n, and let q ∈ [1, ∞]. Whenever a positive integer s and an m × n matrix H are linked by the relation 1
kColj [In − H T A]ks,q < 21 s q −1 , 1 ≤ j ≤ n, one has s≤
√
m.
(1.38) (1.39)
As a result, the sufficient condition for the validity √ of Qq (s, κ) with κ < 1/2 from Proposition 1.9 can never be satisfied when s > m. Similarly, the verifiable sufficient condition Q∞ (s, κ), κ < 1/2, for s-goodness of A cannot be satisfied
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
25
Figure 1.4: Erroneous ℓ1 recovery of a 25-sparse signal, no observation noise. Top: frequency domain, o – true signal, + – recovery. Bottom: time domain.
when s >
√
m.
For proof, see Section 1.5.6. We see that unless A is “nearly square,” our (same as all others known to us) verifiable sufficient conditions for s-goodness are unable to justify this property for “large” s. This unpleasant fact is in full accordance with the already mentioned fact that no individual provably s-good “essentially nonsquare” m × n matrices √ with s ≥ O(1) m are known. Matrices for √ which our verifiable sufficient conditions do establish s-goodness with s ≤ O(1) m do exist. How it works: Numerical illustration. Let us apply our machinery to the 256×512 randomly selected submatrix A of the matrix of 512×512 Inverse Discrete Cosine Transform which we used in experiments reported in Figure 1.3. These experiments exhibit nice performance of ℓ1 minimization when recovering sparse (even nearly sparse) signals with as many as 64 nonzeros. In fact, the level of goodness of A is at most 24, as is witnessed in Figure 1.4. In order to upper-bound the level of goodness of a matrix A, one can try to maximize the convex function kwks,1 over the set W = {w : Aw = 0, kwk1 ≤ 1}: if, for a given s, the maximum of k·ks,1 over W is ≥ 1/2, the matrix is not s-good— it does not possess the nullspace property. Now, while global maximization of the convex function kwks,1 over W is difficult, we can try to find suboptimal solutions as follows. Let us start with a vector w1 ∈ W of k·k1 -norm 1, and let u1 be obtained from w1 by replacing the s entries in w1 of largest magnitudes by the signs of these entries and zeroing out all other entries, so that w1T u1 = kw1 ks,1 . After u1 is found, let us solve the LO program maxw {[u1 ]T w : w ∈ W }. w1 is a feasible solution to this problem, so that for the optimal solution w2 we have [u1 ]T w2 ≥ [u1 ]T w1 =
26
CHAPTER 1
kw1 ks,1 ; this inequality, by virtue of what u1 is, implies that kw2 ks,1 ≥ kw1 ks,1 , and, by construction, w2 ∈ W . We now can iterate the construction, with w2 in the role of w1 , to get w3 ∈ W with kw3 ks,1 ≥ kw2 ks,1 , etc. Proceeding in this way, we generate a sequence of points from W with monotonically increasing value of the objective k · ks,1 we want to maximize. We terminate this recurrence either when the achieved value of the objective becomes ≥ 1/2 (then we know for sure that A is not s-good, and can proceed to investigating s-goodness for a smaller value of s) or when the recurrence gets stuck—the observed progress in the objective falls below a given threshold, say, 10−6 . When it happens, we can restart the process from a new starting point randomly selected in W , after getting stuck, restart again, etc., until we exhaust our time budget. The output of the process is the best of the points we have generated—that of the largest k · ks,1 . Applying this approach to the matrix A in question, in a couple of minutes it turns out that the matrix is at most 24-good.
One can ask how it may happen that previous experiments with recovering 64sparse signals went fine, when in fact some 25-sparse signals cannot be recovered by ℓ1 minimization even in the ideal noiseless case. The answer is simple: in our experiments, we dealt with randomly selected signals, and typical randomly selected data are much nicer, whatever be the purpose of a numerical experiment, than the worst-case data. It is interesting to understand also which goodness we can certify using our verifiable sufficient conditions. Computations show that the fully verifiable (and strongest in our scale of sufficient conditions for s-goodness) condition Q∞ (s, κ) can be satisfied with κ < 1/2 when s is as large as 7 and κ = 0.4887, and cannot be satisfied with κ < 1/2 when s = 8. As for Mutual Incoherence, it can only justify 3-goodness, no more. We can hardly be happy with the resulting bounds—goodness at least 7 and at most 24; however, it could be worse.
1.4
EXERCISES FOR CHAPTER 1
Exercise 1.1. The k-th Hadamard matrix, Hk (here k is a nonnegative integer), is the nk × nk matrix, nk = 2k , given by the recurrence Hk Hk . H0 = [1]; Hk+1 = Hk −Hk In the sequel, we assume that k > 0. Now comes the exercise: 1. Check that Hk is a symmetric matrix with entries ±1, and columns of the matrix √ are mutually orthogonal, so that Hk / nk is an orthogonal matrix. √ √ 2. Check that when k > 0, Hk has just two distinct eigenvalues, nk and − nk , each of multiplicity mk := 2k−1 = nk /2. 3. Prove that whenever f is an eigenvector of Hk , one has √ kf k∞ ≤ kf k1 / nk .
Derive from this observation the conclusion as follows:
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
27
Let a1 , ..., amk ∈ Rnk be unit vectors orthogonal to each other which are √ eigenvectors of Hk with eigenvalues nk (by the above, the dimension of √ the eigenspace of Hk associated with the eigenvalue nk is mk , so that the required a1 , ..., amk do exist), and let A be the mk × nk matrix with the rows aT1 , ..., aTmk . For every x ∈ Ker A it holds 1 kxk∞ ≤ √ kxk1 , nk whence A satisfies the nullspace property whenever the sparsity s satisfies √ √ 2s < nk = 2mk . Moreover, there exists (and can be found efficiently) √ an mk × nk contrast matrix H = Hk such that for every s < 12 nk , the √ pair (Hk , k · k∞ ) satisfies the condition Q∞ (s, κs = s/ nk ) associated | {z } O(1)n/ mo , for properly selected absolute constant O(1). Exercise 1.5. Utilize the results of Exercise 1.3 in a numerical experiment as follows. • select n as an integer power 2k of 2, say, n = 210 = 1024; • select a “representative” sequence M of values of m, 1 ≤ m < n, including values of m close to n and “much smaller” than n, say, M = {2, 5, 8, 16, 32, 64, 128, 256, 512, 7, 896, 960, 992, 1008, 1016, 1020, 1022, 1023};
• for every m ∈ M ,
30
CHAPTER 1
– generate at random an m × n submatrix A of the n × n Hadamard matrix Hk and utilize the result of item 4 of Exercise 1.3 in order to find the largest s such that the s-goodness of A can be certified via the condition Q∞ (·, ·); call s(m) the resulting value of s; – generate a moderate sample of Gaussian m × n sensing matrices Ai with independent N (0, 1/m) entries and use the construction from Exercise 1.2 to upper-bound the largest s for which a matrix from the sample satisfies RIP(1/3, 2s); call sb(m) the largest—over your Ai ’s—of the resulting upper bounds.
The goal of the exercise is to compare the computed values of s(m) and sb(m); in other words, we again want to understand how “theoretically perfect” RIP compares to “conservative restricted scope” condition Q∞ .
1.5
PROOFS
1.5.1
Proofs of Theorem 1.3, 1.4
All we need is to prove Theorem 1.4, since Theorem 1.3 is the particular case κ = κ < 1/2 of Theorem 1.4. Let us fix x ∈ Rn and η ∈ Ξρ , and let us set x b = x breg (Ax + η). Let also I ⊂ {1, ..., n} be the set of indexes of the s entries in x of largest magnitudes, I o be the complement of I in {1, ..., n}, and, for w ∈ Rn , wI and wI o be the vectors obtained from w by zeroing entries with indexes j 6∈ I and j 6∈ I o , respectively, and keeping the remaining entries intact. Finally, let z = x b − x. 1o . By the definition of Ξρ and due to η ∈ Ξρ , we have kH T ([Ax + η] − Ax)k ≤ ρ,
(1.40)
so that x is a feasible solution to the optimization problem specifying x b, whence kb xk1 ≤ kxk1 . We therefore have kb xI o k1
= ≤
xI k1 kb xk1 − kb xI k1 ≤ kxk1 − kb xI k1 = kxI k1 + kxI o k1 − kb o kzI k1 + kxI k1 ,
(1.41)
and therefore kzI o k1 ≤ kb xI o k1 + kxI o k1 ≤ kzI k1 + 2kxI o k1 . It follows that kzk1 = kzI k1 + kzI o k1 ≤ 2kzI k1 + 2kxI o k1 .
(1.42)
Further, by definition of x b we have kH T ([Ax + η] − Ab x)k ≤ ρ, which combines with (1.40) to imply that kH T A(b x − x)k ≤ 2ρ. (1.43) 2o . Since (H, k · k) satisfies Q1 (s, κ), we have
kzks,1 ≤ skH T Azk + κkzk1 . By (1.43), it follows that kzks,1 ≤ 2sρ + κkzk1 , which combines with the evident
31
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
inequality kzI k ≤ kzks,1 (recall that Card(I) = s) and with (1.42) to imply that kzI k1 ≤ 2sρ + κkzk1 ≤ 2sρ + 2κkzI k1 + 2κkxI o k1 , whence kzI k1 ≤ Invoking (1.42), we conclude that
2sρ + 2κkxI o k1 . 1 − 2κ
4s kxI o k1 kzk1 ≤ ρ+ . 1 − 2κ 2s
(1.44)
3o . Since (H, k · k) satisfies Qq (s, κ), we have 1
1
kzks,q ≤ s q kH T Azk + κs q −1 kzk1 , which combines with (1.44) and (1.43) to imply that 1
1
kzks,q ≤ s q 2ρ + κs q
4ρ+2s−1 kxI o k1 1−2κ
1
≤
4s q [1+κ−κ] 1−2κ
h
ρ+
kxo k1 2s
i
(1.45)
(we have taken into account that κ < 1/2 and κ ≥ κ). Let θ be the (s + 1)-st largest magnitude of entries in z, and let w = z − z s . Now (1.45) implies that 1 kxI o k1 4[1 + κ − κ] ρ+ . θ ≤ kzks,q s− q ≤ 1 − 2κ 2s Hence invoking (1.44) we have q−1
kwkq
≤ ≤ ≤
1
1
q−1
kwk∞q kwk1q ≤ θ q kzk1q 1 h i q1 q−1 q I o k1 θ q (4s) 1 ρ + kx2s [1−2κ] q q−1 h 1 i q kxI o k1 4s [1+κ−κ] q ρ + . 1−2κ 2s
Taking into account (1.45) and the fact that the supports of z s and w do not intersect, we get kzkq
≤ ≤
1
1
2 q max[kz s kq , kwkq ] = 2 q max[kzks,q , kwkq ] 1 h i 4(2s) q [1+κ−κ] kxI o k1 ρ + . 1−2κ 2s
This bound combines with (1.44), the Moment inequality,12 and with the relation kxI o k1 = kx − xs k1 to imply (1.16). ✷ 12 The
Moment inequality states that if (Ω, µ) is a space with measure and f is a µ-measurable R ρ 1 real-valued function on Ω, then φ(ρ) = ln Ω |f (ω)| ρ µ(dω) is a convex function of ρ on every
segment ∆ ⊂ [0, 1] such that φ(·) is well defined at the endpoints of ∆. As a corollary, when q−p p(q−1)
x ∈ Rn and 1 ≤ p ≤ q ≤ ∞, one has kxkp ≤ kxk1
q(p−1) p(q−1)
kxkq
.
32 1.5.2
CHAPTER 1
Proof of Theorem 1.5
Let us prove (i). Let us fix x ∈ Rn and η, and let us set x b=x bpen (Ax + η). Let also I ⊂ {1, ..., K} be the set of indexes of the s entries in x of largest magnitudes, I o be the complement of I in {1, ..., n}, and, for w ∈ Rn , wI and wI o be the vectors obtained from w by zeroing out all entries with indexes not in I and not in I o , respectively. Finally, let z = x b − x and ν = kH T ηk. o 1 . We have kb xk1 + λkH T (Ab x − Ax − η)k ≤ kxk1 + λkH T ηk and kH T (Ab x − Ax − η)k = kH T (Az − η)k ≥ kH T Azk − kH T ηk, whence kb xk1 + λkH T Azk ≤ kxk1 + 2λkH T ηk = kxk1 + 2λν. We have
kb xk1
= ≥
(1.46)
kx + zk1 = kxI + zI k1 + kxI o + zI o k1 kxI k1 − kzI k1 + kzI o k1 − kxI o k1 ,
which combines with (1.46) to imply that kxI k1 − kzI k1 + kzI o k1 − kxI o k1 + λkH T Azk ≤ kxk1 + 2λν, or, which is the same, kzI o k1 − kzI k1 + λkH T Azk ≤ 2kxI o k1 + 2λν.
(1.47)
Since (H, k · k) satisfies Q1 (s, κ), we have kzI k1 ≤ kzks,1 ≤ skH T Azk + κkzk1 , so that (1 − κ)kzI k1 − κkzI o k1 − skH T Azk ≤ 0.
(1.48)
Taking a weighted sum of (1.47) and (1.48), the weights being 1 and 2, respectively, we get (1 − 2κ) [kzI k1 + kzI o k1 ] + (λ − 2s)kH T Azk ≤ 2kxI o k1 + 2λν, whence, due to λ ≥ 2s, kzk1 ≤
kxI o k1 2λ 2λν + 2kxI o k1 ν+ . ≤ 1 − 2κ 1 − 2κ 2s
(1.49)
Further, by (1.46) we have λkH T Azk ≤ kxk1 − kb xk1 + 2λν ≤ kzk1 + 2λν, which combines with (1.49) to imply that λkHAT zk ≤
2λν + 2kxI o k1 2λν(2 − 2κ) + 2kxI o k1 + 2λν = . 1 − 2κ 1 − 2κ
(1.50)
33
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
From Qq (s, κ) it follows that 1
1
kzks,q ≤ s q kH T Azk + κs q −1 kzk1 , which combines with (1.50) and (1.49) to imply that kzks,q
h i 1 −1 4sν(1−κ)+ 2s kxI o k1 kxI o k1 ] κ[2λν+ λ λ s skH T Azk + κkzk1 ≤ s q + 1−2κ 1−2κ i 1 h −1 −1 1 +κs−2 λ]kxI o k1 kxI o k1 sq κλ = s q [4(1−κ)+2s λκ]ν+[2λ ≤ 4 − κ ν + 1 + 1−2κ 1−2κ 2s 2s 1 −1
≤ sq
(1.51)
(recall that λ ≥ 2s, κ ≥ κ, and κ < 1/2). It remains to repeat the reasoning following (1.45) in item 3o of the proof of Theorem 1.4. Specifically, denoting by θ the (s + 1)-st largest magnitude of entries in z, (1.51) implies that λ kxI o k1 4 , (1.52) [1 + κ − κ] ν + θ ≤ s−1/q kzks,q ≤ 1 − 2κ 2s 2s so that for the vector w = z − z s one has kwkq
≤
1
1
θ1− q kwk1q ≤
1
4(λ/2) q 1−2κ
λ 1 + κ 2s −κ
h q−1 q
ν+
kxI o k1 2s
i
(we have used (1.52) and (1.49)). Hence, taking into account that z s and w have nonintersecting supports, kzkq
≤
≤
1
1
2 q max[kz s kq , kwkq ] = 2 q max[kzks,q , kwkq ] i 1 h kxI o k1 4λ q λ 1 + κ − κ ν + 1−2κ 2s 2s
(we have used (1.51) along with λ ≥ 2s and κ ≥ κ). This combines with (1.49) and the Moment inequality to imply (1.18). All remaining claims of Theorem 1.5 are immediate corollaries of (1.18). ✷ 1.5.3
Proof of Proposition 1.7
1o . Assuming k ≤ m and selecting a set I of k indices from {1, ..., n} distinct from each other, consider an m × k submatrix AI of A comprised of columns with indexes from I, and let u be a unit vector in Rk . The entries in the vector m1/2 A are independent N (0, 1) random variables, so that for the random variable PI u m ζu = i=1 (m1/2 AI u)2i and γ ∈ (−1/2, 1/2) it holds (in what follows, expectations and probabilities are taken w.r.t. our ensemble of random A’s) Z m 1 γt2 − 12 t2 e ds = − ln(1 − 2γ). ln (E{exp{γζ}}) = m ln √ 2 2π Given α ∈ (0, 0.1] and selecting γ in such a way that 1 − 2γ = 0 < γ < 1/2 and therefore Prob{ζu > m(1 + α)} ≤ E{exp{γζu }} exp{−mγ(1 + α)} = exp{− m 2 ln(1 − 2γ) − mγ(1 + α)} m 2 = exp{ m 2 [ln(1 + α) − α]} ≤ exp{− 5 α },
1 1+α ,
we get
34
CHAPTER 1
and similarly, selecting γ in such a way that 1 − 2γ = and therefore
1 1−α ,
we get −1/2 < γ < 0
Prob{ζu < m(1 − α)} ≤ E{exp{γζu }} exp{−mγ(1 − α)} = exp{− m 2 ln(1 − 2γ) − mγ(1 − α)} m 2 = exp{ m 2 [ln(1 − α) + α]} ≤ exp{− 5 α }, and we end up with u ∈ Rk , kuk2 = 1 ⇒
2 Prob{A : kAI uk22 > 1 + α} ≤ exp{− m 5α } . m 2 2 Prob{A : kAI uk2 < 1 − α} ≤ exp{− 5 α }
(1.53)
2o . As above, let α ∈ (0, 0.1], let M = 1 + 2α, ǫ =
α , 2(1 + 2α)
and let us build an ǫ-net on the unit sphere S in Rk as follows. We start with a point u1 ∈ S; after {u1 , ..., ut } ⊂ S is already built, we check whether there is a point in S at the k · k2 -distance from all points of the set > ǫ. If it is the case, we add such a point to the net built so far and proceed with building the net; otherwise we terminate with the net {u1 , ..., ut }. By compactness of S and due to ǫ > 0, this process eventually terminates; upon termination, we have at our disposal the collection {u1 , ..., uN } of unit vectors such that every two of them are at k · k2 -distance > ǫ from each other, and every point from S is at distance at most ǫ from some point of the collection. We claim that the cardinality N of the resulting set can be bounded as
2+ǫ N≤ ǫ
k
4 + 9α = α
k
k 5 ≤ . α
(1.54)
Indeed, the interiors of the k · k2 -balls of radius ǫ/2 centered at the points u1 , ..., uN are mutually disjoint, and their union is contained in the k · k2 -ball of radius 1 + ǫ/2 centered at the origin; comparing the volume of the union and that of the ball, we arrive at (1.54). 3o . Consider event E comprised of all realizations of A such that for all k-element subsets I of {1, ..., n} and all t ≤ n it holds 1 − α ≤ kAI ut k22 ≤ 1 + α.
(1.55)
By (1.53) and the union bound, Prob{A 6∈ E} ≤ 2N
m n exp{− α2 }. k 5
(1.56)
We claim that A ∈ E ⇒ (1 − 2α) ≤
kAI uk22
≤ 1 + 2α ∀
I ⊂ {1, ..., n} : Card(I) = k u ∈ Rk : kuk2 = 1
.
(1.57)
35
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
Indeed, let A ∈ E, let us fix I ∈ {1, ..., n}, Card(I) = k, and let M be the maximal value of the quadratic form f (u) = uT ATI AI u on the unit k · k2 -ball B, centered at the origin, in Rk . In this ball, f is Lipschitz continuous with constant 2M w.r.t. k · k2 ; denoting by u ¯ a maximizer of the form on B, we lose nothing when assuming that u ¯ is a unit vector. Now let us be the point of our net which is at k · k2 -distance at most ǫ from u ¯. We have M = f (¯ u) ≤ f (us ) + 2M ǫ ≤ 1 + α + 2M ǫ, whence M≤
1+α = 1 + 2α, 1 − 2ǫ
implying the right inequality in (1.57). Now let u be unit vector in Rk , and us be a point in the net at k · k-distance ≤ ǫ from u. We have f (u) ≥ f (us ) − 2M ǫ ≥ 1 − α − 2
1+α ǫ = 1 − 2α, 1 − 2ǫ
justifying the first inequality in (1.57). The bottom line is: δ ∈ (0, 0.2], 1 ≤ k ≤ n
k 10 ⇒ Prob{A : A does not satisfy RIP(δ, k)} ≤ 2 δ | {z } k ≤( 20 δ )
n k
2
exp{− mδ 20 }.
(1.58)
Indeed, setting α = δ/2, we have seen that whenever A 6∈ E, we have (1 − δ) ≤ kAuk22 ≤ (1 + δ) for all unit k-sparse u, which is nothing but RIP(δ, k); with this in mind, (1.58) follows from (1.56) and (1.54). 4o . It remains to verify that with properly selected—depending solely on δ— positive quantities c, d, f , for every k ≥ 1 satisfying (1.28) the right-hand side in (1.58) is at most exp{−f m}. Passing to logarithms, our goal is to ensure the relation (δ) > 0 G := a(δ)m − b(δ)k − ln nk ≥ mf h i (1.59) δ2 20 a(δ) = 20 , b(δ) = ln δ provided that k ≥ 1 satisfies (1.28). Let k satisfy (1.28) with some c, d to be specified later, and let y = k/m. Assuming d ≥ 3, we have 0 ≤ y ≤ 1/3. Now, it is well known that n n−k n k n ln( ) + ln( ) , ≤n C := ln n k n n−k k
whence C≤n ≤n
m
m
n
n y ln( my )+
n n y ln( my )
+
k n
n−k n
k ln(1 + ) n−k {z } | k
≤ n−k n n = m y ln( my ) + y ≤ 2my ln( my )
36
CHAPTER 1
(recall that n ≥ m and y ≤ 1/3). It follows that G
= =
n ) a(δ)m − b(δ)k − C ≥ a(δ)m − b(δ)ym − 2my ln( my n 1 m a(δ) − b(δ)y − 2y ln( ) − 2y ln( ) , m y {z } | H
and all we need is to select c, d in such a way that (1.28) would imply that H ≥ f with some positive f = f (δ). This is immediate: we can find u(δ) > 0 such that when 0 ≤ y ≤ u(δ), we have 2y ln(1/y) + b(δ)y ≤ 31 a(δ); selecting d(δ) ≥ 3 large enough, (1.28) would imply y ≤ u(δ), and thus would imply H≥
n 2 a(δ) − 2y ln( ). 3 m
n Now we can select c(δ) large enough for (1.28) to ensure that 2y ln( m ) ≤ 13 a(δ). With the c, d just specified, (1.28) implies that H ≥ 31 a(δ), and we can take the latter quantity as f (δ). ✷
1.5.4
Proof of Propositions 1.8 and 1.12
Let x ∈ Rn , and let x1 , ..., xq be obtained from x by the following construction: x1 is obtained from x by zeroing all but the s entries of largest magnitudes; x2 is obtained by the same procedure applied to x − x1 ; x3 —by the same procedure applied to x − x1 − x2 ; and so on; the process is terminated at the first step q when j it happens that x = x1 + ... + xq . Note that for k∞ ≤ s−1 kxj−1 k1 pj ≥ 2 we have kx −1/2 j j−1 j kxj−1 k1 . It is and kx k1 ≤ kx k1 , whence also kx k2 ≤ kxj k∞ kxj k1 ≤ s easily seen that if A is RIP(δ, 2s), then for every two s-sparse vectors u, v with nonoverlapping supports we have |v T AT Au| ≤ δkuk2 kvk2 .
(∗)
Indeed, for s-sparse u, v, let I be the index set of cardinality ≤ 2s containing the supports of u and v, so that, denoting by AI the submatrix of A comprised of columns with indexes from I, we have v T AT Au = vIT [ATI AI ]uI . By RIP, the eigenvalues λi = 1 + µi of the symmetric matrix Q = ATI AI are in-between 1 − δ and 1 + δ; representing uI and vI by vectors w and z of their coordinates in P P the orthonormal eigenbasis of Q, we get |v T AT Au| = | i λi wi zi | = | i wi zi + P T T T i µi wi zi | ≤ |w z| + δkwk2 kzk2 . It remains to note that w z = uI vI = 0 and kwk2 = kuk2 , kzk2 = kvk2 .
We have
⇒
⇒
⇒
Pq kAx1 k2 kAxk2 ≥ [x1 ]T AT Ax = kAx1 k22 + j=2 [x1 ]T AT Axj P q ≥ kAx1 k22 − δ j=2 kx1 k2 kxj k2 [by (∗)] Pq ≥ kAx1 k22 − δs−1/2 kx1 k2 j=2 kxj−1 k1 ≥ kAx1 k22 − δs−1/2 kx1 k2 kxk1 kAx1 k22 ≤ kAx1 k2 kAxk2 + δs−1/2 kx1 k2 kxk1 1 2 kx k2 kx1 k2 kx1 k2 −1/2 1 2 kAxk + δs kAx k ≤ kxk1 kx1 k2 = kAx 2 1 2 1k 2 kAx k2 kAx1 k2 2
kxks,2 = kx1 k2 ≤ [by RIP(δ, 2s)]
√ 1 kAxk2 1−δ
+
δs−1/2 1−δ kxk1
(!)
37
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
s−1/2 δ ), as claimed in I , k · k and we see that the pair H = √ satisfies Q2 (s, 1−δ m 2 1−δ Proposition 1.8.i. Moreover, when q ≥ 2, κ > 0, and integer t ≥ 1 satisfy t ≤ s and −1/2 κt1/q−1 ≥ δs1−δ , by (!) we have kxkt,q ≤ kxks,q ≤ kxks,2 ≤ √
1 kAxk2 + κt1/q−1 kxk1 , 1−δ
or, equivalently, 1 ≤ t ≤ min ⇒
(H =
−1 q
t √
h
κ(1−δ) δ
I ,k 1−δ m
q i q−1
,s
q−2 2q−2
q
s 2q−2
· k2 ) satisfies Qq (t, κ),
as required in Proposition 1.12.i. Next, we have Pq kx1 k1 kAT Axk∞ ≥ [x1 ]T AT Ax = kAx1 k22 + j=2 [x1 ]T AT Axj ≥ kAx1 k22 − δs−1/2 kx1 k2 kxk1 [exactly as above] ⇒ kAx1 k22 ≤ kx1 k1 kAT Axk∞ + δs−1/2 kx1 k2 kxk1 ⇒ (1 − δ)kx1 k22 ≤ kx1 k1 kAT Axk∞ + δs−1/2 kx1 k2 kxk1 [by RIP(δ, 2s)] ≤ s1/2 kx1 k2 kAT Axk∞ + δs−1/2 kx1 k2 kxk1 1/2 δ s−1/2 kxk1 ⇒ kxks,2 = kx1 k2 ≤ s1−δ kAT Axk∞ + 1−δ
(!!)
1 δ and we see that the pair H = 1−δ A, k · k∞ satisfies the condition Q2 s, 1−δ , as required in Proposition 1.8.ii. Moreover, when q ≥ 2, κ > 0, and integer t ≥ 1 δ satisfy t ≤ s and κt1/q−1 ≥ 1−δ s−1/2 , we have by (!!) kxkt,q ≤ kxks,q ≤ kxks,2 ≤
1 1/2 T s kA Axk∞ + κt1/q−1 kxk1 , 1−δ
or, equivalently, 1 ≤ t ≤ min ⇒ (H =
h
κ(1−δ) δ
1 −1
s2 t q 1−δ
q i q−1
q q−2 , s 2q−2 s 2q−2
A, k · k∞ ) satisfies Qq (t, κ),
as required in Proposition 1.12.ii. 1.5.5
Proof of Proposition 1.10
¯ ∈ Rm×N and k · k satisfy Q∞ (s, κ). Then for every k ≤ n we have (i): Let H ¯ T Axk + s−1 κkxk1 , |xk | ≤ kH or, which is the same by homogeneity, T ¯ Axk − xk : kxk1 ≤ 1 ≥ −s−1 κ. min kH x
✷
38
CHAPTER 1
In other words, the optimal value Optk of the conic optimization problem13 ¯ T Axk ≤ t, kxk1 ≤ 1 , Optk = min t − [ek ]T x : kH x,t
where ek ∈ Rn is k-th basic orth, is ≥ −s−1 κ. Since the problem clearly is strictly feasible, this is the same as saying that the dual problem ¯ + g = ek , kgk∞ ≤ µ, kηk∗ ≤ 1 , −µ : AT Hη max µ∈R,g∈Rn ,η∈RN
where k · k∗ is the norm conjugate to k · k,
kuk∗ = max hT u, khk≤1
has a feasible solution with the value of the objective ≥ −s−1 κ. It follows that there exist η = η k and g = g k such that (a) : ek = AT hk + g k , ¯ k , kη k k∗ ≤ 1, (b) : hk := Hη k (c) : kg k∞ ≤ s−1 κ.
(1.60)
Denoting H = [h1 , ..., hn ], V = I − H T A, we get Colk [V T ] = ek − AT hk = g k , implying that kColk [V T ]k∞ ≤ s−1 κ. Since the latter inequality is true for all k ≤ n, we conclude that kColk [V ]ks,∞ = kColk [V ]k∞ ≤ s−1 κ, 1 ≤ k ≤ n, whence, by Proposition 1.9, (H, k · k∞ ) satisfies Q∞ (s, κ). Moreover, for every η ∈ Rm and every k ≤ n we have, in view of (b) and (c), ¯ T η| ≤ kη k k∗ kH ¯ T ηk, |[hk ]T η| = |[η k ]T H ¯ T ηk. whence kH T ηk∞ ≤ kH Now let us prove the “In addition” part of the proposition. Let H = [h1 , ..., hn ] be the contrast matrix specified in this part. We have |[Im − H T A]ij | = |[[ei ]T − hTi A]j | ≤ k[ei ]T − hTi Ak∞ = kei − AT hi k∞ ≤ Opti , implying by Proposition 1.9 that (H, k · k∞ ) does satisfy the condition Q∞ (s, κ∗ ) with κ∗ = s maxi Opti . Now assume that there exists a matrix H ′ which, taken along with some norm k · k, satisfies the condition Q∞ (s, κ) with κ < κ∗ , and let us lead this assumption to a contradiction. By the already proved first part of Proposition 1.10, our assumption implies that there exists an m × n matrix ¯ 1 , ..., h ¯ n ] such that kColj [In − H ¯ = [h ¯ T A]k∞ ≤ s−1 κ for all j ≤ n, implying that H −1 i T T ¯ i k∞ ≤ s−1 κ ¯ |[[e ] − hi A]j | ≤ s κ for all i and j, or, which is the same, kei − AT h i T¯ for all i. Due to the origin of Opti , we have Opti ≤ ke − A hi k∞ for all i, 13 For
a summary on conic programming, see Section 4.1.
39
SPARSE RECOVERY VIA ℓ1 MINIMIZATION
and we arrive at s−1 κ∗ = maxi Opti ≤ s−1 κ, that is, κ∗ ≤ κ, which is a desired contradiction. It remains to prove (1.33), which is just an exercise on LP duality: denoting by e an n-dimensional all-ones vector, we have T i Opti := minh kei − AT hk∞ = minh,t t : ei − AT h P≤ te, APh − e ≤ te = maxλ,µ {λi − µi : λ, µ ≥ 0, A[λ − µ] = 0, i λi + i µi = 1} [LP duality] = maxx:=λ−µ {xi : Ax = 0, kxk1 ≤ 1} where the concluding equality follows from the fact that vectors x representable as λ − µ with λ, µ ≥ 0 satisfying kλk1 + kµk1 = 1 are exactly vectors x with kxk1 ≤ 1. ✷ 1.5.6
Proof of Proposition 1.13
Let H satisfy (1.38). Since kvks,1 ≤ s1−1/q kvks,q , it follows that H satisfies for some α < 1/2 the condition kColj [In − H T A]ks,1 ≤ α, 1 ≤ j ≤ n,
(1.61)
whence, as we know from Proposition 1.9, kxks,1 ≤ skH T Axk∞ + αkxk1 ∀x ∈ Rn . It follows that s ≤ m, since otherwise there exists a nonzero s-sparse vector x with Ax = 0; for this x, the inequality above cannot hold true. ¯ and A¯ be the m × n Let us set n ¯ = 2m, so that n ¯ ≤ n, and let H ¯ matrices comprised of the first 2m columns of H and A, respectively. Relation (1.61) implies ¯ T A¯ satisfies that the matrix V = In¯ − H kColj [V ]ks,1 ≤ α < 1/2, 1 ≤ j ≤ n ¯.
(1.62)
¯ T A¯ is ≤ m, at least n Now, since the rank of H ¯ − m singular values of V are ≥ 1, and therefore the squared Frobenius norm kV k2F of V is at least n ¯ − m. On the other hand, we can upper-bound this squared norm as follows. Observe that for every n ¯ -dimensional vector f one has hn ¯ i (1.63) kf k22 ≤ max 2 , 1 kf k2s,1 . s Indeed, by homogeneity it suffices to verify the inequality when kf ks,1 = 1; besides, we can assume w.l.o.g. that the entries in f are nonnegative, and that f1 ≥ f2 ≥ ... ≥ fn¯ . We have fs ≤ kf ks,1 /s = 1s ; in addition, Pn¯ 2 n − s)fs2 . Now, due to kf ks,1 = 1, for fixed fs ∈ [0, 1/s] we j=s+1 fj ≤ (¯ have s−1 s−1 s X X X tj = 1 − f s . t2j : tj ≥ fs , j ≤ s − 1, fj2 ≤ fs2 + max t j=1
j=1
j=1
The maximum on the right-hand side is the maximum of a convex function
40
CHAPTER 1
over a bounded polytope; it is achieved at an extreme point, that is, at a point where one of the tj is equal to 1 − (s − 1)fs , and all remaining tj are equal to fs . As a result, X j
fj2 ≤ (1 − (s − 1)fs )2 + (s − 1)fs2 + (¯ n − s)fs2 ≤ (1 − (s − 1)fs )2 + (¯ n − 1)fs2 .
The right-hand side in the latter inequality is convex in fs and thus achieves its maximum P over the range2[0, 1/s] of allowed values of fs at an endpoint, ¯ /s ], as claimed. yielding j fj2 ≤ max[1, n
Applying (1.63) to the columns of V and recalling that n ¯ = 2m, we get kV k2F =
2m 2m 2m X kColj [V ]k2s,1 ≤ 2α2 m max 1, 2 . kColj [V ]k22 ≤ max 1, 2 s s j=1 j=1
2m X
The left hand side in this inequality, as we remember, is ≥ n ¯ − m = m, and we arrive at m ≤ 2α2 m max[1, 2m/s2 ]. √ Since α < 1/2, this inequality implies 2m/s2 ≥ 2, whence s ≤ m. It remains to prove that when √ m ≤ n/2, the condition Q∞ (s, κ) with κ < 1/2 can be satisfied only when s ≤ m. This is immediate: by Proposition 1.10, assuming Q∞ (s, κ) satisfiable, there exists an m × n contrast matrix H such that |[In − H T A]ij | ≤ κ/s for all i, √j, which, by the already proved part of Proposition ✷ 1.13, is impossible when s > m.
Chapter Two Hypothesis Testing Disclaimer for experts. In what follows, we allow for “general” probability and observation spaces, general probability distributions, etc., which, formally, would make it necessary to address the related measurability issues. In order to streamline our exposition, and taking into account that we do not expect our target audience to be experts in formal nuances of the measure theory, we decided to omit in the text comments (always self-evident for an expert) on measurability and replace them with a “disclaimer” as follows: Below, unless the opposite is explicitly stated, • all probability and observation spaces are Polish (complete separable metric) spaces equipped with σ-algebras of Borel sets; • all random variables (i.e., functions from a probability space to some other space) take values in Polish spaces; these variables, like other functions we deal with, are Borel; • all probability distributions we are dealing with are σ-additive Borel measures on the respective probability spaces; the same is true for all reference measures and probability densities taken w.r.t. these measures. When an entity (a random variable, or a probability density, or a function, say, a test) is part of the data, the Borel property is a default assumption; e.g., the sentence “Let random variable η be a deterministic transformation of random variable ξ” should be read as “let η = f (ξ) for some Borel function f ,” and the sentence “Consider a test T deciding on hypotheses H1 , ..., HL via observation ω ∈ Ω” should be read as “Consider a Borel function T on Polish space Ω, the values of the function being subsets of the set {1, ..., L}.” When an entity is built by us rather than being part of the data, the Borel property is (an always straightforwardly verifiable) property of the construction. For example, the statement “The test T given by ... is such that ...” should be read as “The test T given by ... is a Borel function of observations and is such that ....” On several occasions, we still use the word “Borel”; those not acquainted with the notion are welcome to just ignore this word.
2.1
2.1.1
PRELIMINARIES FROM STATISTICS: HYPOTHESES, TESTS, RISKS Hypothesis Testing Problem
Hypothesis Testing is one of fundamental problems of Statistics. Informally, this is the problem where one is given an observation—a realization of a random variable with unknown (at least partly) probability distribution—and wants to decide, based on this observation, on two or more hypotheses on the actual distribution of the observed variable. A formal setting convenient for us is as follows:
42
CHAPTER 2
Given are: • Observation space Ω, where the observed random variable (r.v.) takes its values; • L families Pℓ of probability distributions on Ω. We associate with these families L hypotheses H1 , ..., HL , with Hℓ stating that the probability distribution P of the observed r.v. belongs to the family Pℓ (shorthand: Hℓ : P ∈ Pℓ ). We shall say that the distributions from Pℓ obey hypothesis Hℓ . Hypothesis Hℓ is called simple if Pℓ is a singleton, and is called composite otherwise. Our goal is, given an observation—a realization ω of the r.v. in question—to decide which of the hypotheses is true. 2.1.2
Tests
Informally, a test is an inference procedure one can use in the above testing problem. Formally, a test for this testing problem is a function T (ω) of ω ∈ Ω; the value T (ω) of this function at a point ω is some subset of the set {1, ..., L}: T (ω) ⊂ {1, ..., L}. Given observation ω, the test accepts all hypotheses Hℓ with ℓ ∈ T (ω) and rejects all hypotheses Hℓ with ℓ 6∈ T (ω). We call a test simple if T (ω) is a singleton for every ω, that is, whatever be the observation, the test accepts exactly one of the hypotheses H1 , ..., HL and rejects all other hypotheses. Note: What we have defined is a deterministic test. Sometimes we shall consider also randomized tests, where the set of accepted hypotheses is a (deterministic) function of an observation ω and a realization θ of a random parameter (which w.l.o.g. can be assumed to be uniformly distributed on [0, 1]) independent of ω. Thus, in a randomized test, the inference depends both on the observation ω and the outcome θ of “flipping a coin,” while in a deterministic test the inference depends on observation only. In fact, randomized testing can be reduced to deterministic testing. To this end it suffices to pass from our “actual” observation ω to the new observation ω+ = (ω, θ), where θ ∼ Uniform[0, 1] is independent of ω; the ω-component of our new observation ω+ is, as before, generated “by nature,” and the θ-component is generated by us. Now, given families Pℓ , 1 ≤ ℓ ≤ L, of probability distributions on the original observation space Ω, we can associate with them families Pℓ,+ = {P × Uniform[0, 1] : P ∈ Pℓ } of probability distributions on our new observation space Ω+ = Ω × [0, 1]. Clearly, to decide on the hypotheses associated with the families Pℓ via observation ω is the same as to decide on the hypotheses associated with the families Pℓ,+ of our new observation ω+ , and deterministic tests for the latter testing problem are exactly the randomized tests for the former one. 2.1.3
Testing from repeated observations
There are situations where an inference can be based on several observations ω1 , ..., ωK rather than on a single one. Our related setup is as follows: We are given L families Pℓ , ℓ = 1, ..., L, of probability distributions on the
43
HYPOTHESIS TESTING
observation space Ω and a collection ω K = (ω1 , ..., ωK ) ∈ ΩK = Ω × ... × Ω | {z } K
and want to make conclusions on how the distribution of ω K “is positioned” w.r.t. the families Pℓ , 1 ≤ ℓ ≤ L. We will be interested in three situations of this type, specifically, as follows. 2.1.3.1
Stationary K-repeated observations
In the case of stationary K-repeated observations, ω1 , ..., ωK are independently of each other drawn from a distribution P . Our goal is to decide, given ω K , on the hypotheses P ∈ Pℓ , ℓ = 1, ..., L. Equivalently: Families Pℓ of probability distributions of ω ∈ Ω, 1 ≤ ℓ ≤ L, give rise to the families Pℓ⊙,K = {P K = P × ... × P : P ∈ Pℓ } {z } | K
of probability distributions on ΩK ; we refer to the family Pℓ⊙,K as the K-th diagonal power of the family Pℓ . Given observation ω K ∈ ΩK , we want to decide on the hypotheses Hℓ⊙,K : ω K ∼ P K ∈ Pℓ⊙,K , 1 ≤ ℓ ≤ L. 2.1.3.2
Semi-stationary K-repeated observations
In the case of semi-stationary K-repeated observations, “nature” selects somehow a sequence P1 , ..., PK of distributions on Ω, and then draws, independently across k, observations ωk , k = 1, ..., K, from these distributions: ωk ∼ Pk , ωk are independent across k ≤ K. Our goal is to decide, given ω K = (ω1 , ..., ωK ), on the hypotheses {Pk ∈ Pℓ , 1 ≤ k ≤ K}, ℓ = 1, ..., L. Equivalently: Families Pℓ of probability distributions of ω ∈ Ω, 1 ≤ ℓ ≤ L, give rise to the families Pℓ⊕,K = {P K = P1 × ... × PK : Pk ∈ Pℓ , 1 ≤ k ≤ K} of probability distributions on ΩK . Given observation ω K ∈ ΩK , we want to decide on the hypotheses Hℓ⊕,K : ω K ∼ P K ∈ Pℓ⊕,K , 1 ≤ ℓ ≤ L. In the sequel, we refer to families Pℓ⊕,K as the K-th direct powers of the families
44
CHAPTER 2
Pℓ . A closely related notion is that of the direct product Pℓ⊕,K =
K M k=1
Pℓ,k
of K families Pℓ,k , of probability distributions on Ωk , over k = 1, ..., K. By definition, Pℓ⊕,K = {P K = P1 × ... × PK : Pk ∈ Pℓ,k , 1 ≤ k ≤ K}. 2.1.3.3
Quasi-stationary K-repeated observations
Quasi-stationary K-repeated observations ω1 ∈ Ω, ..., ωK ∈ Ω stemming from a family P of probability distributions on an observation space Ω are generated as follows: “In nature” there exists random sequence ζ K = (ζ1 , ..., ζK ) of “driving factors” (or states) such that for every k, ωk is a deterministic function of ζ1 , ..., ζk , ωk = θk (ζ1 , ..., ζk ), and the conditional distribution Pωk |ζ1 ,...,ζk−1 of ωk given ζ1 , ..., ζk−1 always (i.e., for all ζ1 , ..., ζk−1 ) belongs to P. With the above mechanism, the collection ω K = (ω1 , ..., ωK ) has some distribution P K which depends on the distribution of driving factors and functions θk (·). We denote by P ⊗,K the family of all distributions P K which can be obtained in this fashion, and we refer to random observations ω K with distribution P K of the type just defined as the quasi-stationary K-repeated observations stemming from P. The quasi-stationary version of our hypothesis testing problem reads: Given L families Pℓ of probability distributions Pℓ , ℓ = 1, ..., L, on Ω and an observation ω K ∈ ΩK , decide on the hypotheses Hℓ⊗,K = {P K ∈ Pℓ⊗,K }, 1 ≤ ℓ ≤ K on the distribution P K of the observation ω K . A related notion is that of the quasi-direct product Pℓ⊗,K =
K O k=1
Pℓ,k
of K families Pℓ,k , of probability distributions on Ωk , over k = 1, ..., K. By definition, Pℓ⊗,K is comprised of all probability distributions of random sequences ω K = (ω1 , ..., ωK ), ωk ∈ Ωk , which can be generated as follows: “in nature” there exists a random sequence ζ K = (ζ1 , ..., ζK ) of “driving factors” such that for every k ≤ K, ωk is a deterministic function of ζ k = (ζ1 , ..., ζk ), and the conditional distribution of ωk given ζ k−1 always belongs to Pℓ,k . The description of quasi-stationary K-repeated observations seems to be too complicated. However, this is exactly what happens in some important applications, e.g., in hidden Markov chains. Suppose that Ω = {1, ..., d} is a finite set, and that ωk ∈ Ω, k = 1, 2, ..., are generated as follows: “in nature” there exists a Markov chain with D-element state space S split into d nonoverlapping bins, and ωk is the
45
HYPOTHESIS TESTING
index β(η) of the bin to which the state ηk of the chain belongs. Every column Qj of the transition matrix Q of the chain (this column is a probability distribution on {1, ..., D}) generates a probability distribution Pj on Ω, specifically, the distribution of β(η), η ∼ Qj . Now, a family P of distributions on Ω induces a family Q[P] of all D × D stochastic matrices Q for which all D distributions P j , j = 1, ..., D, belong to P. When Q ∈ Q[P], observations ωk , k = 1, 2, ..., clearly are given by the above “quasi-stationary mechanism” with ηk in the role of driving factors and P in the role of Pℓ . Thus, in the situation in question, given L families Pℓ , ℓ = 1, ..., L, of probability distributions on S, deciding on hypotheses Q ∈ Q[Pℓ ], ℓ = 1, ..., L, on the transition matrix Q of the Markov chain underlying our observations reduces to hypothesis testing via quasi-stationary K-repeated observations. 2.1.4
Risk of a simple test
Let Pℓ , ℓ = 1, ..., L, be families of probability distributions on observation space Ω; these families give rise to hypotheses Hℓ : P ∈ Pℓ , ℓ = 1, ..., L on the distribution P of a random observation ω ∼ P . We are about to define the risks of a simple test T deciding on the hypotheses Hℓ , ℓ = 1, ..., L, via observation ω. Recall that simplicity means that as applied to an observation, our test accepts exactly one hypothesis and rejects all other hypotheses. Partial risks Riskℓ (T |H1 , ..., HL ) are the worst-case, over P ∈ Pℓ , P -probabilities of T rejecting the ℓ-th hypothesis when it is true, that is, when ω ∼ P : Riskℓ (T |H1 , ..., HL ) = sup Probω∼P {ω : T (ω) 6= {ℓ}} , ℓ = 1, ..., L. P ∈Pℓ
Obviously, for ℓ fixed, the ℓ-th partial risk depends on how we order the hypotheses; when reordering them, we should reorder risks as well. In particular, for a test T deciding on two hypotheses H, H ′ we have Risk1 (T |H, H ′ ) = Risk2 (T |H ′ , H).
Total risk Risktot (T |H1 , ..., HL ) is the sum of all L partial risks: Risktot (T |H1 , ..., HL ) =
L X ℓ=1
Riskℓ (T |H1 , ..., HL ).
Risk Risk(T |H1 , ..., HL ) is the maximum of all L partial risks: Risk(T |H1 , ..., HL ) = max Riskℓ (T |H1 , ..., HL ). 1≤ℓ≤L
Note that at first glance, we have defined risks for single-observation tests only; in fact, we have defined them for tests based on stationary, semi-stationary, and quasistationary K-repeated observations as well, since, as we remember from Section
46
CHAPTER 2
2.1.3, the corresponding testing problems, after redefining observations and families K L Pℓ in the role of probability distributions (ω K in the role of ω and, say, Pℓ⊕,K = k=1
of Pℓ ), become single-observation testing problems. Pay attention to the following two important observations:
• Partial risks of a simple test are defined in the worst-case fashion: as the maximal, over the true distributions P of observations compatible with the hypothesis in question, probability to reject this hypothesis. • Risks of a simple test say what happens, statistically speaking, when the true distribution P of observation obeys one of the hypotheses in question, and say nothing about what happens when P does not obey any of the L hypotheses. Remark 2.1. “The smaller are the hypotheses, the less are the risks.” Specifically, given families of probability distributions Pℓ ⊂ Pℓ′ , ℓ = 1, ..., L, on observation space Ω, along with hypotheses Hℓ : P ∈ Pℓ , Hℓ′ : P ∈ Pℓ′ on the distribution P of an observation ω ∈ Ω, every test T deciding on the “larger” hypotheses H1′ , ..., HL′ can be considered as a test deciding on the smaller hypotheses H1 , ..., HL as well, and the risks of the test when passing from larger hypotheses to smaller ones can only drop down: Pℓ ⊂ Pℓ′ , 1 ≤ ℓ ≤ L ⇒ Risk(T |H1 , ..., HL ) ≤ Risk(T |H1′ , ..., HL′ ). For example, families of probability distributions Pℓ , 1 ≤ ℓ ≤ L, on Ω and a positive integer K induce three families of hypotheses on a distribution P K of K-repeated observations: Hℓ⊙,K K : P K ∈ Pℓ⊙,K ,
Hℓ⊕,K : P K ∈ Pℓ⊕,K =
Hℓ⊗,K : P K ∈ Pℓ⊗,K = (see Section 2.1.3), and clearly
K N
k=1
Pℓ , 1 ≤ ℓ ≤ L
K L
k=1
Pℓ ,
PℓK ⊂ Pℓ⊕,K ⊂ Pℓ⊗,K . It follows that when passing from quasi-stationary K-repeated observations to semistationary K-repeated observations, and then to stationary K-repeated observations, the risks of a test can only go down. 2.1.5
Two-point lower risk bound
The following well-known [162, 164] observation is nearly evident: Proposition 2.2. Consider two simple hypotheses H1 : P = P1 and H2 : P = P2 on the distribution P of observation ω ∈ Ω, and assume that P1 , P2 have densities p1 , p2 w.r.t. some reference measure Π on Ω.1 Then for any simple test T deciding 1 This
assumption is w.l.o.g.—we can take, as Π, the sum of the measures P1 and P2 .
47
HYPOTHESIS TESTING
on H1 , H2 it holds Risktot (T |H1 , H2 ) ≥
Z
min[p1 (ω), p2 (ω)]Π(dω).
(2.1)
Ω
Note that the right-hand side in this relation is independent of how Π is selected. Proof. Consider a simple test T , perhaps a randomized one, and let π(ω) be the probability for this test to accept H1 and reject H2 when the observation is ω. Since the test is simple, the probability for T to accept H2 and to reject H1 , the observation being ω, is 1 − π(ω). Consequently, R Risk1 (T |H1 , H2 ) = RΩ (1 − π(ω))p1 (ω)Π(dω), Risk2 (T |H1 , H2 ) = Ω π(ω)p2 (ω)Π(dω),
whence
Risktot (T |H1 , H2 )
= ≥
R RΩ [(1 − π(ω))p1 (ω) + π(ω)p2 (ω)]Π(dω) min[p1 (ω), p2 (ω)]Π(dω). ✷ Ω
Remark 2.3. Note that the lower risk bound (2.1) is achievable; given an observation ω, the corresponding test T accepts H1 with probability 1 (i.e., π(ω) = 1) when p1 (ω) > p2 (ω), accepts H2 when p1 (ω) < p2 (ω) (i.e., π(ω) = 0 when p1 (ω) < p2 (ω)) and accepts H1 and H2 with probabilities 1/2 in the case of a tie (i.e., π(ω) = 1/2 when p1 (ω) = p2 (ω)). This is nothing but the likelihood ratio test naturally adjusted to account for ties. Example 2.1. Let Ω = Rd , let the reference measure Π be the Lebesgue measure on Rd , and let pχ (·) = N (µχ , Id ), be the Gaussian densities on Rd with unit covariance and means µχ , χ = 1, 2. In this case, assuming µ1 6= µ2 , the recipe from Remark 2.3 reduces to the following: Let φ1,2 (ω) = 12 [µ1 − µ2 ]T [ω − w], w = 12 [µ1 + µ2 ].
(2.2)
Consider the simple test T which, given an observation ω, accepts H1 : p = p1 (and rejects H2 : p = p2 ) when φ1,2 (ω) ≥ 0, and accepts H2 (and rejects H1 ) otherwise. For this test, Risk1 (T |H1 , H2 ) = Risk2 (T |H1 , H2 ) = Risk(T |H1 , H2 ) = 21 Risktot (T |H1 , H2 ) = Erfc( 12 kµ1 − µ2 k2 )
(2.3)
(see (1.22) for the definition of Erfc), and the test is optimal in terms of its risk and its total risk. Note that optimality of T in terms of total risk is given by Proposition 2.2 and Remark 2.3; optimality in terms of risk is ensured by optimality in terms of total risk combined with the first equality in (2.3). Example 2.1 admits an immediate and useful extension [36, 37, 84, 128]: Example 2.2. Let Ω = Rd , let the reference measure Π be the Lebesgue measure on Rd , and let M1 and M2 be two nonempty closed convex sets in Rd with empty
48
CHAPTER 2
0
"
#
0
Figure 2.1: “Gaussian Separation” (Example 2.5): Optimal test deciding on whether the mean of Gaussian r.v. belongs to the domain A (H1 ) or to the domain B (H2 ). Hyperplane o-o separates the acceptance domains for H1 (“left” half-space) and for H2 (“right” half-space).
intersection and such that the convex optimization program min {kµ1 − µ2 k2 : µχ ∈ Mχ , χ = 1, 2}
µ1 ,µ2
(∗)
has an optimal solution µ∗1 , µ∗2 (this definitely is the case when at least one of the sets M1 , M2 is bounded). Let φ1,2 (ω) = 21 [µ∗1 − µ∗2 ]T [ω − w], w = 12 [µ∗1 + µ∗2 ],
(2.4)
and let the simple test T deciding on the hypotheses H1 : p = N (µ, Id ) with µ ∈ M1 ,
H2 : p = N (µ, Id ) with µ ∈ M2
be as follows (see Figure 2.1): given an observation ω, T accepts H1 (and rejects H2 ) when φ1,2 (ω) ≥ 0, and accepts H2 (and rejects H1 ) otherwise. Then Risk1 (T |H1 , H2 ) = Risk2 (T |H1 , H2 ) = Risk(T |H1 , H2 ) = 21 Risktot (T |H1 , H2 ) = Erfc( 12 kµ∗1 − µ∗2 k2 ),
(2.5)
and the test is optimal in terms of its risk and its total risk. Justification of Example 2.2 is immediate. Let e be the k · k2 -unit vector in the direction of µ∗1 − µ∗2 , and let ξ[ω] = eT (ω − w). From optimality conditions for (∗) it follows that eT µ ≥ eT µ∗1 ∀µ ∈ M1 & eT µ ≤ eT µ∗2 ∀µ ∈ M2 . As a result, if µ ∈ M1 and the density of ω is pµ = N (µ, Id ), the random variable ξ[ω] is a scalar Gaussian random variable with unit variance and expectation ≥ δ := 21 kµ∗1 − µ∗2 k2 , implying that the pµ -probability for ξ[ω] to be negative (which is exactly the same as the pµ -probability for T to reject H1 and accept H2 ) is at most
49
HYPOTHESIS TESTING
Erfc(δ). Similarly, when µ ∈ M2 and the density of ω is pµ = N (µ, Id ), ξ[ω] is a scalar Gaussian random variable with unit variance and expectation ≤ −δ, implying that the pµ -probability for ξ[ω] to be nonnegative (which is exactly the same as the probability for T to reject H2 and accept H1 ) is at most Erfc(δ). These observations imply the validity of (2.5). The test optimality in terms of risks follows from the fact that the risks of a simple test deciding on our now—composite—hypotheses H1 , H1 on the density p of observation ω can be only larger than the risks of a simple test deciding on two simple hypotheses p = pµ∗1 and p = pµ∗2 . In other words, the quantity Erfc( 12 kµ∗1 − µ∗2 k2 )—see Example 2.1—is a lower bound on the risk and half of the total risk of a test deciding on H1 , H2 . With this in mind, the announced optimalities of T in terms of risks are immediate consequences of (2.5). We remark that the (nearly self-evident) result stated in Example 2.2 seems to have first been noticed in [36]. Example 2.2 allows for substantial extensions in two directions: first, it turns out that the “Euclidean separation” underlying the test built in this example can be used to decide on hypotheses on the location of a “center” of d-dimensional distribution far beyond the Gaussian observation model considered in this example. This extension will be our goal in the next section, based on the recent paper [110]. Less straightforward and, we believe, more instructive extensions, originating from [102], will be considered in Section 2.3.
2.2 2.2.1
HYPOTHESIS TESTING VIA EUCLIDEAN SEPARATION Situation
In this section, we will be interested in testing hypotheses Hℓ : P ∈ Pℓ , ℓ = 1, ..., L
(2.6)
on the probability distribution of a random observation ω in the situation where the families of distributions Pℓ are obtained from a given family P of probability distributions by shifts. Specifically, we are given • a family P of probability distributions on Ω = Rd such that all distributions from P possess densities with respect to the Lebesgue measure on Rn , and these densities are even functions on Rd ;2 • a collection X1 , ..., XL of nonempty closed and convex subsets of Rd , with at most one of the sets unbounded. These data specify L families Pℓ of distributions on Rd ; Pℓ is comprised of distributions of random vectors of the form x + ξ, where x ∈ Xℓ is deterministic, and ξ is random with distribution from P. Note that with this setup, deciding upon hypotheses (2.6) via observation ω ∼ P is exactly the same as deciding, given observation ω = x + ξ, (2.7) 2 Allowing for a slight abuse of notation, we write P ∈ P, where P is a probability distribution, to express the fact that P belongs to P (no abuse of notation so far), and write p(·) ∈ P (this is an abuse of notation), where p(·) is the density of the probability distribution P , to express the fact that P ∈ P.
50
CHAPTER 2
where x is a deterministic “signal” and ξ is random noise with distribution P known to belong to P, on the “position” of x w.r.t. X1 , ..., XL , the ℓ-th hypothesis Hℓ saying that x ∈ Xℓ . The latter allows us to write down the ℓ-th hypothesis as Hℓ : x ∈ Xℓ (of course, this shorthand makes sense only within the scope of our current “signal plus noise” setup). 2.2.2 2.2.2.1
Pairwise Hypothesis Testing via Euclidean Separation The simplest case
Consider nearly the simplest case of the situation from Section 2.2.1 where L = 2, X1 = {x1 } and X2 = {x2 }, x1 6= x2 , are singletons, and P also is a singleton. Let the probability density of the only distribution from P be of the form p(u) = f (kuk2 ), f (·) is a strictly monotonically decreasing function on the nonnegative ray. (2.8) This situation is a generalization of the one considered in Example 2.1, where we dealt with the special case of f , namely, with p(u) = (2π)−d/2 e−u
T
u/2
.
In the case in question our goal is to decide on two simple hypotheses Hχ : p(u) = f (ku − xχ k2 ), χ = 1, 2, on the density of observation (2.7). Let us set x1 − x2 , φ(ω) = eT ω − 21 eT [x1 + x2 ], kx1 − x2 k2 | {z }
δ = 21 kx1 − x2 k2 , e =
(2.9)
c
and consider the test T which, given observation ω = x + ξ, accepts the hypothesis H1 : x = x1 when φ(ω) ≥ 0, and accepts the hypothesis H2 : x = x2 otherwise. ւ p2 (·) p1 (·) ց
x2 x1
φ(ω) > 0
φ(ω) = 0
φ(ω) < 0
We have (cf. Example 2.1) Risk1 (T |H1 , H2 )
= =
R
p1 (ω)dω =
ω:φ(ω) 0, ω∈Ω
• F is the space of all real-valued functions on the finite set Ω. Clearly, Discrete o.s. is simple; the function f (µ) := ln
Z
e
φ(ω)
pµ (ω)Π(dω)
Ω
= ln
X
e
φ(ω)
µω
ω∈Ω
!
indeed is concave in µ ∈ M. 2.4.3.4
Direct products of simple observation schemes
Given K simple observation schemes Ok = Ωk , Πk ; {pµ,k (·) : µ ∈ Mk }; Fk , 1 ≤ k ≤ K,
6 Large Binocular Telescope [16, 17] is a cutting-edge instrument for high-resolution optical/infrared astronomical imaging; it is the subject of a huge ongoing international project; see http://www.lbto.org. Nanoscale Fluorescent Microscopy (a.k.a. Poisson Biophotonics) is a revolutionary tool for cell imaging trigged by the advent of techniques [18, 113, 117, 211] (2014 Nobel Prize in Chemistry) allowing us to break the diffraction barrier and to view biological molecules “at work” at a resolution of 10–20 nm, yielding entirely new insights into the signalling and transport processes within cells.
78
CHAPTER 2
we can define their direct product OK =
K Y
k=1
Ok = ΩK , ΠK ; {pµ : µ ∈ MK }; F K
by modeling the situation where our observation is a tuple ω K = (ω1 , ..., ωK ) with components ωk yielded, independently of each other, by observation schemes Ok , namely, as follows: • The observation space ΩK is the direct product of observations spaces Ω1 , ..., ΩK , and the reference measure ΠK is the product of the measures Π1 , ..., ΠK ; • The parameter space MK is the direct product of partial parameter spaces M1 , ..., MK , and the distribution pµ (ω K ) associated with parameter µ = (µ1 , µ2 , ..., µK ) ∈ MK = M1 × ... × MK is the probability distribution on ΩK with the density pµ (ω K ) =
K Y
pµ,k (ωk )
k=1
w.r.t. ΠK . In other words, random observation ω K ∼ pµ is a sample of observations ω1 , ..., ωK , drawn, independently of each other, from the distributions pµ1 ,1 , pµ2 ,2 , ..., pµK ,K ; • The space F K is comprised of all separable functions φ(ω K ) =
K X
φk (ωk )
k=1
with φk (·) ∈ Fk , 1 ≤ k ≤ K. It is immediately seen that the direct product of simple observation o.s.’s is simple. When all factors Ok , 1 ≤ k ≤ K, are the identical simple o.s. O = (Ω, Π; {pµ : µ ∈ M}; F), the direct product of the factors can be “truncated” to yield the K-th power (called also the stationary K-repeated version) of O, denoted by [O]K = (ΩK , ΠK ; {p(K) : µ ∈ M}; F (K) ) µ and defined as follows. • ΩK and ΠK are exactly the same as in the direct product: ΩK = Ω × ... × Ω, ΠK = Π × ... × Π; | {z } | {z } K
K
• the parameter space is M rather than the direct product of K copies of M, and
79
HYPOTHESIS TESTING
the densities are K p(K) = (ω1 , ..., ωK )) = µ (ω
K Y
pµ (ωk ).
k=1 (K)
In other words, random observations ω K ∼ pµ are K-element samples with components drawn, independently of each other, from pµ ; • the space F (K) is comprised of separable functions φ
(K)
K
(ω ) =
K X
φ(ωk )
k=1
with identical components belonging to F (i.e., φ ∈ F). It is immediately seen that a power of a simple o.s. is simple. Remark 2.19. Gaussian, Poisson, and Discrete o.s.’s clearly are nondegenerate. It is also clear that the direct product of nondegenerate o.s.’s is nondegenerate. 2.4.4
Simple observation schemes—Main result
We are about to demonstrate that when deciding on convex, in some precise sense to be specified below, hypotheses in simple observation schemes, optimal detectors can be found efficiently by solving convex-concave saddle point problems. We start with an “executive summary” of convex-concave saddle point problems.
2.4.4.1
Executive summary of convex-concave saddle point problems
The results to follow are absolutely standard, and their proofs can be found in all textbooks on the subject, see, e.g., [221] or [15, Section D.4]. Let U and V be nonempty sets, and let Φ : U ×V → R be a function. These data define an antagonistic game of two players, I and II, where player I selects a point u ∈ U , and player II selects a point v ∈ V ; as an outcome of these selections, player I pays to player II the sum Φ(u, v). Clearly, player I is interested in minimizing this payment, and player II in maximizing it. The data U, V, Φ are known to the players in advance, and the question is, what should be their selections? When player I makes his selection u first, and player II makes his selection v with u already known, player I should be ready to pay for a selection u ∈ U a toll as large as Φ(u) = sup Φ(u, v). v∈V
In this situation, a risk-averse player I would select u by minimizing the above worst-case payment, by solving the primal problem Opt(P ) = inf Φ(u) = inf sup Φ(u, v) u∈U
u∈U v∈V
(P )
associated with the data U, V, Φ. Similarly, if player II makes his selection v first, and player I selects u after v becomes known, player II should be ready to get, as a result of selecting v ∈ V , the
80
CHAPTER 2
amount as small as Φ(v) = inf Φ(u, v). u∈U
In this situation, a risk-averse player II would select v by maximizing the above worst-case payment, by solving the dual problem Opt(D) = sup Φ(v) = sup inf Φ(u, v). v∈V
v∈V u∈U
(D)
Intuitively, the first situation is less preferable for player I than the second one, so that his guaranteed payment in the first situation, that is, Opt(P ), should be ≥ his guaranteed payment, Opt(D), in the second situation: Opt(P ) := inf sup Φ(u, v) ≥ sup inf Φ(u, v) =: Opt(D). u∈U v∈V
v∈V u∈U
This fact, called Weak Duality, indeed is true. The central question related to the game is what should the players do when making their selections simultaneously, with no knowledge of what is selected by the adversary. There is a case when this question has a completely satisfactory answer—this is the case where Φ has a saddle point on U × V . Definition 2.20. A point (u∗ , v∗ ) ∈ U × V is called a saddle point7 of function Φ(u, v) : U × V → R if Φ as a function of u ∈ U attains at this point its minimum, and as a function of v ∈ V its maximum, that is, if Φ(u, v∗ ) ≥ Φ(u∗ , v∗ ) ≥ Φ(u∗ , v) ∀(u ∈ U, v ∈ V ). From the viewpoint of our game, a saddle point (u∗ , v∗ ) is an equilibrium: when one of the players sticks to the selection stemming from this point, the other one has no incentive to deviate from his selection stemming from the point. Indeed, if player II selects v∗ , there is no reason for player I to deviate from selecting u∗ , since with another selection, his loss (the payment) can only increase; similarly, when player I selects u∗ , there is no reason for player II to deviate from v∗ , since with any other selection, his gain (the payment) can only decrease. As a result, if the cost function Φ has a saddle point on U × V , this saddle point (u∗ , v∗ ) can be considered as a solution to the game, as the pair of preferred selections of rational players. It can be easily seen that while Φ can have many saddle points, the values of Φ at all these points are equal to each other; we denote this common value by SadVal. If (u∗ , v∗ ) is a saddle point and player I selects u = u∗ , his worst loss, over selections v ∈ V of player II, is SadVal, and if player I selects any u ∈ U , his worst-case loss, over the selections of player II can be only ≥ SadVal. Similarly, when player II selects v = v∗ , his worst-case gain, over the selections of player I, is SadVal, and if player II selects any v ∈ V , his worst-case gain, over the selections of player I, can be only ≤ SadVal. Existence of saddle points of Φ (min in u ∈ U , max in v ∈ V ) can be expressed in terms of the primal problem (P ) and the dual problem (P ):
7 More precisely, “saddle point (min in u ∈ U , max in v ∈ V )”; we will usually skip the clarification in parentheses, since it always will be clear from the context what are the minimization variables and what are the maximization ones.
81
HYPOTHESIS TESTING
Proposition 2.21. Φ has a saddle point (min in u ∈ U , max in v ∈ V ) if and only if problems (P ) and (D) are solvable with equal optimal values: Opt(P ) := inf sup Φ(u, v) = sup inf Φ(u, v) =: Opt(D). u∈U v∈V
v∈V u∈U
(2.55)
Whenever this is the case, the saddle points of Φ are exactly the pairs (u∗ , v∗ ) comprised of optimal solutions to problems (P ) and (D), and the value of Φ at every one of these points is the common value SadVal of Opt(P ) and Opt(D). Existence of a saddle point of a function is a “rare commodity,” and the standard sufficient condition for it is convexity-concavity of Φ coupled with convexity of U and V . The precise statement is as follows: Theorem 2.22. [Sion-Kakutani; see, e.g., [221] or [15, Theorems D.4.3, D.4.4]] Let U ⊂ Rm , V ⊂ Rn be nonempty closed convex sets, with V bounded, and let Φ : U × V → R be a continuous function which is convex in u ∈ U for every fixed v ∈ V , and is concave in v ∈ V for every fixed u ∈ U . Then the equality (2.55) holds true (although it may happen that Opt(P ) = Opt(D) = −∞). If, in addition, Φ is coercive in u, meaning that the level sets {u ∈ U : Φ(u, v) ≤ a} are bounded for every a ∈ R and v ∈ V (equivalently: for every v ∈ V , Φ(ui , v) → +∞ along every sequence ui ∈ U going to ∞: kui k → ∞ as i → ∞), then Φ admits saddle points (min in u ∈ U , max in v ∈ V ). Note that the “true” Sion-Kakutani Theorem is a bit stronger than Theorem 2.22; the latter, however, covers all our related needs. 2.4.4.2
Main result
Theorem 2.23. Let O = (Ω, Π; {pµ : µ ∈ M}; F) be a simple observation scheme, and let M1 , M2 be nonempty compact convex subsets of M. Then (i) The function R R Φ(φ, [µ; ν]) = 21 ln Ω e−φ(ω) pµ (ω)Π(dω) + ln Ω eφ(ω) pν (ω)Π(dω) : (2.56) F × (M1 × M2 ) → R is continuous on its domain, convex in φ(·) ∈ F, concave in [µ; ν] ∈ M1 × M2 , and possesses a saddle point (min in φ ∈ F, max in [µ; ν] ∈ M1 × M2 ) (φ∗ (·), [µ∗ ; ν∗ ]) on F × (M1 × M2 ). W.l.o.g. φ∗ can be assumed to satisfy the relation8 Z Z (2.57) exp{φ∗ (ω)}pν∗ (ω)Π(dω). exp{−φ∗ (ω)}pµ∗ (ω)Π(dω) = Ω
Ω
8 Note that F contains constants, and shifting by a constant the φ-component of a saddle point of Φ and keeping its [µ; ν]-component intact, we clearly get another saddle point of Φ.
82
CHAPTER 2
Denoting the common value of the two quantities in (2.57) by ε⋆ , the saddle point value min max Φ(φ, [µ; ν]) φ∈F [µ;ν]∈M1 ×M2
is ln(ε⋆ ). Besides this, setting φa∗ (·) = φ∗ (·) − a, one has R (a) ΩRexp{−φa∗ (ω)}pµ (ω)Π(dω) ≤ exp{a}ε⋆ ∀µ ∈ M1 , (b) exp{φa∗ (ω)}pν (ω)Π(dω) ≤ exp{−a}ε⋆ ∀ν ∈ M2 . Ω
(2.58)
In view of Proposition 2.14 this implies that when deciding via an observation ω ∈ Ω on the hypotheses Hχ : ω ∼ pµ with µ ∈ Mχ , χ = 1, 2, the risks of the simple test Tφa∗ based on the detector φa∗ can be upper-bounded as follows: Risk1 (Tφa∗ |H1 , H2 ) ≤ exp{a}ε⋆ , Risk2 (Tφa∗ |H1 , H2 ) ≤ exp{−a}ε⋆ . Moreover, φ∗ , ε⋆ form an optimal solution to the optimization problem R −φ(ω) e pµ (ω)Π(dω) ≤ ǫ ∀µ ∈ M1 Ω R min ǫ : eφ(ω) pµ (ω)Π(dω) ≤ ǫ ∀µ ∈ M2 φ,ǫ Ω
(2.59)
(the minimum in (2.59) is taken over all ǫ > 0 and all Π-measurable functions φ(·), not just over φ ∈ F). (ii) The dual problem associated with the saddle point data Φ, F, M1 × M2 is (D) max Φ(µ, ν) := inf Φ(φ; [µ; ν]) . µ∈M1 ,ν∈M2
φ∈F
The objective in this problem is in fact the logarithm of the Hellinger affinity of pµ and pν , Z q Φ(µ, ν) = ln pµ (ω)pν (ω)Π(dω) , (2.60) Ω
and this objective is concave and continuous on M1 × M2 . The (µ, ν)-components of saddle points of Φ are exactly the maximizers (µ∗ , ν∗ ) of the concave function Φ on M1 × M2 . Given such a maximizer [µ∗ ; ν∗ ] and setting φ∗ (ω) =
1 2
ln(pµ∗ (ω)/pν∗ (ω))
(2.61)
we get a saddle point (φ∗ , [µ∗ ; ν∗ ]) of Φ satisfying (2.57). (iii) Let [µ∗ ; ν∗ ] be a maximizer of Φ over M1 × M2 . Let, further, ǫ ∈ [0, 1/2] be such that there exists any (perhaps randomized) test for deciding via observation ω ∈ Ω on two simple hypotheses (A) : ω ∼ p(·) := pµ∗ (·), with total risk ≤ 2ǫ. Then
(B) : ω ∼ q(·) := pν∗ (·)
p ε⋆ ≤ 2 ǫ(1 − ǫ).
(2.62)
In other words, if the simple hypotheses (A), (B) can be decided, by any test, with total risk 2ǫ, the risks of the simple test with detector φ∗ given by (2.61) on the
83
HYPOTHESIS TESTING
composite hypotheses H1 , H2 do not exceed 2 For proof, see Section 2.11.3.
p ǫ(1 − ǫ).
Remark 2.24. Assume that we are under the premise of Theorem 2.23 and that the simple o.s. in question is nondegenerate (see Section 2.4.2). Then ε⋆ < 1 if and only if the sets M1 and M2 do not intersect. Indeed, by Theorem 2.23.i, ln(ε⋆ ) is the saddle point value of Φ(φ, [µ; ν]) on F × (M1 × M2 ), or, which is the same by Theorem 2.23.ii, the maximum of the function (2.60) on M1 × M2 ; since saddle points exist, this maximum is achieved at some pair [µ; ν] ∈ M1 ×M2 . Since (2.60) is ≤ 0, we conclude that ε⋆ ≤ 1 and p R clearly the equality takes place if and only if Ω pµ (ω)pν (ω)Π(dω) = 1 for some µ ∈ M1 p R p and ν ∈ M2 , or, which is the same, Ω ( pµ (ω) − pν (ω))2 Π(dω) = 0 for these µ and ν. Since pµ (·) and pν (·) are continuous and the support of Π is the entire Ω, the latter can happen if and only if pµ = pν for our µ, ν, or, by nondegeneracy of O, if and only if M1 ∩ M2 6= ∅. ✷ 2.4.5
Simple observation schemes—Examples of optimal detectors
Theorem 2.23.i states that when the observation scheme O = (Ω, Π; {pµ : µ ∈ M}; F) is simple and we are interested in deciding on a pair of hypotheses on the distribution of observation ω ∈ Ω, Hχ : ω ∼ pµ with µ ∈ Mχ , χ = 1, 2 and the hypotheses are convex, meaning that the underlying parameter sets Mχ are convex and compact, building an optimal, in terms of its risk, detector φ∗ —that is, solving (in general, a semi-infinite and infinite-dimensional) optimization problem (2.59)—reduces to solving a finite-dimensional convex problem. Specifically, an optimal solution (φ∗ , ε⋆ ) can be built as follows: 1. We solve optimization problem Z q pµ (ω)pν (ω)Π(dω) Φ(µ, ν) := ln Opt = max µ∈M1 ,ν∈M2
(2.63)
Ω
of maximizing Hellinger affinity (the quantity under the logarithm) of a pair of distributions obeying H1 and H2 , respectively; for a simple o.s., the objective in this problem is concave and continuous, and optimal solutions do exist; 2. (Any) optimal solution [µ∗ ; ν∗ ] to (2.63) gives rise to an optimal detector φ∗ and its risk ε⋆ , according to 1 pµ∗ (ω) φ∗ (ω) = ln , ε⋆ = exp{Opt}. 2 pν∗ (ω) The risks of the simple test Tφ∗ associated with the above detector and deciding on H1 , H2 , satisfy the bounds max [Risk1 (Tφ∗ |H1 , H2 ), Risk2 (Tφ∗ |H1 , H2 )] ≤ ε⋆ ,
84
CHAPTER 2
and the test is near-optimal, meaning that whenever the hypotheses H1 , H2 (and in fact even two simple hypotheses stating that ω ∼ pµ∗ and ω ∼ pν∗ , respectively) can be decided upon by a test with total risk ≤ 2ǫ ≤ 1, Tφ∗ exhibits a “comparable” risk: p (2.64) ε⋆ ≤ 2 ǫ(1 − ǫ). The test Tφ∗ is just the maximum likelihood test induced by the probability densities pµ∗ and pν∗ .
Note that after we know that (φ∗ , ε⋆ ) form an optimal solution to (2.59), some kind of near-optimality of the test Tφ∗ is guaranteed already by Proposition 2.18. By this proposition, whenever in nature there exists a simple test T which decides on H1 , H2 with risks Risk1 , Risk2 bounded by some ǫ ≤ 1/2, the upper bound ε⋆ on the risks of Tφ∗ can be bounded according to (2.64). Our now near-optimality statement is slightly stronger: first, we allow T to have the total risk ≤ 2ǫ, which is weaker than to have both risks ≤ ǫ; second, and more important, now 2ǫ should upper-bound the total risk of T on a pair of simple hypotheses “embedded” into the hypotheses H1 , H2 ; both these modifications extend the family of tests T to which we compare the test Tφ∗ , and thus enrich the comparison. Let us look how the above recipe works for our basic simple o.s.’s. 2.4.5.1
Gaussian o.s.
When O is a Gaussian o.s., that is, {pµ : µ ∈ M} are Gaussian densities with expectations µ ∈ M = Rd and common positive definite covariance matrix Θ, and F is the family of affine functions on Ω = Rd , • M1 , M2 can be arbitrary nonempty convex compact subsets of Rd , • problem (2.63) becomes the convex optimization problem Opt = −
1
(µ min µ∈M1 ,ν∈M2 8
− ν)T Θ−1 (µ − ν),
(2.65)
• the optimal detector φ∗ and the upper bound ε⋆ on its risks given by an optimal solution (µ∗ , ν∗ ) to (2.65) are φ∗ (ω) ε⋆
= =
[µ∗ − ν∗ ]T Θ−1 [ω − w], w = 21 [µ∗ + ν∗ ] exp{− 81 [µ∗ − ν∗ ]Θ−1 [µ∗ − ν∗ ]}. 1 2
(2.66)
Note that when Θ = Id , the test Tφ∗ becomes exactly the optimal test from Example 2.1. The upper bound on the risks of this test established in Example 2.1 (in our present notation, this bound is Erfc( 21 kµ∗ − ν∗ k2 )) is slightly better than the bound ε⋆ = exp{−kµ∗ − ν∗ k22 /8} given by (2.66) when Θ = Id . Note, however, that when speaking about the distance δ = kµ∗ − ν∗ k2 between M1 and M2 allowing for a test with risks ≤ ǫ ≪ 1, the results of Example 2.1 and (2.66) say nearly the same thing: Example 2.1 says that δ should be ≥ p 2ErfcInv(ǫ), with ErfcInv defined in (1.26), and (2.66) says that δ should be ≥ 2 2 ln(1/ǫ). When ǫ → +0, the ratio of these two lower bounds on δ tends to 1. It should be noted that our general construction of optimal detectors as applied to Gaussian o.s. and a pair of convex hypotheses results in exactly an optimal test and can be analyzed directly, without any “science” (see Example 2.1).
85
HYPOTHESIS TESTING
2.4.5.2
Poisson o.s.
When O is a Poisson o.s., that is, M = Rd++ is the interior of the nonnegative orthant in Rd , and pµ , µ ∈ M, is the density Y µωi i e−µi , ω = (ω1 , ..., ωd ) ∈ Zd+ pµ (ω) = ωi ! i taken w.r.t. the counting measure Π on Ω = Zd+ , and F is the family of affine functions on Ω, the recipe from the beginning of Section 2.4.5 reads as follows: • M1 , M2 can be arbitrary nonempty convex compact subsets of Rd++ = {x ∈ Rd : x > 0}; • problem (2.63) becomes the convex optimization problem d
Opt = −
min
µ∈M1 ,ν∈M2
√ 2 1X √ ( µi − ν i ) ; 2 i=1
(2.67)
• the optimal detector φ∗ and the upper bound ε⋆ on its risks given by an optimal solution (µ∗ , ν ∗ ) to (2.67) are d
1X φ∗ (ω) = ln 2 i=1 2.4.5.3
µ∗i νi∗
d
1X ∗ ωi + [ν − µ∗i ], 2 i=1 i
ε⋆ = eOpt .
Discrete o.s.
When O is a Discrete P o.s., that is, Ω = {1, ..., d}, Π is a counting measure on Ω, M = {µ ∈ Rd : µ > 0, i µi = 1}, and pµ (ω) = µω , ω = 1, ..., d, µ ∈ M,
the recipe from the beginning of Section 2.4.5 reads as follows:9 • M1 , M2 can be arbitrary nonempty convex compact subsets of the relative interior M of the probabilistic simplex, • problem (2.63) is equivalent to the convex program ε⋆ =
max
µ∈M1 ,ν∈M2
d X √
µi ν i ;
(2.68)
i=1
• the optimal detector φ∗ given by an optimal solution (µ∗ , ν ∗ ) to (2.67) is ∗ µ (2.69) φ∗ (ω) = 12 ln ν ω∗ , ω
and the upper bound ε⋆ on the risks of this detector is given by (2.68). 9 It
should be mentioned that the results of this section as applied to the Discrete observation scheme are a simple particular case—that of finite Ω—of the results of [21, 22, 25] on distinguishing convex sets of probability distributions.
86
CHAPTER 2
2.4.5.4
K-th power of a simple o.s.
Recall that K-th power of a simple o.s. O = (Ω, Π; {pµ : µ ∈ M}; F) (see Section 2.4.3.4) is the o.s. [O]K = (ΩK , ΠK ; {p(K) : µ ∈ M}; F (K) ) µ where ΩK is the direct product of K copies of Ω, ΠK is the product of K copies of (K) Π, the densities pµ are product densities induced by K copies of the density pµ , µ ∈ M, K Y pµ (ωk ), pµ(K) (ω K = (ω1 , ..., ωK )) = k=1
and F
(K)
is comprised of functions φ(K) (ω K = (ω1 , ..., ωK )) =
K X
φ(ωk )
k=1
stemming from functions φ ∈ F. Clearly, [O]K is the observation scheme describing the stationary K-repeated observations ω K = (ω1 , ..., ωK ) with ωk stemming from the o.s. O; see Section 2.3.2.3. As we remember, [O]K is simple provided that O is so. Assuming O simple, it is immediately seen that as applied to the o.s. [O]K , the recipe from the beginning of Section 2.4.5 reads as follows: • M1 , M2 can be arbitrary nonempty convex compact subsets of M, and the corresponding hypotheses, HχK , χ = 1, 2, state that the components ωk of observation ω K = (ω1 , ..., ωK ) are independently of each other drawn from distribution pµ with µ ∈ M1 (hypothesis H1K ) or µ ∈ M2 (hypothesis H2K ); • problem (2.63) is the convex program Z q (K) (K) pµ (ω K )pν (ω K )ΠK (dΩ) (DK ) ln Opt(K) = max µ∈M1 ,ν∈M2 ΩK {z } | R √ ≡K ln
Ω
pµ (ω)pν (ω)Π(dω)
implying that any optimal solution to the “single-observation” problem (D1 ) associated with M1 , M2 is optimal for the “K-observation” problem (DK ) associated with M1 , M2 , and Opt(K) = KOpt(1); (K) • the optimal detector φ∗ given by an optimal solution (µ∗ , ν∗ ) to (D1 ) (this solution is optimal for (DK ) as well) is (K)
φ∗ (ω K ) =
K X
k=1
φ∗ (ωk ),
φ∗ (ω) =
1 ln 2
pµ∗ (ω) pν∗ (ω)
(K)
and the upper bound ε⋆ (K) on the risks of the detector φ∗ families of distributions obeying hypotheses H1K or H2K is ε⋆ (K) = eOpt(K) = eKOpt(1) = [ǫ⋆ (1)]K .
,
(2.70)
on the pair of (2.71)
87
HYPOTHESIS TESTING
The just outlined results on powers of simple observation schemes allow us to express near-optimality of detector-based tests in simple o.s.’s in a nicer form. Proposition 2.25. Let O = (Ω, Π; {pµ : µ ∈ M}; F) be a simple observation scheme, M1 , M2 be two nonempty convex compact subsets of M, and (µ∗ , ν∗ ) be an optimal solution to the convex optimization problem (cf. Theorem 2.23) Z q pµ (ω)pν (ω)Π(dω) . Opt = max ln µ∈M1 ,ν∈M2
Ω
Let φ∗ and φK ∗ be single- and K-observation detectors induced by (µ∗ , ν∗ ) via (2.70). Let ǫ ∈ (0, 1/2), and assume that for some positive integer K in nature there exists a simple test T K deciding via K i.i.d. observations ω K = (ω1 , ..., ωK ) with ωk ∼ pµ , for some unknown µ ∈ M, on the hypotheses Hχ(K) : µ ∈ Mχ , χ = 1, 2, with risks Risk1 , Risk2 not exceeding ǫ. Then setting 2 K , K+ = 1 − ln(4(1 − ǫ))/ ln(1/ǫ) the simple test T
(K+ )
φ∗
(K+ )
utilizing K+ i.i.d. observations decides on H1
(K+ )
, H2
with risks ≤ ǫ. Note that K+ “is of the order of K”: K+ /K → 2 as ǫ → +0.
Proof. Applying item (iii) of Theorem 2.23 to the simple o.s. [O]K , we see that what above was called ε⋆ (K) satisfies p ε⋆ (K) ≤ 2 ǫ(1 − ǫ).
p 1/K By (2.71), we conclude that ε⋆ (1) ≤ 2 ǫ(1 − ǫ) , whence, by the same (2.71), T /K p , T = 1, 2, .... When plugging in this bound T = K+ , we ε⋆ (T ) ≤ 2 ǫ(1 − ǫ)
get the inequality ε⋆ (K+ ) ≤ ǫ. It remains to recall that ε⋆ (K+ ) upper-bounds the (K ) (K ) risks of the test T (K+ ) when deciding on H1 + vs. H2 + . ✷ φ∗
2.5
TESTING MULTIPLE HYPOTHESES
So far, we have focused on detector-based tests deciding on pairs of hypotheses, and our “constructive” results were restricted to pairs of convex hypotheses dealing with a simple o.s. O = (Ω, Π; {pµ : µ ∈ M}; F), (2.72) convexity of a hypothesis meaning that the family of probability distributions obeying the hypothesis is {pµ : µ ∈ X}, associated with a convex (in fact, convex compact) set X ⊂ M. In this section, we will be interested in pairwise testing unions of convex hypotheses and testing multiple (more than two) hypotheses.
88
CHAPTER 2
2.5.1 2.5.1.1
Testing unions Situation and goal
Let Ω be an observation space, and assume we are given two finite collections of families of probability distributions on Ω: families of red distributions Ri , 1 ≤ i ≤ r, and families of blue distributions Bj , 1 ≤ j ≤ b. These families give rise to r red and b blue hypotheses on the distribution P of an observation ω ∈ Ω, specifically, Ri : P ∈ Ri (red hypotheses) and Bj : P ∈ Bj (blue hypotheses). Assume that for every i ≤ r, j ≤ b we have at our disposal a simple detector-based test Tij capable of deciding on Ri vs. Bj . What we want is to assemble these tests into a test T deciding on the union R of red hypotheses vs. the union B of blue ones: b r [ [ Bj . Ri , B : P ∈ B := R : P ∈ R := i=1
j=1
Here P , as always, stands for the probability distribution of observation ω ∈ Ω. Our motivation primarily stems from the case where Ri and Bj are convex hypotheses in a simple o.s. (2.72): Ri = {pµ : µ ∈ Mi }, Bj = {pµ : µ ∈ Nj }, where Mi and Nj are convex compact subsets of M. In this case we indeed know how to build near-optimal tests deciding on Ri vs. Bj , and the question we have posed becomes, how do we assemble these tests into a test deciding on R vs. B, with S R : P ∈ R = {pµ : µ ∈ X}, X = S i Mi , B : P ∈ B = {pµ : µ ∈ Y }, Y = j Nj ?
While the structure of R, B is similar to that of Ri , Bj , there is a significant difference: the sets X, Y are, in general, nonconvex, and therefore the techniques we have developed fail to address testing R vs. B directly.
2.5.1.2
The construction
In the situation just described, let φij be the detectors underlying the tests Tij ; w.l.o.g., we can assume these detectors balanced (see Section 2.3.2.2) with some risks ǫij : R −φ (ω) ij RΩ eφ (ω) P (dω) ≤ ǫij ∀P ∈ Ri , 1 ≤ i ≤ r, 1 ≤ j ≤ b. (2.73) e ij P (dω) ≤ ǫij ∀P ∈ Bj Ω Let us assemble the detectors φij into a detector for R, B as follows: φ(ω) = max min [φij (ω) − αij ], 1≤i≤r 1≤j≤b
(2.74)
where the shifts αij are parameters of the construction. Proposition 2.26. The risks of φ on R, B can be bounded as hP i R b αij ∀P ∈ R : Ω e−φ(ω) P (dω) ≤ maxi≤r , j=1 ǫij e R φ(ω) Pr −αij ∀P ∈ B : e P (dω) ≤ maxj≤b [ i=1 ǫij e ]. Ω
(2.75)
89
HYPOTHESIS TESTING
Therefore, the risks of φ on R, B are upper-bounded by the quantity X hX r i b ε⋆ = max max ǫij eαij , max ǫij e−αij , i≤r
j=1
j≤b
i=1
(2.76)
whence the risks of the simple test Tφ , based on the detector φ, deciding on R, B are upper-bounded by ε⋆ . Proof. Let P ∈ R, so that P ∈ Ri∗ for some i∗ ≤ r. Then R −φ(ω) R e P (dω) = Ω emini≤r maxj≤b [−φij (ω)+αij ] P (dω) Ω R Pb R ≤ Ω emaxj≤b [−φi∗ j (ω)+αi∗ j ] P (dω) ≤ j=1 Ω e−φi∗ j (ω)+αi∗ j P (dω) R Pb = j=1 eαi∗ j Ω e−φi∗ j (ω) P (dω) Pb ≤ j=1 ǫi∗hj eαi∗ j [by (2.73) i due to P ∈ Ri∗ ] Pb αij ≤ maxi≤r . j=1 ǫij e Now let P ∈ B, so that P ∈ Bj∗ for some j∗ . We have R φ(ω) R (ω)−αij ] e P (dω) = Ω emaxi≤r minj≤b [φij P Ω R R P (dω) r maxi≤r [φij∗ (ω)−αij∗ ] φij∗ (ω)−αij∗ ≤P e P (dω) ≤ P (dω) i=1 Ω e Ω R φ (ω) r −αij∗ ij∗ = Pi=1 e e P (dω) Ω r ≤ i=1 ǫij∗P e−αij∗ [by (2.73) due to P ∈ Bj∗ ] r ≤ maxj≤b [ i=1 ǫij e−αij ] .
(2.75) is proved. The remaining claims of the proposition are readily given by (2.75) combined with Proposition 2.14. ✷ Optimal choice of shift parameters. The detector and the test considered in Proposition 2.26, like the resulting risk bound ε⋆ , depend on the shifts αij . Let us optimize the risk bound w.r.t. these shifts. To this end, consider the r × b matrix E = [ǫij ] i≤r
j≤b
and the symmetric (r + b) × (r + b) matrix E . E= ET As is well known, the eigenvalues of the symmetric matrix E are comprised of the pairs (σs , −σs ), where σs are the singular values of E, and several zeros; in particular, the leading eigenvalue of E is the spectral norm kEk2,2 (the largest singular value) of matrix E. Further, E is a matrix with positive entries, so that E is a symmetric entrywise nonnegative matrix. By the Perron-Frobenius Theorem, the leading eigenvector of this matrix can be selected to be nonnegative. Denoting this nonnegative eigenvector [g; h] with r-dimensional g and b-dimensional h, and setting ρ = kEk2,2 , we have ρg = Eh (2.77) ρh = E T g. Observe that ρ > 0 (evident), whence both g and h are nonzero (since otherwise (2.77) would imply g = h = 0, which is impossible—the eigenvector [g; h] is
90
CHAPTER 2
nonzero). Since h and g are nonzero nonnegative vectors, ρ > 0 and E is entrywise positive, (2.77) says that g and h are strictly positive vectors. The latter allows us to define shifts αij according to αij = ln(hj /gi ).
(2.78)
With these shifts, we get hP i Pb b αij max = max j=1 ǫij hj /gi = max(Eh)i /gi = max ρ = ρ j=1 ǫij e i≤r
i≤r
i≤r
i≤r
(we have used the first relation in (2.77)), and Pr Pr max [ i=1 ǫij e−αij ] = max i=1 ǫij gi /hj = max[E T g]j /hj = max ρ = ρ j≤b
j≤b
j≤b
j≤b
(we have used the second relation in (2.77)). The bottom line is as follows:
Proposition 2.27. In the situation and the notation of Section 2.5.1.1, the risks of the detector (2.74) with shifts (2.77), (2.78) on the families R, B do not exceed the quantity kE := [ǫij ]i≤r,j≤b k2,2 . As a result, the risks of the simple test Tφ deciding on the hypotheses R, B, does not exceed kEk2,2 as well. In fact, the shifts in the above proposition are the best possible; this is an immediate consequence of the following simple fact: Proposition 2.28. Let E = [eij ] be a nonzero entrywise nonnegative n × n symmetric matrix. Then the optimal value in the optimization problem n X (∗) Opt = min max eij eαij : αij = −αji αij i≤n j=1
is equal to kEk2,2 . When the Perron-Frobenius eigenvector f of E can be selected positive, the problem is solvable, and an optimal solution is given by αij = ln(fj /fi ), 1 ≤ i, j ≤ n.
(2.79)
Proof. Let us prove that Opt ≤ ρ := kEk2,2 . Given ǫ > 0, we clearly can find an entrywise nonnegative symmetric matrix E ′ with entries e′ij inbetween eij and eij + ǫ such that the Perron-Frobenius eigenvector f of E ′ can be selected positive (it suffices, e.g., to set e′ij = eij + ǫ). Selecting αij according to (2.79), we get a feasible solution to (∗) such that X X e′ij fj /fi = kE ′ k2,2 , eij eαij ≤ ∀i : j
j
implying that Opt ≤ kE ′ k2,2 . Passing to limit as ǫ → +0, we get Opt ≤ kEk2,2 . As a byproduct of our reasoning, if E admits a positive Perron-Frobenius eigenvector f , then (2.79) yields a feasible solution to (∗) with the value of the objective equal to kEk2,2 .
91
HYPOTHESIS TESTING
It remain to prove that Opt ≥ kEk2,2 . Assume that this is not the case, so that (∗) admits a feasible solution α bij such that X eij eαbij < ρ := kEk2,2 . ρb := max i
j
By an arbitrarily small perturbation of E, we can make this matrix symmetric and entrywise positive, and still satisfying the above strict inequality; to save notation, assume that already the original E is entrywise positive. Let f be a positive PerronFrobenius eigenvector of E, and let, as above, αij = ln(fj /fi ), so that X X eij fj /fi = ρ ∀i. eij eαij = j
j
Setting δij = α bij − αij , we conclude that the convex functions X θi (t) = eij eαij +tδij j
all are equal to ρ as t = 0, and all are ≤ ρb < ρ as t = 1, implying that θi (1) < θi (0) for every i. The latter, in view of convexity of θi (·), implies that X X eij (fj /fi )δij < 0 ∀i. eij eαij δij = θi′ (0) = j
j
Multiplying the resulting inequalities by fi2 and summing up over i, we get X eij fi fj δij < 0, i,j
which is impossible: we have eij = eji and δij = −δji , implying that the left-hand side in the latter inequality is 0. ✷ 2.5.2
Testing multiple hypotheses “up to closeness”
So far, we have considered detector-based simple tests deciding on pairs of hypotheses, specifically, convex hypotheses in simple o.s.’s (Section 2.4.4) and unions of convex hypotheses (Section 2.5.1).10 Now we intend to consider testing of multiple (perhaps more than 2) hypotheses “up to closeness”; the latter notion was introduced in Section 2.2.4.2. 10 Strictly speaking, in Section 2.5.1 it was not explicitly stated that the unions under consideration involve convex hypotheses in simple o.s.’s; our emphasis was on how to decide on a pair of union-type hypotheses given pairwise detectors for “red” and “blue” components of the unions from the pair. However, for now, the only situation where we indeed have at our disposal good pairwise detectors for red and blue components is that in which these components are convex hypotheses in a good o.s.
92
CHAPTER 2
2.5.2.1
Situation and goal
Let Ω be an observation space, and let a collection P1 , ..., PL of families of probability distributions on Ω be given. As always, families Pℓ give rise to hypotheses Hℓ : P ∈ Pℓ on the distribution P of observation ω ∈ Ω. Assume also that we are given a closeness relation C on {1, ..., L}. Recall that, formally, a closeness relation is some set of pairs of indices (ℓ, ℓ′ ) ∈ {1, ..., L}; we interpret the inclusion (ℓ, ℓ′ ) ∈ C as the fact that hypothesis Hℓ “is close” to hypothesis Hℓ′ . When (ℓ, ℓ′ ) ∈ C, we say that ℓ′ is close (or C-close) to ℓ. We always assume that • C contains the diagonal: (ℓ, ℓ) ∈ C for every ℓ ≤ L (“each hypothesis is close to itself”), and • C is symmetric: whenever (ℓ, ℓ′ ) ∈ C, we have also (ℓ′ , ℓ) ∈ C (“if the ℓ-th hypothesis is close to the ℓ′ -th one, then the ℓ′ -th hypothesis is close to the ℓ-th one”). Recall that a test T deciding on the hypotheses H1 , ..., HL via observation ω ∈ Ω is a procedure which, given on input ω ∈ Ω, builds some set T (ω) ⊂ {1, ..., L}, accepts all hypotheses Hℓ with ℓ ∈ T (ω), and rejects all other hypotheses. Risks of an “up to closeness” test. The notion of C-risk of a test was introduced in Section 2.2.4.2, we reproduce it here for the reader’s convenience. Given closeness C and a test T , we define the C-risk RiskC (T |H1 , ..., HL ) of T as the smallest ǫ ≥ 0 such that
S Whenever an observation ω is drawn from a distribution P ∈ ℓ Pℓ , and ℓ∗ is such that P ∈ Pℓ∗ (i.e., hypothesis Hℓ∗ is true), the P -probability of the event ℓ∗ 6∈ T (ω) (“true hypothesis Hℓ∗ is not accepted”) or there exists ℓ′ not close to ℓ such that Hℓ′ is accepted” is at most ǫ.
Equivalently: RiskC (T |H1 , ..., HL ) ≤ ǫ if and only if the following takes place: S Whenever an observation ω is drawn from a distribution P ∈ ℓ Pℓ , and ℓ∗ is such that P ∈ Pℓ∗ (i.e., hypothesis Hℓ∗ is true), the P -probability of the event ℓ∗ ∈ T (ω) (“the true hypothesis Hℓ∗ is accepted”) and ℓ′ ∈ T (ω) implies that (ℓ, ℓ′ ) ∈ C (“all accepted hypotheses are C-close to the true hypothesis Hℓ∗ ”) is at least 1 − ǫ. For example, consider nine polygons presented on Figure 2.4 and associate with them nine hypotheses on a 2D “signal plus noise” observation ω = x + ξ, ξ ∼ N (0, I2 ), with the ℓ-th hypothesis stating that x belongs to the ℓ-th polygon. We define closeness C on the collection of hypotheses presented on Figure 2.4 as “two hypotheses are close if and only if the corresponding polygons intersect,” like A and B, or A and E. Now the fact that a test T has C-risk ≤ 0.01 would imply, in particular, that if the probability distribution P underlying the observed
93
HYPOTHESIS TESTING
Figure 2.4: Nine hypotheses on the location of the mean µ of observation ω ∼ N (µ, I2 ), each stating that µ belongs to a specific polygon. ω obeys hypothesis A (i.e., the mean of P belongs to the polygon A), then with P -probability at least 0.99 the list of accepted hypotheses includes hypothesis A, and the only other hypotheses in this list are among hypotheses B, D, and E. 2.5.2.2
“Building blocks” and construction
The construction we are about to present is, essentially, that used in Section 2.2.4.3 as applied to detector-generated tests. This being said, the presentation to follow is self-contained. The building blocks of our construction are pairwise detectors φℓℓ′ (ω), 1 ≤ ℓ < ′ ℓ ≤ L, for pairs Pℓ , Pℓ′ along with (upper bounds on) the risks ǫℓℓ′ of these detectors: R ∀(P ∈ Pℓ ) : RΩ e−φℓℓ′ (ω) P (dω) ≤ ǫℓℓ′ , 1 ≤ ℓ < ℓ′ ≤ L. ∀(P ∈ Pℓ′ ) : Ω eφℓℓ′ (ω) P (dω) ≤ ǫℓℓ′
Setting
φℓ′ ℓ (ω) = −φℓℓ′ (ω), ǫℓ′ ℓ = ǫℓℓ′ , 1 ≤ ℓ < ℓ′ ≤ L, φℓℓ (ω) ≡ 0, ǫℓℓ = 1, 1 ≤ ℓ ≤ L, we get what we refer to as a balanced system of detectors φℓℓ′ and risks ǫℓℓ′ , 1 ≤ ℓ, ℓ′ ≤ L, for the collection P1 , ..., PL , meaning that φℓℓ′ (ω) + φRℓ′ ℓ (ω) ≡ 0, ǫℓℓ′ = ǫℓ′ ℓ , ∀P ∈ Pℓ : Ω e−φℓℓ′ (ω) P (dω) ≤ ǫℓℓ′ ,
1 ≤ ℓ, ℓ′ ≤ L, 1 ≤ ℓ, ℓ′ ≤ L.
(2.80)
Given closeness C, we associate with it the symmetric L × L matrix C given by 0, (ℓ, ℓ′ ) ∈ C, ′ (2.81) Cℓℓ = 1, (ℓ, ℓ′ ) 6∈ C. Test TC . Let a collection of shifts αℓℓ′ ∈ R satisfying the relation αℓℓ′ = −αℓ′ ℓ , 1 ≤ ℓ, ℓ′ ≤ L
(2.82)
94
CHAPTER 2
be given. The detectors φℓℓ′ and the shifts αℓℓ′ specify a test TC deciding on hypotheses H1 , ..., HL . Precisely, given an observation ω, the test TC accepts exactly those hypotheses Hℓ for which φℓℓ′ (ω) − αℓℓ′ > 0 whenever ℓ′ is not C-close to ℓ: TC (ω) = {ℓ : φℓℓ′ (ω) − αℓℓ′ > 0 ∀(ℓ′ : (ℓ, ℓ′ ) 6∈ C)}. Proposition 2.29. (i) The C-risk of the test TC just defined is upper-bounded by the quantity L X ε[α] = max ǫℓℓ′ Cℓℓ′ eαℓℓ′ ℓ≤L
ℓ′ =1
with C given by (2.81). (ii) The infimum, over shifts α satisfying (2.82), of the risk bound ε[α] is the quantity ε⋆ = kEk2,2 , where the L × L symmetric entrywise nonnegative matrix E is given by E = [eℓℓ′ := ǫℓℓ′ Cℓℓ′ ]ℓ,ℓ′ ≤L . Assuming E admits a strictly positive Perron-Frobenius vector f , an optimal choice of the shifts is αℓℓ′ = ln(fℓ′ /fℓ ), 1 ≤ ℓ, ℓ′ ≤ L, resulting in ε[α] = ε⋆ = kEk2,2 . Proof. (i): Setting φ¯ℓℓ′ (ω) = φℓℓ′ (ω) − αℓℓ′ , ǫ¯ℓℓ′ = ǫℓℓ′ eαℓℓ′ , (2.80) and (2.82) imply that (a) (b)
φ¯ℓℓ′ (ω) + φ¯ℓ′ ℓ (ω) ≡ 0, R ¯ ∀P ∈ Pℓ : Ω e−φℓℓ′ (ω) P (dω) ≤ ǫ¯ℓℓ′ ,
1 ≤ ℓ, ℓ′ ≤ L 1 ≤ ℓ, ℓ′ ≤ L.
(2.83)
Now let ℓ∗ be such that the distribution P of observation ω belongs to Pℓ∗ . Then for every ℓ′ the P -probability of the event φ¯ℓ∗ ℓ′ (ω) ≤ 0 is ≤ ǫ¯ℓ∗ ℓ′ by (2.83.b), whence the P -probability of the event E∗ = ω : ∃ℓ′ : (ℓ∗ , ℓ′ ) 6∈ C & φ¯ℓ∗ ℓ′ (ω) ≤ 0 is upper-bounded by
X
ℓ′ :(ℓ∗ ,ℓ′ )6∈C
ǫ¯ℓ∗ ℓ′ =
L X
ℓ′ =1
Cℓ∗ ℓ′ ǫℓ∗ ℓ′ eαℓ∗ ℓ′ ≤ ε[α].
Assume that E∗ does not take place (as we have seen, this indeed is so with P probability ≥ 1 − ε[α]). Then φ¯ℓ∗ ℓ′ (ω) > 0 for all ℓ′ such that (ℓ∗ , ℓ′ ) 6∈ C, implying, first, that Hℓ∗ is accepted by our test. Second, φ¯ℓ′ ℓ∗ (ω) = −φ¯ℓ∗ ℓ′ (ω) < 0 whenever (ℓ∗ , ℓ′ ) 6∈ C, or, due to the symmetry of closeness, whenever (ℓ′ , ℓ∗ ) 6∈ C, implying that the test TC rejects the hypothesis Hℓ′ when ℓ′ is not C-close to ℓ∗ . Thus, the P -probability of the event “Hℓ∗ is accepted, and all accepted hypotheses are C-close
95
HYPOTHESIS TESTING
to Hℓ∗ ” is at least 1 − ε[α]. We conclude that the C-risk RiskC (TC |H1 , ..., HL ) of the test TC is at most ε[α]. (i) is proved. (ii) is readily given by Proposition 2.28. ✷ 2.5.2.3
Testing multiple hypotheses via repeated observations
In the situation of Section 2.5.2.1, given a balanced system of detectors φℓℓ′ and risks ǫℓℓ′ , 1 ≤ ℓ, ℓ′ ≤ L, for the collection P1 , ..., PL (see (2.80)) and a positive integer K, we can • pass from detectors φℓℓ′ and risks ǫℓℓ′ to the entities (K)
φℓℓ′ (ω K = (ω1 , ..., ωK )) =
K X
k=1
(K)
′ φℓℓ′ (ωk ), ǫℓℓ′ = ǫK ℓℓ′ , 1 ≤ ℓ, ℓ ≤ L; (K)
• associate with the families Pℓ families Pℓ of probability distributions underlying quasi-stationary K-repeated versions of observations ω ∼ P ∈ Pℓ —see Section 2.3.2.3—and thus arrive at hypotheses HℓK = Hℓ⊗,K stating that the distribution P K of K-repeated observation ω K = (ω1 , ..., ωK ), ωk ∈ Ω, belongs K N to the family Pℓ⊗,K = Pℓ , associated with Pℓ ; see Section 2.1.3.3. k=1
By Proposition 2.16 and (2.80), we arrive at the following analog of (2.80): (K)
(K)
(K)
(K)
φℓℓ′ (ω K ) + φℓ′ ℓ (ω K ) ≡ 0, ǫℓℓ′ = ǫℓ′ ℓ = ǫK ℓℓ′ , (K) (K) R (K) −φℓℓ′ (ω K ) K K K ∀P ∈ Pℓ : ΩK e P (dω ) ≤ ǫℓℓ′ ,
1 ≤ ℓ, ℓ′ ≤ L
1 ≤ ℓ, ℓ′ ≤ L.
Given shifts αℓℓ′ satisfying (2.82) and applying the construction from Section 2.5.2.2 to these shifts and our newly constructed detectors and risks, we arrive at the test TCK deciding on hypotheses H1K , ..., HLK via K-repeated observation ω K . Specifically, given an observation ω K , the test TCK accepts exactly those hypotheses HℓK (K) for which φℓℓ′ (ω K ) − αℓℓ′ > 0 whenever ℓ′ is not C-close to ℓ: (K)
TCK (ω K ) = {ℓ : φℓℓ′ (ω K ) − αℓℓ′ > 0 ∀(ℓ′ : (ℓ, ℓ′ ) 6∈ C)}. Invoking Proposition 2.29, we arrive at Proposition 2.30. (i) The C-risk of the test TCK just defined is upper-bounded by the quantity L X αℓℓ′ . ǫK ε[α, K] = max ℓℓ′ Cℓℓ′ e ℓ≤L
ℓ′ =1
(ii) The infimum, over shifts α satisfying (2.82), of the risk bound ε[α, K] is the quantity ε⋆ (K) = kE (K) k2,2 , where the L × L symmetric entrywise nonnegative matrix E (K) is given by h i (K) ′ E (K) = eℓℓ′ := ǫK C . ′ ℓℓ ℓℓ ℓ,ℓ′ ≤L
Assuming E (K) admits a strictly positive Perron-Frobenius vector f , an optimal
96
CHAPTER 2
choice of the shifts is
αℓℓ′ = ln(fℓ /fℓ′ ), 1 ≤ ℓ, ℓ′ ≤ L,
resulting in ε[α, K] = ε⋆ (K) = kE (K) k2,2 . 2.5.2.4
Consistency and near-optimality
Observe that when closeness C is such that ǫℓℓ′ < 1 whenever ℓ, ℓ′ are not C-close to each other, the entries on the matrix E (K) go to 0 as K → ∞ exponentially fast, whence the C-risk of test TCK also goes to 0 as K → ∞, meaning that test TCK is consistent. When, in addition, Pℓ correspond to convex hypotheses in a simple o.s., the test TCK possesses the property of near-optimality similar to that stated in Proposition 2.25: Proposition 2.31. Consider the special case of the situation from Section 2.5.2.1 where, given a simple o.s. O = (Ω, Π; {pµ : µ ∈ M}; F), the families Pℓ of probability distributions are of the form Pℓ = {pµ : µ ∈ Nℓ }, where Nℓ , 1 ≤ ℓ ≤ L, are nonempty convex compact subsets of M. Let also the pairwise detectors φℓℓ′ and their risks ǫℓℓ′ underlying the construction from Section 2.5.2.2 be obtained by applying Theorem 2.23 to the pairs Nℓ , Nℓ′ , so that for 1 ≤ ℓ < ℓ′ ≤ L one has φℓℓ′ (ω) =
1 2
ln(pµℓ,ℓ′ (ω)/pνℓ,ℓ′ (ω)), ǫℓℓ′ = exp{Optℓℓ′ }
where Optℓℓ′ =
min
µ∈Nℓ ,ν∈Nℓ′
ln
Z q
pµ (ω)pν (ω)Π(dω) ,
Ω
and (µℓℓ′ , νℓℓ′ ) form an optimal solution to the optimization problem on the righthand side. Assume that for some positive integer K∗ in nature there exists a test T K∗ which decides with C-risk ǫ ∈ (0, 1/2), via stationary K∗ -repeated observation ω K∗ , on the (K ) hypotheses Hℓ ∗ , stating that the components in ω K∗ are drawn, independently of each other, from a distribution P ∈ Pℓ , ℓ = 1, ..., L, and let 1 + ln(L − 1)/ ln(1/ǫ) K∗ . (2.84) K= 2 1 − ln(4(1 − ǫ))/ ln(1/ǫ) Then the test TCK yielded by the construction from Section 2.5.2.2 as applied to the above φℓℓ′ , ǫℓℓ′ , and trivial shifts αℓℓ′ ≡ 0, decides on the hypotheses HℓK —see Section 2.5.2.3—via quasi-stationary K-repeated observations ω K , with C-risk ≤ ǫ. Note that K/K∗ → 2 as ǫ → +0. Proof. Let ǫ¯ = max {ǫℓℓ′ : ℓ < ℓ′ , and ℓ, ℓ′ are not C-close to each other} . ′ ℓ,ℓ
Denoting by (ℓ∗ , ℓ′∗ ) the corresponding maximizer, note that T K∗ induces a simple test T able to decide via stationary K∗ -repeated observations ω K on the pair of (K ) (K ) hypotheses Hℓ∗ ∗ , Hℓ′ ∗ with risks ≤ ǫ (it suffices to make T to accept the first ∗ of the hypotheses in the pair and reject the second one whenever T K∗ on the same (K ) observation accepts Hℓ∗ ∗ ; otherwise T rejects the first hypothesis in the pair and accepts the second one). This observation, by the same argument as in the proof
97
HYPOTHESIS TESTING
p of Proposition 2.25, implies that ǫ¯K∗ ≤ 2 ǫ(1 − ǫ) < 1, whence all entries in the matrix E (K) do not exceed ǫ¯(K/K∗ ) , implying by Proposition 2.29 that the C-risk of the test TCK does not exceed p ǫ(K) := (L − 1)[2 ǫ(1 − ǫ)]K/K∗ . It remains to note that for K given by (2.84) one has ǫ(K) ≤ ǫ.
✷
TCK
Remark 2.32. Note that tests TC and we have built may, depending on observations, accept no hypotheses at all, which sometimes is undesirable. Clearly, every test deciding on multiple hypotheses up to C-closeness always can be modified to ensure that a hypothesis always is accepted. To this end, it suffices, for instance, that the modified test accepts exactly those hypotheses, if any, which are accepted by our original test, and accepts, say, hypothesis # 1 when the original test accepts no hypotheses. It is immediate to see that the C-risk of the modified test cannot be larger than the risk of the original test. 2.5.3
Illustration: Selecting the best among a family of estimates
Let us illustrate our machinery for multiple hypothesis testing by applying it to the situation as follows: We are given: • a simple nondegenerate observation scheme O = (Ω, Π; {pµ (·) : µ ∈ M}; F), • a seminorm k · k on Rn ,11 • a convex compact set X ⊂ Rn along with a collection of M points xi ∈ Rn , 1 ≤ i ≤ M , and a positive D such that the k · k-diameter of the set X + = X ∪ {xi : 1 ≤ i ≤ M } is at most D: kx − x′ k ≤ D ∀(x, x′ ∈ X + ), • an affine mapping x 7→ A(x) from Rn into the embedding space of M such that A(x) ∈ M for all x ∈ X, • a tolerance ǫ ∈ (0, 1).
We observe a K-element sample ω K = (ω1 , ..., ωK ) of observations ωk ∼ pA(x∗ ) , 1 ≤ k ≤ K,
(2.85)
independent across k, where x∗ ∈ Rn is an unknown signal known to belong to X. Our “ideal goal” is to use ω K in order to identify, with probability ≥ 1 − ǫ, the k · k-closest to x∗ point among the points x1 , ..., xM . The goal just outlined may be too ambitious, and in the sequel we focus on the relaxed goal as follows: 11 A seminorm on Rn is defined by exactly the same requirements as a norm, except that now we allow zero seminorms for some nonzero vectors. Thus, a seminorm on Rn is a nonnegative function k · k which is even and homogeneous: kλxk = |λ|kxk and satisfies the triangle inequality kx + yk ≤ kxk + kyk. A universal example is kxk = kBxko , where k · ko is a norm on some Rm and B is an m × n matrix; whenever this matrix has a nontrivial kernel, k · k is a seminorm rather than a norm.
98
CHAPTER 2
Given a positive integer N and a “resolution” θ > 1, consider the grid Γ = {rj = Dθ−j , 0 ≤ j ≤ N } and let
ρ(x) = min ρj ∈ Γ : ρj ≥ min kx − xi k . 1≤i≤M
Given the design parameters α ≥ 1 and β ≥ 0, we want to specify a volume of observations K and an inference routine ω K 7→ iα,β (ω K ) ∈ {1, ..., M } such that ∀(x∗ ∈ X) : Prob{kx∗ − xiα,β (ωK ) k > αρ(x∗ ) + β} ≥ 1 − ǫ.
(2.86)
Note that when passing from the “ideal” to the relaxed goal, the simplification is twofold: first, instead of the precise distance mini kx∗ − xi k from x∗ to {x1 , ..., xM } we look at the best upper bound ρ(x∗ ) on this distance from the grid Γ; second, we allow factor α and additive term β in mimicking the (discretized) distance ρ(x∗ ) by kx∗ − xiα,β (ωK ) k. The problem we have posed is quite popular in Statistics and originates from the estimate aggregation problem [185, 229, 101] as follows: let xi be candidate estimates of x∗ yielded by a number of a priori “models” of x∗ and perhaps some preliminary noisy observations of x∗ . Given xi and a matrix B, we want to select among the vectors Bxi the (nearly) best approximation of Bx∗ w.r.t. a given norm k · ko , utilizing additional observations ω K of the signal. To bring this problem into our framework, it suffices to specify the seminorm as kxk = kBxko . We shall see in the meantime that in the context of this problem, the “discretization of distances” is, for all practical purposes, irrelevant: the dependence of the volume of observations on N is just logarithmic, so that we can easily handle a fine grid, like the one with θ = 1.001 and θ−N = 10−10 . As for factor α and additive term β, they indeed could be “expensive in terms of applications,” but the “nearly ideal” goal of making α close to 1 and β close to 0 may be unattainable. 2.5.3.1
The construction
Let us associate with i ≤ M and j, 0 ≤ j ≤ N , the hypothesis Hij stating that the observations ωk independent across k—see (2.85)—stem from x∗ ∈ Xij := {x ∈ X : kx − xi k ≤ rj }. Note that the sets Xij are convex and compact. We denote by J the set of all pairs (i, j), for which i ∈ {1, ..., M }, j ∈ {0, 1, ..., N }, and Xij 6= ∅. Further, we define closeness Cα,β on the set of hypotheses Hij , (i, j) ∈ J , as follows: (ij, i′ j ′ ) ∈ Cαβ if and only if ¯= kxi − xi′ k ≤ α ¯ (rj + rj ′ ) + β, α
α−1 2
(2.87)
(here and in what follows, kℓ denotes the ordered pair (k, ℓ)). Applying Theorem 2.23, we can build, in a computation-friendly fashion, the system φij,i′ j ′ (ω), ij, i′ j ′ ∈ J , of optimal balanced detectors for the hypotheses Hij along
99
HYPOTHESIS TESTING
with the risks of these detectors, so that ′ j ′ (ω) ≡ −φi′ j ′ ,ij (ω) φ R ij,i−φ ij,i′ j ′ (ω) p e A(x) (ω)Π(dω) ≤ ǫij,i′ j ′ Ω
∀(ij, i′ j ′ ∈ J ), ∀(ij ∈ J , i′ j ′ ∈ J , x ∈ Xij ).
Let us say that a pair (α, β) is admissible if α ≥ 1, β ≥ 0, and ∀((i, j) ∈ J , (i′ , j ′ ) ∈ J , (ij, i′ j ′ ) 6∈ Cα,β ) : A(Xij ) ∩ A(Xi′ j ′ ) = ∅. Note that checking admissibility of a given pair (α, β) is a computationally tractable task. Given an admissible pair (α, β), we associate with it a positive integer K = K(α, β) and inference ω K 7→ iα,β (ω K ) as follows: 1. K = K(α, β) is the smallest integer such that the detector-based test TCKα,β yielded by the machinery of Section 2.5.2.3 decides on the hypotheses Hij , ij ∈ J , with Cα,β -risk not exceeding ǫ. Note that by admissibility, ǫij,i′ j ′ < 1 whenever (ij, i′ j ′ ) 6∈ Cα,β , so that K(α, β) is well defined. 2. Given observation ω K , K = K(α, β), we define iα,β (ω K ) as follows: a) We apply to ω K the test TCKα,β . If the test accepts no hypothesis (case A), iαβ (ω K ) is undefined. The observations ω K resulting in case A comprise some set, which we denote by B; given ω K , we can recognize whether or not ω K ∈ B. b) When ω K 6∈ B, the test TCKα,β accepts some of the hypotheses Hij , let the set of their indices ij be J (ω K ); we select from the pairs ij ∈ J (ω K ) the one with the largest j, and set iα,β (ω K ) to be equal to the first component, and jα,β (ω K ) to be equal to the second component of the selected pair. We have the following: Proposition 2.33. Assuming (α, β) admissible, for the inference ω K 7→ iα,β (ω K ) just defined and for every x∗ ∈ X, denoting by PxK∗ the distribution of stationary K-repeated observation ω K stemming from x∗ one has kx∗ − xiα,β (ωK ) k ≤ αρ(x∗ ) + β
(2.88)
with PxK∗ -probability at least 1 − ǫ. Proof. Let us fix x∗ ∈ X, and let j∗ = j∗ (x∗ ) be the largest j ≤ N such that rj ≥ min kx∗ − xi k; i≤M
note that j∗ is well defined due to r0 = D ≥ kx∗ − x1 k, and that rj∗ = ρ(x∗ ). We specify i∗ = i∗ (x∗ ) ≤ M in such a way that kx∗ − xi∗ k ≤ rj∗ .
(2.89)
Note that i∗ is well defined and that observations (2.85) stemming from x∗ obey the hypothesis Hi∗ j∗ . Let E be the set of those ω K for which the predicate
100
CHAPTER 2
P: As applied to observation ω K , the test TCKα,β accepts Hi∗ j∗ , and all hypotheses accepted by the test are Cα,β -close to Hi∗ j∗ holds true. Taking into account that the Cα,β -risk of TCKα,β does not exceed ǫ and that the hypothesis Hi∗ j∗ is true, the PxK∗ -probability of the event E is at least 1 − ǫ. Let observation ω K satisfy ω K ∈ E. (2.90) Then 1. The test TCKα,β accepts the hypothesis Hi∗ j∗ , that is, ω K 6∈ B. By construction of iα,β (ω K )jα,β (ω K ) (see the rule 2b above) and due to the fact that TCKα,β accepts Hi∗ j∗ , we have jα,β (ω K ) ≥ j∗ . 2. The hypothesis Hiα,β (ωK )jα,β (ωK ) is Cα,β -close to Hi∗ j∗ , so that kxi∗ − xiα,β (ωK ) k ≤ α ¯ (rj∗ + rjα,β (ωK ) ) + β ≤ 2¯ αrj∗ + β = 2¯ αρ(x∗ ) + β, where the concluding inequality is due to the fact that, as we have already seen, jα,β (ω K ) ≥ j∗ when (2.90) takes place. Invoking (2.89), we conclude that with PxK∗ -probability at least 1 − ǫ it holds kx∗ − xiα,β (ωK ) k ≤ (2¯ α + 1)ρ(x∗ ) + β = αρ(x∗ ) + β. 2.5.3.2
✷
A modification
From the computational viewpoint, an obvious shortcoming of the construction presented in the previous section is the necessity to operate with M (N +1) hypotheses, which might require computing as many as O(M 2 N 2 ) detectors. We are about to present a modified construction, where we deal at most N + 1 times with just M hypotheses at a time (i.e., with the total of at most O(M 2 N ) detectors). The idea is to replace simultaneously processing all hypotheses Hij , ij ∈ J , with processing them in stages j = 0, 1, ..., with stage j operating only with the hypotheses Hij , i = 1, ..., M . The implementation of this idea is as follows. In the situation of Section 2.5.3, given the same entities Γ, (α, β), Hij , Xij , ij ∈ J , as at the beginning of Section 2.5.3.1 and specifying closeness Cα,β according to (2.87), we now act as follows. Preprocessing. For j = 0, 1, ..., N 1. we identify the set Ij = {i ≤ M : Xij 6= ∅} and stop if this set is empty. If this set is nonempty, j 2. we specify the closeness Cαβ on the set of hypotheses Hij , i ∈ Ij , as a “slice” of the closeness Cα,β : j Hij and Hi′ j (equivalently, i and i′ ) are Cα,β -close to each other if (ij, i′ j) are Cα`,β -close, that is,
kxi − xi′ k ≤ 2¯ αrj + β, α ¯=
α−1 . 2
3. We build the optimal detectors φij,i′ j , along with their risks ǫij,i′ j , for all i, i′ ∈ Ij j such that (i, i′ ) 6∈ Cα,β . If ǫij,i′ j = 1 for a pair i, i′ of the latter type, that is,
101
HYPOTHESIS TESTING
A(Xij ) ∩ A(Xi′ j ) 6= ∅, we claim that (α, β) is inadmissible and stop. Otherwise we find the smallest K = Kj such that the spectral norm of the symmetric M × M matrix E jK with the entries K j ǫij,i′ j , i ∈ Ij , i′ ∈ Ij , (i, i′ ) 6∈ Cα,β jK Eii ′ = 0, otherwise does not exceed ǫ¯ = ǫ/(N + 1). We then use the machinery of Section 2.5.2.3 K to build detector-based test TC j j , which decides on the hypotheses Hij , i ∈ Ij , α,β
j with Cα,β -risk not exceeding ǫ¯.
It may happen that the outlined process stops when processing some value ¯j of j; if this does not happen, we set ¯j = N + 1. Now, if the process does stop, and stops with the claim that (α, β) is inadmissible, we call (α, β) inadmissible and terminate—in this case we fail to produce a desired inference; note that if this is the case, (α, β) is inadmissible in the sense of Section 2.5.3.1 as well. When we do not stop with the inadmissibility claim, we call (α, β) admissible, and in this case we do produce an inference, specifically, as follows. Processing observations: 1. We set J¯ = {0, 1, ..., b j = ¯j − 1}, K = K(α, β) = max K j . Note that J¯ is 0≤j≤b j
nonempty due to ¯j > 0.12
2. Let ω K = (ω1 , ..., ωK ) with independent across k components stemming from unknown signal x∗ ∈ X according to (2.85). We put Ib−1 (ω K ) = {1, ..., M } = I0 . a) For j = 0, 1, ..., b j, we act as follows. When processing j, we have at our disposal subsets Ibk (ω K ) ⊂ {1, ..., M }, −1 ≤ k < j. To build the set Ibj (ω K ) K
i. we apply the test TC j j to the initial Kj components of the observation α,β
ω K . Let Ij+ (ω K ) be the set of hypotheses Hij , i ∈ Ij , accepted by the test; ii. it may happen that Ij+ (ω K ) = ∅; if it is so, we terminate;
iii. if Ij+ (ω K ) is nonempty, we look, one by one, at indices i ∈ Ij+ (ω K ) and call the index i good if for every ℓ ∈ {−1, 0, ..., j − 1}, i ∈ Ibℓ (ω K ); iv. we define Ibj (ω K ) as the set of good indices of Ij+ (ω K ) if this set is not empty and proceed to the next value of j (if j < b j), or terminate (if j = b j). We terminate if there are no good indices in Ij+ (ω K ). b) Upon termination, we have at our disposal a collection Ibj (ω K ), 0 ≤ j ≤ e j(ω K ), of all sets Ibj (ω K ) we have built (this collection can be empty, which we encode by setting e j(ω K ) = −1). When e j(ω K ) = −1, our inference remains undefined. Otherwise we select from the set Ibej(ωK ) (ω K ) an index iα,β (ω K ), say, the smallest one, and claim that the point xiα,β (ωK ) is the point among 12 All the sets X i0 contain X and thus are nonempty, so that I0 = {1, ..., M } 6= ∅, and thus we cannot stop at step j = 0 due to I0 = ∅; the other possibility to stop at step j = 0 is ruled out by the fact that we are in the case where (α, β) is admissible.
102
CHAPTER 2
x1 , ..., xM “nearly closest” to x∗ . We have the following analog of Proposition 2.33: Proposition 2.34. Assuming (α, β) admissible, for the inference ω K 7→ iα,β (ω K ) just defined and for every x∗ ∈ X, denoting by PxK∗ the distribution of stationary K-repeated observation ω K stemming from x∗ one has PxK∗ ω K : iα,β (ω K ) is well defined and kx∗ − xiα,β (ωK ) k ≤ αρ(x∗ ) + β ≥ 1 − ǫ. Proof. Let us fix the signal x∗ ∈ X underlying observations ω K . As in the proof of Proposition 2.33, let j∗ be such that ρ(x∗ ) = rj∗ , and let i∗ ≤ M be such that x∗ ∈ Xi∗ j∗ . Clearly, i∗ and j∗ are well defined, and the hypotheses Hi∗ j , 0 ≤ j ≤ j∗ , are true. In particular, Xi∗ j 6= ∅ when j ≤ j∗ , implying that i∗ ∈ Ij , 0 ≤ j ≤ j∗ , whence also b j ≥ j∗ . For 0 ≤ j ≤ j∗ , let Ej be the set of all realizations of ω K such that j i∗ ∈ Ij+ (ω K ) & {(i∗ , i) ∈ Cα,β ∀i ∈ Ij+ (ω K )}. K
j Since the Cα,β -risk of the test TC j j is ≤ ǫ¯, we conclude that the PxK∗ -probability of α,β
Ej is at least 1 − ǫ¯, whence the PxK∗ -probability of the event E=
j∗ \
j=0
Ej
is at least 1 − (N + 1)¯ ǫ˙ = 1 − ǫ. Now let ω K ∈ E. Then, • By the definition of Ej , when j ≤ j∗ , we have i∗ ∈ Ij+ (ω K ), whence, by evident induction in j, i∗ ∈ Ibj (ω K ) for all j ≤ j∗ . • We conclude from the above that e j(ω K ) ≥ j∗ . In particular, i := iα,β (ω K ) is well defined and turned out to be good at step e j ≥ j∗ , implying that i ∈ Ibj∗ (ω K ) ⊂ + K Ij∗ (ω ).
Thus, i ∈ Ij+∗ (ω K ), which combines with the definition of Ej∗ to imply that i and j∗ i∗ are Cα,β -close to each other, whence αρ(x∗ ) + β, kxi(α,β)(ωK ) − xi∗ k ≤ 2¯ αrj∗ + β = 2¯ resulting in the desired relation
kxi(α,β)(ωK ) − x∗ k ≤ 2¯ αρ(x∗ ) + β + kxi∗ − x∗ k ≤ [2¯ α + 1]ρ(x∗ ) + β = αρ(x∗ ) + β. ✷ 2.5.3.3
“Near-optimality”
We augment the above constructions with the following ¯ ǫ ∈ (0, 1/2), and a pair (a, b) ≥ Proposition 2.35. Let for some positive integer K,
103
HYPOTHESIS TESTING ¯
¯
0 there exist an inference ω K 7→ i(ω K ) ∈ {1, ..., M } such that whenever x∗ ∈ X, we have ProbωK¯ ∼PxK¯ {kx∗ − xi(ωK¯ ) k ≤ aρ(x∗ ) + b} ≥ 1 − ǫ. ∗
Then the pair (α = 2a + 3, β = 2b) is admissible in the sense of Section 2.5.3.1 (and thus—in the sense of Section 2.5.3.2), and for the constructions in Sections 2.5.3.1 and 2.5.3.2 one has 1 + ln(M (N + 1))/ ln(1/ǫ) ¯ (2.91) K ; K(α, β) ≤ Ceil 2 1 − ln(4(1−ǫ)) ln(1/ǫ) Proof. Consider the situation of Section 2.5.3.1 (the situation of Section 2.5.3.2 can be processed in a completely similar way). Observe that with α, β as above, there exists a simple test deciding on a pair of hypotheses Hij , Hi′ j ′ which are not ¯ ¯ Cα,β -close to each other via stationary K-repeated observation ω K with risk ≤ ǫ. ′ ′ Indeed, the desired test T is as follows: given ij ∈ J , i j ∈ J , and observation ¯ ¯ ω K , we compute i(ω K ) and accept Hij if and only if kxi(ωK¯ ) − xi k ≤ (a + 1)rj + b, and accept Hi′ j ′ otherwise. Let us check that the risk of this test indeed is at most ¯ ǫ. Assume, first, that Hij takes place. The PxK∗ -probability of the event E : kxi(ωK¯ ) − x∗ k ≤ aρ(x∗ ) + b is at least 1 − ǫ due to the origin of i(·), and kxi − x∗ k ≤ rj since Hij takes place, implying that ρ(x∗ ) ≤ rj by the definition of ρ(·). Thus, in the case of E it holds kxi(ωK¯ ) − xi k ≤ kxi(ωK¯ ) − x∗ k + kxi − x∗ k ≤ aρ(x∗ ) + b + rj ≤ (a + 1)rj + b. ¯
We conclude that if Hij is true and ω K ∈ E, then the test T accepts Hij , and thus ¯ the PxK∗ -probability for the simple test T not to accept Hij when the hypothesis takes place is ≤ ǫ. ¯ Now let Hi′ j ′ take place, and let E be the same event as above. When ω K ∈ E, ¯ which happens with the PxK∗ -probability at least 1−ǫ, for the same reasons as above, we have kxi(ωK¯ ) − xi′ k ≤ (a + 1)rj ′ + b. It follows that when Hi′ j ′ takes place and ¯ ω K ∈ E, we have kxi(ωK¯ ) − xi k > (a + 1)rj + b, since otherwise we would have kxi − xi′ k ≤ =
kxi(ωK¯ ) − xi k + kxi(ωK¯ ) − xi′ k ≤ (a + 1)rj + b + (a + 1)rj ′ + b ′ (a + 1)(rj + rj ′ ) + 2b = α−1 2 (rj + rj ) + β,
which contradicts the fact that ij and i′ j ′ are not Cα,β -close. Thus, whenever Hi′ j ′ holds true and E takes place, we have kxi(ωK¯ ) − xi k > (a + 1)rj + b, implying ¯ by the definition of T that T accepts Hi′ j ′ . Thus, the PxK∗ -probability not to accept Hi′ j ′ when the hypotheses is true is at most ǫ. From the fact that whenever ¯ observations, (ij, i′ j ′ ) 6∈ Cα,β , the hypotheses Hij , Hi′ j ′ can be decided upon, via K ′ ′ with risk ≤ ǫ < 0.5 it follows that for the ij, i j in question, the sets A(Xij ) and A(Xi′ j ′ ) do not intersect, so that (α, β) is an admissible pair. As in the proof of Proposition 2.31, by basic properties of simple observation schemes, the fact that the hypotheses Hij , Hi′ j ′ with (ij, i′ j ′ ) 6∈ Cα,β can be decided ¯ upon observations (2.85) with risk ≤ ǫ < 1/2 implies that ǫij,i′ j ′ ≤ p via K-repeated ¯ 1/K , whence, again by basic results on simple observation schemes (look [2 ǫ(1 − ǫ)]
104
CHAPTER 2
once again at the proof of Proposition 2.31), the Cα,β -risk of K-observation detectorbased test TK deciding Hij , ijp ∈ J , up to closeness Cα,β does not p on theK/hypotheses ¯ ¯ K exceed Card(J )[2 ǫ(1 − ǫ)] ≤ M (N + 1)[2 ǫ(1 − ǫ)]K/K , and (2.91) follows. ✷ Comment. Proposition 2.35 says that in our problem, the “statistical toll” for quite large values of N and M is quite moderate: with ǫ = 0.01, resolution θ = 1.001 (which for all practical purposes is the same as no discretization of distances at all), ¯ D/rN as large as 1010 , and M as large as 10,000, (2.91) reads K = Ceil(10.7K)— not a disaster! The actual statistical toll of our construction is in replacing the “existing in nature” a and b with a α = 2α + 3 and β = 2b. And of course there is a huge computational toll for large M and N : we need to operate with large (albeit polynomial in M, N ) number of hypotheses and detectors. 2.5.3.4
Numerical illustration
As an illustration of the approach presented in this section consider the following (toy) problem: A signal x∗ ∈ Rn (one may think of x∗ as of the restriction on the equidistant n-point grid in [0, 1] of a function of continuous argument t ∈ [0, 1]) is observed according to ω = Ax∗ + ξ, ξ ∼ N (0, σ 2 In ),
(2.92)
where A is a “discretized integration”: s
(Ax)s =
1X xs , s = 1, ..., n. n j=1
We want to approximate x in the discrete version of L1 -norm n
kyk =
1X |ys |, y ∈ Rn n s=1
by a low-order polynomial. In order to build the approximation, we use a single observation ω as in (2.92), to build five candidate estimates xi , i = 1, ..., 5, of x∗ . Specifically, xi is the Least Squares polynomial—of degree ≤ i − 1—approximation of x: xi = argmin kAy − ωk22 , y∈Pi−1
where Pκ is the linear space of algebraic polynomials, of degree ≤ κ, of discrete argument s varying in {1, 2, ..., n}. After the candidate estimates are built, we use additional K observations (2.92) “to select the model”—to select among our estimates the k · k-closest to x∗ . In the experiment reported below we use n = 128 and σ = 0.01. The true signal x∗ is a discretization of a piecewise linear function of continuous argument t ∈ [0, 1], with slope 2 to the left of t = 0.5, and with slope −2 to the right of t = 0.5; at t = 0.5, the function has a jump. The a priori information on the true signal is that
105
HYPOTHESIS TESTING
1
0
-1
-2
0
20
40
60
80
100
120
140
0
20
40
60
80
100
120
140
0.3
0.2
0.1
0
Figure 2.5: Signal (top, solid) and candidate estimates (top, dotted). Bottom: the primitive of the signal.
it belongs to the box {x ∈ Rn : kxk∞ ≤ 1}. The signal and sample polynomial approximations xi of x∗ , 1 ≤ i ≤ 5, are presented on the top plot in Figure 2.5; their actual k · k-distances to x∗ are as follows: i kxi − x∗ k
1 0.534
2 0.354
3 0.233
4 0.161
5 0.172
Setting ǫ = 0.01, N = 22, and θ = 21/4 , α = 3 and β = 0.05 resulted in K = 3. In a series of 1,000 simulations of the resulting inference, all 1,000 results correctly identified the candidate estimate x4 k · k-closest to x∗ , in spite of the factor α = 3 in (2.88). Surprisingly, the same holds true when we use the resulting inference with the reduced values of K, namely, K = 1 and K = 2, although the theoretical guarantees deteriorate: with K = 1 and K = 2, the theory guarantees the validity of (2.88) with probabilities 0.77 and 0.97, respectively.
2.6
SEQUENTIAL HYPOTHESIS TESTING
2.6.1
Motivation: Election polls
Let us consider the following “practical” question. One of L candidates for an office is about to be selected by a populationwide majority vote. Every member of the population votes for exactly one candidate. How do we predict the winner via an opinion poll? A (naive) model of the situation could be as follows. Let us represent the preference of a particular voter by his preference vector—a basic orth e in RL with unit entry in a position ℓ meaning that the voter is about to vote for the ℓ-th candidate. The
106
CHAPTER 2
entries µℓ in the average µ, over the population, of these vectors are the fractions of votes in favor of the ℓ-th candidate, and the elected candidate is the one “indexing” the largest of the µℓ ’s. Now assume that we select at random, from the uniform distribution, a member of the population and observe his preference vector. Our observation ω is a realization of a discrete random variable taking values in the set Ω = {e1 , ..., eL } of basic orths in RL , and µ is the distribution of ω (technically, the density of this distribution w.r.t. the counting measure Π on Ω). Selecting a small threshold δ and assuming that the true—unknown to us—µ is such that the largest entry in µ is at least by δ larger than every other entry and that µℓ ≥ N1 for all ℓ, N being the population size,13 we can model the population preference for the ℓ-th candidate with P µ ∈ Mℓ = {µ ∈ Rd : µi ≥ N1 , i µP i = 1, µℓ ≥ µi + δ ∀(i 6= ℓ)} ⊂ M = {µ ∈ Rd : µ > 0, i µi = 1}.
In an (idealized) poll, we select at random a number K of voters and observe their preferences, thus arriving at a sample ω K = (ω1 , ..., ωK ) of observations drawn, independently SLof each other, from an unknown distribution µ on Ω, with µ known to belong to ℓ=1 Mℓ . Therefore, to predict the winner is the same as to decide on L convex hypotheses, H1 , ..., HL , in the Discrete o.s., with Hℓ stating that ω1 , ..., ωK are drawn, independently of each other, from a distribution µ ∈ Mℓ . What we end up with, is the problem of deciding on L convex hypotheses in the Discrete o.s. with L-element Ω via stationary K-repeated observations.
Illustration. Consider two-candidate elections; now the goal of a poll is, given K independent of each other realizations ω1 , ..., ωK of random variable ω taking value χ = 1, 2 with probability µχ , µ1 + µ2 = 1, to decide what is larger, µ1 or µ2 . As explained above, we select somehow a threshold δ and impose on the unknown µ an a priori assumption that the gap between the largest and the next largest (in our case, just the smallest) entry of µ is at least δ, thus arriving at two hypotheses, H1 : µ1 ≥ µ2 + δ,
H2 : µ2 ≥ µ1 + δ,
which is the same as H1 : µ ∈ M1 = {µ : µ1 ≥ H2 : µ ∈ M2 = {µ : µ2 ≥
1+δ 2 , µ2 1+δ 2 , µ1
≥ 0, µ1 + µ2 = 1}, ≥ 0, µ1 + µ2 = 1}.
We now want to decide on these two hypotheses via a stationary K-repeated observation. We are in the case of a simple (specifically, Discrete) o.s.; the optimal detector as given by Theorem 2.23 stems from the optimal solution (µ∗ , ν ∗ ) to the convex optimization problem ε⋆ =
max
µ∈M1 ,ν∈M2
√ √ [ µ1 ν 1 + µ2 ν 2 ] ;
(2.93)
the optimal balanced single-observation detector is φ∗ (ω) = f∗T ω, f∗ = 21 [ln(µ∗1 /ν1∗ ); ln(µ∗2 /ν2∗ )] 13 With the size N of population in the range of tens of thousands and δ as 1/N , both these assumptions seem to be quite realistic.
107
HYPOTHESIS TESTING
(recall that we encoded observations ωk by basic orths from R2 ), the risk of this detector being ε⋆ . In other words, √ 1−δ 1−δ 1+δ ∗ 1 − δ2 , µ∗ = [ 1+δ 2 ; 2 ], ν = [ 2 ; 2 ], ε⋆ = 1 f∗ = 2 [ln((1 + δ)/(1 − δ)); ln((1 − δ)/(1 + δ))] . The optimal balanced K-observation detector and its risk are (K)
(K)
φ∗ (ω1 , ..., ωK ) = f∗T (ω1 + ... + ωK ), ε⋆ | {z }
= (1 − δ 2 )K/2 .
ωK
(K)
The near-optimal K-observation test TφK∗ accepts H1 and rejects H2 if φ∗ (ω K ) ≥ 0; otherwise it accepts H2 and rejects H1 . Both risks of this test do not exceed (K) ε⋆ . Given risk level ǫ, we can identify the minimal “poll size” K for which the risks K Risk1 , Risk2 of the test Tφ∗ do not exceed ǫ. This poll size depends on ǫ and on our a priori “hypotheses separation” parameter δ : K = Kǫ (δ). Some impression on this size can be obtained from Table 2.1, where, as in all subsequent “election illustrations,” ǫ is set to 0.01. We see that while poll sizes for “landslide” elections are surprisingly low, reliable prediction of the results of “close run” elections requires surprisingly high sizes of the polls. Note that this phenomenon reflects reality (to the extent to which the reality is captured by our model).14 Indeed, from Proposition 2.25 we know that our poll size is within an explicit factor, depending solely on ǫ, from the “ideal” poll sizes—the smallest ones which allow to decide upon H1 , H2 with risk ≤ ǫ. For ǫ = 0.01, this factor is about 2.85, meaning that when δ = 0.01, the ideal poll size is larger than 32,000. In fact, we can easily construct more accurate “numerical” lower bounds on the sizes of ideal polls, specifically, as follows. When computing the optimal detector φ∗ , we get, as a byproduct, two distributions, µ∗ , ν ∗ obeying ∗ H1 , H2 , respectively. Denoting by µ∗K and νK the distributions of K-element i.i.d. ∗ ∗ samples drawn from µ and ν , the risk of deciding on two simple hypotheses on ∗ the distribution of ω K —stating that this distribution is µ∗K and νK , respectively— can be only smaller than the risk of deciding on H1 , H2 via K-repeated stationary observations. On the other hand, the former risk can be lower-bounded by one half of the total risk of deciding on our two simple hypotheses, and the latter risk admits a sharp lower bound given by Proposition 2.2, namely, " ( " # #) X Y Y Y Y ∗ ∗ ∗ ∗ min µ iℓ , νiℓ = E(i1 ,...,iK ) min (2µiℓ ), (2νiℓ ) , i1 ,...,iK ∈{1,2}
ℓ
ℓ
ℓ
ℓ
with the expectation taken w.r.t independent tuples of K integers taking values 14 In actual opinion polls, additional information is used. For instance, in reality voters can be split into groups according to their age, sex, education, income, etc., with variability of preferences within a group essentially lower than across the entire population. When planning a poll, respondents are selected at random within these groups, with a prearranged number of selections in every group, and their preferences are properly weighted, yielding more accurate predictions as compared to the case when the respondents are selected from the uniform distribution. In other words, in actual polls a nontrivial a priori information on the “true” distribution of preferences is used—something we do not have in our naive model.
108
CHAPTER 2
δ K0.01 (δ), L = 2 K0.01 (δ), L = 5
0.5623 25 32
0.3162 88 114
0.1778 287 373
0.1000 917 1193
0.0562 2908 3784
0.0316 9206 11977
0.0177 29118 37885
0.0100 92098 119745
Table 2.1: Sample of values of poll size K0.01 (δ) as a function of δ for 2-candidate (L = 2) and 5-candidate (L = 5) elections. Values of δ form a geometric progression with ratio 10−1/4 .
1 and 2 with probabilities 1/2. Of course, when K is in the range of a few tens and more, we cannot compute the 2K -term sum above exactly; however, we can use Monte Carlo simulation in order to estimate the sum reliably with moderate accuracy, like 0.005, and use this estimate to lower-bound the value of K for which an “ideal” K-observation test decides on H1 , H2 with risks ≤ 0.01. Here are the resulting lower bounds (along with upper bounds from Table 2.1): δ K /K
0.5623
0.3162
0.1778
0.1000
0.0562
0.0316
0.0177
0.0100
14 25
51 88
166 287
534 917
1699 2908
5379 9206
17023 29122
53820 92064
Lower (K) and upper (K) bounds on the “ideal” poll sizes We see that the poll sizes as yielded by our machinery are within factor 2 of the “ideal” poll sizes. Clearly, the outlined approach can be extended to L-candidate elections with L ≥ 2. In our model of the corresponding problem we decide, via stationary K-repeated observations drawn from unknown probability distribution µ on L-element set, on L hypotheses Hℓ : µ ∈ M ℓ =
(
µ ∈ R d : µi ≥
) X 1 , i ≤ L, µi = 1, µℓ ≥ µℓ′ + δ ∀(ℓ′ 6= ℓ) , ℓ ≤ L. N i (2.94)
Here δ > 0 is a threshold selected in advance smallSenough to believe that the actual preferences of the voters correspond to µ ∈ ℓ Mℓ . Defining closeness C in the strongest possible way—Hℓ is close to Hℓ′ if and only if ℓ = ℓ′ —predicting the outcome of elections with risk ǫ becomes the problem of deciding upon our multiple hypotheses with C-risk ≤ ǫ. Thus, we can use pairwise detectors yielded by Theorem 2.23 to identify the smallest possible K = Kǫ such that the test TCK from Section 2.5.2.3 is capable of deciding upon our L hypotheses with C-risk ≤ ǫ. A numerical illustration of the performance of this approach in 5-candidate elections is presented in Table 2.1 (where ǫ is set to 0.01). 2.6.2
Sequential hypothesis testing
In view of the above analysis, when predicting outcomes of “close run” elections, huge poll sizes are necessary. It, however, does not mean that nothing can be done in order to build more reasonable opinion polls. The classical related statistical idea, going back to Wald [236], is to pass to sequential tests where the observations are processed one by one, and at every instant we either accept some of our hypotheses and terminate, or conclude that the observations obtained so far are insufficient to make a reliable inference and pass to the next observation. The idea is that a properly built sequential test, while still ensuring a desired risk, will be able to make “early decisions” in the case when the distribution underlying observations is “well inside” the true hypothesis and thus is far from the alternatives. Let us show
109
HYPOTHESIS TESTING
"
$
#
Figure 2.6: 3-candidate hypotheses in probabilistic simplex ∆3 [area [area [area [area [area [area
A] A] B] B] C] C]
M1 M1s M2 M2s M3 M3s
dark dark dark dark dark dark
tetragon + light border strip: candidate A wins with margin ≥ δS tetragon: candidate A wins with margin ≥ δs > δS tetragon + light border strip: candidate B wins with margin ≥ δS tetragon: candidate B wins with margin ≥ δs > δS tetragon + light border strip: candidate C wins with margin ≥ δS tetragon: candidate C wins with margin ≥ δs > δS
Cs closeness: hypotheses in the tuple {Gs2ℓ−1 : µ ∈ Mℓ , Gs2ℓ : µ ∈ Mℓs , 1 ≤ ℓ ≤ 3} are not Cs -close to each other if the corresponding M -sets belong to different areas and at least one of the sets is painted dark, like M1s and M2 , but not M1 and M2 . how our machinery can be utilized to conceive a sequential test for the problem of predicting the outcome of L-candidate elections. Thus, our goal is, given a small threshold δ, to decide upon L hypotheses (2.94). Let us act as follows. 1. We select a factor θ ∈ (0, 1), say, θ = 10−1/4 , and consider thresholds δ1 = θ, δ2 = θδ1 , δ3 = θδ2 , and so on, until for the first time we get a threshold ≤ δ; to save notation, we assume that this threshold is exactly δ, and let the number of the thresholds be S. 2. We split somehow (e.g., equally) the risk ǫ which we want to guarantee into S portions ǫs , 1 ≤ s ≤ S, so that ǫs are positive and S X
ǫs = ǫ.
s=1
3. For s ∈ {1, 2, ..., S}, we define, along with the hypotheses Hℓ , the hypotheses Hℓs : µ ∈ Mℓs = {µ ∈ Mℓ : µℓ ≥ µℓ′ + δs , ∀(ℓ′ 6= ℓ)}, ℓ = 1, ..., L, (see Figure 2.6), and introduce 2L hypotheses Gs2ℓ−1 = Hℓ , and Gs2ℓ = Hℓs , 1 ≤ ℓ ≤ L. It is convenient to color these hypotheses in L colors, with Gs2ℓ−1 = Hℓ and Gs2ℓ = Hℓs assigned color ℓ. We define also s-th closeness Cs as follows: When s < S, hypotheses Gsi and Gsj are Cs -close to each other if either they are of the same color, or they are of different colors and both of them have odd indices (that is, one of them is Hℓ , and another one is Hℓ′ with ℓ 6= ℓ′ ).
110
CHAPTER 2
When s = S (in this case GS2ℓ−1 = Hℓ = GS2ℓ ), hypotheses GSℓ and GSℓ′ are CS -close to each other if and only if they are of the same color, i.e., both coincide with the same hypothesis Hℓ . Observe that Gsi is a convex hypothesis: Gsi : µ ∈ Yis
s s [Y2ℓ−1 = Mℓ , Y2ℓ = Mℓs ]
The key observation is that when Gsi and Gsj are not Cs -close, sets Yis and Yjs are “separated” by at least δs , meaning that for some vector e ∈ RL with just two nonvanishing entries, equal to 1 and −1, we have min eT µ ≥ δs + maxs eT µ.
µ∈Yis
µ∈Yj
(2.95)
Indeed, let Gsi and Gsj not be Cs -close to each other. That means that the hypotheses are of different colors, say, ℓ and ℓ′ 6= ℓ, and at least one of them has even index. W.l.o.g. we can assume that the even-indexed hypothesis is Gsi , so that Yis ⊂ {µ : µℓ − µℓ′ ≥ δs },
while Yjs is contained in the set {µ : µℓ′ ≥ µℓ }. Specifying e as the vector with just two nonzero entries, ℓ-th equal to 1 and ℓ′ -th equal to −1, we ensure (2.95).
4. For 1 ≤ s ≤ S, we apply the construction from Section 2.5.2.3 to identify the smallest K = K(s) for which the test Ts yielded by this construction as applied to a stationary K-repeated observation allows us to decide on the hypotheses Gs1 , ..., Gs2L with Cs -risk ≤ ǫs . The required K exists due to the already mentioned separation of members in a pair of not Cs -close hypotheses Gsi , Gsj . It is easily seen that K(1) ≤ K(2) ≤ ... ≤ K(S − 1). However, it may happen that K(S − 1) > K(S), the reason being that CS is defined differently than Cs with s < S. We set S = {s ≤ S : K(s) ≤ K(S)}. For example, here is what we get in L-candidate Opinion P8 Poll problem when S = 8, δ = δS = 0.01, and for properly selected ǫs with s=1 ǫs = 0.01: L 2 5
K(1) 177 208
K(2) 617 723
K(3) K(4) K(5) K(6) K(7) K(8) 1829 5099 15704 49699 153299 160118 2175 6204 19205 60781 188203 187718 S = 8, δs = 10−s/4 . S = {1, 2, ..., 8} when L = 2 and S = {1, 2, ..., 6} ∪ {8} when L = 5.
5. Our sequential test Tseq works in attempts (stages) s ∈ S—it tries to make conclusions after observing K(s), s ∈ S, realizations ωk of ω. At the s-th attempt, we apply the test Ts to the collection ω K(s) of observations obtained so far to decide on hypotheses Gs1 , ..., Gs2L . If Ts accepts some of these hypotheses and all accepted hypotheses are of the same color—let it be ℓ—the sequential test accepts the hypothesis Hℓ and terminates; otherwise we continue to observe the realizations of ω (if s < S) or terminate with no hypotheses accepted/rejected (if s = S). It is easily seen that the risk of the outlined sequential test Tseq does not exceed SL ǫ, meaning that whatever be the distribution µ ∈ ℓ=1 Mℓ underlying observations
HYPOTHESIS TESTING
111
ω1 , ω2 , ...ωK(S) and ℓ∗ such that µ ∈ Mℓ∗ , the µ-probability of the event is at least 1 − ǫ.
Tseq accepts exactly one hypothesis, namely, Hℓ∗
Indeed, observe, first, that the sequential test always accepts at most one of the hypotheses H1 , ..., HL . Second, let ωk ∼ µ with µ obeying Hℓ∗ . Consider events Es , s ∈ S, defined as follows:
• when s < S, Es is the event “the test Ts as applied to observation ω K(s) does not accept the true hypothesis Gs2ℓ∗ −1 = Hℓ∗ ”; • ES is the event “as applied to observation ω K(S) , the test TS does not accept the S true hypothesis GS 2ℓ∗ −1 = Hℓ∗ or accepts a hypothesis not CS -close to G2ℓ∗ −1 .”
Note that by our selection of K(s)’s, the µ-probability of Es does not exceed ǫs , so that the µ-probability of none of the events Es , s ∈ S, taking place is at least 1 − ǫ. To justify the above claim on the risk of the sequential test, all we need to verify is that when none of the events Es , s ∈ S, takes place, the sequential test accepts the true hypothesis Hℓ∗ . Verification is immediate: let the observations be such that none of the Es ’s takes place. We claim that in this case (a) The sequential test does accept a hypothesis—if this does not happen at the s-th attempt with some s < S, it definitely happens at the S-th attempt. Indeed, since ES does not take place, TS accepts GS 2ℓ∗ −1 and all other hypotheses, if any, accepted by TS are CS -close to GS 2ℓ∗ −1 , implying by construction of CS that TS does accept hypotheses, and all these hypotheses are of the same color. That is, the sequential test at the S-th attempt does accept a hypothesis.
(b) The sequential test does not accept a wrong hypothesis.
Indeed, assume that the sequential test accepts a wrong hypothesis, Hℓ′ , ℓ′ 6= ℓ∗ , and it happens at the s-th attempt, and let us lead this assumption to a contradiction. Observe that under our assumption the test Ts as applied to observation ω K(s) does accept some hypothesis Gsi , but does not accept the true hypothesis Gs2ℓ∗ −1 = Hℓ∗ . Indeed, assuming Gs2ℓ∗ −1 to be accepted, its color, which is ℓ∗ , should be the same as the color ℓ′ of Gsi —we are in the case where the sequential test accepts Hℓ′ at the s-th attempt! Since in fact ℓ′ 6= ℓ∗ , the above assumption leads to a contradiction. On the other hand, we are in the case where Es does not take place, that is, Ts does accept the true hypothesis Gs2ℓ∗ −1 , and we arrive at the desired contradiction.
(a) and (b) provide us with the verification we were looking for.
Discussion and illustration. It can be easily seen that when ǫs = ǫ/S for all s, the worst-case duration K(S) of our sequential test is within a logarithmic in the SL factor of the duration of any other test capable of deciding on our L hypotheses with risk ǫ. At the same time it is easily seen that when the distribution µ of our observation is “deeply inside” some set Mℓ , specifically, µ ∈ Mℓs for some s ∈ S, s < S, then the µ-probability to terminate not later than just after K(s) realizations ωk of ω ∼ µ are observed and to infer correctly what is the true hypothesis is at least 1 − ǫ. Informally speaking, in the case of “landslide” elections, a reliable prediction of the elections’ outcome will be made after a relatively small number of respondents are interviewed. Indeed, let s ∈ S and ωk ∼ µ ∈ Mℓs , so that µ obeys the hypothesis Gs2ℓ . Consider the s events Et , 1 ≤ t ≤ s, defined as follows: • For t < s, Et occurs when the sequential test terminates at attempt t by accepting, instead of Hℓ , the wrong hypothesis Hℓ′ , ℓ′ 6= ℓ. Note that Et can take place only when Tt does not accept the true hypothesis Gt2ℓ−1 = Hℓ , and the
112
CHAPTER 2
µ-probability of this outcome is ≤ ǫt . • Es occurs when Ts does not accept the true hypothesis Gs2ℓ or accepts it along with some hypothesis Gsj , 1 ≤ j ≤ 2L, of color different from ℓ. Note that we are in the situation where the hypothesis Gs2ℓ is true, and, by construction of Cs , all hypotheses Cs -close to Gs2ℓ are of the same color ℓ as Gs2ℓ . Recalling what Cs -risk is and that the Cs -risk of Ts is ≤ ǫs , we conclude that the µ-probability of Es is at most ǫs . S P The bottom line is that the µ-probability of the event t≤s Et is at most st=1 ǫt ≤ S ǫ; by construction of the sequential test, if the event t≤s Et does not take place, the test terminates in the course of the first s attempts by accepting the correct hypothesis Hℓ . Our claim is justified.
Numerical illustration. To get an impression of the “power” of sequential hypothesis testing, here are the data on the durations of non-sequential and sequential tests with risk ǫ = 0.01 for various values of δ; in the sequential tests, θ = 10−1/4 is used. The worst-case data for 2-candidate and 5-candidate elections are as follows (below, “volume” stands for the number of observations used by the test) δ K, L = 2 S / K(S), L = 2 K, L = 5 S / K(S), L = 5
0.5623 25
0.3162 88
0.1778 287
0.1000 917
0.0562 2908
0.0316 9206
0.0177 29118
0.0100 92098
1 25
2 152
3 499
4 1594
5 5056
6 16005
7 50624
8 160118
32
114
373
1193
3784
11977
37885
119745
1 32
2 179
3 585
4 1870
5 5931
6 18776
7 59391
8 187720
Volume K of non-sequential test, number S of stages, and worst-case volume K(S) of sequential test as functions of threshold δ = δS . Risk ǫ is set to 0.01. As it should be, the worst-case volume of the sequential test is significantly larger than the volume of the non-sequential test.15 This being said, look at what happens in the “average,” rather than the worst, case; specifically, let us look at the empirical distribution of the volume when the distribution µ of observations is selected in the P L-dimensional probabilistic simplex ∆L = {µ ∈ RL : µ ≥ 0, ℓ µℓ = 1} at random. Here are the empirical statistics of test volume obtained when drawing µ from the S uniform distribution on ℓ≤L Mℓ and running the sequential test16 on observations drawn from the selected µ: L 2 5 L 2 5
risk 0.0010 0.0040 75% 617 12704
median 177 1449 80% 1223 19205
mean 9182 18564 85% 1829 39993
60% 177 2175 90% 8766 60781
65% 397 4189 95% 87911 124249
70% 617 6204 100% 160118 187718
Parameters (columns “median, mean”) and quantiles (columns “60%,..., 100%”) of the sample distribution of the observation volume of the Sequential test for a given empirical risk (column ”risk”) . The data in the table are obtained from 1,000 experiments. We see that with the Sequential test, “typical” numbers of observations before termination are much 15 The reason is twofold: first, for s < S we pass from deciding on L hypotheses to deciding on 2L of them; second, the desired risk ǫ is now distributed among several tests, so that each of them should be more reliable than the non-sequential test with risk ǫ. 16 Corresponding to δ = 0.01, θ = 10−1/4 and ǫ = 0.01.
HYPOTHESIS TESTING
113
less than the worst-case values of these numbers. For example, in as much as 80% of experiments these numbers were below quite reasonable levels, at least in the case L = 2. Of course, what is “typical,” and what is not, depends on how we generate µ’s (this is called “prior Bayesian distribution”). Were our generation more likely to produce “close run” distributions, the advantages of sequential decision making would be reduced. This ambiguity is, however, unavoidable when attempting to go beyond worst-case-oriented analysis. 2.6.3
Concluding remarks
Application of our machinery to sequential hypothesis testing is in no sense restricted to the simple election model considered so far. A natural general setup we can handle is as follows: We are given a simple observation scheme O and a number L of related convex hypotheses, colored in d colors, on the distribution of an observation, with distributions obeying hypotheses of different colors being distinct from each other. Given the risk level ǫ, we want to decide (1 − ǫ)-reliably on the color of the distribution underlying observations (i.e., the color of the hypothesis obeyed by this distribution) from stationary K-repeated observations, utilizing as small a number of observations as possible. For detailed description of related constructions and results, an interested reader is referred to [134].
2.7
2.7.1
MEASUREMENT DESIGN IN SIMPLE OBSERVATION SCHEMES Motivation: Opinion polls revisited
Consider the same situation as in Section 2.6.1—we want to use an opinion poll to predict the winner in a population-wide election with L candidates. When addressing this situation earlier, no essential a priori information on the distribution of voters’ preferences was available. Now consider the case when the population is split into I groups (according to age, sex, income, etc., etc.), with the i-th group forming the fraction θi of the entire population, and we have at our disposal, at least for some i, nontrivial a priori information about the distribution pi of the preferences across group # i (the ℓ-th entry piℓ in pi is the fraction of voters of group i voting for candidate ℓ). For instance, we could know in advance that at least 90% of members of group #1 vote for candidate #1, and at least 85% of members of group #2 vote for candidate #2; no information of this type for group #3 is available. In this situation it would be wise to select respondents in the poll via a two-stage procedure, first selecting at random, with probabilities q1 , ..., qI , the group from which the next respondent will be picked, and second selecting the respondent from this group at random according to the uniform distribution on the group. When the qi are proportional to the sizes of the groups (i.e., qi = θi for all i), we come back to selecting respondents at random from the uniform distribution over the entire population. The point, however, is that in the presence of a priori information, it makes sense to use qi different from θi , specifically, to make the
114
CHAPTER 2
ratios qi /θi “large” or “small” depending on whether a priori information on group #i is poor or rich. The story we have just told is an example of a situation in which we can “design measurements”—draw observations from a distribution which partly is under our control. Indeed, what in fact happens in the story is the following. “In nature” there exist I probabilistic vectors p1 , ..., pI of dimension L representing distributions of voting preferences within the corresponding the distribution of preferP groups; i θ p . With the two-stage selection ences across the entire population is p = i i of respondents, the outcome of a particular interview becomes a pair (i, ℓ), with i identifying the group to which the respondent belongs, and ℓ identifying the candidate preferred by this respondent. In subsequent interviews, the pairs (i, ℓ)—these are our observations—are drawn, independently of each other, from the probability distribution on the pairs (i, ℓ), i ≤ I, ℓ ≤ L, with the probability of an outcome (i, ℓ) equal to p(i, ℓ) = qi piℓ . Thus, we find ourselves in the situation of stationary repeated observations stemming from the Discrete o.s. with observation space Ω of cardinality IL; the distribution from which the observations are drawn is a probabilistic vector µ of the form µ = Ax, where • x = [p1 ; ...; pI ] is the “signal” underlying our observations and representing the preferences of the population; this signal is selected by nature in the set X known to us defined in terms of our a priori information on p1 , ..., pI : X = {x = [x1 ; ...; xI ] : xi ∈ Πi , 1 ≤ i ≤ I},
(2.96)
where the Πi are the sets, given by our a priori information, of possible values of the preference vectors pi of the voters from i-th group. In the sequel, we assume o L that P the Πi are convex compact subsets of the positive part ∆L = {p ∈ R : p > 0, ℓ pℓ = 1} of the L-dimensional probabilistic simplex; • A is a “sensing matrix” which, to some extent, is under our control; specifically, A[x1 ; ...; xI ] = [q1 x1 ; q2 x2 ; ...; qI xI ],
(2.97)
with q = [q1 ; ...; qI ] fully controlled by us (up to the fact that q must be a probabilistic vector). Note that in the situation under consideration the hypotheses we want to decide upon can be represented by convex sets in the space of signals, with a particular hypothesis stating that the observations stem from a distribution µ on Ω, with µ belonging to the image of some convex P compact set Xℓ ⊂ X under the mapping x 7→ µ = Ax. For example, when ν = i θi xi , the hypotheses X H ℓ : ν ∈ Mℓ = ν ∈ R L : νj = 1, νj ≥ N1 , νℓ ≥ νℓ′ + δ, ℓ′ 6= ℓ , 1 ≤ ℓ ≤ L, j
115
HYPOTHESIS TESTING
considered in Section 2.6.1 can be expressed in terms of the signal x = [x1 ; ...; xI ]: P i xℓ = 1∀i ≤ I xi ≥ 0, ℓ P P . Hℓ : µ = Ax, x ∈ Xℓ = x = [x1 ; ...; xI ] : Pi θi xiℓ ≥ i θi xiℓ′ + δ ∀(ℓ′ 6= ℓ) 1 i ≥ θ x , ∀j i j i N (2.98) The challenge we intend to address is as follows: so far, we were interested in inferences from observations drawn from distributions selected “by nature.” Now our goal is to make inferences from observations drawn from a distribution selected partly by nature and partly by us: nature selects the signal x, we select from some set matrix A, and the observations are drawn from the distribution Ax. As a result, we arrive at a question completely new for us: how do we utilize the freedom in selecting A in order to improve our inferences (this is somewhat similar to what is called “design of experiments” in Statistics)? 2.7.2
Measurement Design: Setup
In what follows we address measurement design in simple observation schemes, and our setup is as follows (to make our intensions transparent, we illustrate our general setup by explaining how it should be specified to cover the outlined twostage Opinion Poll Design (OPD) problem). Given are • simple observation scheme O = (Ω, Π; {pµ : µ ∈ M}; F), specifically, Gaussian, Poisson, or Discrete, with M ⊂ Rd . In OPD, O is the Discrete o.s. with Ω = {(i, ℓ) : 1 ≤ i ≤ I, 1 ≤ ℓ ≤ L}, that is, points of Ω are the potential outcomes “reference group, preferred candidate” of individual interviews. • a nonempty closed convex signal space X ⊂ Rn , along with L nonempty convex compact subsets Xℓ of X , ℓ = 1, ..., L. In OPD, X is the set (2.96) comprised of tuples of allowed distributions of voters’ preferences from various groups, and Xℓ are the sets (2.98) of signals associated with the hypotheses Hℓ we intend to decide upon. • a nonempty convex compact set Q in some RN along with a continuous mapping q 7→ Aq acting from Q into the space of d × n matrices such that ∀(x ∈ X , q ∈ Q) : Aq x ∈ M.
(2.99)
In OPD, Q is the set of probabilistic vectors q = [q1 ; ...; qI ] specifying our measurement design, and Aq is the matrix of the mapping (2.97). • a closeness C on the set {1, ..., L} (that is, a set C of pairs (i, j) with 1 ≤ i, j ≤ L such that (i, i) ∈ C for all i ≤ L and (j, i) ∈ C whenever (i, j) ∈ C), and a positive integer K. In OPD, the closeness C is as strict as it could be—i is close to j if and only if i = j,17 and K is the total number of interviews in the poll. 17 This
closeness makes sense when the goal of the poll is to predict the winner; a less ambitious goal, e.g., to decide whether the winner will or will not belong to a particular set of candidates, would require weaker closeness.
116
CHAPTER 2
We associate with q ∈ Q and Xℓ , ℓ ≤ L, the nonempty convex compact sets Mℓq in the space M, Mℓq = {Aq x : x ∈ Xℓ },
and hypotheses Hℓq on K-repeated stationary observations ω K = (ω1 , ..., ωK ), Hℓq stating that the ωk , k = 1, ..., K, are drawn, independently of each other, from a distribution µ ∈ Mℓq , ℓ = 1, ..., L. Closeness C can be thought of as closeness on the collection of hypotheses H1q , H2q , ..., HLq . Given q ∈ Q, we can use the construction from Section 2.5.2 in order to build the test TφK∗ deciding on the hypotheses Hℓq up to closeness C, the C-risk of the test being the smallest allowed by the construction. Note that this C-risk depends on q; the “Measurement Design” (MD for short) problem we are about to consider is to select q ∈ Q which minimizes the C-risk of the associated test TφK∗ . 2.7.3
Formulating the MD problem
By Proposition 2.30, the C-risk of the test TφK∗ is upper-bounded by the spectral norm of the symmetric entrywise nonnegative L × L matrix E (K) (q) = [ǫℓℓ′ (q)]ℓ,ℓ′ , and this is what we intend to minimize in our MD problem. In the above formula, ǫℓℓ′ (q) = ǫℓ′ ℓ (q) are zeros if (ℓ, ℓ′ ) ∈ C. For (ℓ, ℓ′ ) 6∈ C and 1 ≤ ℓ < ℓ′ ≤ L, the quantities ǫℓℓ′ (q) = ǫℓ′ ℓ (q) are defined depending on what the simple o.s. is O. Specifically, • In the case of the Gaussian observation scheme (see Section 2.4.5.1), restriction (2.99) does not restrain the dependence Aq on q at all (modulo the default constraint that Aq is a d × n matrix continuous in q ∈ Q), and ǫℓℓ′ (q) = exp{KOptℓℓ′ (q)} where Optℓℓ′ (q) =
max
x∈Xℓ ,y∈Xℓ′
− 81 [Aq (x − y)]T Θ−1 [Aq (x − y)]
(Gq )
and Θ is the common covariance matrix of the Gaussian densities forming the family {pµ : µ ∈ M}; • In the case of Poisson o.s. (see Section 2.4.5.2), restriction (2.99) requires of Aq x to be a positive vector whenever q ∈ Q and x ∈ X , and ǫℓℓ′ (q) = exp{KOptℓℓ′ (q)}, where Optℓℓ′ (q) =
max
x∈Xℓ ,y∈Xℓ′
X q i
1 2
1 2
[Aq x]i [Aq y]i − [Aq x]i − [Aq y]i ;
(Pq )
• In the case of Discrete o.s. (see Section 2.4.5.3), restriction (2.99) requires of Aq x to be a positive probabilistic vector whenever q ∈ Q and x ∈ X , and K
ǫℓℓ′ (q) = [Optℓℓ′ (q)] ,
117
HYPOTHESIS TESTING
where Optℓℓ′ (q) =
max
x∈Xℓ ,y∈Xℓ′
Xq
[Aq x]i [Aq y]i .
(Dq )
i
The summary of the above observations is as follows. The norm kE (K) k2,2 —the quantity we are interested in minimizing in q ∈ Q—as a function of q ∈ Q is of the form Ψ(q) = ψ({Optℓℓ′ (q) : (ℓ, ℓ′ ) 6∈ C}) | {z } (2.100) Opt(q)
where the outer function ψ is an explicitly given real-valued function on RN (N is the cardinality of the set of pairs (ℓ, ℓ′ ), 1 ≤ ℓ, ℓ′ ≤ L, with (ℓ, ℓ′ ) 6∈ C) which is convex and nondecreasing in each argument. Indeed, denoting by Γ(S) the spectral norm of the d × d matrix S, note that Γ is a convex function of S, and this function is nondecreasing in every one of the entries of S, provided that S is restricted to be entrywise nonnegative.18 ψ(·) is obtained from Γ(S) by substituting for the entries Sℓℓ′ of S, certain—explicit everywhere—convex, nonnegative and nondecreasing functions of variables z = {zℓℓ′ : (ℓ, ℓ′ ) 6∈ C, 1 ≤ ℓ, ℓ′ ≤ L}. Namely, • when (ℓ, ℓ′ ) ∈ C, we set Sℓℓ′ to zero; • when (ℓ, ℓ′ ) 6∈ C, we set Sℓℓ′ = exp{Kzℓℓ′ } in the case of Gaussian and Poisson o.s.’s, and set Sℓℓ′ = max[0, zℓℓ′ ]K , in the case of Discrete o.s. As a result, we indeed get a convex and nondecreasing, in every argument, function ψ of z ∈ RN . Now, the Measurement Design problem we want to solve reads Opt = min ψ(Opt(q)). q∈Q
(2.101)
As we remember, the entries in the inner function Opt(q) are optimal values of solvable convex optimization problems and as such are efficiently computable. When these entries are also convex functions of q ∈ Q, the objective in (2.101), due to the already established convexity and monotonicity properties of ψ, is a convex function of q, meaning that (2.101) is a convex and thus efficiently solvable problem. On the other hand, when some of the entries in Opt(q) are nonconvex in q, we can hardly expect (2.101) to be easy to solve. Unfortunately, convexity of the entries in Opt(q) in q turns out to be a “rare commodity.” For example, we can verify by inspection that the objectives in (Gq ), (Pq ), and (Dq ) as functions of Aq (not of q!) are concave rather than convex. Thus, the optimal values in the problems, as functions of q, are maxima, over the parameters, of parametric families of concave functions of Aq (the parameters in these parametric families are the optimization variables in (Gq ) – (Dq )) and as such can hardly be convex as functions of Aq . And indeed, as a matter of fact, the MD problem usually is nonconvex and difficult to solve. We intend to consider a “Simple case” where this difficulty does not arise, i.e., the case where the objectives of the optimization problems specifying Optℓℓ′ (q) are affine in q. In this case, Optℓℓ′ (q) as a function of q is the maximum, over the 18 The
monotonicity follows from the fact that for an entrywise nonnegative S, we have
kSk2,2 = max{xT Sy : kxk2 ≤ 1, kyk2 ≤ 1} = max{xT Sy : kxk2 ≤ 1, kyk2 ≤ 1, x ≥ 0, y ≥ 0}. x,y
x,y
118
CHAPTER 2
parameters (optimization variables in the corresponding problems), of parametric families of affine functions of q and as such is convex. Our current goal is to understand what our sufficient condition for tractability of the MD problem—affinity in q of the objectives in the respective problems (Gq ), (Pq ), and (Dq )—actually means, and to show that this, by itself quite restrictive, assumption indeed takes place in some important applications. 2.7.3.1
Simple case, Discrete o.s.
Looking at the optimization problem (Dq ), we see that the simplest way to ensure that its objective is affine in q is to assume that Aq = Diag{Bq}A,
(2.102)
where A is some fixed d × n matrix, and B is some fixed d × (dim q) matrix such that Bq is positive whenever q ∈ Q. On the top of this, we should ensure that when q ∈ Q and x ∈ X , Aq x is a positive probabilistic vector; this amounts to some restrictions linking Q, X , A, and B. Illustration. The Opinion Poll Design problem of Section 2.7.1 provides an instructive example of the Simple case of Measurement Design in Discrete o.s.: recall that in this problem the voting population is split into I groups, with the i-th group constituting fraction θi of the entire population. The distribution of voters’ preferences in the i-th group is represented by an unknown L-dimensional probabilistic vector xi = [xi1 ; ...; xiL ] (L is the number of candidates, xiℓ is the fraction of voters in the i-th group intending to vote for the ℓ-th candidate), known to belongPto a given convex compact subset Πi of the “positive part” ∆oL = {x ∈ RL : x > 0, ℓ xℓ = 1} of the L-dimensional probabilistic simplex. We are given threshold δ > 0 and want to decide on PIL hypotheses H1 ,..., HL , with Hℓ stating that the population-wide vector y = i=1 θi xi of voters’ preferences belongs to the closed convex set Yℓ =
(
y=
I X i=1
i
′
i
)
θi x : x ∈ Πi , 1 ≤ i ≤ I, yℓ ≥ yℓ′ + δ, ∀(ℓ 6= ℓ) .
Note that Yℓ is the image, under the linear mapping X θi xi , [x1 ; ...; xI ] 7→ y(x) = i
of the compact convex set Xℓ = x = [x1 ; ...; xI ] : xi ∈ Πi , 1 ≤ i ≤ I, yℓ (x) ≥ yℓ′ (x) + δ, ∀(ℓ′ 6= ℓ) , which is a subset of the convex compact set
X = {x = [x1 ; ...; xI ] : xi ∈ Πi , 1 ≤ i ≤ I}. The k-th poll interview is organized as follows: We draw at random a group among the I groups of voters, with probability qi to draw i-th group, and then draw at random, from the uniform distribution on the group, the respondent to be interviewed. The outcome of
119
HYPOTHESIS TESTING
the interview—our observation ωk —is the pair (i, ℓ), where i is the group to which the respondent belongs, and ℓ is the candidate preferred by the respondent. This results in a sensing matrix Aq —see (2.97)—which is in the form of (2.102), namely, Aq = Diag{q1 IL , q2 IL , ..., qI IL }, [q ∈ ∆I ] and the outcome of k-th interview is drawn at random from the discrete probability distribution Aq x, where x ∈ X is the “signal” summarizing voters’ preferences in the groups. Given the total number of observations K, our goal is to decide with a given risk ǫ on our L hypotheses. Whether this goal is or is not achievable depends on K and on Aq . What we want is to find q for which the above goal can be attained with as small a K as possible; in the case in question, this reduces to solving, for various trial values of K, problem (2.101), which under the circumstances is an explicit convex optimization problem. To get an impression of the potential of Measurement Design, we present a sample of numerical results. In all reported experiments, we use δ = 0.05, ǫ = 0.01 and equal fractions θi = I −1 for all groups. The sets Πi , 1 ≤ i ≤ I, are generated as follows: we pick at random a probabilistic vector p¯i of dimension L, and define Πi as the intersection of the box {p : p¯ℓ − ui ≤ pℓ ≤ p¯ℓ + ui } centered at p¯ with the probabilistic simplex ∆L , where the ui , i = 1, ..., I, are prescribed “uncertainty levels.” Note that uncertainty level ui ≥ 1 is the same as absence of any a priori information on the preferences of voters from the i-th group. The results of our numerical experiments are as follows: L 2 2 3 5 5
I 2 2 3 4 4
Uncertainty levels u [0.03;1.00] [0.02;1.00] [0.02;0.03;1.00] [0.02;0.02;0.03;1.00] [1.00;1.00;1.00;1.00]
Kini 1212 2699 3177 2556 4788
qopt [0.437;0.563] [0.000;1.000] [0.000;0.455;0.545] [0.000;0.131;0.322;0.547] [0.250;0.250;0.250;0.250]
Kopt 1194 1948 2726 2086 4788
Effect of measurement design: poll sizes required for 0.99-reliable winner prediction when q = θ (column Kini ) and q = qopt (column Kopt ). We see that measurement design allows us to reduce (for some data, quite significantly) the volume of observations as compared to the straightforward selecting of the respondents uniformly across the entire population. To compare our current model and results with those from Section 2.6.1, note that now we have more a priori information on the true distribution of voting preferences due to some a priori knowledge of preferences within groups, which allows us to reduce the poll sizes with both straightforward and optimal measurement designs.19 On the other hand, the difference between Kini and Kopt is fully due to the measurement design. Comparative drug study. A Simple case of the Measurement Design in Discrete o.s. related to OPD and perhaps more interesting is as follows. Suppose that now, 19 To illustrate this point, look at the last two lines in the table: utilizing a priori information allows us to reduce the poll size from 4,7,88 to 2,556 even with the straightforward measurement design.
120
CHAPTER 2
instead of L competing candidates running for an office we have L competing drugs, and the population of patients the drugs are aimed at rather than the population of voters. For the sake of simplicity, assume that when a particular drug is administered to a particular patient, the outcome is binary: (positive) “effect” or “no effect” (what follows can be easily extended to the case of non-binary categorial outcomes, like “strong positive effect,” “weak positive effect,” “negative effect,” and alike). Our goal is to organize a clinical study in order to decide on comparative drug efficiency, measured by the percentage of patients on which a particular drug has effect. The difference with organizing an opinion poll is that now we cannot just ask a respondent what his or her preferences are; we may only administer to a participant of the study a single drug of our choice and look at the result. As in the OPD problem, we assume that the population of patients is split into I groups, with the i-th group comprising a fraction θi of the entire population. We model the situation as follows. We associate with a patient a Boolean vector of dimension 2L, with the ℓ-th entry in the vector equal to 1 or 0 depending on whether drug # ℓ has effect on the patient, and the (L + ℓ)-th entry complementing the ℓ-th one to 1 (that is, if the ℓ-th entry is χ, then the (L+ℓ)-th entry is 1−χ). Let xi be the average of these vectors over patients from group i. We define “signal” x underlying our measurements as the vector [x1 ; ...; xI ] and assume that our a priori information allows us to localize x in a closed convex subset X of the set Y = {x = [x1 ; ...; xI ] : xi ≥ 0, xiℓ + xiL+ℓ = 1, 1 ≤ i ≤ I, 1 ≤ ℓ ≤ L} to which all our signals belong by construction. Note that the vector X y = Bx = θi xi i
can be treated as a “population-wise distribution of drug effects:” yℓ , ℓ ≤ L, is the fraction, in the entire population of patients, of those patients on whom drug ℓ has effect, and yL+ℓ = 1 − yℓ . As a result, typical hypotheses related to comparison of the drugs, like “drug ℓ has effect on a larger fraction, at least by margin δ, of patients than drug ℓ′ ,” become convex hypotheses on the signal x. In order to test hypotheses of this type, we can use a two-stage procedure for observing drug effects, namely, as follows. To get a particular observation, we select at random, with probability qiℓ , a pair (i, ℓ) from the set {(i, ℓ) : 1 ≤ i ≤ I, 1 ≤ ℓ ≤ L}, select a patient from group i according to the uniform distribution on the group, administer to the patient the drug ℓ, and check whether the drug has effect. Thus, a single observation is a triple (i, ℓ, χ), where χ = 0 if the administered drug has no effect on the patient, and χ = 1 otherwise. The probability of getting observation (i, ℓ, 1) is qiℓ xiℓ , and the probability of getting observation (i, ℓ, 0) is qiℓ xiL+ℓ . Thus, we arrive at the Discrete o.s. where the distribution µ of observations is of the form µ = Aq x, with the rows in Aq indexed by triples ω = (i, ℓ, χ) ∈ Ω := {1, 2, ..., I} × {1, 2, ..., L} × {0, 1} and given by qiℓ xiℓ χ = 1, 1 I (Aq [x ; ...; x ])i,ℓ,χ = qiℓ xiL+ℓ χ = 0. Specifying the set Q of admissible measurement designs as a closed convex subset of the set of all nonvanishing discrete probability distributions on the set {1, 2, ..., I}× {1, 2, ..., L}, we find ourselves in the Simple case of Discrete o.s., as defined by
121
HYPOTHESIS TESTING
Figure 2.7: PET scanner
(2.102), and Aq x is a probabilistic vector whenever q ∈ Q and x ∈ Y. 2.7.3.2
Simple case, Poisson o.s.
Looking at the optimization problem (Pq ), we see that the simplest way to ensure its objective is, as in the case of Discrete o.s., to assume that Aq = Diag{Bq}A, where A is some fixed d × n matrix, and B is some fixed d × (dim q) matrix such that Bq is positive whenever q ∈ Q. On the top of this, we should ensure that when q ∈ Q and x ∈ X , Aq x is a positive vector; this amounts to some restrictions linking Q, X , A, and B. Application Example: PET with time control. Positron Emission Tomography was already mentioned, as an example of Poisson o.s., in Section 2.4.3.2. As explained in the section, in PET we observe a random vector ω ∈ Rd with independent entries [ω]i ∼ Poisson(µi ), 1 ≤ i ≤ d, where the vector of parameters µ = [µ1 ; ...µd ] of the Poisson distributions is the linear image µ = Aλ of an unknown “signal” λ (the tracer’s density in patient’s body) belonging to some known subset Λ of RD + , with entrywise nonnegative matrix A. Our goal is to make inferences about λ. Now, in an actual PET scan, the patient’s position w.r.t. the scanner is not the same during the entire study; the position is kept fixed for an i-th time period, 1 ≤ i ≤ I, and changes from period to period in order to expose to the scanner the entire “area of interest.” For example, with the scanner shown on Figure 2.7, during the PET study the imaging table with the patient will be shifted several times along the axis of the scanning ring. As a result, the observed vector ω can be split into blocks ω i , i = 1, ..., I, of data acquired during the i-th period, 1 ≤ i ≤ I. On closer inspection, the corresponding block µi in µ is µi = qi Ai λ, where Ai is an entrywise nonnegative matrix known in advance, and qi is the duration of the i-th period. In principle, the qi could PIbe treated as nonnegative design variables subject to the “budget constraint” i=1 qi = T , where T is the
122
CHAPTER 2
total duration of the study,20 and perhaps some other convex constraints, say, positive lower bounds on qi . It is immediately seen that the outlined situation is exactly as is required in the Simple case of Poisson o.s. 2.7.3.3
Simple case, Gaussian o.s.
Looking at the optimization problem (Gq ), we see that the simplest way to ensure that its objective is affine in q is to assume that the covariance matrix Θ is diagonal, and √ √ (2.103) Aq = Diag{ q1 , ..., qd }A where A is a fixed d × n matrix, and q runs through a convex compact subset of Rd+ . It turns out that there are situations where assumption (2.103) makes perfect sense. Let us start with a preamble. In Gaussian o.s.
ω = Ax + ξ A ∈ Rd×n , ξ ∼ N (0, Σ), Σ = Diag{σ12 , ..., σd2 }
(2.104)
the “physics” behind the observations in many cases is as follows. There are d sensors (receivers), the i-th registering the continuous time analogous input depending linearly on the underlying observations signal x. On the time horizon on which the measurements are taken, this input is constant in time and is registered by the i-th sensor on time interval ∆i . The deterministic component of the measurement registered by sensor i is the integral of the corresponding input taken over ∆i , and the stochastic component of the measurement is obtained by integrating white Gaussian noise over the same interval. As far as this noise is concerned, what matters is that when the white noise affecting the i-th sensor is integrated over a time interval ∆i , the result is a Gaussian random variable with zero mean and variance σi2 |∆i | (here |∆i | is the length of ∆i ), and the random variables obtained by integrating white noise over nonoverlapping segments are independent. Besides this, we assume that the noisy components of measurements are independent across the sensors. Now, there could be two basic versions of the situation just outlined, both leading to the same observation model (2.104). In the first, “parallel,” version, all d sensors work in parallel on the same time horizon of duration 1. In the second, “sequential,” version, the sensors are activated and scanned one by one, each working unit time; thus, here the full time horizon is d, and the sensors are registering their respective inputs on consecutive time intervals of duration 1 each. In this second “physical” version of Gaussian o.s., we can, in principle, allow for sensors to register their inputs on consecutive time segments of varying durations q1 ≥ 0, q2 ≥ 0, ..., qd ≥ 0, with thePadditional to nonnegativity restriction that our total time budget is respected: i qi = d (perhaps with some other convex constraints on qi ). Let us look what the observation scheme we end up with is. Assuming that (2.104) represents correctly our observations in the reference case where all the |∆i | are equal to 1, the deterministic component ofP the measurement registered by sensor i in time interval of duration qi will be qi j aij xj , and the √ standard deviation of the noisy component will be σi qi , so that the measurements 20 T cannot be too large; aside from other considerations, the tracer disintegrates, and its density can be considered as nearly constant only on a properly restricted time horizon.
123
HYPOTHESIS TESTING
become
X √ aij xj , i = 1, ..., d, zi = σ i q i ζ i + q i j
with standard (zero mean, unit variance) Gaussian noises ζi independent of each other. Now, since we know qi , we can scale the latter observations by making the standard deviation of the noisy component the same σi as in the reference case. Specifically, we lose nothing when assuming that our observations are √ √ X ω i = zi / q i = σ i ζ i + q i aij xj , |{z} ξi
j
or, equivalently,
√ √ ω = ξ + Diag{ q1 , ..., qd }A x, ξ ∼ N (0, Diag{σ12 , ..., σd2 }) {z } |
[A = [aij ]]
Aq
P where q runs through a convex compact subset Q of the simplex {q ∈ Rd+ : i qi = d}. Thus, if the “physical nature” of a Gaussian o.s. is sequential, then, making the activity times of the sensors our design variables, as is natural under the circumstances, we arrive at (2.103), and, as a result, end up with an easy-to-solve Measurements Design problem.
2.8
AFFINE DETECTORS BEYOND SIMPLE OBSERVATION SCHEMES
On a closer inspection, the “common denominator” of our basic simple o.s.’s— Gaussian, Poisson and Discrete ones—is that in all these cases the minimal risk detector for a pair of convex hypotheses is affine. At first glance, this indeed is so for Gaussian and Poisson o.s.’s, where F is comprised of affine functions on the corresponding observation space Ω (Rd for Gaussian o.s., and Zd+ ⊂ Rd for Poisson o.s.), but is not so for Discrete o.s.—in that case, Ω = {1, ..., d}, and F is comprised of all functions on Ω, while “affine functions on Ω = {1, ..., d}” make no sense. Note, however, that we can encode (and from now on this is what we do) the points i = 1, ..., d of a d-element set by basic orths ei = [0; ...; 0; 1; 0; ...; 0] ∈ Rd in Rd , thus making our observation space Ω a subset of Rd . With this encoding, every real valued function on {1, ..., d} becomes a restriction on Ω of an affine function. Note that when passing from our basic simple o.s.’s to their direct products, the minimum risk detectors for pairs of convex hypotheses remain affine. Now, in our context the following two properties of simple o.s.’s are essential: A) the best—with the smallest possible risk—affine detector, like its risk, can be efficiently computed; B) the smallest risk affine detector from A) is the best detector, in terms of risk, available under the circumstances, so that the associated test is near-optimal. Note that as far as practical applications of the detector-based hypothesis testing are concerned, one “can survive” without B) (near-optimality of the constructed detectors), while A) is a requisite.
124
CHAPTER 2
In this section we focus on families of probability distributions obeying A). This class turns out to be incomparably larger than what was defined as simple o.s.’s in Section 2.4; in particular, it includes nonparametric families of distributions. Staying within this much broader class, we still are able to construct in a computationally efficient way the best affine detectors, in certain precise sense, for a pair of “convex” hypotheses, along with valid upper bounds on the risks of the detectors. What we, in general, cannot claim anymore, is that the tests associated with such detectors are near-optimal. This being said, we believe that investigating possibilities for building tests and quantifying their performance in a computationally friendly manner is of value even when we cannot provably guarantee near-optimality of these tests. The results to follow originate from [135, 136]. 2.8.1
Situation
In what follows, we fix an observation space Ω = Rd , and let Pj , 1 ≤ j ≤ J, be given families of probability distributions on Ω. Put S broadly, our goal still Pj , to decide upon the is, given a random observation ω ∼ P , where P ∈ j≤J
hypotheses Hj : P ∈ Pj , j = 1, ..., J. We intend to address this goal in the case when the families Pj are simple—they are comprised of distributions for which moment-generating functions admit an explicit upper bound. 2.8.1.1
Preliminaries: Regular data and associated families of distributions
Definition 2.36.A. Regular data is as a triple H, M, Φ(·, ·), where
– H is a nonempty closed convex set in Ω = Rd symmetric w.r.t. the origin, – M is a closed convex set in some Rn ,
– Φ(h; µ) : H×M → R is a continuous function convex in h ∈ H and concave in µ ∈ M.
B. Regular data H, M, Φ(·, ·) define two families of probability distributions on Ω: – the family of regular distributions
R = R[H, M, Φ] comprised of all probability distributions P on Ω such that R ∀h ∈ H ∃µ ∈ M : ln Ω exp{hT ω}P (dω) ≤ Φ(h; µ).
– the family of simple distributions
S = S[H, M, Φ] comprised of probability distributions P on Ω such that R ∃µ ∈ M : ∀h ∈ H : ln Ω exp{hT ω}P (dω) ≤ Φ(h; µ).
(2.105)
For a probability distribution P ∈ S[H, M, Φ], every µ ∈ M satisfying (2.105) is referred to as a parameter of P w.r.t. S. Note that a distribution may have many parameters different from each other.
125
HYPOTHESIS TESTING
Recall that beginning with Section 2.3, the starting point in all our constructions is a “plausibly good” detector-based test which, given two families P1 and P2 of distributions with common observation space, and repeated observations ω1 , ..., ωt drawn from a distribution P ∈ P1 ∪ P2 , decides whether P ∈ P1 or P ∈ P2 . Our interest in the families of regular/simple distributions stems from the fact that when the families P1 and P2 are of this type, building such a test reduces to solving a convex-concave saddle point problem and thus can be carried out in a computationally efficient manner. We postpone the related construction and analysis to Section 2.8.2, and continue with presenting some basic examples of families of simple and regular distributions along with a simple “calculus” of these families. 2.8.1.2
Basic examples of simple families of probability distributions
2.8.1.2.A. Sub-Gaussian distributions: Let H = Ω = Rd , let M be a closed convex subset of the set Gd = {µ = (θ, Θ) : θ ∈ Rd , Θ ∈ Sd+ }, where Sd+ is a cone of positive semidefinite matrices in the space Sd of symmetric d × d matrices, and let Φ(h; θ, Θ) = θT h + 21 hT Θh. Recall that a distribution P on Ω = Rd is called sub-Gaussian with subGaussianity parameters θ ∈ Rd and Θ ∈ Sd+ if Eω∼P {exp{hT ω}} ≤ exp{θT h + 12 hT Θh} ∀h ∈ Rd .
(2.106)
Whenever this is the case, θ is the expected value of P . We shall use the notation ξ ∼ SG(θ, Θ) as a shortcut for the sentence “random vector ξ is sub-Gaussian with parameters θ, Θ.” It is immediately seen that when ξ ∼ N (θ, Θ), we also have ξ ∼ SG(θ, Θ), and (2.106) in this case is an identity rather than an inequality. With Φ as above, S[H, M, Φ] clearly contains every sub-Gaussian distribution P on Rd with sub-Gaussianity parameters (forming a parameter of P w.r.t. S) from M. In particular, S[H, M, Φ] contains all Gaussian distributions N (θ, Θ) with (θ, Θ) ∈ M. 2.8.1.2.B. Poisson distributions: Let H = Ω = Rd , let M be a closed convex subset of d-dimensional nonnegative orthant Rd+ , and let Φ(h = [h1 ; ...; hd ]; µ = [µ1 ; ...; µd ]) =
d X i=1
µi [exp{hi } − 1] : H × M → R.
The family S = S[H, M, Φ] contains all Poisson distributions Poisson[µ] with vectors µ of parameters belonging to M; here Poisson[µ] is the distribution of a random d-dimensional vector with entries independent of each other, the i-th entry being a Poisson random variable with parameter µi . µ is a parameter of Poisson[µ] w.r.t. S. 2.8.1.2.C. Discrete distributions. Consider a discrete random variable taking values in d-element set {1, 2, ..., d}, and let us think of such a variable as of random
126
CHAPTER 2
variable taking values ei ∈ Rd , i = 1, ..., d, where ei = [0; ...; 0; 1; 0; ...; 0] (1 in position i) are standard basic orths in Rd . The probability distribution of such a variable can be identified with a point µ = [µ1 ; ...; µd ] from the d-dimensional probabilistic simplex ) ( d X νi = 1 , ∆d = ν ∈ Rd+ : i=1
where µi is the probability for the variable to take value ei . With these identifications, setting H = Rd , specifying M as a closed convex subset of ∆d , and setting ! d X Φ(h = [h1 ; ...; hd ]; µ = [µ1 ; ...; µd ]) = ln µi exp{hi } , i=1
the family S = S[H, M, Φ] contains distributions of all discrete random variables taking values in {1, ..., d} with probabilities µ1 , ..., µd comprising a vector from M. This vector is a parameter of the corresponding distribution w.r.t. S.
2.8.1.2.D. Distributions with bounded support. Consider the family P[X] of probability distributions supported on a closed and bounded convex set X ⊂ Ω = Rd , and let φX (h) = max hT x x∈X
be the support function of X. We have the following result (to be refined in Section 2.8.1.3): Proposition 2.37. For every P ∈ P[X] it holds Z 2 T d exp{h ω}P (dω) ≤ hT e[P ] + 81 [φX (h) + φX (−h)] , (2.107) ∀h ∈ R : ln Rd
R
where e[P ] = Rd ωP (dω) is the expectation of P , and the function in the right-hand side of (2.107) is convex. As a result, setting H = Rd , M = X, Φ(h; µ) = hT µ + 81 [φX (h) + φX (−h)]
2
we obtain regular data such that P[X] ⊂ S[H, M, Φ], e[P ] being a parameter of a distribution P ∈ P[X] w.r.t. S. For proof, see Section 2.11.4 2.8.1.3
Calculus of regular and simple families of probability distributions
Families of regular and simple distributions admit “fully algorithmic” calculus, with the main calculus rules as follows. 2.8.1.3.A. Direct summation. For 1 ≤ ℓ ≤ L, let regular data Hℓ ⊂ Ωℓ = Rdℓ ,
127
HYPOTHESIS TESTING
Mℓ ⊂ Rnℓ , Φℓ (hℓ ; µℓ ) : Hℓ × Mℓ → R be given. Let us set Ω1 × ... × ΩL = Rd , d = d1 + ... + dL , H1 × ... × HL = {h = [h1 ; ...; hL ] : hℓ ∈ Hℓ , ℓ ≤ L}, M1 × ... × ML = {µ = [µ1 ; ...; µL ] : µℓ ∈ Mℓ , ℓ ≤ L} ⊂ Rn , n = n1 + ... + nL , PL Φ(h = [h1 ; ...; hL ]; µ = [µ1 ; ...; µL ]) = ℓ=1 Φℓ (hℓ ; µℓ ) : H × M → R.
Ω H M
= = =
Then H is a closed convex set in Ω = Rd , symmetric w.r.t. the origin, M is a nonempty closed convex set in Rn , Φ : H × M → R is a continuous convexconcave function, and clearly • the family R[H, M, Φ] contains all product-type distributions P = P1 × ... × PL on Ω = Ω1 × ... × ΩL with Pℓ ∈ R[Hℓ , Mℓ , Φℓ ], 1 ≤ ℓ ≤ L; • the family S = S[H, M, Φ] contains all product-type distributions P = P1 × ... × PL on Ω = Ω1 × ... × ΩL with Pℓ ∈ Sℓ = S[Hℓ , Mℓ , Φℓ ], 1 ≤ ℓ ≤ L, a parameter of P w.r.t. S being the vector of parameters of Pℓ w.r.t. Sℓ . 2.8.1.3.B. Mixing. For 1 ≤ ℓ ≤ L, let regular data Hℓ ⊂ Ω = Rd , Mℓ ⊂ Rnℓ , Φℓ (hℓ ; µℓ ) : Hℓ ×Mℓ → R be given, with compact Mℓ . Let also ν = [ν1 ; ...; νL ] be a L probabilistic vector. For a tuple P L = {Pℓ ∈ R[Hℓ , Mℓ , Φℓ ]}L ℓ=1 , let Π[P , ν] be the ν-mixture of distributions P1 , ..., PL defined as the distribution of random vector ω ∼ Ω generated as follows: we draw at random, from probability distribution ν on {1, ..., L}, index ℓ, and then draw ω at random from the distribution Pℓ . Finally, let P be the set of all probability distributions on Ω which can be obtained as Π[P L , ν] from the outlined tuples P P L and vectors ν running through the probabilistic simplex L ∆L = {µ ∈ R : ν ≥ 0, ℓ νℓ = 1}. Let us set H
=
Ψℓ (h)
=
Φ(h; ν)
=
L T
ℓ=1
Hℓ ,
max Φℓ (h; µℓ ) : Hℓ → R, PL ν exp{Ψ (h)} : H × ∆L → R. ln ℓ ℓ ℓ=1
µℓ ∈M ℓ
Then H, ∆L , Φ clearly is regular data (recall that all Mℓ are compact sets), and for every ν ∈ ∆L and tuple P L of the above type one has Z T eh ω P (dω) ≤ Φ(h; ν) ∀h ∈ H, (2.108) P = Π[P L , ν] ⇒ ln Ω
implying that P ⊂ S[H, ∆L , Φ], ν being a parameter of P = Π[P L , ν] ∈ P. Indeed,(2.108) is readily given by the fact that for P = Π[P L , ν] ∈ P and h ∈ H it holds ! ! L L n T o X X h ω hT ω νℓ exp{Ψℓ (h)} = Φ(h; ν), νℓ Eω∼Pℓ {e } ≤ ln = ln ln Eω∼P e ℓ=1
ℓ=1
with the concluding inequality given by h ∈ H ⊂ Hℓ and Pℓ ∈ R[Hℓ , Mℓ , Φℓ ], 1 ≤ ℓ ≤ L.
128
CHAPTER 2
We have built a simple family of distributions S := S[H, ∆L , Φ] which contains all mixtures of distributions from given regular families Rℓ := R[Hℓ , Mℓ , Φℓ ], 1 ≤ ℓ ≤ L, which makes S a simple outer approximation of mixtures of distributions from the simple families Sℓ := S[Hℓ , Mℓ , Φℓ ], 1 ≤ ℓ ≤ L. In this latter capacity, S has a drawback—the only parameter of the mixture P = Π[P L , ν] of distributions Pℓ ∈ Sℓ is ν, while the parameters of Pℓ ’s disappear. In some situations, this makes the outer approximation S of P too conservative. We are about to get rid, to some extent, of this drawback. A modification. In the situation described at the beginning of 2.8.1.3.B, let a vector ν¯ ∈ ∆L be given, and let ¯ Φ(h; µ1 , ..., µL ) =
L X ℓ=1
Let d × d matrix Q 0 satisfy ¯ Φℓ (h; µℓ ) − Φ(h; µ1 , ..., µL ) and let
ν¯ℓ Φℓ (h; µℓ ) : H × (M1 × ... × ML ) → R.
2
≤ hT Qh ∀(h ∈ H, ℓ ≤ L, µ ∈ M1 × ... × ML ), (2.109)
¯ µ1 , ..., µL ) : H × (M1 × ... × ML ) → R. (2.110) Φ(h; µ1 , ..., µL ) = 53 hT Qh + Φ(h; T Φ clearly is convex-concave and continuous on its domain, whence H = ℓ Hℓ , M1 × ... × ML , Φ is regular data. Proposition 2.38. In the situation just defined, denoting by Pν¯ the family of all probability distributions P = Π[P L , ν¯], stemming from tuples P L = {Pℓ ∈ S[Hℓ , Mℓ , Φℓ ]}L ℓ=1 ,
(2.111)
one has Pν¯ ⊂ S[H, M1 × ... × ML , Φ].
As a parameter of distribution P = Π[P L , ν¯] ∈ Pν¯ with P L as in (2.111), one can take µL = [µ1 ; ....; µL ]. Proof. It is easily seen that 3
2
ea ≤ a + e 5 a , ∀a. P As a result, when aℓ , ℓ = 1, ..., L, satisfy ℓ ν¯ℓ aℓ = 0, we have X ℓ
ν¯ℓ eaℓ ≤
X ℓ
ν¯ℓ aℓ +
X ℓ
Now let P L be as in (2.111), and let h ∈ H = ln
R
3
2
3
2
ν¯ℓ e 5 aℓ ≤ e 5 maxℓ aℓ . T L
(2.112)
Hℓ . Setting P = Π[P L , ν¯], we have
P R P T T ¯ℓ Ω eh ω Pℓ (dω) = ln ( ℓ ν¯ℓ exp{Φℓ (h, µℓ )}) eh ω P (dω) = ln ℓν P ¯ ¯ ¯ℓ exp{Φℓ (h, µℓ ) − Φ(h; µ1 , ...µL )} = Φ(h; µ1 , ...µL ) + ln ℓν ¯ ¯ ≤ Φ(h; µ1 , ...µL ) + 35 maxℓ [Φℓ (h, µℓ ) − Φ(h; µ1 , ...µL )]2 ≤ Φ(h; µ1 , ..., µL ), |{z} |{z} Ω
a
b
129
HYPOTHESIS TESTING
¯ where a is given by (2.112) as applied to aℓ = Φℓ (h, µℓ ) − Φ(h; µ1 , ...µL ), and b is due to (2.109) and (2.110). The resulting inequality, which holds true for all h ∈ H, is all we need. ✷ 2.8.1.3.C. I.i.d. summation. Let Ω = Rd be an observation space, (H, M, Φ) be regular data on this space, and λ = {λℓ }K ℓ=1 be a collection of reals. We can associate with the outlined entities new data (Hλ , M, Φλ ) on Ω by setting Hλ = {h ∈ Ω : kλk∞ h ∈ H}, Φλ (h; µ) =
L X ℓ=1
Φ(λℓ h; µ) : Hλ × M → R.
Now, given a probability distribution P on Ω, we can associate with it and with the λ λ above P λ a new probability distribution P on Ω as follows: P is the distribution of ℓ λℓ ωℓ , where ω1 , ω2 , ..., ωL are drawn, independently of each other, from P . An immediate observation is that the data (Hλ , M, Φλ ) is regular, and whenever a probability distribution P belongs to S[H, M, Φ], the distribution P λ belongs to S[Hλ , M, Φλ ], and every parameter of P is a parameter of P λ . In particular, when ω ∼ P ∈ S[H, M, Φ] the distribution P L of the sum of L independent copies of ω belongs to S[H, M, LΦ]. 2.8.1.3.D. Semi-direct summation. For 1 ≤ ℓ ≤ L, let regular data Hℓ ⊂ Ωℓ = Rdℓ , Mℓ , Φℓ be given. To avoid complications, we assume that for every ℓ, • Hℓ = Ωℓ , • Mℓ is bounded. Let also an ǫ > 0 be given. We assume that ǫ is small, namely, Lǫ < 1. Let us aggregate the given regular data into a new one by setting H = Ω := Ω1 × ... × ΩL = Rd , d = d1 + ... + dL , M = M1 × ... × ML , and let us define function Φ(h; µ) : Ωd × M → R as follows: Φ(h = [h1 ; ...; hL ]; µ = [µ1 ; ...; µL ]) = inf λ∈∆ǫ PL ∆ǫ = {λ ∈ Rd : λℓ ≥ ǫ ∀ℓ & ℓ=1 λℓ = 1}.
Pd
ℓ=1
λℓ Φℓ (hℓ /λℓ ; µℓ ),
(2.113)
For evident reasons, the infimum in the description of Φ is achieved, and Φ is continuous. In addition, Φ is convex in h ∈ Rd and concave in µ ∈ M. Postponing for a moment verification, the consequences are that H = Ω = Rd , M, and Φ form regular data. We claim that Whenever ω = [ω 1 ; ...; ω L ] is a random variable taking values in Ω = Rd1 × ... × RdL , and the marginal distributions Pℓ , 1 ≤ ℓ ≤ L, of ω belong to the families Sℓ = S[Rdℓ , Mℓ , Φℓ ] for all 1 ≤ ℓ ≤ L, the distribution P of ω belongs to S = S[Rd , M, Φ], a parameter of P w.r.t. S being the vector comprised of parameters of Pℓ w.r.t. Sℓ . Indeed, since Pℓ ∈ S[Rdℓ , Mℓ , Φℓ ], there exists µ bℓ ∈ Mℓ such that ln(Eωℓ ∼Pℓ {exp{g T ω ℓ }}) ≤ Φℓ (g; µ bℓ ) ∀g ∈ Rdℓ .
130
CHAPTER 2
Let us set µ b = [b µ1 ; ...; µ bL ], and let h = [h1 ; ...; hL ] ∈ Ω be given. We can find λ ∈ ∆ǫ such that L X Φ(h; µ b) = λℓ Φℓ (hℓ /λℓ ; µ bℓ ). ℓ=1
Applying the H¨ older inequality, we get ( ) L X Y λℓ E[ω1 ;...;ωL ]∼P exp{ [hℓ ]T ω ℓ } ≤ Eωℓ ∼Pℓ [hℓ ]T ω ℓ /λℓ , ℓ
ℓ=1
whence ln E[ω1 ;...;ωL ]∼P We see that
(
)! L X X λℓ Φℓ (hℓ /λℓ ; µ bℓ ) = Φ(h; µ b). ≤ exp{ [hℓ ]T ω ℓ }
ln E[ω1 ;...;ωL ]∼P
ℓ=1
ℓ
(
X exp{ [hℓ ]T ω ℓ } ℓ
)!
≤ Φ(h; µ b) ∀h ∈ H = Rd ,
and thus P ∈ S[Rd , M, Φ], as claimed. It remains to verify that the function Φ defined by (2.113) indeed is convex in h ∈ Rd and concave in µ ∈ M. Concavity in µ is evident. Further, functions λℓ Φℓ (hℓ /λℓ ; µ) (as perspective transformations of convex functions Φℓ (·; µ)) are PL ℓ jointly convex in λ and hℓ , and so is Ψ(λ, h; µ) = ℓ=1 λℓ Φℓ (h /λℓ , µ). Thus Φ(·; µ), obtained by partial minimization of Ψ in λ, indeed is convex. 2.8.1.3.E. Affine image. Let H, M, Φ be regular data, Ω be the embedding ¯ = Rd¯, and let us space of H, and x 7→ Ax + a be an affine mapping from Ω to Ω set ¯ ∈ Rd¯ : AT h ¯ ∈ H}, M ¯ µ) = Φ(AT h; ¯ µ) + aT h ¯: H ¯ = {h ¯ = M, Φ( ¯ h; ¯×M ¯ → R. H ¯ M, ¯ Φ ¯ is regular data. It is immediately seen that Note that H, Whenever the probability distribution P of a random variable ω belongs to R[H, M, Φ] (or belongs to S[H, M, Φ]), the distribution P¯ [P ] of the ran¯ M, ¯ Φ] ¯ (respectively, belongs to dom variable ω ¯ = Aω + a belongs to R[H, ¯ ¯ ¯ S[H, M, Φ], and every parameter of P is a parameter of P¯ [P ]). 2.8.1.3.F. Incorporating support information. Consider the situation as follows. We are given regular data H ⊂ Ω = Rd , M, Φ and are interested in a family P of distributions known to belong to S[H, M, Φ]. In addition, we know that all distributions P from P are supported on a given closed convex set X ⊂ Rd . How could we incorporate this domain information to pass from the family S[H, M, Φ] containing P to a smaller family of the same type still containing P? We are about to give an answer in the simplest case of H = Ω. When denoting by φX (·) the support function of X and selecting somehow a closed convex set G ⊂ Rd containing
131
HYPOTHESIS TESTING
the origin, let us set b µ) = inf Φ+ (h, g; µ) := Φ(h − g; µ) + φX (g) , Φ(h; g∈G
where Φ(h; µ) : Rd × M → R is the continuous convex-concave function participatb is real-valued and continuous on ing in the original regular data. Assuming that Φ the domain Rd × M (which definitely is the case when G is a compact set such that b is convex-concave on this domain, φX is finite and continuous on G), note that Φ d b so that R , M, Φ is regular data. We claim that b contains P, provided the family S[Rd , M, Φ] does The family S[Rd , M, Φ] so, and the first of these two families is smaller than the second one.
Verification of the claim is immediate. Let P ∈ P, so that for properly selected µ = µP ∈ M and for all e ∈ Rd it holds Z T exp{e ω}P (dω) ≤ Φ(e; µP ). ln Rd
On the other hand, for every g ∈ G we have φX (g) − g T ω ≥ 0 on the support of P , whence for every h ∈ Rd one has R R ln Rd exp{hT ω}P (dω) ≤ ln Rd exp{hT ω + φX (g) − g T ω}P (dω) ≤ φX (g) + Φ(h − g; µP ). Since the resulting inequality holds true for all g ∈ G, we get Z b µP ) ∀h ∈ Rd , exp{hT ω}P (dω) ≤ Φ(h; ln Rd
b because P ∈ P is arbitrary, the first part of the implying that P ∈ S[Rd , M, Φ]; b ⊂ S[Rd , M, Φ] is readily given by the claim is justified. The inclusion S[Rd , M, Φ] b ≤ Φ, and the latter is due to Φ(h, b µ) ≤ Φ(h − 0, µ) + φX (0). inequality Φ
Illustration: Distributions with bounded support revisited. In Section 2.8.1.2, given a convex compact set X ⊂ Rd with support function φX , we checked that the data H = Rd , M = X, Φ(h; µ) = hT µ + 18 [φX (h) + φX (−h)]2 is regular and the family S[Rd , M, Φ] contains the family P[X] of all probability distributions supported on X. Moreover, for every µ ∈ M = X, the family S[Rd , {µ}, Φ Rd ×{µ} ] contains all distributions supported on X with the expectations e[P] = µ. Note that R T Φ(h; e[P ]) describes well the behavior of the logarithm FP (h) = ln Rd eh ω P (dω) of the moment-generating function of P ∈ P[X] when h is small (indeed, FP (h) = hT e[P ] + O(khk2 ) as h → 0), and by far overestimates FP (h) when h is large. Utilizing the above construction, we replace Φ with the real-valued, convex-concave, and continuous on Rd × M = Rd × X (see Exercise 2.22) function h i b µ) = inf Ψ(h, b Φ(h; g; µ) := (h − g)T µ + 18 [φX (h − g) + φX (−h + g)]2 + φX (g) g
≤
Φ(h; µ).
(2.114)
132
CHAPTER 2
b ·) still ensures the inclusion P ∈ S[Rd , {e[P ]}, Φ b d ] It is easy to see that Φ(·; R ×{e[P ]} for every distribution P ∈ P[X] and “reproduces FP (h) reasonably well” for both b e[P ]) ≤ Φ(h; e[P ]), for small h small and large h. Indeed, since FP (h) ≤ Φ(h; b Φ(h; e[P ]) reproduces FP (h) even better than Φ(h; e[P ]), and we clearly have b µ) ≤ (h − h)T µ + 1 [φX (h − h) + φX (−h + h)]2 + φX (h) = φX (h) ∀µ, Φ(h; 8
and φX (h) is a correct description of FP (h) for large h. 2.8.2
Main result
2.8.2.1
Situation & Construction
Assume we are given two collections of regular data with common Ω = Rd and common H, specifically, the collections (H, Mχ , Φχ ), χ = 1, 2. We start with constructing a specific detector for the associated families of regular probability distributions Pχ = R[H, Mχ , Φχ ], χ = 1, 2. When building the detector, we impose on the regular data in question the following Assumption I: The regular data (H, Mχ , Φχ ), χ = 1, 2, are such that the convex-concave function Ψ(h; µ1 , µ2 ) =
1 2
[Φ1 (−h; µ1 ) + Φ2 (h; µ2 )] : H × (M1 × M2 ) → R
(2.115)
has a saddle point (min in h ∈ H, max in (µ1 , µ2 ) ∈ M1 × M2 ). A simple sufficient condition for existence of a saddle point of (2.115) is Condition A: The sets M1 and M2 are compact, and the function Φ(h) =
max
µ1 ∈M1 ,µ2 ∈M2
Φ(h; µ1 , µ2 )
is coercive on H, meaning that Φ(hi ) → ∞ along every sequence hi ∈ H with khi k2 → ∞ as i → ∞. Indeed, under Condition A by the Sion-Kakutani Theorem (Theorem 2.22) it holds SadVal[Φ] := inf
max
sup inf Φ(h; µ1 , µ2 ), Φ(h; µ1 , µ2 ) = µ1 ∈M1 ,µ2 ∈M2 h∈H | {z } {z }
h∈H µ1 ∈M1 ,µ2 ∈M2
|
Φ(µ1 ,µ2 )
Φ(h)
so that the optimization problems (P )
Opt(P ) = min Φ(h)
(D)
Opt(D) =
h∈H
max
µ1 ∈M1 ,µ2 ∈M2
Φ(µ1 , µ2 )
have equal optimal values. Under Condition A, problem (P ) clearly is a problem of minimizing a continuous coercive function over a closed set and as such is solvable; thus, Opt(P ) = Opt(D) is a real. Problem (D) clearly is the problem of maximizing over a compact set an upper semi-continuous (since Φ is continuous) function taking real values and, perhaps, value −∞,
133
HYPOTHESIS TESTING
and not identically equal to −∞ (since Opt(D) is a real), and thus (D) is solvable. As a result, (P ) and (D) are solvable with common optimal values, and therefore Φ has a saddle point.
2.8.2.2
Main Result
An immediate (and essential) observation is as follows: Proposition 2.39. In the situation of Section 2.8.2.1, let h ∈ H be such that the quantities Ψ1 (h) = sup Φ1 (−h; µ1 ), Ψ2 (h) = sup Φ2 (h; µ2 ) µ1 ∈M1
µ2 ∈M2
are finite. Consider the affine detector φh (ω) = hT ω + 21 [Ψ1 (h) − Ψ2 (h)] . {z } | κ
Then
Risk[φh |R[H, M1 , Φ1 ], R[H, M2 , Φ2 ]] ≤ exp{ 21 [Ψ1 (h) + Ψ2 (h)]}. Proof. Let h satisfy the premise of the proposition. For every µ1 ∈ M1 , we have Φ1 (−h; µ1 ) ≤ Ψ1 (h), and for every P ∈ R[H, M1 , Φ1 ] we have Z exp{−hT ω}P (dω) ≤ exp{Φ1 (−h; µ1 )} Ω
for properly selected µ1 ∈ M1 . Thus, Z exp{−hT ω}P (dω) ≤ exp{Ψ1 (h)} ∀P ∈ R[H, M1 , Φ1 ], Ω
whence also Z
Ω
exp{−hT ω−κ}P (dω) ≤ exp{Ψ1 (h)−κ} = exp{ 21 [Ψ1 (h)+Ψ2 (h)]} ∀P ∈ R[H, M1 , Φ1 ].
Similarly, for every µ2 ∈ M2 , we have Φ2 (h; µ2 ) ≤ Ψ2 (h), and for every P ∈ R[H, M2 , Φ2 ], we have Z exp{hT ω}P (dω) ≤ exp{Φ2 (h; µ2 )} Ω
for properly selected µ2 ∈ M2 . Thus, Z exp{hT ω}P (dω) ≤ exp{Ψ2 (h)} ∀P ∈ R[H, M2 , Φ2 ], Ω
and Z
Ω
exp{hT ω + κ}P (dω) ≤ exp{Ψ2 (h) + κ} = exp{ 21 [Ψ1 (h) + Ψ2 (h)]} ∀P ∈ R[H, M2 , Φ2 ]. ✷
An immediate corollary is as follows:
134
CHAPTER 2
Proposition 2.40. In the situation of Section 2.8.2.1 and under Assumption I, let us associate with a saddle point (h∗ ; µ∗1 , µ∗2 ) of the convex-concave function (2.115) the following entities: • the risk
ǫ⋆ := exp{Ψ(h∗ ; µ∗1 , µ∗2 )};
(2.116)
this quantity is uniquely defined by the saddle point value of Ψ and thus is independent of how we select a saddle point; • the detector φ∗ (ω)—the affine function of ω ∈ Rd given by φ∗ (ω) = hT∗ ω + a, a =
1 2
[Φ1 (−h∗ ; µ∗1 ) − Φ2 (h∗ ; µ∗2 )] .
(2.117)
Then Risk[φ∗ |R[H, M1 , Φ1 ], R[H, M2 , Φ2 ]] ≤ ǫ⋆ . Consequences. Assume we are given L collections (H, Mℓ , Φℓ ) of regular data on a common observation space Ω = Rd and with common H, and let Pℓ = R[H, Mℓ , Φℓ ] be the corresponding families of regular distributions. Assume also that for every pair (ℓ, ℓ′ ), 1 ≤ ℓ < ℓ′ ≤ L, the pair (H, Mℓ , Φℓ ), (H, Mℓ′ , Φℓ′ ) of regular data satisfies Assumption I, so that the convex-concave functions 1 2
[Φℓ (−h; µℓ ) + Φℓ′ (h; µℓ′ )] : H × (Mℓ × Mℓ′ ) → R [1 ≤ ℓ < ℓ′ ≤ L] ∗ ∗ ∗ have saddle points (hℓℓ′ ; (µℓ , µℓ′ )) (min in h ∈ H, max in (µℓ , µℓ′ ) ∈ Mℓ × Mℓ′ ). These saddle points give rise to the affine detectors Ψℓℓ′ (h; µℓ , µℓ′ ) =
φℓℓ′ (ω) = [h∗ℓℓ′ ]T ω + 21 [Φℓ (−h∗ℓℓ′ ; µ∗ℓ ) − Φℓ′ (h∗ ; µ∗ℓ′ )]
[1 ≤ ℓ < ℓ′ ≤ L]
and the quantities ǫℓℓ′ = exp { 12 [Φℓ (−h∗ℓℓ′ ; µ∗ℓ ) + Φℓ′ (h∗ ; µ∗ℓ′ )]} ;
[1 ≤ ℓ < ℓ′ ≤ L]
by Proposition 2.40, ǫℓℓ′ are upper bounds on the risks, taken w.r.t. Pℓ , Pℓ′ , of the detectors φℓℓ′ : Z Z −φℓℓ′ (ω) eφℓℓ′ (ω) P (dω) ≤ ǫℓℓ′ ∀P ∈ Pℓ′ . P (dω) ≤ ǫℓℓ′ ∀P ∈ Pℓ & e Ω
Ω
[1 ≤ ℓ < ℓ′ ≤ L] Setting φℓℓ′ (·) = −φℓ′ ℓ (·) and ǫℓℓ′ = ǫℓ′ ℓ when L ≥ ℓ > ℓ ≥ 1 and φℓℓ (·) ≡ 0, ǫℓℓ = 1, 1 ≤ ℓ ≤ L, we get a system of detectors and risks satisfying (2.80) and, consequently, can use these “building blocks” in the machinery developed so far for pairwise and multiple hypothesis testing from single and repeated observations (stationary, semi-stationary, and quasi-stationary). ′
Numerical example. To get some impression of how Proposition 2.40 extends the grasp of our computation-friendly machinery of test design consider a toy problem as follows:
135
HYPOTHESIS TESTING
We are given an observation √ √ ω = Ax + σADiag { x1 , ..., xn } ξ,
(2.118)
where • unknown signal x is known to belong to a given convex compact subset M of the interior of Rn+ ; • A is a given n × n matrix of rank n, σ > 0 is a given noise intensity, and ξ ∼ N (0, In ).
Our goal is to decide via a K-repeated version of observations (2.118) on the pair of hypotheses x ∈ Xχ , χ = 1, 2, where X1 , X2 are given nonempty convex compact subsets of M .
Note that an essential novelty, as compared to the standard Gaussian o.s., is that now we deal with zero mean Gaussian noise with covariance matrix Θ(x) = σ 2 ADiag{x}AT depending on the true signal—the larger the signal, the greater the noise. We can easily process the situation in question utilizing the machinery developed in this section. Namely, let us set Hχ = Rn , Mχ = {(x, Diag{x}) : x ∈ Xχ } ⊂ Rn+ × Sn+ , 2 Φχ (h; x, Ξ) = hT AT x + σ2 hT [AΞAT ]h : Mχ → R.
[χ = 1, 2]
It is immediately seen that for χ = 1, 2, H, Mχ , Φχ is regular data, and that the distribution P of observation (2.118) stemming from a signal x ∈ Xχ belongs to S[H, Mχ , Φχ ], so that we can use Proposition 2.40 to build an affine detector for the families Pχ , χ = 1, 2, of distributions of observations (2.118) stemming from signals x ∈ Xχ . The corresponding recipe boils down to the necessity to find a saddle point (h∗ ; x∗ , y∗ ) of the simple convex-concave function σ2 T 1 T h A(y − x) + h ADiag{x + y}AT h Ψ(h; x, y) = 2 2 (min in h ∈ Rn , max in (x, y) ∈ X1 × X2 ). Such a point clearly exists and is easily found, and gives rise to affine detector φ∗ (ω) = hT∗ ω + 41 σ 2 hT∗ ADiag{x∗ − y∗ }AT h∗ − 21 hT∗ A[x∗ + y∗ ] | {z } a
such that
Risk[φ∗ |P1 , P2 ] ≤ exp
σ2 T 1 T h∗ A[y∗ − x∗ ] + h∗ ADiag{x∗ + y∗ }AT h∗ . 2 2
(2.119)
Note that we could also process the situation when defining the regular data as + H, M+ χ = Xχ , Φχ , χ = 1, 2, where T Φ+ χ (h; x) = h Ax +
σ2 θ T h AAT h 2
[θ =
max
x∈X1 ∪X2
kxk∞ ],
which, basically, means passing from our actual observations (2.118) to the “more
136
CHAPTER 2
noisy” observations given by Gaussian o.s. ω = Ax + η, η ∼ N (0, σ 2 θAAT ).
(2.120)
It is easily seen that, for this Gaussian o.s., the risk Risk[φ# |P1 , P2 ] of the optimal, detector φ# can be upper-bounded by the risk Risk[φ# |P1+ , P2+ ] known to us, where Pχ+ is the family of distributions of observations (2.120) induced by signals x ∈ Xχ . Note that Risk[φ# |P1+ , P2+ ] is seemingly the best risk bound available for us “within the realm of detector-based tests in simple o.s.’s.” The goal of the small numerical experiment we are about to report on is to understand how our new risk bound (2.119) compares to the “old” bound Risk[φ# |P1+ , P2+ ]. We use 0.001 ≤ x1 ≤ δ 16 n = 16, X1 = x ∈ R : , 0.001 ≤ xi ≤ 1, 2 ≤ i ≤ 16 2δ ≤ x1 ≤ 1 X2 = x ∈ R16 : 0.001 ≤ xi ≤ 1, 2 ≤ i ≤ 16 and σ = 0.1. The “separation parameter” δ is set to 0.1. Finally, the 16 × 16 matrix A has condition number 100 (singular values 0.01(i−1)/15 , 1 ≤ i ≤ 16) and randomly oriented systems of left- and right singular vectors. With this setup, a typical numerical result is as follows: • the right-hand side in (2.119) is 0.4346, implying that with detector φ∗ , a 6repeated observation is sufficient to decide on our two hypotheses with risk ≤ 0.01; • the quantity Risk[φ# |P1+ , P2+ ] is 0.8825, meaning that with detector φ# , we need at least a 37-repeated observation to guarantee risk ≤ 0.01. When the separation parameter δ participating in the descriptions of X1 , X2 is reduced to 0.01, the risks in question grow to 0.9201 and 0.9988, respectively (a 56-repeated observation to decide on the hypotheses with risk 0.01 when φ∗ is used vs. a 3685-repeated observation needed when φ# is used). The bottom line is that the new developments can indeed improve significantly the performance of our inferences. 2.8.2.3
Sub-Gaussian and Gaussian cases
For χ = 1, 2, let Uχ be a nonempty closed convex set in Rd , and Vχ be a compact convex subset of the interior of the positive semidefinite cone Sd+ . We assume that U1 is compact. Setting Hχ = Ω = Rd , M χ = U χ × V χ , Φχ (h; θ, Θ) = θT h + 21 hT Θh : Hχ × Mχ → R, χ = 1, 2,
(2.121)
we get two collections (H, Mχ , Φχ ), χ = 1, 2, of regular data. As we know from Section 2.8.1.2, for χ = 1, 2, the families of distributions S[Rd , Mχ , Φχ ] contain the families SG[Uχ , Vχ ] of sub-Gaussian distributions on Rd with sub-Gaussianity parameters (θ, Θ) ∈ Uχ × Vχ (see (2.106)), as well as families G[Uχ , Vχ ] of Gaussian distributions on Rd with parameters (θ, Θ) (expectation and covariance matrix) running through Uχ × Vχ . Besides this, the pair of regular data in question clearly satisfies Condition A. Consequently, the test T∗K given by the above construction
137
HYPOTHESIS TESTING
as applied to the collections of regular data (2.121) is well defined and allows to decide on hypotheses Hχ : P ∈ R[Rd , Uχ , Vχ ], χ = 1, 2, on the distribution P underlying K-repeated observation ω K . The same test can be also used to decide on stricter hypotheses HχG , χ = 1, 2, stating that the observations ω1 , ..., ωK are i.i.d. and drawn from a Gaussian distribution P belonging to G[Uχ , Vχ ]. Our goal now is to process in detail the situation in question and to refine our conclusions on the risk of the test T∗1 when the Gaussian hypotheses HχG are considered and the situation is symmetric, that is, when V1 = V2 . Observe, first, that the convex-concave function Ψ from (2.115) in the current setting becomes Ψ(h; θ1 , Θ1 , θ2 , Θ2 ) = 12 hT [θ2 − θ1 ] + 41 hT Θ1 h + 14 hT Θ2 h.
(2.122)
We are interested in solutions to the saddle point problem min
h∈Rd
max
θ1 ∈U1 ,θ2 ∈U2
Ψ(h; θ1 , Θ1 , θ2 , Θ2 )
(2.123)
Θ1 ∈V1 ,Θ2 ∈V2
associated with the function (2.122). From the structure of Ψ and compactness of U1 , V1 , V2 , combined with the fact that Vχ , χ = 1, 2, are comprised of positive definite matrices, it immediately follows that saddle points do exist, and a saddle point (h∗ ; θ1∗ , Θ∗1 , θ2∗ , Θ∗2 ) satisfies the relations (a) (b) (c)
h∗ = [Θ∗1 + Θ∗2 ]−1 [θ1∗ − θ2∗ ], hT∗ (θ1 − θ1∗ ) ≥ 0 ∀θ1 ∈ U1 , hT∗ (θ2∗ − θ2 ) ≥ 0 ∀θ2 ∈ U2 , hT∗ Θ1 h∗ ≤ hT∗ Θ∗1 h∗ ∀Θ1 ∈ V1 , hT∗ Θ2 h∗ ≤ h∗ Θ∗2 h∗ ∀Θ2 ∈ V2 .
(2.124)
From (2.124.a) it immediately follows that the affine detector φ∗ (·) and risk ǫ⋆ , as given by (2.116) and (2.117), are φ∗ (ω) ǫ⋆
= = =
hT∗ [ω − w∗ ] + 12 hT∗ [Θ∗1 − Θ∗2 ]h∗ , w∗ = 12 [θ1∗ + θ2∗ ]; exp{− 41 [θ1∗ − θ2∗ ]T [Θ∗1 + Θ∗2 ]−1 [θ1∗ − θ2∗ ]} exp{− 14 hT∗ [Θ∗1 + Θ∗2 ]h∗ }.
(2.125)
Note that in the symmetric case (where V1 = V2 ), there always exists a saddle point of Ψ with Θ∗1 = Θ∗2 ,21 and the test T∗1 associated with such saddle point is quite transparent: it is the maximum likelihood test for two Gaussian distributions, N (θ1∗ , Θ∗ ), N (θ2∗ , Θ∗ ), where Θ∗ is the common value of Θ∗1 and Θ∗2 . The bound ǫ⋆ on the risk of the test is nothing but the Hellinger affinity of these two Gaussian distributions, or, equivalently, ∗ ∗ ǫ⋆ = exp − 81 [θ1∗ − θ2∗ ]T Θ−1 ∗ [θ1 − θ2 ] .
21 Indeed, from (2.122) it follows that when V 1 = V2 , the function Ψ(h; θ1 , Θ1 , θ2 , Θ2 ) is symmetric w.r.t. Θ1 , Θ2 , implying similar symmetry of the function Ψ(θ1 , Θ1 , θ2 , Θ2 ) = minh∈H Ψ(h; θ1 , Θ1 , θ2 , Θ2 ). Since Ψ is concave, the set M of its maximizers over M1 × M2 (which, as we know, is nonempty) is symmetric w.r.t. the swap of Θ1 and Θ2 and is convex, implying that if (θ1 , Θ1 , θ2 , Θ2 ) ∈ M , then (θ1 , 21 [Θ1 + Θ2 ], θ2 , 12 [Θ1 + Θ2 ]) ∈ M as well, and the latter point is the desired component of the saddle point of Ψ with Θ1 = Θ2 .
138
CHAPTER 2
We arrive at the following result: Proposition 2.41. In the symmetric sub-Gaussian case (i.e., in the case of (2.121) with V1 = V2 ), saddle point problem (2.122), (2.123) admits a saddle point of the form (h∗ ; θ1∗ , Θ∗ , θ2∗ , Θ∗ ), and the associated affine detector and its risk are given by φ∗ (ω) ǫ⋆
= =
hT∗ [ω − w∗ ], w∗ = 21 [θ1∗ + θ2∗ ]; ∗ ∗ exp{− 18 [θ1∗ − θ2∗ ]T Θ−1 ∗ [θ1 − θ2 ]}.
As a result, when deciding, via ω K , on “sub-Gaussian hypotheses” Hχ , χ = 1, 2, PK (K) the risk of the test T∗K associated with φ∗ (ω K ) := t=1 φ∗ (ωt ) is at most ǫK ⋆ .
In the symmetric single-observation Gaussian case, that is, when V1 = V2 and we apply the test T∗ = T∗1 to observation ω ≡ ω1 in order to decide on the hypotheses HχG , χ = 1, 2, the above risk bound can be improved: Proposition 2.42. Consider the symmetric case V1 = V2 = V, let (h∗ ; θ1∗ ; Θ∗1 , θ2∗ , Θ∗2 ) be the “symmetric”—with Θ∗1 = Θ∗2 = Θ∗ —saddle point of function Ψ given by (2.122), and let φ∗ be the affine detector given by (2.124) and (2.125): ∗ ∗ ∗ 1 ∗ φ∗ (ω) = hT∗ [ω − w∗ ], h∗ = 21 Θ−1 ∗ [θ1 − θ2 ], w∗ = 2 [θ1 + θ2 ].
Let also δ= so that
q
hT∗ Θ∗ h∗
=
1 2
q
∗ ∗ [θ1∗ − θ2∗ ]T Θ−1 ∗ [θ1 − θ2 ],
δ 2 = hT∗ [θ1∗ − w∗ ] = hT∗ [w∗ − θ2∗ ] and ǫ⋆ = exp − 21 δ 2 .
(2.126)
Let, further, α ≤ δ 2 , β ≤ δ 2 . Then (a) (b)
∀(θ ∈ U1 , Θ ∈ V) : Probω∼N (θ,Θ) {φ∗ (ω) ≤ α} ≤ Erfc(δ − α/δ), ∀(θ ∈ U2 , Θ ∈ V) : Probω∼N (θ,Θ) {φ∗ (ω) ≥ −β} ≤ Erfc(δ − β/δ).
(2.127)
In particular, when deciding, via a single observation ω, on Gaussian hypotheses HχG , χ = 1, 2, with HχG stating that ω ∼ N (θ, Θ) with (θ, Θ) ∈ Uχ × V, the risk of the test T∗1 associated with φ∗ is at most Erfc(δ). Proof. Let us prove (a) (the proof of (b) is completely similar). For θ ∈ U1 , Θ ∈ V we have Probω∼N (θ,Θ) {φ∗ (ω) ≤ α} = Probω∼N (θ,Θ) {hT∗ [ω − w∗ ] ≤ α} = Probξ∼N (0,I) {hT∗ [θ + Θ1/2 ξ − w∗ ] ≤ α} = Probξ∼N (0,I) {[Θ1/2 h∗ ]T ξ ≤ α − hT∗ [θ − w∗ ] | {z }
}
∗ 2 ≥hT ∗ [θ1 −w∗ ]=δ
by (2.124.b),(2.126) 2
≤ Probξ∼N (0,I) {[Θ1/2 h∗ ]T ξ ≤ α − δ } = Erfc([δ 2 − α]/kΘ1/2 h∗ k2 ) 1/2 ≤ Erfc([δ 2 − α]/kΘ∗ h∗ k2 )
[due to δ 2 − α ≥ 0 and hT∗ Θh∗ ≤ hT∗ Θ∗ h∗ by (2.124.c)]
= Erfc([δ 2 − α]/δ).
139
HYPOTHESIS TESTING
The “in particular” part of Proposition is readily given by (2.127) as applied with α = β = 0. ✷ Note that the progress, as compared to our results on the minimum risk detectors for convex hypotheses in Gaussian o.s., is that we do not assume anymore that the covariance matrix is once and forever fixed. Now neither the mean nor the covariance matrix of the observed Gaussian random variable are known in advance. In this setting, the mean is running through a closed convex set (depending on the hypothesis), and the covariance is running, independently of the mean, through a given convex compact subset of the interior of the positive definite cone, and this subset should be common for both hypotheses we are deciding upon.
2.9
2.9.1
BEYOND THE SCOPE OF AFFINE DETECTORS: LIFTING THE OBSERVATIONS Motivation
The detectors considered in Section 2.8 were affine functions of observations. Note, however, that what an observation is, to some extent depends on us. To give an instructive example, consider the Gaussian observation ζ = A[u; 1] + ξ ∈ Rn , where u is an unknown signal known to belong to a given set U ⊂ Rn , u 7→ A[u; 1] is a given affine mapping from Rn into the observation space Rd , and ξ is zero mean Gaussian observation noise with covariance matrix Θ known to belong to a given convex compact subset V of the interior of the positive semidefinite cone Sd+ . Treating the observation “as is,” affine in the observation detector is affine in [u; ξ]. On the other hand, we can treat as our observation the image of the actual observation ζ under any deterministic mapping, e.g., the “quadratic lifting” ζ 7→ (ζ, ζζ T ). A detector affine in the new observation is quadratic in u and ξ— we get access to a wider set of detectors as compared to those affine in ζ! At first glance, applying our “affine detectors” machinery to appropriate “nonlinear liftings” of actual observations we can handle quite complicated detectors, e.g., polynomial, of arbitrary degree, in ζ. The bottleneck here stems from the fact that in general it is difficult to “cover” the distribution of a “nonlinearly lifted” observation ζ (even as simple as the Gaussian observation above) by an explicitly defined family of regular distributions, and such a “covering” is what we need in order to apply to the lifted observation our affine detector machinery. It turns out, however, that in some important cases the desired covering is achievable. We are about to demonstrate that this takes place in the case of the quadratic lifting ζ 7→ (ζ, ζζ T ) of (sub)Gaussian observation ζ, and the resulting quadratic detectors allow us to handle some important inference problems which are far beyond the grasp of “genuinely affine” detectors.
140
CHAPTER 2
2.9.2
Quadratic lifting: Gaussian case
Given positive integer d, we define E d as the linear space Rd × Sd equipped with the inner product h(z, S), (z ′ , S ′ )i = sT z ′ + 21 Tr(SS ′ ).
Note that the quadratic lifting z 7→ (z, zz T ) maps the space Rd into E d . In the sequel, an instrumental role is played by the following result. Proposition 2.43. (i) Assume we are given
• a nonempty and bounded subset U of Rn ; • a convex compact set V contained in the interior of the cone Sd+ of positive semidefinite d × d matrices; • a d × (n + 1) matrix A. These data specify the family GA [U, V] of distributions of quadratic liftings (ζ, ζζ T ) of Gaussian random vectors ζ ∼ N (A[u; 1], Θ) stemming from u ∈ U and Θ ∈ V. Let us select some 1. γ ∈ (0, 1), 2. convex compact subset Z of the set Z n = {Z ∈ Sn+1 : Z 0, Zn+1,n+1 = 1} such that Z(u) := [u; 1][u; 1]T ∈ Z ∀u ∈ U, (2.128)
3. positive definite d × d matrix Θ∗ ∈ Sd+ and δ ∈ [0, 2] such that −1/2
Θ∗ Θ ∀Θ ∈ V & kΘ1/2 Θ∗
− Id k ≤ δ ∀Θ ∈ V,
(2.129)
where k · k is the spectral norm,22 and set
−1 H = Hγ := {(h, H) ∈ Rd × Sd : −γΘ−1 ∗ H γΘ∗ },
ΦA,Z (h, H; Θ)
=
1/2
1/2
− 21 ln Det(I − Θ∗ HΘ∗ ) + 21 Tr([Θ − Θ∗ ]H) 1/2 1/2 δ(2+δ) kΘ∗ HΘ∗ k2F + 1/2 1/2 2(1−kΘ HΘ k) ∗ ∗ h i H h T −1 −1 + 12 φZ B T + [H, h] [Θ − H] [H, h] B : T ∗ h
H × V → R, (2.130)
where B is given by B=
A [0, ..., 0, 1]
,
(2.131)
the function φZ (Y ) := max Tr(ZY ) Z∈Z
(2.132)
is the support function of Z, and k · kF is the Frobenius norm. Function ΦA,Z is continuous on its domain, convex in (h, H) ∈ H and concave 22 It is easily seen that with δ = 2, the second relation in (2.129) is satisfied for all Θ such that 0 Θ Θ∗ , so that the restriction δ ≤ 2 is w.l.o.g..
141
HYPOTHESIS TESTING
in Θ ∈ V, so that (H, V, ΦA,Z ) is regular data. Besides this, (#) Whenever u ∈ Rn is such that [u; 1][u; 1]T ∈ Z and Θ ∈ V, the Gaussian random vector ζ ∼ N (A[u; 1], Θ) satisfies the relation o n 1 T T ≤ ΦA,Z (h, H; Θ). (2.133) ∀(h, H) ∈ H : ln Eζ∼N (A[u;1],Θ) e 2 ζ Hζ+h ζ
The latter relation combines with (2.128) to imply that GA [U, V] ⊂ S[H, V, ΦA,Z ].
In addition, ΦA,Z is coercive in (h, H): ΦA,Z (hi , Hi ; Θ) → +∞ as i → ∞ whenever Θ ∈ V, (hi , Hi ) ∈ H, and k(hi , Hi )k → ∞, i → ∞. (χ)
(ii) Let two collections of entities from (i), (Vχ , Θ∗ , δχ , γχ , Aχ , Zχ ), χ = 1, 2, with common d be given, giving rise to the sets Hχ , matrices Bχ , and functions ΦAχ ,Zχ (h, H; Θ), χ = 1, 2. These collections specify the families of normal distributions Gχ = {N (v, Θ) : Θ ∈ Vχ & ∃u ∈ U : v = Aχ [u; 1]}, χ = 1, 2. Consider the convex-concave saddle point problem SV =
1
min max 2 (h,H)∈H1 ∩H2 Θ1 ∈V1 ,Θ2 ∈V2 |
[ΦA1 ,Z1 (−h, −H; Θ1 ) + ΦA2 ,Z2 (h, H; Θ2 )] . {z } Φ(h,H;Θ1 ,Θ2 )
(2.134) A saddle point (H∗ , h∗ ; Θ∗1 , Θ∗2 ) in this problem does exist, and the induced quadratic detector φ∗ (ω) = 12 ω T H∗ ω+hT∗ ω+ 21 [ΦA1 ,Z1 (−h∗ , −H∗ ; Θ∗1 ) − ΦA2 ,Z2 (h∗ , H∗ ; Θ∗2 )], (2.135) {z } | a
when applied to the families of Gaussian distributions Gχ , χ = 1, 2, has the risk Risk[φ∗ |G1 , G2 ] ≤ ǫ⋆ := eSV , that is, (a) (b)
R
e−φ∗ (ω) P (dω) ≤ ǫ⋆ eφ∗ (ω) P (dω) ≤ ǫ⋆ Rd
R Rd
∀P ∈ G1 , ∀P ∈ G2 .
(2.136)
For proof, see Section 2.11.5.
Remark 2.44. Note that the computational effort to solve (2.134) reduces dramatically in the “easy case” of the situation described in item (ii) of Proposition 2.43 where • the observations are direct, meaning that Aχ [u; 1] ≡ u, u ∈ Rd , χ = 1, 2; • the sets Vχ are comprised of positive definite diagonal matrices, and matrices (χ) Θ∗ are diagonal as well, χ = 1, 2; • the sets Zχ , χ = 1, 2, are convex compact sets of the form Zχ = {Z ∈ Sd+1 : Z 0, Tr(ZQχj ) ≤ qjχ , 1 ≤ j ≤ Jχ } +
142
CHAPTER 2
with diagonal matrices Qχj ,23 and these sets intersect the interior of the positive semidefinite cone Sd+1 + . In this case, the convex-concave saddle point problem (2.134) admits a saddle point (h∗ , H∗ ; Θ∗1 , Θ∗2 ) where h∗ = 0 and H∗ is diagonal. Justifying the remark. In the easy case, we have Bχ = Id+1 and therefore h i−1 H h (χ) −1 T + [H, h] [Θ ] − H [H, h] Bχ Mχ (h, H) := BχT ∗ T = and φZχ (Z)
= =
h
h i−1 (χ) H + H [Θ∗ ]−1 − H H h i−1 (χ) −1 T T h + h [Θ∗ ] − H H
(χ) h + H[[Θ∗ ]−1 − H]−1 h h i−1 (χ) hT [Θ∗ ]−1 − H h
max Tr(ZW ) : W 0, Tr(W Qχj ) ≤ qjχ , 1 ≤ j ≤ Jχ W o nP P χ χ minλ , j λj Q j j qj λj : λ ≥ 0, Z
where the last equality is due to semidefinite duality.24 From the second representation of φZχ (·) and the fact that all Qχj are diagonal it follows that φZχ (Mχ (−h, H)) = φZχ (Mχ (h, H)) (indeed, with diagonal Qχj , if λ is feasible for the minimization problem participating in the representation when Z = Mχ (h, H), it clearly remains feasible when Z is replaced with Mχ (−h, H)). This, in turn, combines straightforwardly with (2.130) to imply that when replacing h∗ with 0 in a saddle point (h∗ , H∗ ; Θ∗1 , Θ∗2 ) of (2.134), we end up with another saddle point of (2.134). In other words, when solving (2.134), we can from the very beginning set h to 0, thus converting (2.134) into the convex-concave saddle point problem SV =
min
max
H:(0,H)∈H1 ∩H2 Θ1 ∈V1 ,Θ2 ∈V2
Φ(0, H; Θ1 , Θ2 ).
(2.137)
Taking into account that we are in the case where all matrices from the sets Vχ , like (χ) the matrices Θ∗ and all the matrices Qχj , χ = 1, 2, are diagonal, it is immediate to verify that Φ(0, H; Θ1 , Θ2 ) = Φ(0, EHE; Θ1 , Θ2 ) for any d × d diagonal matrix E with diagonal entries ±1. Due to convexity-concavity of Φ this implies that (2.137) admits a saddle point (0, H∗ ; Θ∗1 , Θ∗2 ) with H∗ invariant w.r.t. transformations H∗ 7→ EH∗ E with the above E, that is, with diagonal H∗ , as claimed. ✷ 2.9.3
Quadratic lifting—Does it help?
Assume that for χ = 1, 2, we are given • affine mappings u 7→ Aχ (u) = Aχ [u; 1] : Rnχ → Rd , • nonempty convex compact sets Uχ ⊂ Rnχ , • nonempty convex compact sets Vχ ⊂ int Sd+ . These data define families Gχ of Gaussian distributions on Rd : Gχ is comprised of all 23 In terms of the sets U , this assumption means that the latter sets are given by linear χ inequalities on the squares of entries in u. 24 See Section 4.1 (or [187, Section 7.1] for more details).
143
HYPOTHESIS TESTING
distributions N (Aχ (u), Θ) with u ∈ Uχ and Θ ∈ Vχ . The data define also families SGχ of sub-Gaussian distributions on Rd : SGχ is comprised of all sub-Gaussian distributions with parameters (Aχ (u), Θ) with (u, Θ) ∈ Uχ × Vχ . Assume we observe random variable ζ ∈ Rd drawn from a distribution P known to belong to G1 ∪ G2 , and our goal is to decide from a stationary K-repeated version of our observation on the pair of hypotheses Hχ : P ∈ Gχ , χ = 1, 2; we refer to this situation as the Gaussian case, and we assume from now on that we are in this case.25 At present, we have developed two approaches to building detector-based tests for H1 , H2 : A. Utilizing the affine in ζ detector φaff given by solution to the saddle point problem (see (2.122), (2.123) and set θχ = Aχ (uχ ) with uχ running through Uχ ) T 1 h [A2 (u2 ) − A1 (u1 )] + 12 hT [Θ1 + Θ2 ] h ; SadValaff = min max 2 h∈Rd
u1 ∈U1 ,u2 ∈U2
Θ1 ∈V1 ,Θ2 ∈V2
this detector satisfies the risk bound Risk[φaff |G1 , G2 ] ≤ exp{SadValaff }. Q. Utilizing the quadratic in ζ detector φlift given by Proposition 2.43.ii, with the risk bound Risk[φlift |G1 , G2 ] ≤ exp{SadVallift }, with SadVallift given by (2.134). A natural question is, which of these options results in a better risk bound. Note that we cannot just say “clearly, the second option is better, since there are more quadratic detectors than affine ones”—the difficulty is that the key relation (2.133), in the context of Proposition 2.43, is inequality rather than equality.26 We are about to show that under reasonable assumptions, the second option indeed is better: Proposition 2.45. In the situation in question, assume that the sets Vχ , χ = 1, 2, contain the -largest elements, and that these elements are taken as the matrices (χ) Θ∗ participating in Proposition 2.43.ii. Let, further, the convex compact sets Zχ participating in Proposition 2.43.ii satisfy W u ¯ 0, u ∈ Uχ (2.138) Zχ ⊂ Zχ := Z = uT 1 (this assumption does not restrict generality, since Z¯χ is, along with Uχ , a closed convex set which clearly contains all matrices [u; 1][u; 1]T with u ∈ Uχ ). Then SadVallift ≤ SadValaff ,
(2.139)
that is, option Q is at least as efficient as option A. 25 It is easily seen that what follows can be straightforwardly extended to the sub-Gaussian case, where the hypotheses we would decide upon state that P ∈ SGχ . 26 One cannot make (2.133) an equality by redefining the right-hand side function—it will lose the convexity-concavity properties required in our context.
144
CHAPTER 2
ρ 0.5 0.5 0.01
σ1 2 1 1
σ2 2 4 4
unrestricted H and h 0.31 0.24 0.41
H=0 0.31 0.39 1.00
h=0 1.00 0.62 0.41
Table 2.2: Risk of quadratic detector φ(ζ) = hT ζ + 12 ζ T Hζ + κ.
Proof. Let Aχ = [A¯χ , aχ ]. Looking at (2.122) (where one should substitute θχ = (χ) Aχ (uχ ) with uχ running through Uχ ) and taking into account that Θχ Θ∗ ∈ Vχ when Θχ ∈ Vχ , we conclude that h h i i (1) (2) 1 hT [A¯2 u2 − A¯1 u1 + a2 − a1 ] + 21 hT Θ∗ + Θ∗ h . SadValaff = min max 2 h
u1 ∈U1 ,u2 ∈U2
(2.140)
At the same time, we have by Proposition 2.43.ii: SadVallift
= ≤ =
1 max min [ΦA1 ,Z1 (−h, −H; Θ1 ) + ΦA2 ,Z2 (h, H; Θ2 )] (h,H)∈H1 ∩H2 Θ1 ∈V1 ,Θ2 ∈V2 2 1 [ΦA1 ,Z1 (−h, 0; Θ1 ) + ΦA2 ,Z2 (h, 0; Θ2 )] max min 2 h∈Rd Θ1 ∈V1 ,Θ2 ∈V2 " #! −A¯T1 h 1 1 min max max Tr Z1 (1) 2 2 Z ∈Z −hT A¯1 −2hT a1 + hT Θ∗ h h∈Rd Θ1 ∈V1 ,Θ2 ∈V2 1 1
"
#! A¯T2 h Tr Z2 (2) hT A¯2 2hT a2 + hT Θ∗ h [by direct computation utilizing (2.130)] h i (1) T ¯T 1 1 ≤ min 2 2 max −2u1 A1 h − 2aT1 h + hT Θ∗ h + u1 ∈U1 h∈Rd h i T ¯T T T (2) 1 2u A h + 2a h + h Θ h max ∗ 2 2 2 2 + 21 max Z2 ∈Z2
u2 ∈U2
[due to (2.138)]
= SadValaff ,
where the concluding equality is due to (2.140).
✷
Numerical illustration. To get an impression of the performance of quadratic detectors as compared to affine ones under the premise of Proposition 2.45, we present here the results of an experiment where U1 = U1ρ = {u ∈ R12 : ui ≥ (χ) ρ, 1 ≤ i ≤ 12}, U2 = U2ρ = −U1ρ , A1 = A2 ∈ R8×13 , and Vχ = {Θ∗ = σχ2 I8 } are singletons. The risks of affine, quadratic and “purely quadratic” (with h set to 0) detectors on the associated families G1 , G2 are given in Table 2.2. We see that • when deciding on families of Gaussian distributions with a common covariance matrix and expectations varying in the convex sets associated with the families, passing from affine detectors described by Proposition 2.41 to quadratic detectors does not affect the risk (first row in the table). This should be expected: we are in the scope of Gaussian o.s., where minimum risk affine detectors are optimal among all possible detectors. • When deciding on families of Gaussian distributions in the case where distributions from different families can have close expectations (third row in the table), (1) affine detectors are useless, while the quadratic ones are not, provided that Θ∗
145
HYPOTHESIS TESTING (2)
differs from Θ∗ . This is how it should be—we are in the case where the first moments of the distribution of observation bear no definitive information on the family to which this distribution belongs, making affine detectors useless. In contrast, quadratic detectors are able to utilize information (valuable when (1) (2) Θ∗ 6= Θ∗ ) “stored” in the second moments of the observation. • “In general” (second row in the table), both affine and purely quadratic components in a quadratic detector are useful; suppressing one of them may increase significantly the attainable risk.
2.9.4
Quadratic lifting: Sub-Gaussian case
The sub-Gaussian version of Proposition 2.43 is as follows: Proposition 2.46. (i) Assume we are given • a nonempty and bounded subset U of Rn ; • a convex compact set V contained in the interior of the cone Sd+ of positive semidefinite d × d matrices; • a d × (n + 1) matrix A. These data specify the family SG A [U, V] of distributions of quadratic liftings (ζ, ζζ T ) of sub-Gaussian random vectors ζ with sub-Gaussianity parameters A[u; 1], Θ stemming from u ∈ U and Θ ∈ V. Let us select some 1. reals γ, γ + such that 0 < γ < γ + < 1, 2. convex compact subset Z of the set Z n = {Z ∈ Sn+1 : Z 0, Zn+1,n+1 = 1} such that relation (2.128) takes place, 3. positive definite d×d matrix Θ∗ ∈ Sd+ and δ ∈ [0, 2] such that (2.129) takes place. These data specify the closed convex sets H
b H
= =
d d −1 −1 Hγ := {(h, ∗ }, H) ∈ R × S : −γΘ∗ H γΘ−1 −1 + −γΘ γ,γ d d d ∗ H γΘ∗ b H = (h, H, G) ∈ R × S × S : 0 G γ + Θ−1 ∗ , H G
and the functions
ΨA,Z (h, H, G)
=
1/2
1/2
− − 21 ln Det(I Θ∗ GΘ∗ ) H h −1 [H, h] B + [H, h]T [Θ−1 : + 12 φZ B T T ∗ − G] h
ΨδA,Z (h, H, G; Θ)
=
1/2
1/2
− 12 ln Det(I − Θ∗ GΘ∗ )
δ(2+δ)
1/2
b × Z → R, H 1/2
kΘ∗ GΘ∗ k2F + 12 Tr([Θ − Θ∗ ]G) + 1/2 1/2 2(1−kΘ∗ GΘ∗ k) h H −1 [H, h] B + [H, h]T [Θ−1 : + 21 φZ B T T ∗ − G] h
b × {0 Θ Θ∗ } → R H
(2.141) where B is given by (2.131) and φZ (·) is the support function of Z given by (2.132),
146
CHAPTER 2
along with ΦA,Z (h, H)
=
ΦδA,Z (h, H; Θ)
=
n o b : H → R, min ΨA,Z (h, H, G) : (h, H, G) ∈ H G n o b : H × {0 Θ Θ∗ } → R, min ΨδA,Z (h, H, G; Θ) : (h, H, G) ∈ H G
ΦA,Z (h, H) is convex and continuous on its domain, and ΦδA,Z (h, H; Θ) is continuous on its domain, convex in (h, H) ∈ H and concave in Θ ∈ {0 Θ Θ∗ }. Besides this, (##) Whenever u ∈ Rn is such that [u; 1][u; 1]T ∈ Z and Θ ∈ V, the subGaussian random vector ζ, with parameters (A[u; 1], Θ), satisfies the relation ∀(h, H) ∈ Hn: (a)
(b)
o 1 T T ≤ ΦA,Z (h, H), ln Eζ e 2 ζ Hζ+h ζ o n 1 T T ζ Hζ+h ζ ≤ ΦδA,Z (h, H; Θ), ln Eζ e 2
(2.142)
which combines with (2.128) to imply that
SG A [U, V] ⊂ S[H, V, ΦA,Z ] & SG A [U, V] ⊂ S[H, V, ΦδA,Z ].
(2.143)
In addition, ΦA,Z and ΦδA,Z are coercive in (h, H): ΦA,Z (hi , Hi ) → +∞ and ΦδA,Z (hi , Hi ; Θ) → +∞ as i → ∞ whenever Θ ∈ V, (hi , Hi ) ∈ H, and k(hi , Hi )k → ∞, i → ∞. (χ)
(ii) Let two collections of data from (i): (Vχ , Θ∗ , δχ , γχ , γχ+ , Aχ , Zχ ), χ = 1, 2, with common d be given, giving rise to the sets Hχ , matrices Bχ , and functions δ ΦAχ ,Zχ (h, H), ΦAχχ ,Zχ (h, H; Θ), χ = 1, 2. These collections specify the families SG χ = SG Aχ [Uχ , Vχ ] of sub-Gaussian distributions. Consider the convex-concave saddle point problem h i δ1 δ2 1 Φ SV = min (−h, −H; Θ ) + Φ max (h, H; Θ ) . 1 2 A1 ,Z1 A2 ,Z2 2 (h,H)∈H1 ∩H2 Θ1 ∈V1 ,Θ2 ∈V2 | {z } Φδ1 ,δ2 (h,H;Θ1 ,Θ2 )
(2.144) A saddle point (H∗ , h∗ ; Θ∗1 , Θ∗2 ) in this problem does exist, and the induced quadratic detector h i φ∗ (ω) = 21 ω T H∗ ω + hT∗ ω + 12 ΦδA11 ,Z1 (−h∗ , −H∗ ; Θ∗1 ) − ΦδA22 ,Z2 (h∗ , H∗ ; Θ∗2 ) , {z } | a
when applied to the families of sub-Gaussian distributions SG χ , χ = 1, 2, has the risk Risk[φ∗ |SG 1 , SG 2 ] ≤ ǫ⋆ := eSV .
As a result, (a) (b)
R
e−φ∗ (ω) P (dω) ≤ ǫ⋆ eφ∗ (ω) P (dω) ≤ ǫ⋆ Rd
R Rd
∀P ∈ SG 1 , ∀P ∈ SG 2 .
147
HYPOTHESIS TESTING
Similarly, the convex minimization problem Opt =
min
(h,H)∈H1 ∩H2
1
|2
[ΦA1 ,Z1 (−h, −H) + ΦA2 ,Z2 (h, H)] . {z }
(2.145)
Φ(h,H)
is solvable, and the quadratic detector induced by its optimal solution (h∗ , H∗ ) φ∗ (ω) = 12 ω T H∗ ω + hT∗ ω + 21 [ΦA1 ,Z1 (−h∗ , −H∗ ) − ΦA2 ,Z2 (h∗ , H∗ )], | {z }
(2.146)
a
when applied to the families of sub-Gaussian distributions SG χ , χ = 1, 2, has the risk Risk[φ∗ |SG 1 , SG 2 ] ≤ ǫ⋆ := eOpt , so that relation (2.145) takes place for the φ∗ and ǫ⋆ just defined. For proof, see Section 2.11.6. Remark 2.47. Proposition 2.46 offers two options for building quadratic detectors for the families SG 1 , SG 2 , those based on the saddle point of (2.144) and on the optimal solution to (2.145). Inspecting the proof, the number of options can be δ increased to 4: we can replace any of the functions ΦAχχ ,Zχ , χ = 1, 2 (or both these functions simultaneously), with ΦAχ ,Zχ . The second of the original two options is δ exactly what we get when replacing both ΦAχχ ,Zχ , χ = 1, 2, with ΦAχ ,Zχ . It is easily seen that depending on the data, each of these four options can be the best—result in the smallest risk bound. Thus, it makes sense to keep all these options in mind and to use the one which, under the circumstances, results in the best risk bound. Note that the risk bounds are efficiently computable, so that identifying the best option is easy. 2.9.5
Generic application: Quadratically constrained hypotheses
Propositions 2.43 and 2.46 operate with Gaussian/sub-Gaussian observations ζ with matrix parameters Θ running through convex compact subsets V of int Sd+ , and means of the form A[u; 1], with “signals” u running through given sets U ⊂ Rn . The constructions, however, involved additional entities—convex compact sets Z ⊂ Z n := {Z ∈ Sn+1 : Zn+1,n+1 = 1} containing quadratic liftings [u; 1][u; 1]T of all + signals u ∈ U . Other things being equal, the smaller the Z, the smaller the associated function ΦA,Z (or ΦδA,Z ), and consequently, the smaller the (upper bounds on the) risks of the quadratic in ζ detectors we end up with. In order to implement these constructions, we need to understand how to build the required sets Z in an “economical” way. There is a relatively simple case when it is easy to get reasonable candidates for the role of Z—the case of quadratically constrained signal set U : U = {u ∈ Rn : fk (u) := uT Qk u + 2qkT u ≤ bk , 1 ≤ k ≤ K}.
(2.147)
Indeed, the constraints fk (u) ≤ bk are just linear constraints on the quadratic lifting [u; 1][u; 1]T of u: Qk qk T T T ∈ Sn+1 . u Qk u + 2qk u ≤ bk ⇔ Tr(Fk [u; 1][u; 1] ) ≤ bk , Fk = qkT
148
CHAPTER 2
Consequently, in the case of (2.147), the simplest candidate on the role of Z is the set Z = {Z ∈ Sn : Z 0, Zn+1,n+1 = 1, Tr(Fk Z) ≤ bk , 1 ≤ k ≤ K}.
(2.148)
This set clearly is closed and convex (the latter even when U itself is not convex), and indeed contains the quadratic liftings [u; 1][u; 1]T of all points u ∈ U . We need also the compactness of Z; the latter definitely takes place when the quadratic constraints describing U contain the constraint of the form uT u ≤ R2 , which, in turn, can be ensured, basically “for free,” when U is bounded. It should be stressed that the “ideal” choice of Z would be the convex hull Z[U ] of all rank 1 matrices [u; 1][u; 1]T with u ∈ U —this definitely is the smallest convex set which contains the quadratic liftings of all points from U . Moreover, Z[U ] is closed and bounded, provided U is so. The difficulty is that Z[U ] can be computationally intractable (and thus useless in our context) already for pretty simple sets U of the form (2.147). The set (2.148) is a simple outer approximation of Z[U ], and this approximation can be very loose: for instance, when U = {u : −1 ≤ uk ≤ 1, 1 ≤ k ≤ n} is just the unit box in Rn , the set (2.148) is {Z ∈ Sn+1 : Z 0, Zn+1,n+1 = 1, |Zk,n+1 | ≤ 1, 1 ≤ k ≤ n}; this set even is not bounded, while Z[U ] clearly is bounded. There is, essentially, just one generic case when the set (2.148) is exactly equal to Z[U ]—the case where U = {u : uT Qu ≤ c}, Q ≻ 0 is an ellipsoid centered at the origin; the fact that in this case the set given by (2.148) is exactly Z[U ] is a consequence of what is called S-Lemma. Though, in general, the set Z can be a very loose outer approximation of Z[U ], this does not mean that this construction cannot be improved. As an instructive example, let U = {u ∈ Rn : kuk∞ ≤ 1}. We get an approximation of Z[U ] much better than the one above when applying (2.148) to an equivalent description of the box by quadratic constraints: U := {u ∈ Rn : kuk∞ ≤ 1} = {u ∈ Rn : u2k ≤ 1, 1 ≤ k ≤ n}. Applying the recipe of (2.148) to the latter description of U , we arrive at a significantly less conservative outer approximation of Z[U ], specifically, Z = {Z ∈ Sn+1 : Z 0, Zn+1,n+1 = 1, Zkk ≤ 1, 1 ≤ k ≤ n}. Not only the resulting set Z is bounded; we can get a reasonable “upper bound” on the discrepancy between Z and Z[U ]. Namely, denoting by Z o the matrix obtained from a symmetric n × n matrix Z by zeroing out the entry Zn+1,n+1 and keeping the remaining entries intact, we have Z o [U ] := {Z o : Z ∈ Z[U ]} ⊂ Z o := {Z o : Z ∈ Z} ⊂ O(1) ln(n + 1)Z o . This is a particular case of a general result (which goes back to [191]; we shall get this result as a byproduct of our forthcoming considerations, specifically, Proposition 4.6) as follows:
HYPOTHESIS TESTING
149
Let U be a bounded set given by a system of convex quadratic constraints without linear terms: U = {u ∈ Rn : uT Qk u ≤ ck , 1 ≤ k ≤ K}, Qk 0, 1 ≤ k ≤ K, and let Z be the associated set (2.148): Z = {Z ∈ Sn+1 : Z 0, Zn+1,n+1 = 1, Tr(ZDiag{Qk , 1}) ≤ ck , 1 ≤ k ≤ K}
Then √ Z o [U ] := {Z o : Z ∈ Z[U ]} ⊂ Z o := {Z o : Z ∈ Z} ⊂ 3 ln( 3(K + 1))Z o [U ]. Note that when K = 1 (i.e., U is an ellipsoid centered at the origin), the factor 4 ln(5(K + 1)), as it was already mentioned, can be replaced by 1. √ One can think that the factor 3 ln( 3(K + 1)) is too large to be of interest; well, this is nearly the best factor one can get under the circumstances, and a nice fact is that the factor is “nearly independent” of K. Finally, we remark that, as in the case of a box, we can try to reduce the conservatism of the outer approximation (2.148) of Z[U ] by passing from the initial description of U to an equivalent one. The standard recipe here is to replace linear constraints in the description of U by their quadratic consequences; for example, we can augment a pair of linear constraints qiT u ≤ ci , qjT u ≤ cj , assuming there is such a pair, with the quadratic constraint (ci −qiT u)(cj −qjT u) ≥ 0. While this constraint is redundant, as far as the description of U itself is concerned, adding this constraint reduces, and sometimes significantly, the set given by (2.148). Informally speaking, transition from (2.147) to (2.148) is by itself “too stupid” to utilize the fact (known to every kid) that the product of two nonnegative quantities is nonnegative; when augmenting linear constraints in the description of U by their pairwise products, we somehow compensate for this stupidity. Unfortunately, while “computationally tractable” assistance of this type allows us to reduce the conservatism of (2.148), it usually does not allow us to eliminate it completely: a grave “fact of life” is that even in the case of the unit box U , the set Z[U ] is computationally intractable. Scientifically speaking: maximizing quadratic forms over the unit box U is provably an NP-hard problem; were we able to get a computationally tractable description of Z[U ], we would be able to solve this NP-hard problem efficiently, implying that P=NP. While we do not know for sure that the latter is not the case, “informal odds” are strongly against this possibility. The bottom line is that while the approach we are discussing in some situations could result in quite conservative tests, “some” is by far not the same as “always”; on the positive side, this approach allows us to process some important problems. We are about to present a simple and instructive illustration. 2.9.5.1
Simple change detection
In Figure 2.8, you see a sample of frames from a “movie” in which a noisy picture of a dog gradually transforms into a noisy picture of a lady; several initial frames differ just by realizations of noise, and starting from some instant, the “signal” (the deterministic component of the image) starts to drift from the dog towards the lady. What, in your opinion, is the change point—the first time instant where
150
CHAPTER 2
#1
#2
#3
#4
#5
#6
#7
#8
# 15
# 20
# 28
# 30
Figure 2.8: Frames from a “movie”
151
HYPOTHESIS TESTING
the signal component of the image differs from the signal component of the initial image? A simple model of the situation is as follows: we observe, one by one, vectors (in fact, 2D arrays, but we can “vectorize” them) ωt = xt + ξt , t = 1, 2, ..., K,
(2.149)
where the xt are deterministic components of the observations and the ξt are random noises. It may happen that for some τ ∈ {2, 3, ..., K}, the vectors xt are independent of t when t < τ , and xτ differs from xτ −1 (“τ is a change point”); if it is the case, τ is uniquely defined by xK = (x1 , ..., xK ). An alternative is that xt is independent of t, for all 1 ≤ t ≤ K (“no change”). The goal is to decide, based on observation ω K = (ω1 , .., ωK ), whether there was a change point, and if yes, then, perhaps, to localize it. The model we have just described is the simplest case of “change detection,” where, given noisy observations on some time horizon, one is interested in detecting a “change” in some time series underlying the observations. In our simple model, this time series is comprised of deterministic components xt of observations, and “change at time τ ” is understood in the most straightforward way—as the fact that xτ differs from preceding xt ’s equal to each other. In more complicated situations, our observations are obtained from the underlying time series {xt } by a non-anticipative transformation, like ωt =
t X
Ats xs + ξt , t = 1, ..., K,
s=1
and we still want to detect the change, if any, in the time series {xt }. As an instructive example, consider observations, taken along an equidistant time grid, of the positions of an aircraft which “normally” flies with constant velocity, but at some time instant can start to maneuver. In this situation, the underlying time series is comprised of the velocities of the aircraft at consecutive time instants, observations are obtained from this time series by integration, and to detect a maneuver means to detect that on the observation horizon, there was a change in the series of velocities. Change detection is the subject of a huge literature dealing with a wide range of models differing from each other in • whether we deal with direct observations of the time series of interest, as in (2.149), or with indirect ones (in the latter case, there is a wide spectrum of options related to how the observations depend on the underlying time series), • what are the assumptions on the noise, • what happens with the xt ’s after the change—do they jump from their common value prior to time τ to a new common value starting with this time, or start to depend on time (and if yes, then how), etc. A significant role in change detection is played by hypothesis testing; as far as affine/quadratic-detector-based techniques developed in this section are concerned, their applications in the context of change detection are discussed in [50]. In what follows, we focus on the simplest of these applications. Situation and goal. We consider the situation as follows:
152
CHAPTER 2
1. Our observations are given by (2.149) with noises ξt ∼ N (0, σ 2 Id ) independent across t = 1, ..., K. We do not known σ a priori; what we know is that σ is independent of t and belongs to a given segment [σ, σ], with 0 < σ ≤ σ; 2. Observations (2.149) arrive one by one, so that at time t, 2 ≤ t ≤ K, we have at our disposal observation ω t = (ω1 , ..., ωt ). Our goal is to build a system of inferences Tt , 2 ≤ t ≤ K, such that Tt as applied to ω t either infers that there was a change at time t or earlier, in which case we terminate, or infers that so far there has been no change, in which case we either proceed to time t + 1 (if t < K), or terminate (if t = K) with a “no change” conclusion. We are given ǫ ∈ (0, 1) and want our collection of inferences to satisfy the bound ǫ on the probability of false alarm (i.e., on the probability of terminating somewhere on time horizon 2, 3, ..., K with a “there was a change” conclusion in the situation where there was no change: x1 = ... = xK ). Under this restriction, we want to make as small as possible the probability of a miss (of not detecting the change at all in the situation where there was a change). The “small probability of a miss” desire should be clarified. When the noise is nontrivial, we have no chances to detect very small changes and respect the bound on the probability of false alarm. A realistic goal is to make as small as possible the probability of missing a not too small change, which can be formalized as follows. Given ρ > 0, and tolerances ǫ, ε ∈ (0, 1), let us look for a system of inferences {Tt : 2 ≤ t ≤ K} such that • the probability of false alarm is at most ǫ, and • the probability of “ρ-miss”—the probability of detecting no change when there was a change of energy ≥ ρ2 (i.e., when there was a change a time τ , and, moreover, it holds kxτ − x1 k22 ≥ ρ2 ) is at most ε. What we are interested in, is to achieve the goal just formulated with as small a ρ as possible. Construction. Let us select a large “safety parameter” R, like R = 108 or even R = 1080 , so that we can assume that for all time series we are interested in it holds kxt − xτ k22 ≤ R2 .27 Let us associate with ρ > 0 “signal hypotheses” Htρ , t = 2, 3, ..., K, on the distribution of observation ω K given by (2.149), with Htρ stating that at time t there is a change, of energy at least ρ2 , in the time series K {xt }K t=1 underlying the observation ω : x1 = x2 = ... = xt−1 & kxt − xt−1 k22 = kxt − x1 k22 ≥ ρ2 (and on top of that, kxt − xτ k22 ≤ R2 for all t, τ ). Let us augment these hypotheses by the null hypothesis H0 stating that there is no change at all—the observation ω K stems from a stationary time series x1 = x2 = ... = xK . We are about to use our machinery of detector-based tests in order to build a system of tests deciding, S with partial risks ǫ and ε, on the null hypothesis vs. the “signal alternative” t Htρ for as small a ρ as possible. The implementation is as follows. Given ρ > 0 such that ρ2 < R2 , consider two 27 R is needed only to make the domains we are working with bounded, thus allowing us to apply the theory we have developed so far. The actual value of R does not enter our constructions and conclusions.
153
HYPOTHESIS TESTING
hypotheses, G1 and Gρ2 , on the distribution of observation ζ = x + ξ ∈ Rd .
(2.150)
Both hypotheses state that ∼ N (0, σ 2 Id ) with unknown σ known to belong to √ ξ√ a given segment ∆ := [ 2σ, 2σ]. In addition, G1 states that x = 0, and Gρ2 that ρ2 ≤ kxk22 ≤ R2 . We can use the result of Proposition 2.43.ii to build a detector quadratic in ζ for the families of distributions P1 , P2ρ obeying the hypotheses G1 , Gρ2 , respectively. To this end it suffices to apply the proposition to the collections (χ)
Vχ = {σ 2 Id : σ ∈ ∆}, Θ∗
= 2σ 2 Id , δχ = 1 − σ/σ, γχ = 0.999, Aχ = Id , Zχ , [χ = 1, 2]
where Z1 Z2
= =
{[0; ...; 0; 1][0; ...; 0; 1]T } ⊂ Sd+1 + , Z2ρ = {Z ∈ Sd+1 : Z = 1, 1 + R2 ≥ Tr(Z) ≥ 1 + ρ2 }. d+1,d+1 +
The (upper bound on the) risk of the quadratic in ζ detector yielded by a saddle point of function (2.134), as given by Proposition 2.43.ii, is immediate: by the same argument as used when justifying Remark 2.44, in the situation in question one can look for a saddle point with h = 0, H = ηId , and identifying the required η reduces to solving the univariate convex problem σ4 η2 b4 η 2 ) − d2 σ b2 (1 − σ 2 /σ 2 )η + dδ(2+δ)b Opt(ρ) = min 21 − d2 ln(1 − σ 2η 1+b σ η ρ2 η 2 + 2(1−b : −γ ≤ σ b η ≤ 0 σ 2 η) √ σ b = 2σ, δ = 1 − σ/σ which can be done in no time by Bisection. The resulting detector and the upper bound on its risk are given by the optimal solution η(ρ) to the latter problem according to 1−σ b2 η(ρ) ρ2 η(ρ) d 2 2 2 /σ φ∗ρ (ζ) = 12 η(ρ)ζ T ζ + ln − σ b (1 − σ )η(ρ) − 4 1+σ b2 η(ρ) d(1 − σ b2 η(ρ)) {z } | a(ρ)
with
Risk[φ∗ρ |P1 , P2 ] ≤ Risk(ρ) := eOpt(ρ)
(observe that R appears neither in the definition of the optimal detector nor in the risk bound). It is immediately seen that Opt(ρ) → 0 as ρ → +0 and Opt(ρ) → −∞ as ρ → +∞, implying that given κ ∈ (0, 1), we can easily find by bisection ρ = ρ(κ) such that Risk(ρ) = κ; in what follows, we assume w.l.o.g. that R > ρ(κ) for the value of κ we end with; see below. Next, let us pass from the detector φ∗ρ(κ) (·) to its shift φ∗,κ (ζ) = φ∗ρ(κ) (ζ) + ln(ε/κ), so that for the simple test T κ which, given observation ζ, accepts G1 and rejects
154
CHAPTER 2
ρ(κ)
G2
ρ(κ)
whenever φ∗,κ (ζ) ≥ 0, and accepts G2 ρ(κ)
Risk1 (T κ |G1 , G2
)≤
and rejects G1 otherwise, it holds
κ2 ρ(κ) , Risk2 (T κ |G1 , G2 ) ≤ ε; ε
(2.151)
see Proposition 2.14 and (2.48). We are nearly done. Given κ ∈ (0, 1), consider the system of tests Ttκ , t = 2, 3, ..., K, as follows. At time t ∈ {2, 3, ..., K}, given observations ω1 , ..., ωt stemming from (2.149), let us form the vector ζt = ωt − ω1 and compute the quantity φ∗,κ (ζt ). If this quantity is negative, we claim that the change has already taken place and terminate; otherwise we claim that so far, there was no change, and proceed to time t + 1 (if t < K) or terminate (if t = K). The risk analysis for the resulting system of inferences is immediate. Observe that (!) For every t = 2, 3, ..., K: • if there is no change on time horizon 1, ..., t: x1 = x2 = ... = xt (case A) the probability for Ttκ to conclude that there was a change is at most κ2 /ε; • if, on the other hand, kxt − x1 k22 ≥ ρ2 (κ) (case B), then the probability for Ttκ to conclude that so far there was no change is at most ε. Indeed, we clearly have
ζt = [xt − x1 ] + ξ t , √ √ where ξ t = ξt − ξ1 ∼ N (0, σ 2 Id ) with σ ∈ [ 2σ, 2σ]. Our action at time t is nothing but application of the test T κ to the observation ζt . In case A the distribution of this observation obeys the hypothesis G1 , and the probability for Ttκ to claim that there was a change is at most κ2 /ε by the first inequality ρ(κ) in (2.151). In case B, the distribution of ζt obeys the hypothesis G2 , and κ thus the probability for Tt to claim that there was no change on time horizon 1, ..., t is ≤ ε by the second inequality in (2.151). In view of (!), the probability of false alarm for the system of inferences {Ttκ }K t=2 is at most (K − 1)κ2 /ε, and specifying κ as p κ = ǫε/(K − 1),
we make this probability ≤ ǫ. The resulting procedure, by the same (!), detects a change at time t ∈ {2, 3, ..., K} with probability at least 1 − ε, provided that the energy of this change is at least ρ2∗ , with p ǫε/(K − 1) . (2.152) ρ∗ = ρ
In fact we can say a bit more:
Proposition 2.48. Let the deterministic sequence x1 , ..., xK underlying observations (2.149) be such that for some t it holds kxt − x1 k22 ≥ ρ2∗ , with ρ∗ given by
155
HYPOTHESIS TESTING
(2.152). Then the probability for the system of inferences we have built to detect a change at time t or earlier is at least 1 − ε. Indeed, under the premise of the proposition, the probability for Ttκ to claim that a change already took place is at least 1 − ε, and this probability can be only smaller than the probability to detect change on time horizon 2, 3, ..., t. How it works. As applied to the “movie” story we started with, the outlined procedure works as follows. The images in question are of the size 256 × 256, so that we are in the case of d = 2562 = 65536. The images are represented by 2D arrays in gray scale, that is, as 256 × 256 matrices with entries in the range [0, 255]. In the experiment to be reported (just as in √ the movie) we assumed the maximal noise intensity σ to be 10, and used σ = σ/ 2. The reliability tolerances ǫ, ε were set to 0.01, and K was set to 9, resulting in ρ2∗ = 7.38 · 106 , which corresponds to the per pixel energy ρ2∗ /65536 = 112.68—just 12% above the allowed expected per pixel energy of noise (the latter is σ 2 = 100). The resulting detector is ζT ζ φ∗ (ζ) = −2.7138 5 + 366.9548. 10 In other words, test Ttκ claims that the change took place when the average, over pixels, per pixel energy in the difference ωt − ω1 was at least 206.33, which is pretty close to the expected per pixel energy (200.0) in the noise ξt − ξ1 affecting the difference ωt − ω1 . Finally, this is how the system of inferences just described worked in simulations. The underlying sequence of images is obtained from the “basic sequence” x ¯t = D + 0.0357(t − 1)(L − D), t = 1, 2, ...28
(2.153)
where D is the image of the dog and L is the image of the lady (up to noise, these are the first and the last frames on Figure 2.8). To get the observations in a particular simulation, we augment this sequence from the left by a random number of images D in such a way that with probability 1/2 there was no change of image on the time horizon 1,2, ..., 9, and with probability 1/2 there was a change at time instant τ chosen at random from the uniform distribution on {2, 3, ..., 9}. The observation is obtained by taking the first nine images in the resulting sequence, and adding to them observation noises independent across the images drawn at random from N (0, 100I65536 ). In the series of 3,000 simulations of this type we have not observed a single false alarm, while the empirical probability of a miss was 0.0553. Besides, the change at time t, if detected, was never detected with a delay more than 1. Finally, in the particular “movie” in Figure 2.8 the change takes place at time t = 3, and the system of inferences we have just developed discovered the change at time 4. How does this compare to the time when you managed to detect the change? “Numerical near-optimality.” Recall that beyond the realm of simple o.s.’s we 28 The
coefficient 0.0357 corresponds to a 28-frame linear transition from D to L.
156
CHAPTER 2
have no theoretical guarantees of near-optimality for the inferences we are developing. This does not mean, however, that we cannot quantify the conservatism of our techniques numerically. To give an example, let us forget, for the sake of simplicity, about change detection per se and focus on the auxiliary problem we have introduced above, that of deciding upon hypotheses G1 and Gρ2 via observation (2.150), and suppose that we want to decide on these two hypotheses from a single observation with risk ≤ ǫ, for a given ǫ ∈ (0, 1). Whether this is possible or not depends on ρ; let us denote by ρ+ the smallest ρ for which we can meet the risk specification with our detector-based approach (ρ+ is nothing but what was above called ρ(ǫ)), and by ρ the smallest ρ for which there exists “in nature” a simple test deciding on G1 vs. Gρ2 with risk ≤ ǫ. We can consider the ratio ρ+ /ρ as the “index of conservatism” of our approach. Now, ρ+ is given by an efficient computation; what about ρ? Well, there is a simple way to get a lower bound on ρ, namely, as follows. Observe that if the composite hypotheses G1 , Gρ2 can be decided upon with risk ≤ ǫ, the same holds true for two simple hypotheses stating that the distribution of observation (2.150) is P1 or P2 respectively, where P1 , P2 correspond to the cases where • (P1 ): ζ is drawn from N (0, 2σ 2 Id ) • (P2 ): ζ is obtained by adding N (0, 2σ 2 Id )-noise to a random signal u, independent of the noise, uniformly distributed on the sphere {kuk2 = ρ}. Indeed, P1 obeys hypothesis G1 , and P2 is a mixture of distributions obeying Gρ2 ; as a result, a simple test T deciding (1 − ǫ)-reliably on G1 vs. Gρ2 would induce a test deciding equally reliably on P1 vs. P2 , specifically, the test which, given observation ζ, accepts P1 if T on the same observation accepts G1 , and accepts P2 otherwise. We can now use a two-point lower bound (Proposition 2.2) to lower-bound the risk of deciding on P1 vs. P2 . Because both distributions are spherically symmetric, computing this bound reduces to computing a similar bound for the univariate distributions of ζ T ζ induced by P1 and P2 , and these univariate distributions are easy to compute. The resulting lower risk bound depends on ρ, and we can find the smallest ρ for which the bound is ≥ 0.01, and use this ρ in the role of ρ; the associated indexes of conservatism can be only larger than the true ones. Let us look at what these indexes are for the data used in our√change detection experiment, that is, ǫ = 0.01, d = 2562 = 65536, σ = 10, σ = σ/ 2. Computation shows that in this case we have ρ+ = 2702.4, ρ+ /ρ ≤ 1.04 —nearly no conservatism at all! √ When eliminating the uncertainty in the intensity of noise by increasing σ from σ/ 2 to σ, we get ρ+ = 668.46, ρ+ /ρ ≤ 1.15 —still not that much of conservatism!
157
HYPOTHESIS TESTING
2.10
EXERCISES FOR CHAPTER 2
2.10.1
Two-point lower risk bound
Exercise 2.1. Let p and q be two probability distributions distinct from each other on delement observation space Ω = {1, ..., d}, and consider two simple hypotheses on the distribution of observation ω ∈ Ω, H1 : ω ∼ p, and H2 : ω ∼ q. 1. Is it true that there always exists a simple deterministic test deciding on H1 , H2 with risk < 1/2? 2. Is it true that there always exists a simple randomized test deciding on H1 , H2 with risk < 1/2? 3. Is it true that when quasi-stationary K-repeated observations are allowed, one can decide on H1 , H2 with any small risk, provided K is large enough? 2.10.2
Around Euclidean Separation
Exercise 2.2. Justify the “immediate observation” in Section 2.2.2.3.B. Exercise 2.3. 1) Prove Proposition 2.9. Hint: You can find useful the following simple observation (prove it, provided you indeed use it): Let f (ω), g(ω) be probability densities taken w.r.t. a reference measure P on an observation space Ω, and let ǫ ∈ (0, 1/2] be such that Z min[f (ω), g(ω)]P (dω) ≤ 2ǫ. 2¯ ǫ := Ω
Then
Z p Ω
f (ω)g(ω)P (dω) ≤ 2
2) Justify the illustration in Section 2.2.3.2.C. 2.10.3
p ǫ(1 − ǫ).
Hypothesis testing via ℓ1 -separation
Let d be a positive integer, and the observation space Ω be the finite set {1, ..., d} equipped with the counting reference measure.29 Probability distributions on Ω can be identified with points p of d-dimensional probabilistic simplex X pi = 1}; ∆d = {p ∈ Rd : p ≥ 0, i
29 Counting measure is the measure on a discrete (finite or countable) set Ω which assigns every point of Ω with mass 1, so that the measure of a subset of Ω is the cardinality of the subset when it is finite and is +∞ otherwise.
158
CHAPTER 2
the i-th entry pi in p ∈ ∆d is the probability for the random variable distributed according to p to take value i ∈ {1, ..., d}. With this interpretation, p is the probability density taken w.r.t. the counting measure on Ω. Assume B and W are two nonintersecting nonempty closed convex subsets of ∆d ; we interpret B and W as the sets of black and white probability distributions on Ω, and our goal is to find the optimal, in terms of its total risk, test deciding on the hypotheses H1 : p ∈ B, H2 : p ∈ W via a single observation ω ∼ p. Warning: Everywhere in this section, “test” means “simple test.” Exercise 2.4. Our first goal is to find the optimal test, in terms of its total risk, deciding on the hypotheses H1 , H2 via a single observation ω ∼ p ∈ B ∪ W . To this end we consider the convex optimization problem " # d X f (p, q) := Opt = min |pi − qi | (2.154) p∈B,q∈W
i=1
and let (p∗ , q ∗ ) be an optimal solution to this problem (it clearly exists). 1. Extract from optimality conditions that there exist reals ρi ∈ [−1, 1], 1 ≤ i ≤ n, such that 1, p∗i > qi∗ (2.155) ρi = −1, p∗i < qi∗ and
ρT (p − p∗ ) ≥ 0 ∀p ∈ B & ρT (q − q ∗ ) ≤ 0 ∀q ∈ W.
(2.156)
2. Extract from the previous item that the test T which, given an observation ω ∈ {1, ..., d}, accepts H1 with probability πω = (1 + ρω )/2 and accepts H2 with complementary probability, has its total risk equal to X min[p∗ω , qω∗ ], (2.157) ω∈Ω
and thus is minimax optimal in terms of the total risk. Comments. Exercise 2.4 describes an efficiently computable and, in terms of worst-case total risk, optimal simple test deciding on a pair of “convex” composite hypotheses on the distribution of a discrete random variable. While it seems an attractive result, we believe by itself this result is useless, since typically in the testing problem in question a single observation by far is not enough for a reasonable inference; such an inference requires observing several independent realizations ω1 , ..., ωK of the random variable in question. And the construction presented in Exercise 2.4 says nothing on how to adjust the test to the case of repeated observation. Of course, when ω K = (ω1 , ..., ωK ) is a K-element i.i.d. sample drawn from a probability distribution p on Ω = {1, ..., d}, ω K can be thought of as a single observation of a discrete random variable taking value in the set ΩK = Ω × ... × Ω, | {z } K
the probability distribution pK of ω K being readily given by p. So, why not to
HYPOTHESIS TESTING
159
apply the construction from Exercise 2.4 to ω K in the role of ω? On a close inspection, this idea fails. One of the reasons for this failure is that the cardinality of ΩK (which, among other factors, is responsible for the computational complexity of implementing the test in Exercise 2.4) blows up exponentially as K grows. Another, even more serious, complication is that pK depends on p nonlinearly, so that the family of distributions pK of ω K induced by a convex family of distributions p of ω—convexity meaning that the p’s in question fill a convex subset of the probabilistic simplex—is not convex; and convexity of the sets B, W in the context of Exercise 2.4 is crucial. Thus, passing from a single realization of discrete random variable to the sample of K > 1 independent realizations of the variable results in severe structural and quantitative complications “killing,” at least at first glance, the approach undertaken in Exercise 2.4.30 In spite of the above pessimistic conclusions, the single-observation test from Exercise 2.4 admits a meaningful multi-observation modification, which is the subject of our next exercise. Exercise 2.5. There is a straightforward way to use the optimal–in terms of its total risk— single-observation test built in Exercise 2.4 in the “multi-observation” environment. Specifically, following the notation from the exercise 2.4, let ρ ∈ Rd , p∗ , q ∗ be the entities built in this Exercise, so that p∗ ∈ B, q ∗ ∈ W , all entries in ρ belong to [−1, 1], and {ρT p ≥ α := ρT p∗ ∀p ∈ B} & {ρT q ≤ β := ρT q ∗ ∀q ∈ W } & α − β = ρT [p∗ − q ∗ ] = kp∗ − q ∗ k1 . Given an i.i.d. sample ω K = (ω1 , ..., ωK ) with ωt ∼ p, where p ∈ B ∪ W , we could try to decide on the hypotheses H1 : p ∈ B, H2 : p ∈ W as follows.PLet us K 1 set ζt = ρωt . For large K, given ω K , the observable quantity ζ K := K t=1 ζt , by the Law of Large Numbers, will be with overwhelming probability close to Eω∼p {ρω } = ρT p, and the latter quantity is ≥ α when p ∈ B and is ≤ β < α when p ∈ W . Consequently, selecting a “comparison level” ℓ ∈ (β, α), we can decide on the hypotheses p ∈ B vs. p ∈ W by computing ζ K , comparing the result to ℓ, accepting the hypothesis p ∈ B when ζ K ≥ ℓ, and accepting the alternative p ∈ W otherwise. The goal of this exercise is to quantify the above qualitative considerations. To this end let us fix ℓ ∈ (β, α) and K and ask ourselves the following questions: A. For p ∈ B, how do we upper-bound the probability ProbpK {ζ K ≤ ℓ}? B. For p ∈ W , how do we upper-bound the probability ProbpK {ζ K ≥ ℓ}? Here pK is the probability distribution of the i.i.d. sample ω K = (ω1 , ..., ωK ) with ωt ∼ p. The simplest way to answer these questions is to use Bernstein’s bounding scheme. Specifically, to answer question A, let us select γ ≥ 0 and observe that for 30 Though directly extending the optimal single-observation test to the case of repeated observations encounters significant technical difficulties, it was carried on in some specific situations. For instance, in [122, 123] such an extension has been proposed for the case of sets B and W of distributions which are dominated by bi-alternating capacities (see, e.g., [8, 12, 35], and references therein); explicit constructions of the test were proposed for some special sets of distributions [121, 196, 209].
160
CHAPTER 2
every probability distribution p on {1, 2, ..., d} it holds
ProbpK ζ {z |
K
πK,− [p]
whence
≤ ℓ exp{−γℓ} ≤ EpK } ln(πK,− [p]) ≤ K ln
" d #K X 1 exp{−γζ } = pi exp − γρi , K i=1 K
! 1 + γℓ, pi exp − γρi K i=1
d X
implying, via substitution γ = µK, that
∀µ ≥ 0 : ln(πK,− [p]) ≤ Kψ− (µ, p), ψ− (µ, p) = ln Similarly, setting πK,+ [p] = ProbpK ζ K ≥ ℓ , we get ∀ν ≥ 0 : ln(πK,+ [p]) ≤ Kψ+ (ν, p), ψ+ (ν, p) = ln
d X i=1
!
pi exp{−µρi }
d X i=1
!
pi exp{νρi }
+ µℓ.
− νℓ.
Now comes the exercise: 1. Extract from the above observations that Risk(T K,ℓ |H1 , H2 ) ≤ exp{Kκ}, κ = max max inf ψ− (µ, p), max inf ψ+ (ν, q) , p∈B µ≥0
q∈W ν≥0
where T K,ℓ is the K-observation test which accepts the hypothesis H1 : p ∈ B when ζ K ≥ ℓ and accepts the hypothesis H2 : p ∈ W otherwise. 2. Verify that ψ− (µ, p) is convex in µ and concave in p, and similarly for ψ+ (ν, q), so that max inf ψ− (µ, p) = inf max ψ− (µ, p), max inf ψ+ (ν, q) = inf max ψ+ (ν, q). p∈B µ≥0
µ≥0 p∈B
q∈W ν≥0
ν≥0 q∈W
Thus, computing κ reduces to minimizing on the nonnegative ray the convex functions φ− (µ) = maxp∈B ψ+ (µ, p) and φ+ (ν) = maxq∈W ψ+ (ν, q). 3. Prove that when ℓ = 12 [α + β], one has κ≤−
1 2 ∆ , ∆ = α − β = kp∗ − q ∗ k1 . 12
Note that the above test and the quantity κ responsible for the upper bound on its risk depend, as on a parameter, on the “acceptance level” ℓ ∈ (β, α). The simplest way to select a reasonable value of ℓ is to minimize κ over an equidistant grid Γ ⊂ (β, α), of small cardinality, of values of ℓ. Now, let us consider an alternative way to pass from a “good” single-observation test to its multi-observation version. Our “building block” now is the minimum risk randomized single-observation test31 and its multi-observation modification is just 31 This test can differ from the test built in Exercise 2.4—the latter test is optimal in terms of the sum, rather than the maximum, of its partial risks.
161
HYPOTHESIS TESTING
the majority version of this building block. Our first observation is that building the minimum risk single-observation test reduces to solving a convex optimization problem. Exercise 2.6. Let, as above, B and W be nonempty nonintersecting closed convex subsets of probabilistic simplex ∆d . Show that the problem of finding the best—in terms of its risk—randomized single-observation test deciding on H1 : p ∈ B vs. H2 : p ∈ W via observation ω ∼ p reduces to solving a convex optimization problem. Write down this problem as an explicit LO program when B and W are polyhedral sets given by polyhedral representations: B W
= =
{p : ∃u : PB p + QB u ≤ aB }, {p : ∃u : PW p + QW u ≤ aW }.
We see that the “ideal building block”—the minimum-risk single-observation test—can be built efficiently. What is at this point unclear is whether this block is of any use for majority modifications, that is, whether the risk of this test < 1/2— this is what we need for the majority version of the minimum-risk single-observation test to be consistent. Exercise 2.7. Extract from Exercise 2.4 that in the situation of this section, denoting by ∆ the optimal value in the optimization problem (2.154), one has 1. The risk of any single-observation test, deterministic or randomized, is ≥ 21 − ∆ 4 2. There exists a single-observation randomized test with risk ≤ 12 − ∆ 8 , and thus the risk of the minimum risk single-observation test given by Exercise 2.6 does not exceed 12 − ∆ 8 < 1/2 as well. Pay attention to the fact that ∆ > 0 (since, by assumption, B and W do not intersect). The bottom line is that in the situation of this section, given a target value ǫ of risk and assuming stationary repeated observations are allowed, we have (at least) three options to meet the risk specifications: 1. To start with the optimal—in terms of its total risk—single-observation detector as explained in Exercise 2.4, and then to pass to its multi-observation version built in Exercise 2.5; 2. To use the majority version of the minimum-risk randomized single-observation test built in Exercise 2.6; 3. To use the test based on the minimum risk detector for B, W , as explained in the main body of Chapter 2. In all cases, we have to specify the number K of observations which guarantees that the risk of the resulting multi-observation test is at most a given target ǫ. A bound on K can be easily obtained by utilizing the results on the risk of a detector-based test in a Discrete o.s. from the main body of Chapter 2 along with risk-related results of Exercises 2.5, 2.6, and 2.7. Exercise 2.8.
162
CHAPTER 2
Run numerical experiments to see if one of the three options above always dominates the others (that is, requires a smaller sample of observations to ensure the same risk). Let us now focus on a theoretical comparison of the detector-based test and the majority version of the minimum-risk single-observation test (options 1 and 2 above) in the general situation described at the beginning of Section 2.10.3. Given ǫ ∈ (0, 1), the corresponding sample sizes Kd and Km are completely determined by the relevant “measure of closeness” between B and W . Specifically, • For Kd , the closeness measure is ρd (B, W ) = 1 −
max
p∈B,q∈W
X√
pω q ω ;
(2.158)
ω
1 − ρd (B, W ) is the minimal risk of a detector for B, W , and for ρd (B, W ) and ǫ small, we have Kd ≈ ln(1/ǫ)/ρd (B, W ) (why?). • Given ǫ, Km is fully specified by the minimal risk ρ of simple randomized singleobservation test T deciding on the hypotheses associated with B, W . By Exercise 2.7, we have ρ = 12 − δ, where δ is within absolute constant factor of the optimal value ∆ = minp∈B,q∈W kp − qk1 of (2.154). The risk bound for the Kobservation majority version of T is the probability to get at least K/2 heads in K independent tosses of coin with probability to get heads in a single toss equal to ρ = 1/2 − δ. When ρ is not close to 0 and ǫ is small,pthe (1 − ǫ)quantile of the number of heads in our K coin tosses is Kρ + O(1) K ln(1/ǫ) = p K/2−δK +O(1) K ln(1/ǫ) (why?). Km is the smallest K for which this quantile is < K/2, so that Km is of the order of ln(1/ǫ)/δ 2 , or, which is the same, of the order of ln(1/ǫ)/∆2 . We see that the closeness between B and W “responsible for Km ” is 2
ρm (B, W ) = ∆2 =
min
p∈B,q∈W
kp − qk1
,
and Km is of the order of ln(1/ǫ)/ρm (B, W ). The goal of the next exercise is to compare ρb and ρm . Exercise 2.9. Prove that in the situation of this section one has p 1 1 ρm (B, W ). ρ (B, W ) ≤ ρ (B, W ) ≤ m d 8 2
(2.159)
Relation (2.159) suggests that while Kd never is “much larger” than Km (this we know in advance: in repeated versions of Discrete o.s., a properly built detectorbased test provably is nearly optimal), Km might be much larger than Kd . This indeed is the case: Exercise 2.10. Given δ ∈ (0, 1/2), let B = {[δ; 0; 1 − δ]} and W = {[0; δ; 1 − δ]}. Verify that in this case the numbers of observations Kd and Km , resulting in a given risk ǫ ≪ 1 of multi-observation tests, as functions of δ are proportional to 1/δ and 1/δ 2 , respectively. Compare the numbers when ǫ = 0.01 and δ ∈ {0.01; 0.05; 0.1}.
HYPOTHESIS TESTING
2.10.4
163
Miscellaneous exercises
Exercise 2.11. Prove that the conclusion in Proposition 2.18 remains true when the test T in the premise of the proposition is randomized. Exercise 2.12. Let p1 (ω), p2 (ω) be two positive probability densities, taken w.r.t. a reference measure Π on an observation space Ω, and let Pχ = {pχ }, χ = 1, 2. Find the optimal—in terms of its risk—balanced detector for Pχ , χ = 1, 2. Exercise 2.13. Recall that the exponential distribution on Ω = R+ , with parameter µ > 0, is the distribution with the density pµ (ω) = µe−µω , ω ≥ 0. Given positive reals α < β, consider two families of exponential distributions, P1 = {pµ : 0 < µ ≤ α}, and P2 = {pµ : µ ≥ β}. Build the optimal—in terms of its risk—balanced detector for P1 , P2 . What happens with the risk of the detector you have built when the families Pχ , χ = 1, 2, are replaced with their convex hulls? Exercise 2.14. [Follow-up to Exercise 2.13] Assume that the “lifetime” ζ of a lightbulb is a realization of random variable with exponential distribution (i.e., the density pµ (ζ) = µe−µζ , ζ ≥ 0; in particular, the expected lifespan of a lightbulb in this model is 1/µ).32 Given a lot of lightbulbs, you should decide whether they were produced under normal conditions (resulting in µ ≤ α = 1) or under abnormal ones (resulting in µ ≥ β = 1.5). To this end, you can select at random K lightbulbs and test them. How many lightbulbs should you test in order to make a 0.99-reliable conclusion? Answer this question in the situations when the observation ω in a test is 1. the lifespan of a lightbulb (i.e., ω ∼ pµ (·)); 2. the minimum ω = min[ζ, δ] of the lifespan ζ ∼ pµ (·) and the allowed duration δ > 0 of your test (i.e., if the lightbulb you are testing does not “die” on time horizon δ, you terminate the test); 3. ω = χζ 0 is the allowed test duration (i.e., you observe whether or not a lightbulb “dies” on time horizon δ, but do not register the lifespan when it is < δ). Consider the values 0.25, 0.5, 1, 2, 4 of δ. Exercise 2.15. 32 In Reliability, probability distribution of the lifespan ζ of an organism or a technical device Prob{t≤ζ≤t+δ} is characterized by the failure rate λ(t) = limδ→+0 δ·Prob{ζ≥t} (so that for small δ, λ(t)δ is the conditional probability to “die” in the time interval [t, t + δ] provided the organism or device is still “alive” at time t). The exponential distribution corresponds to the case of failure rate independent of t; in applications, this indeed is often the case except for “very small” and “very large” values of t.
164
CHAPTER 2
[Follow-up to Exercise 2.14] In the situation of Exercise 2.14, build a sequential test for deciding on null hypothesis “the lifespan of a lightbulb from a given lot is ζ ∼ pµ (·) with µ ≤ 1” (recall that pµ (z) is the exponential density µe−µz on the ray {z ≥ 0}) vs. the alternative “the lifespan is ζ ∼ pµ (·) with µ > 1.” In this test, you can select a number K of lightbulbs from the lot, switch them on at time 0 and record the actual lifetimes of the lightbulbs you are testing. As a result at the end of (any) observation interval ∆ = [0, δ], you observe K independent realizations of r.v. min[ζ, δ], where ζ ∼ pµ (·) with some unknown µ. In your sequential test, you are welcome to make conclusions at the endpoints δ1 < δ2 < ... < δS of several observation intervals. Note: We deliberately skip details of the problem’s setting; how you decide on these missing details is part of your solution to the exercise. Exercise 2.16. In Section 2.6, we consider a model of elections where every member of the population was supposed to cast a vote. Enrich the model by incorporating the option for a voter not to participate in the elections at all. Implement Sequential test for the resulting model and run simulations. Exercise 2.17. Work out the following extension of the Opinion Poll Design problem. You are given two finite sets, Ω1 = {1, ..., I} and Ω2 = {1, ..., M }, along with L nonempty closed convex subsets Yℓ of the set ) ( M I X X yim = 1 ∆IM = [yim > 0]i,m : i=1 m=1
of all nonvanishing probability distributions on Ω = Ω1 × Ω2 = {(i, m) : 1 ≤ i ≤ I, 1 ≤ m ≤ M }. Sets Yℓ are such that all distributions from Yℓ have a common marginal distribution θℓ > 0 of i: M X
m=1
yim = θiℓ , 1 ≤ i ≤ I, ∀y ∈ Yℓ , 1 ≤ ℓ ≤ L.
Your observations ω1 , ω2 , ... are sampled, independently of each other, from a distribution partly selected “by nature,” and partly by you. Specifically, nature selects ℓ ≤ L and a distribution y ∈ Yℓ , and you select a positive an I-dimensional probabilistic vector q from a given convex compact subset Q of the positive part of I-dimensional probabilistic simplex. Let y|i be the conditional distribution of m ∈ Ω2 given i induced by y, so that y|i is the M -dimensional probabilistic vector with entries yim yim = ℓ . [y|i ]m = P y θi µ≤M iµ
In order to generate ωt = (it , mt ) ∈ Ω, you draw it at random from the distribution q, and then nature draws mt at random from the distribution y|it . Given closeness relation C, your goal is to decide, up to closeness C, on the hypotheses H1 , ..., HL , with Hℓ stating that the distribution y selected by nature belongs to Yℓ . Given an “observation budget” (a number K of observations ωk you can use), you want to find a probabilistic vector q which results in the test with as
165
HYPOTHESIS TESTING
small a C-risk as possible. Pose this Measurement Design problem as an efficiently solvable convex optimization problem. Exercise 2.18. [Probabilities of deviations from the mean] The goal of what follows is to present the most straightforward application of simple families of distributions—bounds on probabilities of deviations of random vectors from their means. Let H ⊂ Ω = Rd , M, Φ be regular data such that 0 ∈ int H, M is compact, Φ(0; µ) = 0 ∀µ ∈ M, and Φ(h; µ) is differentiable at h = 0 for every µ ∈ M. Let, further, P¯ ∈ S[H, M, Φ] and let µ ¯ ∈ M be a parameter of P¯ . Prove that 1. P¯ possesses expectation e[P¯ ], and e[P¯ ] = ∇h Φ(0; µ ¯) 2. For every linear form eT ω on Ω it holds π
:= ≤
P¯ {ω: eT (ω − e[P¯ ]) ≥ 1} T Φ(te; µ ¯) − te ∇h Φ(0; µ ¯) − t . exp inf
(2.160)
t≥0:te∈H
What are the consequences of (2.160) for sub-Gaussian distributions? Exercise 2.19. [testing convex hypotheses on mixtures] Consider the situation as follows. For given positive integers K and L and for χ = 1, 2, given are • nonempty convex compact signal sets Uχ ⊂ Rnχ , χ • regular data Hkℓ ⊂ Rdk , Mχkℓ , Φχkℓ , and affine mappings uχ 7→ Aχkℓ [uχ ; 1] : Rnχ → Rdk such that
uχ ∈ Uχ ⇒ Aχkℓ [uχ ; 1] ∈ Mχkℓ ,
1 ≤ k ≤ K, 1 ≤ ℓ ≤ L, • probabilistic vectors µk = [µk1 ; ...; µkL ], 1 ≤ k ≤ K. We can associate with the outlined data families of probability distributions Pχ on the observation space Ω = Rd1 × ... × RdK as follows. For χ = 1, 2, Pχ is comprised of all probability distributions P of random vectors ω K = [ω1 ; ...; ωK ] ∈ Ω generated as follows: We select • a signal uχ ∈ Uχ , χ • a collection of probability distributions Pkℓ ∈ S[Hkℓ , Mχkℓ , Φχkℓ ], 1 ≤ k ≤ K, χ 1 ≤ ℓ ≤ L, in such a way that Akℓ [uχ ; 1] is a parameter of Pkℓ : T χ ∀h ∈ Hkℓ : ln Eωk ∼Pkℓ {eh ωk } ≤ Φχkℓ (hk ; Aχkℓ [uχ ; 1]);
• we generate the components ωk , k = 1, ..., K, independently across k, from µk mixture Π[{Pkℓ }L ℓ=1 , µ] of distributions Pkℓ , ℓ = 1, ..., L, that is, draw at random,
166
CHAPTER 2
from distribution µk on {1, ..., L}, index ℓ, and then draw ωk from the distribution Pkℓ . Prove that when setting Hχ
=
Mχ
=
Φχ (h; µ)
=
{h = [h1 ; ...; hK ] ∈ Rd=d1 +...+dK : hk ∈
L T
ℓ=1
χ Hkℓ , 1 ≤ k ≤ K},
{0} ⊂ R, PK PL χ χ k ln µ exp max Φ (h ; A [u ; 1]) : Hχ × Mχ → R, χ k kℓ kℓ k=1 ℓ=1 ℓ uχ ∈Uχ
we obtain the regular data such that
Pχ ⊂ S[Hχ , Mχ , Φχ ]. Explain how to use this observation to compute via Convex Programming an affine detector and its risk for the families of distributions P1 and P2 . Exercise 2.20. [Mixture of sub-Gaussian distributions] Let Pℓ be sub-Gaussian distributions on Rd with sub-Gaussianity parameters θℓ , Θ, 1 ≤ ℓ ≤ L, with a common Θparameter, and let ν = [ν1 ; ...; νL ] be a probabilistic vector. Consider the ν-mixture P = Π[P L , ν] of distributions Pℓ , so that ω ∼ P is generated as follows: we draw at random from distribution ν index ℓ and then draw ω at random from distribution P Pℓ . Prove that P is sub-Gaussian with sub-Gaussianity parameters θ¯ = ℓ νℓ θℓ ¯ with (any) Θ ¯ chosen to satisfy and Θ, ¯ ℓ − θ] ¯ T ∀ℓ, ¯ Θ + 6 [θℓ − θ][θ Θ 5 in particular, according to any one of the following rules: ¯ 2 Id , ¯ = Θ + 6 maxℓ kθℓ − θk 1. Θ 2 5P ¯ ¯T ¯ =Θ+ 6 2. Θ ℓ (θℓ − θ)(θℓ − θ) , 5 ¯ = Θ + 6 P θℓ θT , provided that ν1 = ... = νL = 1/L. 3. Θ ℓ ℓ 5 Exercise 2.21.
The goal of this exercise is to give a simple sufficient condition for quadratic lifting “to work” in the Gaussian case. Namely, let Aχ , Uχ , Vχ , Gχ , χ = 1, 2, be as in Section 2.9.3, with the only difference that now we do not assume the compact sets Uχ to be convex, and let Zχ be convex compact subsets of the sets Z nχ —see item i.2. in Proposition 2.43—such that [uχ ; 1][uχ ; 1]T ∈ Zχ ∀uχ ∈ Uχ , χ = 1, 2. (∗)
(χ)
Augmenting the above data with Θχ , δχ such that V = Vχ , Θ∗ = Θ∗ , δ = δχ satisfy (2.129), χ = 1, 2, and invoking Proposition 2.43.ii, we get at our disposal a quadratic detector φlift such that Risk[φlift |G1 , G2 ] ≤ exp{SadVallift }, with SadVallift given by (2.134). A natural question is, when SadVallift is negative, meaning that our quadratic detector indeed “is working”—its risk is < 1, imply-
HYPOTHESIS TESTING
167
ing that when repeated observations are allowed, tests based upon this detector are consistent—able to decide on the hypotheses Hχ : P ∈ Gχ , χ = 1, 2, on the distribution of observation ζ ∼ P with any small desired risk ǫ ∈ (0, 1). With our computation-oriented ideology, this is not too important a question, since we can answer it via efficient computation. This being said, there is no harm in a “theoretical” answer which could provide us with an additional insight. The goal of the exercise is to justify a simple result on the subject. Here is the exercise: In the situation in question, assume that V1 = V2 = {Θ∗ }, which allows us to (χ) set Θ∗ = Θ∗ , δχ = 0, χ = 1, 2. Prove that in this case a necessary and sufficient condition for SadVallift to be negative is that the convex compact sets Uχ = {Bχ ZBχT : Z ∈ Zχ } ⊂ Sd+1 + , χ = 1, 2 do not intersect with each other. Exercise 2.22. Prove that if X is a nonempty convex compact set in Rd , then the function b µ) given by (2.114) is real-valued and continuous on Rd × X and is convex in Φ(h; h and concave in µ. Exercise 2.23.
The goal of what follows is to refine the change detection procedure (let us refer to it as the“basic” one) developed in Section 2.9.5.1. The idea is pretty simple. With the notation from Section 2.9.5.1, in the basic procedure, when testing the null hypothesis H0 vs. signal hypothesis Htρ , we look at the difference ζt = ωt − ω1 and try to decide whether the energy of the deterministic component xt − x1 of ζt is 0, as is the case under H0 , or is ≥ ρ2 , as is the case under Htρ . Note that if σ ∈ [σ, σ] is the actual intensity of the observation noise, then the noise component of ζt is N (0, 2σ 2 Id ); other things being equal, the larger is the noise in ζt , the larger should be ρ to allow for a reliable—with a given reliability level—decision. Now note that under the hypothesis Htρ , we have x1 = ... = xt−1 , so that the deterministic component ofP the difference ζt = ωt − ω1 is exactly the same as for t−1 1 2 e the difference ζet = ωt − t−1 s=1 ωs , while the noise component in ζt is N (0, σt Id ) 1 t 2 2 2 2 with σt = σ + t−1 σ = t−1 σ . Thus, the intensity of noise in ζet is at most the same as in ζt , and this intensity, in contrast to that for ζt , decreases as t grows. Here comes the exercise: Let reliability tolerances ǫ, ε ∈ (0, 1) be given, and let our goal be to design a system of inferences Tt , t = 2, 3, ..., K, which, when used in the same fashion as tests Ttκ were used in the basic procedure, results in false alarm probability at most ǫ and in probability to miss a change of energy ≥ ρ2 at most ε. Needless to say, we want to achieve this goal with as small a ρ as possible. Think how to utilize the above observation to refine the basic procedure eventually reducing (and provably not increasing) the required value of ρ. Implement the basic and the refined change detection procedures and compare their quality (the resulting values of ρ), e.g., on the data used in the experiment reported in Section 2.9.5.1.
168
CHAPTER 2
2.11 2.11.1
PROOFS Proof of the observation in Remark 2.8
We have to prove that if p = [p1 ; ...; pK ] ∈ B = [0, 1]K then the probability PM (p) of the event The total number of heads in K independent coin tosses, with probability pk to get heads in k-th toss, is at least M is a nondecreasing function of p: if p′ ≤ p′′ , p′ , p′′ ∈ B, then PM (p′ ) ≤ PM (p′′ ). To see it, let us associate with p ∈ B a subset of B, specifically, Bp = {x ∈ B : 0 ≤ xk ≤ pk , 1 ≤ k ≤ K}, and a function χp (x) : B → {0, 1} which is equal to 0 at every point x ∈ B where the number of entries xk satisfying xk ≤ pk is less than M , and is equal to 1 otherwise. It is immediately seen that Z PM (p) ≡ χp (x)dx (2.161) B
(since with respect to the uniform distribution on B, the events Ek = {x ∈ B : xk ≤ pk } are independent across k and have probabilities pk , and the right-hand side in (2.161) is exactly the probability, taken w.r.t. the uniform distribution on B, of the event “at least M of the events E1 ,..., EK take place”). But the right-hand side in (2.31) clearly is nondecreasing in p ∈ B, since χp , by construction, is the characteristic function of the set B[p] = {x : at least M of the entries xk in x satisfy xk ≤ pk }, and these sets clearly grow when p increases entrywise. 2.11.2
✷
Proof of Proposition 2.6 in the case of quasi-stationary K-repeated observations
2.11.2.A Situation and goal. We are in the case QS—see Section 2.2.3.1—of the setting described at the beginning of Section 2.2.3. It suffices to verify that if Hℓ , ℓ ∈ {1, 2}, is true then the probability for TKmaj to reject Hℓ is at most the quantity ǫK defined in (2.23). Let us verify this statement in the case of ℓ = 1; the reasoning for ℓ = 2 “mirrors” the one to follow. It is clear that our situation and goal can be formulated as follows: • “In nature” there exists a random sequence ζ K = (ζ1 , ..., ζK ) of driving factors and a collection of deterministic functions θk (ζ k = (ζ1 , ..., ζk ))33 taking values in Ω = Rd such that our k-th observation is ωk = θk (ζ k ). Additionally, the conditional distribution Pωk |ζ k−1 of ωk given ζ k−1 always belongs to the family P1 comprised of distributions of random vectors of the form x + ξ, where deterministic x belongs to X1 and the distribution of ξ belongs to Pγd . • There exist deterministic functions χk : Ω → {0, 1} and integer M, 1 ≤ M ≤ K, such that the test TKmaj , as applied to observation ω K = (ω1 , ..., ωK ), rejects H1 33 As always, given a K-element sequence, say, ζ , ..., ζ , we write ζ t , t ≤ K, as a shorthand 1 K for the fragment ζ1 , ..., ζt of this sequence.
169
HYPOTHESIS TESTING
if and only if the number of 1’s among the quantities χk (ωk ), 1 ≤ k ≤ K, is at least M . In the situation of Proposition 2.6, M =⌋K/2⌊ and χk (·) are in fact independent of k: χk (ω) = 1 if and only if φ(ω) ≤ 0.34 • What we know is that the conditional probability of the event χk (ωk = θk (ζ k )) = 1, ζ k−1 being given, is at most ǫ⋆ : Pωk |ζ k−1 {ωk : χk (ωk ) = 1} ≤ ǫ⋆ ∀ζ k−1 . Indeed, Pωk |ζ k−1 ∈ P1 . As a result, Pωk |ζ k−1 {ωk : φk (ωk ) = 1}
= =
Pωk |ζ k−1 {ωk : φ(ωk ) ≤ 0} Pωk |ζ k−1 {ωk : φ(ωk ) < 0} ≤ ǫ⋆ ,
where the second equality is due to the fact that φ(ω) is a nonconstant affine function and Pωk |ζ k−1 , along with all distributions from P1 , has density, and the inequality is given by the origin of ǫ⋆ which upper-bounds the risk of the single-observation test underlying TKmaj . What we want to prove is that under the circumstances we have just summarized, we have M} PωK {ω K = (ω1 , ..., ωK ) : Card{k ≤ K : χk (ω k ) = 1} ≥K−k P K k , ≤ ǫM = M ≤k≤K k ǫ⋆ (1 − ǫ⋆ )
(2.162)
where PωK is the distribution of ω K = {ωk = θk (ζ k−1 )}K k=1 induced by the distribution of hidden factors. There is nothing to prove when ǫ⋆ = 1, since in this case ǫM = 1. Thus, we assume from now on that ǫ⋆ < 1. 2.11.2.B Achieving the goal, step 1. Our reasoning, inspired by that used to justify Remark 2.8, is as follows. Consider a sequence of random variables ηk , 1 ≤ k ≤ K, uniformly distributed on [0, 1] and independent of each other and of ζ K , and consider new driving factors λk = [ζk ; ηk ] and new observations35 µk = [ωk = θk (ζ k ); ηk ] = Θk (λk = (λ1 , ..., λk ))
(2.163)
driven by these new driving factors, and let ψk (µk = [ωk ; ηk ]) = χk (ωk ). It is immediately seen that • µk = [ωk = θk (ζ k ); ηk ] is a deterministic function, Θk (λk ), of λk , and the con34 In fact, we need to write φ(ω) < 0 instead of φ(ω) ≤ 0; we replace the strict inequality with its nonstrict version in order to make our reasoning applicable to the case of ℓ = 2, where nonstrict inequalities do arise. Clearly, replacing in the definition of χk strict inequality with the nonstrict one, we only increase the “rejection domain” of H1 , so that the upper bound on the probability of this domain we are about to get automatically is valid for the true rejection domain. 35 In this display, as in what follows, whenever some of the variables λ, ω, ζ, η, µ appear in the same context, it should always be understood that ζt and ηt are components of λt = [ζt ; ηt ], µt = [ωt ; ηt ] = Θt (λt ), and ωt = θt (ζ t ). To remind us about these “hidden relations,” we sometimes write something like φ(ωk = θk (ζ k )) to stress that we are speaking about the value of function φ at the point ωk = θk (ζ k ).
170
CHAPTER 2
ditional distribution Pµk |λk−1 of µk given λk−1 = [ζ k−1 ; η k−1 ] is the product distribution Pωk |ζ k−1 × U on Ω × [0, 1], where U is the uniform distribution on [0, 1]. In particular, πk (λk−1 )
:= =
Pµk |λk−1 {µk = [ωk ; ηk ] : χk (ωk ) = 1} Pωk |ζ k−1 {ωk : χk (ωk ) = 1} ≤ ǫ⋆ .
(2.164)
• We have PλK {λK : Card{k ≤ K : ψk (µk = Θk (λk )) = 1} ≥ M } = PωK {ω K = (ω1 , ..., ωK ) : Card{k ≤ K : χk (ωk ) = 1} ≥ M }
(2.165)
where PωK is as in (2.162), and Θk (·) is defined in (2.163). Now let us define ψk+ (λk ) as follows: • when ψk (Θk (λk )) = 1, or, which is the same, χk (ωk = θk (ζ k )) = 1, we set ψk+ (λk ) = 1 as well; • when ψk (Θk (λk )) = 0, or, which is the same, χk (ωk = θk (ζ k )) = 0, we set ψk+ (λk ) = 1 whenever ηk ≤ γk (λk−1 ) :=
ǫ⋆ − πk (λk−1 ) 1 − πk (λk−1 )
and ψk+ (λk ) = 0 otherwise. Let us make the following immediate observations: (A) Whenever λk is such that ψk (µk = Θk (λk )) = 1, we also have ψk+ (λk ) = 1; (B) The conditional probability of the event ψk+ (λk ) = 1, given λk−1 = [ζ k−1 ; η k−1 ] is exactly ǫ⋆ . Indeed, let Pλk |λk−1 be the conditional distribution of λk given λk−1 . Let us fix λk−1 . The event E = {λk : ψk+ (λk ) = 1}, by construction, is the union of two nonoverlapping events: E1 E2
= =
{λk = [ζk ; ηk ] : χk (θk (ζ k )) = 1}, {λk = [ζk ; ηk ] : χk (θk (ζ k )) = 0, ηk ≤ γk (λk−1 )}.
Taking into account that the conditional distribution of µk = [ωk = θk (ζ k ); ηk ], λk−1 being fixed, is the product distribution Pωk |ζ k−1 × U , we conclude in view of (2.164) that Pλk |λk−1 {E1 } = Pλk |λk−1 {E2 } = =
Pωk |ζ k−1 {ωk : χk (ωk ) = 1} = πk (λk−1 ), Pωk |ζ k−1 {ωk : χk (ωk ) = 0}U {η ≤ γk (λk−1 )} (1 − πk (λk−1 ))γk (λk−1 ),
which combines with the definition of γk (·) to imply (B).
171
HYPOTHESIS TESTING
2.11.2.C Achieving the goal, step 2. By (A) combined with (2.165) we have PωK {ω K : Card{k ≤ K : χk (ωk ) = 1} ≥ M } = PλK {λK : Card{k ≤ K : ψk (µk = Θk (λk )) = 1} ≥ M } ≤ PλK {λK : Card{k ≤ K : ψk+ (λk ) = 1} ≥ M }, and all we need to verify is that the first quantity in this chain is upper-bounded by the quantity ǫM given by (2.162). Invoking (B), it is enough to prove the following claim: (!) Let λK = (λ1 , ..., λK ) be a random sequence with probability distribution P , let ψk (λk ) take values 0 and 1 only, and let for every k ≤ K the conditional probability for ψk+ (λk ) to take value 1, λk−1 being fixed, be equal to ǫ⋆ , for all λk−1 . Then the P -probability of the event {λK : Card{k ≤ K : ψk+ (λk ) = 1} ≥ M } is equal to ǫM given by (2.162). This is immediate. For integers k, m, 1 ≤ k ≤ K, m ≥ 0, let χkm (λk ) be the characteristic function of the event {λk : Card{t ≤ k : ψt+ (λt ) = 1} = m}, and let k πm = P {λK : χkm (λk ) = 1}.
We have the following evident recurrence: k−1 k−1 k−1 χkm (λk ) = χm (λ )(1 − ψk+ (λk )) + χm−1 (λk−1 )ψk+ (λk ), k = 1, 2, ... k−1 augmented by the “boundary conditions” χ0m = 0, m > 0, χ00 = 1, χ−1 = 0 for all k ≥ 1. Taking expectation w.r.t. P and utilizing the fact that conditional expectation of ψk+ (λk ) given λk−1 is, identically in λk−1 , equal to ǫ⋆ , we get k πm
=
0 πm
=
k−1 k−1 πm (1 − ǫ⋆ ) + πm−1 ǫ⋆ , k = 1, ..., K, 1, m = 0, k−1 π−1 = 0, k = 1, 2, ... 0, m > 0,
whence k πm =
k m
0,
k−m ǫm , ⋆ (1 − ǫ⋆ )
m ≤ k, m > k.
Therefore, P {λK : Card{k ≤ K : ψk+ (λk ) = 1} ≥ M } = as required.
X
πkK = ǫM ,
M ≤k≤K
✷
172
CHAPTER 2
2.11.3
Proof of Theorem 2.23
1o . Since O is a simple o.s., the function Φ(φ, [µ; ν]) given by (2.56) is a well-defined real-valued function on F × (M × M) which is concave in [µ; ν]; convexity of the function in φ ∈ F is evident. Since both F and M are convex sets coinciding with their relative interiors, convexity-concavity and real valuedness of Φ on F ×(M×M) imply the continuity of Φ on the indicated domain. As a consequence, Φ is a convexconcave continuous real-valued function on F × (M1 × M2 ). Now let (2.166) Φ(µ, ν) = inf Φ(φ, [µ; ν]). φ∈F
Note that Φ, being the infimum of a family of concave functions of [µ; ν] ∈ M × M, is concave on M × M. We claim that for µ, ν ∈ M the function φµ,ν (ω) =
1 2
ln(pµ (ω)/pν (ω))
(which, by definition of a simple o.s., belongs to F) is an optimal solution to the right-hand side minimization problem in (2.166), so that ∀(µ ∈ M1 , ν ∈ M2 ) :
Φ([µ; ν]) := inf φ∈F Φ(φ, [µ; ν]) = Φ(φµ,ν , [µ; ν]) = ln Indeed, we have
R p p (ω)p (ω)Π(dω) . µ ν Ω (2.167)
exp{−φµ,ν (ω)}pµ (ω) = exp{φµ,ν (ω)}pν (ω) = g(ω) := whence Φ(φµ,ν , [µ; ν]) = ln δ(·) ∈ F we have (a) (b)
R
Ω
q
pµ (ω)pν (ω),
(2.168)
g(ω)Π(dω) . On the other hand, for φ(·) = φµ,ν (·) +
i hp i R hp g(ω)Π(dω) = g(ω) exp{−δ(ω)/2} g(ω) exp{δ(ω)/2} Π(dω) Ω Ω 1/2 1/2 R R g(ω) exp{δ(ω)}Π(dω) ≤ Ω g(ω) exp{−δ(ω)}Π(dω) 1/2 1/2 Ω R R = Ω Rexp{−φ(ω)}pµ (ω)Π(dω) [by (2.168)] exp{φ(ω)}pν (ω)Π(dω) Ω ⇒ ln Ω g(ω)Π(dω) ≤ Φ(φ, [µ; ν]), R
and thus Φ(φµ,ν , [µ; ν]) ≤ Φ(φ, [µ; ν]) for every φ ∈ F.
Remark 2.49. Note that the above reasoning did not use the fact that the minimization on the right-hand side of (2.166) is over φ ∈ F; in fact, this reasoning shows that φµ,ν (·) minimizes Φ(φ, R R [µ; ν]) over all functions φ for which the integrals exp{−φ(ω)}p (ω)Π(dω) and exp{φ(ω)}pν (ω)Π(dω) exist. µ Ω Ω
Remark 2.50. Note that the inequality in (b) can be equality only when the inequality in (a) is so. In other words, if φ¯ is pa minimizer of Φ(φ, [µ; ν])pover φ ∈ F, setting ¯ − φµ,ν (·), the functions g(ω) exp{−δ(ω)/2} and g(ω) exp{δ(ω)/2}, δ(·) = φ(·) considered as elements of L2 [Ω, Π], are proportional to each other. Since g is positive and g, δ are continuous, while the support of Π is the entire Ω, this “L2 proportionality” means that the functions in question differ by a constant factor, or, which is the same, that δ(·) is constant. Thus, the minimizers of Φ(φ, [µ; ν]) over φ ∈ F are exactly the functions of the form φ(ω) = φµ,ν (ω) + const.
173
HYPOTHESIS TESTING
2o . Let us verify that Φ(φ, [µ; ν]) has a saddle point (min in φ ∈ F, max in [µ; ν] ∈ M1 × M2 ). First, observe that on the domain of Φ it holds Φ(φ(·) + a, [µ; ν]) = Φ(φ(·), [µ; ν]) ∀(a ∈ R, φ ∈ F).
(2.169)
Let us select some µ ¯ ∈ M, R and let P be the measure on Ω with density pµ¯ w.r.t. Π. For φ ∈ F, the integrals Ω e±φ(ω) P (dω) are finite (since O is simple), implying that φ R ∈ L1 [Ω, P ]; note also that P is a probabilistic measure. Let now F0 = {φ ∈ F : φ(ω)P (dω) = 0}, so that F0 is a linear subspace in F, and all functions φ ∈ F Ω can be obtained by shifts of functions from F0 by constants. Now, by (2.169), to prove the existence of a saddle point of Φ on F × (M1 × M2 ) is exactly the same as to prove the existence of a saddle point of Φ on F0 × (M1 × M2 ). Let us verify that Φ(φ, [µ; ν]) indeed has a saddle point on F0 × (M1 × M2 ). Because M1 × M2 is a convex compact set, and Φ is continuous on F0 × (M1 × M2 ) and convex-concave, invoking the Sion-Kakutani Theorem we see that all we need in order to prove the existence of a saddle point is to verify that Φ is coercive in the first argument. In other words, we have to show that for every fixed [µ; ν] ∈ M1 × M2 one has Φ(φ, [µ; ν]) → +∞ as φ ∈ F0 and kφk → ∞ (whatever be the norm k · k on F0 ; recall that F0 is a finite-dimensional linear space). Setting Z Z 1 φ(ω) −φ(ω) e pν (ω)Π(dω) e pµ (ω)Π(dω) + ln ln Θ(φ) = Φ(φ, [µ; ν]) = 2 ω ω and taking into account that Θ is convex and finite on F0 , in order to prove that Θ is coercive, it suffices to verify Rthat Θ(tφ) → ∞, t → ∞, for every nonzero φ ∈ F0 , which is evident: since Ω φ(ω)P (dω) = 0 and φ is nonzero, we have R R max[φ(ω), 0]P (dω) = Ω max[−φ(ω), 0]P (dω) > 0, whence φ > 0 and φ < 0 on Ω sets of Π-positive measure, so that Θ(tφ) → ∞ as t → ∞ due to the fact that both pµ (·) and pν (·) are continuous and everywhere positive. 3o . Now let (φ∗ (·); [µ∗ ; ν∗ ]) be a saddle point of Φ on F × (M1 × M2 ). Shifting, if necessary, φ∗ (·) by a constant (by (2.169), this does not affect the fact that (φ∗ , [µ∗ ; ν∗ ]) is a saddle point of Φ), we can assume that Z Z exp{φ∗ (ω)}pν∗ (ω)Π(dω), exp{−φ∗ (ω)}pµ∗ (ω)Π(dω) = ε⋆ := Ω
Ω
so that the saddle point value of Φ is Φ∗ :=
max
min Φ(φ, [µ; ν]) = Φ(φ∗ , [µ∗ ; ν∗ ]) = ln(ε⋆ ),
[µ;ν]∈M1 ×M2 φ∈F
(2.170)
as claimed in item (i) of the theorem. Now let us prove (2.58). For µ ∈ M1 , we have ln(ε⋆ )
Hence, ln
R
Ω
= = =
Φ∗ ≥ RΦ(φ∗ , [µ; ν∗ ]) 1 ln RΩ exp{−φ∗ (ω)}pµ (ω)Π(dω) + 2 1 ln Ω exp{−φ∗ (ω)}pµ (ω)P (dω) + 2
exp{−φa∗ (ω)}pµ (ω)Π(dω)
= ≤
1 2 1 2
R ln Ω exp{φ∗ (ω)}pν∗ (ω)Π(dω) ln(ε⋆ ).
R ln Ω exp{−φ∗ (ω)}pµ (ω)P (dω) + a ln(ε⋆ ) + a,
174
CHAPTER 2
and (2.58.a) follows. Similarly, when ν ∈ M2 , we have ln(ε⋆ )
= = =
so that ln
R
Ω
Φ∗ ≥ RΦ(φ∗ , [µ∗ ; ν]) R 1 ln Ω exp{−φR∗ (ω)}pµ∗ (ω)Π(dω) + 21 ln Ω exp{φ∗ (ω)}pν (ω)Π(dω) 2 1 ln(ε⋆ ) + 21 ln Ω exp{φ∗ (ω)}pν (ω)Π(dω) , 2
exp{φa∗ (ω)}pν (ω)Π(dω)
= ≤
R ln Ω exp{φ∗ (ω)}pν (ω)Π(dω) − a ln(ε⋆ ) − a,
and (2.58.b) follows. We have proved all statements of item (i), except for the claim that the φ∗ , ε⋆ just defined form an optimal solution to (2.59). Note that by (2.58) as applied with a = 0, the pair in question is feasible for (2.59). Assuming that the problem admits ¯ ǫ) with ǫ < ε⋆ , let us lead this assumption to a contradiction. a feasible solution (φ, ¯ Note that φ should be such that Z Z ¯ ¯ −φ(ω) eφ(ω) pν∗ (ω)Π(dω) < ε⋆ , e pµ∗ (ω)Π(dω) < ε⋆ & Ω
Ω
¯ [µ∗ ; ν∗ ]) < ln(ε⋆ ). On the other hand, Remark 2.49 says and consequently Φ(φ, ¯ that Φ(φ, [µ∗ ; ν∗ ]) cannot be less than min Φ(φ, [µ∗ ; ν∗ ]), and the latter quantity is φ∈F
Φ(φ∗ , [µ∗ ; ν∗ ]) because (φ∗ , [µ∗ ; ν∗ ]) is a saddle point of Φ on F × (M1 × M2 ). Thus, assuming that the optimal value in (2.59) is < ε⋆ , we conclude that Φ(φ∗ , [µ∗ ; ν∗ ]) ≤ ¯ [µ∗ ; ν∗ ]) < ln(ε⋆ ), contradicting (2.170). Item (i) of Theorem 2.23 is proved. Φ(φ, 4o . Let us prove item (ii) of Theorem 2.23. Relation (2.60) and concavity of the right-hand side of this relation in [µ; ν] were already proved; moreover, these relations were proved in the range M × M of [µ; ν]. Since this range coincides with its relative interior, the real-valued concave function Φ is continuous on M × M and thus is continuous on M1 × M2 . Next, let φ∗ be the φ-component of a saddle point of Φ on F × (M1 × M2 ) (we already know that such a saddle point exists). By Proposition 2.21, the [µ; ν]-components of saddle points of Φ on F × (M1 × M2 ) are exactly the maximizers of Φ on M1 × M2 ; let [µ∗ ; ν∗ ] be such a maximizer. By the same proposition, (φ∗ , [µ∗ ; ν∗ ]) is a saddle point of Φ, whence Φ(φ, [µ∗ ; ν∗ ]) attains its minimum over φ ∈ F at φ = φ∗ . We have also seen that Φ(φ, [µ∗ ; ν∗ ]) attains its minimum over φ ∈ F at φ = φµ∗ ,ν∗ . These observations combine with Remark 2.50 to imply that φ∗ and φµ∗ ,ν∗ differ by a constant, which, in view of (2.169), means that (φµ∗ ,ν∗ , [µ∗ ; ν∗ ]) is a saddle point of Φ along with (φ∗ , [µ∗ ; ν∗ ]). (ii) is proved. 5o . It remains to prove item (iii) of Theorem 2.23. In the notation from (iii), simple hypotheses (A) and (B) can be decided with the total risk ≤ 2ǫ, and therefore, by Proposition 2.2, Z min[p(ω), q(ω)]Π(dω) ≤ 2ǫ. 2¯ ǫ := Ω
On the other hand, we have seen that the saddle point value of Φ is ln(ε⋆ ); since [µ∗ ; ν∗ ] is a component of a saddle point of Φ, it follows that minφ∈F Φ(φ, [µ∗ ; ν∗ ]) = ln(ε⋆ ). The left-hand side in this equality, by item 1o , is Φ(φµ∗ ,ν∗ , [µ∗ ; ν∗ ]), and we
175
HYPOTHESIS TESTING
arrive at 1 2
ln(ε⋆ ) = Φ( ln(pµ∗ (·)/pν∗ (·)), [µ∗ ; ν∗ ]) = ln so that ε⋆ =
Z q
pµ∗ (ω)pν∗ (ω)Π(dω) =
Ω
We now have ε⋆
Z q
pµ∗ (ω)pν∗ (ω)Π(dω) ,
Ω
Z p
p(ω)q(ω)Π(dω).
Ω
p R p R p = Ω p(ω)q(ω)Π(dω) = Ω min[p(ω), q(ω)] max[p(ω), q(ω)]Π(dω) 1/2 1/2 R R max[p(ω), q(ω)]Π(dω) ≤ Ω min[p(ω), q(ω)]Π(dω) 1/2 1/2 RΩ R (p(ω) + q(ω) − min[p(ω), q(ω)])Π(dω) = pΩ min[p(ω), q(ω)]Π(dω) Ω p = 2¯ ǫ(2 − 2¯ ǫ) ≤ 2 (1 − ǫ)ǫ,
where the concluding inequality is due to ǫ¯ ≤ ǫ ≤ 1/2. (iii) is proved, and the proof of Theorem 2.23 is complete. ✷ 2.11.4
Proof of Proposition 2.37
All we need is to verify (2.107) and to check that the right-hand side function in this relation is convex. The latter is evident, since φX (h) + φX (−h) ≥ 2φX (0) = 0 and φX (h) + φX (−h) is convex. To verify (2.107), let us fix P ∈ P[X] and h ∈ Rd and set ν = hT e[P ], so that ν is the expectation of hT ω with ω ∼ P . Note that for ω ∼ P we have hT ω ∈ [−φX (−h), φX (h)] with P -probability 1, whence −φX (−h) ≤ ν ≤ φX (h). In particular, when φX (h) + φX (−h) = 0, hT ω = ν with P -probability 1, so that (2.107) definitely holds true. Now let η :=
1 2
[φX (h) + φX (−h)] > 0,
and let a=
1 2
[φX (h) − φX (−h)] , β = (ν − a)/η.
Denoting by Ph the distribution of hT ω induced by the distribution P of ω and noting that this distribution is supported on [−φX (−h), φX (h)] = [a − η, a + η] and has expectation ν, we get β ∈ [−1, 1] and γ :=
Z
exp{hT ω}P (dω) =
Z
a+η a−η
[es − λ(s − ν)]Ph (ds)
for all λ ∈ R. Hence, ln(γ) ≤ inf ln max [es − λ(s − ν)] a−η≤s≤a+η λ = a + inf ln max [et − ρ(t − [ν − a])] [substituting λ = ea ρ, s = a + t] ρ −η≤t≤η = a + inf ln max [et − ρ(t − ηβ)] ≤ a + ln max [et − ρ¯(t − ηβ) ρ
−η≤t≤η
−η≤t≤η
176
CHAPTER 2
with ρ¯ = (2η)−1 (eη − e−η ). The function g(t) = et − ρ¯(t − ηβ) is convex on [−η, η], and g(−η) = g(η) = cosh(η) + β sinh(η), which combines with the above computation to yield the relation ln(γ) ≤ a + ln(cosh(η) + β sinh(η)).
(2.171)
Thus, all we need to verify is that ∀(η > 0, β ∈ [−1, 1]) : βη + 21 η 2 − ln(cosh(η) + β sinh(η)) ≥ 0.
(2.172)
Indeed, if (2.172) holds true (2.171) implies that ln(γ) ≤ a + βη + 12 η 2 = ν + 21 η 2 , which, recalling what γ, ν, and η are, is exactly what we want to prove. Verification of (2.172) is as follows. The left-hand side in (2.172) is convex in β for β > − cosh(η) sinh(η) containing, due to η > 0, the range of β in (2.172). Furthermore, the minimum of the left-hand side of (2.172) over β > − coth(η) is attained at cosh(η) and is equal to β = sinh(η)−η η sinh(η) r(η) = 12 η 2 + 1 − η coth(η) − ln(sinh(η)/η). All we need to prove is that the latter quantity is nonnegative whenever η > 0. We have r′ (η) = η − coth(η) − η(1 − coth2 (η)) − coth(η) + η −1 = (η coth(η) − 1)2 η −1 ≥ 0, and since r(+0) = 0, we get r(η) ≥ 0 when η > 0. 2.11.5
✷
Proof of Proposition 2.43
2.11.5.A Proof of Proposition 2.43.i
A 1 . Let b = [0; ...; 0; 1] ∈ R , so that B = , and let A(u) = A[u; 1]. For bT any u ∈ Rn , h ∈ Rd , Θ ∈ Sd+ , and H ∈ Sd such that −I ≺ Θ1/2 HΘ1/2 ≺ I we have Ψ(h, H; u, Θ) :=ln Eζ∼N (A(u),Θ) exp{hT ζ + 21 ζ T Hζ} = ln Eξ∼N (0,I) exp{hT [A(u) + Θ1/2 ξ] + 21 [A(u) + Θ1/2 ξ]T H[A(u) + Θ1/2 ξ]} = − 12 ln Det(I − Θ1/2 HΘ1/2 ) + hT A(u) + 21 A(u)T HA(u) 1/2 1/2 1/2 −1 1/2 + 21 [HA(u) + h]T Θ ] Θ [HA(u) + h] T[I − Θ T HΘ 1 1 T 1/2 1/2 T = − 2 ln Det(I − Θ HΘ ) + 2 [u; 1] bh A + A hb + AT HA [u; 1] + 21 [u; 1]T B T [H, h]T Θ1/2 [I − Θ1/2 HΘ1/2 ]−1 Θ1/2 [H, h]B [u; 1] (2.173) due to hT A(u) = [u; 1]T bhT A[u; 1] = [u; 1]T AT hbT [u; 1] o
n+1
and HA(u) + h = [H, h]B[u; 1].
177
HYPOTHESIS TESTING
Observe that when (h, H) ∈ Hγ , we have −1 Θ1/2 [I − Θ1/2 HΘ1/2 ]−1 Θ1/2 = [Θ−1 − H]−1 [Θ−1 , ∗ − H]
so that (2.173) implies that for all u ∈ Rn , Θ ∈ V, and (h, H) ∈ Hγ , Ψ(h, H; u,Θ) ≤ − 12 ln Det(I − Θ1/2 HΘ1/2 ) −1 [H, h]B [u; 1] + 12 [u; 1]T bhT A + AT hbT + AT HA + B T [H, h]T [Θ−1 ∗ − H] | {z } Q[H,h]
= − 21 ln Det(I − Θ1/2 HΘ1/2 ) + 12 Tr(Q[H, h]Z(u)) ≤ − 21 ln Det(I − Θ1/2 HΘ1/2 ) + ΓZ (h, H), ΓZ (h, H) = 12 φZ (Q[H, h])
(2.174) (we have taken into account that Z(u) ∈ Z when u ∈ U , the premise of the proposition, and therefore Tr(Q[H, h]Z(u)) ≤ φZ (Q[H, h])). Note that the above function Q[H, h] is nothing but H h T −1 −1 + [H, h] [Θ − H] [H, h] B. (2.175) Q[H, h] = B T ∗ hT 2o . We need the following: Lemma 2.51. Let Θ∗ be a d × d symmetric positive definite matrix, let δ ∈ [0, 2], and let V be a closed convex subset of Sd+ such that −1/2
Θ ∈ V ⇒ {Θ Θ∗ } & {kΘ1/2 Θ∗
− Id k ≤ δ}
(2.176)
−1 (cf. (2.129)). Let also Ho := {H ∈ Sd : −Θ−1 ∗ ≺ H ≺ Θ∗ }. Then
∀(H, Θ) ∈ Ho × V : G(H; Θ) := − 12 ln Det(I − Θ1/2 HΘ1/2 ) 1/2 1/2 ≤ G+ (H; Θ) := − 12 ln Det(I − Θ∗ HΘ∗ ) + 21 Tr([Θ − Θ∗ ]H) 1/2 1/2 δ(2+δ) + kΘ∗ HΘ∗ k2F , 1/2 1/2 2(1−kΘ∗
HΘ∗
(2.177)
k)
where k · k is the spectral, and k · kF the Frobenius norm of a matrix. In addition, G+ (H, Θ) is a continuous function on Ho × V which is convex in H ∈ H o and concave (in fact, affine) in Θ ∈ V Proof. Let us set
1/2
1/2
d(H) = kΘ∗ HΘ∗ k,
so that d(H) < 1 for H ∈ Ho . For H ∈ Ho and Θ ∈ V fixed we have kΘ1/2 HΘ1/2 k = ≤
−1/2
1/2
1/2
−1/2
k[Θ1/2 Θ∗ ][Θ∗ HΘ∗ ][Θ1/2 Θ∗ ]T k −1/2 1/2 1/2 1/2 1/2 kΘ1/2 Θ∗ k2 kΘ∗ HΘ∗ k ≤ kΘ∗ HΘ∗ k = d(H) (2.178) 1/2 −1/2 (we have used the fact that 0 Θ Θ∗ implies kΘ Θ∗ k ≤ 1). Noting that
178
CHAPTER 2
kABkF ≤ kAkkBkF , a computation completely similar to the one in (2.178) yields 1/2
1/2
kΘ1/2 HΘ1/2 kF ≤ kΘ∗ HΘ∗ kF =: D(H).
(2.179)
Besides this, setting F (X) = − ln Det(X) : int Sd+ → R and equipping Sd with the 1/2 1/2 Frobenius inner product, we have ∇F (X) = −X −1 , so that with R0 = Θ∗ HΘ∗ , R1 = Θ1/2 HΘ1/2 , and ∆ = R1 − R0 , we have for properly selected λ ∈ (0, 1) and Rλ = λR0 + (1 − λ)R1 : F (I − R1 )
= = =
F (I − R0 − ∆) = F (I − R0 ) + h∇F (I − Rλ ), −∆i F (I − R0 ) + h(I − Rλ )−1 , ∆i F (I − R0 ) + hI, ∆i + h(I − Rλ )−1 − I, ∆i.
We conclude that F (I − R1 ) ≤ F (I − R0 ) + Tr(∆) + kI − (I − Rλ )−1 kF k∆kF .
(2.180)
Denoting by µi the eigenvalues of Rλ and noting that kRλ k ≤ max[kR0 k, kR1 k] = 1 = d(H) (see (2.178)), we have |µi | ≤ d(H), and therefore eigenvalues νi = 1 − 1−µ i µi −1 − 1−µi of I − (I − Rλ ) satisfy |νi | ≤ |µi |/(1 − µi ) ≤ |µi |/(1 − d(H)), whence kI − (I − Rλ )−1 kF ≤ kRλ kF /(1 − d(H)). Noting that kRλ kF ≤ max[kR0 kF , kR1 kF ] ≤ D(H)—see (2.179)—we conclude that kI − (I − Rλ )−1 kF ≤ D(H)/(1 − d(H)), so that (2.180) yields F (I − R1 ) ≤ F (I − R0 ) + Tr(∆) + D(H)k∆kF /(1 − d(H)). −1/2
Further, by (2.129) the matrix D = Θ1/2 Θ∗ 1/2
(2.181)
− I satisfies kDk ≤ δ, whence
1/2
1/2 = (I +D)R0 (I +DT )−R0 = DR0 +R0 DT +DR0 DT . ∆ = |Θ1/2 HΘ HΘ {z }−Θ | ∗ {z ∗ } R1
R0
Consequently, k∆kF
kDR0 kF + kR0 DT kF + kDR0 DT kF ≤ [2kDk + kDk2 ]kR0 kF δ(2 + δ)kR0 kF = δ(2 + δ)D(H).
≤ ≤
This combines with (2.181) and the relation 1/2
1/2
Tr(∆) = Tr(Θ1/2 HΘ1/2 − Θ∗ HΘ∗ ) = Tr([Θ − Θ∗ ]H) to yield F (I − R1 )
≤
=
F (I − R0 ) + Tr([Θ − Θ∗ ]H) +
F (I − R0 ) + Tr([Θ − Θ∗ ]H) +
δ(2+δ) 1−d(H) D(H) 1/2 1/2 δ(2+δ) HΘ∗ k2F , 1/2 1/2 kΘ∗ 1−kΘ∗ HΘ∗ }
and we arrive at (2.177). It remains to prove that G+ (H; Θ) is convex-concave and continuous on Ho × V. The only component of this claim which is not completely evident is convexity of the function in H ∈ Ho . To see that it is the case, note that ln Det(S) is concave on the interior of the semidefinite cone, the function
179
HYPOTHESIS TESTING
f (u, v) =
u2 1−v
is convex and nondecreasing in u, v in the convex domain Π =
{(u, v) : u ≥ 0, v < 1}, and the function convex substitution of variables H 7→ into Π. .
1/2 2 kΘ1/2 ∗ HΘ∗ kF 1/2
1/2
is obtained from f by
1−kΘ∗ HΘ∗ k 1/2 1/2 1/2 1/2 (kΘ∗ HΘ∗ kF , kΘ∗ HΘ∗ k)
mapping Ho ✷
3o . Combining (2.177), (2.174), and (2.130) and the origin of Ψ—see (2.173)—we arrive at ∀((u, Θ) ∈ U × V, (h, H) ∈ Hγ = H) : ln Eζ∼N (A[u;1],Θ) exp{hT ζ + 12 ζ T Hζ} ≤ ΦA,Z (h, H; Θ),
as claimed in (2.133).
4o . Now let us check that ΦA,Z (h, H; Θ) : H × V → R is continuous and convexconcave. Recalling that the function G+ (H; Θ) from (2.177) is convex-concave and continuous on Ho ×V, all we need to verify is that ΓZ (h, H) is convex and continuous on H. Recalling that Z is a nonempty compact set, the function φZ (·) : Sd+1 → R is continuous, implying the continuity of ΓZ (h, H) = 12 φZ (Q[H, h]) on H = Hγ (Q[H, h] is defined in (2.175)). To prove convexity of ΓZ , note that Z is contained in Sn+1 + , implying that φZ (·) is convex and -monotone. On the other hand, by the Schur Complement Lemma, we have S
:= =
{(h, (h, H) ∈ Hγ } H, G) : G Q[H, h], T G − [bh A + AT hbT + AT HA] (h, H, G) : [H, h]B
B T [H, h]T Θ−1 ∗ −H
0, γ (h, H) ∈ H ,
implying that S is convex. Since φZ (·) is -monotone, we have {(h, H, τ ) : (h, H) ∈ Hγ , τ ≥ ΓZ (h, H)} = {(h, H, τ ) : ∃G : G Q[H, h], 2τ ≥ φZ (G), (h, H) ∈ Hγ }, and we see that the epigraph of ΓZ is convex (since the set S and the epigraph of φZ are so), as claimed. 5o . It remains to prove that ΦA,Z is coercive in H, h. Let Θ ∈ V and (hi , Hi ) ∈ Hγ with k(hi , Hi )k → ∞ as i → ∞, and let us prove that ΦA,Z (hi , Hi ; Θ) → ∞. Looking at the expression for ΦA,Z (hi , Hi ; Θ), it is immediately seen that all terms in this expression, except for the terms coming from φZ (·), remain bounded as i grows, so that all we need to verify is that the φZ (·)-term goes to ∞ as i → ∞. Observe that Hi are uniformly bounded due to (hi , Hi ) ∈ Hγ , implying that khi k2 → ∞ as i → ∞. Denoting e = [0; ...; 0; 1] ∈ Rd+1 and, as before, b = [0; ...; 0; 1] ∈ Rn+1 , note that, by construction, B T e = b. Now let W ∈ Z, so −1 that Wn+1,n+1 = 1. Taking into account that the matrices [Θ−1 satisfy ∗ − Hi ] −1 γ αId [Θ−1 − H ] βI for some positive α, β due to H ∈ H , observe that i d i ∗ H i hi T −1 −1 + [Hi , hi ] [Θ−1 [Hi , hi ] = hTi [Θ−1 hi eeT + Ri , ∗ − Hi ] ∗ − Hi ] T hi {z } | | {z } α kh k2 Qi =Q[Hi ,hi ]
i
i 2
180
CHAPTER 2
where αi ≥ α > 0 and kRi kF ≤ C(1 + khi k2 ). As a result, φZ (B T Qi B)
≥
=
Tr(W B T Qi B) = Tr(W B T [αi khi k22 eeT + Ri ]B) αi khi k22 Tr(W bbT ) −kBW B T kF kRi kF | {z } =Wn+1,n+1 =1
≥
αkhi k22
− C(1 + khi k2 )kBW B T kF ,
and the concluding quantity tends to ∞ as i → ∞ due to khi k2 → ∞, i → ∞. Part (i) is proved. 2.11.5.B Proof of Proposition 2.43.ii By (i) the function Φ(h, H; Θ1 , Θ2 ), as defined in (2.134), is continuous and convexconcave on the domain (H1 ∩ H2 ) × (V1 × V2 ) and is coercive in (h, H), H and V | {z } | {z } H
V
are closed and convex, and V in addition is compact, so that saddle point problem (2.134) is solvable (Sion-Kakutani Theorem, a.k.a. Theorem 2.22). Now let (h∗ , H∗ ; Θ∗1 , Θ∗2 ) be a saddle point. To prove (2.136), let P ∈ G1 , that is, P = N (A1 [u; 1], Θ1 ) for some Θ1 ∈ V1 and some u with [u; 1][u; 1]T ∈ Z1 . Applying (2.133) to the first collection of data, with a given by (2.135), we get the first ≤ in the following chain: R 1 T T e− 2 ω H∗ ω−ω h∗ −a P (dω) ≤ ΦA1 ,Z1 (−h∗ , −H∗ ; Θ1 ) − a ln ≤ Φ (−h∗ , −H∗ ; Θ∗1 ) − a |{z} = SV , |{z} A1 ,Z1 (b)
(a)
where (a) is due to the fact that ΦA1 ,Z1 (−h∗ , −H∗ ; Θ1 ) + ΦA2 ,Z2 (h∗ , H∗ ; Θ2 ) attains its maximum over (Θ1 , Θ2 ) ∈ V1 × V2 at the point (Θ∗1 , Θ∗2 ), and (b) is due to the origin of a and the relation SV = 21 [ΦA1 ,Z1 (−h∗ , −H∗ ; Θ∗1 ) + ΦA2 ,Z2 (h∗ , H∗ ; Θ∗2 )]. The bound in (2.136.a) is proved. Similarly, let P ∈ G2 , that is, P = N (A2 [u; 1], Θ2 ) for some Θ2 ∈ V2 and some u with [u; 1][u; 1]T ∈ Z2 . Applying (2.133) to the second collection of data, with the same a as above, we get the first ≤ in the following chain: R 1 T T ln e 2 ω H∗ ω+ω h∗ +a P (dω) ≤ ΦA2 ,Z2 (h∗ , H∗ ; Θ2 ) + a (h , H ; Θ∗ ) + a |{z} = SV , ≤ Φ |{z} A2 ,Z2 ∗ ∗ 2 (b)
(a)
with exactly the same justification of (a) and (b) as above. The bound in (2.136.b) is proved. ✷ 2.11.6
Proof of Proposition 2.46
2.11.6.A Preliminaries We start with the following result: ¯ be a positive definite d × d matrix, Lemma 2.52. Let Θ u 7→ C(u) = A[u; 1]
B=
A 0, ..., 0, 1
, and let
HYPOTHESIS TESTING
181
be an affine mapping from Rn into Rd . Finally, let h ∈ Rd , H ∈ Sd and P ∈ Sd satisfy the relations ¯ 1/2 H Θ ¯ 1/2 . 0 P ≺ Id & P Θ (2.182)
¯ and for every u ∈ Rn it holds Then, ζ ∼ SG(C(u), Θ) n T 1 T o ≤ − 12 ln Det(I − P ) ln Eζ eh ζ+ 2 ζ Hζ h i T ¯ 1/2 H h −1 ¯ 1/2 + [H, h] Θ [I − P ] Θ [H, h] B[u; 1] + 21 [u; 1]T B T T h
(2.183)
¯ −1/2 P Θ ¯ −1/2 ): whenever h ∈ Rd , H ∈ Sd and G ∈ Sd Equivalently (set G = Θ satisfy the relations ¯ −1 & G H, 0G≺Θ (2.184)
¯ and every for every u ∈ Rn : one has for ζ ∼ SG(C(u), Θ) n T 1 T o ¯ 1/2 GΘ ¯ 1/2 ) ln Eζ eh ζ+ 2 ζ Hζ ≤ − 21 ln Det(I − Θ h i T ¯ −1 H h −1 + 21 [u; 1]T B T + [H, h] [ Θ − G] [H, h] B[u; 1]. T h
(2.185)
Proof. 1o . Let us start with the following observation:
Lemma 2.53. Let Θ ∈ Sd+ and S ∈ Rd×d be such that SΘS T ≺ Id . Then for every ν ∈ Rd one has o o n 1 T n T 1 T T T ln Eξ∼SG(0,Θ) eν Sξ+ 2 ξ S Sξ ≤ ln Eη∼N (ν,Id ) e 2 η SΘS η (2.186) = − 12 ln Det(Id − SΘS T ) + 21 ν T SΘS T (Id − SΘS T )−1 ν. Indeed, let ξ ∼ SG(0, Θ) and η ∼ N (ν, Id ) be independent. We have n T o n n oo n n T T oo 1 T T T Eξ eν Sξ+ 2 ξ S Sξ |{z} = Eξ Eη e[Sξ] η = Eη Eξ e[S η] ξ o a n 1 T T ≤ Eη e 2 η SΘS η , |{z} b
where a is due to η ∼ N (ν, Id ) and b is due to ξ ∼ SG(0, Θ). We have verified the inequality in (2.186); the equality in (2.186) is given by direct computation. ✷ 2o . Now, in the situation described in Lemma 2.52, by continuity it suffices to prove (2.183) in the case when P 0 in (2.182) is replaced with P ≻ 0. Under the premise of the lemma, given u ∈ Rn and assuming P ≻ 0, let us set µ = C(u) = A[u; 1], ¯ 1/2 [Hµ + h], and S = P 1/2 Θ ¯ −1/2 , so that S ΘS ¯ T = P ≺ Id , and let ν = P −1/2 Θ −1/2 ¯ −1/2 ¯ ¯ G=Θ PΘ , so that G H. Let ζ ∼ SG(µ, Θ). Representing ζ as ζ = µ + ξ
182
CHAPTER 2
¯ we have with ξ ∼ SG(0, Θ), o n T 1 T o n 1 T T = hT µ + 21 µT Hµ + ln Eξ e[h+Hµ] ξ+ 2 ξ Hξ ln Eζ eh ζ+ 2 ζ Hζ o n 1 T T ≤ hT µ + 12 µT Hµ + ln Eξ e[h+Hµ] ξ+ 2 ξ Gξ [since G H] n o = hT µ + 12 µT Hµ + ln Eξ eν
T
1
Sξ+ 2 ξ T S T Sξ
[since S T ν = h + Hµ and G = S T S] ¯ T ) + 1 ν T S ΘS ¯ T (Id − S ΘS ¯ T )−1 ν ≤ hT µ + 12 µT Hµ − 21 ln Det(Id − S ΘS 2 ¯ [by Lemma 2.53 with Θ = Θ] 1 1 1 T T ¯ 1/2 −1 ¯ 1/2 T = h µ + 2 µ Hµ − 2 ln Det(Id − P ) + 2 [Hµ + h] Θ (Id − P ) Θ [Hµ + h] [plugging in S and ν].
It is immediately seen that the concluding quantity in this chain is nothing but the right-hand side quantity in (2.183). ✷ 2.11.6.B Completing the proof of Proposition 2.46. ¯ = Θ∗ , 1o . Let us prove (2.142.a). By Lemma 2.52 (see (2.185)) applied with Θ setting C(u) = A[u; 1], we have ∀ (h, H) ∈ H, G : 0n G γ + Θ−1 , G H, u ∈ Rn : [u; 1][u; 1]T ∈ Z : ∗ o 1 T T 1/2 1/2 ≤ − 12 ln Det(I − Θ∗ GΘ∗ ) ln Eζ∼SG(C(u),Θ∗ ) eh ζ+ 2 ζ Hζ h i T h H −1 −1 + 12 [u; 1]T B T + [H, h] [Θ − G] [H, h] B[u; 1] T ∗ h 1/2
1/2
≤ − 12 ln Det(I h − Θ∗ GΘ∗ ) + 21 φZ B T
H hT
h
i T −1 + [H, h] [Θ−1 − G] [H, h] B = ΨA,Z (h, H, G), ∗
(2.187) implying, due to the origin of ΦA,Z , that under the premise of (2.187) we have n T 1 T o ≤ ΦA,Z (h, H), ∀(h, H) ∈ H. ln Eζ∼SG(C(u),Θ∗ ) eh ζ+ 2 ζ Hζ
Taking into account that when ζ ∼ SG(C(u), Θ) with Θ ∈ V, we have also ζ ∼ SG(C(u), Θ∗ ); (2.142.a) follows.
2o . Now let us prove (2.142.b). All we need is to verify the relation + −1 n T ∀ (h, H) ∈ H, G : 0 G γ Θ∗ , G n H, u ∈ Ro: [u; 1][u; 1] ∈ Z, Θ ∈ V : ln Eζ∼SG(C(u),Θ) eh
T
1
ζ+ 2 ζ T Hζ
≤ ΨδA,Z (h, H, G; Θ);
(2.188) with this relation at our disposal (2.142.b) can be obtained by the same argument as the one we used in item 1o to derive (2.142.a). To establish (2.188), let us fix h, H, G, u, Θ satisfying the premise of (2.188); recall that under the premise of Proposition 2.46.i, we have 0 Θ Θ∗ . Now let λ ∈ (0, 1), and let Θλ = Θ + λ(Θ∗ − Θ), so that 0 ≺ Θλ Θ∗ , and let δλ = 1/2 −1/2 + −1 kΘλ Θ∗ − Id k, implying that δλ ∈ [0, 2]. We have 0 G γ + Θ−1 ∗ γ Θλ , ¯ ¯ that is, H, G satisfy (2.184) w.r.t. Θ = Θλ . As a result, for our h, G, H, u, the Θ
183
HYPOTHESIS TESTING
just defined and the ζ ∼ SG(C(u), Θλ ) relation (2.185) hold true: n T 1 T o 1/2 1/2 ≤ − 21 ln Det(I − Θλ GΘλ ) ln Eζ eh ζ+ 2 ζ Hζ h i T H h −1 −1 + [H, h] [Θ − G] [H, h] B[u; 1] + 12 [u; 1]T B T λ hT 1/2
1/2
≤ − 21 ln Det(I h− Θλ GΘλ )
+ 12 φZ B T
h
H hT
(2.189)
i T −1 + [H, h] [Θ−1 − G] [H, h] B λ
(recall that [u; 1][u; 1]T ∈ Z). As a result,
1 T T 1/2 1/2 ≤ − 21 ln Det(I − Θλ GΘλ ) ln Eζ∼SG(C(u),Θ) eh ζ+ 2 ζ Hζ H h −1 + 12 φZ B T + [H, h]T [Θ−1 [H, h] B . T ∗ − G] h
(2.190)
When deriving (2.190) from (2.189), we have used that — Θ Θλ , so that when ζ ∼ SG(C(u), Θ), we have also ζ ∼ SG(C(u), Θλ ), −1 −1 −1 — 0 Θλ Θ∗ and G ≺ Θ−1 [Θ−1 , ∗ , whence [Θλ − G] ∗ − G] n+1 — Z ⊂ S+ , whence φZ is -monotone: φZ (M ) ≤ φZ (N ) whenever M N .
By Lemma 2.51 applied with Θλ in the role of Θ and δλ in the role of δ, we have 1/2
1/2
1/2
1/2
− 21 ln Det(I − Θλ GΘλ ) ≤ − 12 ln Det(I − Θ∗ GΘ∗ ) + 12 Tr([Θλ − Θ∗ ]G) 1/2 1/2 δλ (2+δλ ) kΘ∗ GΘ∗ k2F . + 1/2 1/2 2(1−kΘ∗
GΘ∗
k)
Consequently, (2.190) implies that
ln Eζ∼SG(C(u),Θ) e
hT ζ+
1 T ζ Hζ 2
+
1/2
1/2
δλ (2+δλ ) 1/2
2(1−kΘ ∗
+ 21 φZ
1/2
≤ − 12 ln Det(I − Θ∗ GΘ∗ ) + 21 Tr([Θλ − Θ∗ ]G) 1/2
GΘ ∗
BT
k)
1/2
kΘ∗ GΘ∗ k2F
H hT
h
−1 + [H, h]T [Θ−1 [H, h] B . ∗ − G]
The resulting inequality holds true for all small positive λ; taking lim inf of the right-hand side as λ → +0, and recalling that Θ0 = Θ, we get 1 T T 1/2 1/2 ln Eζ∼SG(C(u),Θ) eh ζ+ 2 ζ Hζ ≤ − 21 ln Det(I − Θ∗ GΘ∗ ) + 12 Tr([Θ − Θ∗ ]G) +
1/2
δ(2+δ) 1/2
2(1−kΘ ∗
+ 12 φZ
1/2
GΘ ∗
BT
k)
H hT
1/2
kΘ∗ GΘ∗ k2F h
−1 + [H, h]T [Θ−1 − G] [H, h] B ∗
(note that under the premise of Proposition 2.46.i we clearly have lim inf λ→+0 δλ ≤ δ). The right-hand side of the resulting inequality is nothing but ΨδA,Z (h, H, G; Θ)— see (2.141)—and we arrive at the inequality required in the conclusion of (2.188). 3o . To complete the proof of Proposition 2.46.i, it remains to show that functions ΦA,Z , ΦδA,Z , as announced in the proposition, possess continuity, convexityconcavity, and coerciveness properties. Let us verify that this indeed is so for ΦδA,Z ; the reasoning which follows, with obvious simplifications, is applicable to ΦA,Z as well.
184
CHAPTER 2
Observe, first, that for exactly the same reasons as in item 4o of the proof of Proposition 2.43, the function ΨδA,Z (h, H, G; Θ) is real-valued, continuous and convex-concave on the domain + −1 + −1 b × V = {(h, H, G) : −γ + Θ−1 H ∗ H γ Θ∗ , 0 G γ Θ∗ , H G} × V.
The function ΦδA,Z (h, H; Θ) : H × V → R is obtained from ΨδA,Z (h, H, G; Θ) by the following two operations: we first minimize ΨδA,Z (h, H, G; Θ) over G linked to (h, H) by the convex constraints 0 G γ + Θ−1 and G H, thus obtaining a ∗ function + −1 ¯ Φ(h, H; Θ) : {(h, H) : −γ + Θ−1 ∗ H γ Θ∗ } ×V → R ∪ {+∞} ∪ {−∞}. {z } | ¯ H
¯ ¯ ¯ Second, we restrict the function Φ(h, H; Θ) from H×V onto H×V. For (h, H) ∈ H, the set of G’s linked to (h, H) by the above convex constraints clearly is a nonempty ¯ is a real-valued convex-concave function on H ¯ ×V. From compact set; as a result, Φ δ δ continuity of ΨA,Z on its domain it immediately follows that ΨA,Z is bounded and uniformly continuous on every bounded subset of this domain. This implies that ¯ ¯ × V, where B ¯ is a bounded Φ(h, H; Θ) is bounded in every domain of the form B ¯ ¯ subset of H, and is continuous on B × V in Θ ∈ V with properly selected modulus ¯ Furthermore, by construction, H ⊂ int H, ¯ of continuity independent of (h, H) ∈ B. implying that if B is a convex compact subset of H, it belongs to the interior of ¯ of H. ¯ Since Φ ¯ is bounded on B ¯×V a properly selected convex compact subset B ¯ and is convex in (h, H), the function Φ is a Lipschitz continuous in (h, H) ∈ B with Lipschitz constant which can be selected to be independent of Θ ∈ V. Taking into account that H is convex and closed, the bottom line is that ΦδA,Z is not just real-valued convex-concave function on the domain H × V, but is also continuous on this domain. Coerciveness of ΦδA,Z (h, H; Θ) in (h, H) is proved in exactly the same way as the similar property of function (2.130); see item 5o in the proof of Proposition 2.43. The proof of item (i) of Proposition 2.46 is complete. 4o . Item (ii) of Proposition 2.46 can be derived from item (i) of the proposition following the steps of the proof of (ii) of Proposition 2.43. ✷
Chapter Three From Hypothesis Testing to Estimating Functionals In this chapter we extend the techniques developed in Chapter 2 beyond the hypothesis testing problem and apply them to estimating properly structured scalar functionals of the unknown signal, specifically: • In simple observation schemes—linear (and more generally, N -convex; see Section 3.2) functionals on unions of convex sets (Sections 3.1 and 3.2); • Beyond simple observation schemes—linear and quadratic functionals on convex sets (Sections 3.3 and 3.4).
3.1
ESTIMATING LINEAR FORMS ON UNIONS OF CONVEX SETS
The key to the subsequent developments in this section and in Sections 3.3 and 3.4 is the following simple observation. Let P = {Px : x ∈ X } be a parametric family of distributions on Rd , X being a convex subset of some Rm . Suppose that given a linear form g T x on Rm and an observation ω ∼ Px stemming from unknown signal x ∈ X , we want to recover g T x, and intend to use for this purpose an affine function hT ω + κ of the observation. How do we ensure that the recovery, with a given probability 1 − ǫ, deviates from g T x by at most a given margin ρ, for all x ∈ X? Let us focus on one “half” of the answer: how to ensure that the probability of the event hT ω + κ > g T x + ρ does not exceed ǫ/2, for every x ∈ X . The answer becomes easy when assuming that we have at our disposal an upper bound on the exponential moments of the distributions from the family—a function Φ(h; x) such that Z T ln eh ω Px (dω) ≤ Φ(h; x) ∀(h ∈ Rn , x ∈ X ). Indeed, for obvious reasons, in this case the Px -probability of the event hT ω + κ − g T x > ρ is at most exp{Φ(h; x) − [g T x + ρ − κ]}. To add some flexibility, note that when α > 0, the event in question is the same as the event (h/α)T ω + κ/α > [g T x + ρ]/α; thus we arrive at a parametric family of upper bounds exp{Φ(h/α; x) − [g T x + ρ − κ]/α}, α > 0, on the Px -probability of our “bad” event. It follows that a sufficient condition for this probability to be ≤ ǫ/2, for a given x ∈ X , is the existence of α > 0 such that exp{Φ(h/α; x) − [g T x + ρ − κ]/α} ≤ ǫ/2,
186
CHAPTER 3
or Φ(h/α; x) − [g T x + ρ − κ]/α ≤ ln(ǫ/2), or, which again is the same, the existence of α > 0 such that αΦ(h/α; x) + α ln(2/ǫ) − g T x ≤ ρ − κ. In other words, a sufficient condition for the relation Probω∼Px {hT ω + κ > g T x + ρ} ≤ ǫ/2 is inf [αΦ(h/α; x) + α ln(2/ǫ) − g T x] ≤ ρ − κ.
α>0
If we want the bad event in question to take place with Px -probability ≤ ǫ/2 whatever be x ∈ X , the sufficient condition for this is sup inf [αΦ(h/α; x) + α ln(2/ǫ) − g T x] ≤ ρ − κ.
x∈X α>0
(3.1)
Now assume that X is convex and compact, and Φ(h; x) is continuous, convex in h, and concave in x. In this case the function αΦ(h/α; x) is convex in (h, α) in the domain α > 0 1 and is concave in x, so that we can switch sup and inf, thus arriving at the sufficient condition (3.2) ∃α > 0 : max αΦ(h/α; x) + α ln(2/ǫ) − g T x ≤ ρ − κ, x∈X
for the validity of the relation
∀x ∈ X : Probω∼Px hT ω + κ − g T x ≤ ρ ≥ 1 − ǫ/2.
Note that our sufficient condition is expressed in terms of a convex constraint on h, κ, ρ, α. Consider also the dramatic simplification allowed by the convexityconcavity of Φ: in (3.1), every x ∈ X should be “served” by its own α, so that (3.1) is an infinite system of constraints on h, ρ, κ. In contrast, in (3.2) all x ∈ X are “served” by a single α. The developments in this section and Sections 3.3 and 3.4 are no more than implementations, under various circumstances, of the simple idea we have just outlined. 3.1.1
The problem
Let O = (Ω, Π, {pµ (·) : µ ∈ M}, F) be a simple observation scheme (see Section 2.4.2). The problem we consider in this section is as follows: We are given a positive integer K and I nonempty convex compact sets Xj ⊂ Rn , along with affine mappings Aj (·) : Rn → RM such that Aj (x) ∈ M whenever x ∈ Xj , 1 ≤ j ≤ I. In addition, we are given a linear function 1 This is due to the following standard fact: if f (h) is a convex function, then the projective transformation αf (h/α) of f is convex in (h, α) in the domain α > 0.
187
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
g T x on Rn . Given random observation ω K = (ω1 , ..., ωK ) with ωk drawn, independently across k, from pAj (x) with j ≤ I and x ∈ Xj , we want to recover g T x. It should be stressed that we do not know j and x underlying our observation. Given reliability tolerance ǫ ∈ (0, 1), we quantify the performance of a candidate estimate—a Borel function gb(·) : Ω → R—by the worst-case, over j and x, width of a (1 − ǫ)-confidence interval. Precisely, we say that gb(·) is (ρ, ǫ)-reliable if g (ω) − g T x| > ρ} ≤ ǫ. ∀(j ≤ I, x ∈ Xj ) : Probω∼pAj (x) {|b
(3.3)
We define the ǫ-risk of the estimate as Riskǫ [b g ] = inf {ρ : gb is (ρ, ǫ)-reliable} ,
i.e., Riskǫ [b g ] is the smallest ρ such that gb is (ρ, ǫ)-reliable. The technique we are about to develop originates from [131] where estimating a linear form on a convex compact set in a simple o.s. (i.e., the case I = 1 of the problem at hand) was considered, and where it was proved that in this situation the estimate X gb(ω K ) = φ(ωk ) + κ k
with properly selected φ ∈ F and κ ∈ R is near-optimal. The problem of estimating linear functionals of a signal in Gaussian o.s. has a long history; see, e.g., [38, 40, 124, 125, 125, 127, 126, 170, 179] and references therein. In particular, in the case of I = 1, using different techniques, a similar fact was proved by D. Donoho [64] in 1991; related results in the case of I > 1 are available in [41, 42]. 3.1.2
The estimate
In the sequel, we associate with the simple o.s. O = (Ω, Π, {pµ (·) : µ ∈ M}, F) in question the function Z ΦO (φ; µ) = ln eφ(ω) pµ (ω)Π(dω) , (φ, µ) ∈ F × M. Recall that by definition of a simple o.s., this function is real-valued on F × M, concave in µ ∈ M, convex in φ ∈ F, and continuous on F × M (the latter follows from convexity-concavity and relative openness of M and F). Let us associate with a pair (i, j), 1 ≤ i, j ≤ I, the functions Φij (α, φ; x, y)
=
Ψij (α, φ)
= =
1 2
KαΦO (φ/α; Ai (x)) + KαΦ O (−φ/α; Aj (y)) +g T (y − x) + 2α ln(2I/ǫ) : {α > 0, φ ∈ F } × [Xi × Xj ] → R, max Φij (α, φ; x, y)
x∈Xi ,y∈Xj 1 [Ψi,+ (α, φ) 2
+ Ψj,− (α, φ)] : {α > 0} × F → R
188
CHAPTER 3
where Ψℓ,+ (β, ψ)
=
Ψℓ,− (β, ψ)
=
max KβΦO (ψ/β; Aℓ (x)) − g T x + β ln(2I/ǫ) :
x∈Xℓ
{β > 0, ψ ∈ F} → R, max KβΦO (−ψ/β; Aℓ (x)) + g T x + β ln(2I/ǫ) :
x∈Xℓ
{β > 0, ψ ∈ F} → R.
Note that the function αΦO (φ/α; Ai (x)) is obtained from the continuous convexconcave function ΦO (·, ·) by projective transformation in the convex argument, and affine substitution in the concave argument, so that the former function is convexconcave and continuous on the domain {α > 0, φ ∈ X } × Xi . By similar argument, the function αΦO (−φ/α; Aj (y)) is convex-concave and continuous on the domain {α > 0, φ ∈ F} × Xj . These observations combine with compactness of Xi and Xj to imply that Ψij (α, φ) is a real-valued continuous convex function on the domain F + = {α > 0} × F. Observe that functions Ψii (α, φ) are nonnegative on F + . Indeed, selecting some x ¯ ∈ Xi , and setting µ = Ai (¯ x), we have µ) + ΦO (−φ/α; µ)]K + ln(2I/ǫ) Ψii (α, φ) ≥ Φii (α, φ; x ¯, x ¯) = α 21 [ΦO (φ/α; ≥ α ΦO (0; µ) K + ln(2I/ǫ) = α ln(2I/ǫ) ≥ 0 | {z } =0
(we have used convexity of ΦO in the first argument). Functions Ψij give rise to convex and feasible optimization problems Optij = Optij (K) =
min
(α,φ)∈F +
Ψij (α, φ).
(3.4)
By its origin, Optij is either a real, or −∞; by the observation above, Optii are nonnegative. Our estimate is as follows. 1. For 1 ≤ i, j ≤ I, we select some feasible solutions αij , φij to problems (3.4) (the less the values of the corresponding objectives, the better) and set ρij κij gij (ω K ) ρ
= = = =
Ψij (αij , φij ) = 21 [Ψi,+ (αij , φij ) + Ψj,− (αij , φij )] 1 [Ψj,− (αij , φij ) − Ψi,+ (αij , φij )] 2 P K k=1 φij (ωk ) + κij max1≤i,j≤I ρij .
2. Given observation ω K , we specify the estimate gb(ω K ) as follows: ri cj gb(ω K )
= = =
maxj≤I gij (ω K ) mini≤I gij (ω K ) 1 [mini≤I ri + maxj≤I cj ] . 2
(3.5)
(3.6)
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
3.1.3
189
Main result
Proposition 3.1. The ǫ-risk of the estimate gb(ω K ) can be upper-bounded as follows: Riskǫ [b g ] ≤ ρ. (3.7) Proof. Let the common distribution p of components ωk independent across k in observation ω K be pAℓ (u) for some ℓ ≤ I and u ∈ Xℓ . Let us fix these ℓ and u; we denote µ = Aℓ (u) and let pK stand for the distribution of ω K . 1o . We have
Ψℓ,+ (αℓj , φℓj ) = maxx∈Xℓ Kαℓj ΦO (φℓj /αℓj , Aℓ (x)) − g T x + αℓj ln(2I/ǫ) T ≥ Kαℓj ΦO (φ [since u ∈ Xℓ and µ = Aℓ (u)] R ℓj /αℓj , µ) − g u + αℓj ln(2I/ǫ) = Kαℓj ln exp{φℓj (ω)/αℓj }pµ (ω)Π(dω) − g T u + αℓj ln(2I/ǫ) [by definition of ΦO ] n o −1 P T = αℓj ln EωK ∼pK exp{αℓj φ (ω )} − g u + α ln(2I/ǫ) ℓj k ℓj k o n −1 = αℓj ln EωK ∼pK exp{αℓj [gℓj (ω K ) − κℓj ]} − g T u + αℓj ln(2I/ǫ) n o −1 = αℓj ln EωK ∼pK exp{αℓj [gℓj (ω K ) − g T u − ρℓj ]} + ρℓj − κℓj + αℓj ln(2I/ǫ) K T ≥ αℓj ln ProbωK ∼pK gℓj (ω ) > g u + ρℓj + ρℓj − κℓj + αℓj ln(2I/ǫ) ⇒ ǫ ) αℓj ln ProbωK ∼pK gℓj (ω K ) > g T u + ρℓj ≤ Ψℓ,+ (αℓj , φℓj ) + κℓj − ρℓj + αℓj ln( 2I ǫ = αℓj ln( 2I ) [by (3.5)],
and we arrive at
Similarly,
ǫ ProbωK ∼pK gℓj (ω K ) > g T u + ρℓj ≤ . 2I
(3.8)
Ψℓ,− (αiℓ , φiℓ ) = maxy∈Xℓ Kαiℓ ΦO (−φiℓ /αiℓ , Aℓ (y)) + g T y + αiℓ ln(2I/ǫ) T ≥ Kαiℓ ΦO (−φ [since u ∈ Xℓ and µ = Aℓ (u)] R iℓ /αiℓ , µ) + g u + αiℓ ln(2I/ǫ) = Kαiℓ ln exp{−φiℓ (ω)/αiℓ }pµ (ω)Π(dω) + g T u + αiℓ ln(2I/ǫ) [by definition of ΦO ] −1 P T = αiℓ ln EωK ∼pK exp{−αiℓ φ (ω )} + g u + α ln(2I/ǫ) iℓ k iℓ k −1 = αiℓ ln EωK ∼pK exp{αiℓ [−giℓ (ω K ) + κiℓ ]} + g T u + αiℓ ln(2I/ǫ) −1 K T = αiℓ ln EωK ∼pK exp{α u − ρiℓ ]} + ρiℓ + κiℓ + αiℓ ln(2I/ǫ) iℓ [−giℓ (ω ) + g ≥ αiℓ ln ProbωK ∼pK giℓ (ω K ) < g T u − ρiℓ + ρiℓ + κiℓ + αiℓ ln(2I/ǫ) ⇒ ǫ ) αiℓ ln ProbωK ∼pK giℓ (ω K ) < g T u − ρiℓ ≤ Ψℓ,− (αiℓ , φiℓ ) − κiℓ − ρiℓ + αiℓ ln( 2I ǫ = αiℓ ln( 2I ) [by (3.5)],
and we arrive at ǫ ProbωK ∼pK giℓ (ω K ) < g T u − ρiℓ ≤ . 2I
(3.9)
2o . Let E = {ω K : gℓj (ω K ) ≤ g T u + ρℓj , giℓ (ω K ) ≥ g T u − ρiℓ , 1 ≤ i, j ≤ I}. From (3.8) and (3.9) and the union bound it follows that pK -probability of the event E is ≥ 1 − ǫ. As a result, all we need to complete the proof of the proposition
190
CHAPTER 3
is to verify that ω K ∈ E ⇒ |b g (ω K ) − g T u| ≤ ρℓ := max[max ρiℓ , max ρℓj ], i
j
(3.10)
since clearly ρℓ ≤ ρ := maxi,j ρij . To this end, let us fix ω K ∈ E, and let E be the I × I matrix with entries Eij = gij (ω K ), 1 ≤ i, j ≤ I. The quantity ri —see (3.6)—is the maximum of the entries in the i-th row of E, while the quantity cj is the minimum of the entries in the j-th column of E. In particular, ri ≥ Eij ≥ cj for all i, j, implying that ri ≥ cj for all i, j. Now, let ∆ = [g T u − ρℓ , g T u + ρℓ ]. Since ω K ∈ E, we have Eℓℓ = gℓℓ (ω K ) ≥ g T u − ρℓℓ ≥ g T u − ρℓ and Eℓj = gℓj (ω K ) ≤ g T u + ρℓj ≤ g T u + ρℓ for all j, implying that rℓ = maxj Eℓj ∈ ∆. Similarly, ω K ∈ E implies that Eℓℓ = gℓℓ (ω K ) ≤ g T u + ρℓ and Eiℓ = giℓ (ω K ) ≥ g T u − ρiℓ ≥ g T u − ρℓ for all i, implying that cℓ = mini Eiℓ ∈ ∆. We see that both rℓ and cℓ belong to ∆; since r∗ := mini ri ≤ rℓ and, as have already seen, ri ≥ cℓ for all i, we conclude that r∗ ∈ ∆. By a similar argument, c∗ := maxj cj ∈ ∆ as well. By construction, gb(ω K ) = 21 [r∗ + c∗ ], that is, gb(ω K ) ∈ ∆, and the conclusion in (3.10) indeed takes place. ✷
Remark 3.2. Let us consider a special case of I = 1. In this case, given a K-repeated observation of the signal in a simple o.s., our construction yields an estimate of a linear form g T x of unknown signal x, known to belong to a given convex compact set X1 . This estimate is K X gb(ω K ) = φ(ωk ) + κ, (3.11) k=1
and is associated with the optimization problem
{Ψ(α, φ) := 21 [Ψ+ (α, φ) + Ψ− (α, φ)]} , Ψ+ (α, φ) = max KαΦO (φ/α, A1 (x)) − g T x + α ln(2/ǫ) , x∈X1 Ψ− (α, φ) = max KαΦO (−φ/α, A1 (x)) + g T x + α ln(2/ǫ) . min
α>0,φ∈F
x∈X1
By Proposition 3.1, when α, φ is a feasible solution to the problem and κ = 12 [Ψ− (α, φ) − Ψ+ (α, φ)], the ǫ-risk of estimate (3.11) does not exceed Ψ(α, φ). 3.1.4
Near-optimality
Observe that by properly selecting φij and αij we can make, in a computationally efficient manner, the upper bound ρ on the ǫ-risk of the above estimate arbitrarily close to Opt(K) = max Optij (K). 1≤i,j≤I
We are about to demonstrate that the quantity Opt(K) “nearly lower-bounds” the minimax optimal ǫ-risk Risk∗ǫ (K) = inf Riskǫ [b g ], g b(·)
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
191
the infimum being taken over all estimates (all Borel functions of ω K ). The precise statement is as follows: Proposition 3.3. In the situation of this section, let ǫ ∈ (0, 1/2) and K be a positive integer. Then for every integer K satisfying K/K > one has
2 ln(2I/ǫ) 1 ln 4ǫ(1−ǫ)
Opt(K) ≤ Risk∗ǫ (K).
(3.12)
(3.13)
In addition, in the special case where for every i, j there exists xij ∈ Xi ∩ Xj such that Ai (xij ) = Aj (xij ) one has K ≥ K ⇒ Opt(K) ≤ For proof, see Section 3.6.1. 3.1.5
2 ln(2I/ǫ) Risk∗ǫ (K). 1 ln 4ǫ(1−ǫ)
(3.14)
Illustration
We illustrate our construction with the simplest possible example in which Xi = {xi } are singletons in Rn , i = 1, ..., I, and the observation scheme is Gaussian. Thus, setting yi = Ai (xi ) ∈ Rm , the observation’s components ωk , 1 ≤ k ≤ K, stemming from the signal xi , are drawn, independently of each other, from the normal distribution N (yi , Im ). The family F of functions φ associated with Gaussian o.s. is the family of all affine functions φ(ω) = φ0 +ϕT ω on the observation space (which at present is Rm ); we identify φ ∈ F with the pair (φ0 , ϕ). The function ΨO associated with the Gaussian observation scheme with m-dimensional observations is ΦO (φ; µ) = φ0 + ϕT µ + 21 ϕT ϕ : (R × Rm ) × Rm → R; a straightforward computation shows that in the case in question, setting θ = ln(2I/ǫ),
192
CHAPTER 3
we have Ψi,+ (α, φ)
= =
Ψj,− (α, φ)
=
Optij
= = = =
Kα φ0 + ϕT yi /α + 21 ϕT ϕ/α2 + αθ − g T xi K T ϕ ϕ + αθ, Kαφ0 + KϕT yi − g T xi + 2α K T −Kαφ0 − KϕT yj + g T xj + ϕ ϕ + αθ, 2α inf 1 [Ψi,+ (α, φ) + Ψj,− (α, φ)] α>0,φ 2 K T K T 1 T ϕ [yi − yj ] + inf ϕ ϕ + αθ g [xj − xi ] + inf 2 ϕ α>0 2α 2 √ K T 1 T ϕ [yi − yj ] + 2Kθkϕk2 g [xj − xi ] + inf 2 ϕ 2 p 1 T g [xj − xi ], kyi − yj k2 ≤ 2p2θ/K 2 −∞, kyi − yj k2 > 2 2θ/K.
We see that we can put φ0 = 0, and that setting
p I = {(i, j) : kyi − yj k2 ≤ 2 2θ/K},
Optij (K) is finite if and only if (i, j) ∈ I and is −∞ otherwise. In both cases, the optimization problem specifying Optij has no optimal solution.2 Indeed, this clearly is the case when (i, j) 6∈ I; when (i, j) ∈ I, a minimizing sequence is, e.g., φ0 ≡ 0, ϕ ≡ 0, αi → 0, but its limit is not in the minimization domain (on this domain, α should be positive). In this particular case, the simplest way to overcome the difficulty is to restrict the optimization domain F + in (3.4) with its compact subset {α ≥ 1/R, φ0 = 0, kϕk2 ≤ R} with large R, like R = 1010 or 1020 . Then we specify the entities participating in (3.5) as 0, (i, j) ∈ I T φij (ω) = ϕij ω, ϕij = −R[y − y ]/ky − y k , (i, j) 6∈ I i j i j 2 ( 1/R, (i, j) ∈ I q αij = K 2θ R, (i, j) 6∈ I resulting in κij
= = =
1 [Ψ (αij , φij ) − Ψi,+ (αij , φij )] 2 h j,− 1 −KϕTij yj + g T xj + 2αKij ϕTij ϕij 2 1 T g [xi + xj ] − K ϕT [y + yj ] 2 2 ij i
+ αij θ − KϕTij yi + g T xi −
K 2αij
ϕTij ϕij − αij θ
i
2 Handling this case was exactly the reason why in our construction we required φ , α ij ij to be feasible, and not necessary optimal, solutions to the optimization problems (3.4).
193
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
and ρij
= = = =
1 2
[Ψi,+ (αij , φij ) + Ψj,− (αij , φij )] K T K T 1 KϕTij yi − g T xi + ϕij ϕij + αij θ − KϕTij yj + g T xj + ϕij ϕij + αij θ 2 2αij 2αij K T K T 1 T ϕij φij + αij θ + 2 g [xj − xi ] + ϕij [yi − yj ] 2αij 2 1 T −1 g [x − x ] + R θ, (i, j) ∈ I, j i 2 √ (3.15) 1 T K 2Kθ − g [x − x ] + [ ky − y k ]R, (i, j) 6∈ I. j i i j 2 2 2
In the numerical experiment we report on we use n = 20, m = 10, and I = 100, with xi , i ≤ I, drawn independently of each other from N (0, In ), and yi = Axi with randomly generated matrix A (specifically, matrix with independent N (0, 1) entries normalized to have unit spectral norm). The linear form to be recovered is the first coordinate of x, the confidence parameter is set to ǫ = 0.01, and R = 1020 . Results of a typical experiment are presented in Figure 3.1.
2.5 2
1.5
1
0.5 0 20
30
40
50
100
200
300
Figure 3.1: Boxplot of empirical distributions, over 20 random estimation problems, of the upper 0.01-risk bounds max1≤i,j≤100 ρij (as in (3.15)) for different observation sample sizes K.
3.2
ESTIMATING N -CONVEX FUNCTIONS ON UNIONS OF CONVEX SETS
In this section, we apply our testing machinery to the estimation problem as follows. Given are: • a simple o.s. O = (Ω, Π; {pµ : µ ∈ M}; F), • a signal space X ⊂ Rn along with the affine mapping x 7→ A(x) : X → M, • a real-valued function f on X.
Given observation ω ∼ pA(x∗ ) stemming from unknown signal x∗ known to belong to X, we want to recover f (x∗ ).
194
CHAPTER 3
OHIW
ULJKW
OHIW
ULJKW
7HVW OHIW YV ULJKW
7HVW OHIW YV ULJKW
1HZ ORFDOL]HU OHIW DFFHSWHG
ULJKW
OHIW
, 7HVW OHIW YV ULJKW
,, 7HVW OHIW YV ULJKW
1HZ ORFDOL]HU OHIW DFFHSWHG
1HZ ORFDOL]HU ULJKW DFFHSWHG
1HZ ORFDOL]HU
1HZ ORFDOL]HU
ULJKW DFFHSWHG
, ,, DFFHSW OHIW
1HZ ORFDOL]HU , ,, DFFHSW ULJKW
1HZ ORFDOL]HU , ,, LQ GLVDJUHHPHQW
D
E
F
Figure 3.2: Bisection via Hypothesis Testing.
Our approach imposes severe restrictions on f (satisfied, e.g., when f is linear, or linear-fractional, or is the maximum of several linear functions); as a compensation, we allow for rather “complex” X—finite unions of convex sets. 3.2.1
Outline
Though the estimator we develop is, in a nutshell, quite simple, its formal description turns out to be rather involved.3 For this reason we start its presentation with an informal outline, which exposes some simple ideas underlying its construction. Consider the situation where the signal space X is the 2D rectangle as presented on the top of Figure 3.2.(a), and let the function to be recovered be f (x) = x1 . Thus, “nature” has somehow selected x = [x1 , x2 ] in the rectangle, and we observe a Gaussian random vector with the mean A(x) and known covariance matrix, where A(·) is a given affine mapping. Note that hypotheses f (x) ≥ b and f (x) ≤ a translate into convex hypotheses on the expectation of the observed Gaussian r.v., so that we can use our hypothesis testing machinery to decide on hypotheses of this type and to localize f (x) in a (hopefully, small) segment by a Bisection-type process. Before describing the process, let us make a terminological agreement. In the sequel we shall use pairwise hypothesis testing in the situation where it may happen that neither of the hypotheses we are deciding upon is true. In this case, we will say that the outcome of a test is correct if the rejected hypothesis indeed is wrong (the accepted hypothesis can be wrong as well, but the latter can happen 3 It should be mentioned that the proposed estimation procedure is a “close relative” of the binary search algorithm of [77].
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
195
only in the case when both our hypotheses are wrong). This is what the Bisection might look like. 1. Were we able to decide reliably on the left and the right hypotheses in Figure 3.2.(a), that is, to understand via observations whether x belongs to the left or to the right half of the original rectangle, our course of actions would be clear: depending on this decision, we would replace our original rectangle with a smaller rectangle localizing x, as shown in Figure 3.2.(a), and then iterate this process. The difficulty, of course, is that our left and right hypotheses intersect, so that is impossible to decide on them reliably. 2. In order to make the left and right hypotheses distinguishable from each other, we could act as shown in Figure 3.2.(b), by shrinking the left and the right rectangles and inserting a rectangle in the middle (“no man’s land”). Assuming that the width of the middle rectangle allows to decide reliably on our new left and right hypotheses and utilizing the available observation, we can localize x either in the left, or in the right rectangle as shown in Figure 3.2.(b). Specifically, assume that our “left vs. right” test rejected correctly the right hypothesis. Then x can be located either in the left, or in the middle rectangle shown on the top, and thus x is in the new left localizer which is the union of the left and the middle original rectangles. Similarly, if our test rejects correctly the left hypothesis, then we can take, as the new localizer of x, the union of the original right and middle rectangles. Note that our localization is as reliable as our test is, and that it reduces the width of the localizer by a factor close to 2, provided the width of the middle rectangle is small compared to the width of the original localizer of x. We can iterate this process, until we arrive at a localizer so narrow that the corresponding separator— “no man’s land” (this part cannot be too narrow, since it should allow for a reliable decision on the current left and right hypotheses)—becomes too large to allow reducing significantly the localizer’s width. Note that in this implementation of the binary search (same as in the implementation proposed in [77]), starting from the second step of the Bisection, the hypotheses to decide upon depend on the observations (e.g., when x belongs to the middle part of the three-rectangle localizer in Figure 3.2, deciding on “left vs. right” can, depending on observation, result in accepting either the left or the right hypothesis, leading to different updated localizers). Analysing this situation usually brings about complications we would like to avoid. 3. A simple modification of the Bisection allows us to circumvent the difficulties related to testing random hypotheses. Indeed, let us consider the following construction: given the current localizer for x (at the first step the initial rectangle), we consider two “three-rectangle” partitions of it as presented in Figure 3.2.(c). In the first partition, the left rectangle is the left half of the original rectangle, in the second partition the right rectangle is the right half of the original rectangle. We then run two “left vs. right” tests, the first on the pair of left and right hypotheses stemming from the first partition, and the second on the pair of left and right hypotheses stemming from the second partition. Assuming that in both tests the rejected hypotheses indeed were wrong, the results of these tests allow us to make the following conclusions: • when both tests reject the right hypotheses from the corresponding pairs, x is located in the left half of the initial rectangle (since otherwise in the second test
196
CHAPTER 3
the rejected hypothesis were in fact true, contradicting to the assumption that both tests make no wrong rejections); • when both tests reject the left hypotheses from the corresponding pairs, x is located in the right half of the original rectangle (for the exactly same reasons as in the previous case); • when the tests “disagree,” rejecting hypotheses of different types (like left in the firsts, and right in the second test), x is located in the union of the two middle rectangles we deal with. Indeed, otherwise x should be either in the left rectangles of both our three-rectangle partitions, or in the right rectangles of both of them. Since we have assumed that in both tests no wrong rejections took place, in the first case both tests must reject the right hypotheses, and both should reject the left hypotheses in the second, while none of these events took place. Now, in the first two cases we can safely say to which of the “halves”—left or right— of the initial rectangle x belongs, and take this half as the new localizer. In the third case, we take as a new localizer for x the middle rectangle shown at the bottom of Figure 3.2 and terminate our estimation process—the new localizer already is narrow! In the proposed algorithm, unless we terminate at the very first step, we carry out the second step exactly in the same way as the first one, with the localizer of x yielded by the first step in the role of the initial localizer, then carry out, in the same way, the third step, etc., until termination either due to running into a disagreement, or due to reaching a prescribed number of steps. Upon termination, we return the last localizer for x which we have built, and claim that f (x) = x1 belongs to the projection of this localizer onto the x1 -axis. In all tests from the above process, we use the same observation. Note that in the present situation, in contrast to that discussed earlier, reutilizing a single observation creates no difficulties, since with no wrong rejections in the pairwise tests we use, the pairs of hypotheses participating in the tests are not random at all—they are uniquely defined by f (x) = x1 . Indeed, with no wrong rejections, prior to termination everything is as if we were running deterministic Bisection, that is, were updating subsequent rectangles ∆t containing x according to the rules • ∆1 is a rectangle containing x given in advance, • ∆t+1 is precisely the half of ∆t containing x (say, the left half in the case of a tie). Thus, given x and assuming that there are no wrong rejections, the situation is as if a single observation were used in L tests running in “parallel” rather than sequentially. The only elaboration caused by the sequential nature of our process is the “risk accumulation”—we want the probability of error in one or more of our L tests to be less than the desired risk ǫ of wrong “bracketing” of f (x), implying, in the absence of something better, that the risks of the individual tests should be at most ǫ/L. These risks, in turn, define the allowed width of separators and thus – the accuracy to which f (x) can be estimated. It should be noted that the number L of steps of Bisection always is a moderate integer (since otherwise the width of “no man’s land,” which at the concluding Bisection steps is of order of 2−L , would be too small to allow for deciding on the concluding pairs of our hypotheses with risk ǫ/L, at least when our observations possess non-negligible volatility). As a result, “the cost” of Bisection turns out to be significantly lower than in the case where every test uses its own observation.
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
197
From the above sketch of our construction it is clear that all that matters is our ability to decide on the pairs of hypotheses {x ∈ X : f (x) ≤ a} and {x ∈ X : f (x) ≥ b}, with a and b given, via observation drawn from pA(x) . In our outline, these were convex hypotheses in Gaussian o.s., and in this case we can use detectorbased pairwise tests yielded by Theorem 2.23. Applying the machinery developed in Section 2.5.1, we could also handle the case when the sets {x ∈ X : f (x) ≤ a} and {x ∈ X : f (X) ≥ b} are unions of a moderate number of convex sets (e.g., f is affine, and X is the union of a number of convex sets), the o.s. in question still being simple, and this is the situation we intend to consider. 3.2.2
Estimating N -convex functions: Problem setting
In the rest of this section, we consider the situation as follows. We are given 1. simple o.s. O = (Ω, P, {pµ (·) : µ ∈ M}, F), 2. convex compact set X ⊂ Rn along with a collection of I convex compact sets Xi ⊂ X , 3. affine mapping x 7→ A(x) : X → M, 4. a continuous function f (x) : X → R which is N -convex, meaning that for every a ∈ R the sets X a,≥ = {x ∈ X : f (x) ≥ a} and X a,≤ = {x ∈ X : f (x) ≤ a} can be represented as the unions of at most N closed convex sets Xνa,≥ , Xνa,≤ : X a,≥ =
N [
ν=1
Xνa,≥ , X a,≤ =
For some unknown x known to belong to X =
N [
ν=1 I S
Xνa,≤ .
Xi , we have at our disposal
i=1
observation ω K = (ω1 , ..., ωK ) with i.i.d. ωt ∼ pA(x) (·), and our goal is to estimate from this observation the quantity f (x). Given tolerances ρ > 0, ǫ ∈ (0, 1), let us call a candidate estimate fb(ω K ) (ρ, ǫ)reliable (cf. (3.3)) if for every x ∈ X, with the pA(x) -probability at least 1 − ǫ, it holds |fb(ω K ) − f (x)| ≤ ρ or, which is the same, n o ∀(x ∈ X) : ProbωK ∼pA(x) ×...×pA(x) |fb(ω K ) − f (x)| > ρ ≤ ǫ. 3.2.2.1
Examples of N -convex functions
Example 3.1. [Minima and maxima of linear-fractional functions] Every function which can be obtained from linear-fractional functions hgνν (x) (x) (gν , hν are affine functions on X and hν are positive on X ) by taking maxima and minima is N -convex for appropriately selected N due to the following immediate observations: g(x) • linear-fractional function h(x) with denominator positive on X is 1-convex on X ; • if f (x) is N -convex, so is −f (x);
198
CHAPTER 3
• if fi (x) is Ni -convex, i = 1, 2, ..., I, then f (x) = max fi (x) is N -convex with i
N = max
"
Y
Ni ,
i
due to {x ∈ X : f (x) ≤ a}
=
{x ∈ X : f (x) ≥ a}
=
X i
I T
i=1 I S
i=1
#
Ni ,
{x : fi (x) ≤ a}, {x : fi (x) ≥ a}.
Note that the first set is the intersection of I unionsQof convex sets with Ni components in i-th union, and thus is the union of i Ni convex sets. The second set is the union of I unions P of convex sets with Ni elements in the i-th union, and thus is the union of i Ni convex sets.
Example 3.2. [Conditional quantile] Let S = {s1 < s2 < ... < sM } ⊂ R. For a nonvanishing probability distribution q on S and α ∈ [0, 1], let χα [q] be the regularized α-quantile of q defined as follows: we pass from q to the distribution on [s1 , sM ] by spreading uniformly the mass qν , 1 < ν ≤ M , over [sν−1 , sν ], and assigning mass q1 to the point s1 ; χα [q] is the usual α-quantile of the resulting χα [q] = min{s ∈ [s1 , sM ] : q¯{[s1 , s]} ≥ α}. distribution q¯: s s4 s3 s2
s1
0
q1
q1+q2
q1+q2+q3 1
α
Regularized quantile as function of α, M = 4
Given, along with S, a finite set T , let X be a convex compact set in the space of nonvanishing probability distributions on S ×T . For τ ∈ T , consider the conditional to t = τ , distribution pτ (·) of s ∈ S induced by a distribution p(·, ·) ∈ X : p(µ, τ ) , 1 ≤ µ ≤ M, pτ (µ) = PM ν=1 p(ν, τ )
where p(µ, τ ) is the p-probability for (s, t) to take value (sµ , τ ), and pτ (µ) is the pτ -probability for s to take value sµ , 1 ≤ µ ≤ M . The function χα [pτ ] : X → R turns out to be 1-convex; for verification see Section 3.6.2.
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
3.2.3
199
Bisection estimate: Construction
While the construction to be presented admits numerous refinements, we focus here on its simplest version. 3.2.3.1
Preliminaries
Upper and lower feasibility/infeasibility, sets Zia,≥ and Zia,≤ . Let a be a real. We associate with a a collection of upper a-sets defined as follows: we look at the sets Xi ∩ Xνa,≥ , 1 ≤ i ≤ I, 1 ≤ ν ≤ N , and arrange the nonempty sets from this family into a sequence Zia,≥ , 1 ≤ i ≤ Ia,≥ . Here Ia,≥ = 0 if all sets in the family are empty; in the latter case, we call a upper-infeasible, and call it upper-feasible otherwise. Similarly, we associate with a the collection of lower a-sets Zia,≤ , 1 ≤ i ≤ Ia,≤ , by arranging into a sequence all nonempty sets from the family Xi ∩ Xνa,≤ , and call a lower-feasible or lower-infeasible depending on whether Ia,≤ is positive or zero. Note that upper and lower a-sets are nonempty convex compact sets, and S Zia,≥ , X a,≥ := {x ∈ X : f (x) ≥ a} = 1≤i≤Ia,≥ S Zia,≤ . X a,≤ := {x ∈ X : f (x) ≤ a} = 1≤i≤Ia,≤
Right tests. Given a segment ∆ = [a, b] of positive length with lower-feasible K a, we associate with this segment a right test—a function T∆,r (ω K ) taking values right and left, and risk σ∆,r ≥ 0—as follows: K 1. if b is upper-infeasible, T∆,r (·) ≡ left and σ∆,r = 0; 2. if b is upper-feasible, the collections of “right sets” {A(Zib,≥ )}i≤Ib,≥ and of “left sets” {A(Zja,≤ )}j≤Ia,≤ are nonempty, and the test is given by the construction from Section 2.5.1 as applied to these sets and the stationary K-repeated version of O , specifically,
• for 1 ≤ i ≤ Ib,≥ , 1 ≤ j ≤ Ia,≤ , we build the detectors K φK ij∆ (ω ) =
K X
φij∆ (ωt ),
t=1
with φij∆ (ω) given by (rij∆ , sij∆ ) φij∆ (ω)
∈ =
Argmin b,≥
r∈Zi 1 2
We set ǫij∆ =
a,≤
,s∈Zj
ln
R p pA(r) (ω)pA(s) (ω)Π(dω) , Ω
ln pA(rij∆ ) (ω)/pA(sij∆ ) (ω) .
Z q Ω
pA(rij∆ ) (ω)pA(sij∆ ) (ω)Π(dω)
and build the Ib,≥ × Ia,≤ matrix E∆,r = [ǫK ij∆ ] 1≤i≤Ib,≥ ; 1≤j≤Ia,≤
• we define σ∆,r as the spectral norm of E∆,r . We compute the Perron-Frobenius E ∆,r eigenvector [g ∆,r ; h∆,r ] of the matrix , so that (see Section T E∆,r
200
CHAPTER 3
2.5.1.2) T g ∆,r > 0, h∆,r > 0, σ∆,r g ∆,r = E∆,r h∆,r , σ∆,r h∆,r = E∆,r g ∆,r .
Finally, we define the matrix-valued function ∆,r ∆,r K D∆,r (ω K ) = [φK ij∆ (ω )− ln(hj )+ ln(gi )] 1≤i≤Ib,≥ . 1≤j≤Ia,≤
K Test T∆,r (ω K ) takes value right iff the matrix D∆,r (ω K ) has a nonnegative row, and takes value left otherwise.
Given δ > 0, κ > 0, we call segment ∆ = [a, b] δ-good (right) if a is lower-feasible, b > a, and σ∆,r ≤ δ. We call a δ-good (right) segment ∆ = [a, b] κ-maximal if the segment [a, b − κ] is not δ-good (right). Left tests. The “mirror” version of the above is as follows. Given a segment ∆ = [a, b] of positive length with upper-feasible b, we associate with this segment a K left test—a function T∆,l (ω K ) taking values right and left, and risk σ∆,l ≥ 0—as follows: K 1. if a is lower-infeasible, T∆,l (·) ≡ right and σ∆,l = 0; K K 2. if a is lower-feasible, we set T∆,l ≡ T∆,r , σ∆,l = σ∆,r .
Given δ > 0, κ > 0, we call segment ∆ = [a, b] δ-good (left) if b is upper-feasible, b > a, and σ∆,l ≤ δ. We call a δ-good (left) segment ∆ = [a, b] κ-maximal if the segment [a + κ, b] is not δ-good (left). Explanation: When a < b and a is lower-feasible, b is upper-feasible, so that the sets X a,≤ = {x ∈ X : f (x) ≤ a}, X b,≥ = {x ∈ X : f (x) ≥ b}
K K are nonempty, the right and the left tests T∆,l , T∆,r are identical to each other and coincide with the minimal risk test, built as explained in Section 2.5.1, deciding, via stationary K-repeated observations, on the “location” of the distribution pA(x) underlying the observations—whether this location is left (left hypothesis stating S A(Zia,≤ )), or right (right that x ∈ X and f (x) ≤ a, whence A(x) ∈ 1≤i≤Ia,≤
hypothesis stating that x ∈ X and f (x) ≥ b, whence A(x) ∈
S
1≤i≤Ib,≥
A(Zib,≥ )).
When a is lower-feasible and b is not upper-feasible, the right hypothesis is empty, and the left test associated with [a, b], naturally, always accepts the left hypothesis; similarly, when a is lower-infeasible and b is upper-feasible, the right test associated with [a, b] always accepts the right hypothesis. A segment [a, b] with a < b is δ-good (left) if the right hypothesis corresponding K to the segment is nonempty, and the left test T∆ℓ associated with [a, b] decides on the right and the left hypotheses with risk ≤ δ, and similarly for the δ-good (right) segment [a, b].
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
3.2.4 3.2.4.1
201
Building Bisection estimate Control parameters
The control parameters of the Bisection estimate are 1. positive integer L—the maximum allowed number of bisection steps, 2. tolerances δ ∈ (0, 1) and κ > 0. 3.2.4.2
Bisection estimate: Construction
The estimate of f (x) (x is the signal underlying our observations: ωt ∼ pA(x) ) is given by the following recurrence run on the observation ω ¯ K = (¯ ω1 , ..., ω ¯ K ) at our disposal: 1. Initialization. We find a valid upper bound b0 on maxu∈X f (u) and valid lower bound a0 on minu∈X f (u) and set ∆0 = [a0 , b0 ]. We assume w.l.o.g. that a0 < b0 ; otherwise the estimation is trivial. Note: f (x) ∈ ∆0 . 2. Bisection Step ℓ, 1 ≤ ℓ ≤ L. Given the localizer ∆ℓ−1 = [aℓ−1 , bℓ−1 ] with aℓ−1 < bℓ−1 , we act as follows: a) We set cℓ = 21 [aℓ−1 + bℓ−1 ]. If cℓ is not upper-feasible, we set ∆ℓ = [aℓ−1 , cℓ ] and pass to 2e, and if cℓ is not lower-feasible, we set ∆ℓ = [cℓ , bℓ−1 ] and pass to 2e. Note: When the rule requires us to pass to 2e, the set ∆ℓ−1 \∆ℓ does not intersect with f (X); in particular, in such a case f (x) ∈ ∆ℓ provided that f (x) ∈ ∆ℓ−1 . b) When cℓ is both upper- and lower-feasible, we check whether the segment [cℓ , bℓ−1 ] is δ-good (right). If it is not the case, we terminate and claim that ¯ := ∆ℓ−1 ; otherwise find vℓ , cℓ < vℓ ≤ bℓ−1 , such that the segment f (x) ∈ ∆ ∆ℓ,rg = [cℓ , vℓ ] is δ-good (right) κ-maximal. Note: In terms of the outline of our strategy presented in Section 3.2.1, termination when the segment [cℓ , bℓ−1 ] is not δ-good (right) corresponds to the case when the current localizer is too small to allow for the “no-man’s land” wide enough to ensure low-risk decision on the left and the right hypotheses. To find vℓ , we check the candidates with vℓk = bℓ−1 − kκ, k = 0, 1, ... until arriving for the first time at segment [cℓ , vℓk ], which is not δ-good (right), and take as vℓ the quantity v k−1 (because k ≥ 1 the resulting value of vℓ is well-defined and clearly meets the above requirements). c) Similarly, we check whether the segment [aℓ−1 , cℓ ] is δ-good (left). If it is ¯ := ∆ℓ−1 ; otherwise find not the case, we terminate and claim that f (x) ∈ ∆ uℓ , aℓ−1 ≤ uℓ < cℓ , such that the segment ∆ℓ,lf = [uℓ , cℓ ] is δ-good (left) κ-maximal. Note: The rules for building uℓ are completely similar to those for vℓ . d) We compute T∆Kℓ,rg ,r (¯ ω K ) and T∆Kℓ,lf ,l (¯ ω K ). If T∆Kℓ,rg ,r (¯ ω K ) = T∆Kℓ,lf ,l (¯ ωK ) (“consensus”), we set ∆ℓ = [aℓ , bℓ ] =
[cℓ , bℓ−1 ], [aℓ−1 , cℓ ],
T∆Kℓ,rg ,r (¯ ω K ) = right, K T∆ℓ,rg ,r (¯ ω K ) = left
(3.16)
and pass to 2e. Otherwise (“disagreement”) we terminate and claim that
202
CHAPTER 3
¯ = [uℓ , vℓ ]. f (x) ∈ ∆ e) We pass to step ℓ + 1 when ℓ < L; otherwise we terminate with the claim that ¯ := ∆L . f (x) ∈ ∆ ¯ built upon termination 3. Output of the estimation procedure is the segment ∆ and claimed to contain f (x) (see rules 2b–2e) the midpoint of this segment is the estimate of f (x) yielded by our procedure. 3.2.5
Bisection estimate: Main result
Our main result on Bisection is as follows: Proposition 3.4. Consider the situation described at the beginning of Section 3.2.2, and let ǫ ∈ (0, 1/2) be given. Then (i) [reliability of Bisection] For every positive integer L and every κ > 0, Bisection with control parameters L, δ =
ǫ ,κ 2L
(3.17)
is (1 − ǫ)-reliable: for every x ∈ X, the pA(x) -probability of the event ¯ f (x) ∈ ∆ ¯ is the Bisection output as defined above) is at least 1 − ǫ. (∆ (ii) [near-optimality] Let ρ > 0 and positive integer K be such that “in nature” S there exists a (ρ, ǫ)-reliable estimate fb(·) of f (x), x ∈ X := i≤I Xi , via stationary
K-repeated observation ω K with ωk ∼ pA(x) , 1 ≤ k ≤ K. Given ρb > 2ρ, the Bisection estimate utilizing stationary K-repeated observations, with K≥
2 ln(2LN I/ǫ) K, 1 ln 4ǫ(1−ǫ)
the control parameters of the estimate being ǫ b0 − a 0 L = log2 , δ= , κ = ρb − 2ρ, 2b ρ 2L
(3.18)
(3.19)
is (b ρ, ǫ)-reliable. Note that K is only “slightly larger” than K.
For proof, see Section 3.6.3. Note that the running time K of the Bisection estimate as given by (3.18) is just by (at most) logarithmic in N , I, L, and 1/ǫ factor larger than K; note also that L is just logarithmic in 1/b ρ. Assume, e.g., that for some γ > 0 “in nature” there exist (ǫγ , ǫ)-reliable estimates, parameterized by ǫ ∈ (0, 1/2), utilizing K = K(ǫ) observations. Then Bisection with the volume of observation and control parameters given by (3.18) and (3.19), where ρb = 3ρ = 3ǫγ and K = K(ǫ), is (3ǫγ , ǫ)-reliable and requires K = K(ǫ)-repeated observations with limǫ→+0 K(ǫ)/K(ǫ) ≤ 2.
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
3.2.6
203
Illustration
To illustrate bisection-based estimation of an N -convex function, consider the following situation.4 There are M devices (“receivers”) recording a signal u known to belong to a given convex compact and nonempty set U ⊂ Rn ; the output of the i-th receiver is the vector yi = Ai u + σξ ∈ Rm [ξ ∼ N (0, Im )] where Ai are given m × n matrices (you may think of M allowed positions for a single receiver, and of yi as the output of the receiver when the latter is in position i). Our observation ω is one of the vectors yi , 1 ≤ i ≤ M , with index i unknown to us (“we observe a noisy record of a signal, but do not know the position in which this record was taken”). Given ω, we want to recover a given linear function g(x) = eT u of the signal. The problem can be modeled as follows. Consider M sets Xi = {x = [x1 ; ...; xM ] ∈ RM n = Rn × ... × Rn : xj = 0, j 6= i; xi ∈ U } {z } | M
along with the linear mapping
A[x1 ; ...; xM ] =
M X i=1
Ai x i : R M n → R m
and linear function f ([x1 ; ...; xM ]) = eT
X i
xi : RM n → R.
Let X be a convex compact set in RM n containing all the sets Xi , 1 ≤ i ≤ m. Observe that the problem we are interested in is nothing but the problem of recovering f (x) via observation ω = Ax + σξ, ξ ∼ N (0, Im ), (3.20) SM where the unknown signal x is known to belong to the union i=1 Xi of known convex compact sets Xi . As a result, our problem can be solved via the machinery developed in this section. Numerical illustration. In the numerical experiments to be reported, we use n = 128, m = 64 and M = 2. The data is generated as follows: • The set U ⊂ R128 of candidate signals is comprised of restrictions onto the equidistant (n = 128)-point grid in [0, 1] of twice differentiable functions h(t) of continuous argument t ∈ [0, 1] satisfying the relations |h(0)| ≤ 1, |h′ (0)| ≤ 1, |h′′ (t)| ≤ 1, 0 ≤ t ≤ 1. For the discretized signal u = [h(0); h(1/n); ...; h(1 − 1/n)] this translates into the system of convex constraints |u1 | ≤ 1, n|u2 − u1 | ≤ 1, n2 |ui+1 − 2ui + ui−1 | ≤ 1, 2 ≤ i ≤ n − 1. 4 Our goal is to illustrate a mathematical construction rather than to work out a particular application; the reader is welcome to invent a plausible “covering story.”
204
CHAPTER 3
Characteristic error bound actual error # of Bisection steps
min 0.008 0.001 5
median 0.015 0.002 7.00
mean 0.014 0.002 6.60
max 0.015 0.005 8
Table 3.1: Data of 10 Bisection experiments, σ = 0.01. In the table, “error bound” is the half-length of the final localizer, which is an 0.99-reliable upper bound on the estimation error; the “actual error” is the actual estimation error. R1 • We look to estimate the discretized counterpart of the integral 0 h(t)dt, specifP n ically, the quantity eT u = α i=1 ui . The normalizing constant α is selected to T ensure maxu∈U e u = 1, minu∈U eT u = −1, allowing us to run Bisection over ∆0 = [−1; 1]. • We generate A1 as an (m = 64)×(n = 128) matrix with singular values σi = θi−1 , 1 ≤ i ≤ m, with θ selected from the requirement σm = 0.1. The system of left singular vectors of A1 is obtained from the system of basic orths in Rn by random rotation. Matrix A2 was selected as A2 = A1 S, where S is a symmetry w.r.t. the axis e, that is, Se = e & Sh = −h whenever h is orthogonal to e. (3.21) Signals u underlying the observations are selected at random in U . • The reliability 1 − ǫ of the estimate is set to 0.99, while the maximal allowed number L of Bisection steps is set to 8. We use single observation (3.20) (i.e., use K = 1 in our general scheme) with σ = 0.01. The results of our experiments are presented in Table 3.1. Observe that in the considered problem there exists an intrinsic obstacle for high accuracy estimation even in the case of noiseless observations and invertible matrices Ai , i = 1, 2 (recall that we are in the case of M = 2). Indeed, assume that there exist u ∈ U , u′ ∈ U such that A1 u = A2 u′ and eT u 6= eT u′ . Since we do not know which of the matrices, A1 or A2 , underlies the observation and A1 u = A2 u′ , there is no way to distinguish between the two cases we have described, implying that the quantity 1 T |e (u − u′ )| : A1 u = A2 u′ (3.22) ρ = max 2 ′ u,u ∈U
is a lower bound on the worst-case, over signals from U , error of a reliable recovery of eT u, independently of how small the noise is. In the reported experiments, we used A2 = A1 S with S linked to e (see (3.21)); with this selection of S, e, and A2 , and invertible A1 , the lower bound ρ would be trivial—just zero. Note that the selected A1 is not invertible, resulting in a positive ρ. However, computation shows that with our data, this positive ρ is negligibly small (about 2.0e − 5). When we destroy the link between e and S, the estimation problem can become intrinsically more difficult, and the performance of our estimation procedure can deteriorate. Let us look at what happens when we keep A1 and A2 = A1 S exactly as they are, but replace the linear form to be estimated with eT u, e being randomly selected.5 The corresponding results are presented in Table 3.2. The data in the
5 In the experiments to be reported, e is selected as follows: we start with a random unit vector drawn from the uniform distribution on the unit sphere in Rn and then normalize it to
205
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
Characteristic error bound actual error # of Bisection steps
min 0.057 0.001 1
median 0.457 0.297 1.00
mean 0.441 0.350 2.20
max 1.000 1.000 5
“Difficult” signals, data over 10 experiments ρ error bound
0.022
0.028
0.154
0.170
0.213
0.248
0.250
0.500
0.605
0.924
0.057
0.063
0.219
0.239
0.406
0.508
0.516
0.625
0.773
1.000
Error bound vs. ρ, experiments sorted according to the values of ρ Characteristic error bound actual error # of Bisection steps
min 0.016 0.005 1
median 0.274 0.066 2.00
mean 0.348 0.127 2.80
max 1.000 0.556 7
Random signals, data over 10 experiments ρ error bound
0.010
0.085
0.177
0.243
0.294
0.334
0.337
0.554
0.630
0.762
0.016
0.182
0.376
0.438
0.602
0.029
0.031
0.688
0.125
1.000
Error bound vs. ρ, experiments sorted according to the values of ρ
Table 3.2: Results of experiments with randomly selected linear form, σ = 0.01.
top part of the table match “difficult” signals u—those participating in forming the lower bound (3.22) on the recovery error, while the data in the bottom part of the table correspond to randomly selected signals.6 Observe that when estimating a randomly selected linear form, the error bounds indeed deteriorate, as compared to those in Table 3.1. We see also that the resulting error bounds are in a reasonably good agreement with the lower bound ρ, illustrating the basic property of nearly optimal estimates: the guaranteed performance of an estimate can be bad or good, but it is always nearly as good as is possible under the circumstances. As for actual estimation errors, they in some experiments are significantly less than the error bounds, especially when random signals are used. 3.2.7
Estimating N -convex functions: An alternative
Observe that the problem of estimating an N -convex function on the union of convex sets posed in Section 3.2.2 can be processed not only by Bisection. An alternative is as follows. In the notation ofSSection 3.2.2, we start with computing Xi , that is, we compute the quantities the range ∆ of function f on the set X = i≤I
f = min f (x), f = max f (x) x∈X
x∈X
have maxu∈U eT u − minu∈U eT u = 2. 6 Precisely, to generate a signal u, we draw a point u ¯ at random, from the uniform distribution √ ¯. on the sphere of radius 10 n, and take as u the point of U k · k2 -closest to u
206
CHAPTER 3
and set ∆ = [f , f ]. We assume that this segment is not a singleton; otherwise estimating f is trivial. Let L ∈ Z+ and let δL = (f −f )/L be the desired estimation accuracy. We split ∆ into L segments ∆ℓ of equal length δL and consider the sets Xiℓ = {x ∈ Xi : f (x) ∈ ∆ℓ }, 1 ≤ i ≤ I, 1 ≤ ℓ ≤ L. Since f is N -convex, each set Xiℓ is a union of Miℓ ≤ N 2 convex compact sets Xiℓj , 1 ≤ j ≤ Miℓ . Thus, we have at our disposal a collection of at most ILN 2 convex compact sets; let us eliminate from this collection empty sets and S arrange the nonempty ones into a sequence Y1 , ..., YM , M ≤ ILN 2 . Note that s≤M Ys = X, so that the goal set in Section 3.2.2 can be reformulated as follows: For some unknown x known to belong to X =
M S
Ys , we have at our disposal
s=1
observation ω K = (ω1 , ..., ωK ) with i.i.d. ωt ∼ pA(x) (·); we aim at estimating the quantity f (x) from this observation. The sets Ys give rise to M hypotheses H1 , ..., HM on the distribution of the observations ωt , 1 ≤ t ≤ K; according to Hs , ωt ∼ pA(x) (·) with some x ∈ Ys . Let us define a closeness C on the set of our M hypotheses as follows. Given s ≤ M , the set Ys is some Xi(s)ℓ(s)j(s) ; we say that two hypotheses, Hs and Hs′ , are C-close if the segments ∆ℓ(s) and ∆ℓ(s′ ) intersect. Observe that when Hs and Hs′ are not C-close, the convex compact sets Ys and Ys′ do not intersect, since the values of f on Ys belong to ∆ℓ(s) , the values of f on Ys′ belong to ∆ℓ(s′ ) , and the segments ∆ℓ(s) and ∆ℓ(s′ ) do not intersect. Now let us apply to the hypotheses H1 , ..., HM our machinery for testing up to closeness C; see Section 2.5.2. Assuming that whenever Hs and Hs′ are not C-close, the risks ǫss′ defined in Section 2.5.2.2 are < 1,7 we, given tolerance ǫ ∈ (0, 1), can find K = K(ǫ) such that stationary K-repeated observation ω K allows us to decide (1−ǫ)-reliably on H1 , ..., HM up to closeness C. As applied to ω K , the corresponding test T K will accept some (perhaps, none) of the hypotheses, let the indexes of the accepted hypotheses form set S = S(ω K ). We convert S into an estimate fb(ω K ) of S f (x), x ∈ X = s≤M Ys being the signal underlying our observation, as follows: • when S = ∅ the estimate is, say (f + f )/2; • when S is nonempty we take the union ∆(S) of the segments ∆ℓ(s) , s ∈ S, and our estimate is the average of the largest and the smallest elements of ∆(S).
It is immediately seen that if the signal x underlying our stationary K-repeated observation ω K belongs to some Ys∗ , so that the hypothesis Hs∗ is true, and the outcome S of T K contains s∗ and is such that for all s ∈ S Hs and Hs∗ are Cclose to each other, we have |f (x) − fb(ω K )| ≤ δL . Note that since the C-risk of T K is ≤ ǫ, the pA(x) -probability to get such a “good” outcome, and thus to get |f (x) − fb(ω K )| ≤ δL , is at least 1 − ǫ. 7 In standard simple o.s.’s, this is the case whenever for s, s′ in question the images of Y and s Ys′ under the mapping x 7→ A(x) do not intersect. Because for s, s′ , Ys and Ys′ do not intersect, this definitely is the case when A(·) is an embedding.
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
3.2.7.1
207
Numerical illustration
Our illustration deals with the situation when I = 1, X = X1 is a convex compact set, and f (x) is fractional-linear: f (x) = aT x/cT x with positive on X denominator. Specifically, assume we are given noisy measurements of voltages Vi at some nodes i and currents Iij in some arcs (i, j) of an electric circuit, and want to recover the resistance of a particular arc (i∗ , j∗ ): ri ∗ j ∗ =
V j ∗ − V i∗ . I i∗ j ∗
The observation noises are assumed to be N (0, σ 2 ) and independent across the measurements. In our experiment, we work with the data as follows:
B
D C
input node
output node
x = [voltages at nodes; currents in arcs] Ax = [observable voltages; observable currents] • • • •
Currents are measured in all arcs except for a, b Voltages are measured at all nodes except for c We want to recover resistance of arc b conservation of current, except for input/output nodes zero voltage at input node, nonnegative currents X: current in arc b at least 1, total of currents at most 33 Ohm’s Law, resistances of arcs between 1 and 10
We are in the situation of N = 1 and I = 1, implying M = L. When using L = 8, the projections of the sets Ys , 1 ≤ s ≤ L = 8, onto the 2D plane of variables
208
CHAPTER 3
(Vj∗ − Vi∗ , Ii∗ j∗ ) are the “stripes” shown below: I i∗ j ∗
V j ∗ − V i∗ The range of the unknown resistance turns out to be ∆ = [1, 10]. We set ǫ = 0.01, and instead of looking for K such that the K-repeated observation allows us to recover 0.99-reliably the resistance in the arc of interest within accuracy |∆|/L, we look for the largest observation noise σ allowing us to achieve the desired recovery with a single observation. The results for L = 8, 16, 32 are as follows: L δL σ σopt /σ ≤ σ σopt /σ ≤
8 9/8 ≈ 1.13 0.024 1.31 0.031 1.01
16 9/16 ≈ 0.56 0.010 1.31 0.013 1.06
32 9/32 ≈ 0.28 0.005 1.33 0.006 1.08
In the above table: • σopt is the largest σ for which “in nature” there exists a test deciding on H1 , ..., HL with C-risk ≤ 0.01; • Underlined data: Risks ǫss′ of pairwise tests are bounded via risks of optimal detectors; C-risk of T is bounded by 1, (s, s′ ) 6∈ C, L ′ ′ ′ , χ = ] χ [ǫss ss s,s′ =1 ss 0, (s, s′ ) ∈ C; 2,2
see Proposition 2.29; • “Slanted” data: Risks ǫss′ of pairwise tests are bounded via the error function; C-risk of T is bounded by X max ǫss′ s
s′ :(s,s′ )6∈C
(it is immediately seen that in the case of Gaussian o.s., this indeed is a legitimate risk bound).
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
209
C
B D C Figure 3.3: A circuit (nine nodes and 16 arcs). a: arc of interest; b: arcs with measured currents; c: input node where external current and voltage are measured.
3.2.7.2
Estimating dissipated power
The alternative approach to estimating N -convex functions proposed in Section 3.2.7 can be combined with the quadratic lifting described in Section 2.9 to yield, under favorable circumstances, estimates of quadratic and quadratic fractional functions. We are about to consider an instructive example of this type. Figure 3.3 represents a DC circuit. We have access to repeated noisy measurements of currents in some arcs and voltages at some nodes, with the voltage of the ground node equal to 0. The arcs are oriented; this orientation, however, is of no relevance in our context and therefore is not displayed. Our goal is to use these observations to estimate the power dissipated in a given “arc of interest.” The a priori information is as follows: • the (unknown) arc resistances are known to belong to a given range [r, R], with 0 < r < R < ∞; • the currents and the voltages are linked by Kirchhoff’s laws:
– at every node, the sum of currents in the outgoing arcs is equal to the sum of currents in the incoming arcs plus the external current at the node. In our circuit, there are just two external currents, one at the ground node and one at the input node c.
– the voltages and the currents are linked by Ohm’s law: for every (inner) arc γ, we have Iγ rγ = Vj(γ) − Vi(γ) where Iγ is the current in the arc, rγ is the arc’s resistance, Vs is the voltage at node s, and i(γ), j(γ) are the initial and the terminal nodes linked by arc γ; • magnitudes of all currents and voltages are bounded by 1. We assume that the measurements of observable currents and voltages are affected by zero mean Gaussian noise with scalar covariance matrix θ2 I, with unknown θ from a given range [σ, σ]. Processing the problem. We specify the “signal” underlying our observation as
210
CHAPTER 3
a collection u of the voltages at nine nodes and currents Iγ in 16 (inner) arcs γ of the circuit, augmented by the external current Io at the input node (so that −Io is the external current at the ground node). Thus, our single-time observation is ζ = Au + θξ,
(3.23)
where A extracts from u four entries (currents in two arcs b and external current and voltage at the input node c), ξ ∼ N (0, I4 ), and θ ∈ [σ, σ]. Our a priori information on u states that u belongs to the compact set U given by the quadratic constraints, namely, as follows:
U=
u = {Iγ , Io , Vi } :
Iγ2 ≤ 1, Vi2 ≤ 1 ∀γ, i; uT J T Ju≤ 0 [Vj(γ) − Vi(γ) ]2 /R − Iγ [Vj(γ) − Vi(γ) ] ≤ 0 ∀γ Iγ [Vj(γ) − Vi(γ) ] − [Vj(γ) − Vi(γ) ]2 /r ≤ 0 2 rIγ − Iγ [Vj(γ) − Vi(γ) ] ≤ 0 ∀γ Iγ [Vj(γ) − Vi(γ) ] − RIγ2 ≤ 0
(a)
(b)
(3.24)
where Ju = 0 expresses the first Kirchhoff’s law, and quadratic constraints (a) and (b) account for Ohm’s law in the situation when we do not know the exact resistances but only their range [r, R]. Note that groups (a) and (b) of constraints in (3.24) are “logical consequences” of each other, and thus one of groups seems to be redundant. However, on closer inspection, quadratic inequalities valid on U do not tighten the outer approximation Z of Z[U ] and thus are redundant in our context only when these inequalities can be obtained from the inequalities we do include into the description of Z “in a linear fashion”—by taking weighted sums with nonnegative coefficients. This is not how (b) is obtained from (a). As a result, to get a smaller Z, it makes sense to keep both (a) and (b). The dissipated power we are interested in estimating is the quadratic function f (u) = Iγ∗ [Vj∗ − Vi∗ ] = [u; 1]T G[u; 1] where γ∗ = (i∗ , j∗ ) is the arc of interest, and G ∈ Sn+1 , n = dim u, is a properly built matrix. In order to build an estimate, we “lift quadratically” the observations ζ 7→ ω = (ζ, ζζ T ) and pass from the domain U of actual signals to the outer approximation Z of the quadratic lifting of U : Z
:= ⊃
n+1 {Z : Z 0, Z ∈S n+1,n+1 = 1, Tr(Qs Z) ≤ cs , 1 ≤ s ≤ S} [u; 1][u; 1]T : u ∈ V .
Here the matrix Qs ∈ Sn+1 represents the left-hand side Fs (u) of the s-th quadratic constraint in the description (3.24) of U : Fs (u) ≡ [u; 1]T Qs [u; 1], and cs is the righthand side of the s-th constraint. We process the problem similarly to what was done in Section 3.2.7.1, where our goal was to estimate a fractional-linear function. Specifically, 1. We compute the range of f on U ; the smallest value f of f on U clearly is zero, and an upper bound on the maximum of f (u) over u ∈ U is the optimal value
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
211
in the convex optimization problem f = max Tr(GZ). Z∈Z
2. Given a positive integer L, we split the range [f , f ] into L segments ∆ℓ = [aℓ−1 , aℓ ] of equal length δL = (f − f )/L and define convex compact sets Zℓ = {Z ∈ Z : aℓ−1 ≤ Tr(GZ) ≤ aℓ }, 1 ≤ ℓ ≤ L, so that u ∈ U, f (u) ∈ ∆ℓ ⇒ [u; 1][u; 1]T ∈ Zℓ , 1 ≤ ℓ ≤ L. 3. We specify L quadratically constrained hypotheses H1 , ..., HL on the distribution of observation (3.23), with Hℓ stating that ζ ∼ N (Au, θ2 I4 ) with some u ∈ U satisfying f (u) ∈ ∆ℓ (so that [u; 1][u; 1]T ∈ Zℓ ), and θ belongs to the above segment [σ, σ]]. We equip our hypotheses with a closeness relation C; specifically, we consider Hℓ and Hℓ′ C-close if and only if the segments ∆ℓ and ∆ℓ′ intersect. 4. We use Propositions 2.43.ii and 2.40 to build detectors φℓℓ′ quadratic in ζ for the families of distributions obeying Hℓ and Hℓ′ , respectively, along with upper bounds ǫℓℓ′ on the risks of these detectors. Finally, we use the machinery from Section 2.5.2 to find the smallest K and a test TCK , based on a stationary Krepeated version of observation (3.23), able to decide on H1 , ..., HL with C-risk ≤ ǫ, where ǫ ∈ (0, 1) is a given tolerance. Finally, given stationary K-repeated observation (3.23), we apply to it test TCK , look at the hypotheses, if any, accepted by the test, and build the union ∆ of the corresponding segments ∆ℓ . If ∆ = ∅, we estimate f (u) as the midpoint of the power range [f , f ]; otherwise the estimate is the mean of the largest and the smallest points in ∆. It is easily seen that for this estimate, the probability for the estimation error to be > δℓ is ≤ ǫ. The numerical results we present√here correspond to the circuit presented in Figure 3.3. We set σ = 0.01, σ = σ/ 2, [r, R] = [1, 2], ǫ = 0.01, and L = 8. The simulation setting is as follows: the computed range [f , f ] of the dissipated power is [0, 0.821], so that the estimate built recovers the dissipated power within accuracy 0.103 and reliability 0.99. The resulting value of K is K = 95. In all 500 simulation runs, the actual recovery error was less than the bound 0.103, and the average error was as small as 0.041.
3.3
ESTIMATING LINEAR FORMS BEYOND SIMPLE OBSERVATION SCHEMES
We are about to show that the techniques developed in Section 2.8 can be applied to building estimates of linear and quadratic forms of the parameters of observed distributions. As compared to the machinery of Section 3.2, our new approach has somewhat restricted scope: we do not estimate general N -convex functions nor handle domains which are unions of convex sets; now we need the function to be linear (perhaps, after quadratic lifting of observations) and the domain to
212
CHAPTER 3
be convex.8 As a compensation, we are not limited to simple observation schemes anymore—our approach is in fact a natural extension of the approach developed in Section 3.1 beyond simple o.s.’s. In this section, we focus on estimating linear forms; estimating quadratic forms will be our subject in Section 3.4. 3.3.1
Situation and goal
Consider the situation as follows: given are Euclidean spaces Ω = EH , EM , EX along with • regular data (see Section 2.8.1.1) H ⊂ EH , M ⊂ EM , Φ(·; ·) : H × M → R, with 0 ∈ int H, • a nonempty convex compact set X ⊂ EX , • an affine mapping x 7→ A(x) : EX → EM such that A(X ) ⊂ M, • a continuous convex calibrating function υ(x) : X → R, • a vector g ∈ EX and a constant c specifying the linear form G(x) = hg, xi + c : EX → R,9 • a tolerance ǫ ∈ (0, 1). These data specify, in particular, the family P = S[H, M, Φ] of probability distributions on Ω = EH ; see Section 2.8.1.1. Given random observation ω ∼ P (·) (3.25) where P ∈ P is such that ∀h ∈ H : ln
Z
ehh,ωi P (dω) EH
≤ Φ(h; A(x))
(3.26)
for some x ∈ X (that is, A(x) is a parameter, as defined in Section 2.8.1.1, of distribution P ), we want to recover the quantity G(x). ǫ-risk. Given ρ > 0, we call an estimate gb(·) : EH → R (ρ, ǫ, υ(·))-accurate if for all pairs x ∈ X , P ∈ P satisfying (3.26) it holds Probω∼P {|b g (ω) − G(x)| > ρ + υ(x)} ≤ ǫ.
If ρ∗ is the infimum of those ρ for which estimate gb is (ρ, ǫ, υ(·))-accurate, then clearly gb is (ρ∗ , ǫ, υ(·))-accurate; we shall call ρ∗ the ǫ-risk of the estimate gb taken
8 The latter is just for the sake of simplicity, to not overload the presentation to follow. An interested reader will certainly be able to reproduce the corresponding construction of Section 3.1 in the situation of this section. 9 From now on, hu, vi denotes the inner product of vectors u, v belonging to a Euclidean space; what this space is will always be clear from the context.
213
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
w.r.t. the data G(·), X , υ(·), and (A, H, M, Φ):
Riskǫ (b g (·)|G, X , υ, A, H, M, Φ) = min ρ : Probω∼P {ω : |b g (ω) − G(x)| > ρ + υ(x)} ≤ ǫ ( P ∈ P, x ∈ X R hT ω . ∀(x, P ) : ln e P (dω) ≤ Φ(h; A(x)) ∀h ∈ H
(3.27)
When G, X , υ, A, H, M, and Φ are clear from the context, we shorten Riskǫ (b g (·)|G, X , υ, A, H, M, Φ)
to Riskǫ (b g (·)). Given the data listed at the beginning of this section, we are about to build, in a computationally efficient fashion, an affine estimate gb(ω) = hh∗ , ωi + κ along with ρ∗ such that the estimate is (ρ∗ , ǫ, υ(·))-accurate. 3.3.2
Construction and main results
Let us set H+ = {(h, α) : h ∈ EH , α > 0, h/α ∈ H}
so that H+ is a nonempty convex set in EH × R+ , and let (a) (b)
Ψ+ (h, α)
=
Ψ− (h, β)
=
sup [αΦ(h/α, A(x)) − G(x) − υ(x)] : H+ → R,
x∈X
sup [βΦ(−h/β, A(x)) + G(x) − υ(x)] : H+ → R,
(3.28)
x∈X
so that Ψ± are convex real-valued functions on H+ (recall that Φ is convex-concave and continuous on H × M, while A(X ) is a compact subset of M). Our starting point is quite simple: ¯ α ¯ κ, Proposition 3.5. Given ǫ ∈ (0, 1), let h, ¯ , β, ¯ ρ¯ be a feasible solution to the system of convex constraints (a1 ) (a2 ) (b1 ) (b2 )
(h, α) (h, β) α ln(ǫ/2) β ln(ǫ/2)
∈ ∈ ≥ ≥
H+ H+ Ψ+ (h, α) − ρ + κ Ψ− (h, β) − ρ − κ
(3.29)
in variables h, α, β, ρ, κ. Setting ¯ ωi + κ, gb(ω) = hh, ¯
we obtain an estimate with ǫ-risk at most ρ¯.
¯ α ¯ κ, Proof. Let ǫ ∈ (0, 1), h, ¯ , β, ¯ ρ¯ satisfy the premise of the proposition, and let x ∈ X , P satisfy (3.26). We have ⇒
Probω∼P {b g (ω) > G(x) + ρ¯ + υ(x)}
=
Probω∼P {b g (ω) > G(x) + ρ¯ + υ(x)}
≤ ≤
o n ¯ ¯ κ+υ(x) ¯ Probω∼P hh,ωi > G(x)+ρ− α ¯ α ¯ hR i G(x)+ρ− ¯ κ+υ(x) ¯ ¯ α ¯ ehh,ωi/α¯ P (dω) e− ¯
¯ eΦ(h/α,A(x)) e−
G(x)+ρ− ¯ κ+υ(x) ¯ α ¯
.
214
CHAPTER 3
As a result, α ¯ ln (Probω∼P {b g (ω) > G(x) + ρ¯ + υ(x)}) ¯ α, A(x)) − G(x) − ρ¯ + κ ≤α ¯ Φ(h/¯ ¯ − υ(x) ¯ α ≤ Ψ+ (h, ¯ ) − ρ¯ + κ ¯ [by definition of Ψ+ and due to x ∈ X ] ≤α ¯ ln(ǫ/2) [by (3.29.b1 )] so that Probω∼P {b g (ω) > G(x) + ρ¯ + υ(x)} ≤ ǫ/2. Similarly o n ¯ −G(x)+ρ+ ¯ κ+υ(x) ¯ > = Probω∼P −hh,ωi ¯ ¯ β i −G(x)+βρ+ hR ¯ κ+υ(x) ¯ ¯ ¯ − ¯ β e−hh,ωi/β P (dω) e ⇒ Probω∼P {b g (ω) < G(x) − ρ¯ − υ(x)} ≤ Probω∼P {b g (ω) < G(x) − ρ¯ − υ(x)}
¯ ¯
≤ eΦ(−h/β,A(x)) e
G(x)−ρ− ¯ κ−υ(x) ¯ ¯ β
.
Thus β¯ ln (Probω∼P {b g (ω) < G(x) − ρ¯ − υ(x)}) ¯ ¯ β, ¯ A(x)) + G(x) − ρ¯ − κ ≤ βΦ(− h/ ¯ − υ(x) ¯ β) ¯ − ρ¯ − κ ≤ Ψ− (h, ¯ [by definition of Ψ− and due to x ∈ X ] ≤ β¯ ln(ǫ/2) [by (3.29.b2 )] and Probω∼P {b g (ω) < G(x) − ρ¯ − υ(x)} ≤ ǫ/2.
✷
Corollary 3.6. In the situation described in Section 3.3.1, let Φ satisfy the relation Φ(0; µ) ≥ 0 ∀µ ∈ M.
(3.30)
Then b + (h) := inf α {Ψ+ (h, α) + α ln(2/ǫ) : α > 0, (h, α) ∈ H+ } Ψ = supx∈X inf α>0,(h,α)∈H+ [αΦ(h/α, A(x)) − G(x) − υ(x) + α ln(2/ǫ)] , b − (h) := inf α {Ψ− (h, α) + α ln(2/ǫ) : α > 0, (h, α) ∈ H+ } (b) Ψ = supx∈X inf α>0,(h,α)∈H+ [αΦ(−h/α, A(x)) + G(x) − υ(x) + α ln(2/ǫ)] , (3.31) ¯ κ, b ± : EH → R are convex. Furthermore, let h, and functions Ψ ¯ ρe be a feasible solution to the system of convex constraints (a)
b + (h) ≤ ρ − κ, Ψ b − (h) ≤ ρ + κ Ψ
(3.32)
in variables h, ρ, κ. Then the estimate
¯ ωi + κ gb(ω) = hh, ¯
of G(x), x ∈ X, has the ǫ-risk at most ρe:
Riskǫ (b g (·)|G, X, υ, A, H, M, Φ) ≤ ρe.
(3.33)
¯ is a Relation (3.32) (and thus the risk bound (3.33)) clearly holds true when h
215
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
candidate solution to the convex optimization problem h io n b + (h) + Ψ b − (h) , b Opt = min Ψ(h) := 12 Ψ h
¯ and b h), ρe = Ψ(
κ ¯=
1 2
h
(3.34)
i ¯ −Ψ ¯ . b − (h) b + (h) Ψ
¯ we can make (an upper bound on) the ǫ-risk of As a result, by properly selecting h, estimate gb(·) arbitrarily close to Opt, and equal to Opt when optimization problem (3.34) is solvable. Proof. Let us first verify the identities in (3.31). The function
Θ+ (h, α; x) = αΦ(h/α, A(x)) − G(x) − υ(x) + α ln(2/ǫ) : H+ × X → R is convex-concave and continuous, and X is compact, whence by the Sion-Kakutani Theorem b + (h) Ψ
:= = = =
inf α {Ψ+ (h, α) + α ln(2/ǫ) : α > 0, (h, α) ∈ H+ } inf α>0,(h,α)∈H+ maxx∈X Θ+ (h, α; x) supx∈X inf α>0,(h,α)∈H+ Θ+ (h, α; x) supx∈X inf α>0,(h,α)∈H+ [αΦ(h/α, A(x)) − G(x) − υ(x) + α ln(2/ǫ)] ,
as required in (3.31.a). As we know, Ψ+ (h, α) is real-valued continuous function on b + is convex on EH , provided that the function is real-valued. Now, H+ , so that Ψ let x ¯ ∈ X , and let e be a subgradient of φ(h) = Φ(h; A(¯ x)) taken at h = 0. For h ∈ EH and all α > 0 such that (h, α) ∈ H+ we have Ψ+ (h, α)
≥ ≥ ≥
αΦ(h/α; A(¯ x)) − G(¯ x) − υ(¯ x) + α ln(2/ǫ) α[Φ(0; A(¯ x)) + he, h/αi] − G(¯ x) − υ(¯ x) + α ln(2/ǫ) he, hi − G(¯ x) − υ(¯ x)
(we have used (3.30)), and therefore Ψ+ (h, α) as a function of α is bounded from below on the set {α > 0 : h/α ∈ H}. In addition, this set is nonempty, since H b + is real-valued and convex on EH . contains a neighbourhood of the origin. Thus, Ψ b Verification of (3.31.b) and of the fact that Ψ− (h) is real-valued convex function on EH is completely similar. ¯ κ, Now, given a feasible solution (h, ¯ ρe) to (3.32), let us select some ρ¯ > ρe. Taking b ± , we can find α into account the definition of Ψ ¯ and β¯ such that ¯ α ¯ α (h, ¯ ) ∈ H+ & Ψ+ (h, ¯) + α ¯ ln(2/ǫ) ≤ ρ¯ − κ, ¯ + ¯ ¯ ¯ ¯ + β¯ ln(2/ǫ) ≤ ρ¯ + κ, (h, β) ∈ H & Ψ− (h, β) ¯
¯ α ¯ κ, implying that the collection (h, ¯ , β, ¯ ρ¯) is a feasible solution to (3.29). Invoking Proposition 3.5, we get Probω∼P {ω : |b g (ω) − G(x)| > ρ¯ + υ(x)} ≤ ǫ for all (x ∈ X , P ∈ P) satisfying (3.26). Since ρ¯ can be selected arbitrarily close to ρe, gb(·) indeed is a (e ρ, ǫ, υ(·))-accurate estimate. ✷
216 3.3.3
CHAPTER 3
Estimation from repeated observations
Assume that in the situation described in Section 3.3.1 we have access to K observations ω1 , ..., ωK sampled, independently of each other, from a probability distribution P , and aim to build the estimate based on these K observations rather than on a single observation. We can immediately reduce this new situation to the previous one, just by redefining the data. Specifically, given initial data H ⊂ EH , M ⊂ EM , Φ(·; ·) : H × M → R, X ⊂ EX , υ(·), A(·), G(x) = hg, xi + c (see Section 3.3.1) and a positive integer K, let us update part of the data, namely, replace H ⊂ EH with K := EH × ... × EH , HK := H × ... × H ⊂ EH {z } | {z } | K
K
and replace Φ(·, ·) : H × M → R with
ΦK (hK = (h1 , ..., hK ); µ) =
K X i=1
Φ(hi ; µ) : HK × M → R.
It is immediately seen that the updated data satisfy all requirements imposed on the data in Section 3.3.1, and that whenever x ∈ X and a Borel probability distribution P on EH are linked by (3.26), x and the distribution P K of K-element i.i.d. sample ω K = (ω1 , ..., ωK ) drawn from P are linked by the relation K K ∀h 1 , ..., hK ) ∈ H : R = (hK K ln E K ehh ,ω i P K (dω K ) H
= ≤
P
R
ehhi ,ωi i P (dωi ) Φ (h ; A(x)). i ln EH K K
Applying to our new data the construction from Section 3.3.2, we arrive at “repeated observation” versions of Proposition 3.5 and Corollary 3.6. Note that the resulting convex constraints/objectives are symmetric w.r.t. permutations functions of the components h1 , ..., hK of hK , implying that we lose nothing when restricting ourselves with collections hK with components equal to each other; it is convenient to denote the common value of these components h/K. With this observation in mind, Proposition 3.5 and Corollary 3.6 translate into the following statements (we use the assumptions and the notation from the previous sections): Proposition 3.7. Given ǫ ∈ (0, 1) and positive integer K, let (a) (b)
Ψ+ (h, α)
=
Ψ− (h, β)
=
sup [αΦ(h/α, A(x)) − G(x) − υ(x)] : H+ → R,
x∈X
sup [βΦ(−h/β, A(x)) + G(x) − υ(x)] : H+ → R,
x∈X
¯ α ¯ κ, and let h, ¯ , β, ¯ ρ¯ be a feasible solution to the system of convex constraints (a1 ) (a2 ) (b1 ) (b2 )
(h, α) (h, β) αK −1 ln(ǫ/2) βK −1 ln(ǫ/2)
∈ ∈ ≥ ≥
H+ H+ Ψ+ (h, α) − ρ + κ Ψ− (h, β) − ρ − κ
(3.35)
217
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
in variables h, α, β, ρ, κ. Setting gb(ω K ) =
XK ¯ 1 ωi + κ, ¯ h, i=1 K
we obtain an estimate of G(x) via independent K-repeated observations ωi ∼ P, i = 1, ..., K, with the ǫ-risk on X not exceeding ρ¯. In other words, whenever x ∈ X and a Borel probability distribution P on EH are linked by (3.26), one has g (ω K ) − G(x)| > ρ¯ + υ(x) ≤ ǫ. (3.36) ProbωK ∼P K ω K : |b
Corollary 3.8. In the situation described at the beginning of Section 3.3.1, let Φ satisfy relation (3.30), and let a positive integer K be given. Then (a)
(b)
b +,K (h) := inf Ψ+ (h, α) + K −1 α ln(2/ǫ) : α > 0, (h, α) ∈ H+ Ψ α = sup inf αΦ(h/α, A(x)) − G(x) − υ(x) + K −1 α ln(2/ǫ) , x∈X α>0,(h,α)∈H+ b −,K (h) := inf α Ψ− (h, α) + K −1 α ln(2/ǫ) : α > 0, (h, α) ∈ H+ Ψ = sup inf αΦ(−h/α, A(x)) + G(x) − υ(x) + K −1 α ln(2/ǫ) , x∈X α>0,(h,α)∈H+
¯ κ, b ±,K : EH → R are convex. Furthermore, let h, and functions Ψ ¯ ρe be a feasible solution to the system of convex constraints b +,K (h) ≤ ρ − κ, Ψ b −,K (h) ≤ ρ + κ Ψ
(3.37)
in variables h, ρ, κ. Then the ǫ-risk of the estimate XK ¯ 1 ωi + κ, ¯ gb(ω K ) = h, i=1 K
¯ implying that whenever x ∈ X and a Borel b h), of G(x), x ∈ X , is at most Ψ( probability distribution P on EH are linked by (3.26), relation (3.36) holds true. ¯ is a candidate solution to the convex Relation (3.37) clearly holds true when h optimization problem io h n b +,K (h) + Ψ b −,K (h) , b K (h) := 1 Ψ (3.38) OptK = min Ψ 2 h
¯ and b K (h), ρ¯ = Ψ
κ ¯=
1 2
h
i ¯ −Ψ ¯ . b −,K (h) b +,K (h) Ψ
¯ we can make (an upper bound on) the ǫ-risk As a result, by properly selecting h of the estimate gb(·) arbitrarily close to Opt, and equal to Opt when optimization problem (3.38) is solvable.
From now on, if not explicitly stated otherwise, we deal with K-repeated observations; to get back to single-observation case, it suffices to set K = 1.
218
CHAPTER 3
3.3.4
Application: Estimating linear forms of sub-Gaussianity parameters
Consider the simplest case of the situation from Sections 3.3.1 and 3.3.3, where • H = EH = Rd , M = EM = Rd × Sd+ , Φ(h; µ, M ) = hT µ + 12 hT M h : Rd × (Rd × Sd+ ) → R, • • • •
so that S[H, M, Φ] is the family of all sub-Gaussian distributions on Rd ; X ⊂ EX = Rnx is a nonempty convex compact set; A(x) = (Ax + a, M (x)), where A is d × nx matrix, and M (x) is a symmetric d × d matrix affinely depending on x such that M (x) is 0 when x ∈ X ; υ(x) is a convex continuous function on X ; G(x) is an affine function on EX .
In the case in question, (3.30) clearly takes place, and the left-hand sides in constraints (3.37) become b +,K (h) Ψ
=
b −,K (h) Ψ
=
=
=
1 T sup inf hT [Ax + a] + 2α h M (x)h + K −1 α ln(2/ǫ) − G(x) − υ(x) x∈X α>0 o np 2K −1 ln(2/ǫ)[hT M (x)h] + hT [Ax + a] − G(x) − υ(x) , max x∈X 1 T sup inf −hT [Ax + a] + 2α h M (x)h + K −1 α ln(2/ǫ) + G(x) − υ(x) α>0 x∈X n o p 2K −1 ln(2/ǫ)[hT M (x)h] − hT [Ax + a] + G(x) − υ(x) . max x∈X
Thus, system (3.37) reads hp i aT h + max 2K −1 ln(2/ǫ)[hT M (x)h] + hT Ax − G(x) − υ(x) x∈X h i p −aT h + max 2K −1 ln(2/ǫ)[hT M (x)h] − hT Ax + G(x) − υ(x) x∈X
≤
ρ − κ,
≤
ρ + κ.
We arrive at the following version of Corollary 3.8:
Proposition 3.9. In the situation described at the beginning of Section 3.3.4, given ¯ be a feasible solution to the convex optimization problem ǫ ∈ (0, 1), let h b K (h) OptK = min Ψ h∈Rd
where
b +,K (h) Ψ
(3.39)
}| { z hp i T T max −1 T 2K ln(2/ǫ)[h M (x)h] + h Ax − G(x) − υ(x) + a h i b K (h) := 1 x∈X hp . Ψ 2 + max 2K −1 ln(2/ǫ)[hT M (y)h] − hT Ay + G(y) − υ(y) − aT h y∈X {z } |
Then, setting
κ ¯= the affine estimate
1 2
h
b −,K (h) Ψ
i ¯ −Ψ ¯ , ρ¯ = Ψ ¯ b −,K (h) b +,K (h) b K (h), Ψ gb(ω K ) =
K 1 X ¯T h ωi + κ ¯ K i=1
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
219
has ǫ-risk, taken w.r.t. the data listed at the beginning of this section, at most ρ¯. It is immediately seen that optimization problem (3.39) is solvable, provided that \ Ker(M (x)) = {0}, x∈X
and an optimal solution h∗ to the problem, taken along with i h b −,K (h∗ ) − Ψ b +,K (h∗ ) , κ∗ = 1 Ψ 2
(3.40)
yields the affine estimate
gb∗ (ω) =
K 1 X T h ωi + κ∗ K i=1 ∗
with ǫ-risk, taken w.r.t. the data listed at the beginning of this section, at most OptK . 3.3.4.1
Consistency
Assuming υ(x) ≡ 0, we can easily answer the natural question “when is the proposed estimation scheme consistent?” meaning that for every ǫ ∈ (0, 1), it allows us to achieve arbitrarily small ǫ-risk, provided that K is large enough. Specifically, denoting by g T x the linear part of G(x): G(x) = g T x + c, from Proposition 3.9 it is immediately seen that a necessary and sufficient condition for consistency is the ¯ ∈ Rd such that h ¯ T Ax = g T x for all x ∈ X − X , or, equivalently, existence of h the condition that g is orthogonal to the intersection of the kernel of A with the linear span of X − X . Indeed, under this assumption, for every fixed ǫ ∈ (0, 1) we ¯ = 0, implying that limK→∞ Opt = 0, with Ψ b K (h) b K and clearly have limK→∞ Ψ K OptK given by (3.39). On the other hand, if the condition is violated, then there exist x′ , x′′ ∈ X such that Ax′ = Ax′′ and G(x′ ) 6= G(x′′ ); we lose nothing when assuming that G(x′′ ) > G(x′ ). Looking at (3.39), we see that p −1 ln(2/ǫ)[hT M (x′ )h] + hT Ax′ − G(x′ ) + aT h b K (h) ≥ 1 2K Ψ 2 p + 2K −1 ln(2/ǫ)[hT M (x′′ )h] − hT Ax′′ + G(x′′ ) − aT h ≥
G(x′′ ) − G(x′ ),
whence OptK , for all K, is lower-bounded by G(x′′ ) − G(x′ ) > 0. 3.3.4.2
Direct product case
Further simplifications are possible in the direct product case, where, in addition to what was assumed at the beginning of Section 3.3.4, • EX = EU × EV and X = U × V , with convex compact sets U ⊂ EU = Rnu and V ⊂ E V = R nv , • A(x = (u, v)) = [Au + a, M (v)] : U × V → Rd × Sd , with M (v) 0 for v ∈ V , • G(x = (u, v)) = g T u + c depends solely on u, and
220
CHAPTER 3
• υ(x = (u, v)) = ̺(u) depends solely on u. It is immediately seen that in the direct product case problem (3.39) reads q φU (AT h − g) + φU (−AT h + g) −1 T + max 2K ln(2/ǫ)h M (v)h , OptK = min v∈V 2 h∈Rd (3.41) where (3.42) φU (f ) = max uT f − ̺(u) . u∈U T Assuming v∈V Ker(M (v)) = {0}, the problem is solvable, and its optimal solution h∗ gives rise to the affine estimate gb∗ (ω K ) =
1 X T h ωi + κ∗ , κ∗ = 21 [φU (−AT h + g) − φU (AT h − g)] − aT h∗ + c K i ∗
with ǫ-risk ≤ OptK . Near-optimality. In addition to the assumption that we are in the direct product case, assume that υ(·) ≡ 0 and, for the sake of simplicity, that M (v) ≻ 0 whenever v ∈ V . In this case (3.39) reads OptK = minh maxv∈V Θ(h, v) := 21 [φU (AT h − g) + φU (−AT h + g)] p −1 T + 2K ln(2/ǫ)h M (v)h . Hence, taking into account that Θ(h, v) clearly is convex in h and concave in v, while V is a convex compact set, by the Sion-Kakutani Theorem we get also OptK =
maxv∈V Opt(v) = minh 21 [φU (AT h − g) + φU (−AT h + g)] p + 2K −1 ln(2/ǫ)hT M (v)h .
(3.43)
Now consider the problem of estimating g T u from independent observations ωi , i ≤ K, sampled from N (Au + a, M (v)), where unknown u is known to belong to U and v ∈ V is known. Let ρǫ (v) be the minimax ǫ-risk of recovery: g (ω K ) − g T u| > ρ} ≤ ǫ ∀u ∈ U , ρǫ (v) = inf ρ : ProbωK ∼[N (Au+a,M (v))]K {ω K : |b g b(·)
where inf is taken over all Borel functions gb(·) : RKd → R. Invoking [131, Theorem 3.1], it is immediately seen that whenever ǫ < 1/4, one has "
2 ln(2/ǫ) ρǫ (v) ≥ 1 ln 4ǫ
#−1
Opt(v).
Since the family SG(U, V ) of all sub-Gaussian distributions on Rd with parameters (Au + a, M (v)), u ∈ U , v ∈ V , contains all Gaussian distributions N (Au + a, M (v)) induced by (u, v) ∈ U × V , we arrive at the following conclusion:
221
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
Proposition 3.10. In the just described situation, the minimax optimal ǫ-risk Riskopt g (·)) ǫ (K) = inf Riskǫ (b g b(·)
of recovering g T u from a K-repeated i.i.d. sub-Gaussian observation with parameters (Au + a, M (v)), (u, v) ∈ U × V , is within a moderate factor of the upper bound OptK on the ǫ-risk, taken w.r.t. the same data, of the affine estimate gb∗ (·) yielded by an optimal solution to (3.41), namely, OptK ≤ 3.3.4.3
Numerical illustration
2 ln(2/ǫ) Riskopt ǫ (K). 1 ln 4ǫ
The numerical illustration we are about to discuss models the situation in which we want to recover a linear form of a signal x known to belong to a given convex compact subset X via indirect observations Ax affected by sub-Gaussian “relative noise,” meaning that the variance of observation is larger the larger is the signal. Specifically, our observation is ω ∼ SG(Ax, M (x)), where
n
x ∈ X = x ∈ R : 0 ≤ xj ≤ j
−α
, 1 ≤ j ≤ n , M (x) = σ
2
n X
xj Θ j
(3.44)
j=1
where A ∈ Rd×n and Θj ∈ Sd+ , j = 1, ..., n, are given matrices; the linear form to be estimated is G(x) = g T x. The entities g, A and {Θj }nj=1 and reals α ≥ 0 (“degree of smoothness”) and σ > 0 (“noise intensity”) are parameters of the estimation problem we intend to process. The parameters g, A, Θj are as follows: • g ≥ 0 is selected at random and then normalized to have max g T x = max g T [x − y] = 2; x∈X
x,y∈X
• we deal with the case of n > d (“deficient observations”); the d nonzero singular i−1 values of A were set to θ− d−1 , where “condition number” θ ≥ 1 is a parameter; the orthonormal systems U and V of the first d left and, respectively, right singular vectors of A were drawn at random from rotationally invariant distributions; • the positive semidefinite d×d matrices Θj are orthogonal projectors on randomly selected subspaces in Rd of dimension ⌊d/2⌋; • in all our experiments, we consider the single-observation case K = 1 and use υ(·) ≡ 0. Note that X possesses the ≥-largest point x ¯, whence M (x) M (¯ x) whenever x ∈ X ; as a result, sub-Gaussian distributions with matrix parameter M (x), x ∈ X , can be thought also to have matrix parameter M (¯ x). One of the goals of the considered experiment is to understand how much we might lose were we replacing c(x) ≡ M (¯ M (·) with M x), that is, were we ignoring the fact that small signals result
222
CHAPTER 3
in low-noise observations. In our experiment we use d = 32, m = 48, α = 2, θ = 2, and σ = 0.01. With these parameters, we generated at random, as described above, 10 collections {g, A, Θj , j ≤ d}, thus arriving at 10 estimation problems. For each problem, we apply the outlined machinery to build an estimate of g T x affine in ω as yielded by the optimal solution to (3.39), and compute the upper bound Opt on the (ǫ = 0.01)risk of this estimate. In fact, for each problem, we build two estimates and two risk bounds: the first for the problem “as is,” and the second for the aforementioned “direct product envelope” of the problem, where the mapping x 7→ M (x) is replaced c(x) := M (¯ with conservative x 7→ M x). The results are as follows: min median mean max 0.138 0.190 0.212 0.299 0.150 0.210 0.227 0.320 Upper bounds on 0.01-risk, data over 10 estimation problems [d = 32, m = 48, α = 2, θ = 2, σ = 0.01] First row: ω ∼ SG(Ax, M (x)); second row: ω ∼ SG(Ax, M (¯ x))
Note the significant “noise amplification” in the estimate (about 20 times the observation noise level σ) and high risk variability across the experiments. Seemingly, both these phenomena stem from the fact that we have highly deficient observations (n/d = 1.5) combined with a random orientation of the 16-dimensional kernel of A.
3.4
ESTIMATING QUADRATIC FORMS VIA QUADRATIC LIFTING
In the situation of Section 3.3.1, passing from “original” observations (3.25) to their quadratic lifting, we can use the machinery just developed to estimate quadratic, rather than linear, forms of the underlying parameters. We investigate the related possibilities in the cases of Gaussian and sub-Gaussian observations. The results of this section form an essential extension of the results of [39, 81] where a similar approach to estimating quadratic functionals of the mean of a Gaussian vector was used. 3.4.1 3.4.1.1
Estimating quadratic forms, Gaussian case Preliminaries
Consider the situation where we are given a nonempty bounded set U in Rm ; a nonempty convex compact subset V of the positive semidefinite cone Sd+ ; a matrix Θ∗ ≻ 0 such that Θ∗ Θ for all Θ ∈ V; an affine mapping u 7→ A[u; 1] : Rm → Ω = Rd , where A is a given d × (m + 1) matrix; • a convex continuous function ̺(·) on Sm+1 . + • • • •
A pair (u ∈ U, Θ ∈ V) specifies Gaussian random vector ζ ∼ N (A[u; 1], Θ) and thus
223
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
specifies probability distribution P [u, Θ] of (ζ, ζζ T ). Let Q(U, V) be the family of probability distributions on Ω = Rd × Sd stemming this way from Gaussian distributions with parameters from U × V. Our goal is to cover the family Q(U, V) by a family of the type S[N, M, Φ]. It is convenient to represent a linear form on Ω = Rd × Sd as hT z + 21 Tr(HZ), where (h, H) ∈ Rd × Sd is the “vector of coefficients” of the form, and (z, Z) ∈ Rd × Sd is the argument of the form. We assume that for some δ ∈ [0, 2] it holds −1/2
kΘ1/2 Θ∗
− Id k ≤ δ ∀Θ ∈ V,
(3.45)
where k · k is the spectral norm (cf. (2.129)). Finally, we set A m+1 b = [0; ...; 0; 1] ∈ R , B= bT and
Z + = {W ∈ Sm+1 : Wm+1,m+1 = 1}. +
The statement below is nothing but a straightforward reformulation of Proposition 2.43.i: Proposition 3.11. In the just described situation, let us select γ ∈ (0, 1) and set H M+ Φ(h, H; Θ, Z)
= = =
−1 Hγ := {(h, H) ∈ Rd × Sd : −γΘ−1 ∗ H γΘ∗ }, V × Z +, 1/2 1/2 − 12 ln Det(I − Θ∗ HΘ∗ ) + 21 Tr([Θ − Θ∗ ]H) 1/2 1/2 δ(2+δ) + kΘ∗ HΘ∗ k2F + Γ(h, H; Z) : H × M+ → R, 1/2 1/2 2(1−kΘ∗
HΘ∗
k)
where k · k is the spectral, k · kF is the Frobenius norm, and Γ(h, H; Z)
= =
AT hbT + AT HA + B T [H, h]T [Θ−1 H]−1 ∗ − [H, h]B] H h T −1 −1 + [H, h] [Θ∗ − H] [H, h] B . hT
1 Tr Z[bhT A+ 2 1 Tr 2
ZB T
Then H, M+ , Φ is a regular data, and for every (u, Θ) ∈ Rm × V it holds n T 1 T o ≤ Φ(h, H; Θ, [u; 1][u; 1]T ). ∀(h, H) ∈ H : ln Eζ∼N (A[u;1],Θ) eh ζ+ 2 ζ Hζ
Besides this, function Φ(h, H; Θ, Z) is coercive in the convex argument: whenever (Θ, Z) ∈ M and (hi , Hi ) ∈ H and k(hi , Hi )k → ∞ as i → ∞, we have Φ(hi , Hi ; Θ, Z) → ∞, i → ∞. 3.4.1.2
Estimating quadratic form: Situation and goal
Let us assume that we are given a sample ζ K = (ζ1 , ..., ζK ) of identically distributed observations ζi ∼ N (A[u; 1], M (v)), 1 ≤ i ≤ K (3.46) independent across i, where
224
CHAPTER 3
• (u, v) is an unknown “signal” known to belong to a given set U × V , where – U ⊂ Rm is a compact set, and
– V ⊂ Rk is a compact convex set;
• A is a given d × (m + 1) matrix, and v 7→ M (v) : Rk → Sd is an affine mapping such that M (v) 0 whenever v ∈ V . We are also given a convex calibrating function ̺(Z) : Sm+1 → R and “functional + of interest” F (u, v) = [u; 1]T Q[u; 1] + q T v, (3.47) where Q and q are a known (m+1)×(m+1) symmetric matrix and a k-dimensional vector, respectively. Our goal is to estimate the value F (u, v), for unknown (u, v) known to belong to U × V . Given a tolerance ǫ ∈ (0, 1), we quantify the quality of a candidate estimate gb(ζ K ) of F (u, v) by the smallest ρ such that for all (u, v) ∈ U ×V it holds g (ζ K ) − F (u, v)| > ρ + ̺([u; 1][u; 1]T ) ≤ ǫ. Probζ K ∼[N (A[u;1],M (v))]K |b 3.4.1.3
Construction and result
Let V = {M (v) : v ∈ V },
so that V is a convex compact subset of the positive semidefinite cone Sd+ . Let us select some 1. matrix Θ∗ ≻ 0 such that Θ∗ Θ, for all Θ ∈ V; 2. convex compact subset Z of the set Z + = {Z ∈ Sm+1 : Zm+1,m+1 = 1} such + that [u; 1][u; 1]T ∈ Z for all u ∈ U ; 3. real γ ∈ (0, 1) and a nonnegative real δ such that (3.45) takes place. We further set (cf. Proposition 3.11) B
=
H M Φ(h, H; Θ, Z)
= = =
A ∈ R(d+1)×(m+1) , [0, ..., 0, 1] −1 Hγ := {(h, H) ∈ Rd × Sd : −γΘ−1 ∗ H γΘ∗ }, V × Z, 1/2 1/2 − 12 ln Det(I − Θ∗ HΘ∗ ) + 12 Tr([Θ − Θ∗ ]H) 1/2 1/2 δ(2+δ) + kΘ∗ HΘ∗ k2F + Γ(h, H; Z) : H × M → R 1/2 1/2 2(1−kΘ∗
HΘ∗
k)
(3.48)
where Γ(h, H; Z)
= =
AT hbT + AT HA + B T [H, h]T [Θ−1 H]−1 ∗ − [H, h]B] H h T −1 −1 + [H, h] [Θ∗ − H] [H, h] B hT
1 Tr Z[bhT A+ 2 1 Tr 2
ZB T
and treat, as observation, the quadratic lifting of observation (3.46), that is, our observation is ω K = {ωi = (ζi , ζi ζiT )}K i=1 , with independent ζi ∼ N (A[u; 1], M (v)).
(3.49)
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
225
Note that by Proposition 3.11 function Φ(h, H; Θ, Z) : H × M → R is a continuous convex-concave function which is coercive in convex argument and is such that ∀(u ∈ U, v ∈ V, (h, H)n∈ 1H)T : o T ln Eζ∼N (A[u;1],M (v)) e 2 ζ Hζ+h ζ ≤ Φ(h, H; M (v), [u; 1][u; 1]T ).
(3.50)
We are about to demonstrate that when estimating the functional of interest (3.47) at a point (u, v) ∈ U × V via observation (3.49), we are in the situation considered in Section 3.3 and can utilize the corresponding machinery. Indeed, let us specify the following data introduced in Section 3.3.1: • H = {f = (h, H) ∈ H} ⊂ EH = Rd × Sd , with H defined in (3.48), and the inner product on EH defined as 1 h(h, H), (h′ , H ′ )i = hT h′ + Tr(HH ′ ), 2 EM = Sd × Sm+1 , and M, Φ defined as in (3.48); • EX = Rk × Sm+1 , X = V × Z; • A(x = (v, Z)) = (M (v), Z); note that A is an affine mapping from EX into EM which maps X into M, as required in Section 3.3.1. Observe that when u ∈ U and v ∈ V , the common distribution P = Pu,v of i.i.d. observations ωi defined by (3.49) satisfies the relation ∀(f = (h, H) ∈ H) : ln Eω∼P ehf,ωi
= ≤
n T 1 T o ln Eζ∼N (A[u;1],M (v)) eh ζ+ 2 ζ Hζ Φ(h, H; M (v), [u; 1][u; 1]T );
(3.51)
see (3.50); • υ(x = (v, Z)) = ̺(Z) : X → R; • we define affine functional G(x) on EX by the relation hg, x := (v, Z)i = q T v + Tr(QZ); see (3.47). As a result, for x = (v, [u; 1][u; 1]T ) with v ∈ V and u ∈ U we have F (u, v) = G(x). Applying Corollary 3.8 to the data just specified (which is legitimate, because our Φ clearly satisfies (3.30)), we arrive at the result as follows: Proposition 3.12. In the situation just described, let us set
226
CHAPTER 3
b +,K (h, H) Ψ := inf α
=
max
(v,Z)∈V ×Z
max
(v,Z)∈V ×Z
b −,K (h, H) Ψ := inf α
=
α>0, −1 −1 −γαΘ∗ HγαΘ∗
(v,Z)∈V ×Z
max
αΦ(h/α, H/α; M (v), Z) − G(v, Z) − ̺(Z) + K −1 α ln(2/ǫ) : inf
max
(v,Z)∈V ×Z
−1 α > 0, −γαΘ−1 ∗ H γαΘ∗
αΦ(h/α, H/α; M (v), Z) − G(v, Z) − ̺(Z)
+K −1 α ln(2/ǫ) ,
αΦ(−h/α, −H/α; M (v), Z) + G(v, Z) − ̺(Z) + K −1 α ln(2/ǫ) : −1 −1 α > 0, −γαΘ∗ H γαΘ∗ αΦ(−h/α, −H/α; M (v), Z) + G(v, Z) − ̺(Z) inf
α>0, −1 −1 −γαΘ∗ HγαΘ∗
+K −1 α ln(2/ǫ) ,
(3.52)
b ±,K (h, H) : Rd × Sd → R are convex. Furthermore, whenever so that functions Ψ ¯ ¯ h, H, ρ¯, κ ¯ form a feasible solution to the system of convex constraints b +,K (h, H) ≤ ρ − κ, Ψ b −,K (h, H) ≤ ρ + κ Ψ
(3.53)
in variables (h, H) ∈ Rd × Sd , ρ ∈ R, κ ∈ R, setting gb(ζ K := (ζ1 , ..., ζK )) =
K 1 1 X T h ζi + ζiT Hζi + κ, ¯ K i=1 2
(3.54)
we get an estimate of the functional of interest F (u, v) = [u; 1]T Q[u; 1] + q T v via K independent observations ζi ∼ N (A[u; 1], M (v)), i = 1, ..., K, with the following property: ∀(u, v) ∈ U × V : Probζ K ∼[N (A[u;1],M (v))]K |F (u, v) − gb(ζ K )| > ρ¯ + ̺([u; 1][u; 1]T ) ≤ ǫ.
(3.55)
Proof. Under the premise of the proposition, let us fix u ∈ U , v ∈ V , so that x := (v, Z := [u; 1][u; 1]T ) ∈ X . Denoting, as above, by P = Pu,v the distribution of ω := (ζ, ζζ T ) with ζ ∼ N (A[u; 1], M (v)), and invoking (3.51), we see that for the (x, P ) just defined, relation (3.26) takes place. Applying Corollary 3.8, we conclude that g (ζ K ) − G(x)| > ρ¯ + ̺([u; 1][u; 1]T ) ≤ ǫ. Probζ K ∼[N (A[u;1],M (v))]K |b
227
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
It remains to note that by construction for the x = (v, Z) in question it holds G(x) = q T v + Tr(QZ) = q T v + Tr(Q[u; 1][u; 1]T ) = q T v + [u; 1]T Q[u, 1] = F (u, v). ✷ An immediate consequence of Proposition 3.12 is as follows: Corollary 3.13. Under the premise and in the notation of Proposition 3.12, let (h, H) ∈ Rd × Sd . Setting h i b +,K (h, H) + Ψ b −,K (h, H) , ρ = 12 Ψ h i (3.56) b −,K (h, H) − Ψ b +,K (h, H) , κ = 21 Ψ the ǫ-risk of estimate (3.54) does not exceed ρ.
Indeed, with ρ and κ given by (3.56), h, H, ρ, κ satisfy (3.53). 3.4.1.4
Consistency
We are about to present a simple sufficient condition for the estimator defined in Proposition 3.12 to be consistent in the sense of Section 3.3.4.1. Specifically, in the situation and with the notation from Sections 3.4.1.1 and 3.4.1.3 assume that A.1. ̺(·) ≡ 0; A.2. V = {¯ v } is a singleton and M (v) ≻ 0, which allows us to set Θ∗ = M (¯ v ), to satisfy (3.45) with δ = 0, and to assume w.l.o.g. that F (u, v) = [u; 1]T Q[u; 1], G(Z) = Tr(QZ); A.3. the first m columns of the d × (m + 1) matrix A are linearly independent. By A.3, the columns of (d + 1) × (m + 1) matrix B (see (3.48)) are linearly independent, so that we can find (m + 1) × (d + 1) matrix C such that CB = Im+1 . Let ¯ H) ¯ ∈ Rd × Sd from the relation us define (h,
¯ H ¯ hT
¯ h
= 2(C T QC)o ,
(3.57)
where for (d + 1) × (d + 1) matrix S, S o is the matrix obtained from S by zeroing our the entry in the cell (d + 1, d + 1). The consistency of our estimation machinery is given by the following simple statement: Proposition 3.14. In the situation just described and under assumptions A.1–3, given ǫ ∈ (0, 1), consider the estimate
where
gbK (ζ K ) = κK =
1 2
K 1 X ¯T ¯ k ] + κK , [h ζk + 21 ζ T Hζ K k=1
h
¯ H) ¯ H) b −,K (h, ¯ −Ψ b +,K (h, ¯ Ψ
i
b ±,K are given by (3.52). Then the ǫ-risk of gbK (·) goes to 0 as K → ∞. and Ψ
228
CHAPTER 3
For proof, see Section 3.6.4. 3.4.1.5
A modification
In the situation described at the beginning of Section 3.4.1.2, let a set W ⊂ U × V be given, and assume we are interested in estimating the value of F (u, v), as defined in (3.47), at points (u, v) ∈ W only. When reducing the “domain of interest” U × V to W , we hopefully can reduce the attainable ǫ-risk of recovery. Let us assume that we can point out a convex compact set W ⊂ V × Z such that (u, v) ∈ W ⇒ (v, [u; 1][u; 1]T ) ∈ W A straightforward inspection justifies the following: Remark 3.15. In the situation just described, the conclusion of Proposition 3.12 remains valid when the set U × V participating in (3.55) is reduced to W , and the set V × Z participating in relations (3.52) is reduced to W. This modification enlarges the feasible set of (3.53) and thus reduces the risk bound ρ¯. 3.4.2 3.4.2.1
Estimating quadratic form, sub-Gaussian case Situation
In the rest of this section we are interested in the situation as follows: we are given K i.i.d. observations ζi ∼ SG(A[u; 1], M (v)), i = 1, ..., K
(3.58)
(i.e., ζi are sub-Gaussian random vectors with parameters A[u; 1] ∈ Rd and M (v) ∈ d S+ ), where • (u, v) is an unknown “signal” known to belong to a given set U × V , where – U ⊂ Rm is a compact set, and
– V ⊂ Rk is a compact convex set;
• A is a given d × (m + 1) matrix, and v 7→ M (v) : Rk → Sd is an affine mapping such that M (v) 0 whenever v ∈ V . We are also given a convex calibrating function ̺(Z) : Sm+1 → R and “functional + of interest” F (u, v) = [u; 1]T Q[u; 1] + q T v, (3.59) where Q and q are a known (m+1)×(m+1) symmetric matrix and a k-dimensional vector, respectively. Our goal is to recover F (u, v), for unknown (u, v) known to belong to U × V , via observation (3.58). Note that the only difference between our present setting and that considered in Section 3.4.1.1 is that now we allow for sub-Gaussian, and not necessary Gaussian, observations. 3.4.2.2
Construction and result
Let V = {M (v) : v ∈ V },
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
229
so that V is a convex compact subset of the positive semidefinite cone Sd+ . Let us select some 1. matrix Θ∗ ≻ 0 such that Θ∗ Θ, for all Θ ∈ V; 2. convex compact subset Z of the set Z + = {Z ∈ Sm+1 : Zm+1,m+1 = 1} such + that [u; 1][u; 1]T ∈ Z for all u ∈ U ; 3. reals γ, γ + ∈ (0, 1) with γ < γ + (say, γ = 0.99, γ + = 0.999). Preliminaries. Given the data of the above description and δ ∈ [0, 2], we set (cf. Proposition 3.11) −1 Hγ := {(h, H) ∈ Rd × Sd : −γΘ−1 ∗ H γΘ∗ }, A B = ∈ R(d+1)×(m+1) , [0, ..., 0, 1] M = V × Z, 1/2 1/2 Ψ(h, H, G; Z) = − 21 ln ∗ GΘ∗ ) Det(I − Θ h H T −1 −1 + [H, h] [Θ − G] [H, h] B : + 12 Tr ZB T T ∗ h + −1 H × {G : 0 G γ Θ∗ } × Z → R,
H
where
=
Ψδ (h, H, G; Θ, Z)
=
Φ(h, H; Z)
=
Φδ (h, H; Θ, Z)
=
1/2
(3.60)
1/2
− 12 ln Det(I − Θ∗ GΘ∗ ) + 12 Tr([Θ − Θ∗ ]G) 1/2 1/2 δ(2+δ) kΘ∗ GΘ∗ k2F + 1/2 1/2 2(1−kΘ ∗ GΘ∗ k) h H T −1 −1 + [H, h] [Θ − G] [H, h] B : + 21 Tr ZB T ∗ hT + −1 H × {G : 0 G γ Θ∗ } × ({0 Θ Θ∗ } × Z) → R, min Ψ(h, H, G; Z) : 0 G γ + Θ−1 ∗ , G H : H × Z → R, G min Ψδ (h, H, G; Θ, Z) : 0 G γ + Θ−1 ∗ ,G H : G
H × ({0 Θ Θ∗ } × Z) → R.
The following statement is a straightforward reformulation of Proposition 2.46.i: Proposition 3.16. In the situation described in Sections 3.4.2.1 and 3.4.2.2 we have (i) Φ is well-defined real-valued continuous function on the domain H × Z; the function is convex in (h, H) ∈ H, concave in Z ∈ Z, and Φ(0; Z) ≥ 0. Furthermore, let (h, H) ∈ H, u ∈ U , v ∈ V , and let ζ ∼ SG(A[u; 1], M (v)). Then (3.61) ln Eζ exp{hT ζ + 21 ζ T Hζ} ≤ Φ(h, H; [u; 1][u; 1]T ). (ii) Assume that
−1/2
∀Θ ∈ V : kΘ1/2 Θ∗
− Id k ≤ δ.
(3.62)
Then Φδ (h, H; Θ, Z) is a well-defined real-valued continuous function on the domain H × (V × Z); it is convex in (h, H) ∈ H, concave in (Θ, Z) ∈ V × Z, and Φδ (0; Θ, Z) ≥ 0. Furthermore, let (h, H) ∈ H, u ∈ U , v ∈ V , and let ζ ∼ SG(A[u; 1], M (v)). Then (3.63) ln Eζ exp{hT ζ + 21 ζ T Hζ} ≤ Φδ (h, H; M (v), [u; 1][u; 1]T ). The estimate. Our construction of the estimate is completely similar to the case of Gaussian observations. Specifically, let us pass from observations (3.58) to their
230
CHAPTER 3
quadratic lifts, so that our observations become ωi = (ζi , ζi ζiT ), 1 ≤ i ≤ K, ζi ∼ SG(A[u; 1], M (v)) are i.i.d.
(3.64)
As in the Gaussian case, we find ourselves in the situation considered in Section 3.3.3 and can use the corresponding constructions. Indeed, let us specify the data introduced in Section 3.3.1 and participating in the constructions of Section 3.3 as follows: • H = {f = (h, H) ∈ H} ⊂ EH = Rd × Sd , with H defined in (3.60), and the inner product on EH defined as 1 h(h, H), (h′ , H ′ )i = hT h′ + Tr(HH ′ ), 2 EM = Sd × Sm+1 , and M, Φ defined as in (3.60); • EX = Rk × Sm+1 , X = V × Z; • A(x = (v, Z)) = (M (v), Z); note that A is an affine mapping from EX into EM mapping X into M, as required in Section 3.3. Observe that when u ∈ U and v ∈ V , the common distribution P = Pu,v of i.i.d. observations ωi defined by (3.64) satisfies the relation ∀(f = (h, H) ∈ H) : ln Eω∼P ehf,ωi
= ≤
n T 1 T o ln Eζ∼SG(A[u;1],M (v)) eh ζ+ 2 ζ Hζ Φ(h, H; [u; 1][u; 1]T );
(3.65)
see (3.61). Moreover, in the case of (3.62), we have also ∀(f = (h, H) ∈ H) : ln Eω∼P ehf,ωi
= ≤
n T 1 T o ln Eζ∼SG(A[u;1],M (v)) eh ζ+ 2 ζ Hζ Φδ (h, H; M (v), [u; 1][u; 1]T );
(3.66)
see (3.63); • we set υ(x = (v, Z)) = ̺(Z); • we define affine functional G(x) on EX by the relation G(x := (v, Z)) = q T v + Tr(QZ); see (3.59). As a result, for x = (v, [u; 1][u; 1]T ) with v ∈ V and u ∈ U we have F (u, v) = G(x). The result. Applying to the data just specified Corollary 3.8 (which is legitimate, because our Φ clearly satisfies (3.30)), we arrive at the result as follows: Proposition 3.17. In the situation described in Sections 3.4.2.1 and 3.4.2.2 let us
231
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
set b +,K (h, H) := inf Ψ α
=
max
(v,Z)∈V ×Z
=
max
(v,Z)∈V ×Z
max
(v,Z)∈V ×Z
inf
αΦ(h/α, H/α; Z) − G(v, Z) − ̺(Z) + αK −1 ln(2/ǫ) :
α>0, −1 −1 −γαΘ∗ HγαΘ∗
b −,K (h, H) := inf Ψ α
max
(v,Z)∈V ×Z
−1 α > 0, −γαΘ−1 ∗ H γαΘ∗ αΦ(h/α, H/α; Z) − G(v, Z) − ̺(Z) + αK −1 ln(2/ǫ) ,
αΦ(−h/α, −H/α; Z) + G(v, Z) − ̺(Z) + αK −1 ln(2/ǫ) :
inf
α>0, −1 −1 −γαΘ∗ HγαΘ∗
−1 α > 0, −γαΘ−1 ∗ H γαΘ∗ αΦ(−h/α, −H/α; Z) + G(v, Z) − ̺(Z) + αK −1 ln(2/ǫ) .
(3.67)
b ±,K (h, H) : Rd × Sd → R are convex. Furthermore, whenever Thus, functions Ψ ¯ ¯ h, H, ρ¯, κ ¯ form a feasible solution to the system of convex constraints b +,K (h, H) ≤ ρ − κ, Ψ b −,K (h, H) ≤ ρ + κ Ψ
(3.68)
in variables (h, H) ∈ Rd × Sd , ρ ∈ R, κ ∈ R, the estimate gb(ζ K ) =
K 1 1 X T h ζi + ζiT Hζi + κ, ¯ K i=1 2
of F (u, v) = [u; 1]T Q[u; 1] + q T v via i.i.d. observations ζi ∼ SG(A[u; 1], M (v)), 1 ≤ i ≤ K, satisfies for all (u, v) ∈ U × V :
Probζ K ∼[SG(A[u;1],M (v))]K |F (u, v) − gb(ζ K )| > ρ¯ + ̺([u; 1][u; 1]T ) ≤ ǫ.
Proof. Under the premise of the proposition, let us fix u ∈ U , v ∈ V , and let x = (v, Z := [u; 1][u; 1]T ). Denoting by P the distribution of ω := (ζ, ζζ T ) with ζ ∼ SG(A[u; 1], M (v)), and invoking (3.65), we see that for the (x, P ) just defined relation (3.26) takes place. Applying Corollary 3.8, we conclude that g (ζ K ) − G(x)| > ρ¯ + ̺([u; 1][u; 1]T ) ≤ ǫ. Probζ K ∼[N (A[u;1],M (v))]K |b It remains to note that by construction for the x = (v, Z) in question it holds G(x) = q T v + Tr(QZ) = q T v + [u; 1]T Q[u, 1] = F (u, v).
✷
Remark 3.18. In the situation described in Sections 3.4.2.1 and 3.4.2.2 let δ ∈ [0, 2] be such that −1/2 kΘ1/2 Θ∗ − Id k ≤ δ ∀Θ ∈ V.
Then the conclusion of Proposition 3.17 remains valid when the function Φ in (3.67)
232
CHAPTER 3
b ±,K are defined as is replaced with the function Φδ , that is, when Ψ b +,K (h, H) := inf Ψ α
=
max
(v,Z)∈V ×Z
inf
α
max
(v,Z)∈V ×Z
max
(v,Z)∈V ×Z
αΦδ (h/α, H/α; M (v), Z) − G(v, Z) − ̺(Z) + αK −1 ln(2/ǫ) :
α>0, −1 −1 −γαΘ∗ HγαΘ∗
b −,K (h, H) := inf Ψ =
max
(v,Z)∈V ×Z
inf
α > 0, −γαΘ−1 H γαΘ−1 ∗ ∗ αΦδ (h/α, H/α; M (v), Z) − G(v, Z) − ̺(Z) + αK −1 ln(2/ǫ) ,
αΦδ (−h/α, −H/α; M (v), Z) + G(v, Z) − ̺(Z) + αK −1 ln(2/ǫ) :
α>0, −1 −1 −γαΘ∗ HγαΘ∗
α > 0, −γαΘ−1 H γαΘ−1 ∗ ∗ αΦδ (−h/α, −H/α; M (v), Z) + G(v, Z) − ̺(Z) + αK −1 ln(2/ǫ) .
To justify Remark 3.18, it suffices to replace relation (3.65) in the proof of Proposition 3.17 with (3.66). Note that what is better in terms of the risk of the resulting estimate—Proposition 3.17 “as is” or its modification presented in Remark 3.18—depends on the situation, so that it makes sense to keep in mind both options. 3.4.2.3
Numerical illustration, direct observations
The problem. Our initial illustration is deliberately selected to be extremely simple: given direct noisy observations ζ =u+ξ of unknown signal u ∈ Rm known to belong to a given set U , we want to recover the “energy” uT u of u. We are interested in an estimate of uT U quadratic in ζ with as small as possible an ǫ-risk on U ; here ǫ ∈ (0, 1) is a given design parameter. The details of our setup are as follows: • U is the “spherical layer” U = {u ∈ Rm : r2 ≤ uT u ≤ R2 }, where r and R, 0 ≤ r < R < ∞, are given. As a result, the “main ingredient” of constructions from Sections 3.4.1.3 and 3.4.2.2—the convex compact subset Z of the set {Z ∈ Sm+1 : Zm+1,m+1 = 1} containing all matrices [u; 1][u; 1]T , u ∈ U —can be + specified as Z = {Z ∈ Sm+1 : Zm+1,m+1 = 1, 1 + r2 ≤ Tr(Z) ≤ 1 + R2 }; + • ξ is either ∼ N (0, Θ) (Gaussian case), or ∼ SG(0, Θ) (sub-Gaussian case), with matrix Θ known to be diagonal with diagonal entries equal to each other satisfying θσ 2 ≤ Θii ≤ σ 2 , 1 ≤ i ≤ d = m, withPknown θ ∈ [0, 1] and σ 2 > 0; m • the calibrating function ̺(Z) is ̺(Z) = ς( i=1 Zii ), where ς is a convex continuous real-valued function on R+ . Note that with this selection, the claim that ǫ-risk of an estimate gb(·) is ≤ ρ means that whenever u ∈ U , one has Prob{|b g (u + ξ) − uT u| > ρ + ς(uT u)} ≤ ǫ.
(3.69)
Processing the problem. It is easily seen that in the situation in question the apparatus in Sections 3.4.1 and 3.4.2 translates into the following:
233
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
1. We lose nothing when restricting ourselves with estimates of the form gb(ζ) = 12 ηζ T ζ + κ,
(3.70)
with properly selected scalars η and κ; 2. In Gaussian case, η and κ are yielded by the convex optimization problem with only three variables α+ , α− , and η, namely the problem n i h o b + , α− , η) = 1 Ψ b + (α+ , η) + Ψ b − (α− , η) : σ 2 |η| < α± min Ψ(α (3.71) 2 α± ,η
where
b + (α+ , η) Ψ
=
b − (α+ , η) Ψ
=
4 2
dα+ 2
dδ(2+δ)σ η ln(1 − σ 2 η/α+ ) + d2 σ 2 (1 − θ) max[−η, 0] + 2(α 2 + −σ |η|) hh i i α+ η + max − 1 t − ς(t) + α+ ln(2/ǫ) 2(α −σ 2 η)
−
−
r 2 ≤t≤R2 dα− ln(1 + 2 hh
+ max
r 2 ≤t≤R2
√
+
4 2
dδ(2+δ)σ η σ 2 η/α− ) + d2 σ 2 (1 − θ) max[η, 0] + 2(α 2 − −σ |η|) i i α η − 2(α − + 1 t − ς(t) + α ln(2/ǫ), − +σ 2 η) −
with δ = 1− θ. Now, the η-component of a feasible solution to (3.71) augmented by the quantity i h b − (α− , η) − Ψ b + (α+ , η) κ = 21 Ψ
b + , α− , η); yields estimate (3.70) with ǫ-risk on U not exceeding Ψ(α 3. In the sub-Gaussian case, η and κ are yielded by the convex optimization problem with five variables, α± , g± , and η, namely, the problem i h b ± , g± , η) = 1 Ψ b + (α+ , g+ , η) + Ψ b − (α− , g− , η) : min Ψ(α 2 α± ,g± ,η (3.72) 0 ≤ σ 2 g± < α± , −α+ < σ 2 η < α− , η ≤ g+ , −η ≤ g− , where b + (α+ , g+ , η) Ψ
b − (α− , g− , η) Ψ
=
−
dα+ 2
−
dα− 2
ln(1 − σ 2 g+ /α+ )hh
+α+ ln(2/ǫ) + =
max
r 2 ≤t≤R2
ln(1 − σ 2 g− /α− )hh
+α− ln(2/ǫ) +
max
r 2 ≤t≤R2
σ2 η2 2(α+ −σ 2 g+ ) σ2 η2 2(α− −σ 2 g− )
i i + 12 η − 1 t − ς(t)
i i − 21 η + 1 t − ς(t)
The η-component of a feasible solution to (3.72) augmented by the quantity i h b − (α− , g− , η) − Ψ b + (α+ , g+ , η) κ=1 Ψ 2
b ± , g± , η). yields estimate (3.70) with ǫ-risk on U not exceeding Ψ(α
Note that the Gaussian case of our “energy estimation” problem is well studied in the literature (see, among others, [19, 43, 81, 87, 90, 97, 120, 124, 147, 160]), mainly in the case ξ ∼ N (0, σ 2 Im ) of white Gaussian noise with exactly known variance σ 2 . Available results investigate analytically the interplay between the dimension m of signal, noise intensity σ 2 and the parameters R, r and offer estimates which are provably optimal, up to absolute constant factors. A nice property of the proposed
234
CHAPTER 3
d
r
R
θ
64 64 64 64 64 64 256 256 256 256 256 256 1024 1024 1024 1024 1024 1024
0 0 0 0 8 8 0 0 0 0 16 16 0 0 0 0 32 32
16 16 128 128 80 80 32 32 512 512 160 160 64 64 2048 2048 320 320
1 0.5 1 0.5 1 0.5 1 0.5 1 0.5 1 0.5 1 0.5 1 0.5 1 0.5
Relative 0.01-risk, Gaussian case 0.34808 0.43313 0.04962 0.05064 0.07827 0.08095 0.19503 0.26813 0.01264 0.01289 0.03996 0.04255 0.10272 0.17032 0.00317 0.00324 0.02019 0.02273
Relative 0.01-risk, sub-Gaussian case 0.44469 0.44469 0.05181 0.05181 0.08376 0.08376 0.30457 0.30457 0.01314 0.01314 0.04501 0.04501 0.21923 0.21923 0.00330 0.00330 0.02516 0.02516
Optimality ratio 1.22 1.48 1.28 1.34 1.28 1.34 1.28 1.41 1.28 1.34 1.28 1.34 1.28 1.34 1.28 1.34 1.28 1.41
Table 3.3: Estimating the signal energy from direct observations.
approach is that (3.71) automatically takes care of the parameters and results in estimates with seemingly near-optimal performance, as witnessed by the numerical experiments we are about to present. Numerical results. In the first series of experiments we use the trivial calibrating function: ς(·) ≡ 0. A typical sample of numerical results is presented in Table 3.3. To avoid large numbers, we display in the table relative 0.01-risk of the estimates, that is, the plain risk as given by (3.71) divided by R2 ; keeping this in mind, one will not be surprised that when extending the range [r, R] of allowed norms of the observed signal, all other components of the setup being fixed, the relative risk can decrease (the actual risk, of course, can only increase). Note that in all our experiments σ is set to 1. Along with the values of the relative 0.01-risk, we present also the values of “optimality ratios”—the ratios of the upper risk bounds given by (3.71) in the Gaussian case, to (lower bounds on) the best 0.01-risks Risk∗0.01 possible under the circumstances, defined as the infimum of the 0.01-risk over all estimates recovering kuk22 via single observation ω = u + ζ. These lower bounds are obtained as follows. Let us select some values r1 < r2 in the allowed range [r, R] of kuk2 , along with two values, σ1 , σ2 , in the allowed range [θσ, σ] = [θ, 1] of values of diagonal entries in diagonal matrices Θ, and consider two distributions of observations P1 and P2 as follows: Pχ is the distribution of the random vector x + ζ, where x and ξ are independent, x is uniformly distributed on the sphere kxk2 = rχ , and ζ ∼ N (0, σχ2 Id ). It is immediately seen that whenever the two simple hypotheses ω ∼ P1 and ω ∼ P2 cannot be decided upon via a single observation by a test with total risk (the sum, over the two hypotheses in question, of probabilities for the test to reject the hypothesis when it is true) ≤ 2ǫ, the quantity δ = 21 (r22 − r12 ) is a lower bound on the optimal ǫ-risk, Risk∗ǫ . In other words, denoting by pχ (·) the density
235
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
of Pχ , we have 0.02
d (“deficient observations”), • u ∈ Rm is a signal known to belong to a compact set U , • ξ ∼ N (0, Θ) (Gaussian case) or ξ ∼ SG(0, Θ) (sub-Gaussian case) is the observation noise; Θ is a positive semidefinite d × d matrix known to belong to a given convex compact set V ⊂ Sd+ . Our goal is to estimate the energy F (u) =
1 m
kuk22
of the signal given observation (3.74). In our experiment, the data is specified as follows: 1. We think of u ∈ Rm as of discretization of a smooth function x(t) of continuous argument t ∈ [0; 1]: ui = x( mi ), 1 ≤ i ≤ m. We set U = {u : kSuk2 ≤ 1}, where u 7→ Su is the finite-difference approximation of the mapping x(·) 7→ (x(0), x′ (0), x′′ (·)), so that U is a natural discrete-time analog of the SobolevR1 type ball {x : [x(0)]2 + [x′ (0)]2 + 0 [x′′ (t)]2 dt ≤ 1}. 2. d × m matrix B is of the form U DV T , where U and V are randomly selected d × d and m × m orthogonal matrices, and the d diagonal entries in diagonal i−1 d × m matrix D are of the form θ− d−1 , 1 ≤ i ≤ d. 3. The set V of admissible matrices Θ is the set of all diagonal d × d matrices with diagonal entries varying in [0, σ 2 ]. Both σ and θ are components of the experiment setup. Processing the problem. The described estimation problem clearly is covered by the setups considered in Sections 3.4.1 (Gaussian case) and 3.4.2 (sub-Gaussian case); in terms of these setups, it suffices to specify Θ∗ as σ 2 Id , M (v) as the identity mapping of V onto itself, the mapping u 7→ A[u; 1] as the mapping u 7→ Bu, and the set Z (which should be a convex compact subset of the set {Z ∈ Sd+1 : Zd+1,d+1 = + 0} containing all matrices of the form [u; 1][u; 1]T , u ∈ U ) as the set Z = {Z ∈ Sd+1 : Zd+1,d+1 = 1, Tr ZDiag{S T S, 0} ≤ 1}. + As suggested by Propositions 3.12 (Gaussian case) and 3.17 (sub-Gaussian case), 1 kuk22 stem the linear in “lifted observation” ω = (ζ, ζζ T ) estimates of F (u) = m from the optimal solution (h∗ , H∗ ) to the convex optimization problem h i b + (h, H) + Ψ b − (h, H) , Opt = min 12 Ψ (3.75) h,H
b ± (·) given by (3.52) in the Gaussian, and by (3.67) in the sub-Gaussian cases, with Ψ with the number K of observations in (3.52) and (3.67) set to 1. The resulting
237
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
d, m 8, 12 16, 24 32, 48
Opt, Gaussian case 0.1362(+65%) 0.1614(+53%) 0.0687(+46%)
Opt, sub-Gaussian case 0.1382(+67%) 0.1640(+55%) 0.0692(+48%)
LwBnd 0.0825 0.1058 0.0469
Table 3.4: Upper bound (Opt) on the 0.01-risk of estimate (3.76), (3.75) vs. lower bound (LwBnd) on the 0.01-risk attainable under the circumstances. In the experiments, σ = 0.025 and θ = 10. Data in parentheses: excess of Opt over LwBnd.
estimate is ζ 7→ hT∗ ζ + 12 ζ T H∗ ζ + κ, κ =
1 2
h
b − (h∗ , H∗ ) − Ψ b + (h∗ , H∗ ) Ψ
i
(3.76)
and the ǫ-risk of the estimate is (upper-bounded by) Opt. Problem (3.75) is a well-structured convex-concave saddle point problem and as such is beyond the “immediate scope” of the standard Convex Programming software toolbox primarily aimed at solving well-structured convex minimization (or maximization) problems. However, applying conic duality, one can easily eliminate in (3.52) and (3.67) the inner maxima over v, Z to end up with a reformulation which can be solved numerically by CVX [108], and this is how we process (3.75) in our experiments. Numerical results. In the experiments to be reported, we use the trivial calibrating function: ̺(·) ≡ 0. We present some typical numerical results in Table 3.4. To qualify the performance of our approach, we present, along with the upper risk bounds for the computed estimates, simple lower bounds on ǫ-risk. The origin of the lower bounds is as follows. Assume we have at our disposal a signal w ∈ U , and let t(w) = kBwk2 , ρ = 2σErfcInv(ǫ), where ErfcInv is the inverse error function as defined in (1.26). Setting θ(w) = max[1 − ρ/t(w), 0], observe that w′ := θ(w)w ∈ U and kBw − Bw′ k2 ≤ ρ, which, due to the origin of ρ, implies that there is no way to decide via observation Bu + ξ, ξ ∼ N (0, σ 2 ), with risk < ǫ on the two simple hypotheses u = w and u = w′ . As an immediate consequence, the quantity φ(w) := 12 [kwk22 − kw′ k22 ] = kwk22 [1 − θ2 (w)]/2 is a lower bound on the ǫ-risk, on U , of any estimate of kuk22 . We can now try to maximize the resulting lower risk bound over U , thus arriving at the lower risk bound LwBnd = max 21 kwk22 (1 − θ2 (w)) . w∈U
On closer inspection, the latter problem is not a convex one, which does not prevent building a suboptimal solution to this problem, and this is how the lower risk bounds in Table 3.4 are built (we omit the details). We see that the ǫ-risks of our estimates are within a moderate factor of the optimal ones. Figure 3.4 shows empirical error distributions of the estimates built in the three experiments reported in Table 3.4. When simulating the observations and estimates, we used N (0, σ 2 Id ) noise and selected signals in U by maximizing over U randomly selected linear forms. Finally, we note that already with fixed design parameters d, m, θ and σ we deal with a family of estimation problems rather than with a single problem, the reason being that our U is an ellipsoid with half-axes es-
238
CHAPTER 3
d = 8, m = 12
d = 16, m = 24
d = 32, m = 48
d = 8, m = 12
Gaussian case d = 16, m = 24
d = 32, m = 48
Sub-Gaussian case
Figure 3.4: Histograms of recovery errors in experiments, 1,000 simulations per experiment.
sentially different from each other. In this situation, attainable risks heavily depend on how the right singular vectors of A are oriented with respect to the directions of the half-axes of U , so that the risks of our estimates vary significantly from instance to instance. Note also that the “sub-Gaussian experiments” were conducted on exactly the same data as “Gaussian experiments” of the same sizes d and m.
3.5
EXERCISES FOR CHAPTER 3
Exercise 3.1. In the situation of Section 3.3.4, design of a “good” estimate is reduced to solving convex optimization problem (3.39). Note that the objective in this problem is, in a sense, “implicit”—the design variable is h, and the objective is obtained from an explicit convex-concave function of h and (x, y) by maximization over (x, y). There exist solvers able to process problems of this type efficiently. However, commonly used off-the-shelf solvers, like cvx, cannot handle problems of this type. The goal of the exercise to follow is to reformulate (3.39) as a semidefinite program, thus making it amenable for cvx. On an immediate inspection, the situation we are interested in is as follows. We are given • a nonempty convex compact set X ⊂ Rn along with affine function M (x) taking values in Sd and such that M (x) 0 when x ∈ X, and • an affine function F (h) : Rd → Rn .
239
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
Given γ > 0, this data gives rise to the convex function q Ψ(h) = max F T (h)x + γ hT M (x)h , x∈X
and we want to find a “nice” representation of this function, specifically, we want to represent the inequality τ ≥ Ψ(h) by a bunch of LMIs in variables τ , h, and perhaps additional variables. To achieve our goal, we assume in the sequel that the set X + = {(x, M ) : x ∈ X, M = M (x)} can be described by a system of linear and semidefinite constraints in variables x, M , and additional variables ξ, namely, (a) si − aTi x − bTi ξ − Tr(Ci M ) ≥ 0, i ≤ I + (b) S − A(x) − B(ξ) − C(M ) 0 X = (x, M ) : ∃ξ : . (c) M 0
Here si ∈ R, S ∈ SN are some constants, and A(·), B(·), C(·) are (homogeneous) linear functions taking values in SN . We assume that this system of constraints is essentially strictly feasible, meaning that there exists a feasible solution at which the semidefinite constraints (b) and (c) are satisfied strictly (i.e., the left-hand sides of the LMIs are positive definite). Here comes the exercise: 1) Check that Ψ(h) is the optimal value in the semidefinite program si − aTi x − bTi ξ − Tr(Ci M ) ≥ 0, i ≤ I S − A(x) − B(ξ) − C(M ) 0 M 0 F T (h)x + γt : Ψ(h) = max x,M,ξ,t hT M h t 0 t 1
(a) (b) (c)
(d)
.
(P )
2) Passing from (P ) to the semidefinite dual of (P ), build explicit semidefinite representation of Ψ, that is, an explicit system S of LMIs in variables h, τ , and additional variables u such that {τ ≥ Ψ(h)} ⇔ {∃u : (τ, h, u) satisfies S}. Exercise 3.2. Let us consider the situation as follows. Given an m × n “sensing matrix” A which is stochastic with columns from the probabilistic simplex ) ( X m vi = 1 ∆m = v ∈ R : v ≥ 0, i
and a nonempty closed subset U of ∆n , we observe an M -element, M > 1, i.i.d. sample ζ M = (ζ1 , ..., ζM ) with ζk drawn from the discrete distribution Au∗ , where u∗ is an unknown probabilistic vector (“signal”) known to belong to U . We handle the discrete distribution Au, u ∈ ∆n , as a distribution on the vertices e1 , ..., em of ∆m , so that possible values of ζk are basic orths e1 , ..., em in Rm . Our goal is to
240
CHAPTER 3
recover the value F (u∗ ) of a given quadratic form F (u) = uT Qu + 2q T u. Observe that for u ∈ ∆n , we have u = [uuT ]1n , where 1k is the all-ones vector in Rk . This observation allows us to rewrite F (u) as a homogeneous quadratic form: ¯ Q ¯ = Q + [q1T + 1n q T ]. F (u) = uT Qu, n
(3.77)
The goal of the exercise is to follow the approach developed in Section 3.4.1 for the Gaussian case in order to build an estimate gb(ζ M ) of F (u). To this end, consider the following construction. Let
JM = {(i, j) : 1 ≤ i < j ≤ M }, JM = Card(JM ).
For ζ M = (ζ1 , ..., ζM ) with ζk ∈ {e1 , ..., em }, 1 ≤ k ≤ M , let ωij [ζ M ] = 21 [ζi ζjT + ζj ζiT ], (i, j) ∈ JM .
The estimates we are interested in are of the form 1 X +κ ωij [ζ M ] gb(ζ M ) = Tr h (i,j)∈JM JM {z } | ω[ζ M ]
where h ∈ Sm and κ ∈ R are the parameters of the estimate. Now comes the exercise: 1) Verify that when the ζk ’s stem from signal u ∈ U , the expectation of ω[ζ M ] is a linear image Az[u]AT of the matrix z[u] = uuT ∈ Sn : denoting by PuM the distribution of ζ M , we have Eζ M ∼PuM {ω[ζ M ]} = Az[u]AT .
(3.78)
Check that when setting Zk = {ω ∈ Sk : ω 0, ω ≥ 0, 1Tk ω1k = 1}, where x ≥ 0 for a matrix x means that x is entrywise nonnegative, the image of Zn under the mapping z 7→ AzAT is contained in Zm . 2) Let ∆k = {z ∈ Sk : z ≥ 0, 1Tn z1n = 1}, so that Zk is the set of all positive semidefinite matrices from ∆k . For µ ∈ ∆m , let Pµ be the distribution of the random matrix w taking values in Sm as follows: the possible values of w are matrices of the form eij = 12 [ei eTj + ej eTi ], 1 ≤ i ≤ j ≤ m; for every i ≤ m, w takes value eii with probability µii , and for every i, j with i < j, w takes value eij with probability 2µij . Let us set m X µij exp{hij } : Sm × ∆m → R, Φ1 (h; µ) = ln i,j=1
241
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
so that Φ1 is a continuous convex-concave function on Sm × ∆m .
2.1. Prove that
∀(h ∈ Sm , µ ∈ Zm ) : ln Ew∼Pµ {exp{Tr(hw)}} = Φ1 (h; µ).
2.2. Derive from 2.1 that setting
K = K(M ) = ⌊M/2⌋, ΦM (h; µ) = KΦ1 (h/K; µ) : Sm × ∆m → R, ΦM is a continuous convex-concave function on Sm × ∆m such ΦK (0; µ) = 0 for all µ ∈ Zm , and whenever u ∈ U , the following holds true: Let Pu,M be the distribution of ω = ω[ζ M ], ζ M ∼ PuM . Then for all u ∈ U, h ∈ Sm , ln Eω∼Pu,M {exp{Tr(hω)}} ≤ ΦM (h; Az[u]AT ), z[u] = uuT . (3.79)
3) Combine the above observations with Corollary 3.6 to arrive at the following result: Proposition 3.19. In the situation in question, let Z be a convex compact subset of Zn such that uuT ∈ Z for all u ∈ U . Given ǫ ∈ (0, 1), let Ψ+ (h, α)
=
Ψ− (h, α)
=
b + (h) Ψ
b − (h) Ψ
:= = = := = =
¯ : Sm × {α > 0} → R, max αΦM (h/α, AzAT ) − Tr(Qz) z∈Z T ¯ : Sm × {α > 0} → R max αΦM (−h/α, AzA ) + Tr(Qz) z∈Z
inf α>0 [Ψ+ (h, α) + α ln(2/ǫ)] ¯ + α ln(2/ǫ) max inf αΦM (h/α, AzAT ) − Tr(Qz) z∈Z α>0 ¯ + β ln(2/ǫ) max inf βΦ1 (h/β, AzAT ) − Tr(Qz) K z∈Z β>0
inf [Ψ− (h, α) + α ln(2/ǫ)] ¯ + α ln(2/ǫ) max inf αΦM (−h/α, AzAT ) + Tr(Qz) z∈Z α>0 ¯ + β ln(2/ǫ) max inf βΦ1 (−h/β, AzAT ) + Tr(Qz)
[β = Kα],
α>0
K
z∈Z β>0
[β = Kα].
b ± are real-valued and convex on Sm , and every candidate solution The functions Ψ h to the convex optimization problem h io n b + (h) + Ψ b − (h) b (3.80) Opt = min Ψ(h) := 12 Ψ h
induces the estimate
b − (h) − Ψ b + (h)] gbh (ζ M ) = Tr(hω[ζ M ]) + κ(h), κ(h) = 12 [Ψ
of the functional of interest (3.77) via observation ζ M with ǫ-risk on U not exceeding b ρ = Ψ(h): ∀(u ∈ U ) : Probζ M ∼PuM {|F (u) − gbh (ζ M )| > ρ} ≤ ǫ.
4) Consider an alternative way to estimate F (u), namely, as follows. Let u ∈ U . Given a pair of independent observations ζ1 , ζ2 drawn from distribution Au, let us convert them into the symmetric matrix ω1,2 [ζ 2 ] = 21 [ζ1 ζ2T + ζ2 ζ1T ]. The distribution Pu,2 of this matrix is exactly the distribution Pµ(z[u]) —see item B—where µ(z) = AzAT : ∆n → ∆m . Now, given M = 2K observations ζ 2K = (ζ1 , ..., ζ2K ) stemming from signal u, we can split them into K consecutive pairs giving rise
242
CHAPTER 3
to K observations ω K = (ω1 , ..., ωK ), ωk = ω[[ζ2k−1 ; ζ2k ]], drawn independently of each other from probability distribution Pµ(z[u]) , and the functional of interest ¯ (3.77) is a linear function Tr(Qz[u]) of z[u]. Assume that we are given a set Z as in the premise of Proposition 3.19. Observe that we are in the situation as follows: Given K i.i.d. observations ω K = (ω1 , ..., ωK ) with ωk ∼ Pµ(z) , where z is an unknown signal known to belong to Z, we want to recover the value ¯ of v ∈ Sn . Besides this, we know at z of linear function G(v) = Tr(Qv) m that Pµ , for every µ ∈ ∆ , satisfies the relation ∀(h ∈ Sm ) : ln Eω∼Pµ {exp{Tr(hω)}} ≤ Φ1 (h; µ).
This situation fits the setting of Section 3.3.3, with the data specified as H = EH = Sm , M = ∆m ⊂ EM = Sm , Φ = Φ1 , X := Z ⊂ EX = Sn , A(z) = AzAT .
Therefore, we can use the apparatus developed in that section to upper-bound the ǫ-risk of the affine estimate ! K 1 X ωk + κ Tr h K k=1
¯ and to build the best, in terms of the upper risk of F (u) := G(z[u]) = uT Qu bound, estimate; see Corollary 3.8. On closer inspection (carry it out!), the b ± arising in (3.38) are exactly the associated with the above data functions Ψ b functions Ψ± specified in Proposition 3.19 for M = 2K. Thus, the approach to estimating F (u) via observations ζ 2K stemming from u ∈ U results in a family of estimates ! K 1 X 2K geh (ζ ) = Tr h ω[[ζ2k−1 ; ζ2k ]] + κ(h), h ∈ Sm . K k=1
b b The resulting upper bound on the ǫ-risk of estimate geh is Ψ(h), where Ψ(·) is associated with M = 2K according to Proposition 3.19. In other words, this is exactly the upper bound on the ǫ-risk of the estimate gbh offered by the proposition. Note, however, that the estimates geh and gbh are not identical: PK 1 2K ] + κ(h), geh (ζ 2K ) = Tr h K k=1 ω2k−1,2k [ζ P 1 2K ω [ζ ] + κ(h). gbh (ζ 2K ) = Tr h K(2K−1) 1≤i ζ,” where η, ζ are discrete real-valued random variables independent of each other with distributions u, v, and π is a linear function of the “joint distribution” uv T of η, ζ. This story gives rise to the aforementioned estimation problem with the unit sensing matrices P and Q. Assuming that there are “measurement errors”—instead of observing an action’s outcome “as is,” we observe a realization of a random variable with distribution depending, in a prescribed fashion, on the outcome—we arrive at problems where P and Q can be general type stochastic matrices. As always, we encode the p possible values of ηk by the basic orths e1 , ..., ep in Rp , and the q possible values of ζ by the basic orths f1 , ..., fq in Rq . We focus on estimates of the form #T " # " 1X 1 X K L [h ∈ Rp×q , κ ∈ R]. ηk h ζℓ + κ gbh,κ (η , ζ ) = K L k
This is what you are supposed to do:
ℓ
244
CHAPTER 3
1) (cf. item 2 in Exercise 3.2) Denoting by ∆mn the set of nonnegative m × n matrices with unit sum of all entries (i.e., the set of all probability distributions on {1, ..., m} × {1, ..., n}) and assuming L ≥ K, let us set A(z) = P zQT : Rr×s → Rp×q and Φ(h; µ) ΦK (h; µ)
= =
P Pq p ln µ exp{h } : Rp×q × ∆pq → R, ij ij i=1 j=1 KΦ(h/K; µ) : Rp×q × ∆pq → R.
Verify that A maps ∆rs into ∆pq , Φ and ΦK are continuous convex-concave functions on their domains, and that for every u ∈ ∆r , v ∈ ∆s , the following holds true: (!) When η K = (η1 , ..., ηK ), ζ L = (ζ1 , ..., ζK ) with mutually independent η1 , ..., ζL such that ηk ∼ P u, ηℓ ∼ Qv for all k, ℓ, we have " #T " # 1 X X 1 ≤ ΦK (h; A(uv T )). (3.82) ηk h ζℓ ln Eη,ζ exp K L k
ℓ
2) Combine (!) with Corollary 3.6 to arrive at the following analog of Proposition 3.19: Proposition 3.20. In the situation in question, let Z be a convex compact subset of ∆rs such that uv T ∈ Z for all u ∈ U , v ∈ V . Given ǫ ∈ (0, 1), let Ψ+ (h, α)
=
Ψ− (h, α)
=
z∈Z
b + (h) Ψ
:= =
b − (h) Ψ
:=
=
= =
max αΦK (h/α, P zQT ) − Tr(F z T ) : Rp×q × {α > 0} → R, z∈Z max αΦK (−h/α, P zQT ) + Tr(F z T ) : Rp×q × {α > 0} → R,
inf α>0 [Ψ+ (h, α) + α ln(2/ǫ)] max inf αΦK (h/α, P zQT ) − Tr(F z T ) + α ln(2/ǫ) z∈Z α>0 β ln(2/ǫ) [β = Kα], max inf βΦ(h/β, P zQT ) − Tr(F z T ) + K z∈Z β>0
inf [Ψ− (h, α) + α ln(2/ǫ)] max inf αΦK (−h/α, P zQT ) + Tr(F z T ) + α ln(2/ǫ) z∈Z α>0 β max inf βΦ(−h/β, P zQT ) + Tr(F z T ) + K ln(2/ǫ) [β = Kα].
α>0
z∈Z β>0
b ± are real-valued and convex on Rp×q , and every candidate solution The functions Ψ h to the convex optimization problem n h io b b + (h) + Ψ b − (h) Opt = min Ψ(h) := 21 Ψ h
induces the estimate
"
1 X ηk gbh (η , ζ ) = Tr h K K
L
k
#"
1X ζℓ L ℓ
# T T
b − (h) − Ψ b + (h)] + κ(h), κ(h) = 1 [Ψ 2
of the functional of interest (3.81) via observation η K , ζ L with ǫ-risk on U × V not b exceeding ρ = Ψ(h): ∀(u ∈ U, v ∈ V ) : Prob{|F (u, v) − gbh (η K , ζ L )| > ρ} ≤ ǫ,
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
245
the probability being taken w.r.t. the distribution of observations η K , ζ L stemming from signals u, v. Exercise 3.4. [recovering mixture weights] The problem to be addressed in this exercise is as follows. We are given K probability distributions P1 , ..., PK on observation space Ω, and let these distributions have densities pk (·) w.r.t. some reference measure Π P on Ω; we assume that k pk (·) is positive on Ω. We are given also N independent observations ωt ∼ Pµ , t = 1, ..., N, drawn from distribution Pµ =
K X
µ k Pk ,
k=1
where µ is an unknown P “signal” known to belong to the probabilistic simplex ∆K = {µ ∈ RK : µ ≥ 0, k µk = 1}. Given ω N = (ω1 , ..., ωN ), we want to recover the linear image Gµ of µ, where G ∈ Rν×K is given. b N ) : Ω × ... × Ω → Rν We intend to measure the risk of a candidate estimate G(ω by the quantity h oi1/2 n b b N ) − Gµk22 Risk[G(·)] = sup EωN ∼Pµ ×...×Pµ kG(ω . µ∈∆
3.4.A. Recovering linear form. Let us start with the case when G = g T is a 1 × K matrix. 3.4.A.1. Preliminaries. To motivate the construction to follow, consider the case when Ω is a finite set (obtained, e.g., by “fine discretization” of the “true” observation space). In this situation our problem becomes an estimation problem in Discrete o.s.: given a stationary N -repeated observation stemming from a discrete probability distribution Pµ affinely parameterized by signal µ ∈ ∆K , we want to recover a linear form of µ. It is shown in Section 3.1—see Remark 3.2—that in this case a nearly optimal, in terms of its ǫ-risk, estimate is of the form gb(ω N ) =
N 1 X φ(ωt ) N t=1
(3.83)
with properly selected φ. The difficulty with this approach is that as far as computations are concerned, optimal design of φ requires solving a convex optimization problem of design dimension of order of the cardinality of Ω, and this cardinality could be huge, as is the case when Ω is a discretization of a domain in Rd with d in the range of tens. To circumvent this problem, we are to simplify the outlined approach: from the construction of Section 3.1 we inherit the simple structure (3.83) of the estimator; taking this structure for granted, we are to develop an alternative design of φ. With this new design, we have no theoretical guarantees for the resulting estimates to be near-optimal; we sacrifice these guarantees in order to reduce dramatically the computational effort of building the estimates.
246
CHAPTER 3
3.4.A.2. Generic estimate. Let us select somehow L functions Fℓ (·) on Ω such that Z Fℓ2 (ω)pk (ω)Π(dω) < ∞, 1 ≤ ℓ ≤ L, 1 ≤ k ≤ K. (3.84) With λ ∈ RL , consider estimate of the form gbλ (ω N ) =
1) Prove that
≤
Risk[b gλ ]
:=
N X 1 X λℓ Fℓ (ω). Φλ (ωt ), Φλ (ω) = N t=1
Risk(λ) R P 2 max N1 [ ℓ λℓ Fℓ (ω)] pk (ω)Π(dω) k≤K
2 R P [ ℓ λℓ Fℓ (ω)] pk (ω)Π(dω) − g T ek 1/2 max N1 λT Wk λ + [eTk [M λ − g]]2 ,
+
= where M
=
Wk
=
(3.85)
ℓ
1/2
(3.86)
k≤K
Mkℓ := [Wk ]ℓℓ′
Fℓ (ω)pk (ω)Π(dω) k≤K , ℓ≤L R := Fℓ (ω)Fℓ′ (ω)pk (ω)Π(dω) ℓ≤L , 1 ≤ k ≤ K, R
ℓ′ ≤L
and e1 , ..., eK are the standard basic orths in RK .
Note that Risk(λ) is a convex function of λ; this function is easy to compute, provided the matrices M and Wk , k ≤ K, are available. Assuming this is the case, we can solve the convex optimization problem Opt = min Risk(λ) λ∈RK
(3.87)
and use the estimate (3.85) associated with the optimal solution to this problem; the risk of this estimate will be upper-bounded by Opt. 3.4.A.3. Implementation. When implementing the generic estimate we arrive at the “Measurement Design” question: how do we select the value of L and functions Fℓ , 1 ≤ ℓ ≤ L, resulting in small (upper bound Opt on the) risk of the estimate (3.85) yielded by an optimal solution to (3.87)? We are about to consider three related options—naive, basic, and Maximum Likelihood (ML). The naive option is to take Fℓ = pℓ , 1 ≤ ℓ ≤ L = K, assuming that this selection meets (3.84). For the sake of definiteness, consider the “Gaussian case,” where Ω = Rd , Π is the Lebesgue measure, and pk is Gaussian distribution with parameters νk , Σk : pk (ω) = (2π)−d/2 Det(Σk )−1/2 exp − 21 (ω − νk )T Σ−1 k (ω − νk ) .
In this case, the Naive option leads to easily computable matrices M and Wk appearing in (3.86).
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
247
2) Check that in the Gaussian case, when setting −1 −1 −1 −1 −1 Σkℓ = [Σ−1 , Σkℓm = [Σ−1 , χk = Σ−1 k + Σℓ ] k + Σℓ + Σm ] k νk , q q Det(Σkℓ ) Det(Σkℓm ) −d αkℓ = (2π)d Det(Σ , β = (2π) kℓm Det(Σk )Det(Σℓ )Det(Σm ) , )Det(Σ ) k ℓ
we have
Mkℓ [Wk ]ℓm
:= = := =
R
pℓ (ω)pk (ω)Π(dω) 1 T T T α R kℓ exp 2 [χk + χℓ ] Σkℓ [χk + χℓ ] − χk Σk χk − χℓ Σℓ χℓ , pℓ (ω)pm(ω)p k (ω)Π(dω) βkℓm exp 12 [χk + χℓ + χm ]T Σkℓm [χk + χℓ + χm ] −χTk Σk χk − χTℓ Σℓ χℓ − χTm Σm χm .
Basic option. Though simple, the Naive option does not make much sense: when
replacing the reference measure Π with another measure Π′ which has positive density θ(·) w.r.t. Π, the densities pk are updated according to pk (·) 7→ p′k (·) = θ−1 (·)p(·), so that selecting Fℓ′ = p′ℓ , the matrices M and Wk become M ′ and Wk′ with R R pk (ω)pℓ (ω) ′ ℓ (ω) ′ Mkℓ = Π (dω) = pk (ω)p Π(dω), θ 2 (ω) θ(ω) R R pk (ω)pℓ (ω)pm (ω) ′ pk (ω)pℓ (ω) ′ [Wk ]ℓm = Π (dω) = Π(dω). θ 3 (ω) θ 2 (ω)
We see that in general M 6= M ′ and Wk 6= Wk′ , which makes the Naive option rather unnatural. In the alternative Basic option we set pℓ (ω) . L = K, Fℓ (ω) = π(ω) := P k pk (ω)
The motivation is that the functions Fℓ are invariant when replacing Π with Π′ , so that here M = M ′ and Wk = Wk′ . Besides this, there are statistical arguments in favor P of the Basic option, namely, as follows. Let Π∗ be the measure with the w.r.t. Π; taken w.r.t. Π∗ , the densities of Pk are exactly the density k pk (·) P above πk (·), and k πk (ω) ≡ 1. Now, (3.86) says that the risk of estimate gbλ can be upper-bounded by the function Risk(λ) defined in (3.86), and this function, in turn, can be upper-bounded by the function P R P 2 + 1 [ ℓ λℓ Fℓ (ω)] pk (ω)Π(dω) Risk (λ) := k N 2 1/2 R P + maxk [ k λℓ Fℓ (ω)] pk (ω)Π(dω) − g T ek R P 2 = N1 [ ℓ λℓ Fℓ (ω)] Π∗ (dω) 2 1/2 R P T + maxk [ k λℓ Fℓ (ω)] πk (ω)Π∗ (dω) − g ek ≤ KRisk(λ)
(we have said that the maximum of K nonnegative quantities is at most their sum, and the latter is at most K times the maximum of the quantities). Consequently, the risk of the estimate (3.85) stemming from an optimal solution to (3.87) can be
248
CHAPTER 3
upper-bounded by the quantity Opt+ := min Risk+ (λ) λ
[≥ Opt := max Risk(λ)]. λ
And here comes the punchline: 3.1) Prove that both the quantities Opt defined in (3.87) and the above Opt+ depend only on the linear span of the functions Fℓ , ℓ = 1, ..., L, not on how the functions Fℓ are selected in this span. 3.2) Prove that the selection Fℓ = πℓ , 1 ≤ ℓ ≤ L = K, minimizes Opt+ among all possible selections L, {Fℓ }L ℓ=1 satisfying (3.84). Conclude that the selection Fℓ = πℓ , 1 ≤ ℓ ≤ L = K, while not necessarily optimal in terms of Opt, definitely is meaningful: this selection optimizes the natural upper bound Opt+ on Opt. Observe that Opt+ ≤ KOpt, so that optimizing instead of Opt the upper bound Opt+ , although rough, is not completely meaningless. A downside of the Basic option is that it seems problematic to get closed form expressions for the associated matrices M and Wk ; see (3.86). For example, in the Gaussian case, the Naive choice of Fℓ ’s allows us to represent M and Wk in an explicit closed form; in contrast to this, when selecting Fℓ = πℓ , ℓ ≤ L = K, seemingly the only way to get M and Wk is to use Monte-Carlo simulations. This being said, we indeed can use Monte-Carlo simulations to compute M and Wk , provided we can sample from distributions P1 , ..., PK . In this respect, it should be stressed that with Fℓ ≡ πℓ , the entries in M and Wk are expectations, w.r.t. P1 , ..., PK , of functions of ω bounded in magnitude by 1, and thus well-suited for Monte-Carlo simulation. Maximum Likelihood option. This choice of {Fℓ }ℓ≤L follows straightforwardly the idea of discretization we started with in this exercise. Specifically, we split Ω into L cells Ω1 , ..., ΩL in such a way that the intersection of any two different cells is of Π-measure zero, and treat as our observations not the actual observations ωt , but the indexes of the cells to which the ωt ’s belong. With our estimation scheme, this is the same as selecting Fℓ as the characteristic function of Ωℓ , ℓ ≤ L. Assuming that for distinct k, k ′ the densities pk , pk′ differ from each other Π-almost surely, the simplest discretization independent of how the reference measure is selected is the Maximum Likelihood discretization Ωℓ = {ω : max pk (ω) = pℓ (ω)}, 1 ≤ ℓ ≤ L = K; k
with the ML option, we take, as Fℓ ’s, the characteristic functions of the sets Ωℓ , 1 ≤ ℓ ≤ L = K, just defined. As with the Basic option, the matrices M and Wk associated with the ML option can be found by Monte-Carlo simulation. We have discussed three simple options for selecting Fℓ ’s. In applications, one can compute the upper risk bounds Opt—see (3.87)—associated with each option, and use the option with the best—the smallest—risk bound (“smart” choice of Fℓ ’s). Alternatively, one can take as {Fℓ , ℓ ≤ L} the union of the three collections yielded by the above options (and, perhaps, further extend this union). Note that the larger is the collection of the Fℓ ’s, the smaller is the associated Opt, so that the only price for combining different selections is in increasing the computational cost of solving (3.87). 3.4.A.4. Illustration. In the experimental part of this exercise your are expected
249
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
to 4.1) Run numerical experiments to compare the estimates yielded by the above three options (Naive, Basic, ML). Recommended setup: • d = 8, K = 90; • Gaussian case with the covariance matrices Σk of Pk selected at random, Sk = rand(d, d), Σk =
Sk SkT kSk k2
[k · k: spectral norm]
and the expectations νk of Pk selected at random from N (0, σ 2 Id ), with σ = 0.1; • values of N : {10s , s = 0, 1, ..., 5}; • linear form to be recovered: g T µ ≡ µ1 .
4.2† ). Utilize the Cramer-Rao lower risk bound (see Proposition 4.37, Exercise 4.22) to Opt of the estimates built in item 4.1. upper-bound the level of conservatism Risk ∗ Here Risk∗ is the minimax risk in our estimation problem: 1/2 g (ω N ) − g T µ|2 , Risk∗ = inf Risk[b g (ω N )] = inf sup EωN ∼Pµ ×...×Pµ |b g b(·)
g b(·) µ∈∆
where inf is taken over all estimates.
3.4.B. Recovering linear images. Now consider the case when G is a general ν × K matrix. The analog of the estimate gbλ (·) is now as follows: with somehow chosen F1 , ..., FL satisfying (3.84), we select a ν × L matrix Λ = [λiℓ ], set X X X ΦΛ (ω) = [ λ1ℓ Fℓ (ω); λ2ℓ Fℓ (ω); ...; λνℓ Fℓ (ω)], ℓ
ℓ
ℓ
and estimate Gµ by
N X b Λ (ω N ) = 1 Φλ (ωt ). G N t=1
5) Prove the following counterpart of the results of item 3.4.A: Proposition 3.21. The risk of the proposed estimator can be upper-bounded as follows: bΛ ] Risk[G Ψ(Λ, µ)
:= ≤ = =
where
oi1/2 n h b N ) − Gµk22 maxµ∈∆K EωN ∼Pµ ×...×Pµ kG(ω
Risk(Λ) := maxk≤K Ψ(Λ, ek ), h P i1/2 K 2 2 1 k=1 µk Eω∼Pk kΦΛ (ω)k2 + k[ψΛ − G]µk2 N i1/2 h R P PK P 2 , k[ψΛ − G]µk22 + N1 k=1 µk [ i≤ν [ ℓ λiℓ Fℓ (ω)] ]Pk (dω)
R P [ ℓ λ1ℓ Fℓ (ω)]Pk (dω) , 1 ≤ k ≤ K ··· Colk [ψΛ ] = Eω∼Pk (·) ΦΛ (ω) = R P [ ℓ λνℓ Fℓ (ω)]Pk (dω)
and e1 , ..., eK are the standard basic orths in RK .
250
CHAPTER 3
Note that exactly the same reasoning as in the case of the scalar Gµ ≡ g T µ demonstrates that a reasonable way to select L and Fℓ , ℓ = 1, ..., L, is to set L = K and Fℓ (·) = πℓ (·), 1 ≤ ℓ ≤ L.
3.6
PROOFS
3.6.1
Proof of Proposition 3.3
o
1 . Observe that Optij (K) is the saddle point value in the convex-concave saddle point problem: 1 Kα [ΦO (φ/α; Ai (x)) + ΦO (−φ/α; Aj (y))] Optij (K) = inf max α>0,φ∈F x∈Xi ,y∈Xj 2 + 12 g T [y − x] + α ln(2I/ǫ) . The domain of the maximization variable is compact and the cost function is continuous on its domain, whence, by the Sion-Kakutani Theorem, we also have Optij (K)
=
Θij (x, y)
=
max
x∈Xi ,y∈Xj
Θij (x, y), 1 2
Kα [ΦO (φ/α; Ai (x)) + ΦO (−φ/α; Aj (y))] +α ln(2I/ǫ) + 12 g T [y − x]. inf
α>0,φ∈F
(3.88)
Note that Θij (x, y)
=
=
inf
α>0,ψ∈F 1 T
1 2
Kα [ΦO (ψ; Ai (x)) + ΦO (−ψ; Aj (y))] + α ln(2I/ǫ)
+ 2 g [y − x] inf
α>0
1 2
αK inf [ΦO (ψ; Ai (x)) + ΦO (−ψ; Aj (y))] + α ln(2I/ǫ) ψ∈F
+ 21 g T [y − x].
Given x ∈ Xi , y ∈ Xj and setting µ = Ai (x), ν = Aj (y), we obtain inf [ΦO (ψ; Ai (x)) + ΦO (−ψ; Aj (y))] Z Z = inf ln exp{ψ(ω)}pµ (ω)Π(dω) + ln exp{−ψ(ω)}pν (ω)Π(dω) .
ψ∈F
ψ∈F
¯ Since O is a good o.s., the function ψ(ω) = inf
ψ∈F
= =
ln
Z
inf
δ∈F
inf
δ∈F
exp{ψ(ω)}pµ (ω)Π(dω)
|
ln ln
Z Z
+ ln
Z
q
ln(pν (ω)/pµ (ω)) belongs to F, and
exp{−ψ(ω)}pν (ω)Π(dω)
¯ exp{ψ(ω) + δ(ω)}pµ (ω)Π(dω) exp{δ(ω)}
1 2
pµ (ω)pν (ω)Π(dω)
+ ln + ln {z
f (δ)
Z
Z
¯ exp{−ψ(ω) − δ(ω)}pν (ω)Π(dω) exp{−δ(ω)}
q
pµ (ω)pν (ω)Π(dω)
}
.
251
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
Observe that f (δ) clearly is a convex and even function of δ ∈ F; as such, it attains its minimum over δ ∈ F when δ = 0. The bottom line is that Z q pAi (x) (ω)pAj (y) (ω)Π(dω) , inf [ΦO (ψ; Ai (x)) + ΦO (−ψ; Aj (y))] = 2 ln ψ∈F
(3.89)
and Θij (x, y)
= =
Z q pAi (x) (ω)pAj (y) (ω)Π(dω) + ln(2I/ǫ) + 12 g T [y − x] inf α K ln α>0 R q ( 1 T pAi (x) (ω)pAj (y) (ω)Π(dω) + ln(2I/ǫ) ≥ 0, g [y − x], K ln 2 −∞,
otherwise.
This combines with (3.88) to imply that n Optij (K) = maxx,y 21 g T [y − x] : x ∈ Xi , y ∈ Xj , iK hR q pAi (x) (ω)pAj (y) (ω)Π(dω) ≥
ǫ 2I
.
(3.90)
2o . We claim that under the premise of the proposition, for all i, j, 1 ≤ i, j ≤ I, one has Optij (K) ≤ Risk∗ǫ (K), implying the validity of (3.13). Indeed, assume that for some pair i, j the opposite inequality holds true, Optij (K) > Risk∗ǫ (K), and let us lead this assumption to a contradiction. Under our assumption optimization problem in (3.90) has a feasible solution (¯ x, y¯) such that y−x ¯] > Risk∗ǫ (K), r := 21 g T [¯
(3.91)
implying, due to the origin of Risk∗ǫ (K), that there exists an estimate ge(ω K ) such that for µ = Ai (¯ x), ν = Aj (¯ y ) it holds o o n n x + y¯] ≤ ProbωK ∼pK |e ProbωK ∼pK ge(ω K ) ≤ 12 g T [¯ g (ω K ) − g T y¯| ≥ r ≤ ǫ ν n ν n o o 1 T K x + y¯] ≤ ProbωK ∼pK |e g (ω K ) − g T x ¯| ≥ r ≤ ǫ. ProbωK ∼pK ge(ω ) ≥ 2 g [¯ µ
µ
In other words, we can decide on two simple hypotheses stating that observation ω K K K obeys distribution pK Π × ... × Π µ or pν , with risk ≤ ǫ. Consequently, setting Π = | {z } K Q K K and pK (ω ) = p (ω ), we have k θ k=1 θ Z i h K K K (ω ), p (ω ) ΠK (dω K ) ≤ 2ǫ. min pK µ ν
252
CHAPTER 3
Hence, hR p = ≤ =
= ≤
iK Rq K K K K K pµ (ω)pν (ω)Π(dω) pK = µ (ω )pν (ω )Π (dω ) r ir i h h R K K K K K K K K min pK max pK µ (ω ), pν (ω ) µ (ω ), pν (ω ) Π (dω ) R h i 1 R i 1 h 2 2 K K K K K K K K K K min pK max pK µ (ω ), pν (ω ) Π (dω ) µ (ω ), pν (ω ) Π (dω ) i 1 R h 2 K K K K K min pK µ (ω ), pν (ω ) Π (dω ) h ii 1 R h 2 K K K K K K K ΠK (dω K ) × pK µ (ω ) + pν (ω ) − min pµ (ω ), pν (ω ) i 1 i 1 h R h R 2 2 K K K K K K K K K K 2 − min pK min pK µ (ω ), pν (ω ) Π (dω ) µ (ω ), pν (ω ) Π (dω ) p 2 ǫ(1 − ǫ).
Therefore, for K satisfying (3.12) we have
K Z q p ǫ pµ (ω)pν (ω)Π(dω) , ≤ [2 ǫ(1 − ǫ)]K/K < 2I
which is the desired contradiction (recall that µ = Ai (¯ x), ν = Aj (¯ y ) and (¯ x, y¯), is feasible for (3.90)). 3o . Now let us prove that under the premise of the proposition, (3.14) takes place. To this end, let us set Z q 1 T g [y − x] : K ln wij (s) = max pAi (x) (ω)pAj (y) (ω)Π(dω) +s ≥ 0 . 2 x∈Xj ,y∈Xj {z } | H(x,y)
(3.92)
As we have seen in item 1o —see (3.89)—one has H(x, y) = inf
1
ψ∈F 2
[ΦO (ψ; Ai (x)) + ΦO (−ψ, Aj (y))] ,
that is, H(x, y) is the infimum of a parametric family of concave functions of (x, y) ∈ Xi × Xj and as such is concave. Besides this, the optimization problem in (3.92) is feasible whenever s ≥ 0, a feasible solution being y = x = xij . At this feasible solution we have g T [y − x] = 0, implying that wij (s) ≥ 0 for s ≥ 0. Observe also that from concavity of H(x, y) it follows that wij (s) is concave on the ray {s ≥ 0}. Finally, we claim that p (3.93) wij (¯ s) ≤ Risk∗ǫ (K), s¯ = − ln(2 ǫ(1 − ǫ)).
Indeed, wij (s) is nonnegative, concave, and bounded (since Xi , Xj are compact) on R+ , implying that wij (s) is continuous on {s > 0}. Assuming, on the contrary to what we need to prove, that wij (¯ s) > Risk∗ǫ (K), there exists s′ ∈ (0, s¯) such that ∗ ′ wij (s ) > Riskǫ (K) and thus there exist x ¯ ∈ Xi , y¯ ∈ Xj such that (¯ x, y¯) is feasible for the optimization problem specifying wij (s′ ) and (3.91) takes place. We have seen in item 2o that the latter relation implies that for µ = Ai (¯ x), ν = Aj (¯ y ) it holds K Z q p pµ (ω)pν (ω)Π(dω) ≤ 2 ǫ(1 − ǫ),
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
that is, K ln
Z q
K ln
Z q
Hence,
253
pµ (ω)pν (ω)Π(dω) + s¯ ≤ 0.
pµ (ω)pν (ω)Π(dω) + s′ < 0,
contradicting the feasibility of (¯ x, y¯) to the optimization problem specifying wij (s′ ). It remains to note that (3.93) combines with concavity of wij (·) and the relation wij (0) ≥ 0 to imply that wij (ln(2I/ǫ)) ≤ ϑwij (¯ s) ≤ ϑRisk∗ǫ (K) where ϑ = ln(2I/ǫ)/¯ s= Invoking (3.90), we conclude that
2 ln(2I/ǫ) . ln([4ǫ(1 − ǫ)]−1 )
Optij (K) = wij (ln(2I/ǫ)) ≤ ϑRisk∗ǫ (K) ∀i, j. Finally, from (3.90) it immediately follows that Optij (K) is nonincreasing in K (as K grows, the feasible set of the optimization problem in (3.90) shrinks), so that for K ≥ K we have Opt(K) ≤ Opt(K) = max Optij (K) ≤ ϑRisk∗ǫ (K), i,j
and (3.14) follows. 3.6.2
✷
Verifying 1-convexity of the conditional quantile
Let r be a nonvanishing probability distribution on S, and let Fm (r) =
m X i=1
ri , 1 ≤ m ≤ M,
so that 0 < F1 (r) < F2 (r) < ... < FM (r) = 1. Denoting by P the set of all nonvanishing probability distributions on S, observe that for every p ∈ P χα [r] is a piecewise linear function of α ∈ [0, 1] with breakpoints 0, F1 (r), F2 (r), F3 (r), ..., FM (r), the values of the function at these breakpoints being s1 , s1 , s2 , s3 , ..., sM . In particular, this function is equal to s1 on [0, F1 (r)] and is strictly increasing on [F1 (r), 1]. Now let s ∈ R, and let Pα≤ [s] = {r ∈ P : χα [r] ≤ s}, Pα≥ [s] = {r ∈ P : χα [r] ≥ s}. Observe that the just introduced sets are cut off P by nonstrict linear inequalities, specifically, • • • •
when when when when
s < s1 , we have Pα≤ [s] = ∅, Pα≥ [s] = P; s = s1 , we have Pα≤ [s] = {r ∈ P : F1 (r) ≥ α}, Pα≥ [s] = P; s > sM , we have Pα≤ [s] = P, Pα≥ [s] = ∅; s1 < s ≤ sM , for every r ∈ P the equation χγ [r] = s in variable γ ∈ [0, 1]
254
CHAPTER 3
has exactly one solution γ(r) which can be found as follows: we specify k = k s ∈ {1, ..., M − 1} such that sk < s ≤ sk+1 and set γ(r) =
(sk+1 − s)Fk (r) + (s − sk )Fk+1 (r) . sk+1 − sk
Since χα [r] is strictly increasing in α when α ∈ [F1 (p), 1], for s ∈ (s1 , sM ] we have
(sk+1 − s)Fk (r) + (s − sk )Fk+1 (r) r∈P: ≥α , sk+1 − sk (s − s)F k+1 k (r) + (s − sk )Fk+1 (r) Pα≥ [s] = {r ∈ P : α ≥ γ(r)} = r ∈ P : ≤α . sk+1 − sk
Pα≤ [s] = {r ∈ P : α ≤ γ(r)} =
As an immediate consequence of this description, given α ∈ [0, 1] and τ ∈ T and setting µ X p(ι, τ ), 1 ≤ µ ≤ M, Gτ,µ (p) = ι=1
and
X s,≤ = {p(·, ·) ∈ X : χα [pτ ] ≤ s}, X s,≥ = {p(·, ·) ∈ X : χα [pτ ] ≥ s}, we get s < s1 s = s1
⇒
⇒
s > sM
⇒
s1 < s ≤ sM
⇒
X s,≤ = ∅, X s,≥ = X ,
X s,≤ = {p ∈ X : Gτ,1 (p) ≤ s1 Gτ,M (p)}, X s,≥ = X ,
X s,≤ = X , X s,≥ = ∅, o n X s,≤ = p ∈ X : (sk+1 −s)Gτ,k (r)+(s−sk )Gτ,k+1 (r) ≥ αGτ,M (p) , sk+1 −sk o n X s,≥ = p ∈ X : (sk+1 −s)Gτ,k (r)+(s−sk )Gτ,k+1 (r) ≤ αGτ,M (p) , sk+1 −sk k = ks : sk < s ≤ sk+1 ,
implying 1-convexity of the conditional quantile on X (recall that Gτ,µ (p) are linear in p). ✷ 3.6.3 3.6.3.1
Proof of Proposition 3.4 Proof of Proposition 3.4.i
We call step ℓ essential if at this step rule 2d is invoked. 1o . Let x ∈ X be the true signal underlying the observation ω ¯ K , so that ω ¯ 1 , ..., ω ¯K are drawn from the distribution pA(x) independently of each other. Consider the “ideal” estimate given by exactly the same rules as the Bisection procedure in Section 3.2.4.2 (in the sequel, we refer to the latter as the “true” one), with tests T∆Kℓ,rg ,r (·), T∆Kℓ,lf ,l (·) in rule 2d replaced with the “ideal tests” Tb∆ℓ,rg ,r = Tb∆ℓ,lf ,l =
right, left,
f (x) > cℓ , f (x) ≤ cℓ .
Marking by ∗ the entities produced by the resulting fully deterministic procedure, we arrive at the sequence of nested segments ∆∗ℓ = [a∗ℓ , b∗ℓ ], 0 ≤ ℓ ≤ L∗ ≤ L, along
255
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
with subsegments ∆∗ℓ,rg = [c∗ℓ , vℓ∗ ], ∆∗ℓ,lf = [u∗ℓ , c∗ℓ ] of ∆∗ℓ−1 , defined for all ∗ -essential ¯ ∗ claimed to contain f (x). Note that the ideal values of ℓ, and the output segment ∆ procedure cannot terminate due to arriving at a disagreement, and that f (x), as is ¯ ∗. immediately seen, is contained in all segments ∆∗ℓ , 0 ≤ ℓ ≤ L∗ , just as f (x) ∈ ∆ ∗ ∗ ∗ Let L be the set of all -essential values of ℓ. For ℓ ∈ L , let the event Eℓ [x] parameterized by x be defined as follows: n o K K K K K ω : T (ω ) = right or T (ω ) = right , f (x) ≤ u∗ℓ , ∗ ∗ ∆ ,r n ℓ,rg o ∆ℓ,lf ,l ω K : T K∗ (ω K ) = right , u∗ℓ < f (x) ≤ c∗ℓ , ∆ℓ,rg ,r n o Eℓ [x] = K K K c∗ℓ < f (x) < vℓ∗ , nω : T∆∗ℓ,lf ,l (ω ) = left , o ω K : T K∗ (ω K ) = left or T K∗ (ω K ) = left , f (x) ≥ vℓ∗ . ,r ∆ ∆ ,l ℓ,rg
ℓ,lf
(3.94)
2o . Observe that by construction and in view of Proposition 2.27 we have ∀ℓ ∈ L∗ : ProbωK ∼pA(x) ×...×pA(x) {Eℓ [x]} ≤ 2δ.
(3.95)
Indeed, let ℓ ∈ L∗ . • When f (x) ≤ u∗ℓ , we have x ∈ X and f (x) ≤ u∗ℓ ≤ c∗ℓ , implying that Eℓ [x] takes place only when either the left test T∆K∗ ,l or the right test T∆K∗ ,r , or both, accept ℓ,rg ℓ,lf wrong—right—hypotheses from the pairs of right and left hypotheses. Since the corresponding intervals ([u∗ℓ , c∗ℓ ] for the left side test, [c∗ℓ , vℓ∗ ] for the right side one) are δ-good left and right, respectively, the risks of the tests do not exceed δ, and the pA(x) -probability of the event Eℓ [x] is at most 2δ; • when u∗ℓ < f (x) ≤ c∗ℓ , the event Eℓ [x] takes place only when the right side test T∆K∗ ,r accepts the wrong—right—hypothesis from the pair; as above, this can ℓ,rg happen with pA(x) -probability at most δ; • when cℓ < f (x) ≤ vℓ , the event Eℓ [x] takes place only if the left test T∆K∗ ,l accepts ℓ,lf the wrong—left—hypothesis from the pair to which it was applied, which again happens with pA(x) -probability ≤ δ; • finally, when f (x) > vℓ , the event Eℓ [x] takes place only when either the left side test T∆K∗ ,l or the right side test T∆K∗ ,r , or both, accept wrong—left—hypotheses ℓ,rg ℓ,lf from the pairs; as above, this can happen with pA(x) -probability at most 2δ. ¯ = L(¯ ¯ ω K ) be the last step of the true estimating procedure as run on the 3o . Let L observation ω ¯ K . We claim that the following holds true: S (!) Let E := ℓ∈L∗ Eℓ [x], so that the pA(x) -probability of the event E, the observations stemming from x, is at most 2δL = ǫ ¯ ω K ) ≤ L∗ , and only two (see (3.17), (3.95)). Assume that ω ¯ K 6∈ E. Then L(¯ cases are possible: A. The true estimating procedure does not terminate due to arriving at a ¯ ω K ) and the trajectories of the ideal and disagreement. In this case L∗ = L(¯
256
CHAPTER 3
the true procedures are identical (same localizers and essential steps, same ¯ or output segments, etc.), and, in particular, f (x) ∈ ∆, B. The true estimating procedure terminates due to arriving at a dis¯ and f (x) ∈ ∆. ¯ agreement. Then ∆ℓ = ∆∗ℓ for ℓ < L, ¯ is at least In view of A and B the pA(x) -probability of the event f (x) ∈ ∆ 1 − ǫ, as claimed in Proposition 3.4. To prove (!), note that the actions at step ℓ in ideal and true procedures depend solely on ∆ℓ−1 and on the outcome of rule 2d. Taking into account that ∆0 = ∆∗0 , all we need to verify is the following claim: (!!) Let ω ¯ K 6∈ E, and let ℓ ≤ L∗ be such that ∆ℓ−1 = ∆∗ℓ−1 , whence also ∗ uℓ = uℓ , cℓ = c∗ℓ , and vℓ = vℓ∗ . Assume that ℓ is essential (given that ∆ℓ−1 = ∆∗ℓ−1 , this may happen if and only if ℓ is ∗ -essential as well). Then either C. At step ℓ the true procedure terminates due to disagreement, in which ¯ or case f (x) ∈ ∆, D. At step ℓ there was no disagreement, in which case ∆ℓ as given by (3.16) is identical to ∆∗ℓ as given by the ideal counterpart of (3.16) in the case of ∆∗ℓ−1 = ∆ℓ−1 , that is, by the rule ∆∗ℓ =
[cℓ , bℓ−1 ], [aℓ−1 , cℓ ],
f (x) > cℓ , f (x) ≤ cℓ .
(3.96)
To verify (!!), let ω ¯ K and ℓ satisfy the premise of (!!). Note that due to ∆ℓ−1 = ∗ ∆ℓ−1 we have uℓ = u∗ℓ , cℓ = c∗ℓ , and vℓ = vℓ∗ , and thus also ∆∗ℓ,lf = ∆ℓ,lf , ∆∗ℓ,rg = ∆ℓ,rg . Consider first the case when the true estimation procedure terminates by disagreement at step ℓ, so that T∆K∗ ,l (¯ ω K ) 6= T∆K∗ ,r (¯ ω K ). When ℓ,lf
ℓ,rg
assuming that f (x) < uℓ = u∗ℓ , the relation ω ¯ K 6∈ Eℓ [x] combines with (3.94) to K K K K imply that T∆∗ ,r (¯ ω ) = T∆∗ ,l (¯ ω ) = left, which under disagreement is imℓ,rg
ℓ,lf
possible. Assuming f (x) > vℓ = vℓ∗ , the same argument results in T∆K∗
ℓ,rg ,r
(¯ ωK ) =
T∆K∗
(¯ ω K ) = right, which again is impossible. We conclude that in the case in ¯ as claimed in C. C is proved. question uℓ ≤ f (x) ≤ vℓ , i.e., f (x) ∈ ∆, Now, suppose that there was a consensus at step ℓ in the true estimating procedure. Because ω ¯ K 6∈ Eℓ [x] this can happen in the following four cases: ℓ,lf ,l
(¯ ω K ) = left and f (x) ≤ uℓ = u∗ℓ , ℓ,rg ,r K T∆∗ ,r (¯ ω K ) = left and uℓ < f (x) ≤ cℓ = c∗ℓ , ℓ,rg T∆K∗ ,l (¯ ω K ) = right and cℓ < f (x) < vℓ = vℓ∗ , ℓ,lf T∆K∗ ,l (¯ ω K ) = right and vℓ ≤ f (x). ℓ,lf
1. T∆K∗
2. 3. 4.
Due to consensus at step ℓ, in situations 1 and 2 (3.16) says that ∆ℓ = [aℓ−1 , cℓ ], which combines with (3.96) and vℓ = vℓ∗ to imply that ∆ℓ = ∆∗ℓ . Similarly, in situations 3 and 4, due to consensus at step ℓ, (3.16) implies that ∆ℓ = [cℓ , bℓ−1 ], which combines with uℓ = u∗ℓ and (3.96) to imply that ∆ℓ = ∆∗ℓ . D is proved. ✷
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
3.6.3.2
257
Proof of Proposition 3.4.ii
0 0 ≤ ρb, since in this case the estimate a0 +b There is nothing to prove when b0 −a 2 2 which does not use observations at all is (b ρ, 0)-reliable. From now on we assume that b0 − a0 > 2b ρ, implying that L is a positive integer.
1o . Observe, first, that if a and b are such that a is lower-feasible, b is upperfeasible, and b − a > 2ρ, then for every i ≤ Ib,≥ and j ≤ Ia,≤ there exists a test, based on K observations, which decides upon the hypotheses H1 , H2 , stating that the observations are drawn from pA(x) with x ∈ Zib,≥ (H1 ) or with x ∈ Zja,≤ (H2 ) with risk at most ǫ. Indeed, it suffices to consider the test which accepts H1 and rejects H2 when fb(ω K ) ≥ a+b 2 and accepts H2 and rejects H1 otherwise.
2o . With parameters of Bisection chosen according to (3.19), by already proved Proposition 3.4.i, we have ¯ ∆ ¯ being E. For every x ∈ X, the pA(x) -probability of the event f (x) ∈ ∆, the output segment of our Bisection, is at least 1 − ǫ.
3o . We claim also that F.1. Every segment ∆ = [a, b] with b − a > 2ρ and lower-feasible a is δ-good (right), F.2. Every segment ∆ = [a, b] with b − a > 2ρ and upper-feasible b is δ-good (left), F.3. Every κ-maximal δ-good (left or right) segment has length at most 2ρ + κ = ρb. As a result, for every essential step ℓ, the lengths of the segments ∆ℓ,rg and ∆ℓ,lf do not exceed ρb.
Let us verify F.1 (verification of F.2 is completely similar, and F.3 is an immediate consequence of the definitions and F.1-2). Let [a, b] satisfy the premise of F.1. It may happen that b is upper-infeasible, whence ∆ = [a, b] is 0-good (right), and we are done. Now let b be upper-feasible. As we have already seen, whenever i ≤ Ib,≥ and j ≤ Ia,≤ , the hypotheses stating that ωk are sampled from pA(x) for some x ∈ Zib,≥ and for some x ∈ Zja,≤ , respectively, can be decided upon with risk ≤ ǫ, implying, as in the proof of Proposition 2.25, that p ǫij∆ ≤ [2 ǫ(1 − ǫ)]1/K .
Hence, taking into account that the column and the row sizes of E∆,r do not exceed N I, p ǫ K/K =δ σ∆,r ≤ N I max ǫK ≤ ij∆ ≤ N I[2 ǫ(1 − ǫ)] i,j 2L
(we have used (3.19)), that is, ∆ indeed is δ-good (right).
4o . Let us fix x ∈ X and consider a trajectory of Bisection, the observation being ¯ of the procedure is given by one of the following drawn from pA(x) . The output ∆ options: 1. At some step ℓ of Bisection, the process terminated according to rules in 2b or 2c. In the first case, the segment [cℓ , bℓ−1 ] has lower-feasible left endpoint and is not δ-good (right), implying by F.1 that the length of this segment (which is ¯ = ∆ℓ−1 ) is ≤ 2ρ, so that the length |∆| ¯ of ∆ ¯ is at most half the length of ∆
258
CHAPTER 3
4ρ ≤ 2b ρ. The same conclusion, by a completely similar argument, holds true if the process terminated at step ℓ according to rule 2c. 2. At some step ℓ of Bisection, the process terminated due to disagreement. In this ¯ ≤ 2b case, by F.3, we have |∆| ρ. ¯ = ∆L . In this case, termination clauses in 3. Bisection terminated at step L, and ∆ rules 2b, 2c, and 2d were never invoked, clearly implying that |∆s | ≤ |∆s−1 |/2, ¯ = |∆L | ≤ 2−L |∆0 | ≤ 2b 1 ≤ s ≤ L, and thus |∆| ρ (see (3.19)). ¯ ≤ 2b Thus, we have |∆| ρ, implying that whenever the signal x ∈ X underlying ¯ are such that f (x) ∈ ∆, ¯ the error of the observations and the output segment ∆ ¯ is at most ρb. Invoking E, we Bisection estimate (which is the midpoint of ∆) conclude that the Bisection estimate is (b ρ, ǫ)-reliable. ✷ 3.6.4
Proof of Proposition 3.14
Let us fix ǫ ∈ (0, 1). Setting ρK =
1 2
h
¯ H) ¯ H) b +,K (h, ¯ +Ψ b −,K (h, ¯ Ψ
i
and invoking Corollary 3.13, all we need to prove is that in the case of A.1-3 one has i h ¯ H) ¯ H) b +,K (h, ¯ +Ψ b −,K (h, ¯ ≤ 0. (3.97) lim sup Ψ K→∞
To this end, note that in our current situation, (3.48) and (3.52) simplify to 1/2
1/2
− Θ∗ HΘ∗ ) Φ(h, H; Z) = − 21ln Det(I H h T −1 1 + 2 Tr Z B + [H, h]T [Θ−1 [H, h] B , ∗ − H] T h {z } | Q(h,H) b +,K (h, H) = inf max αΦ(h/α, H/α; Z) − Tr(QZ) + K −1 α ln(2/ǫ) : Ψ α Z∈Z −1 , α > 0, −γαΘ−1 ∗ H γαΘ∗ b −,K (h, H) = inf max αΦ(−h/α, −H/α; Z) + Tr(QZ) + K −1 α ln(2/ǫ) : Ψ α Z∈Z −1 −1 . α > 0, −γαΘ∗ H γαΘ∗
Hence h i ¯ H) ¯ H) ¯ b +,K (h, ¯ +Ψ b −,K (h, ¯ ≤ inf ¯ Ψ max αΦ(h/α, H/α; Z1 ) − Tr(QZ1 ) α
Z1 ,Z2 ∈Z ¯ ¯ +Φ(−h/α, −H/α; Z1 ) + Tr(QZ2 ) + 2K −1 α ln(2/ǫ) : −1 −1 ¯ α > 0, −γαΘ∗ H γαΘ∗ 1/2 ¯ 1/2 2 2 − 21 α ln Det I − [Θ∗ HΘ = inf max + 2K −1 α ln(2/ǫ) ∗ ] /α α Z1 ,Z2 ∈Z ¯ ¯ ¯ ¯ H/α) + αTr Z2 Q(−h/α, −H/α) +Tr(Q[Z2 − Z1 ]) + 12 αTr Z1 Q(h/α, : −1 ¯ α > 0, −γαΘ−1 ∗ H γαΘ∗
259
FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS
= inf max
α Z1 ,Z2 ∈Z
1/2 ¯ 1/2 2 2 + 2K −1 α ln(2/ǫ) − 12 α ln Det I − [Θ∗ HΘ ∗ ] /α ¯ T [αΘ−1 − H] ¯ ¯ h] ¯ −1 [H, ¯ h]B + 21 Tr Z1 B T [H, ∗ ¯ T [αΘ−1 + H] ¯ ¯ h] ¯ −1 [H, ¯ h]B + 21 Tr Z2 B T [H, ∗ + Tr(Q[Z2 − Z1 ]) + 21 Tr([Z1 − Z2 ]B T | {z T (Z1 ,Z2 )
α > 0, −γαΘ−1 ∗
By (3.57) we have 21 B T
¯ H ¯T h
¯ h
¯ H ¯ hT
¯ h
B) : } ¯ γαΘ−1 H . ∗
(3.98)
B = B T [C T QC + J]B, where the only nonzero
entry, if any, in the (d + 1) × (d + 1) matrix J is in the cell (d + 1, d + 1). By definition of B—see (3.48)—the only nonzero element, if any, in J¯ = B T JB is in the cell (m + 1, m + 1), and we conclude that 1 2
BT
¯ h
¯ H ¯T h
B = (CB)T Q(CB) + J¯ = Q + J¯
(recall that CB = Im+1 ). Now, when Z1 , Z2 ∈ Z, the entries of Z1 , Z2 in the cell (m + 1, m + 1) both are equal to 1, whence 1 2
Tr([Z1 −Z2 ]B
T
¯ H ¯T h
¯ h
¯ = Tr([Z1 −Z2 ]Q), B) = Tr([Z1 −Z2 ]Q)+Tr([Z1 −Z2 ]J)
implying that the quantity T (Z1 , Z2 ) in (3.98) is zero, provided Z1 , Z2 ∈ Z. Consequently, (3.98) becomes h
i ¯ H) ¯ H) b +,K (h, ¯ +Ψ b −,K (h, ¯ ≤ inf Ψ
1/2 ¯ 1/2 2 2 − 12 α ln Det I − [Θ∗ HΘ ∗ ] /α α Z1 ,Z2 ∈Z ¯ TB ¯ h][αΘ−1 ¯ −1 [H, ¯ h] +2K −1 α ln(2/ǫ) + 21 Tr Z1 B T [H, ∗ − H] −1 ¯ T [αΘ−1 ¯ ¯ h] ¯ −1 [H, ¯ h]B ¯ . : α > 0, −γαΘ−1 + 21 Tr Z2 B T [H, ∗ + H] ∗ H γαΘ∗ max
(3.99)
Now, for an appropriately selected real c independent of K, for α allowed by (3.99), and all Z1 , Z2 ∈ Z we have (recall that Z is bounded) 1 ¯ T [αΘ−1 − H] ¯ ¯ h] ¯ −1 [H, ¯ h]B Tr Z1 B T [H, ∗ 2 ¯ T [αΘ−1 + H] ¯ ¯ h] ¯ −1 [H, ¯ h]B ≤ c/α, + 21 Tr Z2 B T [H, ∗ along with
1/2 ¯ 1/2 2 2 − 12 α ln Det I − [Θ∗ HΘ ≤ c/α. ∗ ] /α
Therefore, given δ > 0, we can find α = αδ > 0 large enough to ensure that ¯ γαδ Θ−1 and 2c/αδ ≤ δ, −γαδ Θ−1 H ∗
∗
which combines with (3.99) to imply that i h ¯ H) ¯ H) b +,K (h, ¯ +Ψ b −,K (h, ¯ ≤ δ + 2K −1 αδ ln(2/ǫ), Ψ
and (3.97) follows.
✷
Chapter Four Signal Recovery by Linear Estimation OVERVIEW In this chapter we consider several variations of one of the most basic problems of high-dimensional statistics—signal recovery. In its simplest form the problem is as follows: given positive definite m × m matrix Γ, m × n matrix A, ν × n matrix B, and indirect noisy observation [ξ ∼ N (0, Γ)]
ω = Ax + ξ
(4.1)
of unknown “signal” x known to belong to a given convex compact subset X of Rn , we want to recover the vector Bx ∈ Rν of x. We focus first on the case where the quality of a candidate recovery ω 7→ x b(ω) is quantified by its worst-case, over x ∈ X , expected k · k22 -error, that is, by the risk q x(Ax + ξ) − Bxk22 }. (4.2) Risk[b x(·)|X ] = sup Eξ∼N (0,Γ) {kb x∈X
The simplest and the most studied type of recovery is an affine one: x b(ω) = H T ω+h; assuming X to be symmetric w.r.t. the origin, we lose nothing when passing from affine estimates to linear ones—those of the form x bH (ω) = H T ω. An advantage of linear estimates is that under favorable circumstances (e.g., when X is an ellipsoid), minimizing risk over linear estimates is an efficiently solvable problem, and there exists a huge body of literature on optimal in terms of their risk linear estimates (see, e.g., [6, 57, 82, 155, 156, 197, 206, 207] and references therein). Moreover, in the case of signal recovery from direct observations in white Gaussian noise (the case of B = A = In , Γ = σ 2 In ), there is huge body of results on near-optimality of properly selected linear estimates among all possible recovery routines; see, e.g., [79, 88, 106, 124, 198, 230, 239] and references therein. A typical result of this type states that when recovering x ∈ X from direct observation ω = x+σξ, ξ ∼ N (0, Im ), where X is an ellipsoid of the form X {x ∈ Rn : j 2α x2j ≤ L2 }, j
or the box {x ∈ Rn : j α |xj | ≤ L, j ≤ n}, with fixed L < ∞ and α > 0, the ratio of the risk of a properly selected linear estimate to the minimax risk Riskopt [X ] := inf Risk[b x|X ] x b(·)
(4.3)
(the infimum is taken over all estimates, not necessarily linear) remains bounded, or even tends to 1, as σ → +0, and this happens uniformly in n, α and L being fixed.
SIGNAL RECOVERY BY LINEAR ESTIMATION
261
Similar “near-optimality” results are known for the “diagonal” case, where X is an ellipsoid/box and A, B, Γ are diagonal matrices. To the best of our knowledge, the only “general” (that is, not imposing severe restrictions on how the geometries of X , A, B, Γ are linked to each other) result on optimality of linear estimates is due to D. Donoho, who proved [64], that when recovering a linear form (i.e., in the case of one-dimensional Bx), the best risk over all linear estimates is within the factor 1.2 of the minimax risk. The primary goal of this chapter is to establish rather general results on nearoptimality of properly built linear estimates as compared to all possible estimates. Results of this type are bound to impose some restrictions on X , since there are cases (e.g., the case of a high-dimensional k · k1 -ball X ) where linear estimates are by far nonoptimal. Our restrictions on X reduce to the existence of a special type representation of X and are satisfied, e.g., when X is the intersection of K < ∞ ellipsoids/elliptic cylinders, P X = {x ∈ Rn : xT Rk x ≤ 1, 1 ≤ k ≤ K} [Rk 0, k Rk ≻ 0] (4.4)
in particular, X can be a symmetric w.r.t. the origin compact polytope given by 2K linear inequalities −1 ≤ rkT x ≤ 1, 1 ≤ k ≤ K, or, equivalently, X = {x : xT (rk rkT ) x, 1 ≤ k ≤ K}. Another instructive example is a set of the form | {z } Rk
X = {x : kSxkp ≤ L}, where p ≥ 2 and S is a matrix with trivial kernel. It should be stressed that while imposing some restrictions on X , we require nothing from A, B, and Γ, aside from positive definiteness of the latter matrix. Our main result (Proposition 4.5) states, in particular, that with X given by (4.4) and with arbitrary A and B, the risk of properly selected linear estimate x bH∗ with both H∗ and the risk efficiently computable, satisfies the bound p (∗) Risk[b xH∗ |X ] ≤ O(1) ln(K + 1)Riskopt [X ],
where Riskopt [X ] is the minimax risk, and O(1) is an absolute constant. Note that the outlined result is an “operational” one—the risk of provably nearly optimal estimate and the estimate itself are given by efficient computation. This is in sharp contrast with traditional results of nonparametric statistics, where near-optimal estimates and their risks are given in a “closed analytical form,” at the price of severe restrictions on the structure of the “data” X , A, B, Γ. This being said, it should be stressed that one of the crucial components in our construction is quite classical—this is the idea, going back to M.S. Pinsker [198], of bounding from below the minimax risk via the Bayesian risk associated with a properly selected Gaussian prior.1 The main body of the chapter originates from [138, 137] and is organized as follows. • Section 4.1 presents basic results on Conic Programming and Conic Duality—the 1 [88, 198] address the problem of k · k -recovery of a signal x from direct observations (A = 2 B = I) in the case when X is a high-dimensional ellipsoid with “regularly decreasing half-axes,” P 2α 2 2 n like X = {x ∈ R : j j xj ≤ L } with α > 0. In this case Pinsker’s construction shows that as σ → +0, the risk of a properly built linear estimate is, uniformly in n, (1 + o(1)) times the minimax risk. This is much stronger than (∗), and it seems to be unlikely that a similarly strong result holds true in the general case underlying (∗).
262
CHAPTER 4
principal optimization tools utilized in all subsequent constructions and proofs. • Section 4.2 contains problem formulation (Section 4.2.1), construction of the linear estimate we deal with (Section 4.2.2) and the central result on near-optimality of this estimate (Section 4.2.2.2). We discuss also the “expressive abilities” of the family of sets (we call them ellitopes) to which our main result applies. • In Section 4.3 we extend the results of the previous section from ellitopes to their “matrix analogs”—spectratopes in the role of signal sets, passing simultaneously from the norm k · k2 in which the recovery error is measured to arbitrary spectratopic norms, those for which the unit ball of the conjugate norm is a spectratope. In addition, we allow for observation noise to have nonzero mean and to be non-Gaussian. • Section 4.4 adjusts our preceding results on linear estimation to the case where the signals to be recovered possess stochastic components. • Finally, Section 4.5 deals with “uncertain-but-bounded” observation noise, that is, noise selected “by nature,” perhaps in an adversarial fashion, from a given bounded set.
4.1
4.1.1
PRELIMINARIES: EXECUTIVE SUMMARY ON CONIC PROGRAMMING Cones
A cone in Euclidean space E is a nonempty set K which is closed w.r.t. taking conic combinations of its elements, that is, linear combinations with nonnegative coefficients. Equivalently: K ⊂ E is a cone if K is nonempty, and • x, y ∈ K ⇒ x + y ∈ K; • x ∈ K, λ ≥ 0 ⇒ λx ∈ K. It is immediately seen that a cone is a convex set. We call a cone K regular if it is closed, pointed T (that is, does not contain lines passing through the origin, or, equivalently, K [−K] = {0}) and possesses a nonempty interior. Given a cone K ⊂ E, we can associate with it its dual cone K ∗ defined as K ∗ = {y ∈ E : hy, xi ≥ 0 ∀x ∈ K}
[h·, ·i is inner product on E].
It is immediately seen that K ∗ is a closed cone, and K ⊂ (K ∗ )∗ . It is well known that • if K is a closed cone, it holds K = (K ∗ )∗ ; • K is a regular cone if and only if K ∗ is so. Examples of regular cones “useful in applications” are as follows: 1. Nonnegative orthants Rd+ = {x ∈ Rd : x ≥ 0}; qP d−1 2 2. Lorentz cones Ld+ = {x ∈ Rd : xd ≥ i=1 xi }; 3. Semidefinite cones Sd+ comprised of positive semidefinite symmetric d × d matrices. Semidefinite cone Sd+ lives in the space Sd of symmetric matrices equipped
263
SIGNAL RECOVERY BY LINEAR ESTIMATION
with the Frobenius inner product hA, Bi = Tr(AB T ) = Tr(AB) =
d X
Aij Bij ,
i,j=1
A, B ∈ Sd .
All cones listed so far are self-dual. 4. Let k · k be a norm on Rn . The set {[x; t] ∈ Rn × R : t ≥ kxk} is a regular cone, and the dual cone is {[y; τ ] : kyk∗ ≤ τ }, where kyk∗ = max{xT y : kxk ≤ 1} x
is the norm on Rn conjugate to k · k. An additional example of a regular cone useful for the sequel is the conic hull of a convex compact set defined as follows. Let T be a convex compact set with a nonempty interior in Euclidean space E. We can associate with T its closed conic hull T = cl [t; τ ] ∈ E + = E × R : τ > 0, t/τ ∈ T . {z } | K o (T )
It is immediately seen that T is a regular cone, and that to get this cone, one should add to the convex set K o (T ) the origin of E + . It is also clear that one can “see T in T:”—T is nothing but the cross-section of the cone T by the hyperplane τ = 1 in E + = {[t; τ ]}: T = {t ∈ E : [t; 1] ∈ T}.
It is easily seen that the cone T∗ dual to T is given by T∗ = {[g; s] ∈ E+ : s ≥ φT (−g)}, where φT (g) = maxhg, ti t∈T
is the support function of T . 4.1.2
Conic problems and their duals
Given regular cones Ki ⊂ Ei , 1 ≤ i ≤ m, consider an optimization problem of the form Ai x − bi ∈ Ki , i = 1, ..., m Opt(P ) = min hc, xi : , (P ) Rx = r where x 7→ Ai x − bi are affine mappings acting from some Euclidean space E to the spaces Ei where the cones Ki live. A problem in this form is called a conic problem on the cones K1 , ..., Km ; the constraints Ai x − bi ∈ Ki on x are called conic constraints. We call a conic problem (P ) strictly feasible if it admits a strictly feasible solution x ¯, meaning that x ¯ satisfies the equality constraints and satisfies strictly the conic constraints, i.e., Ai x ¯ − bi ∈ int Ki . One can associate with conic problem (P ) its dual, which also is a conic problem. The origin of the dual problem is the desire to obtain lower bounds on the optimal value Opt(P ) of the primal problem (P ) in a systematic way—by linear aggregation
264
CHAPTER 4
of constraints. Linear aggregation of constraints works as follows: let us equip every conic constraint Ai x − bi ∈ Ki with aggregation weight, called Lagrange multiplier, yi restricted to reside in the cone Ki∗ dual to Ki . Similarly, we equip the system Rx = r of equality constraints in (P ) with Lagrange multiplier z—a vector of the same dimension as r. Now let x be a feasible solution to the conic problem, and let yi ∈ Ki∗ , i ≤ m, and z be Lagrange multipliers. By the definition of the dual cone and due to Ai x − bi ∈ Ki , yi ∈ Ki∗ we have hyi , Ai xi ≥ hyi , bi i, 1 ≤ i ≤ m and of course z T Rx ≥ rT z. Summing up all resulting inequalities, we arrive at the scalar linear inequality D E X X R∗ z + A∗i yi , x ≥ rT z + hbi , yi i (!) i
i
where A∗i are the conjugates to Ai : hy, Ai xiEi ≡ hA∗i y, xiE , and R∗ is the conjugate of R. By its origin, (!) is a consequence of the system of constraints in (P ) and as such is satisfied everywhere on the feasible domain of the problem. If we are lucky to get the objective of (P ) as the linear function of x in the left hand side of (!), that is, if X A∗i yi = c, R∗ z + i
(!) imposes a lower bound on the objective of the primal conic problem (P ) everywhere on the feasible domain of the primal problem, and the conic dual of (P ) is the problem ) ( ∗ X yi ∈ KiP ,1≤i≤m T (D) hbi , yi i : Opt(D) = max r z + m R∗ z + i=1 A∗i yi = c yi ,z i
of maximizing this lower bound on Opt(P ). The relations between the primal and the dual conic problems are the subject of the standard Conic Duality Theorem as follows: Theorem 4.1. [Conic Duality Theorem] Consider conic problem (P ) (where all Ki are regular cones) along with its dual problem (D). Then
1. Duality is symmetric: the dual problem (D) is conic, and the conic dual of (D) is (equivalent to) (P ); 2. Weak duality: It always holds Opt(D) ≤ Opt(P ) 3. Strong duality: If one of the problems (P ), (D) is strictly feasible and bounded,2 then the other problem in the pair is solvable, and the optimal values of the problems are equal to each other. In particular, if both (P ) and (D) are strictly feasible, then both problems are solvable with equal optimal values. Remark 4.2. While the Conic Duality Theorem in the form just presented meets all our subsequent needs, it makes sense to note that in fact the Strong Duality part of 2 For a minimization problem, boundedness means that the objective is bounded from below on the feasible set, for a maximization problem, that it is bounded from above on the feasible set.
SIGNAL RECOVERY BY LINEAR ESTIMATION
265
the theorem can be strengthened by replacing strict feasibility with “essential strict feasibility” defined as follows: a conic problem in the form of (P ) (or, which is the same, form of (D)) is called essentially strictly feasible if it admits a feasible solution x ¯ which satisfies strictly the non-polyhedral conic constraints, that is, Ai x ¯ − bi ∈ int Ki for all i for which the cone Ki is not polyhedral—is not given by a finite list of homogeneous linear inequality constraints. The proof of the Conic Duality Theorem can be found in numerous sources, e.g., in [187, Section 7.1.3]. 4.1.3
Schur Complement Lemma
The following simple fact is extremely useful: Lemma 4.3. [Schur Complement Lemma] A symmetric block matrix P QT A= Q R with R ≻ 0 is positive (semi)definite if and only if the matrix P − QT R−1 Q is so. Proof. With u, v of the same sizes as P , R, we have T
min [u; v] A [u; v] = uT [P − QT R−1 Q]u v
(direct computation utilizing the fact that R ≻ 0). It follows that the quadratic form associated with A is nonnegative everywhere if and only if the quadratic form with the matrix [P −QT R−1 Q] is nonnegative everywhere (since the latter quadratic form is obtained from the former one by partial minimization). ✷
4.2
NEAR-OPTIMAL LINEAR ESTIMATION FROM GAUSSIAN OBSERVATIONS
4.2.1
Situation and goal
Given an m × n matrix A, a ν × n matrix B, and an m × m matrix Γ ≻ 0, consider the problem of estimating the linear image Bx of an unknown signal x known to belong to a given set X ⊂ Rn via noisy observation ω = Ax + ξ, ξ ∼ N (0, Γ),
(4.5)
where ξ is the observation noise. A candidate estimate in this case is a (Borel) function x b(·) : Rm → Rν , and the performance of such an estimate in what follows will be quantified by the Euclidean risk Risk[b x|X ] defined by (4.2). 4.2.1.1
Ellitopes
From now on we assume that X ⊂ Rn is a set given by X = x ∈ Rn : ∃(y ∈ Rn¯ , t ∈ T ) : x = P y, y T Rk y ≤ tk , 1 ≤ k ≤ K ,
(4.6)
266
CHAPTER 4
where • P is an n × n ¯ matrix, P • Rk 0 are n ¯×n ¯ matrices with k Rk ≻ 0, • T is a nonempty computationally tractable convex compact subset of RK + intersecting the interior of RK and such that T is monotone, meaning that the + relations 0 ≤ τ ≤ t and t ∈ T imply that τ ∈ T .3 Note that under our assumptions int T 6= ∅. In the sequel, we refer to a set of the form (4.6) with data [P, {Rk , 1 ≤ k ≤ K}, T ] satisfying the assumptions just formulated as an ellitope, and to (4.6) as an ellitopic representation of X . Here are instructive examples of ellitopes (in all these examples, P is the identity mapping; in the sequel, we call ellitopes of this type basic): • when K = 1, T = [0, 1], and R1 ≻ 0, X is the ellipsoid {x : xT R1 x ≤ 1}; • when K ≥ 1, T = {t ∈ RK : 0 ≤ tk ≤ 1, k ≤ K}, and X is the intersection of \ {x : xT Rk x ≤ 1} 1≤k≤K
ellipsoids/elliptic cylinders centered at the origin. In particular, when U is a K × n matrix of rank n with rows uTk , 1 ≤ k ≤ K, and Rk = uk uTk , X is the symmetric w.r.t. the origin polytope {x : kU xk∞ ≤ 1}; P p/2 ≤ 1} • when U , uk and Rk are as in the latter example and T = {t ∈ RK + : k tk for some p ≥ 2, we get X = {x : kU xkp ≤ 1}. It should be added that the family of ellitope-representable sets is quite rich: this family admits a “calculus,” so that more ellitopes can be constructed by taking intersections, direct products, linear images (direct and inverse) or arithmetic sums of ellitopes given by the above examples. In fact, the property of being an ellitope is preserved by nearly all basic operations with sets preserving convexity and symmetry w.r.t. the origin (a regrettable exception is taking the convex hull of a finite union); see Section 4.6;. As another example of an ellitope instructive in the context of nonparametric statistics, consider the situation where our signals x are discretizations of functions of continuous argument running through a compact d-dimensional domain D, and the functions f we are interested in are those satisfying a Sobolev-type smoothness constraint – an upper bound on the Lp (D)-norm of Lf , where L is a linear differential operator with constant coefficients. After discretization, this restriction can be modeled as kLxkp ≤ 1, with properly selected matrix L. As we already know from the above example, when p ≥ 2, the set X = {x : kLxkp ≤ 1} is an ellitope, and as such is captured by our machinery. Note also that by the outlined calculus, imposing on the functions f in question several Sobolev-type smoothness constraints with parameters p ≥ 2, still results in a set of signals which is an ellitope. 3 The latter relation is “for free”—given a nonempty convex compact set T ⊂ RK , the right+ hand side of (4.6) remains intact when passing from T to its “monotone hull” {τ ∈ RK + : ∃t ∈ T : τ ≤ t} which already is a monotone convex compact set.
267
SIGNAL RECOVERY BY LINEAR ESTIMATION
4.2.1.2
Estimates and their risks
In the outlined situation, a candidate estimate is a Borel function x b(·) : Rm → Rν ; given observation (4.5), we recover w = Bx as x b(ω). In the sequel, we quantify the quality of an estimate by its worst-case, over x ∈ X , expected k · k22 recovery error h i1/2 x(Ax + ξ) − Bxk22 , Risk[b x|X ] = sup Eξ∼N (0,Γ) kb x∈X
and define the optimal, or the minimax, risk as
Riskopt [X ] = inf Risk[b x|X ], x b(·)
(4.7)
where inf is taken over all Borel candidate estimates. 4.2.1.3
Main goal
The main goal of what follows is to demonstrate that an estimate linear in ω x bH (ω) = H T ω
(4.8)
with a properly selected efficiently computable matrix H is near-optimal in terms of its risk. Our first observation is that when X is the ellitope (4.6), replacing matrices A and B with AP and BP , respectively, we pass from the initial estimation problem of interest to the transformed problem, where the signal set is ¯ = {y ∈ Rn¯ : ∃t ∈ T : y T Rk y ≤ tk , 1 ≤ k ≤ K}, X ¯ via observation and we want to recover [BP ]y, y ∈ X, ω = [AP ]y + ξ. It is obvious that the considered families of estimates (the family of all linear estimates and the family of all estimates), like the risks of the estimates, remain intact under this transformation; in particular, 1/2 x([AP ] y + ξ) − [BP ] yk22 } . Risk[b x|X ] = sup Eξ {kb ¯ y∈X
Therefore, to save notation, from now on, unless explicitly stated otherwise, we assume that matrix P is identity, so that X is the basic ellitope X = x ∈ Rn : ∃t ∈ T , xT Rk x ≤ tk , 1 ≤ k ≤ K . (4.9)
We assume in the sequel that B 6= 0, since otherwise one has Bx = 0 for all x ∈ X , and the estimation problem is trivial. 4.2.2
Building a linear estimate
We start with building a “presumably good” linear estimate. Restricting ourselves to linear estimates (4.8), we may be interested in the estimate with the smallest
268
CHAPTER 4
risk, that is, the estimate associated with a ν × m matrix H which is an optimal solution to the optimization problem min R(H) := Risk2 [b xH |X ] . H
We have
R(H)
= =
max Eξ {kH T ω − Bxk22 } = Eξ {kH T ξk22 } + max kH T Ax − Bxk22 x∈X
x∈X
T
T
T
T
T
Tr(H ΓH) + max x (H A − B) (H A − B)x. x∈X
This function, while convex, can be hard to compute. For this reason, we use a linear estimate yielded by minimizing an efficiently computable convex upper bound on R(H) which is built as follows. Let φT be the support function of T : φT (λ) = max λT t : RK → R. t∈T
Observe that whenever λ ∈ RK + and H are such that [B − H T A]T [B − H T A] for x ∈ X it holds
X
λ k Rk ,
(4.10)
k
kBx − H T Axk22 ≤ φT (λ).
(4.11)
Indeed, in the case of (4.10) and with x ∈ X , there exists t ∈ T such that xT Rk x ≤ tk for all t, and consequently vector t¯ with the entries t¯k = xT Rk x also belongs to T , whence kBx − H T Axk22 = xT [B − H T A]T [B − H T A]x ≤
X k
λk xT Rk x = λT t¯ ≤ φT (λ),
which combines with (4.9) to imply (4.11). From (4.11) it follows that if H and λ ≥ 0 are linked by (4.10), then Risk2 [b xH |X ] = max E kBx − H T (Ax + ξ)k22 x∈X
=
≤
Tr(H T ΓH) + max k[B − H T A]xk22 x∈X
T
Tr(H ΓH) + φT (λ).
We see that the efficiently computable convex function b R(H) = inf λ
(
T
T
T
T
Tr(H ΓH) + φT (λ) : (B − H A) (B − H A)
X k
λk R k , λ ≥ 0
)
P (which clearly is well defined due to compactness of T combined with k Rk ≻ 0) is an upper bound on R(H).4 Note that by Schur Complement Lemma the matrix 4 It
is well known that when K = 1 (i.e., X is an ellipsoid), the above bounding scheme is b b could be larger than R(·), although the ratio exact: R(·) ≡ R(·). For more complicated X ’s, R(·) b R(·)/R(·) is bounded by O(log(K)); see Section 4.2.3.
269
SIGNAL RECOVERY BY LINEAR ESTIMATION
P inequality (B−H T A)T (B−H T A) k λk Rk is equivalent to the matrix inequality P B T − AT H k λ k Rk 0 B − HT A Iν linear in H, λ. We have arrived at the following result: Proposition 4.4. In the situation of this section, the risk of the “presumably good” linear estimate x bH∗ (ω) = H∗T ω yielded by an optimal solution (H∗ , λ∗ ) to the (clearly solvable) convex optimization problem Opt
=
=
P λk R k , λ ≥ 0 min Tr(H T ΓH) + φT (λ) : (B − H T A)T (B − H T A) H,λ P k B T − AT H T k λk Rk min Tr(H ΓH) + φT (λ) : 0, λ ≥ 0 H,λ B − HT A Iν (4.12)
is upper-bounded by 4.2.2.1
√
Opt.
Illustration: Recovering temperature distribution
Situation: A square steel plate was somewhat heated at time 0 and left to cool, the temperature along the perimeter of the plate being all the time kept zero. At time t1 , we measure the temperatures at m points of the plate, and want to recover the distribution of the temperature along the plate at a given time t0 , 0 < t0 < t1 . Physics, after suitable discretization of spatial variables, offers the following model of the situation. We represent the distribution of temperature at time t as 2N −1 (2N − 1) × (2N − 1) matrix U (t) = [uij (t)]i,j=1 , where uij (t) is the temperature, at time t, at the point Pij = (pi , pj ), pk = k/N − 1,
1 ≤ i, j ≤ 2N − 1
of the plate (in our model, this plate occupies the square S = {(p, q) : |p| ≤ 1, |q| ≤ 1}). Here positive integer N is responsible for spatial discretization. For 1 ≤ k ≤ 2N − 1, let us specify functions φk (s) on the segment −1 ≤ s ≤ 1 as follows: φ2ℓ−1 (s) = c2ℓ−1 cos(ω2ℓ−1 s), φ2ℓ (s) = c2ℓ sin(ω2ℓ s), ω2ℓ−1 = (ℓ − 1/2)π, ω2ℓ = ℓπ, where ck are readily given by the normalization condition that φk (±1) = 0. It is immediately seen that the matrices
P2N −1 i=1
φ2k (pi ) = 1; note
2N −1 Φkℓ = [φk (pi )φℓ (pj )]i,j=1 , 1 ≤ k, ℓ ≤ 2N − 1
form an orthonormal basis in the space of (2N − 1) × (2N − 1) matrices, so that we can write X U (t) = xkℓ (t)Φkℓ . k,ℓ≤2N −1
The advantage of representing temperature fields in the basis {Φkℓ }k,ℓ≤2N −1 stems from the fact that in this basis the heat equation governing evolution of the tem-
270
CHAPTER 4
perature distribution in time becomes extremely simple, just d xkℓ (t) = −(ωk2 + ωℓ2 )xkℓ (t) ⇒ xkℓ (t) = exp{−(ωk2 + ωℓ2 )t}xkℓ .5 dt Now we can convert the situation into the one considered in our general estimation scheme, namely, as follows: • We select some discretization parameter N and treat x = {xkℓ (0), 1 ≤ k, ℓ ≤ 2N − 1} as the signal underlying our observations. In every potential application, we can safely upper-bound the magnitudes of the initial temperatures and thus the magnitude of x, say, by a constraint of the form X x2kℓ (0) ≤ R2 k,ℓ
with properly selected R, which allows us to specify the domain X of the signal as the Euclidean ball: X = {x ∈ R(2N −1)×(2N −1) : kxk22 ≤ R2 }.
(4.13)
• Let the measurements of the temperature at time t1 be taken along the points Pi(ν),j(ν) , 1 ≤ ν ≤ m,6 and let them be affected by a N (0, σ 2 Im )-noise, so that our observation is ω = A(x) + ξ, ξ ∼ N (0, σ 2 Im ). Here x 7→ A(x) is the linear mapping from R(2N −1)×(2N −1) into Rm given by [A(x)]ν =
2N −1 X
2
2
e−(ωk +ωℓ )t1 φk (pi(ν) )φℓ (pj(ν) )xkℓ (0).
(4.14)
k,ℓ=1
• We want to recover the temperatures at time t0 taken along some grid, say, the square (2K − 1) × (2K − 1) grid {Qij = (ri , rj ), 1 ≤ i, j ≤ 2K − 1}, where ri = i/K −1, 1 ≤ i ≤ 2K −1. In other words, we want to recover B(x), where the linear mapping x 7→ B(x) from R(2N −1)×(2N −1) into R(2K−1)×(2K−1) is given by [B(x)]ij =
2N −1 X
2
2
e−(ωk +ωℓ )t0 φk (ri )φℓ (rj )xkℓ (0).
k,ℓ=1
5 The explanation is simple: the functions φ (p, q) = φ (p)φ (q), k, ℓ = 1, 2, ..., form an kℓ k ℓ orthogonal basis in L2 (S) and vanish on the boundary of S, and the heat equation 2 ∂ ∂2 ∂ u(t; p, q) u(t; p, q) = + 2 2 ∂t ∂p ∂q
governing evolution of the temperature field u(t; p, q), (p, q) ∈ S, with time t, in terms of the coefficients xkℓ (t) of the temperature field in the orthogonal basis {φkℓ (p, q)}k,ℓ becomes d xkℓ (t) = −(ωk2 + ωℓ2 )xkℓ (t). dt In our discretization, we truncate the expansion of u(t; p, q), keeping only the terms with k, ℓ ≤ 2N − 1, and restrict the spatial variables to reside in the grid {Pij , 1 ≤ i, j ≤ 2N − 1}. 6 The construction can be easily extended to allow for measurement points outside of the grid {Pij }.
271
SIGNAL RECOVERY BY LINEAR ESTIMATION
Ill-posedness. Our problem is a typical example of an ill-posed inverse problem, where one wants to recover a past state of a dynamical system converging exponentially fast to equilibrium and thus “forgetting rapidly” its past. More specifically, in our situation ill-posedness stems from the fact that, as is clearly seen from (4.14), contributions of “high frequency” (i.e., with large ωk2 + ωℓ2 ) components xkℓ (0) of the signal to A(x) decrease exponentially fast, with high decay rate, as t1 grows. As a result, high frequency components xkℓ (0) are impossible to recover from noisy observations of A(x), unless the corresponding time instant t1 is very small. As a kind of compensation, contributions of high frequency components xkℓ (0) to B(x) are also very small, provided that t0 is not too small, implying that there is no necessity to recover well high frequency components, unless they are huge. Our linear estimate, roughly speaking, seeks for the best trade-off between these two opposite phenomena, utilizing (4.13) as the source of upper bounds on the magnitudes of high frequency components of the signal. Numerical results. In the experiment to be reported, we used N = 32, m = 100, K = 6, t0 = 0.01, t1 = 0.03 (i.e., temperature is measured at time 0.03 at 100 points selected at random on a 63 × 63 square grid, and we want to recover the temperatures at time 0.01 along an 11 × 11 square grid). We used R = 15, that is, X X = {[xkℓ ]63 x2kℓ ≤ 225}, k,ℓ=1 : k,ℓ
and σ = 0.001. Under the circumstances, the risk of the best linear estimate turns out to be 0.3968. Figure 4.1 shows a sample temperature distribution B(x) = U∗ (t0 ) at time b (t0 ) t0 resulting from a randomly selected signal x ∈ X along with the recovery U e (t0 ) of U∗ by the optimal linear estimate and the naive “least squares” recovery U of U∗ . The latter is defined as B(x∗ ), where x∗ is the least squares recovery of the signal underlying observation ω: x = x∗ (ω) := argmin kA(x) − ωk2 . x
Notice the dramatic difference in performances of the “naive least squares” and the optimal linear estimate. 4.2.2.2
Near-optimality of x bH∗
Proposition 4.5. The efficiently computable linear estimate x bH∗ (ω) = H∗T ω yielded by an optimal solution to the optimization problem (4.12) is nearly optimal in terms of its risk: p p (4.15) Risk[b xH∗ |X ] ≤ Opt ≤ 64 45 ln 2(ln K + 5 ln 2) Riskopt [X ], where the minimax optimal risk Riskopt [X ] is given by (4.7).
For proof, see Section 4.8.5. Note that the “nonoptimality factor” in (4.15) depends logarithmically on K and is completely independent on what A, B, Γ are and the “details” Rk , T —see (4.9)—specifying ellitope X .
272
CHAPTER 4
U∗ :
kU∗ k2 = 2.01 kU∗ k∞ = 0.347
b b : kU − U∗ k2 = 0.318 U b − U∗ k∞ = 0.078 kU
e e : kU − U∗ k2 = 44.82 U e − U∗ k∞ = 12.47 kU
Figure 4.1: True distribution of temperature U∗ = B(x) at time t0 = 0.01 (left) b via the optimal linear estimate (center) and the “naive” along with its recovery U e (right). recovery U 4.2.2.3
Relaxing the symmetry requirement
Sets X of the form (4.6)—we called them ellitopes—are symmetric w.r.t. the origin convex compact sets of special structure. This structure is rather flexible, but the symmetry is “built in.” We are about to demonstrate that, to some extent, the symmetry requirement can be somewhat relaxed. Specifically, assume instead of (4.6) that the convex compact set X known to contain the signals x underlying observations (4.5) can be “sandwiched” by two ellitopes known to us and similar to each other, with coefficient α ≥ 1: x ∈ Rn : ∃(y ∈ Rn¯ , t ∈ T ) : x = P y & y T Rk y ≤ tk , 1 ≤ k ≤ K ⊂ X ⊂ αX , {z } | X
with Rk and T possessing the properties postulated in Section 4.2.1.1. Let Opt and H∗ be the optimal value and optimal solution of the optimization problem (4.12) ¯ = BP in the role associated with the data R1 , ..., RK , T and matrices A¯ = AP , B of A, B, respectively. It is immediately seen that the risk Risk[b x H∗ |X ] of the linear √ we have Riskopt [X ] ≤ estimate x bH∗ (ω) is at most α Opt. On the other hand, p √ Riskopt [X ], and by Proposition 4.5 also Opt ≤ O(1) ln(2K)Riskopt [X ]. Taken together, these relations imply that p (4.16) Risk[b xH ∗ |X ] ≤ O(1)α ln(2K)Riskopt [X ].
In other words, as far as the “level of nonoptimality” of efficiently computable linear estimates is concerned, signal sets X which can be approximated by ellitopes within a factor α of order of 1 are nearly as good as the ellitopes. To give an example: it is known that whenever the intersection X of K elliptic cylinders {x : (x − ck )T Rk (x − ck ) ≤ 1}, Rk 0, concentric or not, is bounded and has a nonempty interior, X can be approximated by an ellipsoid within the factor
273
SIGNAL RECOVERY BY LINEAR ESTIMATION
√ α = K + 2 K.7 Assuming w.l.o.g. that the approximating ellipsoid is centered at the origin, the level of nonoptimality of a linear estimate is bounded by (4.16) with O(1)K in the role of α. 4.2.2.4
Comments
Note that bound (4.16) rapidly deteriorates when α grows, and this phenomenon to some extent “reflects the reality.” For example, a perfect simplex X inscribed into the unit sphere in Rn is in-between two Euclidean balls centered at the origin with the ratio of radii equal to n (i.e. α = n). It is immediately seen that with A = B = I, Γ = σ 2 I, in the range σ ≤ nσ 2 ≤ 1 of values of n and σ, we have √ √ xH∗ |X ] = O(1) nσ, Riskopt [X ] ≈ σ, Riskopt [b with ≈ meaning “up to logarithmic in n/σ factor.” In other words, for large nσ linear estimates indeed are significantly (albeit not to the full extent of (4.16)) outperformed by nonlinear ones. Another situation “bad for linear estimates” suggested by (4.15) is the one where the description (4.6) of X , albeit possible, requires a very large value of K. Here again (4.15) reflects to some extent the reality: when X is the unit k · k1 ball in Rn , (4.6) takes place with K = 2n−1 ; consequently, the factor at Riskopt [X ] √ in the right-hand side of (4.15) becomes at least n. On the other hand, with A = B = I, Γ = σ 2 I, in the range σ ≤ nσ 2 ≤ 1 of values of n, σ, the risks Riskopt [X ], Riskopt [b xH∗ |X ] are basically the same as in the case of X being the perfect simplex inscribed into the unit sphere in Rn , and linear estimates indeed are “heavily nonoptimal” when nσ is large. 4.2.2.5
How near is “near-optimal”: Numerical illustration √ The “nonoptimality factor” θ in the upper bound Opt ≤ θRiskopt [X ] from Proposition 4.5, while logarithmic, seems to be unpleasantly large. On closer inspection, one can get numerically less conservative bounds on non-optimality factors. Here are some illustrations. In the six experiments to be reported, we used n = m = ν = 32 and Γ = σ 2 Im . In the first triple of experiments, X was the ellipsoid X = {x ∈ R32 :
32 X j=1
j 2 x2j ≤ 1},
P32 that is, P was the identity, K = 1, R1 = j=1 j 2 ej eTj (ej are basic orths), and T = [0, 1]. In the second triple of experiments, X was the box circumscribed around the above ellipsoid:
X = {x ∈ R32 : j|xj | ≤ 1, 1 ≤ j ≤ 32} P = I, K = 32, Rk = k 2 ek eTk , k ≤ K, T = [0, 1]K .
P T setting F (x) = − K k=1 ln(1 − (x − ck ) Rk (x − ck )) : int X → R and denoting by x ¯ the analytic center argminx∈int X F (x), one has √ {x : (x − x ¯)T F ′′ (¯ x)(x − x ¯) ≤ 1} ⊂ X ⊂ {x : (x − x ¯)T F ′′ (¯ x)(x − x ¯) ≤ [K + 2 K]2 }. 7 Namely,
274
CHAPTER 4
X ellipsoid ellipsoid ellipsoid box box box
σ 0.0100 0.0010 0.0001 0.0100 0.0010 0.0001
√
Opt 0.288 0.103 0.019 0.698 0.163 0.021
LwB 0.153 0.060 0.018 0.231 0.082 0.020
√
Opt/LwB 1.88 1.71 1.06 3.02 2.00 1.06
Table 4.1: Performance of linear estimates (4.8), (4.12), m = n = 32, B = I.
In these experiments, B was the identity matrix, and A was a randomly rotated matrix common for all experiments, with singular values λj , 1 ≤ j ≤ 32, forming a geometric progression, with λ1 = 1 and λ32 = 0.01. Experiments in a triple differed by the values of σ (0.01,0.001,0.0001). √ The results of the experiments are presented in Table 4.1, where, as above, Opt is the upper bound given by (4.12) on the risk Risk[b xH∗ |X] of recovering Bx = x, x ∈ X, by the linear estimate yielded by (4.8) and (4.12), and LwB is the lower bound on Riskopt [X] computed via the techniques outlined in Exercise 4.22 (we skip the details). Whatever might be your attitude to the “reality” as reflected by the data in Table 4.1, this reality is much better than the theoretical upper bound on θ appearing in (4.15). 4.2.3
Byproduct on semidefinite relaxation
We are about to present a byproduct, important in its own right, of the reasoning underlying Proposition 4.5. This byproduct is not directly related to Statistics; it relates to the quality of the standard semidefinite relaxation. Specifically, given a quadratic from xT Cx and an ellitope X represented by (4.6), consider the problem Opt∗ = max xT Cx = max y T P T CP y : ∃t ∈ T : y T Rk y ≤ tk , k ≤ K . (4.17) x∈X
y
This problem can be NP-hard (this is already so when X is the unit box and C a general-type positive semidefinite matrix); however, Opt admits an efficiently computable upper bound given by semidefinite relaxation as follows: whenever λ ≥ 0 is such that K X P T CP λk Rk , k=1
¯ := {y : ∃t ∈ T : y T Rk y ≤ tk , k ≤ K} we clearly have for y ∈ X X [P y]T CP y ≤ λk y T Rk y ≤ φT (λ) k
where the last ≤ is due to the fact that the vector with the entries y T Rk y, 1 ≤ k ≤ K, belongs to T . As a result, the efficiently computable quantity ) ( X T (4.18) Opt = min φT (λ) : λ ≥ 0, P CP λ k Rk λ
k
SIGNAL RECOVERY BY LINEAR ESTIMATION
275
is an upper bound on Opt∗ . We have the following Proposition 4.6. Let C be a symmetric n × n matrix and X be given by ellitopic representation (4.6), and let Opt∗ and Opt be given by (4.17) and (4.18). Then Opt √ ≤ Opt∗ ≤ Opt. 3 ln( 3K)
(4.19)
For proof, see Section 4.8.2.
4.3
FROM ELLITOPES TO SPECTRATOPES
So far, the domains of signals we dealt with were ellitopes. In this section we demonstrate that basically all our constructions and results can be extended onto a much wider family of signal domains, namely, spectratopes. 4.3.1
Spectratopes: Definition and examples
We call a set X ⊂ Rn a basic spectratope if it admits a simple spectratopic representation—representation of the form X = x ∈ Rn : ∃t ∈ T : Rk2 [x] tk Idk , 1 ≤ k ≤ K (4.20)
where
Pn S.1. Rk [x] = i=1 xi Rki are symmetric dk ×dk matrices linearly depending on x ∈ Rn (i,e., “matrix coefficients” Rki belong to Sn ). S.2. T ∈ RK + is the set with the same properties as in the definition of an ellitope, that is, T is a convex compact subset of RK + which contains a positive vector and is monotone: 0 ≤ t′ ≤ t ∈ T ⇒ t′ ∈ T . S.3. Whenever x 6= 0, it holds Rk [x] 6= 0 for at least one k ≤ K. An immediate observation is as follows: Remark 4.7. By the Schur Complement Lemma, the set (4.20) given by data satisfying S.1-2 can be represented as tk Idk Rk [x] 0, k ≤ K . X = x ∈ Rn : ∃t ∈ T : Rk [x] Idk By the latter representation, X is nonempty, closed, convex, symmetric w.r.t. the origin, and contains a neighbourhood of the origin. This set is bounded if and only if the data, in addition to S.1–2, satisfies S.3. A spectratope X ⊂ Rν is a set represented as a linear image of a basic spectratope: X = {x ∈ Rν : ∃(y ∈ Rn , t ∈ T ) : x = P y, Rk2 [y] tk Idk , 1 ≤ k ≤ K},
(4.21)
276
CHAPTER 4
where P is a ν × n matrix, and Rk [·], T are as in S.1–3. We associate with a basic spectratope (4.20), S.1–3, the following entities: 1. The size D=
K X
dk ;
k=1
2. Linear mappings Q 7→ Rk [Q] =
X i,j
Qij Rki Rkj : Sn → Sdk .
As is immediately seen, we have Rk [xxT ] ≡ Rk2 [x],
(4.22)
implying that Rk [Q] 0 whenever Q 0, whence Rk [·] is -monotone: Q′ Q ⇒ Rk [Q′ ] Rk [Q].
(4.23)
Besides this, we have Q 0 ⇒ Eξ∼N (0,Q) {Rk2 [ξ]} = Eξ∼N (0,Q) {Rk [ξξ T ]} = Rk [Q],
(4.24)
where the first equality is given by (4.22). 3. Linear mappings Λk 7→ R∗k [Λk ] : Sdk → Sn given by [R∗k [Λk ]]ij = 21 Tr(Λk [Rki Rkj + Rkj Rki ]), 1 ≤ i, j ≤ n.
(4.25)
It is immediately seen that R∗k [·] is the conjugate of Rk [·]: hΛk , Rk [Q]iF = Tr(Λk Rk [Q]) = Tr(R∗k [Λk ]Q) = hR∗k [Λk ], QiF ,
(4.26)
where hA, BiF = Tr(AB) is the Frobenius inner product of symmetric matrices. Besides this, we have (4.27) Λk 0 ⇒ R∗k [Λk ] 0. Indeed, R∗k [Λk ] is linear in Λk , so that it suffices to verify (4.27) for dyadic matrices Λk = f f T ; for such a Λk , (4.25) reads (R∗k [f f T ])ij = [Rki f ]T [Rkj f ], that is, R∗k [f f T ] is a Gram matrix and as such is 0. Another way to arrive at (4.27) is to note that when Λk 0 and Q = xxT , the first quantity in (4.26) is nonnegative by (4.22), and therefore (4.26) states that xT R∗k [Λk ]x ≥ 0 for every x, implying R∗k [Λk ] 0. 4. The linear space ΛK = Sd1 × ... × SdK of all ordered collections Λ = {Λk ∈ Sdk }k≤K along with the linear mapping Λ 7→ λ[Λ] := [Tr(Λ1 ); ...; Tr(ΛK )] : ΛK → RK .
277
SIGNAL RECOVERY BY LINEAR ESTIMATION
4.3.1.1
Examples of spectratopes
Example: Ellitopes. Every ellitope X = {x ∈ Rν : ∃(y ∈ Rn , t ∈ TP ) : x = P y, y T Rk y ≤ tk , k ≤ K} [Rk 0, k Rk ≻ 0]
Ppk T rkj rkj , pk = Rank(Rk ), be a dyadic is a spectratope as well. Indeed, let Rk = j=1 representation of the positive semidefinite matrix Rk , so that X T (rkj y)2 ∀y, y T Rk y = j
and let P Tb = {{tkj ≥ 0, 1 ≤ j ≤ pk , 1 ≤ k ≤ K} : ∃t ∈ T : j tkj ≤ tk }, T Rkj [y] = rkj y ∈ S1 = R. We clearly have 2 X = {x ∈ Rν : ∃({tkj } ∈ Tb , y) : x = P y, Rkj [y] tkj I1 ∀k, j},
and the right-hand side is a legitimate spectratopic representation of X . Example: “Matrix box.” Let L be a positive definite d × d matrix. Then the “matrix box” X
= =
{X ∈ Sd : −L X L} = {X ∈ Sd : −Id L−1/2 XL−1/2 Id } {X ∈ Sd : R2 [X] := [L−1/2 XL−1/2 ]2 Id }
is a basic spectratope (augment R1 [·] := R[·] with K = 1, T = [0, 1]). As a result, a bounded set X ⊂ Rν given by a system of “two-sided” Linear Matrix Inequalities, specifically, √ √ X = {x ∈ Rν : ∃t ∈ T : − tk Lk Sk [x] tk Lk , k ≤ K} where Sk [x] are symmetric dk × dk matrices linearly depending on x, Lk ≻ 0, and T satisfies S.2, is a basic spectratope: X = {x ∈ Rν : ∃t ∈ T : Rk2 [x] ≤ tk Idk , k ≤ K}
−1/2
[Rk [x] = Lk
−1/2
Sk [x]Lk
].
Like ellitopes, spectratopes admit fully algorithmic calculus; see Section 4.6. 4.3.2
Semidefinite relaxation on spectratopes
Now let us extend Proposition 4.6 to our current situation. The extension reads as follows: Proposition 4.8. Let C be a symmetric n×n matrix and X be given by spectratopic representation X = {x ∈ Rn : ∃y ∈ Rµ , t ∈ T : x = P y, Rk2 [y] tk Idk , k ≤ K},
(4.28)
278
CHAPTER 4
let Opt∗ = max xT Cx, x∈X
and let Opt =
min
Λ={Λk }k≤K
φT (λ[Λ]) : Λk 0, P T CP
[λ[Λ] = [Tr(Λ1 ); ...; Tr(ΛK )]] .
P
k
R∗k [Λk ]
(4.29)
Then (4.29) is solvable, and Opt∗ ≤ Opt ≤ 2 max[ln(2D), 1]Opt∗ , D =
X
dk .
(4.30)
k
Let us verify the easy and instructive part of the proposition, namely, the left inequality in (4.30); the remaining claims will be proved in Section 4.8.3. The left inequality in (4.30) is readily given by the following Lemma 4.9. Let X be spectratope (4.28) and Q ∈ Sn . Whenever Λk ∈ Sd+k satisfy P T QP
X k
R∗k [Λk ],
for all x ∈ X we have xT Qx ≤ φT (λ[Λ]), λ[Λ] = [Tr(Λ1 ); ...; Tr(ΛK )]. Proof of the lemma: Let x ∈ X , so that for some t ∈ T and y it holds x = P y, Rk2 [y] tk Idk ∀k ≤ K. Consequently, xT Qx
4.3.3
P P T T = yP P QP y ≤ y T k R∗k [Λk ]y = k Tr(R∗k [Λk ][yy T ]) = Pk Tr(Λk Rk [yy T ]) [by (4.26)] = Pk Tr(Λk Rk2 [y]) [by (4.22)] ≤ k tk Tr(Λk Idk ) [since Λk 0 and Rk2 [y] tk Idk ] ≤ φT (λ[Λ]). ✷
Linear estimates beyond ellitopic signal sets and k · k2 -risk
In Section 4.2, we have developed a computationally efficient scheme for building “presumably good” linear estimates of the linear image Bx of unknown signal x known to belong to a given ellitope X in the case when the (squared) risk is defined as the worst, w.r.t. x ∈ X , expected squared Euclidean norm k · k22 of the recovery error. We are about to extend these results to the case when X is a spectratope, and the norm used to measure the recovery error, while not being completely arbitrary, is not necessarily k · k2 . Besides this, in what follows we also relax our assumptions on observation noise.
279
SIGNAL RECOVERY BY LINEAR ESTIMATION
4.3.3.1
Situation and goal
We consider the problem of recovering the image Bx ∈ Rν of a signal x ∈ Rn known to belong to a given spectratope X = {x ∈ Rn : ∃t ∈ T : Rk2 [x] tk Idk , 1 ≤ k ≤ K} from noisy observation ω = Ax + ξ,
(4.31)
where A is a known m × n matrix, and ξ is random observation noise. Observation noise. In typical signal processing applications, the distribution of noise is fixed and is a part of the data of the estimation problem. In order to cover some applications (e.g., the one in Section 4.3.3.7), we allow for “ambiguous” noise distributions; all we know is that this distribution belongs to a family P of Borel probability distributions on Rm associated with a given convex compact subset Π of the interior of the cone Sm + of positive semidefinite m × m matrices, “association” meaning that the matrix of second moments of every distribution P ∈ P is -dominated by a matrix from Π: P ∈ P ⇒ ∃Q ∈ Π : Var[P ] := Eξ∼P {ξξ T } Q.
(4.32)
The actual distribution of noise in (4.31) is selected from P by nature (and may, e.g., depend on x). In the sequel, for a probability distribution P on Rm we write P ✁ Π to express the fact that the matrix of second moments of P is -dominated by a matrix from Π: {P ✁ Π} ⇔ {∃Θ ∈ Π : Var[P ] Θ}. Quantifying risk. Given Π and a norm k · k on Rν , we quantify the quality of a candidate estimate x b(·) : Rm → Rν by its (Π, k · k)-risk on X defined as RiskΠ,k·k [b x|X ] =
sup
x∈X ,P ✁Π
Eξ∼P {kb x(Ax + ξ) − Bxk} .
Goal. As before, our focus is on linear estimates—estimates of the form x bH (ω) = H T ω
given by m×ν matrices H. Our goal is to demonstrate that under some restrictions on the signal domain X , a “presumably good” linear estimate yielded by an optimal solution to an efficiently solvable convex optimization problem is near-optimal in terms of its risk among all estimates, linear and nonlinear alike. 4.3.3.2
Assumptions
Preliminaries: Conjugate norms. Recall that a norm k · k on a Euclidean space E, e.g., on Rk , gives rise to its conjugate norm kyk∗ = max{hy, xi : kxk ≤ 1}, x
280
CHAPTER 4
where h·, ·i is the inner product in E. Equivalently, k · k∗ is the smallest norm such that hx, yi ≤ kxkkyk∗ ∀x, y. (4.33) It is well known that taken twice, norm conjugation recovers the initial norm: (k · k∗ )∗ is exactly k · k; in other words, kxk = max{hx, yi : kyk∗ ≤ 1}. y
The standard examples are the conjugates to the standard ℓp -norms on E = Rk , p ∈ [1, ∞]: it turns out that (k · kp )∗ = k · kp∗ , where p∗ ∈ [1, ∞] is linked to p ∈ [1, ∞] by the symmetric relation 1 1 = 1, + p p∗ so that 1∗ = ∞, ∞∗ = 1, 2∗ = 2. The corresponding version of inequality (4.33) is called H¨ older inequality—an extension of the Cauchy-Schwartz inequality dealing with the case k · k = k · k∗ = k · k2 . Assumptions. From now on we make the following assumptions: Assumption A: The unit ball B∗ of the norm k · k∗ conjugate to the norm k · k in the formulation of our estimation problem is a spectratope: B∗ = {z ∈ Rν : ∃y ∈ Y : z = M y}, Y := {y ∈ Rq : ∃r ∈ R : Sℓ2 [y] rℓ Ifℓ , 1 ≤ ℓ ≤ L},
(4.34)
where the right-hand side data are as required in a spectratopic representation. Note that Assumption A is satisfied when k · k = k · kp with p ∈ [1, 2]: in this case, B∗ = {u ∈ Rν : kukp∗ ≤ 1}, p∗ =
p ∈ [2, ∞], p−1
so that B∗ is an ellitope—see Section 4.2.1.1—and thus is a spectratope. Another potentially useful example of norm k · k which obeys Assumption A is the nuclear norm kV kSh,1 on the space Rν = Rp×q of p×q matrices—the sum of singular values of a matrix V . In this case the conjugate norm is the spectral norm k · k = k · k2,2 on Rν = Rp×q , and the unit ball of the latter norm is a spectratope: {X ∈ Rp×q : kXk ≤ 1} = {X: ∃t ∈ T = [0, 1] : R2 [X] tIp+q }, XT . R[X] = X Besides Assumption A, we make Assumption B: The signal set X is a basic spectratope: X = {x ∈ Rn : ∃t ∈ T : Rk2 [x] tk Idk , 1 ≤ k ≤ K},
281
SIGNAL RECOVERY BY LINEAR ESTIMATION
where the right-hand side data are as required in a spectratopic representation. Note: Similarly to what we have observed in Section 4.2.1.3 in the case of ellitopes, the situation where the signal set is a general type spectratope can be straightforwardly reduced to the one where X is a basic spectratope. In addition we make the following regularity assumption: Assumption R: All matrices from Π are positive definite. 4.3.3.3
Building linear estimate
Let H ∈ Rm×ν . We clearly have RiskΠ,k·k [b xH (·)|X ]
Eξ∼P k[B − H T A]x − H T ξk x∈X,P ✁Π supx∈X k[B − H T A]xk + supP ✁Π Eξ∼P kH T ξk kB − H T AkX ,k·k + ΨΠ (H), (4.35)
=
sup
≤ = where
kV kX ,k·k ΨΠ (H)
= =
ν×n maxx {kV xk → R, : xT∈ X } : R sup Eξ∼P kH ξk . P ✁Π
As in Section 4.2.2, we need to derive efficiently computable convex upper bounds on the norm k·kX ,k·k and the function ΨΠ , which by themselves, while being convex, can be difficult to compute. 4.3.3.4
Upper-bounding k · kX ,k·k
With Assumptions A, B in force, consider the spectratope Z
:= =
X × Y = {[x; y] ∈ Rn × Rq : ∃s = [t; r] ∈ T × R : Rk2 [x] tk Idk , 1 ≤ k ≤ K, Sℓ2 [y] rℓ Ifℓ , 1 ≤ ℓ ≤ L} {w = [x; y] ∈ Rn × Rq : ∃s = [t; r] ∈ S = T × R : Ui2 [w] si Igi , 1 ≤ i ≤ I = K + L}
with Ui [·] readily given by Rk [·] and Sℓ [·]. Given a ν × n matrix V and setting 1 V TM W [V ] = 2 MT V we have kV kX ,k·k = max kV xk = x∈X
max
x∈X ,z∈B∗
zT V x =
max y T M T V x = max wT W [V ]w.
x∈X ,y∈Y
w∈Z
Applying Proposition 4.8, we arrive at the following Corollary 4.10. In the situation just defined, the efficiently computable convex
282
CHAPTER 4
function kV k+ X ,k·k
=
min φT (λ[Λ]) + φR (λ[Υ]) : Λ,Υ
fℓ dk Λ = {Λk ∈ ℓ ∈ S+ }ℓ≤L, PS+ }∗k≤K , Υ =1 {Υ T V M R [Λ ] k k k P2 ∗ 0 1 T ℓ Sℓ [Υℓ ] 2M V
(4.36)
φT (λ) = max λT t, φR (λ) = max λT r, λ[{Ξ1 , ..., ΞN }] = [Tr(Ξ1 ); ...; Tr(ΞN )], t∈T r∈R P [R∗k [Λk ]]ij = 12 Tr(Λk [Rkki Rkkj + Rkkj Rkki ]), where Rk [x] = i xi Rki , P ℓj ℓj ℓi ℓi ℓi ∗ 1 [Sℓ [Υℓ ]]ij = 2 Tr(Υℓ [Sℓ Sℓ + Sℓ Sℓ ]), where Sℓ [y] = i yi S
is a norm on Rν×n , and this norm is a tight upper bound on k · kX ,k·k , namely,
4.3.3.5
∀V ∈ Rν×n : kV kX ,k·k ≤ kV k+ ≤ 2 max[ln(2D), 1]kV kX ,k·k , P X ,k·k P D = k dk + ℓ fℓ .
Upper-bounding ΨΠ (·)
The next step is to derive an efficiently computable convex upper bound on the function ΨΠ stemming from a norm obeying Assumption B. The underlying observation is as follows: Lemma 4.11. Let V be an m × ν matrix, Q ∈ Sm + , and P be a probability distribution on Rm with Var[P ] Q. Let, further, k · k be a norm on Rν with the unit ball B∗ of the conjugate norm k · k∗ given by (4.34). Finally, let Υ = {Υℓ ∈ Sf+ℓ }ℓ≤L and a matrix Θ ∈ Sm satisfy the constraint 1 Θ V M 2 P ∗ 0 (4.37) 1 T T ℓ Sℓ [Υℓ ] 2M V
(for notation, see (4.34), (4.36)). Then
Eη∼P {kV T ηk} ≤ Tr(QΘ) + φR (λ[Υ]).
(4.38)
Proof is immediate. In the case of (4.37), we have kV T ξk =
≤
= = = ≤
≤
max z T V T ξ = max y T M T V T ξ Py∈Y [by (4.37)] max ξ T Θξ + ℓ y T Sℓ∗ [Υℓ ]y y∈Y P ∗ T T max ξ Θξ + ℓ Tr(Sℓ [Υℓ ]yy ) y∈Y P [by (4.22) and (4.26)] max ξ T Θξ + ℓ Tr(Υℓ Sℓ2 [y]) y∈Y P 2 2 ξ T Θξ + max ℓ Tr(Υℓ Sℓ [y]) : Sℓ [y] rℓ Ifℓ , ℓ ≤ L, r ∈ R
z∈B∗
y,r
ξ T Θξ + max r∈R
P
[by (4.34)]
ℓ Tr(Υℓ )rℓ [by Υℓ 0]
ξ T Θξ + φR (λ[Υ]).
Taking the expectation of both sides of the resulting inequality w.r.t. distribution P of ξ and taking into account that Tr(Var[P ]Θ) ≤ Tr(QΘ) due to Θ 0 (by (4.37)) and Var[P ] Q, we get (4.38). ✷
283
SIGNAL RECOVERY BY LINEAR ESTIMATION
Note that when P = N (0, Q), the smallest upper bound on Eη∼P {kV T ηk} which can be extracted from Lemma 4.11 (this bound is efficiently computable) is tight; see Lemma 4.17 below. An immediate consequence of the bound in Lemma (4.11) is: Corollary 4.12. Let Γ(Θ) = max Tr(QΘ)
(4.39)
Q∈Π
and ΨΠ (H)
=
min
{Υℓ }ℓ≤L ,Θ∈Sm
(
Γ(Θ) + φR (λ[Υ]) : Υℓ 0 ∀ℓ, Θ 1 T M HT 2
1 P2 HM ∗ ℓ Sℓ [Υℓ ]
)
(4.40)
0 .
Then ΨΠ (·) : Rm×ν → R is an efficiently computable convex upper bound on ΨΠ (·). Indeed, given Lemma 4.11, the only non-evident part of the corollary is that ΨΠ (·) is a well-defined real-valued function, which is readily given by Lemma 4.44 stating, in particular, that the optimization problem in (4.40) is feasible, combined with the fact that the objective is coercive on the feasible set (i.e., is not bounded from above along every unbounded sequence of feasible solutions). Remark 4.13. When Υ = {Υℓ }ℓ≤L , Θ is a feasible solution to the right-hand side problem in (4.40) and s > 0, the pair Υ′ = {sΥℓ }ℓ≤L , Θ′ = s−1 Θ also is a feasible solution. Since φR (·) and Γ(·) are positive homogeneous of degree 1, we conclude that ΨΠ is in fact the infimum of the function p 2 Γ(Θ)φR (λ[Υ]) = inf s−1 Γ(Θ) + sφR (λ[Υ]) s>0
over Υ, Θ satisfying the constraints of the problem (4.40). In addition, for every feasible solution Υ = {Υℓ }ℓ≤L , Θ to (4.40) with M[Υ] := P ∗ 1 −1 b [Υ]M T H T is feasible for the problem ℓ Sℓ [Υℓ ] ≻ 0, the pair Υ, Θ = 4 HM M b Θ (Schur Complement Lemma), so that Γ(Θ) b ≤ Γ(Θ). As a as well, and 0 Θ result, 1 Γ(HM M−1 [Υ]M T H T ) + φR (λ[Υ]) : 4 ΨΠ (H) = inf . (4.41) Υ Υ = {Υℓ ∈ Sf+ℓ }ℓ≤L , M[Υ] ≻ 0 Illustration. Suppose that kuk = kukp with p ∈ [1, 2], and let us apply the just described scheme for upper-bounding ΨΠ , assuming {Q} ⊂ Π ⊂ {S ∈ Sm + : S Q} for some given Q ≻ 0, so that Γ(Θ) = Tr(QΘ), Θ 0. The unit ball of the norm p conjugate to k · k, that is, the norm k · kq , q = p−1 ∈ [2, ∞], is the basic spectratope (in fact, ellitope) B∗ = {y ∈ Rµ : ∃r ∈ R := {Rν+ : krkq/2 ≤ 1} : Sℓ2 [y] ≤ rℓ , 1 ≤ ℓ ≤ L = ν}, Sℓ [y] = yℓ . As a result, Υ’s from Remark 4.13 are collections of ν positive semidefinite 1 × 1 matrices, and we can identify them with ν-dimensional nonnegative vectors υ,
284
CHAPTER 4
resulting in λ[Υ] = υ and M[Υ] = Diag{υ}. Furthermore, for nonnegative υ we clearly have φR (υ) = kυkp/(2−p) , so the optimization problem in (4.41) now reads ΨΠ (H) = inf ν υ∈R
1 4
Tr(V Diag−1 {υ}V T ) + kυkp/(2−p) : υ > 0
and when setting aℓ = kColℓ [V ]k2 , (4.41) becomes ) ( 1 X a2ℓ ΨΠ (H) = inf + kυkp/(2−p) . υ>0 4 υℓ
[V = Q1/2 H],
ℓ
This results in ΨΠ (H) = k[a1 ; ...; aµ ]kp . Recalling what aℓ and V are, we end up with ∀P, Var[P ] Q :
Eξ∼P {kH T ξk} ≤ ΨΠ (H) := kRow1 [H T Q1/2 ]k2 ; . . . ; kRowν [H T Q1/2 ]k2 p .
This result is quite transparent and could be easily obtained straightforwardly. 2 Indeed, when Var[P ] Q, and ξ ∼ P , the vector ζ = H T ξ clearly Pi } ≤ Psatisfies E{ζ σi2 := kRowi [H T Q1/2 ]k22 , implying, due to p ∈ [1, 2], that E{ i |ζi |p } ≤ i σip , whence E{kζkp } ≤ k[σ1 ; ...; σν ]kp . 4.3.3.6
Putting things together
An immediate outcome of Corollaries 4.10 and 4.12 is the following recipe for building a “presumably good” linear estimate: Proposition 4.14. In the situation of Section 4.3.3.1 and under Assumptions A, B, and R (see Section 4.3.3.2) consider the convex optimization problem (for notation, see (4.36) and (4.39)) Opt
=
min ′
H,Λ,Υ,Υ ,Θ
φT (λ[Λ]) + φR (λ[Υ]) + φR (λ[Υ′ ]) + Γ(Θ) :
Λ = {Λk 0, k ≤ K}, Υ′ = {Υ′ℓ 0, ℓ ≤ L}, Υ =P{Υℓ ∗ 0, ℓ ≤ L}, 1 [B T − AT H]M k Rk [Λk ] 2 P 0, T T ∗ 1 M [B − H A] ℓ Sℓ [Υℓ ] 2 1 HM Θ P2 ∗ ′ 0. T T 1 M H ℓ Sℓ [Υℓ ] 2
(4.42)
The problem is solvable, and the H-component H∗ of its optimal solution yields linear estimate x bH∗ (ω) = H∗T ω such that RiskΠ,k·k [b xH∗ (·)|X ] ≤ Opt.
(4.43)
Note that the only claim in Proposition 4.14 which is not an immediate consequence of Corollaries 4.10 and 4.12 is that problem (4.42) is solvable; this fact is readily given by the feasibility of the problem (by Lemma 4.44) and the coerciveness of the objective on the feasible set (recall that Γ(Θ) is coercive on Sm + due to Π ⊂ int Sm and that y → 7 M y is an onto mapping, since B is full-dimensional). ∗ +
285
SIGNAL RECOVERY BY LINEAR ESTIMATION
4.3.3.7
Illustration: Covariance matrix estimation
Suppose that we observe a sample η T = {ηk = Aξk }k≤T
(4.44)
where A is a given m × n matrix, and ξ1 , ..., ξT are sampled, independently of each other, from a zero mean Gaussian distribution with unknown covariance matrix ϑ known to satisfy γϑ∗ ϑ ϑ∗ , (4.45)
where γ ≥ 0 and ϑ∗ ≻ 0 are given. Our goal is to recover ϑ, and the norm on Sn in which the recovery error is measured satisfies Assumption A. Processing the problem. We can process the problem just outlined as follows. 1. box
We represent the set {ϑ ∈ Sn+ : γϑ∗ ϑ ϑ∗ } as the image of the matrix V = {v ∈ Sn : kvk2,2 ≤ 1}
[k · k2,2 : spectral norm]
under affine mapping; specifically, we set ϑ0 =
1−γ 1+γ ϑ∗ , σ = 2 2
and treat the matrix −1/2
v = σ −1 ϑ∗
−1/2
(ϑ − ϑ0 )ϑ∗
h
1/2
1/2
⇔ ϑ = ϑ0 + σϑ∗ vϑ∗
i
as the signal underlying our observations. Note that our a priori information on ϑ reduces to v ∈ V. 2.
We pass from observations ηk to “lifted” observations ηk ηkT ∈ Sm , so that 1/2
1/2
E{ηk ηkT } = E{Aξk ξkT AT } = AϑAT = A (ϑ0 + σAϑ∗ vϑ∗ ) AT , {z } | ϑ[v]
and treat as “actual” observations the matrices
ωk = ηk ηkT − Aϑ0 AT . We have8 1/2
1/2
ωk = Av + ζk with Av = σAϑ∗ vϑ∗ AT and ζk = ηk ηkT − Aϑ[v]AT .
(4.46)
Observe that random matrices ζ1 , ..., ζT are i.i.d. with zero mean and covariance mapping Q[v] (that of random matrix-valued variable ζ = ηη T − E{ηη T }, η ∼ 8 In our current considerations, we need to operate with linear mappings acting from Sp to Sq . We treat Sk as Euclidean space equipped with the Frobenius inner product hu, vi = Tr(uv) and denote linear mappings from Sp into Sq by capital calligraphic letters, like A, Q, etc. Thus, A in (4.46) denotes the linear mapping which, on closer inspection, maps matrix v ∈ Sn into the matrix Av = A[ϑ[v] − ϑ[0]]AT .
286
CHAPTER 4
N (0, Aϑ[v]AT )). 3. Let us -upper-bound the covariance mapping of ζ. Observe that Q[v] is a symmetric linear mapping of Sm into itself given by hh, Q[v]hi = E{hh, ζi2 } = E{hh, ηη T i2 } − hh, E{ηη T }i2 , h ∈ Sm . Given v ∈ V, let us set θ = ϑ[v], so that 0 θ θ∗ , and let H(h) = θ1/2 AT hAθ1/2 . We have hh, Q[v]hi
= = =
Eξ∼N (0,θ) {Tr2 (hAξξ T AT )} − Tr2 (hEξ∼N (0,θ) {Aξξ T AT }) Eχ∼N (0,In ) {Tr2 (hAθ1/2 χχT θ1/2 AT ))} − Tr2 (hAθAT ) Eχ∼N (0,In ) {(χT H(h)χ)2 } − Tr2 (H(h)).
We have H(h) = U Diag{λ}U T with orthogonal U , so that Eχ∼N (0,In ) {(χT H(h)χ)2 } − Tr2 (H(h)) P T T χ∼N (0,I ) {(χ = Eχ:=U χ) ¯ 2 } − ( Pi λi )2 ¯ P P P Pn ¯2 Diag{λ} ¯i )2 } − ( i λi )2 = i6=j λi λj + 3 i λ2i − ( i λi )2 = Eχ∼N {( i λi χ ¯ (0,I ) n P = 2 i λ2i = 2Tr([H(h)]2 ).
Thus,
hh, Q[v]hi
= ≤ = =
2Tr([H(h)]2 ) = 2Tr(θ1/2 AT hAθAT hAθ1/2 ) 2Tr(θ1/2 AT hAθ∗ AT hAθ1/2 ) [since 0 θ θ∗ ] 1/2 1/2 1/2 1/2 2Tr(θ∗ AT hAθAT hAθ∗ ) ≤ 2Tr(θ∗ AT hAθ∗ AT hAθ∗ ) T T 2Tr(θ∗ A hAθ∗ A hA).
We conclude that ∀v ∈ V : Q[v] Q, he, Qhi = 2Tr(ϑ∗ AT hAϑ∗ AT eA), e, h ∈ Sm .
(4.47)
4. To continue, we need to set some additional notation to be used when operating with Euclidean spaces Sp , p = 1, 2, .... • We denote p¯ = set
p(p+1) 2
= dim Sp , Ip = {(i, j) : 1 ≤ i ≤ j ≤ p}, and for (i, j) ∈ Ip ei eTi , i=j ij ep = , √1 [ei eT + ej eT ], i < j i j 2
where the ei are standard basic orths in Rp . Note that {eij p : (i, j) ∈ Ip } is the standard orthonormal basis in Sp . Given v ∈ Sp , we denote by Xp (v) the vector of coordinates of v in this basis: vii , i=j √ , (i, j) ∈ Ip . Xpij (v) = Tr(veij ) = p 2vij , i < j Similarly, P for x ∈ Rp¯, we index the entries in x by pairs ij, (i, j) ∈ Ip , and set p p Vp (x) = (i,j)∈Ip xij eij p , so that v 7→ X (v) and x 7→ V (x) are linear normpreserving maps inverse to each other identifying the Euclidean spaces Sp and Rp¯ (recall that the inner products on these spaces are, respectively, the Frobenius and the standard one).
287
SIGNAL RECOVERY BY LINEAR ESTIMATION
• Recall that V is the matrix box {v ∈ Sn : v 2 In } = {v ∈ Sn : ∃t ∈ T := [0, 1] : v 2 tIn }. We denote by X the image of V under the mapping Xn : X X = {x ∈ Rn¯ : ∃t ∈ T : R2 [x] tIn }, R[x] = xij eij ¯ = 12 n(n + 1). n, n (i,j)∈In
Note that X is a basic spectratope of size n. Now we can assume that the signal underlying our observations is x ∈ X , and the observations themselves are wk = Xm (ωk ) = Xm (AVn (x)) +zk , zk = Xm (ζk ). | {z } =:Ax
¯ Note that zk ∈ Rm , 1 ≤ k ≤ T , are zero mean i.i.d. random vectors with covariance matrix Q[x] satisfying, in view of (4.47), the relation T kℓ Q[x] Q, where Qij,kℓ = 2Tr(ϑ∗ AT eij m Aϑ∗ A em A), (i, j) ∈ Im , (k, ℓ) ∈ Im .
Our goal is to estimate ϑ[v] − ϑ[0], or, which is the same, to recover Bx := Xn (ϑ[Vn (x)] − ϑ[0]). We assume that the norm in which the estimation error is measured is “transferred” from Sn to Rn¯ ; we denote the resulting norm on Rn¯ by k · k and assume that the unit ball B∗ of the conjugate norm k · k∗ is given by spectratopic representation: {u ∈ Rn¯ : kuk∗ ≤ 1} = {u ∈ Rn¯ : ∃y ∈ Y : u = M y}, Y := {y ∈ Rq : ∃r ∈ R : Sℓ2 [y] rℓ Ifℓ , 1 ≤ ℓ ≤ L}.
(4.48)
The formulated description of the estimation problem fits the premises of Proposition 4.14, specifically: • the signal x underlying our observation w(T ) = [w1 ; ...; wT ] is known to belong to basic spectratope X ∈ Rn¯ , and the observation itself is of the form w(T ) = A
(T )
(T )
x + z (T ) , A
= [A; ...; A], z (T ) = [z1 ; ...; zT ]; | {z } T
• the noise z (T ) is zero mean, and its covariance matrix is QT := Diag{Q, ..., Q}, | {z } T
which allows us to set Π = {QT }; • our goal is to recover Bx, and the norm k · k in which the recovery error is measured satisfies (4.48).
Proposition 4.14 supplies the linear estimate x b(w(T ) ) =
T X
k=1
T H∗k wk
288
CHAPTER 4
of Bx with H∗ = [H∗1 ; ...; H∗T ] stemming from the optimal solution to the convex optimization problem Opt
=
min
H=[H1 ;...;HT ],Λ,Υ
"
where
Tr(Λ) + φR (λ[Υ]) + Ψ{QT } (H1 , ..., HT ) :
Λ ∈ Sn L}, + , Υ = {Υℓ 0, ℓ ≤ # T T P 1 − A H ]M R∗ [Λ] [B k 2 0 P P ∗ k 1 M T [B − [ k Hk ]T A] ℓ Sℓ [Υℓ ] 2
(4.49)
kℓ R∗ [Λ] ∈ Sn¯ : (R∗ [Λ])ij,kℓ = Tr(Λeij n en ), (i, j) ∈ In , (k, ℓ) ∈ In ,
and (cf. (4.40)) Tr(QT Θ) + φR (λ[Υ′ ]) : Θ ∈ SmT , Υ′ = {Υ′ℓ 0, ℓ ≤ L}, Ψ{QT } (H1 , ..., HT ) = min Υ′ ,Θ 1 Θ [H1 M ; ...; HT M ] 2 P 0 . ∗ ′ 1 [M T H1T , ..., M T HTT ] ℓ Sℓ [Υℓ ] 2
5. Evidently, the function Ψ{QT } ([H1 , ..., HT ]) remains intact when permuting H1 , ..., HT ; with this in mind, it is clear that permuting H1 , ..., HT and keeping intact Λ and Υ is a symmetry of (4.49)—such a transformation maps the feasible set onto itself and preserves the value of the objective. Since (4.49) is convex and solvable, it follows that there exists an optimal solution to the problem with H1 = ... = HT = H. On the other hand, Ψ{QT } (H, ..., H)
= min Tr(QT Θ) + φR (λ[Υ′ ]) : Θ ∈ SmT , Υ′ = {Υ′ℓ 0, ℓ ≤ L} Υ′ ,Θ 1 [HM ; ...; HM ] Θ 2 P 0 1 ∗ ′ [M T H T , ..., M T H T ] ℓ Sℓ [Υℓ ] 2 = inf ′
Υ ,Θ
= inf ′
Υ ,Θ
Tr(QT Θ) + φR (λ[Υ′ ]) : Θ ∈ SmT , Υ′ = {Υ′ℓ ≻ 0, ℓ ≤ L}, 1 [HM ; ...; HM ] Θ 2 P 0 1 ∗ ′ [M T H T , ..., M T H T ] ℓ Sℓ [Υℓ ] 2
Tr(QT Θ) + φR (λ[Υ′ ]) : Θ ∈ SmT , Υ′ = {Υ′ℓ ≻ 0, ℓ ≤ L},
′
= inf′ φR (λ[Υ ]) + Υ
T 4
Θ
1 4 [HM ; ...; HM ] [
Tr QHM [
P
∗ ′ −1 ℓ Sℓ [Υℓ ]]
P T
∗ ′ −1 ℓ Sℓ [Υℓ ]]
M H
T
′
:Υ =
[HM ; ...; HM ] {Υ′ℓ
T
≻ 0, ℓ ≤ L}
due to QT = Diag{Q, ..., Q}, and we arrive at T Tr(QG) + φR (λ[Υ′ ]) : Υ′ = {Υ′ℓ 0, ℓ ≤ L}, Ψ{QT } (H, ..., H) = min Υ′ ,G (4.50) 1 G HM m 2 0 G∈S , 1 T T P ∗ ′ M H ℓ Sℓ [Υℓ ] 2 P (we have used the Schur Complement Lemma combined with the fact that ℓ Sℓ∗ [Υ′ℓ ] ≻ 0 whenever Υ′ℓ ≻ 0 for all ℓ; see Lemma 4.44).
289
SIGNAL RECOVERY BY LINEAR ESTIMATION
In view of the above observations, when replacing variables H and G with H = T H and G = T 2 G, respectively, problem (4.49), (4.50) becomes Opt = min Tr(Λ) + φR (λ[Υ]) + φR (λ[Υ′ ]) + T1 Tr(QG) : H,G,Λ,Υ,Υ′ Λ ∈ Sn+", Υ = {Υℓ 0, ℓ ≤ L}, Υ′ = {Υ′ℓ 0, ℓ #≤ L}, T T 1 [B − A H]M R∗ [Λ] 2 (4.51) 0, P T 1 T ∗ , ℓ Sℓ [Υℓ ] 2 M [B − H" A] # 1 G HM 0 P2 ∗ ′ T 1 MT H S [Υ ] ℓ
2
and the estimate
x b(wT ) =
ℓ
ℓ
T 1 TX H wk T k=1
brought about by an optimal solution to (4.51) satisfies RiskΠ,k·k [b x|X ] ≤ Opt where Π = {QT }. 4.3.3.8
Estimation from repeated observations
Consider the special case of the situation from Section 4.3.3.1 where observation ω in (4.31) is a T -element sample ω = [ω¯1 ; ...; ω ¯ T ] with components ¯ + ξt , t = 1, ..., T ω ¯ t = Ax ¯ and ξt are i.i.d. observation noises with zero mean distribution P¯ satisfying P¯ ✁ Π m ¯ ¯ for some convex compact set Π ⊂ int S+ . In other words, we are in the situation where ¯ ¯ ...; A¯] ∈ Rm×n for some A¯ ∈ Rm×n and m = T m, ¯ A = [A; | {z } T
¯ ..., Q ¯ }, Q ¯ ∈ Π}. ¯ Π = {Q = Diag{Q, | {z } T
The same argument as used in item 5 of Section 4.3.3.7 above justifies the following Proposition 4.15. In the situation in question and under Assumptions A, B, and R the linear estimate of Bx yielded by an optimal solution to problem (4.42) can be found as follows. Consider the convex optimization problem Opt =
where
min
′ ,Θ ¯ ¯ H,Λ,Υ,Υ
φT (λ[Λ]) + φR (λ[Υ]) + φR (λ[Υ′ ]) +
1 T
¯ : Γ(Θ)
Λ = {Λk 0, k ≤ K}, Υ =P {Υℓ 0, ℓ ≤ L}, Υ′ = {Υ′ℓ 0, ℓ ≤ L}, ∗ 1 ¯ [B T − AT H]M k Rk [Λk ] 2 P 0, T T ∗ 1 ¯ A] M [B − H S [Υ ℓ] ℓ ℓ 2 1 ¯ ¯ HM Θ P2 ∗ ′ 0 T ¯T 1 M H ℓ Sℓ [Υℓ ] 2
(4.52)
¯ = max Tr(Q ¯ Θ). ¯ Γ(Θ) ¯ Π ¯ Q∈
¯ The problem is solvable, and the estimate in question is yielded by the H-component
290
CHAPTER 4
¯ ∗ of the optimal solution according to H x b([¯ ω1 ; ...; ω ¯ T ]) =
T 1 ¯T X ω ¯t. H∗ T t=1
The upper bound provided by Proposition 4.14 on the risk RiskΠ,k·k [b x(·)|X ] of this estimate is equal to Opt. The advantage of this result as compared to what is stated under the circumstances by Proposition 4.14 is that the sizes of optimization problem (4.52) are independent of T . 4.3.3.9
Near-optimality in the Gaussian case
The risk of the linear estimate x bH∗ (·) constructed in (4.42) can be compared to the minimax optimal risk of recovering Bx, x ∈ X , from observations corrupted by zero mean Gaussian noise with covariance matrix from Π. Formally, the minimax risk is defined as b(Ax + ξ)k} (4.53) RiskOptΠ,k·k [X ] = sup inf sup Eξ∼N (0,Q) {kBx − x b(·) x∈X Q∈Π x
where the infimum is taken over all estimates.
Proposition 4.16. Under the premise and in the notation of Proposition 4.14, we have Opt , (4.54) RiskOptΠ,k·k [X ] ≥ p 64 (2 ln F + 10 ln 2)(2 ln D + 10 ln 2) where
D=
X k
dk , F =
X
fℓ .
(4.55)
ℓ
Thus, the upper bound Opt on the risk RiskΠ,k·k [b xH∗ |X ] of the presumably good linear estimate x bH∗ yielded by an optimal solution to optimization problem (4.42) is within logarithmic in the sizes of spectratopes X and B∗ factor of the Gaussian minimax risk RiskOptΠ,k·k [X ]. For the proof, see Section 4.8.5. The key component of the proof is the following fact important in its own right (for proof, see Section 4.8.4): Lemma 4.17. Let Y be an N × ν matrix, let k · k be a norm on Rν such that the unit ball B∗ of the conjugate norm is the spectratope (4.34), and let ζ ∼ N (0, Q) for some positive semidefinite N × N matrix Q. Then the best upper bound on ψQ (Y ) := E{kY T ζk} yielded by Lemma 4.11, that is, the optimal value Opt[Q] in the convex optimization problem (cf. (4.40)) Opt[Q] = min φR (λ[Υ]) + Tr(QΘ) : Υ = {Υℓ 0, 1 ≤ ℓ ≤ L}, Θ,Υ (4.56) 1 YM Θ 2 P 0 Θ ∈ SN , 1 ∗ MT Y T ℓ Sℓ [Υℓ ] 2
SIGNAL RECOVERY BY LINEAR ESTIMATION
291
(for notation, see Lemma 4.11 and (4.36)), satisfies the identity ∀(Q 0) : Opt[Q] = Opt[Q] :=
min φR (λ[Υ]) + Tr(G) : Υℓ 0, G,Υ={Υℓ ,ℓ≤L} 1 1/2 YM G 2Q P 0 , 1 T T 1/2 ∗ ℓ Sℓ [Υℓ ] 2M Y Q
(4.57)
and is a tight bound on ψQ (Y ), namely,
P
√ ψQ (Y ) ≤ Opt[Q] ≤ 22 2 ln F + 10 ln 2 ψQ (Y ),
where F = ℓ fℓ is the size of the spectratope (4.34). Besides this, for all κ ≥ 1 one has 2 e3/8 Opt[Q] ≥ βκ := 1 − − 2F e−κ /2 . Probζ kY T ζk ≥ 4κ 2 √ In particular, when selecting κ = 2 ln F + 10 ln 2, we obtain Opt[Q] 3 Probζ kY T ζk ≥ √ ≥ 0.2100 > 16 . 4 2 ln F + 10 ln 2
4.4
(4.58)
(4.59)
(4.60)
LINEAR ESTIMATES OF STOCHASTIC SIGNALS
In the recovery problem considered so far in this chapter, the signal x underlying observation ω = Ax+ξ was “deterministic uncertain but bounded”—all the a priori information on x was that x ∈ X for a given signal set X . There is a well-known alternative model, where the signal x has a random component, specifically, x = [η; u] where the “stochastic component” η is random with (partly) known probability distribution Pη , and the “deterministic component” u is known to belong to a given set X . As a typical example, consider a linear dynamical system given by yt+1 ωt
= =
Pt y t + η t + u t , Ct yt + ξt , 1 ≤ t ≤ T,
(4.61)
where yt , ηt , and ut are, respectively, the state, the random “process noise,” and the deterministic “uncertain but bounded” disturbance affecting the system at time t, ωt is the output (it is what we observe at time t), and ξt is the observation noise. We assume that the matrices Pt , Ct are known in advance. Note that the trajectory y = [y1 ; ...; yT ] of the states depends not only on the trajectories of process noises ηt and disturbances ut , but also on the initial state y1 , which can be modeled as a realization of either the initial noise η0 , or the initial disturbance u0 . When ut ≡ 0, y1 = η0
292
CHAPTER 4
and the random vectors {ηt , 0 ≤ t ≤ T, ξt , 1 ≤ t ≤ T } are zero mean Gaussian independent of each other, (4.61) is the model underlying the celebrated Kalman filter [143, 144, 171, 172]. Now, given model (4.61), we can use the equations of the model to represent the trajectory of the states as a linear image of the trajectory of noises η = {ηt } and the trajectory of disturbances u = {ut }, y = P η + Qu (recall that the initial state is either the component η0 of η, or the component u0 of u), and our “full observation” becomes ω = [ω1 ; ...; ωT ] = A[η; u] + ξ, ξ = [ξ1 , ..., ξT ]. A typical statistical problem associated with the outlined situation is to estimate the linear image B[η; u] of the “signal” x = [η; u] underlying the observation. For example, when speaking about (4.61), the goal could be to recover yT +1 (“forecast”). We arrive at the following estimation problem: Given noisy observation ω = Ax + ξ ∈ Rm
of signal x = [η; u] with random component η ∈ Rp and deterministic component u known to belong to a given set X ⊂ Rq , we want to recover the image Bx ∈ Rν of the signal. Here A and B are given matrices, η is independent of ξ, and we have a priori (perhaps, incomplete) information on the probability distribution Pη of η, specifically, we know that Pη ∈ Pη for a given family Pη of probability distributions. Similarly, we assume that what we know about the noise ξ is that its distribution belongs to a given family Pξ of distributions on the observation space. Given a norm k · k on the image space of B, it makes sense to specify the risk of a candidate estimate x b(ω) by taking the expectation of the norm kb x(A[η; u] + ξ) − B[η; u]k of the error over both ξ and η and then taking the supremum of the result over the allowed distributions of η, ξ and over u ∈ X : Riskk·k [b x] = sup
sup
u∈X Pξ ∈Pξ ,Pη ∈Pη
E[ξ;η]∼Pξ ×Pη {kb x(A[η; u] + ξ) − B[η; u]k} .
When k · k = k · k2 and all distributions from Pξ and Pη are with zero means and finite covariance matrices, it is technically more convenient to operate with the Euclidean risk #1/2 " 2 x(A[η; u] + ξ) − B[η; u]k2 E[ξ;η]∼Pξ ×Pη kb . RiskEucl [b x] = sup sup u∈X Pξ ∈Pξ ,Pη ∈Pη
Our next goal is to show that as far as the design of “presumably good” linear estimates x b(ω) = H T ω is concerned, the techniques developed so far can be straightforwardly extended to the case of signals with random component.
SIGNAL RECOVERY BY LINEAR ESTIMATION
4.4.1
293
Minimizing Euclidean risk
For the time being, assume that Pξ is comprised of all probability distributions P on Rm with zero mean and covariance matrices Cov[P ] = Eξ∼P {ξξ T } running through a computationally tractable convex compact subset Qξ ⊂ int Sm + , and Pη is comprised of all probability distributions P on Rp with zero mean and covariance matrices running through a computationally tractable convex compact subset Qη ⊂ int Sp+ . Let, in addition, X be a basic spectratope: X = {x ∈ Rq : ∃t ∈ T : Rk2 [x] tk Idk , k ≤ K} with our standard restrictions on T and Rk [·]. Let us derive an efficiently solvable convex optimization problem “responsible” for a presumably good, in terms of its Euclidean risk, linear estimate. For a linear estimate H T ω, u ∈ X , Pξ ∈ Pξ , Pη ∈ Pη , denoting by Qξ and Qη the covariance matrices of Pξ and Pη , and partitioning A as A = [Aη , Au ] and B = [Bη , Bu ] according to the partition x = [η; u], we have E[ξ;η]∼Pξ ×Pη kH T (A[η; u] + ξ) − B[η; u]k22 = E[ξ;η]∼Pξ ×Pη k[H T Aη − Bη ]η + H T ξ + [HT Au − Bu ]uk22 T T = uT [Bu −H T Au ]T [Bu − H T Au ]u + Eξ∼Pξ Tr(H ξξ H) +Eη∼Pη Tr([Bη − H T Aη ]ηη T [Bη − H T Aη ]T ) = uT [Bu − H T Au ]T [Bu − H T Au ]u + Tr(H T Qξ H) +Tr([Bη − H T Aη ]Qη [Bη − H T Aη ]T ). Hence, the squared Euclidean risk of the linear estimate x bH (ω) = H T ω is Risk2Eucl [b xH ] Φ(H)
= =
Ψξ (H)
=
Ψη (H)
=
Φ(H) + Ψξ (H) + Ψη (H), max uT [Bu − H T Au ]T [Bu − H T Au ]u, u∈X
max Tr(H T QH),
Q∈Qξ
maxQ∈Qη Tr([Bη − H T Aη ]Q[Bη − H T Aη ]T ).
Functions Ψξ and Ψη are convex and efficiently computable, function Φ(H), by Proposition 4.8, admits an efficiently computable convex upper bound Φ(H) = minΛ φT (λ[Λ]) : Λ = {Λk 0, k ≤ K}, P ∗ T T T [Bu − H Au ] [Bu − H Au ] k Rk [Λk ]
P which is tight within the factor 2 max[ln(2 k dk ), 1] (see Proposition 4.8). Thus, the efficiently solvable convex problem yielding a presumably good linear estimate is Opt = min Φ(H) + Ψξ (H) + Ψη (H) ; H
the Euclidean risk of the linear√ estimate H∗T ω yielded by the to the p optimal solution P problem is upper-bounded by Opt and is within factor 2 max[ln(2 k dk ), 1] of the minimal Euclidean risk achievable with linear estimates.
294 4.4.2
CHAPTER 4
Minimizing k · k-risk
Now let Pξ be comprised of all probability distributions P on Rm with matrices of second moments Var[P ] = Eξ∼P {ξξ T } running through a computationally tractable convex compact subset Qξ ⊂ int Sm + , and Pη be comprised of all probability distributions P on Rp with matrices of second moments Var[P ] running through a computationally tractable convex compact subset Qη ⊂ int Sp+ . Let, as above, X be a basic spectratope, X = {u ∈ Rn : ∃t ∈ T : Rk2 [u] tk Idk , k ≤ K}, and let k·k be such that the unit ball B∗ of the conjugate norm k·k∗ is a spectratope: B∗ = {y : kyk∗ ≤ 1} = y ∈ Rν : ∃(r ∈ R, z ∈ RN ) : y = M z, Sℓ2 [z] rℓ Ifℓ , ℓ ≤ L ,
with our standard restrictions on T , R, Rk [·] and Sℓ [·]. Here the efficiently solvable convex optimization problem “responsible” for a presumably good, in terms of its risk Riskk·k , linear estimate can be built as follows. For a linear estimate H T ω, u ∈ X , Pξ ∈ Pξ , Pη ∈ Pη , denoting by Qξ and Qη the matrices of second moments of Pξ and Pη , and partitioning A as A = [Aη , Au ] and B = [Bη , Bu ] according to the partition x = [η; u], we have u] + ξ) − B[η; u]k E[ξ;η]∼Pξ ×Pη kH T (A[η; = E[ξ;η]∼Pξ ×Pη k[H T Aη − Bη ]η + H T ξ + [H T A u − Bu ]uk ≤ k[Bu − H T Au ]uk + Eξ∼Pξ kH T ξk + Eη∼Pη k[Bη − H T Aη ]ηk . It follows that for a linear estimate x bH (ω) = H T ω one has Riskk·k [b xH ] Φ(H) Ψξ (H) Ψη (H)
≤ = = =
Φ(H) + Ψξ (H) + Ψη (H), maxu∈X k[Bu − H T Au ]uk, supPξ ∈Pξ Eξ∼Pξ {kH T ξk}, supPη ∈Pη Eξ∼Pξ {k[Bη − H T Aη ]ηk}.
As was shown in Section 4.3.3.3, the functions Φ, Ψξ , Ψη admit efficiently computable upper bounds as follows (for notation, see Section 4.3.3.3): Φ(H) ≤ Φ(H) := min φT (λ[Λ]) + φR (λ[Υ]) : Λ,Υ ) Λ = {ΛP k 0, k ≤ K}, Υ = {Υℓ 0, ℓ ≤ L} 1 T T ∗ ; u − Au H]M k Rk [Λk ] 2 [BP 0 1 T T S [Υ ] M [B − H Au] ℓ ℓ u ℓ 2 Ψξ (H) ≤ Ψξ (H) := min φR (λ[Υ]) + maxQ∈Qξ Tr(GQ) : Υ = {Υℓ 0, ℓ ≤ L} Υ,G 1 G 2 HM P 0 , 1 T T ℓ Sℓ [Υℓ ] 2M H Ψη (H) ≤ Ψη (H) := min φR (λ[Υ]) + maxQ∈Qη Tr(GQ) : Υ = {Υℓ 0, ℓ ≤ L}, Υ,G 1 T T G η − Aη H]M 2 [BP 0 , 1 T T ℓ Sℓ [Υℓ ] 2 M [Bη − H Aη ]
and these bounds are reasonably tight (for details on tightness, see Proposition 4.8
295
SIGNAL RECOVERY BY LINEAR ESTIMATION
and Lemma 4.17). As a result, to get a presumably good linear estimate, one needs to solve the efficiently solvable convex optimization problem Opt = min Φ(H) + Ψξ (H) + Ψη (H) . H
The linear estimate x bH∗ = H∗T ω yielded by an optimal solution H∗ to this problem admits the risk bound Riskk·k [b xH∗ ] ≤ Opt. Note that the above derivation did not use independence of ξ and η.
4.5
LINEAR ESTIMATION UNDER UNCERTAIN-BUT-BOUNDED NOISE
So far, the main subject of our interest was recovering (linear images of) signals via indirect observations of these signals corrupted by random noise. In this section, we focus on alternative observation schemes – those with “uncertain-but-bounded” and “mixed” noise. 4.5.1
Uncertain-but-bounded noise
Consider the estimation problem where, given observation ω = Ax + η
(4.62)
of unknown signal x known to belong to a given signal set X , one wants to recover linear image Bx of x. Here A and B are given m × n and ν × n matrices. The situation looks exactly as before, the difference with our previous considerations is that now we do not assume the observation noise to be random—all we assume about η is that it belongs to a given compact set H (“uncertain-but-bounded observation noise”). In the situation in question, a natural definition of the risk on X of a candidate estimate ω 7→ x b(ω) is RiskH,k·k [b x|X ] =
sup
x∈X,η∈H
kBx − x b(Ax + η)k
(“H-risk”). We are about to prove that when X , H, and the unit ball B∗ of the norm k · k∗ conjugate to k · k are spectratopes, which we assume from now on, an efficiently computable linear estimate is near-optimal in terms of its H-risk. Our initial observation is that in this case the model (4.62) reduces straightforwardly to the model without observation noise. Indeed, let Y = X × H; then Y is a spectratope, and we lose nothing when assuming that the signal underlying observation ω is y = [x; η] ∈ Y: ¯ A¯ = [A, Im ], ω = Ax + η = Ay,
296
CHAPTER 4
while the entity to be recovered is ¯ B ¯ = [B, 0ν×m ]. Bx = By, With these conventions, the H-risk of a candidate estimate x b(·) : Rm → Rν becomes the quantity Riskk·k [b x|X × H] =
sup
y=[x;η]∈X ×H
¯ −x ¯ kBy b(Ay)k,
and we indeed arrive at the situation where the observation noise is identically zero. To avoid messy notation, let us assume that the outlined reduction has been carried out in advance, so that (!) The problem of interest is to recover the linear image Bx ∈ Rν of an unknown signal x known to belong to a given spectratope X (which, as always, we can assume w.l.o.g. to be basic) from (noiseless) observation ω = Ax ∈ Rm . The risk of a candidate estimate is defined as Riskk·k [b x|X ] = sup kBx − x b(Ax)k, x∈X
where k · k is a given norm with a spectratope B∗ —see (4.34)—as the unit ball of the conjugate norm: X B∗
= =
{x ∈ Rn : ∃t ∈ T : Rk2 [x] tk Idk , k ≤ K}, {z ∈ Rν : ∃y ∈ Y : z = M y}, Y := {y ∈ Rq : ∃r ∈ R : Sℓ2 [y] rℓ Ifℓ , 1 ≤ ℓ ≤ L},
(4.63)
with our standard restrictions on T , R and Rk [·], Sℓ [·]. 4.5.1.1
Building a linear estimate
Let us build a presumably good linear estimate. For a linear estimate x bH (ω) = H T ω, we have Riskk·k [b xH |X ]
=
=
max k(B − H T A)xk x∈X T max [u; x] 1
[u;x]∈B∗ ×X
2
(B − H T A)T
1 2
(B − H T A)
[u; x].
Applying Proposition 4.8, we arrive at the following: Proposition 4.18. In the situation of this section, consider the convex optimization problem Opt# = min φR (λ[Υ]) + φT (λ[Λ]) : Υℓ 0, Λk 0, ∀(ℓ, k) H,Υ={Υℓ },Λ={Λk } (4.64) P 1 ∗ − H T A]T M k Rk [Λk ] 2 [BP 0 , 1 T T ∗ ℓ Sℓ [Υℓ ] 2 M [B − H A]
297
SIGNAL RECOVERY BY LINEAR ESTIMATION
where R∗k [·], Sℓ∗ [·] are induced by Rk [·], Sℓ [·], respectively, as explained in Section 4.3.1. The problem is solvable, and the risk of the linear estimate x bH∗ (·) yielded by the H-component of an optimal solution does not exceed Opt# . For proof, see Section 4.8.6.1. 4.5.1.2
Near-optimality
Proposition 4.19. The linear estimate x bH∗ yielded by Proposition 4.18 is nearoptimal in terms of its risk: X X fℓ , (4.65) dk + Riskk·k [b xH∗ |X ] ≤ Opt# ≤ O(1) ln(D)Riskopt [X ], D = k
ℓ
where Riskopt [X ] is the minimax optimal risk: Riskopt [X ] = inf Riskk·k [b x|X ] with inf taken w.r.t. all Borel estimates.
x b
Remark 4.20. When X and B∗ are ellitopes rather than spectratopes, X B∗
= := =
{x ∈ Rn : ∃t ∈ T : xT Rk x ≤ tk , k ≤ K}, {u ∈ Rν : kuk∗ ≤ 1} {u ∈ Rν P : ∃r ∈ R, z : u = M P z, z T Sℓ z ≤ rℓ , ℓ ≤ L} [Rk 0, k Rk ≻ 0, Sℓ 0, ℓ Sℓ ≻ 0],
problem (4.64) becomes Opt#
=
φR (µ) + φT (λ) : λ ≥ 0, µ ≥ 0, P 1 − H T A]T M k λk Rk 2 [B P 0 , 1 T T ℓ µ ℓ Sℓ 2 M [B − H A]
min
H,λ,µ
and (4.65) can be strengthened to
Riskk·k [b xH∗ |X ] ≤ Opt# ≤ O(1) ln(K + L)Riskopt [X ]. For proofs, see Section 4.8.6. 4.5.1.3
Nonlinear estimation
The uncertain-but-bounded model of observation error makes it easy to point out an efficiently computable near-optimal nonlinear estimate. Indeed, in the situation described at the beginning of Section 4.5.1, let us assume that the range of observation error η is H = {η ∈ Rm : kηk(m) ≤ σ},
where k · k(m) and σ > 0 are a given norm on Rm and a given error bound, and let us measure the recovery error by a given norm k · k(ν) on Rν . We can immediately point out a (nonlinear) estimate optimal within factor 2 in terms of its H-risk, namely, estimate x b∗ , as follows:
298
CHAPTER 4
Given ω, we solve the feasibility problem find x ∈ X : kAx − ωk(m) ≤ σ.
(F [ω])
Let xω be a feasible solution; we set x b∗ (ω) = Bxω .
Note that the estimate is well-defined, since (F [ω]) clearly is solvable, with one of the feasible solutions being the true signal underlying observation ω. When X is a computationally tractable convex compact set, and k · k(m) is an efficiently computable norm, a feasible solution to (F [ω]) can be found in a computationally efficient fashion. Let us make the following immediate observation: Proposition 4.21. The estimate x b∗ is optimal within factor 2: RiskH [b x∗ |X ] ≤ Opt∗ := sup kBx − Byk(ν) : x, y ∈ X , kA(x − y)k(m) ≤ 2σ x,y
≤
2Riskopt,H
(4.66)
where Riskopt,H is the infimum of H-risk over all estimates. The proof of the proposition is the subject of Exercise 4.28. 4.5.1.4
Quantifying risk
Note that Proposition 4.21 does not impose restrictions on X and the norms k·k(m) , k · k(ν) . The only—but essential—shortcoming of the estimate x b∗ is that we do not know, in general, what its H-risk is. From (4.66) it follows that this risk is tightly (namely, within factor 2) upper-bounded by Opt∗ , but this quantity, being the maximum of a convex function over some domain, can be difficult to compute. Aside from a handful of special cases where this difficulty does not arise, there is a generic situation when Opt∗ can be tightly upper-bounded by efficient computation. This is the situation where X is the spectratope defined in (4.63), k · k(m) is such that the unit ball of this norm is a basic spectratope, B(m) := {u : kuk(m) ≤ 1} = {u ∈ Rm : ∃p ∈ P : Q2j [u] pj Iej , 1 ≤ j ≤ J}, and the unit ball of the norm k · k(ν),∗ conjugate to the norm k · k(ν) is a spectratope, ∗ B(ν)
:= =
{v ∈ Rν : kvk(ν),∗ ≤ 1} {v : ∃(w ∈ RN , r ∈ R) : v = M w, Sℓ2 [w] rℓ Ifℓ , 1 ≤ ℓ ≤ L},
with the usual restrictions on P, R, Qj [·], and Sℓ [·]. Proposition 4.22. In the situation in question, consider the convex optimization problem Opt
=
min
Λ={Λk ,k≤K}, Υ={Υℓ ,ℓ≤L}, Σ={Σj ,j≤J}
φT (λ[Λ]) + φR (λ[Υ]) + σ 2 φP (λ[Σ]) + φR (λ[Σ]) :
) ΛkP 0, Υℓ 0, Σj 0 ∀(k, ℓ, j), T ∗ M B ℓ Sℓ [Υℓ ] P 0 ∗ T P ∗ BT M k Rk [Λk ] + A [ j Qj [Σj ]]A
(4.67)
299
SIGNAL RECOVERY BY LINEAR ESTIMATION
where R∗k [·] is associated with mapping x 7→ Rk [x] according to (4.25), Sℓ∗ [·] and Q∗j [·] are associated in the same fashion with mappings w 7→ Sℓ [w] and u 7→ Qj [u], respectively, and φT , φR , and φP are the support functions of the corresponding sets T , R, and P. The optimal value in (4.67) is an efficiently computable upper bound on the quantity Opt# defined in (4.66), and this bound is tight within factor 2 max[ln(2D), 1], D =
X
dk +
k
X
fℓ +
ℓ
X
ej .
j
Proof of the proposition is the subject of Exercise 4.29. 4.5.2
Mixed noise
So far, we have considered separately the cases of random and uncertain-butbounded observation noises in (4.31). Note that both these observation schemes are covered by the following “mixed” scheme: ω = Ax + ξ + η, where, as above, A is a given m × n matrix, x is an unknown deterministic signal known to belong to a given signal set X , ξ is random noise with distribution known to belong to a family P of Borel probability distributions on Rm satisfying (4.32) for a given convex compact set Π ⊂ int Sm + , and η is an “uncertain-but-bounded” observation error known to belong to a given set H. As before, our goal is to estimate Bx ∈ Rν via observation ω. In our present setting, given a norm k · k on Rν , we can quantify the performance of a candidate estimate ω 7→ x b(ω) : Rm → Rν by its risk RiskΠ,H,k·k [b x|X ] =
sup
x∈X ,P ✁Π,η∈H
Eξ∼P {kBx − x b(Ax + ξ + η)k}.
Observe that the estimation problem associated with the “mixed” observation scheme straightforwardly reduces to a similar problem for the random observation scheme, by the same trick we have used in Section 4.5 to eliminate the observation noise. Indeed, let us treat x+ = [x; η] ∈ X + := X × H and X + as the new ¯ + = Ax + η, Bx ¯ + = Bx. signal/signal set underlying our observation, and set Ax With these conventions, the “mixed” observation scheme reduces to ¯ + + ξ, ω = Ax and for every candidate estimate x b(·) it clearly holds
RiskΠ,H,k·k [b x|X ] = RiskΠ,k·k [b x|X + ],
so that we find ourselves in the situation of Section 4.3.3.1. Assuming that X and H are spectratopes, so is X + , meaning that all results of Section 4.3.3 on building presumably good linear estimates and their near-optimality are applicable to our present setup.
300 4.6
CHAPTER 4
CALCULUS OF ELLITOPES/SPECTRATOPES
We present here the rules of the calculus of ellitopes/spectratopes. We formulate these rules for ellitopes; the “spectratopic versions” of the rules are straightforward modifications of their “ellitopic versions.” • Intersection X =
I T
i=1
Xi of ellitopes
Xi = {x ∈ Rn : ∃(y i ∈ Rni , ti ∈ Ti ) : x = Pi y i & [y i ]T Rik y i ≤ tik , 1 ≤ k ≤ Ki } is an ellitope. Indeed, this is evident when X = {0}. Assuming X 6= {0}, we have X
{x ∈ Rn : ∃(y = [y 1 ; ...; y I ] ∈ Y, t = (t1 , ..., tI ) ∈ T = T1 × ... × TI ) : x = P y := P1 y 1 & [y i ]T Rik y i ≤ tik , 1 ≤ k ≤ Ki , 1 ≤ i ≤ I}, | {z }
=
Y
1
I
{[y ; ...; y ] ∈ R
=
n1 +...+nI
:
+ y T Rik y i Pi y = P1 y 1 ,
2 ≤ i ≤ I}
(note that Y can be identified with Rn¯ with a properly selected n ¯ > 0). I Q • The direct product X = Xi of ellitopes i=1
Xi =
{xi ∈ Rni : ∃(y i ∈ Rn¯ i , ti ∈ Ti ) : xi = Pi y i , 1 ≤ i ≤ I & [y i ]T Rik y i ≤ tik , 1 ≤ k ≤ Ki }
is an ellitope: X =
(
[x1 ; ...; xI ] ∈ Rn1 × ... × RnI : ∃ 1
I
y = [y 1 ; ...; y I ] ∈ Rn¯ 1 +...¯nI t = (t1 , ..., tI ) ∈ T = T1 × ... × TI
i T
i
x = P y := [P1 y ; ...; PI y ], [y ] Rik y ≤ | {z }
tik , 1
)
≤ k ≤ Ki , 1 ≤ i ≤ I .
+ y T Rik y
• The linear image Z = {Rx : x ∈ X }, R ∈ Rp×n , of an ellitope X = {x ∈ Rn : ∃(y ∈ Rn¯ , t ∈ T ) : x = P y & y T Rk y ≤ tk , 1 ≤ k ≤ K} is an ellitope: Z = {z ∈ Rp : ∃(y ∈ Rn¯ , t ∈ T ) : z = [RP ]y & y T Rk y ≤ tk , 1 ≤ k ≤ K}. • The inverse linear image Z = {z ∈ Rq : Rz ∈ X }, R ∈ Rn×q , of an ellitope X = {x ∈ Rn : ∃(y ∈ Rn¯ , t ∈ T ) : x = P y & y T Rk y ≤ tk , 1 ≤ k ≤ K} under linear mapping z 7→ Rz : Rq → Rn is an ellitope, provided that the mapping is an embedding: Ker R = {0}. Indeed, setting E = {y ∈ Rn¯ : P y ∈ ImR}, we get a linear subspace in Rn¯ . If E = {0}, Z = {0} is an ellitope; if E 6= {0}, we have Z P¯
= =
{z ∈ Rq : ∃(y ∈ E, t ∈ T ) : z = P¯ y & y T Rk y ≤ tk , 1 ≤ k ≤ K}, ΠP, where Π : ImR → Rq is the inverse of z 7→ Rz : Rq → ImR
(E can be identified with some Rk , and Π is well-defined since R is an embed-
301
SIGNAL RECOVERY BY LINEAR ESTIMATION
ding). n o PI • The arithmetic sum X = x = i=1 xi : xi ∈ Xi , 1 ≤ i ≤ I of ellitopes Xi is an ellitope, with representation readily given by those of X1 , ..., XI . Indeed, X is the image of X1 × ... × XI under the linear mapping [x1 ; ...; xI ] 7→ x1 + .... + xI , and taking direct products and images under linear mappings preserves ellitopes. • “S-product.” Let Xi = {xi ∈ Rni : ∃(y i ∈ Rn¯ i , ti ∈ Ti ) : xi = Pi y i , 1 ≤ i ≤ I & [y i ]T Rik y i ≤ tik , 1 ≤ k ≤ Ki } be ellitopes, and let S be a convex compact set in RI+ which intersects the interior of RI+ and is monotone: 0 ≤ s′ ≤ s ∈ S implies s′ ∈ S. We associate with S the set S 1/2 = s ∈ RI+ : [s21 ; ...; s2I ] ∈ S of entrywise square roots of points from S; clearly, S 1/2 is a convex compact set. Xi and S specify the S-product of the sets Xi , i ≤ I, defined as the set n o Z = z = [z 1 ; ...; z I ] : ∃(s ∈ S 1/2 , xi ∈ Xi , i ≤ I) : z i = si xi , 1 ≤ i ≤ I , or, equivalently, Z = z = [z 1 ; ...; z I ] : ∃(r = [r1 ; ...; rI ] ∈ R, y 1 , ..., y I ) : i T
i
zi = Pi yi ∀i ≤ I, [y ] Rik y ≤
rki
∀(i ≤ I, k ≤ Ki ) ,
R = {[r1 ; ...; rI ] ≥ 0 : ∃(s ∈ S 1/2 , ti ∈ Ti ) : ri = s2i ti ∀i ≤ I}.
We claim that Z is an ellitope. All we need to verify to this end is that the set R is as it should be in an ellitopic representation, that is, that R is a compact 1 +...+KI and monotone subset of RK containing a strictly positive vector (all this + is evident), and that R is convex. To verify convexity, let Ti = cl{[ti ; τi ] : τi > 0, ti /τi ∈ Ti } be the conic hulls of Ti ’s. We clearly have R = {[r1 ; ...; rI ] : ∃s ∈ S 1/2 : [ri ; s2i ] ∈ Ti , i ≤ I} = {[r1 ; ...; rI ] : ∃σ ∈ S : [ri ; σi ] ∈ Ti , i ≤ I}, where the concluding equality is due to the origin of S 1/2 . The concluding set in the above chain clearly is convex, and we are done. As an example, consider the situation where the ellitopes Xi possess nonempty interiors and thus can be thought of as unit balls of norms k·k(i) on the respective spaces Rni , and let S = {s ∈ RI+ : kskp/2 ≤ 1}, where p ≥ 2. In this situation, S 1/2 = {s ∈ RI+ : kskp ≤ 1}, whence Z is the unit ball of the “block p-norm” k[z 1 ; ...; z I ]k = k kz 1 k(1) ; ...; kz I k(I) kp .
Note also that the usual direct product of I ellitopes is their S-product, with S = [0, 1]I . • “S-weighted sum.” Let Xi ⊂ Rn be ellitopes, 1 ≤ i ≤ I, and let S ⊂ RI+ , S 1/2 be the same as in the previous rule. Then the S-weighted sum of the sets Xi ,
302
CHAPTER 4
defined as X = {x : ∃(s ∈ S 1/2 , xi ∈ Xi , i ≤ I) : x =
X i
si xi },
is an ellitope. Indeed, the set in question is the image of the S-product of Xi under the linear mapping [z 1 ; ...; z I ] 7→ z 1 + ... + z I , and taking S-products and linear images preserves the property of being an ellitope. It should be stressed that the outlined “calculus rules” are fully algorithmic: representation (4.6) of the result of an operation is readily given by the representations (4.6) of the operands.
4.7
EXERCISES FOR CHAPTER 4
4.7.1
Linear estimates vs. Maximum Likelihood
Exercise 4.1. Consider the problem posed at the beginning of Chapter 4: Given observation ω = Ax + σξ, ξ ∼ N (0, I) of unknown signal x known to belong to a given signal set X ⊂ Rn , we want to recover Bx. Let us consider the case where matrix A is square and invertible, B is the identity, and X is a computationally tractable convex compact set. As far as computational aspects are concerned, the situation is well suited for utilizing the “magic wand” of Statistics—the Maximum Likelihood (ML) estimate where the recovery of x is x bML (ω) = argmin kω − Ayk2 (ML) y∈X
—the signal which maximizes, over y ∈ X , the likelihood (the probability density) of getting the observation we actually got. Indeed, with computationally tractable X , (ML) is an explicit convex, and therefore efficiently solvable, optimization problem. Given the exclusive role played by the ML estimate in Statistics, perhaps the first question about our estimation problem is: how good is the ML estimate? The goal of this exercise is to show that in the situation we are interested in, the ML estimate can be “heavily nonoptimal,” and this may happen even when the techniques we develop in Chapter 4 do result in an efficiently computable nearoptimal linear estimate. To justify the claim, investigate the risk (4.2) of the ML estimate in the case where ( ) n X n 2 −2 2 X = x ∈ R : x1 + ǫ xi ≤ 1 & A = Diag{1, ǫ−1 , ..., ǫ−1 }, i=2
ǫ and σ are small, and n is large, so that σ 2 (n − 1) ≥ 2. Accompany your theoretical analysis by numerical experiments—compare the empirical risks of the ML estimate with theoretical and empirical risks of the linear estimate optimal under the circumstances.
303
SIGNAL RECOVERY BY LINEAR ESTIMATION
Recommended setup: n runs through {256, 1024, 2048}, ǫ = σ runs through {0.01; 0.05; 0.1}, and signal x is generated as x = [cos(φ); sin(φ)ǫζ], where φ ∼ Uniform[0, 2π] and random vector ζ is independent of φ and is distributed uniformly on the unit sphere in Rn−1 . 4.7.2
Measurement Design in Signal Recovery
Exercise 4.2. [Measurement Design in Gaussian o.s.] As a preamble to the exercise, please read the story about possible “physics” of Gaussian o.s. from Section 2.7.3.3. The summary of the story is as follows: We consider the Measurement Design version of signal recovery in Gaussian o.s., specifically, we are allowed to use observations [ξ ∼ N (0, Im )]
ω = Aq x + σξ where
√ √ √ Aq = Diag{ q1 , q2 , ..., qm }A,
with a given A ∈ Rm×n and vector q which we can select in a given convex compact set Q ⊂ Rm + . The signal x underlying the observation is known to belong to a given ellitope X . Your goal is to select q ∈ Q and a linear recovery ω 7→ GT ω of the image Bx of x ∈ X , with given B, resulting in the smallest worst-case, over x ∈ X , expected k · k22 recovery risk. Modify, according to this goal, problem (4.12). Is it possible to end up P with a tractable problem? Work out in full detail the case when Q = {q ∈ Rm + : i qi = m}. Exercise 4.3.
[follow-up to Exercise 4.2] A translucent bar of length n = 32 is comprised of 32 consecutive segments of length 1 each, with density ρi of i-th segment known to belong to the interval [µ − δi , µ + δi ]. Sample translucent bar The bar is lit from the left end; when light passes through a segment with density ρ, the light’s intensity is reduced by factor e−αρ . The light intensity at the left endpoint of the bar is 1. You can scan the segments one by one from left to right and measure light intensity ℓi at the right endpoint of the i-th segment during time qi ; √ the result zi of the measurement is ℓi eσξi / qi , where ξi ∼ N (0, 1) are independent across i. The total time budget is n, and you are interested in recovering the m = n/2-dimensional vector of densities of the right m segments. Build an optimization problem responsible for near-optimal linear recovery with and without Measurement Design (in the latter case, we assume that each segment is observed during unit time) and compare the resulting near-optimal risks. Recommended data: α = 0.01, δi = 1.2 + cos(4π(i − 1)/n), µ = 1.1 max δi , σ = 0.001. i
304
CHAPTER 4
Exercise 4.4. Let X be a basic ellitope in Rn : X = {x ∈ Rn : ∃t ∈ T : xT Sk x ≤ tk , 1 ≤ k ≤ K} with our usual restrictions on Sk and T . Let, further, m be a given positive integer, and x 7→ Bx : Rn → Rν be a given linear mapping. Consider the Measurement Design problem where you are looking for a linear recovery ω 7→ x bH (ω) := H T ω of Bx, x ∈ X, from observation ω = Ax + σξ
[σ > 0 is given and ξ ∼ N (0, Im )]
in which the m × n sensing matrix A is under your control—it is allowed to be any m × n matrix of spectral norm not exceeding 1. You are interested in selecting H and A in order to minimize the worst-case, over x ∈ X, expected k · k22 recovery error. Similarly to (4.12), this problem can be posed as Opt = minH,λ,A σ 2 Tr(H T H) + φT (λ) : P (4.68) B T − AT H k λ k Sk 0, kAk ≤ 1, λ ≥ 0 , T B−H A Iν where k · k stands for the spectral norm. The objective in this problem is the (upper bound on the) squared risk Risk2 [b xH |X], the sensing matrix being A. The problem is nonconvex, since the matrix participating in the semidefinite constraint is bilinear in H and A. A natural way to handle an optimization problem with objective and/or constraints bilinear in the decision variables u, v is to use “alternating minimization,” where one alternates optimization in v for u fixed and optimization in u for v fixed, the value of the variable fixed in a round being the result of optimization w.r.t. this variable in the previous round. Alternating minimizations are carried out until the value of the objective (which in the outlined process definitely improves from round to round) stops to improve (or nearly so). Since the algorithm does not necessarily converge to the globally optimal solution to the problem of interest, it makes sense to run the algorithm several times from different, say, randomly selected, starting points. Now comes the exercise. 1. Implement Alternating Minimization as applied to (4.68). You may restrict your experimentation to the case where the sizes m, n, ν are quite moderate, in the range of tens, and X is either the box {x : j 2γ x2j ≤ 1, 1 ≤ j ≤ n}, or the ellipsoid Pn {x : j=1 j 2γ x2j ≤ 1}, where γ is a nonnegative parameter (try γ = 0, 1, 2, 3). As for B, you can generate it at random, or enforce B to have prescribed singular values, say, σj = j −θ , 1 ≤ j ≤ ν, and a randomly selected system of singular vectors. 2. Identify cases where a globally optimal solution to (4.68) is easy to find and use this information in order to understand how reliable Alternating Minimization is in the application in question, reliability meaning the ability to identify nearoptimal, in terms of the objective, solutions. If you are not satisfied with Alternating Minimization “as is,” try to improve it.
305
SIGNAL RECOVERY BY LINEAR ESTIMATION
3. Modify (4.68) and your experiment to cover the cases where the constraint kAk ≤ 1 on the sensing matrix is replaced with one of the following: • kRowi [A]k2 ≤ 1, 1 ≤ i ≤ m, • |Aij | ≤ 1 for all i, j
(note that these two types of restrictions mimic what happens if you are interested in recovering (the linear image of) the vector of parameters in a linear regression model from noisy observations of the model’s outputs at the m points which you are allowed to select in the unit ball or unit box). 4. [Embedded Exercise] Recall that a ν × n matrix G admits singular value decomposition G = U DV T with orthogonal matrices U ∈ Rν×ν and V ∈ Rn×n and diagonal ν × n matrix D with nonnegative and nonincreasing diagonal entries.9 These entries are uniquely defined by G and are called singular values σi (G), 1 ≤ i ≤ min[ν, n]. Singular values admit characterization similar to variational characterization of eigenvalues of a symmetric matrix; see, e.g., [15, Section A.7.3]: Theorem 4.23. [VCSV—Variational Characterization of Singular Values] For a ν × n matrix G it holds σi (G) = min
max
E∈Ei e∈E,kek2 =1
kGek2 , 1 ≤ i ≤ min[ν, n],
(4.69)
where Ei is the family of all subspaces in Rn of codimension i − 1. Corollary 4.24. [SVI—Singular Value Interlacement] Let G and G′ be ν × n matrices, and let k = Rank(G − G′ ). Then σi (G) ≥ σi+k (G′ ), 1 ≤ i ≤ min[ν, n], where, by definition, singular values of a ν × n matrix with indexes > min[ν, n] are zeros. We denote by σ(G) the vector of singular values of G arranged in nonincreasing order. The function kGkSh,p = kσ(G)kp is called the Shatten p-norm of matrix G; this indeed is a norm on the space of ν × n matrices, and the conjugate norm is k · kSh,q , with p1 + 1q = 1. An easy and important consequence of Corollary 4.24 is the following fact: Corollary 4.25. Given a ν × n matrix G, an integer k, 0 ≤ k ≤ min[ν, n], and p ∈ [1, ∞], (one of ) the best approximation of G in the Shatten p-norm among matrices of rank ≤ k is obtained from Pk G by zeroing out all but k largest singular values, that is, the matrix Gk = i=1 σi (G)Coli [U ]ColTi [V ], where G = U DV T is the singular value decomposition of G. Prove Theorem 4.23 and Corollaries 4.24 and 4.25. 5. Consider the Measurement Design problem (4.68) in the case when X is an ellipsoid: n o Xn X = x ∈ Rn : x2j /a2j ≤ 1 , j=1
9 We
say that a rectangular matrix D is diagonal if all entries Dij in D with i 6= j are zeros.
306
CHAPTER 4
A is an m × n matrix of spectral norm not exceeding 1, and there is no noise in observations: σ = 0. Find an optimal solution to this problem. Think how this result can be used to get a (hopefully) good starting point for Alternating Minimization in the case when X is an ellipsoid and σ is small. 4.7.3
Around semidefinite relaxation
Exercise 4.5. Let X be an ellitope: X = {x ∈ Rn : ∃(y ∈ RN , t ∈ T ) : x = P y, y T Sk y ≤ tk , k ≤ K} P rk skj sTkj , we can with our standard restrictions on T and Sk . Representing Sk = j=1 pass from the initial ellitopic representation of X to the spectratopic representation of the same set: n N + + X = {x ∈ R [sTkj x]2 t+ kj I1 , 1 ≤ k ≤ K, h : ∃(y ∈ R , t ∈ T ) : x = P y,P i 1 ≤ j ≤ rk } rk + + + + T = {t = {tkj ≥ 0} : ∃t ∈ T : j=1 tkj ≤ tk , 1 ≤ k ≤ K} .
If now C is a symmetric n × n matrix and Opt = maxx∈X xT Cx, we have P Opt∗ ≤ Opte := min φT (λ) : P T CP k λk Sk λ={λk ∈R+ } n o P Opt∗ ≤ Opts := min φT + (Λ) : P T CP k,j Λkj skj sTkj Λ={Λkj ∈R+ }
where the first relation is yielded by the ellitopic representation of X and Proposition 4.6, and the second, on closer inspection (carry this inspection out!), by the spectratopic representation of X and Proposition 4.8. Prove that Opte = Opts .
Exercise 4.6. Proposition 4.6 provides us with an upper bound on the quality of the semidefinite relaxation as applied to the problem of upper-bounding the maximum of a homogeneous quadratic form over an ellitope. Extend the construction to the case where an inhomogeneous quadratic form is maximized over a shifted ellitope, so that the quantity to upper-bound is Opt = max f (x) := xT Ax + 2bT x + c , x∈X
X = {x : ∃(y, t ∈ T ) : x = P y + p, y T Sk y ≤ tk , 1 ≤ k ≤ K}
with our standard assumptions on Sk and T . Note: X is centered at p, and a natural upper bound on Opt is d Opt ≤ f (p) + Opt,
d is an upper bound on the quantity where Opt
Opt = max [f (x) − f (p)] . x∈X
d What you are interested in upper-bounding is the ratio Opt/Opt.
307
SIGNAL RECOVERY BY LINEAR ESTIMATION
Exercise 4.7. [estimating Kolmogorov widths of spectratopes/ellitopes] 4.7.A. Preliminaries: Kolmogorov and Gelfand widths. Let X be a convex compact set in Rn , and let k · k be a norm on Rn . Given a linear subspace E in Rn , let distk·k (x, E) = min kx − zk : Rn → R+ z∈E
be the k · k-distance from x to E. The quantity distk·k (X , E) = max distk·k (x, E) x∈X
can be viewed as the worst-case k · k-accuracy to which vectors from X can be approximated by vectors from E. Given positive integer m ≤ n and denoting by Em the family of all linear subspaces in Rm of dimension m, the quantity δm (X , k · k) = min distk·k (X , E) E∈Em
can be viewed as the best achievable quality of approximation, measured in k · k, of vectors from X by vectors from an m-dimensional linear subspace of Rn . This quantity is called the m-th Kolmogorov width of X w.r.t. k · k. Observe that one has distk·k (x, E) = maxξ {ξ T x : kξk∗ ≤ 1, ξ ∈ E ⊥ }, ξ T x, distk·k (X , E) = max
(4.70)
x∈X , kξk∗ ≤1,ξ∈E ⊥
where E ⊥ is the orthogonal complement to E. 1) Prove (4.70). Hint: Represent distk·k (x, E) as the optimal value in a conic problem on the cone K = {[x; t] : t ≥ kxk} and use the Conic Duality Theorem. Now consider the case when X is the unit ball of some norm k · kX . In this case (4.70) combines with the definition of Kolmogorov width to imply that δm (X , k · k)
= = =
min distk·k (x, E) = min max
min
max
max
E∈Em y∈E ⊥ ,kyk∗ ≤1 x:kxkX ≤1
min
max
max
E∈Em x∈X y∈E ⊥ ,kyk∗ ≤1 T
E∈Em
F ∈En−m y∈F,kyk∗ ≤1
y x
yT x (4.71)
kykX ,∗ ,
where k·kX ,∗ is the norm conjugate to k·kX . Note that when Y is a convex compact set in Rn and | · | is a norm on Rn , the quantity dm (Y, | · |) =
min
max |y|
F ∈En−m y∈Y∩F
has a name—it is called the m-th Gelfand width of Y taken w.r.t. | · |. The “duality relation” (4.71) states that When X , Y are the unit balls of respective norms k · kX , k · kY , for every m < n the m-th Kolmogorov width of X taken w.r.t. k · kY,∗ is the same as
308
CHAPTER 4
the m-th Gelfand width of Y taken w.r.t. k · kX ,∗ . The goal of the remaining part of the exercise is to use our results on the quality of semidefinite relaxation on ellitopes/spectratopes to infer efficiently computable upper bounds on Kolmogorov widths of a given set X ⊂ Rn . In the sequel we assume that • X is a spectratope: X = {x ∈ Rn : ∃(t ∈ T , u) : x = P u, Rk2 [u] tk Idk , k ≤ K}; • The unit ball B∗ of the norm conjugate to k · k is a spectratope: B∗ = {y : kyk∗ ≤ 1} = {y ∈ Rn : ∃(r ∈ R, z) : y = M z, Sℓ2 [z] rℓ Ifℓ , ℓ ≤ L}. with our usual restrictions on T , R and Rk [·] and Sℓ [·]. 4.7.B. Simple case: k·k = k·k2 . We start with the simple case where k·k = k·k2 , so that B∗ isP the ellitope {y : y T y ≤ 1}. Let D = k dk be the size of the spectratope X , and let κ = 2 max[ln(2D), 1].
Given integer m < n, consider the convex optimization problem Opt(m) = minΛ={Λk ,k≤K},Y φT (λ[Λ]) : Λk 0∀k, 0 Y In , P ∗ T k Sk [Λk ] P Y P, Tr(Y ) = n − m .
(Pm )
2) Prove the following:
Proposition 4.26. Whenever 1 ≤ µ ≤ m < n, one has 2 2 Opt(m) ≤ κδm (X , k · k2 ) & δm (X , k · k2 ) ≤
m+1 Opt(µ). m+1−µ
(4.72)
Moreover, the above upper bounds on δm (X , k · k2 ) are “constructive,” meaning that an optimal solution to (Pµ ), µ ≤ m, can be straightforwardly converted into a linear subspace E m,µ of dimension m such that r m+1 Opt(µ). distk·k2 (X , E m,µ ) ≤ m+1−µ Finally, Opt(µ) is nonincreasing in µ < n. 4.7.C. General case. Now consider the case when both X and the unit ball B∗ of the norm conjugate to k · k are spectratopes. As we are about to see, this case is essentially more difficult than the case of k · k = k · k2 , but something still can be done. 3) Prove the following statement:
309
SIGNAL RECOVERY BY LINEAR ESTIMATION
(!) Given m < n, let Y be an orthoprojector of Rn of rank n − m, and let collections Λ = {Λk 0, k ≤ K} and Υ = {Υℓ 0, ℓ ≤ L} satisfy the relation P 1 T ∗ k Rk [Λk ] P 2P Y M 0. (4.73) 1 T ∗ ℓ Sℓ [Υℓ ] 2M Y P Then
distk·k (X , Ker Y ) ≤ φT (λ[Λ]) + φR (λ[Υ]).
(4.74)
As a result, δm (X , k · k)
≤ ≤
distk·k (X , Ker Y ) φT (λ[Λ]) + φR (λ[Υ]) : Opt := min Λ={Λk ,k≤K}, Υ={Υℓ ,ℓ≤L}
Λ 0 ∀k, Υℓ 0 ∀ℓ, kP ∗ k Rk [Λk ] T 1 M YP 2
4) Prove the following statement: (!!) Let m, n, Y be as in (!). Then
1 T P YM 2 P ∗ ℓ Sℓ [Υℓ ]
0
)
0
)
δm (X , k · k) ≤ distk·k (X , Ker Y ) ≤
d := Opt
min
ν,Λ={Λk ,k≤K}, Υ={Υℓ ,ℓ≤L}
∗ k Rk [Λk ] T 1 M P 2
P
.
φT (λ[Λ]) + φR (λ[Υ]) :
ν ≥ 0, Λk 0 ∀k, Υℓ 0 ∀ℓ, P ∗ ℓ Sℓ [Υℓ ]
1 T P M 2 T
+ νM (I − Y )M
(4.75)
(4.76) ,
d ≤ Opt, with Opt given by (4.75). and Opt
Statements (!) and (!!) suggest the following policy for upper-bounding the Kolmogorov width δm (X , k · k): A. First, we select an integer µ, 1 ≤ µ < n, and solve the convex optimization problem min φT (λ[Λ]) + φR (λ[Υ]) : 0 Y I, Tr(Y ) = n − µ, Λ,Υ,Y (P µ ) Λ = {Λ 0, k ≤ K}, Υ = {Υ 0, ℓ ≤ L}, k ℓ P ∗ 1 T P YM . k Rk [Λk ] 2 P 0 T ∗ 1 2
M YP
ℓ
Sℓ [Υℓ ]
B. Next, we take the Y -component Y µ of the optimal solution to (P µ ) and “round” it to a orthoprojector Y of rank n − m in the same fashion as in the case of k · k = k · k2 , that is, keep the eigenvectors of Y µ intact and replace the m smallest eigenvalues with zeros, and all remaining eigenvalues with ones.
310
CHAPTER 4
C. Finally, we solve the convex optimization problem Optm,µ = min φT (λ[Λ]) + φR (λ[Υ]) : Λ,Υ,ν
ν ≥ 0, Λ = {Λk 0, k ≤ K}, Υ = {Υℓ 0,ℓ ≤ L}, P ∗ 1 T P M . k Rk [Λk ] 2 P ∗ 0 T T 1 2
M P
ℓ
(P m,µ )
Sℓ [Υℓ ] + νM (I − Y )M
By (!!), Optm,µ is an upper bound on the Kolmogorov width δm (X , k · k) (and in fact also on distk·k (X , Ker Y )). Observe all the complications we encounter when passing from the simple case k · k = k · k2 to the case of general norm k · k with a spectratope as the unit ball p of the conjugate norm. Note that Proposition 4.26 gives both a lower bound Opt(m)/κ on qthe m-th Kolmogorov width of X w.r.t. k · k2 , and a family of
m+1 upper bounds m+1−µ Opt(µ), 1 ≤ µ ≤ m, on this width. As a result, we can approximate X by m-dimensional subspaces in the Euclidean norm in a “nearly optimal” fashion. Indeed, if for some ǫ and k it holds δk (X , k · k2 ) ≤ ǫ, then Opt(k) ≤ κǫ2 by Proposition 4.26 as applied with m = k. On the other hand, assuming k < n/2, the same proposition when applied with m = 2k and µ = k says that r p √ 2k + 1 m,k distk·k2 (X , E ) ≤ Opt(k) ≤ 2Opt(k) ≤ 2κ ǫ. k+1
Thus, if X can be approximated by a k-dimensional subspace within k√· k2 -accuracy ǫ, we can efficiently get approximation of “nearly the same quality” ( 2κǫ instead of ǫ; recall that κ is just logarithmic in D) and “nearly the same dimension” (2k instead of k). Neither of these options is preserved when passing from the Euclidean norm to a general one: in the latter case, we do not have lower bounds on Kolmogorov widths, and have no understanding of how tight our upper bounds are. Now, two concluding questions:
5) Why in step A of the above bounding scheme do we utilize statement (!) rather d ≤ Opt) statement (!!)? than the less conservative (since Opt 6) Implement the scheme numerically and run experiments. Recommended setup: • Given σ > 0 and positive integers n and κ, let f be a function of continuous argument t ∈ [0, 1] satisfying the smoothness restriction |f (k) (t)| ≤ σ k , 0 ≤ t ≤ 1, k = 0, 1, 2, ..., κ. Specify X as the set of n-dimensional vectors x obtained by restricting f onto the n-point equidistant grid {ti = i/n}ni=1 . To this end, translate the description on f into a bunch of two-sided linear constraints on x: |dT(k) [xi ; xi+1 ; ...; xi+k ]| ≤ σ k , 1 ≤ i ≤ n − k, 0 ≤ k ≤ κ, where d(k) ∈ Rk+1 is the vector of coefficients of finite-difference approximation, with resolution 1/n, of the k-th derivative: d(0) = 1, d(1) = n[−1; 1], d(2) = n2 [1; −2; 1], d(3) = n3 [−1; 3; −3; 1], d(4) = n4 [1; −4; 6; −4; 1], ....
311
SIGNAL RECOVERY BY LINEAR ESTIMATION
• Recommended parameters: n = 32, m = 8, κ = 5, σ ∈ {0.25, 0.5; 1, 2, 4}. • Run experiments with k · k = k · k1 and k · k = k · k2 . Exercise 4.8. [more on semidefinite relaxation] The goal of this exercise is to extend SDP relaxation beyond ellitopes/spectratopes. SDP relaxation is aimed at upper-bounding the quantity OptX (B) = max xT Bx, x∈X
[B ∈ Sn ]
where X ⊂ Rn is a given set (which we from now on assume to be nonempty convex compact). To this end we look for a computationally tractable convex compact set U ⊂ Sn such that for every x ∈ X it holds xxT ∈ U ; in this case, we refer to U as to a set matching X (equivalent wording: “U matches X ”). Given such a set U , the optimal value in the convex optimization problem OptU (B) = max Tr(BU ) U ∈U
(4.77)
is an efficiently computable convex upper bound on OptX (B). Given U matching X , we can pass from U to the conic hull of U –to the set U[U ] = cl{(U, µ) ∈ Sn × R+ : µ > 0, U/µ ∈ U} which, as is immediately seen, is a closed convex cone contained in Sn × R+ . The only point (U, µ) in this cone with µ = 0 has U = 0 (since U is compact), and U = {U : (U, 1) ∈ U} = {U : ∃µ ≤ 1 : (U, µ) ∈ U}, so that the definition of OptU can be rewritten equivalently as OptU (B) = min {Tr(BU ) : (U, µ) ∈ U, µ ≤ 1} . U,µ
The question, of course, is where to take a set U matching X , and the answer depends on what we know about X . For example, when X is a basic ellitope X = {x ∈ Rn : ∃t ∈ T : xT Sk x ≤ tk , k ≤ K} with our usual restrictions on T and Sk , it is immediately seen that x ∈ X ⇒ xxT ∈ U = {U ∈ Sn : U 0, ∃t ∈ T : Tr(U Sk ) ≤ tk , k ≤ K}. Similarly, when X is a basic spectratope X = {x ∈ Rn : ∃t ∈ T : Sk2 [x] tk Idk , k ≤ K} with our usual restrictions on T and Sk [·], it is immediately seen that x ∈ X ⇒ xxT ∈ U = {U ∈ Sn : U 0, ∃t ∈ T : Sk [U ] tk Idk , k ≤ K}. One can verify that the semidefinite relaxation bounds on the maximum of a quadratic form on an ellitope/spectratope X derived in Sections 4.2.3 (for elli-
312
CHAPTER 4
topes) and 4.3.2 (for spectratopes) are nothing but the bounds (4.77) associated with the U just defined. 4.8.A Matching via absolute norms. There are other ways to specify a set matching X . The seemingly simplest of them is as follows. Let p(·) be an absolute norm on Rn (recall that this is a norm p(x) which depends solely on abs[x], where abs[x] is the vector comprised of the magnitudes of entries in x). We can convert p(·) into the norm p+ (·) on the space Sn as follows: p+ (U ) = p([p(Col1 [U ]); ...; p(Coln [U ])])
[U ∈ Sn ].
1.1) Prove that p+ indeed is a norm on Sn , and p+ (xxT ) = p2 (x). Denoting by q(·) the norm conjugate to p(·), what is the relation between the norm (p+ )∗ (·) conjugate to p+ (·) and the norm q + (·) ? 1.2) Derive from 1.1 that whenever p(·) is an absolute norm such that X is contained in the unit ball Bp(·) = {x : p(x) ≤ 1} of the norm p, the set Up(·) = {U ∈ Sn : U 0, p+ (U ) ≤ 1} is matching X . If, in addition, X ⊂ {x : p(x) ≤ 1, P x = 0},
(4.78)
then the set Up(·),P = {U ∈ Sn : U 0, p+ (U ) ≤ 1, P U = 0} is matching X . Assume that in addition to p(·), we have at our disposal a computationally tractable closed convex set D such that whenever p(x) ≤ 1, the vector [x]2 := [x21 ; ...; x2n ] belongs to D; in the sequel we call such a D square-dominating p(·). For example, when p(·) = k · kr , we can take P n : i y1 ≤ 1 , r ≤ 2 y ∈ R + . D= y ∈ Rn+ : kykr/2 ≤ 1 , r > 2 Prove that in this situation the above construction can be refined: whenever X satisfies (4.78), the set
D Up(·),P = {U ∈ Sn : U 0, p+ (U ) ≤ 1, P U = 0, dg(U ) ∈ D} [dg(U ) = [U11 ; U22 ; ...; Unn ]] matches X . D when Note: in the sequel, we suppress P in the notation Up(·),P and Up(·),P P = 0; thus, Up(·) is the same as Up(·),0 . 1.3) Check that when p(·) = k · kr with r ∈ [1, ∞], one has +
p (U ) = kU kr :=
( P
1/r |Uij |r , maxi,j |Uij |, i,j
1 ≤ r < ∞, . r=∞
313
SIGNAL RECOVERY BY LINEAR ESTIMATION
1.4) Let X = {x ∈ Rn : kxk1 ≤ 1} and p(x) = kxk1 , so that X ⊂ {x : p(x) ≤ 1}, and n o X Conv{[x]2 : x ∈ X } ⊂ D = y ∈ Rn+ : yi = 1 . (4.79) i
What are the bounds OptUp(·) (B) and OptU D (B)? Is it true that the former p(·) (the latter) of the bounds is precise? Is it true that the former (the latter) bound is precise when B 0 ? 1.5) Let X = {x ∈ Rn : kxk2 ≤ 1} and p(x) = kxk2 , so that X ⊂ {x : p(x) ≤ 1} and (4.79) holds true. What are the bounds OptUp(·) (B) and OptU D (B) ? Is the p(·) former (the latter) bound precise? 1.6) Let X ⊂ Rn+ be closed, convex, bounded, and with a nonempty interior. Verify that the set X + = {x ∈ Rn : ∃y ∈ X : abs[x] ≤ y} is the unit ball of an absolute norm pX , and this is the largest absolute norm p(·) such that X ⊂ {x : p(x) ≤ 1}. Derive from this observation that the norm pX (·) is the best (i.e., resulting in the least conservative bounding scheme) among absolute norms which allow us to upper-bound OptX (B) via the construction from item 1.2. 4.8.B “Calculus of matchings.” Observe that the matching we have introduced admits a kind of “calculus.” Specifically, consider the situation as follows: for 1 ≤ ℓ ≤ L, we are given • nonempty convex compact sets Xℓ ⊂ Rnℓ , 0 ∈ Xℓ , along with matching Xℓ convex compact sets Uℓ ⊂ Snℓ giving rise to the closed convex cones Uℓ = cl{(Uℓ , µℓ ) ∈ Snℓ × R+ : µℓ > 0, µ−1 ℓ Uℓ ∈ Uℓ }. We denote by ϑℓ (·) the Minkowski functions of Xℓ : ϑℓ (y ℓ ) = inf{t : t > 0, t−1 y ℓ ∈ Xℓ } : Rnℓ → R ∪ {+∞}; note that Xℓ = {y ℓ : ϑℓ (y ℓ ) ≤ P 1}; • nℓ × n matrices Aℓ such that ℓ ATℓ Aℓ ≻ 0.
On top of that, we are given a monotone convex set T ⊂ RL + intersecting the interior of RL +. These data specify the convex set X = {x ∈ Rn : ∃t ∈ T : ϑ2ℓ (Aℓ x) ≤ tℓ , ℓ ≤ L}.
(∗)
2.1) Prove the following: Lemma 4.27. In the situation in question, the set U = U ∈ Sn : U 0 & ∃t ∈ T : (Aℓ U ATℓ , tℓ ) ∈ Uℓ , ℓ ≤ L
is a closed and bounded convex set which matches X . As a result, the efficiently
314
CHAPTER 4
computable quantity OptU (B) = max {Tr(BU ) : U ∈ U} U
is an upper bound on OptX (B) = max xT Bx. x∈X
n
2.2) Prove that if X ⊂ R is a nonempty convex compact set, P is an m × n matrix, and U matches X , then the set V = {V = P U P T : U ∈ U} matches Y = {y : ∃x ∈ X : y = P x}. 2.3) Prove that if X ⊂ Rn is a nonempty convex compact set, P is an n × m matrix of rank m, and U matches X , then the set V = {V 0 : P V P T ∈ U} matches Y = {y : P y ∈ X }. 2.4) Consider the “direct product” case where X = X1 × ... × XL . When specifying Aℓ as the matrix which “cuts” the ℓ-th block Aℓ x = xℓ of a block vector x = [x1 ; ...; xL ] ∈ Rn1 × ... × RnL and setting T = [0, 1]L , we cover this situation by the setup under consideration. In the direct product case, the construction from item 2.1 is as follows: given the sets Uℓ matching Xℓ , we build the set ′
U = {U = [U ℓℓ ∈ Rnℓ ×nℓ′ ]ℓ,ℓ′ ≤L ∈ Sn1 +...+nL : U 0, U ℓℓ ∈ Uℓ , ℓ ≤ L} and claim that this set matches X . Could we be less conservative? While we do not know how to be less conservative in general, we do know how to be less conservative in the special case when the Uℓ are built via absolute norms. Namely, let pℓ (·) : Rnℓ → R+ , ℓ ≤ L, be absolute norms, let sets Dℓ be square-dominating pℓ (·), bℓ = {xℓ ∈ Rnℓ : Pℓ xℓ = 0, pℓ (xℓ ) ≤ 1}, Xℓ ⊂ X
and let
Uℓ = {U ∈ Snℓ : U 0, Pℓ U = 0, p+ ℓ (U ) ≤ 1, dg(U ) ∈ Dℓ }. In this case the above construction results in U=
U = [U
ℓℓ′
∈R
nℓ ×nℓ′
]ℓ,ℓ′ ≤L ∈
1 +...+nL Sn +
Now let
Pℓ U ℓℓ = 0 + ℓℓ : U 0, pℓ (U ) ≤ 1 , ℓ ≤ L . dg(U ℓℓ ) ∈ Dℓ
p([x1 ; ...; xL ]) = max[p1 (x1 ), ..., pL (xL )] : Rn1 × ... × RnL → R, so that p is an absolute norm and X ⊂ {x = [x1 ; ...; xL ] : p(x) ≤ 1, Pℓ xℓ = 0, ℓ ≤ L}. Prove that in fact the set U=
′
1 +...+nL U = [U ℓℓ ∈ Rnℓ ×nℓ′ ]ℓ,ℓ′ ≤L ∈ Sn +
Pℓ U ℓℓ = 0 : U 0, dg(U ℓℓ ) ∈ Dℓ , ℓ ≤ L p+ (U ) ≤ 1
SIGNAL RECOVERY BY LINEAR ESTIMATION
315
matches X , and that we always have U ⊂ U . Verify that in general this inclusion is strict. 4.8.C Illustration: Nullspace property revisited. Recall the sparsity-oriented signal recovery via ℓ1 minimization from Chapter 1: Given an m × n sensing matrix A and (noiseless) observation y = Aw of unknown signal w known to have at most s nonzero entries, we recover w as w b ∈ Argmin {kzk1 : Az = y} . z
We called matrix A s-good if whenever y = Aw with s-sparse w, the only optimal solution to the right-hand side optimization problem is w. The (difficult to verify!) necessary and sufficient condition for s-goodness is the Nullspace property: Opt := max kzk(s) : z ∈ Ker A, kzk1 ≤ 1 < 1/2, z
where kzk(k) is the sum of the k largest entries in the vector abs[z]. A verifiable sufficient condition for s-goodness is d := min max kColj [I − H T A]k(s) < 1 , Opt 2 H
j
(4.80)
d is an upper bound on Opt (see the reason being that, as is immediately seen, Opt Proposition 1.9 with q = 1). An immediate observation is that Opt is nothing but the maximum of quadratic form over an appropriate convex compact set. Specifically, let P X = {[u; v] ∈ Rn × Rn : Au= 0, kuk1 ≤ 1, i |vi | ≤ s, kvk∞ ≤ 1}, 1 I 2 n . B= 1 2 In Then OptX (B)
=
max [u; v]T B[u; v]
[u;v]∈X
P max uT v : Au = 0, kuk1 ≤ 1, i |vi | ≤ s, kvk∞ ≤ 1 u,v max kuk : Au = 0, kuk ≤ 1 = u 1 (s) |{z} =
(a)
=
Opt,
where (a) is due to the well-known fact (prove it!) that whenever s is a positive integer ≤ n, the extreme points of the set X |vi | ≤ s, kvk∞ ≤ 1} V = {v ∈ Rn : i
are exactly the vectors with at most s nonzero entries, the nonzero entries being ±1; as a result ∀(z ∈ Rn ) : max z T v = kzk(s) . v∈V
316
CHAPTER 4
Now, V is the unit ball of the absolute norm r(v) = min {t : kvk1 ≤ st, kvk∞ ≤ t} , so that X is contained in the unit ball B of the absolute norm on R2n specified as p([u; v]) = max {kuk1 , r(v)}
[u, v ∈ Rn ],
i.e., X = {[u; v] : p([u, v]) ≤ 1, Au = 0} . As a result, whenever x = [u; v] ∈ X , the matrix 11 U = uuT U 12 = uv T T U = xx = U 21 = vuT U 22 = vv T satisfies the condition p+ (U ) ≤ 1 (see item 1.2 above). In addition, this matrix clearly satisfies the condition A[U 11 , U 12 ] = 0. It follows that the set 11 U 12 U ∈ S2n : U 0, p+ (U ) ≤ 1, AU 11 = 0, AU 12 = 0} U = {U = U 21 U 22 (which clearly is a nonempty convex compact set) matches X . As a result, the efficiently computable quantity Opt
= =
max Tr(BU ) U ∈U 11 U max Tr(U 12 ) : U = U U 21
U 12 U 22
0, p+ (U ) ≤ 1, AU 11 = 0, AU 12 = 0
(4.81)
is an upper bound on Opt. As a result, the verifiable condition Opt < 1/2 is sufficient for s-goodness of A. Now comes the concluding part of the exercise: d so that (4.81) is less conservative than (4.80). 3.1) Prove that Opt ≤ Opt, Hint: Apply Conic Duality to verify that ) ( n X d = max Tr(V ) : V ∈ Rn×n , AV = 0, r(Coli [V T ]) ≤ 1 . Opt V
(4.82)
i=1
3.2) Run simulations with randomly generated Gaussian matrices A and play with d and Opt. To save time, you can use toy different values of s to compare Opt sizes m, n, say, m = 18, n = 24.
317
SIGNAL RECOVERY BY LINEAR ESTIMATION
4.7.4 4.7.4.1
Around Propositions 4.4 and 4.14 Optimizing linear estimates on convex hulls of unions of spectratopes
Exercise 4.9. Let • X1 , ..., XJ be spectratopes in Rn : 2 Xj = {x ∈ Rn : ∃(y ∈ RNj , th∈ Tj ) : x = Pj y, Rkj [y]i tk Idkj , k ≤ Kj }, 1 ≤ j ≤ J, P Nj kji Rkj [y] = i=1 yi R ,
• A ∈ Rm×n and B ∈ Rν×n be given matrices, • k · k be a norm on Rν such that the unit ball B∗ of the conjugate norm k · k∗ is a spectratope: B∗
:= =
{u : kuk∗ ≤ 1} {u ∈ Rν : ∃(z h∈ RN , r ∈ R) : u = iM z, Sℓ2 [z] rℓ Ifℓ , ℓ ≤ L} PN Sℓ [z] = i=1 zi S ℓi ,
• Π be a convex compact subset of the interior of the positive semidefinite cone Sm +, with our standard restrictions on Rkj [·], Sℓ [·], Tj and R. Let, further, [ X = Conv Xj j
be the convex hull of the union of spectratopes Xj . Consider the situation where, given observation ω = Ax + ξ of unknown signal x known to belong to X , we want to recover Bx. We assume that the matrix of second moments of noise is -dominated by a matrix from Π, and quantify the performance of a candidate estimate x b(·) by its k · k-risk RiskΠ,k·k [b x|X ] = sup sup Eξ∼P {kBx − x b(Ax + ξ)k} x∈X P :P ✁Π
where P ✁ Π means that the matrix Var[P ] = Eξ∼P {ξξ T } of second moments of distribution P is -dominated by a matrix from Π. Prove the following: Proposition 4.28. In the situation in question, consider the convex optimization problem max φTj (λ[Λj ]) + φR (λ[Υj ]) + φR (λ[Υ′ ]) + ΓΠ (Θ) : Opt = min j j ′ H,Θ,Λ ,Υ ,Υ
j
318
CHAPTER 4
Λj = {Λjk 0, j ≤ Kj }, j ≤ J, j Υj = {Υℓ 0, ℓ ≤ L}, j ≤ J, Υ′ = {Υ′ℓ 0, ℓ ≤ L} P
R∗kj [Λjk ] 1 M T [B − H T A]Pj 2
1 T P [B T 2 j P
− AT H]M 0, j ≤ J, j ∗ S ℓ ℓ [Υℓ ] 1 Θ HM 2 P ∗ ′ 0 , 1 M T HT ℓ Sℓ [Υℓ ] 2
k
where, as usual,
(4.83)
φTj (λ) = max tT λ, φR (λ) = max rT λ, t∈Tj
r∈R
ΓΠ (Θ) = max Tr(QΘ), λ[U1 , ..., Us ] = [Tr(U1 ); ...; Tr(US )], Q∈Π Sℓ∗ [·] : Sfℓ → SN : Sℓ∗ [U ] = Tr(S ℓp U S ℓq ) p,q≤N , R∗kj [·] : Sdkj → SNj : R∗kj [U ] = Tr(Rkjp U Rkjq ) p,q≤N . j
Problem (4.83) is solvable, and H-component H∗ of its optimal solution gives rise to linear estimate x bH∗ (ω) = H∗T ω such that RiskΠ,k·k [b xH∗ |X ] ≤ Opt.
(4.84)
Moreover, the estimate x bH∗ is near-optimal among linear estimates: ln(D + F )RiskOptlin i h Opt ≤ O(1) P P D = maxj k≤Kj dkj , F = ℓ≤L fℓ
where
RiskOptlin = inf
sup
H x∈X ,Q∈Π
(4.85)
Eξ∼N (0,Q) kBx − H T (Ax + ξ)k
is the best risk attainable by linear estimates in the current setting under zero mean Gaussian observation noise. It should be stressed that the convex hull of a union of spectratopes is not necessarily a spectratope, and that Proposition 4.28 states that the linear estimate stemming from (4.83) is near-optimal only among linear, not among all estimates (the latter might indeed not be the case). 4.7.4.2
Recovering nonlinear vector-valued functions
Exercise 4.10. Consider the situation as follows: We are given a noisy observation ω = Ax + ξx
[A ∈ Rm×n ]
of the linear image Ax of an unknown signal x known to belong to a given spectratope X ⊂ Rn ; here ξx is the observation noise with distribution Px which can depend on x. As in Section 4.3.3, we assume that we are given a computationally tractable convex compact set Π ⊂ int Sm + such that for every x ∈ X , Var[Px ] Θ for some Θ ∈ Π; cf. (4.32). We want to recover the value f (x) of a given vectorvalued function f : X → Rν , and we measure the recovery error in a given norm | · | on Rν .
SIGNAL RECOVERY BY LINEAR ESTIMATION
319
4.10.A. Preliminaries and the Main observation. Let k · k be a norm on Rn , and g(·) : X → Rν be a function. Recall that the function is called Lipschitz continuous on X w.r.t. the pair of norms k · k on the argument and | · | on the image spaces, if there exist L < ∞ such that |g(x) − g(y)| ≤ Lkx − yk ∀(x, y ∈ X ); every L with this property is called a Lipschitz constant of g. It is well known that in our finite-dimensional situation, the property of g to be Lipschitz continuous is independent of how the norms k · k, | · | are selected; this selection affects only the value(s) of Lipschitz constant(s). Assume from now on that the function of interest f is Lipschitz continuous on X . Let us call a norm k · k on Rn appropriate for f if f is Lipschitz continuous with constant 1 on X w.r.t. k · k, | · |. Our immediate observation is as follows:
Observation 4.29. In the situation in question, let k · k be appropriate for f . Then recovering f (x) is not more difficult than recovering x in the norm k · k: every estimate x b(ω) of x via ω such that x b(·) ∈ X induces the “plug-in” estimate fb(ω) = f (b x(ω))
of f (x), and the k · k-risk
x(Ax + ξ) − xk} Riskk·k [b x|X ] = sup Eξ∼Px {kb x∈X
of estimate x b upper-bounds the | · |-risk
n o Risk|·| [fb|X ] = sup Eξ∼Px |fb(Ax + ξ) − f (x)| x∈X
of the estimate fb induced by x b:
Risk|·| [fb|X ] ≤ Riskk·k [b x|X ].
When f is defined and Lipschitz continuous with constant 1 w.r.t. k · k, | · | on the entire Rn , this conclusion remains valid without the assumption that x b is X -valued.
4.10.B. Consequences. Observation 4.29 suggests the following simple approach to solving the estimation problem we started with: assuming that we have at our disposal a norm k · k on Rn such that • k · k is appropriate for f , and • k · k is good, meaning that the unit ball B∗ of the norm k · k∗ conjugate to k · k is a spectratope given by explicit spectratopic representation,
we use the machinery of linear estimation developed in Section 4.3.3 to build a nearoptimal, in terms of its k·k-risk, linear estimate of x via ω, and convert this estimate into an estimate of f (x). By the above observation, the | · |- risk of the resulting estimate is upper-bounded by the k · k-risk of the underlying linear estimate. The construction just outlined needs a correction: in general, the linear estimate x e(·) yielded by Proposition 4.14 (same as any nontrivial—not identically zero—linear estimate) is not guaranteed to take values in X , which is, in general, required for
320
CHAPTER 4
Observation 4.29 to be applicable. This correction is easy: it is enough to convert x e into the estimate x b defined by x b(ω) ∈ Argmin ku − x e(ω)k. u∈X
This transformation preserves efficient computability of the estimate, and ensures that the corrected estimate takes its values in X ; at the same time, “correction” x e 7→ x b nearly preserves the k · k-risk: Riskk·k [b x|X ] ≤ 2Riskk·k [e x|X ].
(∗)
Note that when k · k is a (general-type) Euclidean norm: kxk2 = xT Qx for some Q ≻ 0, factor 2 on the right-hand side can be discarded. 1) Justify (∗). 4.10.C. How to select k · k. When implementing the outlined approach, the major question is how to select a norm k · k appropriate for f . The best choice would be to select the smallest among the norms appropriate for f (such a norm does exist under mild assumptions), because the smaller the k · k, the smaller the k · k-risk of an estimate of x. This ideal can be achieved in rare cases only: first, it could be difficult to identify the smallest among the norms appropriate for f ; second, our approach requires for k · k to have an explicitly given spectratope as the unit ball of the conjugate norm. Let us look at a couple of “favorable cases,” where the difficulties just outlined can be (partially) overcome. Example: A norm-induced f . Let us start with the case, important in its own right, when f is a scalar functional which itself is a norm, and this norm has a spectratope as the unit ball of the conjugate norm, as is the case when f (·) = k · kr , r ∈ [1, 2], or when f (·) is the nuclear norm. In this case the smallest of the norms appropriate for f clearly is f itself, and none of the outlined difficulties arises. As an extension, when f (x) is obtained from a good norm k·k by operations P preserving Lipschitz continuity and constant, such as f (x) = kx − ck, or f (x) = i ai kx − ci k, P i |ai | ≤ 1, or f (x) = sup / inf kx − ck, c∈C
or even something like f (x) = sup / inf α∈A
(
)
sup / inf kx − ck . c∈Cα
In such a case, it seems natural to use this norm in our construction, although now this, perhaps, is not the smallest of the norms appropriate for f . Now let us consider the general case. Note that in principle the smallest of the norms appropriate for a given Lipschitz continuous f admits a description. Specifically, assume that X has a nonempty interior (this is w.l.o.g.—we can always replace Rn with the linear span of X ). A well-known fact of Analysis (Rademacher Theorem) states that in this situation (more generally, when X is convex with a nonempty interior), a Lipschitz continuous f is differentiable almost everywhere in X o = int X , and f is Lipschitz continuous with constant 1 w.r.t. a norm k · k if and
321
SIGNAL RECOVERY BY LINEAR ESTIMATION
only if
kf ′ (x)kk·k→|·| ≤ 1
whenever x ∈ X o is such that the derivative (a.k.a. Jacobian) of f at x exists; here kQkk·k→|·| is the matrix norm of a ν × n matrix Q induced by the norms k · k on Rn and | · | on Rν : kQkk·k→|·| := max |Qx| = max y T Qx = kxk≤1
kxk≤1 |y|∗ ≤1
max xT QT y = kQT k|·|∗ →k·k∗ ,
|y|∗ ≤1 [kxk∗ ]∗ ≤1
where k · k∗ , | · |∗ are the conjugates of k · k, | · |. 2) Prove that a norm k · k is appropriate for f if and only if the unit ball of the conjugate to k · k norm contains the set Bf,∗ = cl Conv{z : ∃(x ∈ Xo , y, |y|∗ ≤ 1) : z = [f ′ (x)]T y}, where Xo is the set of all x ∈ X o where f ′ (x) exists. Geometrically, Bf,∗ is the closed convex hull of the union of all images of the unit ball B∗ of | · |∗ under the linear mappings y 7→ [f ′ (x)]T y stemming from x ∈ Xo . Equivalently: k · k is appropriate for f if and only if kuk ≥ kukf := max z T u. z∈Bf,∗
(!)
Check that kukf is a norm, provided that Bf,∗ (this set by construction is a convex compact set symmetric w.r.t. the origin) possesses a nonempty interior; whenever this is the case, kukf is the smallest of the norms appropriate for f . Derive from the above that the norms k · k we can use in our approach are the norms on Rn for which the unit ball of the conjugate norm is a spectratope containing Bf,∗ . Example. Consider the case of componentwise quadratic f : f (x) = 12 xT Q1 x; 21 xT Q2 x; ...; 12 xT Qν x
[Qi ∈ Sn ]
and |u| = kukq with q ∈ [1, 2].10 In this case B∗ = {u ∈ Rν : kukp ≤ 1}, p =
h i q ∈ [2, ∞[, and f ′ (x) = xT Q1 ; xT Q2 ; ...; xT Qν . q−1
Setting S = {s ∈ Rν+ : kskp/2 ≤ 1} and
S 1/2 = {s ∈ Rν+ : [s21 ; ...; s2ν ] ∈ S} = {s ∈ Rν+ : kskp ≤ 1}, the set
Z = {[f ′ (x)]T u : x ∈ X , u ∈ B∗ }
10 To save notation, we assume that the linear parts in the components of f are trivial—just i zeros. In this respect, note that we always can subtract from f any linear mapping and reduce our estimation problem to two distinct problems of estimating separately the values at the signal x of the modified f and the linear mapping we have subtracted (we know how to solve the latter problem reasonably well).
322
CHAPTER 4
is contained in the set ( Y=
n
y ∈ R : ∃(s ∈ S
1/2
i
, x ∈ X , i ≤ ν) : y =
X
s i Qi xi
i
)
.
The set Y is a spectratope with spectratopic representation readily given by that of X . Indeed, Y is nothing but the S-sum of the spectratopes Qi X , i = 1, ..., ν; see Section 4.10. As a result, we can use the spectratope Y (when int Y 6= ∅) or the arithmetic sum of Y with a small Euclidean ball (when int Y = ∅) as the unit ball of the norm conjugate to k · k, thus ensuring that k · k is appropriate for f . We then can use k · k in order to build an estimate of f (·). 3.1) For illustration, work out the problem of recovering the value of a scalar quadratic form f (x) = xT M x, M = Diag{iα , i = 1, ..., n}
[ν = 1, | · | is the absolute value]
from noisy observation ω = Ax + ση, A = Diag{iβ , i = 1, ..., n}, η ∼ N (0, In )
(4.86)
of a signal x known to belong to the ellipsoid X = {x ∈ Rn : kP xk2 ≤ 1}, P = Diag{iγ , i = 1, ..., n}, where α, β, γ are given reals satisfying α − γ − β < −1/2. You could start with the simplest unbiased estimate x e(ω) = [1−β ω1 ; 2−β ω2 ; ...; n−β ωn ]
of x. 3.2) Work out the problem of recovering the norm
f (x) = kM xkp , M = Diag{iα , i = 1, ..., n}, p ∈ [1, 2], from observation (4.86) with X = {x : kP xkr ≤ 1}, P = Diag{iγ , i = 1, ..., n}, r ∈ [2, ∞]. 4.7.4.3
Suboptimal linear estimation
Exercise 4.11. [recovery of large-scale signals] Consider the problem of estimating the image Bx ∈ Rν of signal x ∈ X from observation ω = Ax + σξ ∈ Rm in the simplest case where X = {x ∈ Rn : xT Sx ≤ 1} is an ellipsoid (so that S ≻ 0), the recovery error is measured in k · k2 , and ξ ∼ N (0, Im ). In this case,
SIGNAL RECOVERY BY LINEAR ESTIMATION
323
Problem (4.12) to solve when building “presumably good linear estimate” reduces to B T − AT H λS 0 , (4.87) Opt = min λ + σ 2 kHk2F : B − HT A Iν H,λ where k · kF is the Frobenius norm of a matrix. An optimal solution H∗ to this problem results in the linear estimate x bH∗ (ω) = H∗T ω satisfying the risk bound q p Risk[b xH∗ |X ] := max E{kBx − H∗T (Ax + σξ)k22 } ≤ Opt. x∈X
Now, (4.87) is an efficiently solvable convex optimization problem. However, when the sizes m, n of the problem are large, solving the problem by standard optimization techniques could become prohibitively time-consuming. The goal of what follows is to develop a relatively cheap computational technique for finding a good enough suboptimal solution to (4.87). In the sequel, we assume that A 6= 0; otherwise (4.87) is trivial. 1) Prove that problem (4.87) can be reduced to a similar problem with S = In and diagonal positive semidefinite matrix A, the reduction requiring several singular value decompositions and multiplications of matrices of the same sizes as those of A, B, and S.
2) By item 1, we can assume from the very beginning that S = I and A = Diag{α1 , ..., αn } with 0 ≤ √ α1 ≤ α2 ≤ ... ≤ αn . Passing in (4.87) from variables λ, H to variables τ = λ, G = H T , the problem becomes Opt = min τ 2 + σ 2 kGk2F : kB − GAk ≤ τ , (4.88) G,τ
where k · k is the spectral norm. Now consider the construction as follows:
• Consider a partition {1, ..., n} = I0 ∪ I1 ∪ ... ∪ IK of the index set {1, ..., n} into consecutive segments in such a way that (a) I0 is the set of those i, if any, for which αi = 0, and Ik 6= ∅ when k ≥ 1, (b) for k ≥ 1 the ratios αj /αi , i, j ∈ Ik , do not exceed θ > 1 (θ is the parameter of our construction), while (c) for 1 ≤ k < k ′ ≤ K, the ratios αj /αi , i ∈ Ik , j ∈ Ik′ , are > θ. The recipe for building the partition is self-evident, and we clearly have K ≤ ln(α/α)/ ln(θ) + 1, where α is the largest of αi , and α is the smallest of those αi which are positive. • For 1 ≤ k ≤ K, we denote by ik the first index in Ik , set αk = αik , nk = Card Ik , and define Ak as the nk × nk diagonal matrix with diagonal entries αi , i ∈ Ik .
Now, given a ν × n matrix C, let us specify Ck , 0 ≤ k ≤ K, as the ν × nk submatrix of C comprised of columns with indexes from Ik , and consider the following parametric optimization problems: Opt∗k (τ ) = minGk ∈Rν×nk kGk k2F : kBk − Gk Ak k ≤ τ (Pk∗ [τ ]) (Pk [τ ]) Optk (τ ) = minGk ∈Rν×nk kGk k2F : kBk − αk Gk k ≤ τ
324
CHAPTER 4
where τ ≥ 0 is the parameter, and 1 ≤ k ≤ K. Justify the following simple observations: 2.1) Gk is feasible for (Pk [τ ]) if and only if the matrix G∗k = αk Gk A−1 k is feasible for (Pk∗ [τ ]), and kG∗k kF ≤ kGk kF ≤ θkG∗k kF , implying that Opt∗k (τ ) ≤ Optk (τ ) ≤ θ2 Opt∗k (τ ); 2.2) Problems (Pk [τ ]) are easy to solve: if Bk = Uk Dk VkT is the singular value decomposition of Bk and σkℓ , 1 ≤ ℓ ≤ νk := min[ν, nk ], are diagonal entries of Dk , then an optimal solution to (Pk [τ ]) is b k [τ ] = [αk ]−1 Uk Dk [τ ]VkT , G
where Dk [τ ] is the diagonal matrix obtained from Dk by truncating the diagonal entries σkℓ 7→ [σkℓ − τ ]+ (from now on, a+ = max[a, 0], a ∈ R). The optimal value in (Pk [τ ]) is Optk (τ ) = [αk ]−2
νk X ℓ=1
[σkℓ − τ ]2+ .
2.3) If (τ, G) is a feasible solution to (4.88) then τ ≥ τ := kB0 k, and the matrices Gk , 1 ≤ k ≤ K, are feasible solutions to problems (Pk∗ [τ ]), implying that X Opt∗k (τ ) ≤ kGk2F . k
And vice versa: if τ ≥ τ , Gk , 1 ≤ k ≤ K, are feasible solutions to problems (Pk∗ [τ ]), and K, I0 = ∅ K+ = , K + 1, I0 6= ∅ p then the matrix G = [0ν×n0 , G1 , ..., GK ] and τ+ = K+ τ form a feasible solution to (4.88).
Extract from these observations that if τ∗ is an optimal solution to the convex optimization problem ( ) K X 2 2 2 Optk (τ ) : τ ≥ τ min θ τ + σ (4.89) τ
k=1
and Gk,∗ are optimal solutions to the problems (Pk [τ∗ ]), then the pair p b = [0ν×n , G∗ , ..., G∗ ] [G∗k,∗ = αk Gk,∗ A−1 τb = K+ τ∗ , G 1,∗ K,∗ 0 k ]
is a feasible solution to (4.88), and the value of the objective of the latter problem at this feasible solution is within the factor max[K+ , θ2 ] of the true optimal value b gives rise to a linear estimate with risk on Opt of this problem. As a result, p G √ X which is within the factor max[ K+ , θ] of the risk Opt of the “presumably
SIGNAL RECOVERY BY LINEAR ESTIMATION
325
good” linear estimate yielded by an optimal solution to (4.87). Notice that • After carrying out singular value decompositions of matrices Bk , 1 ≤ k ≤ K, specifying τ∗ and Gk,∗ requires solving univariate convex minimization problem with an easy-to-compute objective, so that the problem can be easily solved, e.g., by bisection; • The computationally cheap suboptimal solution we end up with is not that bad, since K is “moderate”—just logarithmic in the condition number α/α of A. Your next task is as follows: 3) To get an idea of the performance of the proposed synthesis of “suboptimal” linear estimation, run numerical experiments as follows: • select some n and generate at random the n × n data matrices S, A, B; • for “moderate” values of n compute both the linear estimate yielded by the optimal solution to (4.12)11 and the suboptimal estimate as yielded by the above construction. Compare their risk bounds and the associated CPU times. For “large” n, where solving (4.12) becomes prohibitively time-consuming, compute only a suboptimal estimate in order to get an impression of how the corresponding CPU time grows with n. Recommended setup: • range of n: 50, 100 (“moderate” values), 1000, 2000 (“large” values) • range of σ: {1.0, 0.01, 0.0001} • generation of S, A, B: generate the matrices at random according to S = US Diag{1, 2, ..., n}UST , A = UA Diag{µ1 , ..., µn }VAT , B = UB Diag{µ1 , ..., µn }VBT , where US , UA , VA , UB , VB are random orthogonal n × n matrices, and the µi form a geometric progression with µ1 = 0.01 and µn = 1. You could run the above construction for several values of θ and select the best, in terms of its risk bound, of the resulting suboptimal estimates. 4.11.A. Simple case. There is a trivial case where (4.88) is really easy; this is the case where the right orthogonal factors in the singular value decompositions of A and B are the same, that is, when B = W F V T , A = U DV T with orthogonal n × n matrices W, U, V and diagonal F, D. This very special case is in fact of some importance—it covers the denoising situation where B = A, so that our goal is to denoise our observation of Ax given a priori information x ∈ X 11 When X is an ellipsoid, semidefinite relaxation bound on the maximum of a quadratic form over x ∈ X is exact, so that we are in the case when an optimal solution to (4.12) yields the best, in terms of risk on X , linear estimate.
326
CHAPTER 4
on x. In this situation, setting W T H T U = G, problem (4.88) becomes Opt = min kF − GDk2 + σ 2 kGk2F . G
(4.90)
Now goes the concluding part of the exercise:
4) Prove that in the situation in question an optimal solution G∗ to (4.90) can be selected to be diagonal, with diagonal entries γi , 1 ≤ i ≤ n, yielded by the optimal solution to the optimization problem ) ( n X 2 2 2 [φi = Fii , δi = Dii ]. γi Opt = min f (G) := max(φi − γi δi ) + σ γ
i≤n
i=1
Exercise 4.12. [image reconstruction—follow-up to Exercise 4.11] A grayscale image can be represented by an m × n matrix x = [xpq ] 0≤p 0 be such that Z ⊂ ∆[α]. Prove that X + = {[x; z] : W [x; z] = 0, [x; z] ∈ Conv{vij = [gi ; hj ], 1 ≤ i ≤ n, 0 ≤ j ≤ p}} ,
(!)
where the gi are the standard basic orths in Rn , h0 = 0 ∈ Rp , and αj hj , 1 ≤ j ≤ p, are the standard basic orths in Rp . 6.2) Derive from 5.1 that the efficiently computable convex function n o ΦSA (H) = inf max k(B − H T A)gi + C T W vij k : C ∈ R(p+q)×ν C
i,j
is an upper bound on Φ(H). In the sequel, we refer to ΦSA (H) as to the SheraliAdams bound [214]. 4.17.G. Combined bound. We can combine the above bounds, specifically, as follows: 7) Prove that the efficiently computable convex function ΦLBS (H) =
inf
max Gij (H, Λ± , C± , µ, µ+ ),
(Λ± ,C± ,µ,µ+ )∈R i,j
(#)
where Gij (H, Λ± , C± , µ, µ+ ) := −µT F gi + µT+ W vij + min ktk :
t T W v ] , [(−B + H T A − Λ F )g + C T W v ] , t ≥ Max [(B − H T A − Λ+ F )gi + C+ + − + ij i ij − ν×(p+2q)
R = {(Λ± , C± , µ, µ+ ) : Λ± ∈ R+
, C± ∈ R(p+q)×ν , µ ∈ Rp+2q , µ+ ∈ Rp+q } +
is an upper bound on Φ(H), and that this Combined bound is at least as good as any of the Lagrange, Basic, or Sherali-Adams bounds.
340
CHAPTER 4
4.17.H. How to select α? A shortcoming of the Sherali-Adams and the combined upper bounds on Φ is the presence of a “degree of freedom”—on the positive vector α. Intuitively, we would like to select α to make the simplex ∆[α] ⊃ Z to be “as small as possible.” It is unclear, however, what “as small as possible” is in our context, not to speak of how to select the required α after we agree on how we measure the “size” of ∆[α]. It turns out, however, that we can efficiently select α resulting in the smallest volume ∆[α]. 8) Prove that minimizing the volume of ∆[α] ⊃ Z in α reduces to solving the following convex optimization problem: ) ( p X T T (∗) ln(αs ) : 0 ≤ α ≤ −v, E u + G v ≤ 1n . inf − α,u,v
s=1
9) Run numerical experiments to evaluate the quality of the above bounds. It makes sense to generate problems where we know in advance the actual value of Φ, e.g., to take X = {x ∈ ∆n : x ≥ a} (a) P with a ≥ 0 such that i ai ≤ 1. In this case, we can easily list the extreme points of X (how?) and thus can easily compute Φ(H). In your experiments, you can use the matrices stemming from “presumably good” linear estimates yielded by the optimization problems Opt
=
where
min
H,Υ,Θ
Φ(H) + φR (λ[Υ]) + ΓX (Θ) : Υ = {Υℓ 0, ℓ ≤ L, } 1 HM Θ P2 ∗ 0 T T 1 M H ℓ Sℓ [Υℓ ] 2
ΓX (Θ) =
(4.99)
1 max Tr(Diag{Ax}Θ), K x∈X
(see Corollary 4.12), with the actual Φ (which is available for our X ), or the upper bounds on Φ (Lagrange, Basic, Sherali-Adams, and Combined) in the role of Φ. Note that it may make sense to test seven bounds rather than just four. Indeed, with additional constraints on the optimization variables in (#), we can get, besides “pure” Lagrange, Basic, and Sherali-Adams bounds and their “threecomponent combination” (Combined bound), pairwise combinations of the pure bounds as well. For example, to combine Lagrange and Sherali-Adams bounds, it suffices to add to (#) the constraints Λ± = 0. Exercise 4.18. The exercise to follow deals with recovering discrete probability distributions in the Wasserstein norm. The Wasserstein distance between probability distributions is extremely popular today in Statistics; it is defined as follows.17 Consider discrete random variables taking values in finite observation space Ω = {1, 2, ..., n} which is equipped with 17 The distance we consider stems from the Wasserstein 1-distance between discrete probability distributions. This is a particular case of the general Wasserstein p-distance between (not necessarily discrete) probability distributions.
341
SIGNAL RECOVERY BY LINEAR ESTIMATION
the metric {dij : 1 ≤ i, j ≤ n} satisfying the standard axioms.18 As always, we identify probability distributions on Ω with n-dimensional probabilistic vectors p = [p1 ; ...; pn ], where pi is the probability mass assigned by p to i ∈ Ω. The Wasserstein distance between probability distributions p and q is defined as W (p, q) = min
x=[xij ]
(
X ij
dij xij : xij ≥ 0,
X
xij = pi ,
j
X i
xij = qj ∀1 ≤ i, j ≤ n
)
. (4.100)
In other words, one may think of p and q as two distributions of unit mass on the points of Ω, and consider the mass transport problem of redistributing the mass assigned to points by distribution p to get the distribution q. Denoting by xij the P x = p say that the total mass mass moved from point i to point j, constraints ij i j P taken from point i is exactly pi , constraints i xij = qj say that as the result of transportation, the mass at point j will be exactly qj , and the constraints xij ≥ 0 reflect the fact that transport of a negative mass is forbidden. Assuming that the cost of transporting a mass µ from point i to point j is dij µ, the Wasserstein distance W (p, q) between p and q is the cost of the cheapest transportation plan which converts p into q. As compared to other natural distances between discrete probability distributions, like kp − qk1 , the advantage of the Wasserstein distance is that it allows us to model the situation (indeed arising in some applications) where the effect, measured in terms of intended application, of changing probability masses of points from Ω is small when the probability mass of a point is redistributed among close points.19 Now comes the first part of the exercise: 1) Let p, q be two probability distributions. Prove that ) ( X fi (pi − qi ) : |fi − fj | ≤ dij ∀i, j . W (p, q) = maxn f ∈R
(4.101)
i
Treating vector f ∈ Rn as a function on Ω, the value of the function at a point i ∈ Ω being fi , (4.101) admits a very transparent interpretation: the Wasserstein distance W (p, q) between probability distributions p and q is the maximum of inner products of p − q and functions f on Ω which are Lipschitz continuous w.r.t. the metric d, with constant 1. When shifting f by a constant, the inner product remains intact (since p − q is a vector with zero sum of entries). Therefore, denoting by D = max dij i,j
the d-diameter of Ω, we have W (p, q) = max f T (p − q) : |fi − fj | ≤ dij , |fi | ≤ D/2 ∀i, j , f
(4.102)
18 Namely, positivity: d ij = dji ≥ 0, with dij = 0 if and only if i = j; and the triangle inequality: dik ≤ dij + djk for all triples i, j, k. 19 In fact, the Wasserstein distance shares this property with some other distances between distributions used in Probability Theory, such as Skorohod, or Prokhorov, or Ky Fan distances. What makes the Wasserstein distance so “special” is its representation (4.100) as the optimal value of a Linear Programming problem, responsible for efficient computational handling of this distance.
342
CHAPTER 4
the reason being that every function f on Ω which is Lipschitz continuous, with constant 1, w.r.t. metric d can be shifted by a constant to ensure kf k∞ ≤ D/2 (look what happens when the shift ensures that mini fi = −D/2). Representation (4.102) shows that the Wasserstein distance is generated by a norm on Rn : for all probability distributions on Ω one has W (p, q) = kp − qkW , where k · kW is the Wasserstein norm on Rn given by kxkW = max f T x, f ∈B∗ B∗ = u ∈ Rn : uT Sij u ≤ 1, 1 ≤ i ≤ j ≤ n , T d−2 ij [ei − ej ][ei − ej ] , 1 ≤ i < j ≤ n, Sij = −2 T 4D ei ei , 1 ≤ i = j ≤ n,
(4.103)
where e1 , ..., en are the standard basic orths in Rn . 2) Let us equip n-element set Ω = {1, ..., d} with the metric dij = What is the associated Wasserstein norm?
2, 0,
i 6= j . i=j
Note that the set B∗ in (4.103) is the unit ball of the norm conjugate to k·kW , and as we see, this set is a basic ellitope. As a result, the estimation machinery developed in Chapter 4 is well suited for recovering discrete probability distributions in the Wasserstein norm. This observation motivates the concluding part of the exercise: 3) Consider the situation as follows: Given an m × n column-stochastic matrix A and a ν × n column-stochastic matrix B, we observe K samples ωk , 1 ≤ k ≤ K, independent of each other, drawn from the discrete probability distribution Ax ∈ ∆m (as always, ∆ν ⊂ Rν is the probabilistic simplex in Rν ), x ∈ ∆n being an unknown “signal” underlying the observations; realizations of ωk are identified with respective vertices f1 , ..., fm of ∆m . Our goal is to use the observations to estimate the distribution Bx ∈ ∆ν . We are given a metric d on the set Ων = {1, 2, ..., ν} of indices of entries in Bx, and measure the recovery error in the Wasserstein norm k · kW associated with d. Build an explicit convex optimization problem responsible for a “presumably good” linear recovery of the form
Exercise 4.19.
x bH =
K 1 TX ωk . H K k=1
[follow-up to Exercise 4.17] In Exercise 4.17, we have built a “presumably good” linear estimate x bH∗ (·)—see (4.98)—yielded by the H-component H∗ of an optimal solution to problem (4.99). The optimal value Opt in this problem is an upper bound on the risk Riskk·k [b xH∗ |X ] (here and in what follows we use the same notation and impose the same assumptions as in Exercise 4.17). Recall that Riskk·k is the worst, w.r.t. signals x ∈ X underlying our observations, expected norm of the recovery error. It makes sense also to provide upper bounds on the probabilities of deviations of the error’s magnitude from its expected value, and this is the problem
343
SIGNAL RECOVERY BY LINEAR ESTIMATION
we consider here; cf. Exercise 4.14. 1) Prove the following Lemma 4.33. Let Q ∈ Sm + , let K be a positive integer, and let p ∈ ∆m . Let, further, ω K = (ω1 , ..., ωK ) be i.i.d. random vectors, with ωk taking the value ej (e1 , ..., em are the standard basic orths in Rm ) with probability pj . Finally, let PK 1 ξk = ωk − E{ωk } = ωk − p, and ξb = K k=1 ξk . Then for every ǫ ∈ (0, 1) it holds 12 ln(2m/ǫ) 2 b ≥ 1 − ǫ. Prob kξk2 ≤ K
Hint: use the classical Bernstein inequality: Let X1 , ..., XK be independent zero mean random variables taking values in [−M, M ], and let σk2 = E{Xk2 }. Then for every t ≥ 0 one has X K t2 . Prob Xk ≥ t ≤ exp − P 2 1 k=1 2[ k σk + 3 M t]
2) Consider the situation described in Exercise 4.17 with X = ∆n , specifically,
• Our observation is a sample ω K = (ω1 , ..., ωK ) with i.i.d. components ωk ∼ Ax, where X ∈ ∆n is an unknown n-dimensional probabilistic vector, A is an m × n stochastic matrix (nonnegative matrix with unit column sums), and ω ∼ Ax means that ω is a random vector taking value ei (ei are standard basic orths in Rm ) with probability [Ax]i , 1 ≤ i ≤ m. • Our goal is to recover Bx in a given norm k · k; here B is a given ν × n matrix. • We assume that the unit ball B∗ of the norm k · k∗ conjugate to k · k is a spectratope: B∗ = {u = M y, y ∈ Y}, Y = {y ∈ RN : ∃r ∈ R : Sℓ2 [y] rℓ Ifℓ , ℓ ≤ L}.
Our goal is to build a presumably good linear estimate x bH (ω K ) = H T ω b [ω K ], ω b [ω K ] =
1 X ωk . K k
Prove the following
Proposition 4.34. Let H, Θ, Υ be a feasible solution to the convex optimization problem
where
minH,Θ,Υ {Φ(H) + φR (λ[Υ]) + Γ(Θ)/K : Υ =1{Υℓ 0,ℓ ≤ L}, Θ HM 2 P 0 1 ∗ M T HT ℓ Sℓ [Υℓ ] 2 Φ(H) = max kColj [B − H T A]k, Γ(Θ) = max Tr(Diag{Ax}Θ). j≤n
Then
x∈∆n
(4.104)
344
CHAPTER 4
(i) For every x ∈ ∆n it holds p bH (ω K )k ≤ Φ(H) + 2K −1/2 φR (λ[Υ])Γ(Θ) EωK kBx − x ≤
Φ(H) + φR (λ[Υ]) + Φ(H) + Γ(Θ)/K .
(4.105)
(ii) Let ǫ ∈ (0, 1). For every x ∈ ∆n with p γ = 2 3 ln(2m/ǫ)
one has
o n p bH (ω K )k ≤ Φ(H) + 2γK −1/2 φR (λ[Υ])kΘkSh,∞ ProbωK kBx − x ≥ 1 − ǫ.
(4.106)
3) Look what happens when ν = m = n, A and B are the unit matrices, and H = I, i.e., we want to understand how good is the recovery of a discrete probability distribution by empirical distribution of a K-element i.i.d. sample drawn from the original distribution. Take, as k · k, the norm k · kp with p ∈ [1, 2], and show that for every x ∈ ∆n and every ǫ ∈ (0, 1) one has ∀(x ∈ ∆n ) : 1 1 1 E kxn− x bI (ω K )kp ≤ n p − 2 K − 2 , o p 1 1 1 Prob kx − x bI (ω K )kp ≤ 2 3 ln(2n/ǫ)n p − 2 K − 2 ≥ 1 − ǫ.
Exercise 4.20.
[follow-up to Exercise 4.17] Consider the situation as follows. A retailer sells n items by offering customers, via internet, bundles of m < n items, so that an offer is an m-element subset B of the set S = {1, ..., n} of the items. A customer has personal preferences represented by a subset P of S—customer’s preference set. We assume that if an offer B intersects with the preference set P of a customer, the latter buys an item drawn at random from the uniform distribution on B ∩ P , and if B ∩ P = ∅, the customer declines the offer. In the pilot stage we are interested in, the seller learns the market by making offers to K customers. Specifically, the seller draws the k-th customer, k ≤ K, at random from the uniform distribution on the population of customers, and makes the selected customer an offer drawn at random from the uniform distribution on the set Sm,n of all m-item offers. What is observed in the k-th experiment is the item, if any, bought by the customer, and we want to make statistical inferences from these observations. The outlined observation scheme can be formalized as follows. Let S be the set of all subsets of the n-element set, so that S is of cardinality N = 2n . The population of customers induces a probability distribution p on S: for P ∈ S, pP is the fraction of customers with the preference set being P ; we refer to p as to the preference distribution. An outcome of a single experiment can be represented by a pair (ι, B), where B ∈ Sm,n is the offer used in the experiment, and ι is either 0 (“nothing is bought”, P ∩ B = ∅), or a point from P ∩ B, the item which was bought, when n )P ∩ B 6= ∅. Note that AP is a probability distribution on the (M = (m + 1) m element set Ω = {(ι, B)} of possible outcomes. As a result, our observation scheme is fully specified by an M × N column-stochastic matrix A known to us with the
345
SIGNAL RECOVERY BY LINEAR ESTIMATION
columns AP indexed by P ∈ S. When a customer is drawn at random from the uniform distribution on the population of customers, the distribution of the outcome clearly is Ap, where p is the (unknown) preference distribution. Our inferences should be based on the K-element sample ω K = (ω1 , ..., ωK ), with ω1 , .., ωK drawn, independently of each other, from the distribution Ap. Now we can pose various inference problems, e.g., that of estimating p. We, however, intend to focus on a simpler problem—one of recovering Ap. In terms of our story, this makes sense: when we know Ap, we know, e.g., what the probability is for every offer to be “successful” (something indeed is bought) and/or to result in a specific profit, etc. With this knowledge at hand, the seller can pass from a “blind” offering policy (drawing an offer at random from the uniform distribution on the set Sm,n ) to something more rewarding. Now comes the exercise: 1. Use the results of Exercise 4.17 to build a “presumably good” linear estimate # " K 1 X K T ωk x bH (ω ) = H K k=1
of Ap (as always, we encode observations ω, which are elements of the M -element set Ω, by standard basic orths in RM ). As the norm k·k quantifying the recovery error, use k · k1 and/or k · k2 . In order to avoid computational difficulties, use small m and n (e.g., m = 3 and n = 5). Compare your results with those for the PK 1 “straightforward” estimate K k=1 ωk (the empirical distribution of ω ∼ Ap). 2. Assuming that the “presumably good” linear estimate outperforms the straightforward one, how could this phenomenon be explained? Note that we have no nontrivial a priori information on p! Exercise 4.21. [Poisson Imaging] The Poisson Imaging Problem is to recover an unknown signal observed via the Poisson observation scheme. More specifically, assume that our observation is a realization of random vector ω ∈ Rm + with Poisson entries ωi = Poisson([Ax]i ) independent of each other. Here A is a given entrywise nonnegative m × n matrix, and x is an unknown signal known to belong to a given compact convex subset X of Rn+ . Our goal is to recover in a given norm k · k the linear image Bx of x, where B is a given ν × n matrix. We assume in the sequel that X is a subset cut off the n-dimensional probabilistic simplex ∆n by a collection of linear equality and inequality constraints. The assumption X ⊂ ∆n isPnot too restrictive. Indeed, assume that we know in advance a linear inequality i αi xi ≤ 1 with P positive coefficients which is valid on X .20 Introducing slack variable s given by i αi xi + s = 1 and passing from signal x to the new signal [α1 x1 ; ...; αn xn ; s], after a straightforward modification of matrices A and B, we arrive at the situation where X is a subset of the probabilistic simplex. Our goal in the sequel is to build a presumably good linear estimate x bH (ω) = H T ω of Bx. As in Exercise 4.17, we start with upper-bounding the risk of a linear 20 For example, in PET—see Section 2.4.3.2—where x is the density of a radioactive tracer P injected into the patient taking the PET procedure, we know in advance the total amount i vi xi of the tracer, vi being the volume of voxels.
346
CHAPTER 4
estimate. When representing ω = Ax + ξx , we arrive at zero mean observation noise ξx with entries [ξx ]i = ωi − [Ax]i independent of each other and covariance matrix Diag{Ax}. We now can upper-bound the risk of a linear estimate x bH (·) in the same way as in Exercise 4.17. Specifically, denoting by ΠX the set of all diagonal matrices Diag{Ax}, x ∈ X , and by Pi,x the Poisson distribution with parameter [Ax]i , we have T Riskk·k [b xH |X ] = supx∈X Eω∼P 1,x ×...×Pm,xT kBx − TH ωk = supx∈X Eξx k[Bx − H A]x − H ξx k sup Eξ kH T ξk . ≤ sup k[B − H T A]xk + x∈X ξ:Cov[ξ]∈ΠX {z } | | {z } Φ(H)
ΨX (H)
In order to build a presumably good linear estimate, it suffices to build efficiently X computable upper bounds Φ(H) on Φ(H) and Ψ (H) on ΨX (H) convex in H, and then take as H an optimal solution to the convex optimization problem h i X Opt = min Φ(H) + Ψ (H) . H
As in Exercise 4.17, assume from now on that k · k is an absolute norm, and the unit ball B∗ of the conjugate norm is a spectratope: B∗ := {u : kuk∗ ≤ 1} = {u : ∃r ∈ R, y : u = M y, Sℓ2 [y] rℓ Ifℓ , ℓ ≤ L}.
Observe that • In order to build Φ, we can use exactly the same techniques as those developed in Exercise 4.17. Indeed, as far as building Φ is concerned, the only difference with the situation of Exercise 4.17 is that in the latter, A was column-stochastic matrix, while now A is just an entrywise nonnegative matrix. Note, however, that when upper-bounding Φ in Exercise 4.17, we never used the fact that A is column-stochastic. • In order to upper-bound ΨX , we can use the bound (4.40) of Exercise 4.17. The bottom line is that in order to build a presumably good linear estimate, we need to solve the convex optimization problem Opt = min Φ(H) + φR (λ[Υ]) + ΓX (Θ) : Υ = {Υℓ 0, ℓ ≤ L} H,Υ,Θ 1 (P ) HM Θ 2 P 0 1 T T ∗ M H ℓ Sℓ [Υℓ ] 2 where
ΓX (Θ) = max Tr(Diag{Ax}Θ) x∈X
(cf. problem (4.99)) with Φ yielded by any construction from Exercise 4.17, e.g., the least conservative Combined upper bound on Φ. What in our present situation differs significantly from the situation of Exercise 4.17, are the bounds on probabilities of large deviations (for Discrete o.s., established in Exercise 4.19). The goal of what follows is to establish these bounds for
347
SIGNAL RECOVERY BY LINEAR ESTIMATION
Poisson Imaging. Here is what you are supposed to do: 1. Let ω ∈ Rm be a random vector with independent entries ωi ∼ Poisson(µi ), and let µ = [µ1 ; ...; µm ]. Prove that whenever h ∈ Rm , γ > 0, and δ ≥ 0, one has X ln Prob{hT ω > hT µ + δ} ≤ [exp{γhi } − 1]µi − γhT µ − γδ. (4.107) i
2. Taking for granted (or see, e.g., [178]) that ex − x − 1 ≤ prove that in the situation of item 1 one has for t > 0:
x2 2(1−x/3)
when |x| < 3,
P γ 2 i h2i µi 3 T T ⇒ ln Prob{h ω > h µ + t} ≤ − γt. 0≤γ< khk∞ 2(1 − γkhk∞ /3)
(4.108)
Derive from the latter fact that
δ2 P Prob h ω > h µ + δ ≤ exp − , 2[ i h2i µi + khk∞ δ/3]
T
T
and conclude that
δ2 Prob |h ω − h µ| > δ ≤ 2 exp − P 2 2[ i hi µi + khk∞ δ/3] T
T
(4.109)
.
(4.110)
3. Extract from (4.110) the following
Proposition 4.35. In the situation and under the assumptions of Exercise 4.21, let Opt be the optimal value, and H, Υ, Θ be a feasible solution to problem (P ). Whenever x ∈ X and ǫ ∈ (0, 1), denoting by Px the distribution of observations stemming from x (i.e., the distribution of random vector ω with independent entries ωi ∼ Poisson([Ax]i )), one has
and
Eω∼Px {kBx − x bH (ω)k}
≤
≤
Φ(H) + 2
p
φR (λ[Υ])Tr(Diag(Ax}Θ)
Φ(H) + φR (λ[Υ]) + ΓX (Θ)
Probω∼Px kBx − x bH (ω)k ≤ Φ(H) q p +4 29 ln2 (2m/ǫ)Tr(Θ) + ln(2m/ǫ)Tr(Diag{Ax}Θ) φR (λ[Υ]) ≥ 1 − ǫ.
(4.111)
(4.112)
Note that in the case of [Ax]i ≥ 1 for all x ∈ X and all i we have Tr(Θ) ≤ Tr(Diag{Ax}Θ), so that in this case the Px -probability of the event n o p ω : kBx − x bH (ω)k ≤ Φ(H) + O(1) ln(2m/ǫ) φR (λ[Υ])ΓX (Θ) is at least 1 − ǫ. 4.7.6
Numerical lower-bounding minimax risk
Exercise 4.22. 4.22.A. Motivation. From the theoretical viewpoint, the results on near-optimality of presumably good linear estimates stated in Propositions 4.5 and 4.16 seem
348
CHAPTER 4
to be quite strong and general. This being said, for a practically oriented user the “nonoptimality factors” arising in these propositions can be too large to make any practical sense. This drawback of our theoretical results is not too crucial—what matters in applications, is whether the risk of a proposed estimate is appropriate for the application in question, and not by how much it could be improved were we smart enough to build the “ideal” estimate; results of the latter type from a practical viewpoint offer no more than some “moral support.” Nevertheless, the “moral support” has its value, and it makes sense to strengthen it by improving the lower risk bounds as compared to those underlying Propositions 4.5 and 4.16. In this respect, an appealing idea is to pass from lower risk bounds yielded by theoretical considerations to computation-based ones. The goal of this exercise is to develop some methodology yielding computation-based lower risk bounds. We start with the main ingredient of this methodology—the classical Cramer-Rao bound. 4.22.B. Cramer-Rao bound. Consider the situation as follows: we are given • an observation space Ω equipped with reference measure Π, basic examples being (A) Ω = Rm with Lebesgue measure Π, and (B) (finite or countable) discrete set Ω with counting measure Π; • a convex compact set Θ ⊂ Rk and a family P = {p(ω, θ) : θ ∈ Θ} of probability densities, taken w.r.t. Π. Our goal is, given an observation ω ∼ p(·, θ) stemming from unknown θ known to belong to Θ, to recover θ. We quantify the risk of a candidate estimate θb as o1/2 n b = sup Eω∼p(·,θ) kθ(ω) b Risk[θ|Θ] , − θk22
(4.113)
θ∈Θ
and define the “ideal” minimax risk as
b Riskopt = inf Risk[θ], θb
the infimum being taken w.r.t. all estimates, or, which is the same, all bounded estimates (indeed, passing from a candidate estimate θb to the projected estimate b θbΘ (ω) = argminθ∈Θ kθ(ω) − θk2 will only reduce the estimate risk). The Cramer-Rao inequality [58, 205], which we intend to use,21 is a certain relation between the covariance matrix of a bounded estimate and its bias; this relation is valid under mild regularity assumptions on the family P, specifically, as follows: 1) p(ω, θ) > 0 for all ω ∈ Ω, θ ∈ U , and p(ω, θ) is differentiable in θ, with ∇θ p(ω, θ) continuous in θ ∈ Θ; 2) The Fisher Information matrix I(θ) =
Z
Ω
∇θ p(ω, θ)[∇θ p(ω, θ)]T Π(dω) p(ω, θ)
21 As a matter of fact, the classical Cramer-Rao inequality dealing with unbiased estimates is not sufficient for our purposes “as is.” What we need to build is a “bias enabled” version of this inequality. Such an inequality may be developed using Bayesian argument [99, 233].
349
SIGNAL RECOVERY BY LINEAR ESTIMATION
is well-defined for all θ ∈ Θ; R 3) There exists function M (ω) ≥ 0 such that Ω M (ω)Π(dω) < ∞ and k∇θ p(ω, θ)k2 ≤ M (ω) ∀ω ∈ Ω, θ ∈ Θ.
b The derivation of the Cramer-Rao bound is as follows. Let θ(ω) be a bounded estimate, and let Z b φ(θ) = [φ1 (θ); ...; φk (θ)] = θ(ω)p(ω, θ)Π(dω) Ω
be the expected valuehof theiestimate. By item 3, φ(θ) is differentiable on Θ, with given by the Jacobian φ′ (θ) = ∂φ∂θi (θ) j i,j≤k
φ′ (θ)h =
Z
Ω
T b θ(ω)h ∇θ p(ω, θ)Π(dω), h ∈ Rk .
R this, recalling that Ω p(ω, θ)Π(dω) ≡ 1 and invoking item 3, we have RBesides hT ∇θ p(ω, θ)Π(dω) = 0, whence, in view of the previous identity, Ω Z b − φ(θ)]hT ∇θ p(ω, θ)Π(dω), h ∈ Rk . φ′ (θ)h = [θ(ω) Ω
Therefore, for all g, h ∈ Rk we have [g T φ′ (θ)h]2
= ≤ =
hR
[g T (θb − φ(θ)][hT ∇θ p(ω, θ)/p(ω, θ)]p(ω, θ)Π(dω) hRω i g T [θb − φ(θ)][θb − φ(θ)]T gp(ω, θ)Π(dω) Ω R × Ω [hT ∇θ p(ω, θ)/p(ω, θ)]2 p(ω, θ)Π(dω) [by T the Cauchy Inequality] g Covθb(θ)g hT I(θ)h ,
i2
n o b b where Covθb(θ) is the covariance matrix Eω∼p(·,θ) [θ(ω) − φ(θ)][θ(ω) − φ(θ)]T of b θ(ω) induced by ω ∼ p(·, θ). We have arrived at the inequality
g T Covθb(θ)g
hT I(θ)h ≥ [g T φ′ (θ)h]2 ∀(g, h ∈ Rk , θ ∈ Θ).
(∗)
For θ ∈ Θ fixed, let J be a positive definite matrix such that J I(θ), whence by (∗) it holds T g Covθb(θ)g hT J h ≥ [g T φ′ (θ)h]2 ∀(g, h ∈ Rk ). (∗∗)
For g fixed, the maximum of the right-hand side quantity in (∗∗) over h satisfying hT J h ≤ 1 is g T φ′ (θ)J −1 [φ′ (θ]T g, and we arrive at the Cramer-Rao inequality ∀(θ ∈ Θ, J I(θ), J ≻ 0) : Covθb(θ) φ′ (θ)J −1 [φ′ (θ]T (4.114) h n o n oi b Covθb(θ) = Eω∼p(·,θ) [θb − φ(θ)][θb − φ(θ)]T , φ(θ) = Eω∼p(·,θ) θ(ω)
b which holds true for every bounded estimate θ(·). Note also that for every θ ∈ Θ
350
CHAPTER 4
and every bounded estimate x we have o o n n b b ≥ Eω∼p(·,θ) kθ(ω) b − φ(θ)] + [φ(θ) − θ]k22 Risk2 [θ] − θk22 = Eω∼p(·,θ) k[θ(ω) o n b = Eω∼p(·,θ) kθ(ω) − φ(θ)k22 +kφ(θ) − θ)k22 h o b −2 Eω∼p(·,θ) [θ(ω) − φ(θ)]T [φ(θ) − θ)] | {z } = Tr(Covθb(θ)) + kφ(θ) − θk22 .
=0
Hence, in view of (4.114), for every bounded estimate θb it holds
∀(J ≻ 0 : J I(θ) ∀θ ∈ Θ) : b ≥ sup Tr(φ′ (θ)J −1 [φ′ (θ)]Ti) + kφ(θ) − θk22 Risk2 [θ] θ∈Θ h b φ(θ) = Eω∼p(·,θ) {θ(ω)} .
(4.115)
The fact that we considered the risk of estimating “the entire” θ rather than a given vector-valued function f (θ) : Θ → Rν plays no special role, and in fact the Cramer-Rao inequality admits the following modification yielded by a completely similar reasoning: Proposition 4.36. In the situation described in item 4.22.B and under assumptions 1)–3) of this item, let f (·) : Θ → Rν be a bounded Borel function, and let fb(ω) be a bounded estimate of f (ω) via observation ω ∼ p(·, θ). Then, setting for θ∈Θ n o φ(θ) = Eω∼p(·,θ) fb(θ) , n o Covfb(θ) = Eω∼p(·,θ) [fb(ω) − φ(θ)][fb(ω) − φ(θ)]T , one has
∀(θ ∈ Θ, J I(θ), J ≻ 0) : Covfb(θ) φ′ (θ)J −1 [φ′ (θ)]T .
As a result, for
h oi1/2 n Risk[fb] = sup Eω∼p(·,θ) kfb(ω) − f (θ)k22 θ∈Θ
it holds
∀(J ≻ 0 : J I(θ) ∀θ ∈ Θ) : Risk2 [fb] ≥ supθ∈Θ Tr(φ′ (θ)J −1 [φ′ (θ)]T ) + kφ(θ) − f (θ)k22
Now comes the first part of the exercise: 1) Derive from (4.115) the following
Proposition 4.37. In the situation of item 4.22.B, let • Θ ⊂ Rk be a k · k2 -ball of radius r > 0, • the family P be such that I(θ) J for some J ≻ 0 and all θ ∈ Θ.
351
SIGNAL RECOVERY BY LINEAR ESTIMATION
Then the minimax optimal risk satisfies the bound rk . Riskopt ≥ p r Tr(J ) + k
(4.116)
In particular, when J = α−1 Ik , we have Riskopt
√ r αk √ . ≥ r + αk
(4.117)
Hint. Assuming w.l.o.g. that Θ is centered at the origin, and given a bounded estimate θb with risk R, let φ(θ) be associated with the estimate via (4.115). Select γ ∈ (0, 1) and consider two cases: (a): there exists θ ∈ ∂Θ such that kφ(θ) − θk2 > γr, and (b): kφ(θ) − θk2 ≤ γr for all θ ∈ ∂Θ. In the case of (a), lowerbound R by maxθ∈Θ kφ(θ) − θk2 ; see (4.115). In the case of (b), lower-bound R2 by maxθ∈Θ Tr(φ′ (θ)J −1 [φ′ (θ)]T )—see (4.115)—and use the Gauss Divergence theorem to lower-bound the latter quantity in terms of the flux of the vector field φ(·) over ∂Θ. When implementing the above program, you might find useful the following fact (prove it!): Lemma 4.38. Let Φ be an n × n matrix, and J be a positive definite n × n matrix. Then Tr2 (Φ) . Tr(ΦJ −1 ΦT ) ≥ Tr(J ) 4.22.C. Application to signal recovery. Proposition 4.37 allows us to build computation-based lower risk bounds in the signal recovery problem considered in Section 4.2, in particular, the problem where one wants to recover the linear image Bx of an unknown signal x known to belong to a given ellitope X = {x ∈ Rn : ∃t ∈ T : xT Sℓ x ≤ tℓ , ℓ ≤ L} (with our usual restriction on Sℓ and T ) via observation ω = Ax + σξ, ξ ∼ N (0, Im ), and the risk of a candidate estimate, as in Section 4.2, is defined according to (4.113).22 It is convenient to assume that the matrix B (which in our general setup can be an arbitrary ν × n matrix) is a nonsingular n × n matrix.23 Under this 22 In fact, the approach to be developed can be applied to signal recovery problems involving Discrete/Poisson observation schemes and norms different from k · k2 used to measure the recovery error, signal-dependent noises, etc. 23 This assumption is nonrestrictive. Indeed, when B ∈ Rν×n with ν < n, we can add to B n − ν zero rows, which keeps our estimation problem intact. When ν ≥ n, we can add to B a small perturbation to ensure Ker B = {0}, which, for small enough perturbation, again keeps our estimation problem basically intact. It remains to note that when Ker B = {0} we can replace Rν with the image space of B, which again does not affect the estimation problem we are interested in.
352
CHAPTER 4
assumption, setting Y = B −1 X = {y ∈ Rn : ∃t ∈ T : y T [B −1 ]T Sℓ B −1 y ≤ tℓ , ℓ ≤ L} and A¯ = AB −1 , we lose nothing when replacing the sensing matrix A with A¯ and treating as our signal y ∈ Y rather than X . Note that in our new situation A is ¯ X with Y, and B is the unit matrix In . For the sake of simplicity, replaced with A, ¯ has trivial kernel. Finally, let we assume from now on that A (and therefore A) S˜ℓ Sℓ be close to Sk positive definite matrices, e.g., S˜ℓ = Sℓ + 10−100 In . Setting S¯ℓ = [B −1 ]T S˜ℓ B −1 and Y¯ = {y ∈ Rn : ∃t ∈ T : y T S¯ℓ y ≤ tℓ , ℓ ≤ L}, we get S¯ℓ ≻ 0 and Y¯ ⊂ Y. Therefore, any lower bound on the k · k2 -risk of recovery y ∈ Y¯ via observation ω = AB −1 y + σξ, ξ ∼ N (0, Im ), automatically is a lower bound on the minimax risk Riskopt corresponding to our original problem of interest. Now assume that we can point out a k-dimensional linear subspace E in Rn and positive reals r, γ such that ¯ (i) the k · k2 -ball Θ = {θ ∈ E : kθk2 ≤ r} is contained in Y; (ii) The restriction A¯E of A¯ onto E satisfies the relation Tr(A¯∗E A¯E ) ≤ γ (A¯∗E : Rm → E is the conjugate of the linear map A¯E : E → Rm ). Consider the auxiliary estimation problem obtained from the (reformulated) prob¯ the minimax lem of interest by replacing the signal set Y¯ with Θ. Since Θ ⊂ Y, risk in the auxiliary problem is a lower bound on the minimax risk Riskopt we are interested in. On the other hand, the auxiliary problem is nothing but the problem ¯ σ 2 I), which is just a of recovering parameter θ ∈ Θ from observation ω ∼ N (Aθ, special case of the problem considered in item 4.22.B. As it is immediately seen, the Fisher Information matrix in this problem is independent of θ and is σ −2 A¯∗E A¯E : eT I(θ)e = σ −2 eT A¯∗E A¯E e, e ∈ E. Invoking Proposition 4.37, we arrive at the lower bound on the minimax risk in the auxiliary problem (and thus in the problem of interest as well): rσk . Riskopt ≥ √ r γ + σk
(4.118)
The resulting risk bound depends on r, k, γ and is larger the smaller γ is and the larger k and r are. Lower-bounding Riskopt . In order to make the bounding scheme just outlined give its best, we need a mechanism which allows us to generate k-dimensional “disks” Θ ⊂ Y¯ along with associated quantities r, γ. In order to design such a mechanism, it is convenient to represent k-dimensional linear subspaces of Rn as the image spaces of orthogonal n × n projectors P of rank k. Such a projector P ¯ where rP is the gives rise to the disk ΘP of the radius r = rP contained in Y, T 2 largest ρ such that the set {y ∈ ImP : y P y ≤ ρ } is contained in Y¯ (“condition
353
SIGNAL RECOVERY BY LINEAR ESTIMATION
C(r)”), and we can equip the disk with γ satisfying (ii) if and only if ¯ ) ≤ γ, Tr(P A¯T AP or, which is the same (recall that P is an orthogonal projector) ¯ A¯T ) ≤ γ Tr(AP
(4.119)
(“condition D(γ)”). Now, when P is a nonzero orthogonal projector, the simplest sufficient condition for the validity of C(r) is the existence of t ∈ T such that ∀(y ∈ Rn , ℓ ≤ L) : y T P S¯ℓ P y ≤ tℓ r−2 y T P y, or, which is the same, ∃s : r2 s ∈ T & P S¯ℓ P sℓ P, ℓ ≤ L.
(4.120)
Let us rewrite (4.119) and (4.120) as a system of linear matrix inequalities. This is what you are supposed to do: 2.1) Prove the following simple fact: Observation 4.39. Let Q be a positive definite, R be a nonzero positive semidefinite matrix, and let s be a real. Then RQR sR if and only if
sQ−1 R.
2.2) Extract from the above observation the conclusion as follows. Let T be the conic hull of T : T = cl{[s; τ ] : τ > 0, s/τ ∈ T } = {[s; τ ] : τ > 0, s/τ ∈ T } ∪ {0}. Consider the system of constraints ¯ A¯T ) ≤ γ, [s; τ ] ∈ T & sℓ S¯ℓ−1 P, ℓ ≤ L & Tr(AP P is an orthogonal projector of rank k ≥ 1
(#)
in variables [s; τ ], k, γ, and P . Every feasible solution to this system gives rise to a k-dimensional Euclidean subspace E ⊂ Rn (the image space of P ) such that the Euclidean ball Θ in E centered at the origin of radius √ r = 1/ τ taken along with γ satisfies conditions (i)–(ii). Consequently, such a feasible solution yields the lower bound Riskopt ≥ ψσ,k (γ, τ ) := √
σk √ γ + σ τk
on the minimax risk in the problem of interest. Ideally, to utilize item 2.2 to lower-bound Riskopt , we should look through k =
354
CHAPTER 4
1, ..., n and maximize for every k the lower risk bound ψσ,k (γ, τ ) under constraints (#), thus arriving at the problem n √ √ min[s;τ ],γ,P ψσ,kσ(γ,τ ) = γ/k + σ τ : (Pk ) ¯ A¯T ) ≤ γ, [s; τ ] ∈ T & sℓ S¯ℓ−1 P, ℓ ≤ L & Tr(AP P is an orthogonal projector of rank k. This problem seems to be computationally intractable, since the constraints of (Pk ) include the nonconvex restriction on P to be a projector of rank k. A natural convex relaxation of this constraint is 0 P In , Tr(P ) = k. The (minor) remaining difficulty is that √ the objective in (P ) is nonconvex. Note, √ however, that to minimize γ/k + σ τ is basically the same as to minimize the convex function γ/k 2 + σ 2 τ which is a tight “proxy” of the squared objective of (Pk ). We arrive at a convex “proxy” of (Pk )—the problem [s; τ ] ∈ T, 0 P In , Tr(P ) = k 2 2 min γ/k + σ τ : (P [k]) ¯ A¯T ) ≤ γ , sℓ S¯ℓ−1 P, ℓ ≤ L, Tr(AP [s;τ ],γ,P k = 1, ..., n. Problem (P [k]) clearly is solvable, and the P -component P (k) of its (k) optimal solution gives rise to a collection of orthogonal projectors Pκ , κ = 1, ..., n (k) obtained from P (k) by “rounding”—to get Pκ , we replace the κ leading eigenvalues (k) of P with ones, and the remaining eigenvalues with zeros, while keeping the eigenvectors intact. We can now for every κ = 1, ..., n fix the P -variable in (Pk ) as (k) Pκ and solve the resulting problem in the remaining variables [s; τ ] and γ, which is easy—with P fixed, the problem clearly reduces to minimizing τ under the convex constraints sℓ S¯ℓ−1 P, ℓ ≤ L, [s; τ ] ∈ T on [s; τ ]. As a result, for every k ∈ {1, ..., n}, we get n lower bounds on Riskopt , that is, a total of n2 lower risk bounds, of which we select the best—the largest. Now comes the next part of the exercise: 3) Implement the outlined program numerically and compare the lower bound on the minimax risk with the upper risk bounds of presumably good linear estimates yielded by Proposition 4.4. Recommended setup: • Sizes: m = n = ν = 16. • A, B: B = In , A = Diag{a1 , ..., an } with ai = i−α and α running through {0, 1, 2}. • X = {x ∈ Rn : xT Sℓ x ≤ 1, ℓ ≤ L} (i.e., T = [0, 1]L ) with randomly generated Sℓ . • Range of L: {1, 4, 16}. For L in this range, you can generate Sℓ , ℓ ≤ L, as Sℓ = Rℓ RℓT with Rℓ = randn(n, p), where p =⌋n/L⌊. • Range of σ: {1.0, 0.1, 0.01, 0.001, 0.0001}. Exercise 4.23. [follow-up to Exercise 4.22]
355
SIGNAL RECOVERY BY LINEAR ESTIMATION
1) Prove the following version of Proposition 4.37: Proposition 4.40. In the situation of item 4.22.B and under Assumptions 1)–3) from this item, let • k · k be a norm on Rk such that kθk2 ≤ κkθk ∀θ ∈ Rk , • Θ ⊂ Rk be a k · k-ball of radius r > 0, • the family P be such that I(θ) J for some J ≻ 0 and all θ ∈ Θ.
Then the minimax optimal risk Riskopt,k·k = inf
b θ(·)
sup Eω∼p(·,θ)
θ∈Θ
n
2 b kθ − θ(ω)k
o1/2
of recovering parameter θ ∈ Θ from observation ω ∼ p(·, θ) in the norm k · k satisfies the bound rk . (4.121) Riskopt,k·k ≥ p rκ Tr(J ) + k
In particular, when J = α−1 Ik , we get Riskopt,k·k
√ r αk √ . ≥ rκ + αk
(4.122)
2) Apply Proposition 4.40 to get lower bounds on the minimax k · k-risk in the following estimation problems: 2.1) Given indirect observation ω = Aθ + σξ, ξ ∼ N (0, Im ) of unknown vector θ known to belong to Θ = {θ ∈ Rk : kθkp ≤ r} with given A, Ker A = {0}, p ∈ [2, ∞], r > 0, we want to recover θ in k · kp . 2.2) Given indirect observation ω = LθR + σξ, where θ is unknown µ × ν matrix known to belong to the Shatten norm ball Θ ∈ Rµ×ν : kθkSh,p ≤ r, we want to recover θ in k · kSh,p . Here L ∈ Rm×µ , Ker L = {0} and R ∈ Rν×n , Ker RT = {0} are given matrices, p ∈ [2, ∞], and ξ is a random Gaussian m × n matrix (i.e., the entries in ξ are N (0, 1) random variables independent of each other). 2.3) Given a K-repeated observation ω K = (ω1 , ..., ωK ) with i.i.d. components ωt ∼ N (0, θ), 1 ≤ t ≤ K, with unknown θ ∈ Sn known to belong to the matrix box Θ = {θ : β− In θ β+ In } with given 0 < β− < β+ < ∞, we want to recover θ in the spectral norm. Exercise 4.24. [More on Cramer-Rao risk bound] Let us fix µ ∈ (1, ∞) and a norm k · k on Rk , µ . Assume that we are and let k · k∗ be the norm conjugate to k · k, and µ∗ = µ−1 in the situation of item 4.22.B and under assumptions 1) and 3) from this item; as for assumption 2) we now replace it with the assumption that the quantity 1/µ∗ Ik·k∗ ,µ∗ (θ) := Eω∼p(·,θ) {k∇θ ln(p(ω, θ))kµ∗ ∗ }
356
CHAPTER 4
is well-defined and bounded on Θ; in the sequel, we set Ik·k∗ ,µ∗ = sup Ik·k∗ ,µ∗ (θ). θ∈Θ
1) Prove the following variant of the Cramer-Rao risk hound: Proposition 4.41. In the situation described at the beginning of item 4.22.D, let Θ ⊂ Rk be a k · k-ball of radius r. Then the minimax k · k-risk of recovering θ ∈ Θ via observation ω ∼ p(·, θ) can be lower-bounded as h n oi1/µ b Riskopt,k·k [Θ] := inf sup Eω∼p(·,θ) kθ(ω) − θkµ ≥ rIk·krk,µ +k , ∗ ∗ b θ∈Θ θ(·) i h 1/µ∗ Ik·k∗ ,µ∗ = max Ik·k∗ ,µ∗ (θ) := Eω∼p(·,θ) {k∇θ ln(p(ω, θ))kµ∗ ∗ } .
(4.123)
θ∈Θ
Example I: Gaussian case, estimating shift. Let µ = 2, and let p(ω, θ) = N (Aθ, σ 2 Im ) with A ∈ Rm×k . Then −2 T R∇θ ln(p(ω, θ)) = σ2 A (ω − Aθ) ⇒ −4 R kAT (ω − Aθ)k2∗ p(ω, θ)dω k∇θ ln(p(ω, θ))k∗ p(ω, θ)dω = σ R T 1 = σ −4 [√2πσ] kAT ωk2∗ exp{− ω2σ2ω }dω m R = σ −4 [2π]1m/2 kAT σξk2∗ exp{−ξ T ξ/2}dξ R = σ −2 [2π]1m/2 kAT ξk2∗ exp{−ξ T ξ/2}dξ
whence
1/2 . Ik·k∗ ,2 = σ −1 Eξ∼N (0,Im ) kAT ξk2∗ {z } | γk·k (A)
Consequently, assuming Θ to be a k · k-ball of radius r in Rk , lower bound (4.123) becomes Riskopt,k·k [Θ] ≥
rk rIk·k∗ + k
=
rσ −1 γ
rk rσk = . rγk·k (A) + σk k·k (A) + k
(4.124)
The case of direct observations. To see “how it works,” consider the case m = k, A = Ik of direct observations, and let Θ = {θ ∈ Rk : kθk ≤ r}. Then p • We have γk·k1 (Ik ) ≤ O(1) ln(k), whence the k · k1 -risk bound is
rσk Riskopt,k·k1 [Θ] ≥ O(1) p [Θ = {θ ∈ Rk : kθ − ak1 ≤ r}]. r ln(k) + σk √ • We have γk·k2 (Ik ) = k, whence the k · k2 -risk bound is √ rσ k √ Riskopt,k·k2 [Θ] ≥ r+σ k
[Θ = {θ ∈ Rk : kθ − ak2 ≤ r}].
357
SIGNAL RECOVERY BY LINEAR ESTIMATION
• We have γk·k∞ (Ik ) ≤ O(1)k, whence the k · k∞ -risk bound is Riskopt,k·k∞ [Θ] ≥ O(1)
rσ r+σ
[Θ = {θ ∈ Rk : kθ − ak∞ ≤ r}].
In fact, the above examples are essentially covered by the following: Observation 4.42. Let k · k be a norm on Rk , and let Θ = {θ ∈ Rk : kθk ≤ r}. Consider the problem of recovering signal θ ∈ Θ via observation ω ∼ N (θ, σ 2 Ik ). Let n o1/2 b = sup Eω∼N (θ,σ2 I) kθ(ω) b Riskk·k [θ|Θ] − θk2 θ∈Θ
b be the k · k-risk of an estimate θ(·), and let
b Riskopt,k·k [Θ] = inf Riskk·k [θ|Θ] b θ(·)
be the associated minimax risk. Assume that the norm k · k is absolute and symmetric w.r.t. permutations of the coordinates. Then rσk Riskopt,k·k [Θ] ≥ p , 2 ln(ek)rα∗ + σk
α∗ = k[1; ...; 1]k∗ .
(4.125)
Here is the concluding part of the exercise: 2) Prove the observation and compare the lower risk bound it yields with the k·k-risk of the “plug-in” estimate χ b(ω) ≡ ω.
Example II: Gaussian case, estimating covariance. Let µ = 2, let K be a positive integer, and let our observation ω be a collection of K i.i.d. samples ωt ∼ N (0, θ), 1 ≤ t ≤ K, with unknown θ known to belong to a given convex compact subset Θ of the interior of the positive semidefinite cone Sn+ . Given ω1 ,...,ωK , we want to recover θ in the Shatten norm k · kSh,s with s ∈ [1, ∞]. Our estimation problem is covered by the setupQof Exercise 4.22 with P comprised of the product K probability densities p(ω, θ) = t=1 g(ωt , θ), θ ∈ Θ, where g(·, θ) is the density of N (0, θ). We have P P −1 1 ωt ωtT θ−1 − θ−1 ln(g(ω ∇θ ln(p(ω, θ)) = 21 t ∇θP t , θ)) = 2 t θ (4.126) −1/2 ωt ][θ−1/2 ωt ]T − In θ−1/2 . = 12 θ−1/2 t [θ With some effort [149] it can be proved that when K ≥ n, which we assume from now on, for random vectors ξ1 , ..., ξK independent across t sampled from the standard Gaussian distribution N (0, In ) for every u ∈ [1, ∞] one
358
CHAPTER 4
has
" (
2
XK T
[ξt ξt − In ] E
t=1
Sh,u
)#1/2
1
1
≤ Cn 2 + u
√
K
(4.127)
with appropriate absolute constant C. Consequently, for θ ∈ Θ and all u ∈ [1, ∞] we have n o Eω∼p(·,θ) k∇θ ln(p(ω, θ))k2Sh,u o n P −1/2 ωt ][θ−1/2 ωt ]T − In θ−1/2 k2Sh,u = 41 Eω∼p(·,θ) kθ−1/2 t [θ [by (4.126)] n −1/2 2 o P T 1 −1/2 −1/2 = 4 Eξ∼p(·,In ) kθ θ kSh,u [setting θ ωt = ξt ] t ξt ξt − I n o n P ≤ 14 kθ−1/2 k4Sh,∞ Eξ∼p(·,In ) k t ξt ξtT − In k2Sh,u [since kABkSh,u ≤ kAkSh,∞ kBkSh,u ] h i2 1 1√ + ≤ 41 kθ−1/2 k4Sh,∞ Cn 2 u K [by (4.127)] and we arrive at
1/2 1 1√ C Eω∼p(·,θ) k∇θ ln(p(ω, θ))k2Sh,u ≤ kθ−1 kSh,∞ n 2 + u K. 2
(4.128)
Now assume that Θ is k · kSh,s -ball of radius r < 1 centered at In : Θ = {θ ∈ Sn : kθ − In kSh,s ≤ r}.
(4.129)
In this case the estimation problem from Example II is the scope of Proposition 4.41, and the quantity Ik·k∗ ,2 as defined in (4.123) can be upper-bounded as follows: Ik·k∗ ,2
= ≤ ≤
h n oi1/2 max Eω∼p(·,θ) k∇θ ln(p(ω, θ))k2Sh,s∗ θ∈Θ 1 1 √ O(1)n 2 + s∗ K maxθ∈Θ kθ−1 kSh,∞ [see (4.128)] O(1) n
1 + 1 √ 2 s∗ K
1−r
.
We can now use Proposition 4.41 to lower-bound the minimax k · kSh,s -risk, thus arriving at n(1 − r)r (4.130) Riskopt,k·kSh,s [Θ] ≥ O(1) √ 1 1 Kn 2 − s r + n(1 − r)
(note that we are in the case of k = dim θ = n(n+1) ). 2 Let us compare this lower risk bound with the k · kSh,s -risk of the “plug-in” estimate K 1 X b ωt ωtT . θ(ω) = K t=1
359
SIGNAL RECOVERY BY LINEAR ESTIMATION
Assuming θ ∈ Θ, we have o o n P n b Eω∼p(·,θ) kK[θ(ω) − θ]k2Sh,s = Eω∼p(·,θ) k t [ωt ωtT − θ]k2Sh,s n 1/2 2 o P −1/2 −1/2 T [[θ ω ][θ ω ] − I ] θ kSh,s = Eω∼p(·,θ) kθ1/2 t t n t n o P T 1/2 2 = Eξ∼p(·,In ) kθ1/2 kSh,s t [ξt ξt − In ] θ n P o ≤ kθ1/2 k4Sh,∞ Eξ∼p(·,In ) k t [ξt ξtT − In ]k2Sh,s i2 h 1 1√ [see (4.127)] ≤ kθ1/2 k4Sh,∞ Cn 2 + s K ,
and we arrive at
1 1 2+s
b ≤ O(1) max kθkSh,∞ n√ Riskk·kSh,s [θ|Θ] θ∈Θ
K
.
In the case of (4.129), the latter bound becomes
1
1
2+s b ≤ O(1) max kθkSh,∞ n√ . Riskk·kSh,s [θ|Θ] θ∈Θ K
(4.131)
For the sake of simplicity, assume that r in (4.129) is 1/2 (what actually matters below is that r ∈ (0, 1) is bounded away from 0 and from 1). In this case the lower bound (4.130) on the minimax k · kSh,s -risk reads # 1 1 n2+s Riskopt,k·kSh,s [Θ] ≥ O(1) min √ , 1 . K "
2
When K is “large”: K ≥ n1+ s , this lower bound matches, within an absolute constant factor, the upper bound (4.131) on the risk of the plug-in estimate, so that 2 the latter estimate is near-optimal. When K < n1+ s , the lower risk bound becomes b O(1), so that here a nearly optimal estimate is the trivial estimate θ(ω) ≡ In .
4.7.7
Around S-Lemma
S-Lemma is a classical result of extreme importance in Semidefinite Optimization. Basically, the lemma states that when the ellitope X in Proposition 4.6 is an ellipsoid, (4.19) can be strengthened to Opt = Opt∗ . In fact, S-Lemma is even stronger: Lemma 4.43. [S-Lemma] Consider two quadratic forms f (x) = xT Ax + 2aT x + α and g(x) = xT Bx + 2bT x + β such that g(¯ x) < 0 for some x ¯. Then the implication g(x) ≤ 0 ⇒ f (x) ≤ 0 takes place if and only if for some λ ≥ 0 it holds f (x) ≤ λg(x) for all x, or, which is the same, if and only if Linear Matrix Inequality λb − a λB − A 0 λbT − aT λβ − α
360
CHAPTER 4
in scalar variable λ has a nonnegative solution. Proof of S-Lemma can be found, e.g., in [15, Section 3.5.2]. The goal of subsequent exercises is to get “tight” tractable outer approximations of sets obtained from ellitopes by quadratic lifting. We fix an ellitope X = {x ∈ Rn : ∃t ∈ T : xT Sk x ≤ tk , 1 ≤ k ≤ K}
(4.132)
where, as always, Sk are positive semidefinite matrices with positive definite sum, and T is a computationally tractable convex compact subset in Rk+ such that t ∈ T implies t′ ∈ T whenever 0 ≤ t′ ≤ t and T contains a positive vector. Exercise 4.25.
Let us associate with ellitope X given by (4.132) the sets X Xb
= =
Conv{xxT : x ∈ X}, {Y ∈ Sn : Y 0, ∃t ∈ T : Tr(Sk Y ) ≤ tk , 1 ≤ k ≤ K},
so that X , Xb are convex compact sets containing the origin, and Xb is computationally tractable along with T . Prove that
1. When K = 1, we have X = Xb; √ 2. We always have X ⊂ Xb ⊂ 3 ln( 3K)X . Exercise 4.26.
n
T
o
For x ∈ R let Z(x) = [x; 1][x; 1] , Z [x] = C=
1
xxT xT
x
. Let
,
and let us associate with ellitope X given by (4.132) the sets X+
Xb+
= =
o Conv{Z [x] : x ∈X}, U u Y = ∈ Sn+1 : Y + C 0, ∃t ∈ T : Tr(Sk U ) ≤ tk , 1 ≤ k ≤ K , T u
so that X + , Xb+ are convex compact sets containing the origin, and Xb+ is computationally tractable along with T . Prove that
1. When K = 1, we have X + = Xb+ ; √ 2. We always have X + ⊂ Xb+ ⊂ 3 ln( 3(K + 1))X + . 4.7.8
Miscellaneous exercises
Exercise 4.27. Let X ⊂ Rn be a convex compact set, let b ∈ Rn , and let A be an m × n matrix. Consider the problem of affine recovery ω 7→ hT ω + c of the linear function Bx = bT x of x ∈ X from indirect observation ω = Ax + σξ, ξ ∼ N (0, Im ).
361
SIGNAL RECOVERY BY LINEAR ESTIMATION
Given tolerance ǫ ∈ (0, 1), we are interested in minimizing the worst-case, over x ∈ X, width of (1 − ǫ) confidence interval, that is, the smallest ρ such that Prob{ξ : bT x−f T (Ax+σξ) > ρ} ≤ ǫ/2 & Prob{ξ : bT x−f T (Ax+σξ) < ρ} ≤ ǫ/2 ∀x ∈ X.
Pose the problem as a convex optimization problem and consider in detail the case where X is the box {x ∈ Rn : aj |xj | ≤ 1, 1 ≤ j ≤ n}, where aj > 0 for all j. Exercise 4.28. Prove Proposition 4.21. Exercise 4.29. Prove Proposition 4.22.
4.8
PROOFS
4.8.1 4.8.1.1
Preliminaries Technical lemma
Lemma 4.44. Given basic spectratope X = {x ∈ Rn : ∃t ∈ T : Rk2 [x] tk Idk , 1 ≤ k ≤ K}
(4.133)
and a positive definite n × n matrix Q and setting Λk = Rk [Q] (for notation, see P Section 4.3.1), we get a collection of positive semidefinite matrices, and k R∗k [Λk ] is positive definite. As a corollaries, P (i) whenever Mk , k ≤ K, are positive definite matrices, the matrix k R∗k [Mk ] is positive definite; (ii) the set QT = {Q 0 : Rk [Q] T Idk , k ≤ K} is bounded for every T . Proof. Let us prove the first claim. P Assuming the opposite, we would be able to find a nonzero vector y such that k y T R∗k [Λk ]y ≤ 0, whence 0≥
X k
y T R∗k [Λk ]y =
X k
Tr(R∗k [Λk ][yy T ]) =
X k
Tr(Λk Rk [yy T ])
(we have used (4.26), (4.22)). Since Λk = Rk [Q] 0 due to Q 0—see (4.23)— it follows that Tr(Λk Rk [yy T ]) = 0 for all k. Now, the linear mapping Rk [·] is -monotone, and Q is positive definite, implying that Q rk yy T for some rk > 0, whence Λk rk Rk [yy T ], and therefore Tr(Λk Rk [yy T ]) = 0 implies that Tr(R2k [yy T ]) = 0, that is, Rk [yy T ] = Rk2 [y] = 0. Since Rk [·] takes values in Sdk , we get Rk [y] = 0 for al k, which is impossible due to y 6= 0 and property S.3; see Section 4.3.1. To verify (i), note that when Mk are positive definite, we can find γ > 0 such that Λk P γMk for all k ≤ K; invoking (4.27), we conclude that R∗k [Λk ] γR∗k [Mk ], P ∗ ∗ whence k Rk [Mk ] is positive definite along with k Rk [Λk ]. To verify (ii), assume, on the contrary to what should be proved, that QT is unbounded. Since QT is closed and convex, it must possess a nonzero recessive
362
CHAPTER 4
direction, that is, there should exist nonzero positive semidefinite matrix D such that Rk [D] 0 for all k. Selecting positive definite matrices Mk , the matrices R∗k [Mk ] are positive semidefinite (see Section 4.3.1), and their sum S is positive definite by (i). We have X X Tr(DR∗k [Mk ]) = Tr(DS), Tr(Rk [D]Mk ) = 0≥ k
k
where the first inequality is due to Mk 0, and the first equality is due to (4.26). The resulting inequality is impossible due to 0 6= D 0 and S ≻ 0, which is the desired contradiction. ✷ 4.8.1.2
Noncommutative Khintchine Inequality
We will use a deep result from Functional Analysis (“Noncommutative Khintchine Inequality”) due to Lust-Piquard [175], Pisier [199] and Buchholz [34]; see [228, Theorem 4.6.1]: Theorem 4.45. Let Qi ∈ Sn , 1 ≤ i ≤ I, and let ξi , i = 1, ..., I, be independent Rademacher (±1 with probabilities 1/2) or N (0, 1) random variables. Then for all t ≥ 0 one has
( I )
X
t2
Prob ξi Qi ≥ t ≤ 2n exp −
2vQ i=1
P
I
where k · k is the spectral norm, and vQ = i=1 Q2i . We need the following immediate consequence of the theorem:
Lemma 4.46. Given spectratope (4.20), let Q ∈ Sn+ be such that Rk [Q] ρtk Idk , 1 ≤ k ≤ K,
(4.134)
for some t ∈ T and some ρ ∈ (0, 1]. Then h
Probξ∼N (0,Q) {ξ 6∈ X } ≤ min 2De
1 − 2ρ
K i X , 1 , D := dk . k=1
Proof. When setting ξ = Q1/2 η, η ∼ N (0, In ), we have Rk [ξ] = Rk [Q1/2 η] =:
n X
¯ ki = R ¯ k [η] ηi R
i=1
with X i
2 ¯ ki ]2 = Eη∼N (0,I ) R ¯ k [η] = Eξ∼N (0,Q) Rk2 [ξ] = Rk [Q] ρtk Id [R k n
due to (4.24). Hence, by Theorem 4.45 1
¯ k [η]k2 ≥ tk } ≤ 2dk e− 2ρ . Probξ∼N (0,Q) {kRk [ξ]k2 ≥ tk } = Probη∼N (0,In ) {kR
363
SIGNAL RECOVERY BY LINEAR ESTIMATION
We conclude that 1
Probξ∼N (0,Q) {ξ 6∈ X } ≤ Probξ∼N (0,Q) {∃k : kRk [ξ]k2 > tk } ≤ 2De− 2ρ .
✷
The ellitopic version of Lemma 4.46 is as follows: Lemma 4.47. Given ellitope (4.9), let Q ∈ Sn+ be such that Tr(Rk Q) ≤ ρtk , 1 ≤ k ≤ K,
(4.135)
for some t ∈ T and some ρ ∈ (0, 1]. Then
1 Probξ∼N (0,Q) {ξ 6∈ X } ≤ 2K exp − . 3ρ
Proof. Observe that if P ∈ Sn+ satisfies Tr(R) ≤ 1, we have √ Eη∼N (0,In ) exp 13 η T P η ≤ 3.
(4.136)
Indeed, we lose nothing when assuming that P = Diag{λ1 , ..., λn } with λi ≥ 0, P i λi ≤ 1. In this case ) ( X Eη∼N (0,In ) exp{ 13 η T P η} = f (λ) := Eη∼N (0,In ) exp{ 31 λi ηi2 } . i
Function f is convex, so that its maximum on the simplex {λ ≥ 0 : achieved at a vertex, that is, √ f (λ) ≤ Eη∼N (0,1) exp{ 13 η 2 } = 3;
P
i
(4.136) is proved. Note that (4.136) implies that √ Probη∼N (0,In ) η : η T P η > s < 3 exp{−s/3}, s ≥ 0.
λi ≤ 1} is
(4.137)
Now let Q and t satisfy the Lemma’s premise. Setting ξ = Q1/2 η, η ∼ N (0, In ), for k ≤ K such that tk > 0 we have ξ T Rk ξ = ρtk η T Pk η, Pk := [ρtk ]−1 Q1/2 Rk Q1/2 0 & Tr(Pk ) = [ρtk ]−1 Tr(QRk ) ≤ 1, so that Probξ∼N (0,Q) ξ : ξ T Rk ξ > sρtk
=
s √ 3 exp{−s/3},
(4.138)
where the inequality is due to (4.137). Relation (4.138) was established for k with tk > 0; it is trivially true when tk = 0, since in this case Q1/2 Rk Q1/2 = 0 due to Tr(QRk ) ≤ 0 and Rk , Q ∈ Sn+ . Setting s = 1/ρ, we get from (4.138) that √ 1 Probx∼N (0,Q) ξ T Rk ξ > tk ≤ 3 exp{− }, k ≤ K, 3ρ
and (4.137) follows due to the union bound.
✷
364 4.8.1.3
CHAPTER 4
Anderson’s Lemma
Below we use a simple-looking, but by far nontrivial, fact. Anderson’s Lemma [4]. Let f be a nonnegative even (f (x) ≡ f (−x)) summable function on RN such that the level sets {x : f (x) ≥ t} are convex for all t and let X ⊂ RN be a closed convex set symmetric w.r.t. the origin. Then for every y ∈ RN Z f (z)dz X+ty
is a nonincreasing function of t ≥ 0. In particular, if ζ is a zero mean N dimensional Gaussian random vector, then for every y ∈ RN Prob{ζ 6∈ y + X} ≥ Prob{ζ 6∈ X}. Hence, for every norm k · k on RN it holds Prob{ζ : kζ − yk > ρ} ≥ Prob{ζ : kζk > ρ} ∀(y ∈ RN , ρ ≥ 0).
4.8.2
Proof of Proposition 4.6
1o . We need the following: Lemma 4.48. Let S be a positive semidefinite N × N matrix with trace ≤ 1 and ξ be an N -dimensional Rademacher random vector (i.e., the entries in ξ are independent and take values ±1 with probabilities 1/2). Then √ ≤ 3, E exp 31 ζ T Sζ
implying that
√ Prob{ξ T Sξ > s} ≤ 3 exp{−s/3}, s ≥ 0. P i i T be the eigenvalue decomposition of S, so that Proof. Let S = i h [h ] i σP i T i [h ] h = 1, σi ≥ 0, and i σi ≤ 1. The function n 1P T i i T o F (σ1 , ..., σn¯ ) = E e 3 i σi ξ h [h ] ξ
P is convex on the simplex {σ ≥ 0, i σi ≤ 1} and thus attains its maximum over the simplex at a vertex, implying that for some f = hi , f T f = 1, it holds 1
E{e 3 ξ
T
Sξ
1
} ≤ E{e 3 (f
T
ξ)2
}.
365
SIGNAL RECOVERY BY LINEAR ESTIMATION
Let ζ ∼ N (0, 1) be independent of ξ. We have oo n n p Eξ exp{ 13 (f T ξ)2 } = Eξ Eζ exp{[ 2/3f T ξ]ζ} ( ) n n n oo o N p p Q = Eζ Eξ exp{[ 2/3f T ξ]ζ} Eξ exp{ 2/3ζfj ξj } = Eζ j=1 ) ( ) ( N N p Q Q 1 2 2 exp{ 3 ζ fj } cosh( 2/3ζfj ) ≤ Eζ = Eζ j=1 j=1 1 2 √ = Eζ exp{ 3 ζ } = 3. ✷ 2o . The right inequality in (4.19) has been justified in Section 4.2.3. To prove the left inequality in (4.19), let T be the closed conic hull of T (see Section 4.1.1), and let us consider the conic problem Opt# = max Tr(P T CP Q) : Q 0, Tr(QRk ) ≤ tk ∀k ≤ K, [t; 1] ∈ T . (4.139) Q,t
We claim that
Opt = Opt# .
(4.140)
Indeed, (4.139) clearly is a strictly feasible and bounded conic problem, so that its optimal value is equal to the optimal value of its conic dual (Conic Duality Theorem). Taking into account that the cone T∗ dual to T is {[g; s] : s ≥ φT (−g)}—see Section 4.1.1—we therefore get Opt#
P P Tr([ k λk Rk − L]Q) − k [λk + gk ]tk = Tr(P T CP Q) ∀(Q, t), λP≥ 0, L 0, s ≥ φT (−g) λ,[g;s],L λk Rk − L = P T CP, g = −λ, k = min s: λP≥ 0, L 0, s ≥ φT (−g) λ,[g;s],L = min φT (λ) : k λk Rk P T CP, λ ≥ 0 = Opt,
= min
s:
λ
as claimed.
3o . With Lemma 4.48 and (4.140) at our disposal, we can now complete the proof of Proposition 4.6 by adjusting the technique from [191]. Specifically, problem (4.139) clearly is solvable; let Q∗ , t∗ be an optimal solution to the problem. Next, let us 1/2 set R∗ = Q∗ , C¯ = R∗ P T CP R∗ , let C¯ = U DU T be the eigenvalue decomposition ¯ ¯ k = U T R∗ Rk R∗ U . Observe that of C, and let R ¯ Tr(C) ¯k ) Tr(R
= =
Tr(R∗ P T CP R∗ ) = Tr(Q∗ P T CP ) = Opt# = Opt, Tr(R∗ Rk R∗ ) = Tr(Q∗ Rk ) ≤ t∗k .
Now let ξ be a Rademacher random vector. For k with t∗k > 0, applying Lemma ¯ k /t∗ , we get for s > 0 4.48 to matrices R k √ ¯ k ξ > st∗k } ≤ 3 exp{−s/3}; (4.141) Prob{ξ T R ¯ k ) = 0, that is, R ¯ k = 0 (since R ¯ k 0), and if k is such that t∗k = 0, we have Tr(R (4.141) holds true as well. Now let √ s∗ = 3 ln( 3K),
366
CHAPTER 4
√ so that 3 exp{−s/3} < 1/K when s > s∗ . The latter relation combines with (4.141) to imply that for every s > s∗ there exists a realization ξ¯ of ξ such that ¯ k ξ¯ ≤ st∗k ∀k. ξ¯T R Let us set y¯ =
¯ √1 R∗ U ξ. s
Then
¯ k ξ¯ ≤ t∗k ∀k y¯T Rk y¯ = s−1 ξ¯T U T R∗ Rk R∗ U ξ¯ = s−1 ξ¯T R implying that x ¯ := P y¯ ∈ X , and ¯ = s−1 Opt. x ¯T C x ¯ = s−1 ξ¯T U T R∗ P T CP R∗ U ξ¯ = s−1 ξ¯T Dξ¯ = s−1 Tr(D) = s−1 Tr(C) {z } | ¯ C
Thus, Opt∗ := maxx∈X xT Cx ≥ s−1 Opt whenever s > s∗ , which implies the left inequality in (4.19). ✷ 4.8.3
Proof of Proposition 4.8
The proof follows the lines of the proof of Proposition 4.6. First, passing from C to the matrix C¯ = P T CP , the situation clearly reduces to the one where P = I. To save notation, in the rest of the proof we assume that P is the identity. Second, from Lemma 4.44 and the fact that the level sets of φT (·) on the nonnegative orthant are bounded (since T contains a positive vector) it immediately follows that problem (4.29) is feasible with bounded level sets of the objective, so that the problem is solvable. The left inequality in (4.30) was proved in Section 4.3.2. Thus, all we need is to prove the right inequality in (4.30). 1o . Let T be the closed conic hull of T (see Section 4.1.1). Consider the conic problem Opt# = max {Tr(CQ) : Q 0, Rk [Q] tk Idk ∀k ≤ K, [t; 1] ∈ T} . Q,t
(4.142)
This problem clearly is strictly feasible; by Lemma 4.44, the feasible set of the problem is bounded, so the problem is solvable. We claim that Opt# = Opt. Indeed, (4.142) is a strictly feasible and bounded conic problem, so that its optimal value is equal to the one in its conic dual, that is, P P Tr([ k R∗k [Λk ] − L]Q) − k [Tr(Λk ) + gk ]tk = Tr(CQ) ∀(Q, t), Opt# = min s: Λ={Λk }k≤K ,[g;s],L Λ 0 ∀k, L 0, s ≥ φ (−g) k T P ∗ k Rk [Λk ] − L = C, g = −λ[Λ], = min s: Λk P 0 ∀k, L 0, s ≥ φT (−g) Λ,[g;s],L = min {φT (λ[Λ]) : k R∗k [Λk ] C, Λk 0 ∀k} = Opt, Λ
as claimed.
2o . Problem (4.142), as we already know, is solvable; let Q∗ , t∗ be an optimal
367
SIGNAL RECOVERY BY LINEAR ESTIMATION
1/2 b b = solution to the problem. Next, let us set R∗ = Q∗ , C = R∗ CR∗ , and let C T T b U DU be the eigenvalue decomposition of C, so that the matrix D = U R∗ CR∗ U is diagonal, and the trace of this matrix is Tr(R∗ CR∗ ) = Tr(CQ∗ ) = Opt# = Opt. Now let V = R∗ U , and let ξ = V η, where η is n-dimensional random Rademacher vector (independent entries taking values ±1 with probabilities 1/2). We have
ξ T Cξ = η T [V T CV ]η = η T [U T R∗ CR∗ U ]η = η T Dη ≡ Tr(D) = Opt
(4.143)
(recall that D is diagonal) and Eξ {ξξ T } = Eη {V ηη T V T } = V V T = R∗ U U T R∗ = R∗2 = Q∗ . From the latter relation, Eξ Rk2 [ξ]
Eξ Rk [ξξ T ] = Rk [Eξ {ξξ T }] Rk [Q∗ ] t∗k Idk , 1 ≤ k ≤ K.
= =
(4.144)
¯ kj we have On the other hand, with properly selected symmetric matrices R X ¯ ki yi R Rk [V y] = i
identically in y ∈ Rn , whence
Eξ Rk2 [ξ] = Eη Rk2 [V η] = Eη
hX
i
¯ ki ηi R
i2
=
X i,j
¯ ki R ¯ kj = Eη {ηi ηj }R
This combines with (4.144) to imply that X ¯ ki ]2 t∗ Id , 1 ≤ k ≤ K. [R k k
X
¯ ki ]2 . [R
i
(4.145)
i
3o . Let us fix k ≤ K. Assuming t∗k > 0 and applying Theorem 4.45, we derive from (4.145) that 1 ¯ k [η]k2 > t∗k /ρ} < 2dk e− 2ρ , Prob{η : kR and recalling the relation between ξ and η, we arrive at 1
Prob{ξ : kRk [ξ]k2 > t∗k /ρ} < 2dk e− 2ρ ∀ρ ∈ (0, 1].
(4.146)
¯ ki = 0 for all i, so that Rk [ξ] ≡ R ¯ k [η] ≡ 0, Note that when t∗k = 0 (4.145) implies R and (4.146) holds for those k as well. 1 . For this ρ, the sum over k ≤ K of the rightNow let us set ρ = 2 max[ln(2D),1] hand sides in inequalities (4.146) is ≤ 1, implying that there exists a realization ξ¯ of ξ such that ¯ 2 ≤ t∗ /ρ, ∀k, kRk [ξ]k k or, equivalently,
x ¯ := ρ1/2 ξ¯ ∈ X
(recall that P = I), implying that Opt∗ := max xT Cx ≥ x ¯T C x ¯ = ρξ T Cξ = ρOpt x∈X
368
CHAPTER 4
(the concluding equality is due to (4.143)), and we arrive at the right inequality in (4.30). ✷ 4.8.4
Proof of Lemma 4.17
1o . Let us verify (4.57). When Q ≻ 0, passing from variables (Θ, Υ) in problem (4.56) to the variables (G = Q1/2 ΘQ1/2 , Υ), the problem becomes exactly the optimization problem in (4.57), implying that Opt[Q] = Opt[Q] when Q ≻ 0. As is easily seen, both sides in this equality are continuous in Q 0, and (4.57) follows. 2o . Let us prove (4.59). Setting ζ = Q1/2 η with η ∼ N (0, IN ) and Z = Q1/2 Y , to justify (4.59) we have to show that when κ ≥ 1 one has 3/8 2 Opt[Q] ¯ ≥ βκ := 1 − e ⇒ Probη {kZ T ηk ≥ δ} − 2F e−κ /2 , δ¯ = 4κ 2
(4.147)
where (cf. (4.57)) [Opt[Q] =] Opt[Q] := min φR (λ[Υ]) + Tr(Θ) : Θ,Υ={Υℓ ,ℓ≤L} 1 ZM Θ 2 Υℓ 0, 1 T T P ∗ 0 . M Z ℓ Sℓ [Υℓ ] 2
(4.148)
Justification of (4.147) is as follows.
2.1o . Let us represent Opt[Q] as the optimal value of a conic problem. Setting K = cl{[r; s] : s > 0, r/s ∈ R}, we ensure that R = {r : [r; 1] ∈ K}, K∗ = {[g; s] : s ≥ φR (−g)}, where K∗ is the cone dual to K. Consequently, (4.148) reads Υ (a) ℓ 0, 1 ≤ ℓ ≤ L 1 ZM Θ 2 P . 0 (b) Opt[Q] = min θ + Tr(Θ) : 1 ∗ T T Sℓ [Υℓ ] M Z Θ,Υ,θ ℓ 2 [−λ[Υ]; θ] ∈ K∗ (c)
(P )
2.2o . Now let us prove that there exists a matrix W ∈ Sq+ and r ∈ R such that Sℓ [W ] rℓ Ifℓ , ℓ ≤ L, and Opt[Q]≤
X
σi (ZM W 1/2 ),
(4.149) (4.150)
i
where σ1 (·) ≥ σ2 (·) ≥ ... are singular values. To get the announced result, let us pass from problem (P ) to its conic dual. Applying Lemma 4.44 we conclude that (P ) is strictly feasible; in addition, (P )
SIGNAL RECOVERY BY LINEAR ESTIMATION
369
clearly is bounded, so that the dual to (P ) problem (D) is solvable with optimal −R G value Opt[Q]. Let us build (D). Denoting by Λℓ 0, ℓ ≤ L, −RT W 0, [r; τ ] ∈ K the Lagrange multipliers for the respective constraints in (P ), and aggregating these constraints, the multipliers being the aggregation weights, we arrive at the following aggregated constraint: P P P Tr(ΘG) + Tr(W ℓ Sℓ∗ [Υℓ ]) + ℓ Tr(Λℓ Υℓ ) − ℓ rℓ Tr(Υℓ ) + θτ ≥ Tr(ZM RT ).
To get the dual problem, we impose on the Lagrange multipliers, in addition to the initial conic constraints like Λℓ 0, 1 ≤ ℓ ≤ L, the restriction that the lefthand side in the aggregated constraint, identically in Θ, Υℓ , and θ, is equal to the objective of (P ), that is, G = I, Sℓ [W ] + Λℓ − rℓ Ifℓ = 0, 1 ≤ ℓ ≤ L, τ = 1, and maximize, under the resulting restrictions, the right-hand side of the aggregated constraint. After immediate simplifications, we arrive at Opt[Q] = max Tr(ZM RT ) : W RT R, r ∈ R, Sℓ [W ] rℓ Ifℓ , 1 ≤ ℓ ≤ L W,R,r
T (note that r ∈ R is equivalent to [r; 1] ∈ K, and W R R is the same as I −R 0). Now, to say that RT R W is exactly the same as to say −RT W that R = SW 1/2 with the spectral norm kSk2,2 of S not exceeding 1, so that
Opt[Q] = max
W,S,r
Tr(ZM [SW | {z
1/2 T
] ) : W 0, kSk2,2 ≤ 1, r ∈ R, Sℓ [W ] rℓ Ifℓ , ℓ ≤ L . }
=Tr([ZM W 1/2 ]S T )
We can immediately eliminate the S-variable, using the well-known fact that for a p × q matrix J it holds max
S∈Rp×q ,kSk2,2 ≤1
Tr(JS T ) = kJkSh,1 ,
where kJkSh,1 is the nuclear norm (the sum of singular values) of J. We arrive at n o Opt[Q] = max kZM W 1/2 kSh,1 : r ∈ R, W 0, Sℓ [W ] rℓ Ifℓ , ℓ ≤ L . W,r
The resulting problem clearly is solvable, and its optimal solution W ensures the target relations (4.149) and (4.150). 2.3o . Given W satisfying (4.149) and (4.150), let U JV = W 1/2 M T Z T be the singular value decomposition of W 1/2 M T Z T , so that U and V are, respectively, q×q and N ×N orthogonal matrices, J is q×N matrix with diagonal σ = [σ1 ; ...; σp ], p = min[q, N ], and zero off-diagonal entries; the diagonal entries σi , 1 ≤ i ≤ p are the singular values of W 1/2 M T Z T , or, which is the same, of ZM W 1/2 . Therefore, by (4.150) we have X σi ≥ Opt[Q]. (4.151) i
370
CHAPTER 4
Now consider the following construction. Let η ∼ N (0, IN ); we denote by υ the vector comprised of the first p entries in V η; note that υ ∼ N (0, Ip ), since V is orthogonal. We then augment, if necessary, υ by q − p N (0, 1) random variables independent of each other and of η to obtain a q-dimensional random vector υ ′ ∼ N (0, Iq ), and set χ = U υ ′ . Because U is orthogonal we also have χ ∼ N (0, Iq ). Observe that χT W 1/2 M T Z T η = χT U JV η = [υ ′ ]T Jυ =
p X
σi υi2 .
(4.152)
i=1
To continue we need two simple observations. (i) One has α := Prob
(
p X i=1
σi υi2
0, and let us apply the Cramer bounding scheme. Namely, given γ > 0, consider the random variable ) ( X X 2 1 σ i υi . σi − γ ω = exp 4 γ i
i
Pp
Pp Note that ω > 0 a.s., and is > 1 when i=1 σi υi2 < 14 i=1 σi , so that α ≤ E{ω}, or, equivalently, thanks to υ ∼ N (0, Ip ), P P ln(α) ≤ ln(E{ω})P = 41 γ i σi + i ln E{exp{−γσi υi2 }} ≤ 41 γσ − 21 i ln(1 + 2γσi ). P in [σ1 ; ...; σp ] ≥ 0; therefore, its maximum Function − i ln(1 + 2γσi ) is convex P over the simplex {σi ≥ 0, i ≤ p, i σi = σ} is attained at a vertex, and we get ln(α) ≤ 14 γσ − 21 ln(1 + 2γσ).
Minimizing the right-hand side in γ > 0, we arrive at (4.153). (ii) Whenever κ ≥ 1, one has Prob{kM W 1/2 χk∗ > κ} ≤ 2F exp{−κ 2 /2},
(4.154)
with F given by (4.55). √ Indeed, setting ρ = 1/κ 2 ≤ 1 and ω = ρW 1/2 χ, we get ω ∼ N (0, ρW ). Let us apply Lemma 4.46 to Q = ρW , R in the role of T , L in the role of K, and Sℓ [·] in the role of Rk [·]. Denoting Y := {y : ∃r ∈ R : Sℓ2 [y] rℓ Ifℓ , ℓ ≤ L}, we have Sℓ [Q] = ρSℓ [W ] ρrℓ Ifℓ , ℓ ≤ L, with r ∈ R (see (4.149)), so we are under the premise of Lemma 4.46 (with Y in the role of X and thus with F in the role of D). Applying the lemma, we conclude that Prob χ : κ −1 W 1/2 χ 6∈ Y ≤ 2F exp{−1/(2ρ)} = 2F exp{−κ 2 /2}.
371
SIGNAL RECOVERY BY LINEAR ESTIMATION
Recalling that B∗ = M Y, we see that Prob{χ : κ −1 M W 1/2 χ 6∈ B∗ } is indeed upper-bounded by the right-hand side of (4.154), and (4.154) follows. 2.4o . Now, for κ ≥ 1, let ( Eκ =
(χ, η) : kM W
1/2
χk∗ ≤ κ,
X
σi υi2
i
≥
1 4
X
σi
i
)
,
and let Eκ+ = {η : ∃χ : (χ, η) ∈ Eκ }. For η ∈ Eκ+ there exists χ such that (χ, η) ∈ Eκ , leading to κkZ T ηk ≥ kM W 1/2 χk∗ kZ T ηk ≥ χT W 1/2 M T Z T η =
X i
σi υi2 ≥
1 4
X i
σi ≥ 14 Opt[Q]
(we have used (4.152) and (4.151)). Thus, η ∈ Eκ+ ⇒ kZ T ηk ≥
Opt[Q] . 4κ
On the other hand, due to (4.153) and (4.154), for our random (χ, η) it holds Prob{Eκ } ≥ 1 −
2 e3/8 − 2F e−κ /2 = βκ , 2
and the marginal distribution of η is N (0, IN ), implying that Probη∼N (0,IN ) {η ∈ Eκ+ } ≥ βκ . (4.147) is proved. 3o . As was explained in the beginning of item 2o , (4.147) is exactly the same as (4.59). The latter relation clearly implies (4.60) which, in turn, implies the right inequality in (4.58). ✷ 4.8.5
Proofs of Propositions 4.5, 4.16 and 4.19
Below, we focus on the proof of Proposition 4.16; Propositions 4.5 and 4.19 will be derived from it in Sections 4.8.5.2, 4.8.6.2, respectively. 4.8.5.1
Proof of Proposition 4.16
In what follows, we use the assumptions and the notation of Proposition 4.16. 1o . Let Φ(H, Λ, Υ, Υ′ , Θ; Q) = φT (λ[Λ]) + φR (λ[Υ]) + φR (λ[Υ′ ]) + Tr(QΘ) : M × Π → R,
372
CHAPTER 4
where M
=
(H, Λ, Υ, Υ′ , Θ) :
Λ = {Λk 0, k ≤ K}, Υ {Υℓ 0, ℓ ≤ L}, Υ′ = {Υ′ℓ 0, ℓ≤ L}, =P 1 ∗ [B T − AT H]M k Rk [Λk ] 2 P 0, 1 T T ∗ M [B − H A] S [Υ ] ℓ ℓ ℓ 2 1 Θ P2 HM 0. 1 T T ∗ ′ M H S [Υ ] ℓ ℓ ℓ 2
Looking at (4.42), we see immediately that the optimal value Opt in (4.42) is nothing but ′ ′ Φ(H, Λ, Υ, Υ , Θ) := max Φ(H, Λ, Υ, Υ , Θ; Q) . (4.155) Opt = min ′ (H,Λ,Υ,Υ ,Θ)∈M
Q∈Π
Note that sets M and Π are closed and convex, Π is compact, and Φ is a continuous convex-concave function on M × Π. In view of these observations, the fact that Π ⊂ int Sm + combines with the Sion-Kakutani Theorem to imply that Φ possesses saddle point (H∗ , Λ∗ , Υ∗ , Υ′∗ , Θ∗ ; Q∗ ) (min in (H, Λ, Υ, Υ′ , Θ), max in Q) on M×Π, whence Opt is the saddle point value of Φ by (4.155). We conclude that for properly selected Q∗ ∈ Π it holds Opt = =
min
(H,Λ,Υ,Υ′ ,Θ)∈M
min ′
H,Λ,Υ,Υ ,Θ
Φ(H, Λ, Υ, Υ′ , Θ; Q∗ )
φT (λ[Λ]) + φR (λ[Υ]) + φR (λ[Υ′ ]) + Tr(Q∗ Θ) :
Λ = {Λ {Υℓ 0, ℓ ≤ L}, Υ′ = {Υ′ℓ 0, ℓ ≤ L}, Pk ∗0, k ≤ K},1 Υ = T [B − AT H]M k Rk [Λk ] 2 P 0, ∗ T T 1 ℓ Sℓ [Υℓ ] 2 M [B − H 1A] Θ HM P2 ∗ ′ 0 T T 1 M H ℓ Sℓ [Υℓ ] 2 min ′ φT (λ[Λ]) + φR (λ[Υ]) + φR (λ[Υ′ ]) + Tr(G) :
=
H,Λ,Υ,Υ ,G
=
min
H,Λ,Υ
where Ψ(H)
Λ = {Λ {Υℓ 0, ℓ ≤ L}, Υ′ = {Υ′ℓ 0, ℓ ≤ L}, Pk ∗0, k ≤ K},1 Υ = T R [Λ ] [B − AT H]M k k k 2 P 0, ∗ T T 1 [Υℓ ] ℓ Sℓ # " 2 M [B − H A] 1/2 1 Q HM G 2 ∗ 0 P T T 1/2 ∗ ′ 1 M H Q∗ ℓ Sℓ [Υℓ ] 2
φT (λ[Λ]) + φR (λ[Υ]) + Ψ(H) :
Λ = {Λ {Υℓ 0, ℓ ≤ L}, Pk ∗0, k ≤ K},1 Υ = [B T − AT H]M k Rk [Λk ] 2 P 0 ∗ T T 1 M [B − H A] ℓ Sℓ [Υℓ ] 2
:=
(4.156)
min′ φR (λ[Υ′ ]) + Tr(G) : Υ′ = {Υ′ℓ 0, ℓ ≤ L}, G,Υ # ) " 1/2 1 Q∗ HM G 2 0 , P ∗ ′ 1/2 1 M T H T Q∗ ℓ Sℓ [Υℓ ] 2
and Opt is given by (4.42), and the equalities are due to (4.56) and (4.57).
SIGNAL RECOVERY BY LINEAR ESTIMATION
373
From now on we assume that the noise ξ in observation (4.31) is ξ ∼ N (0, Q∗ ). We also assume that B 6= 0, since otherwise the conclusion of Proposition 4.16 is evident. 2o . ǫ-risk. In Proposition 4.16, we are speaking about k·k-risk of an estimate—the maximal, over signals x ∈ X , expected norm k · k of the error of recovering Bx; what we need to prove is that the minimax optimal risk RiskOptΠ,k·k [X ] as given by (4.53) can be lower-bounded by a quantity “of order of” Opt. To this end, of course, it suffices to build such a lower bound for the quantity RiskOptk·k := inf sup Eξ∼N (0,Q∗ ) {kBx − x b(Ax + ξ)k} , x b(·) x∈X
since this quantity is a lower bound on RiskOptΠ,k·k . Technically, it is more convenient to work with the ǫ-risk defined in terms of “k · k-confidence intervals” rather than in terms of the expected norm of the error. Specifically, in the sequel we will heavily use the minimax ǫ-risk defined as b(Ax + ξ)k > ρ} ≤ ǫ ∀x ∈ X , RiskOptǫ = inf ρ : Probξ∼N (0,Q∗ ) {kBx − x x b,ρ
where x b in the infimum runs through the set of all Borel estimates. When ǫ ∈ (0, 1) is once and forever fixed (in the sequel, we use ǫ = 18 ) we can use ǫ-risk to lowerbound RiskOptk·k , since by evident reasons RiskOptk·k ≥ ǫ · RiskOptǫ .
(4.157)
Consequently, all we need in order to prove Proposition 4.16 is to lower-bound RiskOpt 18 by a “not too small” multiple of Opt, and this is our current objective. 3o . Let W be a positive semidefinite n × n matrix, let η ∼ N (0, W ) be random signal, and let ξ ∼ N (0, Q∗ ) be independent of η; vectors (η, ξ) induce random vector ω = Aη + ξ ∼ N (0, AW AT + Q∗ ). Consider the Bayesian version of the estimation problem where given ω we are interested in recovering Bη. Recall that, because [ω; Bη] is zero mean Gaussian, ¯Tω the conditional expectation E|ω {Bη} of Bη given ω is linear in ω: E|ω {Bη} = H 24 ¯ for some H depending on W only. Therefore, denoting by P|ω the conditional probability distribution given ω, for any ρ > 0 and estimate x b(·) one has Probη,ξ {kBη −x b(Aη + ξ)k ≥ ρ} = Eω Prob b(ω)k ≥ ρ} |ω {kBη − x ¯ T (Aη + ξ)k ≥ ρ}, ≥ Eω Prob|ω {kBη − E|ω {Bη}k ≥ ρ} = Probη,ξ {kBη − H
with the inequality given by the Anderson Lemma as applied to the shift of the Gaussian distribution P|ω by its mean. Applying the Anderson Lemma again we 24 We have used the following standard fact [172]: let ζ = [ω; η] ∼ N (0, S), the covariance matrix of the marginal distribution of ω being nonsingular. Then the conditional distribution of η given ω is Gaussian with the mean linearly depending on ω and covariance matrix independent of ω.
374
CHAPTER 4
get ¯ T (Aη + ξ)k ≥ ρ} Probη,ξ {kBη − H
= ≥
¯ T A)η − H ¯ T ξk ≥ ρ} Eξ Probη {k(B − H ¯ T A)ηk ≥ ρ}, Probη {k(B − H
and, by “symmetric” reasoning, ¯ T (Aη + ξ)k ≥ ρ} ≥ Probξ {kH ¯ T ξk ≥ ρ}. Probη,ξ {kBη − H We conclude that for any x b(·)
Probη,ξ {kBηn − x b(ω)k ≥ ρ}
o ¯ T A)ηk ≥ ρ}, Probξ {kH ¯ T ξk ≥ ρ} . ≥ max Probη {k(B − H
(4.158)
¯ Q = Q∗ , 4o . Let H be an m × ν matrix. Applying Lemma 4.17 to N = m, Y = H, we get from (4.59) ¯ ≥ βκ ∀κ ≥ 1, ¯ T ξk ≥ [4κ]−1 Ψ(H)} Probξ∼N (0,Q∗ ) {kH
(4.159)
where Ψ(H) is defined by (4.156). Similarly, applying Lemma 4.17 to N = n, ¯ T A)T , Q = W , we obtain Y = (B − H ¯ ≥ βκ ∀κ ≥ 1, ¯ T A)ηk ≥ [4κ]−1 Φ(W, H)} Probη∼N (0,W ) {k(B − H
(4.160)
where Φ(W, H)
=
Tr (W Θ) + φR (λ[Υ]) : Υℓ 0 ∀ℓ, min Υ={Υ ℓ ,ℓ≤L},Θ 1 Θ [B T − AT H]M 2 P 0 . 1 ∗ M T [B − H T A] ℓ Sℓ [Υℓ ] 2
(4.161)
¯ = [8κ]−1 [Ψ(H) ¯ + Φ(W, H)]; ¯ Let us put ρ(W, H) when combining (4.160) with (4.159) we conclude that n o ¯ T A)ηk ≥ ρ(W, H)}, ¯ ¯ T ξk ≥ ρ(W, H)} ¯ max Probη {k(B − H Probξ {kH ≥ βκ , ¯ is replaced with the smaller quantity and the same inequality holds if ρ(W, H) ρ¯(W ) = [8κ]−1 inf [Ψ(H) + Φ(W, H)]. H
Now, the latter bound combines with (4.158) to imply the following result: Lemma 4.49. Let W be a positive semidefinite n × n matrix, and κ ≥ 1. Then for any estimate x b(·) of Bη given observation ω = Aη + ξ, where η ∼ N (0, W ) is independent of ξ ∼ N (0, Q∗ ), one has o n 2 e3/8 −2F e−κ /2 Probη,ξ kBη − x b(ω)k ≥ [8κ]−1 inf [Ψ(H) + Φ(W, H)] ≥ βκ = 1− H 2
where Ψ(H) and Φ(W, H) are defined, respectively, by (4.156) and (4.161).
375
SIGNAL RECOVERY BY LINEAR ESTIMATION
In particular, for κ=κ ¯ :=
√
2 ln F + 10 ln 2
(4.162)
it holds Probη,ξ {kBη − x b(ω)k ≥ [8κ] ¯ −1 inf [Ψ(H) + Φ(W, H)]} > H
3 16 .
5o . For 0 < κ ≤ 1, let us set (a) (b)
Wκ ={W ∈ Sn+ : ∃t ∈ T : Rk [W ] κtk Idk , 1 ≤ k ≤ K}, Z=
(Υ = {Υℓ , ℓ ≤ L}, Θ, H) :
Υ " ℓ 0 ∀ℓ,
Θ − H T A]
1 M T [B 2
1 [B T 2 P
− AT H]M ∗ ℓ Sℓ [Υℓ ]
#
0
.
Note that Wκ is a nonempty convex and compact (by Lemma 4.44) set such that Wκ = κW1 , and Z is a nonempty closed convex set. Consider the parametric saddle point problem Opt(κ) = max
inf
W ∈Wκ (Υ,Θ,H)∈Z
h
i E(W ; Υ, Θ, H) := Tr(W Θ) + φR (λ[Υ]) + Ψ(H) .
(4.163)
This problem is convex-concave; utilizing the fact that Wκ is compact and contains positive definite matrices, it is immediately seen that the Sion-Kakutani theorem ensures the existence of a saddle point whenever κ ∈ (0, 1]. We claim that √ (4.164) 0 < κ ≤ 1 ⇒ Opt(κ) ≥ κOpt(1). Indeed, Z is invariant w.r.t. scalings (Υ = {Υℓ , ℓ ≤ L}, Θ, H) 7→ (θΥ := {θΥℓ , ℓ ≤ L}, θ−1 Θ, H),
[θ > 0].
When taking into account that φR (λ[θΥ]) = θφR (λ[Υ]), we get E(W )
:= =
inf
(Υ,Θ,H)∈Z
inf
(Υ,Θ,H)∈Z
E(W ; Υ, Θ, H) = inf inf E(W ; θΥ, θ−1 Θ, H) θ>0 (Υ,Θ,H)∈Z i h p 2 Tr(W Θ)φR (λ[Υ]) + Ψ(H) .
Because Ψ is nonnegative we conclude that whenever W 0 and κ ∈ (0, 1], one has √ E(κW ) ≥ κE(W ). This combines with Wκ = κW1 to imply that Opt(κ) = max E(W ) = max E(κW ) ≥ W ∈Wκ
W ∈W1
√
κ max E(W ) = W ∈W1
√
κOpt(1),
and (4.164) follows. 6o . We claim that Opt(1) = Opt,
(4.165)
where Opt is given by (4.42) (and, as we have seen, by (4.156) as well). Note that (4.165) combines with (4.164) to imply that √ 0 < κ ≤ 1 ⇒ Opt(κ) ≥ κOpt. (4.166)
376
CHAPTER 4
Verification of (4.165) is given by the following computation. By the Sion-Kakutani Theorem, Tr(W Θ) + φR (λ[Υ]) + Ψ(H) Opt(1) = max inf W ∈W1 (Υ,Θ,H)∈Z = inf max Tr(W Θ) + φR (λ[Υ]) + Ψ(H) W ∈W1 (Υ,Θ,H)∈Z = inf Ψ(H) + φR (λ[Υ]) + max Tr(ΘW ) : W (Υ,Θ,H)∈Z W 0, ∃t ∈ T : Rk [W ] tk Idk , k ≤ K Ψ(H) + φR (λ[Υ]) + max Tr(ΘW ) : = inf W,t (Υ,Θ,H)∈Z W 0, [t; 1] ∈ T, Rk [W ] tk Idk , k ≤ K , where T is the closed conic hull of T . On the other hand, using Conic Duality combined with the fact that T∗ = {[g; s] : s ≥ φT (−g)} we obtain max {Tr(ΘW ) : W 0, [t; 1] ∈ T, Rk [W ] tk Idk , k ≤ K} W,t Z 0, [g; s] ∈ T∗ , ΛP k 0, k ≤ K, −Tr(ZW ) − g T t +P k Tr(R∗k [Λk ]W ) s: = min − k tk Tr(Λk ) = Θ , Z,[g;s],Λ={Λk } ∀(W ∈ Sn , t ∈ RK ) Z 0, P s ≥ φT (−g), Λk 0, k ≤ K, s: = min Θ = k R∗k [Λk ] − Z, g = −λ[Λ] Z,[g;s],Λ={Λk } ( ) X ∗ = min φT (λ[Λ]) : Λ = {Λk 0, k ≤ K}, Θ Rk [Λk ] , Λ
k
and we arrive at Opt(1) =
inf
Υ,Θ,H,Λ
= inf
Υ,H,Λ
= Opt
Ψ(H) + φR (λ[Υ]) + φT (λ[Λ]) :
Υ = {Υ P ℓ ∗0, ℓ ≤ L}, Λ = {Λk 0, k ≤ K}, Θ k Rk [Λk ], 1 [B T − AT H]M Θ 2 P 0 1 T T ∗ ℓ Sℓ [Υℓ ] 2 M [B − H A] Ψ(H) + φR (λ[Υ]) + φT (λ[Λ]) :
Υ = {Υ P ℓ ∗0, ℓ ≤ L},1Λ =T {Λk T 0, k ≤K}, [B − A H]M k Rk [Λk ] 2 P 0 1 ∗ T T M [B − H A] ℓ Sℓ [Υℓ ] 2 [see (4.156)].
7o . Now we can complete the proof. For κ ∈ (0, 1], let Wκ be the W -component of
377
SIGNAL RECOVERY BY LINEAR ESTIMATION
a saddle point solution to the saddle point problem (4.163). Then, by (4.166), o n √ κOpt ≤ Opt(κ) = inf Tr(Wκ Θ) + φR (λ[Υ]) + Ψ(H) (Υ,Θ,H)∈Z (4.167) = inf Φ(Wκ , H) + Ψ(H) . H
On the other hand, when applying Lemma 4.46 to Q = Wκ and ρ = κ, we obtain, in view of relations 0 < κ ≤ 1, Wκ ∈ Wκ , 1
δ(κ) := Probζ∼N (0,In ) {Wκ1/2 ζ 6∈ X } ≤ 2De− 2κ ,
(4.168)
with D given by (4.55). In particular, when setting κ ¯=
1 2 ln D + 10 ln 2
(4.169)
we obtain δκ ≤ 1/16. Therefore, Probη∼N (0,Wκ¯ ) {η 6∈ X } ≤ Now let
1 16 .
Opt ̺∗ := p . 8 (2 ln F + 10 ln 2)(2 ln D + 10 ln 2)
(4.170)
(4.171)
All we need in order to achieve our goal of justifying (4.54) is to show that RiskOpt 81 ≥ ̺∗ ,
(4.172)
since given the latter relation, (4.54) will be immediately given by (4.157) as applied with ǫ = 81 . To prove (4.172), assume, on the contrary to what should be proved, that the 1 -risk is < ̺∗ , and let x ¯(·) be an estimate with 18 -risk ̺′ < ̺∗ . We can utilize x ¯ to 8 estimate Bη, in the Bayesian problem of recovering Bη from observation ω = Aη+ξ, (η, ξ) ∼ N (0, Σ) with Σ = Diag{Wκ¯ , Q∗ }. From (4.170) we conclude that Prob(η,ξ)∼N (0,Σ) {kBη − x ¯(Aη + ξ)k > ̺′ } ≤ Prob(η,ξ)∼N (0,Σ) {kBη − x ¯(Aη + ξ)k > ̺′ , η ∈ X } 1 3 = 16 . + Probη∼N (0,Wκ¯ ) {η 6∈ X } ≤ 18 + 16
(4.173)
On the other hand, by (4.167) we have κ) ≥ inf [Φ(Wκ¯ , H) + Ψ(H)] = Opt(¯ H
√
κ ¯ Opt = [8κ]̺ ¯ ∗
with κ ¯ given by (4.162). Thus, by Lemma 4.49, for any estimate x ˆ(·) of Bη via observation ω = Ax + ξ it holds Probη,ξ {kBη − x b(Aη + ξ)k ≥ ̺∗ } ≥ βκ¯ > 3/16;
in particular, this relation should hold true for x b(·) ≡ x ¯(·), but the latter is impos3 -risk of x ¯ is ≤ ̺′ < ̺∗ ; see (4.173). ✷ sible: the 16
378 4.8.5.2
CHAPTER 4
Proof of Proposition 4.5
We shall extract Proposition 4.5 from the following result, meaningful by its own right (it can be considered as an “ellitopic refinement” of Proposition 4.16): Proposition 4.50. Consider the recovery of the linear image Bx ∈ Rν of unknown signal x known to belong to a given signal set X ⊂ Rn from noisy observation ω = Ax + ξ ∈ Rm
[ξ ∼ N (0, Γ), Γ ≻ 0],
the recovery error being measured in norm k · k on Rν . Assume that X and the unit ball B∗ of the norm k · k∗ conjugate to k · k are ellitopes: X B∗
= =
{x ∈ Rn : ∃t ∈ T : xT Rk x ≤ tk , k ≤ K}, {y ∈ Rν : ∃(r ∈ R, y) : u = M y, y T Sℓ y ≤ rℓ , ℓ ≤ L},
(4.174)
with our standard restrictions on T , R, Rk and Sℓ (as always, we lose nothing when assuming that the ellitope X is basic). Consider the optimization problem Opt# = min ′ φT (λ) + φR (µ) + φR (µ′ ) + Tr(ΓΘ) : Θ,H,λ,µ,µ
λ ≥ 0,P µ ≥ 0, µ′ ≥ 0,
1 [B − H T A]T M λk R k 2 P T T 1 ℓ µℓ S ℓ 2 M [B − H 1 A] HM Θ 2 P ′ 0 1 M T HT ℓ µℓ S ℓ 2 k
0,
)
(4.175) .
The problem is solvable, and the linear estimate x bH∗ (ω) = H∗T ω yielded by the H-component of an optimal solution to the problem satisfies the risk bound bH∗ (Ax + ξ)k} ≤ Opt# . RiskΓ,k·k [b xH∗ |X ] := max Eξ∼N (0,Γ) {kBx − x x∈X
Furthermore, the estimate x bH∗ (·) is near-optimal: p Opt# ≤ 64 (3 ln K + 15 ln 2)(3 ln L + 15 ln 2) RiskOpt,
(4.176)
where RiskOpt is the minimax optimal risk
RiskOpt = inf sup Eξ∼N (0,Γ) {kBx − x b(Ax + ξ)k} , x b x∈X
the infimum being taken w.r.t. all estimates.
Proposition 4.50 ⇒ Proposition 4.5: Clearly, the situation considered in Proposition 4.5 is a particular case of the setting of Proposition 4.50, namely, the case where B∗ is the standard Euclidean ball, B∗ = {u ∈ Rν : uT u ≤ 1}. In this case,
379
SIGNAL RECOVERY BY LINEAR ESTIMATION
problem (4.175) reads Opt# =
min
Θ,H,λ,µ,µ′
φT (λ) + µ + µ′ + Tr(ΓΘ) : λ ≥P 0, µ ≥ 0, µ′ ≥ 0,
1 [B − H T A]T λk Rk 2 T µIν − H A] 1 Θ H 2 0 1 H T µ′ Iν 2
=
min
Θ,H,λ,µ,µ′
= min χ,H
1 [B 2
k
0,
φT (λ) + µ + µ′ + Tr(ΓΘ) :
′ λ≥ P0, µ ≥ 0, µ 1≥ 0, µ [ k λk Rk ] 4 [B − H T A]T [B − H T A], µ′ Θ 14 HH T [Schur Complement Lemma] p p φT (χ) + Tr(HΓH T ) : P ′ [B − H T A]T k χk Rk χ ≥ 0, 0 T [B − H A]
Iν
[by eliminating µ, µ′ ; note that φT (·) is positively homogeneous of degree 1].
Comparing the resulting representation of Opt# with (4.12), we see that the upper √ bH∗ appearing in (4.15) is ≤ Opt# . bound Opt on the risk of the linear estimate x Combining this observation with (4.176) and the evident relation RiskOpt
=
≤
inf xb sup b(Ax + ξ)k2 } q x∈X Ex∼N (0,Γ) {kBx − x
inf xb
supx∈X Ex∼N (0,Γ) {kBx − x b(Ax + ξ)k22 } = Riskopt
(recall that we are in the case of k · k = k · k2 ), we arrive at (4.15) and thus justify Proposition 4.5. ✷ Proof of Proposition 4.50. It is immediately seen that problem (4.175) is nothing but problem (4.42) in the case when the spectratopes X , B∗ and the set Π participating in Proposition 4.14 are, respectively, the ellitopes given by (4.174), and the singleton {Γ}. Thus, Proposition 4.50 is, essentially, a particular case of Proposition 4.16. The only refinement in Proposition 4.50 as compared to Proposition 4.16 is the form of the logarithmic “nonoptimality” factor in (4.176); a similar factor in Proposition 4.16 is expressed in terms of spectratopic sizes D, F of X and B∗ (the total ranks of matrices Rk , k ≤ K, and Sℓ , ℓ ≤ L, in the case of (4.174)), while in (4.176) the nonoptimality factor is expressed in terms of ellitopic sizes K, L of X and B∗ . Strictly speaking, to arrive at this (slight—the sizes in question are under logs) refinement, we were supposed to reproduce, with minimal modifications, the reasoning of items 2o –7o of Section 4.8.5.1, with Γ in the role of Q∗ , and slightly refine Lemma 4.17 underlying this reasoning. Instead of carrying out this plan literally, we detail “local modifications” to be made in the proof of Proposition 4.16 in order to prove Proposition 4.50. Here are these modifications: A. The collections of matrices Λ = {Λk 0, k ≤ K}, Υ = {Υℓ 0, ℓ ≤ L} should be L substituted by collections of nonnegative reals λ ∈ RK + or µ ∈ R+ , and vectors
380
CHAPTER 4
λ[Λ], λ[Υ]—with vectors λ or µ. Expressions like Rk [W ], R∗k [Λk ], and Sℓ∗ [Υℓ ] should be replaced, respectively, with Tr(Rk W ), λk Rk , and µℓ Sℓ . Finally, Q∗ should be replaced with Γ, and scalar matrices, like tk Idk , should be replaced with the corresponding reals, like tk . B. The role of Lemma 4.17 is now played by Lemma 4.51. Let Y be an N × ν matrix, let k · k be a norm on Rν such that the unit ball B∗ of the conjugate norm is the ellitope B∗ = {y ∈ Rν : ∃(r ∈ R, y) : u = M y, y T Sℓ y ≤ rℓ , ℓ ≤ L},
(4.174)
and let ζ ∼ N (0, Q) for some positive semidefinite N × N matrix Q. Then the best upper bound on ψQ (Y ) := E{kY T ζk} yielded by Lemma 4.11, that is, the optimal value Opt[Q] in the convex optimization problem (cf. (4.40)) 1 YM Θ 0 Opt[Q] = min φR (µ) + Tr(QΘ) : µ ≥ 0, 1 T T P2 M Y Θ,µ ℓ µ ℓ Rℓ 2 satisfies for all Q 0 the identity ( Opt[Q] = Opt[Q] :=
min φR (µ) + Tr(G) : G,µ
µ ≥ 0,
"
G
1 M T Y T Q1/2 2
1 1/2 Q YM 2P ℓ
µℓ Rℓ
#
0
)
(4.177) ,
and is a tight bound on ψQ (Y ). Namely, √ ψQ (Y ) ≤ Opt[Q] ≤ 22 3 ln L + 15 ln 2ψQ (Y ), where L is the size of the ellitope B∗ ; see (4.174). Furthermore, for all κ ≥ 1 one has 2 Opt[Q] e3/8 T Probζ kY ζk ≥ ≥ βκ := 1 − − 2Le−κ /3 . (4.178) 4κ 2 √ In particular, when selecting κ = 3 ln L + 15 ln 2, we obtain Opt[Q] T 3 Probζ kY ζk ≥ √ ≥ βκ = 0.2100 > 16 . 4 3 ln L + 15 ln 2 Proof of Lemma 4.51 follows the lines of the proof of Lemma 4.17, with Lemma 4.47 substituting Lemma 4.46. 1o . Relation (4.177) can be verified exactly in the same fashion as in the case of Lemma 4.17. 2o . Let us set ζ = Q1/2 η with η ∼ N (0, IN ) and Z = Q1/2 Y . Observe that to prove (4.178) is the same as to show that when κ ≥ 1 one has 3/8 2 Opt[Q] ¯ ≥ βκ := 1 − e δ¯ = ⇒ Probη {kZ T ηk ≥ δ} − 2Le−κ /3 , 4κ 2
(4.179)
381
SIGNAL RECOVERY BY LINEAR ESTIMATION
where [Opt[Q] =]
Opt[Q]
:=
min Θ,µ
(
φR (µ) + Tr(Θ) : µ ≥ 0, Θ T T 1 M Z 2
1
ZM ℓ µℓ R ℓ
P2
)
(4.180)
0 .
Justification of (4.179) goes as follows. 2.1o . Let us represent Opt[Q] as the optimal value of a conic problem. Setting K = cl{[r; s] : s > 0, r/s ∈ R}, we ensure that R = {r : [r; 1] ∈ K}, K∗ = {[g; s] : s ≥ φR (−g)},
where K∗ is the cone dual to K. Consequently, (4.180) reads µ ≥0 1 Θ ZM 2 P Opt[Q] = min θ + Tr(Θ) : 0 1 Θ,Υ,θ M T ZT ℓ µℓ S ℓ 2 [−µ; θ] ∈ K∗
(a) . (b) (c)
(PE )
2.2o . Now let us prove that there exist matrix W ∈ Sq+ and r ∈ R such that Tr(W Sℓ ) ≤ rℓ , ℓ ≤ L,
and Opt[Q]≤
X
(4.181)
σi (ZM W 1/2 ),
(4.182)
i
where σ1 (·) ≥ σ2 (·) ≥ ... are singular values. To get the announced result, let us pass from problem (P ) to its conic dual. (PE ) clearly is strictly feasible and bounded, so that the dual to (PE ) problem (DE ) is solvable G −R with optimal value Opt[Q]. Denoting by λℓ ≥ 0, ℓ ≤ L, 0, [r; τ ] ∈ K, −RT W the Lagrange multipliers for the respective constraints in (PE ), and aggregating these constraints, the multipliers being the aggregation weights, we arrive at the aggregated constraint: P P P Tr(ΘG) + Tr(W ℓ µℓ Sℓ ) + ℓ λℓ µℓ − ℓ rℓ µℓ + θτ ≥ Tr(ZM RT ).
To get the dual problem, we impose on the Lagrange multipliers, in addition to the initial constraints, the restriction that the left-hand side in the aggregated constraint is equal to the objective of (P ), identically in Θ, µℓ , and θ, that is, G = I, Tr(W Sℓ ) + λℓ − rℓ = 0, 1 ≤ ℓ ≤ L, τ = 1, and maximize the right-hand side of the aggregated constraint. After immediate simplifications, we arrive at n o Opt[Q] = max Tr(ZM RT ) : W RT R, r ∈ R, Tr(W Sℓ ) ≤ rℓ , 1 ≤ ℓ ≤ L W,R,r
(note that r ∈ R is equivalent to [r; 1] ∈ K, and W RT R is the same as 0).
I −RT
−R W
Exactly as in the proof of Lemma 4.17, the above representation of Opt[Q] implies that n o Opt[Q] = max kZM W 1/2 kSh,1 : r ∈ R, W 0, Tr(W Sℓ ) ≤ rℓ , ℓ ≤ L . W,r
382
CHAPTER 4
The resulting problem clearly is solvable, and its optimal solution W ensures the target relations (4.181) and (4.182). 2.3o . Given W satisfying (4.181) and (4.182), we proceed exactly as in item 2.3o of the proof of Lemma 4.17, thus arriving at three random vectors (χ, υ, η) with marginal distributions N (0, Iq ), N (0, Iq ), and N (0, IN ), respectively, such that χT W 1/2 M T Z T η =
p X
σi υi2 ,
(4.183)
i=1
where p = min[q, N ] and σi = σi (ZM W 1/2 ). As in item 3o .i of the proof of Lemma 4.17, we have (i) ) ( p p X X e3/8 2 1 σi ≤ [= 0.7275...]. (4.184) α := Prob σ i υi < 4 2 i=1 i=1
The role of item 3o .ii in the aforementioned proof is now played by (ii) Whenever κ ≥ 1, one has
Prob{kM W 1/2 χk∗ > κ} ≤ 2L exp{−κ 2 /3},
(4.185)
with L as defined in (4.174). √ Indeed, setting ρ = 1/κ 2 ≤ 1 and ω = ρW 1/2 χ, we get ω ∼ N (0, ρW ). Let us apply Lemma 4.47 to Q = ρW , R in the role of T , with L in the role of K, and Sℓ ’s in the role of Rk ’s. Denoting Y := {y : ∃r ∈ R : y T Sℓ y rℓ , ℓ ≤ L},
we have Tr(QSℓ ) = ρTr(W Sℓ ) = ρTr(W Sℓ ) ≤ ρrℓ , ℓ ≤ L, with r ∈ R (see (4.181)), so we are under the premise of Lemma 4.47 (with Y in the role of X and therefore with L in the role of K). Applying the lemma, we conclude that n o Prob χ : κ −1 W 1/2 χ 6∈ Y ≤ 2L exp{−1/(3ρ)} = 2L exp{−κ 2 /3}. Recalling that B∗ = M Y, we see that Prob{χ : κ −1 M W 1/2 χ 6∈ B∗ } is indeed upperbounded by the right-hand side of (4.185), and (4.185) follows. With (i) and (ii) at our disposal, we complete the proof of Lemma 4.51 in exactly the same way as in items 2.4o and 3o of the proof of Lemma 4.17. ✷
C. As a result of substituting Lemma 4.17 with Lemma 4.51, the counterpart of Lemma 4.49 used in item 4o of the proof of Proposition 4.16 now reads as follows: Lemma 4.52. Let W be a positive semidefinite n × n matrix, and κ ≥ 1. Then for any estimate x b(·) of Bη given observation ω = Aη + ξ with η ∼ N (0, W ) and ξ ∼ N (0, Γ) independent of each other, one has o n 2 e3/8 − 2Le−κ /3 Probη,ξ kBη − x b(ω)k ≥ [8κ]−1 inf [Ψ(H) + Φ(W, H)] ≥ βκ = 1 − H 2
where Ψ(H) and Φ(W, H) are defined, respectively, by (4.156) (where Q∗ should be set to Γ) and (4.161). In particular, for √ κ=κ ¯ := 3 ln K + 15 ln 2 the latter probability is > 3/16. D. We substitute the reference to Lemma 4.46 in item 7o of the proof with Lemma 4.47, resulting in replacing
383
SIGNAL RECOVERY BY LINEAR ESTIMATION
• definition of δ(κ) in (4.168) with 1
δ(κ) := Probζ∼N (0,In ) {Wκ1/2 ζ 6∈ X } ≤ 3Ke− 3κ , • definition (4.169) of κ ¯ with κ ¯=
1 , 3 ln K + 15 ln 2
• and, finally, definition (4.171) of ρ∗ with Opt ̺∗ := p . 8 (3 ln L + 15 ln 2)(3 ln K + 15 ln 2) 4.8.6 4.8.6.1
Proofs of Propositions 4.18 and 4.19, and justification of Remark 4.20 Proof of Proposition 4.18
The only claim of the proposition which is not an immediate consequence of Proposition 4.8 is that problem (4.64) is solvable; let us justify this claim. Let F = ImA. Clearly, feasibility of a candidate solution (H, Λ, Υ) to the problem depends solely on the restriction of the linear mapping z 7→ H T z onto F , so that adding to the constraints of the problem the requirement that the restriction of this linear mapping on the orthogonal complement of F in Rm is identically zero, we get an equivalent problem. It is immediately seen that in the resulting problem, the feasible solutions with the value of the objective ≤ a for every a ∈ R form a compact set, so that the latter problem (and thus the original one) indeed is solvable. ✷ 4.8.6.2
Proof of Proposition 4.19
We are about to derive Proposition 4.19 from Proposition 4.16. Observe that in the situation of the latter Proposition, setting formally Π = {0}, problem (4.42) becomes problem (4.64), so that Proposition 4.19 looks like the special case Π = {0} of Proposition 4.16. However, the premise of the latter proposition forbids specializing Π as {0}—this would violate the regularity assumption R which is part of the premise. The difficulty, however, can be easily resolved. Assume w.l.o.g. that the image space of A is the entire Rm (otherwise we could from the very beginning replace Rm with the image space of A), and let us pass from our current noiseless recovery problem of interest (!)—see Section 4.5.1—to its “noisy modification,” the differences with (!) being • noisy observation ω = Ax + σξ, σ > 0, ξ ∼ N (0, Im ); • risk quantification of a candidate estimate x b(·) according to
Riskσk·k [b x(Ax + σξ)|X ] = sup Eξ∼N (0,Im ) {kBx − x b(Ax + σξ)k} , x∈X
the corresponding minimax optimal risk being
RiskOptσk·k [X ] = inf Riskσk·k [b x(Ax + σξ)|X ]. x b(·)
384
CHAPTER 4
Proposition 4.16 does apply to the modified problem—it suffices to specify Π as {σ 2 Im }. According to this proposition, the quantity Opt[σ]
=
min ′
H,Λ,Υ,Υ ,Θ
φT (λ[Λ]) + φR (λ[Υ]) + φR (λ[Υ′ ]) + σ 2 Tr(Θ) :
Λ = {Λk 0, k ≤ K}, Υ =P {Υℓ 0, ℓ ≤ L}, Υ′ = {Υ′ℓ 0, ℓ ≤ L}, ∗ 1 [B T − AT H]M k Rk [Λk ] 2 P 0, ∗ T T 1 M [B − H A] ℓ Sℓ [Υℓ ] 2 1 Θ HM P2 ∗ ′ 0 T T 1 Sℓ [Υℓ ] M H ℓ 2
satisfies the relation
Opt[σ] ≤ O(1) ln(D)RiskOptσk·k [X ]
(4.186)
with D defined in (4.65). Looking at problem (4.64) we immediately conclude that Opt# ≤ Opt[σ]. Thus, all we need in order to extract the target relation (4.65) from (4.186) is to prove that the minimax optimal risk Riskopt [X ] defined in Proposition 4.19 satisfies the relation lim inf RiskOptσk·k [X ] ≤ Riskopt [X ]. σ→+0
(4.187)
To prove this relation, let us fix r > Riskopt [X ], so that for some Borel estimate x b(·) it holds sup kBx − x b(Ax)k < r. (4.188) x∈X
Were we able to ensure that x b(·) is bounded and continuous, we would be done, since in this case, due to compactness of X , it clearly holds lim inf σ→+0 RiskOptσk·k [X ] ≤ lim inf σ→+0 supx∈X Eξ∼N (0,Im ) {kBx − x b(Ax + σξ)k} ≤ supx∈X kBx − x b(Ax)k < r,
and since r > Riskopt [X ] is arbitrary, (4.187) would follow. Thus, all we need to do is to verify that given Borel estimate x b(·) satisfying (4.188), we can update it into a bounded and continuous estimate satisfying the same relation. Verification is as follows: 1. Setting β = maxx∈X kBxk and replacing estimate x b with its truncation x b(ω), kb x(ω)k ≤ 2β x e(ω) = 0, otherwise
for any x ∈ X we only reduce the norm of the recovery error. At the same time, x e is Borel and bounded. Thus, we lose nothing when assuming in the rest of the proof that x b(·) is Borel and bounded. 2. For ǫ > 0, let x bǫ (ω) = (1 + ǫ)b x(ω/(1 + ǫ)) and let Xǫ = (1 + ǫ)X . Observe that supx∈Xǫ kBx − x bǫ (Ax)k = supy∈X kB[1 + ǫ]y − x bǫ (A[1 + ǫ]y)k = supy∈X kB[1 + ǫ]y − [1 + ǫ]b x(Ay)k = [1 + ǫ] supy∈X kBy − x b(Ay)k,
385
SIGNAL RECOVERY BY LINEAR ESTIMATION
implying, in view of (4.188), that for small enough positive ǫ we have r¯ := sup kBx − x bǫ (Ax)k < r.
(4.189)
x∈Xǫ
3. Finally, let A† be the pseudoinverse of A, so that AA† z = z for every z ∈ Rm (recall that the image space of A is the entire Rm ). Given ρ > 0, let θρ (·) be a nonnegative smooth function on Rm with integral 1 such that θρ vanishes outside of the ball of radius ρ centered at the origin, and let Z x bǫ,ρ (ω) = x bǫ (ω − z)θρ (z)dz Rm
be the convolution of x bǫ and θρ . Since x bǫ (·) is Borel and bounded, this convolution is a well-defined smooth function on Rm . Because X contains a neighbourhood of the origin, for all small enough ρ > 0, all z from the support of θρ and all x ∈ X the point x − A† z belongs to Xǫ . For such ρ and any x ∈ X we have kBx − x bǫ (Ax − z)k
= ≤ ≤
kBx − x bǫ (A[x − A† z])k † kBA zk + kB[x − A† z] − x bǫ (A[x − A† z])k Cρ + r¯
with properly selected constant C independent of ρ (we have used (4.189); note that for our ρ and x we have x − A† z ∈ Xǫ ). We conclude that for properly selected r′ < r, ρ > 0 and all x ∈ X we have kBx − x bǫ (Ax − z)k ≤ r′ ∀(z ∈ supp θρ ),
implying, by construction of x bǫ,ρ , that
∀(x ∈ X ) : kBx − x bǫ,ρ (Ax)k ≤ r′ < r.
The resulting estimate x bǫ,ρ is the continuous and bounded estimate satisfying (4.188) we were looking for. ✷ 4.8.6.3
Justification of Remark 4.20
Justification of Remark is given by repeating word by word the proof of Proposition 4.19, with Proposition 4.50 in the role of Proposition 4.16.
Chapter Five Signal Recovery Beyond Linear Estimates OVERVIEW In this chapter, as in Chapter 4, we focus on signal recovery. In contrast to the previous chapter, on our agenda now are • a special kind of nonlinear estimation—polyhedral estimate (Section 5.1), an alternative to linear estimates which were our subject in Chapter 4. We demonstrate that as applied to the same estimation problem as in Chapter 4—recovery of an unknown signal via noisy observation of a linear image of the signal, polyhedral estimation possesses the same attractive properties as linear estimation, that is, efficient computability and near-optimality, provided the signal set is an ellitope/spectratope. Besides this, we show that properly built polyhedral estimates are near-optimal in several special cases where linear estimates could be heavily suboptimal. • recovering signals from noisy observations of nonlinear images of the signal. Specifically, we consider signal recovery in generalized linear models, where the expected value of an observation is a known nonlinear transformation of the signal we want to recover, in contrast to observation model (4.1) where this expectation is linear in the signal.
5.1
POLYHEDRAL ESTIMATION
5.1.1
Motivation
The estimation problem we were considering so far is as follows: We want to recover the image Bx ∈ Rν of unknown signal x known to belong to signal set X ⊂ Rn from a noisy observation ω = Ax + ξx ∈ Rm , where ξx is observation noise (index x in ξx indicates that the distribution Px of the observation noise may depend on x). Here X is a given nonempty convex compact set, and A and B are given m × n and ν × n matrices; in addition, we are given a norm k · k on Rν in which the recovery error is measured. We have seen that if X is an ellitope/spectratope then, under reasonable assumptions on observation noise and k · k, an appropriate efficiently computable estimate linear in ω is near-optimal. Note that the ellitopic/spectratopic structure of X is crucial here. What follows is motivated by the desire to build an alternative estimation scheme which works beyond the ellitopic/spectratopic case, where linear estimates can become “heavily nonoptimal.”
387
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
Motivating example. Consider the simply-looking problem of recovering Bx = x in the k · k2 -norm from direct observations (Ax = x) corrupted by the standard Gaussian noise ξ ∼ N (0, σ 2 I), and let X be the unit k · k1 -ball: X |xi | ≤ 1}. X = {x ∈ Rn : i
In this situation, one can easily build the optimal, in terms of the worst-case, over x ∈ X , expected squared risk, linear estimate x bH (ω) = H T ω: Risk2 [b xH |X ] := maxx∈X E kb xH (ω) − Bxk22 = maxx∈X k[I − H T ]xk22 + σ 2 Tr(HH T ) = maxi≤n kColi [I − H T ]k22 + σ 2 Tr(HH T ). Clearly, the optimal H is just a scalar matrix hI, the optimal h is the minimizer of the univariate quadratic function (1 − h)2 + σ 2 nh2 , and the best squared risk attainable with linear estimates is R2 = min (1 − h)2 + σ 2 nh2 = h
nσ 2 . 1 + nσ 2
On the other hand, consider a nonlinear estimate x b(ω) as follows. Given observation ω, specify x b(ω) as an optimal solution to the optimization problem Opt(ω) = min ky − ωk∞ . y∈X
Note that for every ρ > 0 the probability that the true signal satisfies kx−ωk∞ ≤ ρσ (“event E”) is at least 1 − 2n exp{−ρ2 /2}, and if this event happens, then both x and x b belong to the box {y : ky − ωk∞ ≤ ρσ}, implying that kx − x bk∞ ≤ 2ρσ. In addition, we always have kx − x bk2 ≤ kx − x bk1 ≤ 2, since x ∈ X and x b ∈ X . We therefore have √ p 2 ρσ, ω ∈ E, bk∞ kx − x bk1 ≤ kx − x bk2 ≤ kx − x 2, ω 6∈ E, whence
E kb x − xk22 ≤ 4ρσ + 8n exp{−ρ2 /2}. (∗) p Assuming σ ≤ 2n exp{−1/2} and specifying ρ as 2 ln(2n/σ), we get ρ ≥ 1 and 2n exp{−ρ2 /2} ≤ σ, implying that the right hand side in (∗) is at most 8ρσ. In other words, for our nonlinear estimate x b(ω) it holds p Risk2 [b x|X ] ≤ 8 ln(2n/σ)σ.
2 When p nσ is of order of 1, the latter bound on the squared risk is of order of σ ln(1/σ), while the best squared risk achievable with linear estimates under the circumstances is of order of 1. We conclude that when σ is small and n is large (specifically, is of order of 1/σ 2 ), the best linear estimate is by far inferior compared to our nonlinear estimate—the ratio of the corresponding squared risks is as large as √O(1) , the factor which is “by far” worse than the nonoptimality factor in σ
ln(1/σ)
the case of ellitope/spectratope X .
388
CHAPTER 5
The construction of the nonlinear estimate x b which we have built1 admits a natural extension yielding what we shall call polyhedral estimate, and our present goal is to design and to analyse presumably good estimates of this type. 5.1.2
Generic polyhedral estimate
A generic polyhedral estimate is as follows: Given the data A ∈ Rm×n , B ∈ Rν×n , X ⊂ Rn of our recovery problem (where X is a computationally tractable convex compact set) and a “reliability tolerance” ǫ ∈ (0, 1), we specify somehow positive integer N along with N linear forms hTℓ z on the space Rm where observations live. These forms define linear forms gℓT x := hTℓ Ax on the space of signals Rn . Assuming that the observation noise ξx is zero mean for every x ∈ X , the “plug-in” estimates hTℓ ω are unbiased estimates of the forms giT x. Assume that vectors hℓ are selected in such a way that ∀(x ∈ X ) : Prob{|hTℓ ξx | > 1} ≤ ǫ/N ∀ℓ.
(5.1)
In this situation, setting H = [h1 , ..., hN ] (in the sequel, H is referred to as contrast matrix), we can ensure that whatever be the signal x ∈ X underlying our observation ω = Ax+ξx , the observable vector H T ω satisfies the relation Prob kH T ω − H T Axk∞ > 1 ≤ ǫ. (5.2) With the polyhedral estimation scheme, we act as if all information about x contained in our observation ω were represented by H T ω, and we estimate Bx by B x ¯, where x ¯ = x ¯(ω) is any vector from X compatible with this information, specifically, such that x ¯ solves the feasibility problem find x ¯ ∈ X such that kH T ω − H T A¯ xk∞ ≤ 1. Note that this feasibility problem with positive probability can be unsolvable; all we know in this respect is that the latter probability is ≤ 1 − ǫ, since by construction the true signal x underlying observation ω is with probability 1 − ǫ a feasible solution. In other words, such x ¯ is not always well-defined. To circumvent this difficulty, let us define x ¯ as (5.3) x ¯ ∈ Argmin kH T ω − H T Auk∞ : u ∈ X , u
so that x ¯ always is well-defined and belongs to X , and estimate Bx by B x ¯. Thus, a polyhedral estimate is specified by an m × N contrast matrix H = [h1 , ..., hN ] with columns hℓ satisfying (5.1) and is as follows: given observation ω, we build x ¯=x ¯(ω) ∈ X according to (5.3) and estimate Bx by x bH (ω) = B x ¯(ω).
The rationale behind polyhedral estimation scheme is the desire to reduce complex 1 In fact, this estimate is nearly optimal under the circumstances in a meaningful range of values of n and σ.
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
389
estimating problems to those of estimating linear forms. To the best of our knowledge, this approach was first used in [192] (see also [185, Chapter 2]) in connection with recovering from direct observations (restrictions on regular grids of) multivariate functions from Sobolev balls. Recently, the ideas underlying the results of [192] have been taken up in the MIND estimator of [109], then applied to multiple testing in [203]. What follows is based on [139]. (ǫ, k · k)-risk. Given a desired “reliability tolerance” ǫ ∈ (0, 1), it is convenient to quantify the performance of polyhedral estimate by its (ǫ, k · k)-risk Riskǫ,k·k [b x(·)|X ] = inf {ρ : Prob {kBx − x b(Ax + ξx )k > ρ} ≤ ǫ ∀x ∈ X } ,
(5.4)
that is, the worst, over x ∈ X , size of “(1 − ǫ)-reliable k · k-confidence interval” associated with the estimate x b(·). An immediate observation is as follows:
Proposition 5.1. In the situation in question, denoting by Xs = 12 (X −X ) the symmetrization of X , given a contrast matrix H = [h1 , ..., hN ] with columns satisfying (5.1), the quantity R[H] = max kBzk : kH T Azk∞ ≤ 2, z ∈ 2Xs (5.5) z
is an upper bound on the (ǫ, k · k)-risk of the polyhedral estimate x bH (·): Riskǫ,k·k [b xH |X ] ≤ R[H].
Proof is immediate. Let us fix x ∈ X , and let E be the set of all realizations of ξx such that kH T ξx k∞ ≤ 1, so that Px (E) ≥ 1−ǫ by (5.2). Let us fix a realization ξ ∈ E of the observation noise, and let ω = Ax+ξ, x ¯=x ¯(Ax+ξ). Then u = x is a feasible solution to the optimization problem (5.3) with the value of the objective ≤ 1, implying that the value of this objective at the optimal solution x ¯ to the problem is ≤ 1 as well, so that kH T A[x − x ¯]k∞ ≤ 2. Besides this, z = x − x ¯ ∈ 2Xs . We see that z is a feasible solution to (5.5), whence kB[x − x ¯]k = kBx − x bH (ω)k ≤ R[H]. It remains to note that the latter relation holds true whenever ω = Ax + ξ with ξ ∈ E, and the Px -probability of the latter inclusion is at least 1 − ǫ, whatever be x ∈ X. ✷ What is ahead. In what follows our focus will be on the following questions pertinent to the design of polyhedral estimates: 1. Given the data of our estimation problem and a tolerance δ ∈ (0, 1), how to find a set Hδ of vectors h ∈ Rm satisfying the relation ∀(x ∈ X ) : Prob |hT ξx | > 1 ≤ δ. (5.6)
With our approach, after the number N of columns in a contrast matrix has been selected, we choose the columns of H from Hδ , with δ = ǫ/N , ǫ being a given reliability tolerance of the estimate we are designing. Thus, the problem of constructing sets Hδ arises, the larger Hδ , the better. 2. The upper bound R[H] on the (ǫ, k · k)-risk of the polyhedral estimate x bH is, in general, difficult to compute—this is the maximum of a convex function over a computationally tractable convex set. Thus, similarly to the case of linear
390
CHAPTER 5
estimates, we need techniques for computationally efficient upper bounding of R[·]. 3. With “raw materials” (sets Hδ ) and efficiently computable upper bounds on the risk of candidate polyhedral estimates at our disposal, how do we design the best in terms of (the upper bound on) its risk polyhedral estimate? We are about to consider these questions one by one. 5.1.3
Specifying sets Hδ for basic observation schemes
To specify reasonable sets Hδ we need to make some assumptions on the distributions of observation noises we want to handle. In the sequel we restrict ourselves to three special cases as follows: • sub-Gaussian case: For every x ∈ X , the observation noise ξx is sub-Gaussian with parameters (0, σ 2 Im ), where σ > 0, i.e. ξx ∼ SG(0, σ 2 Im ). • Discrete case: X P is a convex compact subset of the probabilistic simplex ∆n = {x ∈ Rn : x ≥ 0, i xi = 1}, A is a column-stochastic matrix, and ω=
K 1 X ζk K k=1
with random vectors ζk independent across k ≤ K, ζk taking value ei with probability [Ax]i , i = 1, ...., m, ei being the basic orths in Rm . • Poisson case: X is a convex compact subset of the nonnegative orthant Rn+ , A is entrywise nonnegative, and the observation ω stemming from x ∈ X is a random vector with entries ωi ∼ Poisson([Ax]i ) independent across i. The associated sets Hδ can be built as follows. 5.1.3.1
Sub-Gaussian case
When h ∈ Rn is deterministic and ξ is sub-Gaussian with parameters 0, σ 2 Im , we have 1 T . Prob{|h ξ| > 1} ≤ 2 exp − 2 2σ khk22 Indeed, when h 6= 0 and γ > 0, we have Prob{hT ξ > 1} ≤ exp{−γ}E exp{γhT ξ} ≤ exp{ 12 σ 2 γ 2 khk22 − γ}. n o Minimizing the resulting bound in γ > 0, we get Prob{hT ξ > 1} ≤ exp − 2khk12 σ2 ; 2
the n same reasoning as applied to −h in the role of h results in Prob{hT ξ < −1} ≤ o exp − 2khk12 σ2 . 2
Consequently
πG (h) := σ |
and we can set
p
2 ln(2/δ) khk2 ≤ 1 ⇒ Prob{|hT ξ| > 1} ≤ δ, {z } ϑG
Hδ = HδG := {h : πG (h) ≤ 1}.
391
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
5.1.3.2
Discrete case
Given x ∈ X , setting µ = Ax and ηk = ζk − µ, we get ω = Ax +
K 1 X ηk . K k=1 | {z } ξx
Given h ∈ Rm ,
hT ξ x =
1 X T h η . | {z k} K k
χk
Random variables χ1 , ..., χK are independent zero mean and clearly satisfy X [Ax]i h2i , |χk | ≤ 2khk∞ ∀k. E χ2k ≤ i
When applying Bernstein’s inequality2 we get (cf. Exercise 4.19) P Prob{|hT ξx | > 1} = Prob{| o n k χk | > K} . ≤ 2 exp − 2 P [Ax] K h2 + 4 khk i
Setting
πD (h)
=
i
i
3
(5.7)
∞
p P ϑ2D maxx∈X i [Ax]i h2i + ̺2D khk2∞ , q ϑD = 2 ln(2/δ) , ̺D = 8 ln(2/δ) , K 3K
after a completely straightforward computation, we conclude from (5.7) that πD (h) ≤ 1 ⇒ Prob{|hT ξx | > 1} ≤ δ, ∀x ∈ X . Thus, in the Discrete case we can set Hδ = HδD := {h : πD (h) ≤ 1}. 5.1.3.3
Poisson case
In the Poisson case, for x ∈ X , setting µ = Ax, we have ω = Ax + ξx , ξx = ω − µ. It turns out that for every h ∈ Rm one has
n ∀t ≥ 0 : Prob |hT ξx | ≥ t ≤ 2 exp − 2[P
t2
1 2 i hi µi + 3 khk∞ t]
o
(5.8)
2 The classical Bernstein inequality states that if X , ..., X 1 K are independent zero mean scalar random variables with finite variances σk2 such that |Xk | ≤ M a.s., then for every t > 0 one has ( ) t2 Prob{X1 + ... + Xk > t} ≤ exp − P 2 . 2[ k σk + 13 M t]
392
CHAPTER 5
(for verification, see Exercise 4.21 or Section 5.4.1). As a result, we conclude via a straightforward computation that setting p P πP (h) = ϑ2P maxx∈X i [Ax]i h2i + ̺2P khk2∞ , p ϑP = 2 ln(2/δ), ̺P = 43 ln(2/δ),
we ensure that
πP (h) ≤ 1 ⇒ Prob{|hT ξx | > 1} ≤ δ, ∀x ∈ X . Thus, in the Poisson case we can set Hδ = HδP := {h : πP (h) ≤ 1}. 5.1.4
Efficient upper-bounding of R[H] and contrast design, I.
The scheme for upper-bounding R[H] to be presented in this section (an alternative, completely different, scheme will be presented in Section 5.1.5) is inspired by our motivating example. Note that there is a special case of (5.5) where R[H] is easy to compute—the case where k · k is the uniform norm k · k∞ , whence b R[H] = R[H] := 2 max max RowTi [B]x : x ∈ Xs , kH T Axk∞ ≤ 1 i≤ν
x
is the maximum of ν efficiently computable convex functions. It turns out that when k · k = k · k∞ , it is not only easy to compute R[H], but to optimize this risk bound in H as well.3 These observations underlie the forthcoming developments in this section: under appropriate assumptions, we bound the risk of a polyhedral b estimate with contrast matrix H via the efficiently computable quantity R[H] and then show that the resulting risk bounds can be efficiently optimized w.r.t. H. We shall also see that in some “simple for analytical analysis” situations, like that of the example, the resulting estimates are nearly minimax optimal. 5.1.4.1
Assumptions
We stay within the setup introduced in Section 5.1.1 which we augment with the following assumptions: A.1. k · k = k · kr with r ∈ [1, ∞]. A.2. We have at our disposal a sequence γ = {γi > 0, i ≤ ν} and ρ ∈ [1, ∞] such that the image of Xs under the mapping x 7→ Bx is contained in the “scaled k · kρ -ball” Y = {y ∈ Rν : kDiag{γ}ykρ ≤ 1}. (5.9) 5.1.4.2
Simple observation
Let BℓT be the ℓ-th row in B, 1 ≤ ℓ ≤ ν. Let us make the following observation: 3 On closer inspection, in the situation considered in the motivating example the k · k ∞ b optimal contrast matrix H is proportional to the unit matrix, and the quantity R[H] can be easily translated into an upper bound on, say, the k · k2 -risk of the associated polyhedral estimate.
393
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
Proposition 5.2. In the situation described in Section 5.1.1, let us assume that Assumptions A.1-2 hold. Let ǫ ∈ (0, 1) and let a positive real N ≥ ν be given; let also π(·) be a norm on Rm such that ∀(h : π(h) ≤ 1, x ∈ X ) : Prob{|hT ξx | > 1} ≤ ǫ/N. Next, let a matrix H = [H1 , ..., Hν ] with Hℓ ∈ Rm×mℓ , mℓ ≥ 1, and positive reals ςℓ , ℓ ≤ ν, satisfy the relations (a) (b)
π(Colj[H]) ≤ 1, 1 ≤ j ≤ N ; maxx BℓT x : x ∈ Xs , kHℓT Axk∞ ≤ 1 ≤ ςℓ , 1 ≤ ℓ ≤ ν.
(5.10)
Then the quantity R[H] as defined in (5.5) can be upper-bounded as follows: R[H] ≤ Ψ(ς)
:=
2 maxw {k[w1 /γ1 ; ...; wν /γν ]kr : kwkρ ≤ 1, 0 ≤ wℓ ≤ γℓ ςℓ , ℓ ≤ ν} ,
(5.11)
which combines with Proposition 5.1 to imply that Riskǫ,k·k [b xH |X ] ≤ Ψ(ς).
(5.12)
Function Ψ is nondecreasing on the nonnegative orthant and is easy to compute. Proof. Let z = 2¯ z be a feasible solution to (5.5), thus z¯ ∈ Xs and kH T A¯ z k∞ ≤ 1. Let y = B z¯, so that y ∈ Y (see (5.9)) due to z¯ ∈ Xs and A.2. Then kDiag{γ}ykp ≤ 1. Besides this, by (5.10.b) relations z¯ ∈ Xs and kH T A¯ z k∞ ≤ 1 combine with the symmetry of Xs w.r.t. the origin to imply that |yℓ | = |BℓT z¯| ≤ ςℓ , ℓ ≤ ν. Taking into account that k · k = k · kr by A.1, we see that R[H] = maxz kBzkr : z ∈ 2Xs , kH T Azk∞ ≤ 2 ≤ 2 maxy {kykr : |yℓ | ≤ ςℓ , ℓ ≤ ν, & kDiag{γ}ykρ ≤ 1} = 2 maxw {k[w1 /γ1 ; ...; wν /γν ]kr : kwkρ ≤ 1, 0 ≤ wℓ ≤ γℓ ςℓ , ℓ ≤ ν} , as stated in (5.11). It is evident that Ψ is nondecreasing on the nonnegative orthant. Computing Ψ can be carried out as follows: 1. When r = ∞, we need to compute maxℓ≤ν maxw {wℓ /γℓ : kwkρ ≤ 1, 0 ≤ wj ≤ γj ςj , j ≤ ν} so that evaluating Ψ reduces to solving ν simple convex optimization problems; 2. When ρ = ∞, we clearly have Ψ(ς) = k[w ¯1 /γ1 ; ...; w ¯ν /γν ]kr , w ¯ℓ = min[1, γℓ ςℓ ]; 3. When 1 ≤ r, ρ < ∞, passing from variables wℓ to variables uℓ = wℓρ , we get ( ) X X r/ρ Ψr (ς) = 2r max γℓ−r uℓ : uℓ ≤ 1, 0 ≤ uℓ ≤ (γℓ ςℓ )ρ . u
ℓ
ℓ
When r ≤ ρ, the optimization problem on the right-hand side is the easily solvable problem of maximizing a simple concave function over a simple convex compact set. When ∞ > r > ρ, this problem can be solved by Dynamic
394
CHAPTER 5
Programming.
✷
Comment. When we want to recover Bx in k · k∞ (i.e., we are in the case of r = ∞), under the premise of Proposition 5.2 we clearly have Ψ(ς) ≤ maxℓ ςℓ , resulting in the bound Riskǫ,k·k∞ [b xH |X ] ≤ 2 max ςℓ . ℓ≤ν
Note that this bound in fact does not require Assumption A.2 (since it is satisfied for any ρ with large enough γi ’s). 5.1.4.3
Specifying contrasts
Risk bound (5.12) allows for an easy design of contrast matrices. Recalling that Ψ is monotone on the nonnegative orthant, all we need is to select hℓ ’s satisfying (5.10) and resulting in the smallest possible ςℓ ’s, which is what we are about to do now. Preliminaries. Given a vector b ∈ Rm and a norm s(·) on Rm , consider convexconcave saddle point problem (SP ) Opt = infm max φ(g, x) := [b − AT g]T x + s(g) g∈R
x∈Xs
along with the induced primal and dual problems Opt(P ) = inf g∈Rm φ(g) := maxx∈Xs φ(g, x) = inf g∈Rm s(g) + maxx∈Xs [b − AT g]T x ,
and
Opt(D)
= = =
maxx∈Xs φ(g) := inf Tg∈Rm φ(g,Tx) maxx∈X Ts inf g∈Rm b x − [Ax] g + s(g) maxx b x : x ∈ Xs , q(Ax) ≤ 1
(P )
(D)
where q(·) is the norm conjugate to s(·) (we have used the evident fact that inf g∈Rm [f T g + s(g)] is either −∞ or 0 depending on whether q(f ) > 1 or q(f ) ≤ 1). Since Xs is compact, we have Opt(P ) = Opt(D) = Opt by the Sion-Kakutani Theorem. Besides this, (D) is solvable (evident) and (P ) is solvable as well, since φ(g) is continuous due to the compactness of Xs and φ(g) ≥ s(g), so that φ(·) has bounded level sets. Let g¯ be an optimal solution to (P ), let x ¯ be an optimal solution to (D), ¯ be the s(·)-unit normalization of g¯, so that s(h) ¯ = 1 and g¯ = s(¯ ¯ Now and let h g )h. let us make the following observation: Observation 5.3. In the situation in question, we have ¯ T Ax| ≤ 1 ≤ Opt. max |bT x| : x ∈ Xs , |h x
(5.13)
In addition, for any matrix G = [g 1 , ..., g M ] ∈ Rm×M with s(g j ) ≤ 1, j ≤ M , one has maxx |bT x|: x ∈ Xs , kGT Axk∞ ≤ 1 (5.14) = maxx bT x : x ∈ Xs , kGT Axk∞ ≤ 1 ≥ Opt. Proof. Let x be a feasible solution to the problem in (5.13). Replacing, if
395
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
necessary, x with −x, we can assume that |bT x| = bT x. We now have |bT x|
= ≤ ≤
bT x = [¯ g T Ax − s(¯ g )] + [b − AT g¯]T x + s(¯ g) {z } | g )=Opt(P ) ≤φ(¯
¯ T Ax − s(¯ ¯ T Ax| −s(¯ Opt(P ) + [s(¯ g )h g )] ≤ Opt(P ) + s(¯ g ) |h g) | {z } ≤1
Opt(P ) = Opt,
as claimed in (5.13). Now, the equality in (5.14) is due to the symmetry of Xs w.r.t. the origin. To verify the inequality in (5.14), note that x ¯ satisfies the relations x ¯ ∈ Xs and q(A¯ x) ≤ 1, implying, due to the fact that the columns of G are of s(·)-norm ≤ 1, that x ¯ is a feasible solution to the optimization problems in (5.14). As a result, the second quantity in (5.14) is at least bT x ¯ = Opt(D) = Opt, and (5.14) follows. ✷ Comment. Note that problem (P ) has a very transparent origin. In the situation of Section 5.1.1, assume that our goal is, to estimate, given observation ω = Ax+ξx , the value at x ∈ X of the linear function bT x, and we want to use for this purpose an estimate gb(ω) = g T ω + γ affine in ω. Given ǫ ∈ (0, 1), how do we construct a presumably good in terms of its ǫ-risk estimate? Let us show that a meaningful answer is yielded by the optimal solution to (P ). Indeed, we have bT x − gb(Ax + ξx ) = [b − AT g]T x − γ − g T ξx .
Assume that we have at our disposal a norm s(·) on Rm such that ∀(h ∈ Rm , s(h) ≤ 1, x ∈ X ) : Prob{ξx : |hT ξx | > 1} ≤ ǫ, or, which is the same, ∀(g ∈ Rm , x ∈ X ) : Prob{ξx : |g T ξx | > s(g)} ≤ ǫ. Then we can safely upper-bound the ǫ-risk of a candidate estimate gb(·) by the quantity ρ = max |[b − AT g]T x − γ| +s(g). x∈X {z } | bias B(g, γ)
Observe that for g fixed, the minimal, over γ, bias is
M (g) := max[b − AT g]x. x∈Xs
Postponing verification of this claim, here is the conclusion: in the present setting, problem (P ) is nothing but the problem of building the best in terms of the upper bound ρ on the ǫ-risk affine estimate of linear function bT x. It remains to justify the above claim, which is immediate: on one hand, for all u ∈ X , v ∈ X we have B(g, γ) ≥ [b − AT g]T u − γ,
B(g, γ) ≥ −[b − AT g]T v + γ
396
CHAPTER 5
implying that B(g, γ) ≥ 21 [b − AT g]T [u − v] ∀(u ∈ X , v ∈ X ), just as B(g, γ) ≥ M (g). On the other hand, let M+ (g) = max[b − AT g]T x, M− (g) = − min[b − AT g]T x, x∈X
x∈X
so that M (g) = 12 [M+ (g) + M− (g)]. Setting γ¯ = 12 [M+ (g) − M− (g)], we have maxx∈X [b − AT g]T x − γ¯ = M+ (g) − γ¯ = 21 [M+ (g) + M− (g)] = M (g), minx∈X [b − AT g]T x − γ¯ = −M− (g) − γ¯ = − 12 [M+ (g) + M− (g)] = −M (g).
That is, B(g, γ¯ ) = M (g). Combining these observations, we arrive at min B(g, γ) = γ
M (g), as claimed.
✷
Contrast design. Proposition 5.2 and Observation 5.3 allow for a straightforward solution of the associated contrast design problem, at least in the case of subGaussian, Discrete, and Poisson observation schemes. Indeed, in these cases, when designing a contrast matrix with N columns, with our approach we are supposed to select its columns in the respective sets Hǫ/N ; see Section 5.1.3. Note that these sets, while shrinking as N grows, are “nearly independent” of N , since the norms πG , πD , πP in the description of the respective sets HδG , HδD , HδP depend on 1/δ via factors logarithmic in 1/δ. It follows that we lose nearly nothing when assuming that N ≥ ν. Let us act as follows: We set N = ν, specify π ¯ (·) as the norm (πG , or πD , or πP ) associated with the observation scheme (sub-Gaussian, or Discrete, or Poisson) in question and δ = ǫ/ν. We solve ν convex optimization problems Optℓ = ming∈Rm φℓ (g) := maxx∈Xs φℓ (g, x) , (Pℓ ) φℓ (g, x) = [Bℓ − AT g]T x + π ¯ (g). Next, we convert optimal solution gℓ to (Pℓ ) into vector hℓ ∈ Rm by representing gℓ = π ¯ (gℓ )hℓ with π ¯ (hℓ ) = 1, and set Hℓ = hℓ . As a result, we obtain an m × ν contrast matrix H = [h1 , ..., hν ] which, taken along with N = ν, quantities ςℓ = Optℓ , 1 ≤ ℓ ≤ ν, (5.15) and with π(·) ≡ π ¯ (·), in view of the first claim in Observation 5.3 as applied with s(·) ≡ π ¯ (·), satisfies the premise of Proposition 5.2. Consequently, by Proposition 5.2 we have Riskǫ,k·k [b xH |X ] ≤ Ψ([Opt1 ; ...; Optν ]).
(5.16)
Comment. Optimality of the outlined contrast design for the sub-Gaussian, or Discrete, or Poisson observation scheme stems, within the framework set by Proposition 5.2, from the second claim of Observation 5.3, which states that when N ≥ ν and the columns of the m × N contrast matrix H = [H1 , ..., Hν ] belong to the set Hǫ/N associated with the observation scheme in question—i.e., the norm π(·) in the proposition is the norm πG , or πD , or πP associated with δ = ǫ/N —the quantities ςℓ participating in (5.10.b) cannot be less than Optℓ .
397
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
Indeed, the norm π(·) from Proposition 5.2 is ≥ the norm π ¯ (·) participating in (Pℓ ) (because the value ǫ/N in the definition of π(·) is at most νǫ ), implying, by (5.10.a), that the columns of matrix H obeying the premise of the proposition satisfy the relation π ¯ (Colj [H]) ≤ 1. Invoking the second part of Observation 5.3 with s(·) ≡ π ¯ (·), b = Bℓ , and G = Hℓ , and taking (5.10.b) into account, we conclude that ςℓ ≥ Optℓ for all ℓ, as claimed.
Since the bound on the risk of a polyhedral estimate offered by Proposition 5.2 is better the lesser are the ςℓ ’s, we see that as far as this bound is concerned, the outlined design procedure is the best possible, provided N ≥ ν. An attractive feature of the contrast design we have just presented is that it is completely independent of the entities participating in assumptions A.1-2—these entities affect theoretical risk bounds of the resulting polyhedral estimate, but not the estimate itself. 5.1.4.4
Illustration: Diagonal case
Let us consider the diagonal case of our estimation problem, where • X = {x ∈ Rn : kDxkρ ≤ 1}, where D is a diagonal matrix with positive diagonal entries Dℓℓ =: dℓ , • m = ν = n, and A and B are diagonal matrices with diagonal entries 0 < Aℓℓ =: aℓ , 0 < Bℓℓ =: bℓ , • k · k = k · kr , • We are in the sub-Gaussian case, that is, observation noise ξx is (0, σ 2 In )-subGaussian for every x ∈ X . Let us implement the approach developed in Sections 5.1.4.1–5.1.4.3. 1. Given reliability tolerance ǫ, we set p p δ = ǫ/n, ϑG := σ 2 ln(2/δ) = σ 2 ln(2n/ǫ),
(5.17)
and
H = HδG = {h ∈ Rn : πG (h) := ϑG khk2 ≤ 1}. 2. We solve ν = n convex optimization problems (Pℓ ) associated with π ¯ (·) ≡ πG (·), which is immediate: the resulting contrast matrix is H = ϑ−1 G In , and Optℓ = ςℓ := bℓ min[ϑG /aℓ , 1/dℓ ].
(5.18)
Risk analysis. The (ǫ, k · k)-risk of the resulting polyhedral estimate x b(·) can be bounded by Proposition 5.2. Note that setting γℓ = dℓ /bℓ , 1 ≤ ℓ ≤ n, we meet assumptions A.1-2, and the above choice of H, N = n, and ςℓ satisfies the premise of Proposition 5.2. By this proposition, Riskǫ,k·kr [b xH |X ] ≤ Ψ
:=
2 maxw {k[w1 /γ1 ; ...; wn /γn ]kr : kwkρ ≤ 1, 0 ≤ wℓ ≤ γℓ ςℓ } .
(5.19)
398
CHAPTER 5
Let us work out what happens in the simple case where 1 ≤ ρ ≤ r < ∞, aℓ /dℓ and bℓ /aℓ are nonincreasing in ℓ.
(a) (b)
(5.20)
Proposition 5.4. In the simple case just defined, let n = n when n X ℓ=1
ρ
(ϑG dℓ /aℓ ) ≤ 1;
otherwise let n be the smallest integer such that n X
ρ
(ϑG dℓ /aℓ ) > 1,
ℓ=1
with ϑG given by (5.17). Then for the contrast matrix H = ϑ−1 G In one has Riskǫ,k·kr [b xH |X ] ≤ Ψ ≤ 2
hX n
ℓ=1
(ϑG bℓ /aℓ )r
i1/r
.
Proof. Consider the optimization problem specifying Ψ in (5.19). Setting θ = r/ρ ≥ 1, let us pass in this problem from variables wℓ to variables zℓ = wℓρ , so that ( ) X X r r θ r ρ Ψ = 2 max zℓ (bℓ /dℓ ) : zℓ ≤ 1, 0 ≤ zℓ ≤ (dℓ ςℓ /bℓ ) ≤ 2r Γ, z
ℓ
ℓ
where Γ = max z
(
X ℓ
zℓθ (bℓ /dℓ )r
:
X ℓ
zℓ ≤ 1, 0 ≤ zℓ ≤ χℓ := (ϑG dℓ /aℓ )
ρ
)
(we have used (5.18)). Note that ΓP is the optimal value in the problem of maximizing a convex (since θ ≥ 1) function ℓ zℓθ (bℓ /dℓ )r over a bounded polyhedral set, so that the maximum is attained at an extreme point z¯ of the feasible set. By the standard characterization of extreme points, the (clearly nonempty) set I of positive entries in z¯ is as follows. Let us denote by I ′ the set of indexes ℓ ∈ I such that z¯ℓ is on its upper z¯ℓ = χℓ ; note that the cardinality |I ′ | of I ′ is at least P bound P |I| − 1. Since ℓ∈I ′ z¯ℓ = ℓ∈I ′ χℓ ≤ 1 and χℓ are nondecreasing in ℓ by (5.20.b), we conclude that |I ′ | X χℓ ≤ 1, ℓ=1
′
implying that |I | < n provided that n < n, so that in this case |I| ≤ n; and of course |I| ≤ n when n = n. Next, we have X X X Γ= z¯ℓθ (bℓ /dℓ )r ≤ χθℓ (bℓ /dℓ )r = (ϑG bℓ /aℓ )r , ℓ∈I
ℓ∈I
ℓ∈I
and Pn since bℓ /aℓr is nonincreasing in ℓ and |I| ≤ n, the latter quantity is at most ✷ ℓ=1 (ϑG bℓ /aℓ ) .
399
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
Application. Consider the “standard case” [72, 74] where p 0 < ln(2n/ǫ)σ ≤ 1, aℓ = ℓ−α , bℓ = ℓ−β , dℓ = ℓκ
with β ≥ α ≥ 0, κ ≥ 0 and (β − α)r < 1. In this case for large n, namely, −
1
n ≥ cϑG α+κ+1/ρ
[ϑG = σ
p
2 ln(2n/ǫ)]
(5.21)
(here and in what follows, the factors denoted by c and C depend solely on α, β, κ, r, ρ) we get −
1
n ≤ CϑG α+κ+1/ρ , resulting in β+κ+1/ρ−1/r
Riskǫ,k·kr [b x|X ] ≤ CϑG α+κ+1/ρ .
(5.22)
Setting x = D y, α ¯ = α + κ, β¯ = β + κ and treating y, rather than x, as the signal underlying the observation, we obtain the estimation problem which is similar to the original one in which α, β, κ and X are replaced, respectively, with α ¯, ¯ κ β, ¯ = 0, and Y = {y : kykρ ≤ 1}, and A, B replaced with A¯ = Diag{ℓ−α¯ , ℓ ≤ n}, 1 ¯ = Diag{ℓ−β¯ , ℓ ≤ n}. When n is large enough, namely, n ≥ σ − α+1/ρ ¯ , Y contains B the “coordinate box” −1
Y = {x : |xℓ | ≤ m−1/ρ , m/2 ≤ ℓ ≤ m, xℓ = 0 otherwise} of dimension ≥ m/2, where
1
¯ . m ≥ cσ − α+1/ρ
¯ 2 ≤ Cm−α¯ kyk2 , and kByk ¯ r ≥ cm−β¯ kykr . This Observe that for all y ∈ Y, kAyk observation, when combined with the Fano inequality, implies (cf. [79]) that for ǫ ≪ 1 the minimax optimal w.r.t. the family of all Borel estimates (ǫ, k · kr )-risk on the signal set X = D−1 Y ⊂ X is at least cσ
¯ β+1/ρ−1/r α+1/ρ ¯
.
In other words, in this situation, the upper bound (5.22) on the risk of the polyhedral estimate is within a factor logarithmic in n/ǫ from the minimax risk. In particular, without surprise, in the case of β = 0 the polyhedral estimates attain well-known optimal rates [72, 109]. 5.1.5 5.1.5.1
Efficient upper-bounding of R[H] and contrast design, II. Outline
In this section we develop and alternative approach to the design of polyhedral estimates which resembles in many aspects the approach to building linear estimates from Chapter 4. Recall that the principal technique underlying the design of a presumably good linear estimate x bH (ω) = H T ω was upper-bounding of maximal risk of the estimate—the maximum of a quadratic form, depending on H as a parameter, over the signal set X , and we were looking for a bounding scheme allowing us to efficiently optimize the bound in H. The design of a presumably good polyhedral estimate also reduces to minimizing
400
CHAPTER 5
the optimal value in a parametric maximization problem (5.5) over the contrast matrix H. However, while the design of a presumably good linear estimate reduces to unconstrained minimization, to conceive a polyhedral estimate we need to minimize bound R[H] on the estimation risk under the restriction on the contrast matrix H—the columns hℓ of this matrix should satisfy condition (5.1). In other words, in the case of polyhedral estimate the “design parameter” affects the constraints of the optimization problem rather than the objective. Our strategy can be outlined as follows. Let us denote by B∗ = {u ∈ Rν : kuk∗ ≤ 1} the unit ball of the norm k · k∗ conjugate to the norm k · k in the formulation of the estimation problem in Section 5.1.2. Assume that we have at our disposal a technique for bounding quadratic forms on the set B∗ × Xs , in other words, we have an efficiently computable convex function M(M ) on Sν+n such that M(M ) ≥
max
[u;z]∈B∗ ×Xs
[u; z]T M [u; z] ∀M ∈ Sν+n .
(5.23)
Note that the upper bound R[H], as defined in (5.5), on the risk of a candidate polyhedral estimate x bH is nothing but ( 1 B T 2 [u; z] : R[H] = 2 max[u;z] [u; z] 1 BT } | 2 {z (5.24) B+ u ∈ B ∗ , z ∈ Xs , . z T AT hℓ hTℓ Az ≤ 1, ℓ ≤ N T T T Given λ ∈ RN + , the constraints z A hℓ hℓ Az ≤ 1 in (5.24) can be aggregated to yield the quadratic constraint X z T AT Θλ Az ≤ µλ , Θλ = HDiag{λ}H T , µλ = λℓ . ℓ
Observe that for every λ ≥ 0 we have R[H] ≤ 2M 1 T B | 2
1 2B T
−A Θλ A {z }
+ 2µλ .
(5.25)
B+ [Θλ ]
Indeed, let [u; z] be a feasible solution to the optimization problem (5.24) specifying R[H]. Then [u; z]T B+ [u; z] = [u; z]T B+ [Θλ ][u; z] + z T AT Θλ Az; the first term on the right-hand side is ≤ M(B+ [Θλ ]) since [u; z] ∈ B∗ × Xs , and the second term on the right-hand side, as we have already seen, is ≤ µλ , and (5.25) follows.
Now assume that we have at our disposal a computationally tractable cone H ⊂ SN + × R+
401
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
satisfying the following assumption: C. Whenever (Θ, µ) ∈ H, we can efficiently find an n × N matrix H = [h1 , ..., hN ] and a nonnegative vector λ ∈ RN + such that hℓ satisfies (5.1), 1 ≤ ℓ ≤ N , T Θ P= HDiag{λ}H , i λi ≤ µ.
(a) (b) (c)
(5.26)
The following simple observation is crucial to what follows: Proposition 5.5. Consider the estimation problem posed in Section 5.1.1, and let efficiently computable convex function M and computationally tractable closed convex cone H satisfy (5.23) and Assumption C, respectively. Consider the convex optimization problem Opt = minτ,Θ,µ {2τ +2µ : (Θ, µ) ∈ H, M(B + [Θ]) ≤ τ } 1 B 2 . B+ [Θ] = 1 T −AT ΘA 2B
(5.27)
Given a feasible solution (τ, Θ, µ) to this problem, by C we can efficiently convert it P to (H, λ) such that H = [h1 , ..., hN ] with hℓ satisfying (5.1) and λ ≥ 0 with ℓ λℓ ≤ µ. We have R[H] ≤ 2τ + 2µ, whence the (ǫ, k · k)-risk of the polyhedral estimate x bH satisfies the bound Riskǫ,k·k [b xH |X ] ≤ 2τ + 2µ.
(5.28)
Consequently, we can efficiently construct polyhedral estimates with (ǫ, k · k)-risk arbitrarily close to Opt (and with risk exactly Opt, provided problem (5.27) is solvable). Proof is readily given by the reasoning preceding the proposition. Indeed, with τ, Θ, µ, H, λ as in the premise of the proposition, the columns hℓ of H satisfy (5.1) by C, implying, by Proposition 5.1, that Riskǫ,k·k [b xH |X ] ≤ R[H]. Besides this, C says that for our H, λ it holds Θ = Θλ and µλ ≤ µ, so that (5.25) combines with the constraints of (5.27) to imply that R[H] ≤ 2τ + 2µ, and (5.28) follows by Proposition 5.1. ✷ The approach to the design of polyhedral estimates we develop in this section amounts to reducing the construction of the estimate (i.e., construction of the contrast matrix H) to finding (nearly) optimal solutions to (5.27). Implementing this approach requires devising techniques for constructing cones H satisfying C along with efficiently computable functions M(·) satisfying (5.23). These tasks are the subjects of the sections to follow. 5.1.5.2
Specifying cones H
We specify cones H in the case when the number N of columns in the candidate contrast matrices is m and under the following assumption on the given reliability tolerance ǫ and observation scheme in question: D. There is a computationally tractable convex compact subset Z ⊂ Rm +
402
CHAPTER 5
intersecting int Rm + such that the norm π(·) s X zi h2i π(h) = max z∈Z
i
induced by Z satisfies the relation π(h) ≤ 1 ⇒ Prob{|hT ξx | > 1} ≤ ǫ/m ∀x ∈ X . Note that condition D is satisfied for sub-Gaussian, Discrete, and Poisson observation schemes: according to the results of Section 5.1.3, • in the sub-Gaussian case, it suffices to take Z = {2σ 2 ln(2m/ǫ)[1; ...; 1]}; • in the Discrete case, it suffices to take Z=
64 ln2 (2m/ǫ) 4 ln(2m/ǫ) AX + ∆m , K 9K 2
where AX = {Ax : x ∈ X }, ∆m = {y ∈ Rm : y ≥ 0, • in the Poisson case, it suffices to take Z = 2 ln(2m/ǫ)AX +
16 9
X
yi = 1}.
i
ln2 (2m/ǫ)∆m ,
with AX and ∆m as above. Note that in all these cases Z only “marginally”—logarithmically—depends on ǫ and m. Under Assumption D, the cone H can be built as follows: • When Z is a singleton, Z = {¯ z }, so that π(·) is a scaled Euclidean norm, we set ) ( X m z¯i Θii . H = (Θ, µ) ∈ S+ × R+ : µ ≥ i
Given (Θ, µ) the m × m matrix H and λ ∈ Rm + are built as follows: setting √ ∈ H, √ S = Diag{ z¯1 , ..., z¯m }, we compute the eigenvalue decomposition of the matrix SΘS: SΘS = U Diag{λ}U T , where U isP orthonormal, andP set H = S −1 U , thus ensuring Θ = HDiag{λ}H T . Since µ ≥ i z¯i Θii , we have i λi = Tr(SΘS) ≤ µ. Finally, a column h of H is of the form S −1 f with k · k2 -unit vector f , implying that sX sX 2 −1 z¯i [S f ]i = fi2 = 1, π(h) = i
so that h satisfies (5.1) by D.
i
403
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
• When Z is not a singleton, we set φ(r) = κ = H =
T maxz∈Z √ z 2r, 6 ln(2 3m ), {(Θ, µ) ∈ Sm + × R+ : µ ≥ κφ(dg(Θ))},
(5.29)
where dg(Q) is the diagonal of a (square) matrix Q. Note that φ(r) > 0 whenever r ≥ 0, r 6= 0, since Z contains a positive vector. The justification of this construction and the efficient (randomized) algorithm for converting a pair (Θ, µ) ∈ H into (H, λ) satisfying, when taken along with (Θ, µ), the requirements of C are given by the following: Lemma 5.6. Let norm π(·) satisfy D. (i) Whenever H is an m × m matrix with columns hℓ satisfying π(hℓ ) ≤ 1 and λ ∈ Rm + , we have ! X T Θλ = HDiag{λ}H , µ = κ λi ∈ H. i
(ii) Given (Θ, µ) ∈ H with Θ 6= 0, we find decomposition Θ = QQT with m × m matrix Q, andp fix an orthonormal m × m matrix V with magnitudes of entries not exceeding 2/m (e.g., the orthonormal scaling of the matrix of the cosine µ transform). When µ > 0, we set λ = m [1; ...; 1] ∈ Rm and consider the random matrix r m QDiag{χ}V, Hχ = µ where χ is the m-dimensional Rademacher random vector. We have X Hχ Diag{λ}HχT ≡ Θ, λ ≥ 0, λi = µ.
(5.30)
i
Moreover, the probability of the event π(Colℓ [Hχ ]) ≤ 1 ∀ℓ ≤ m
(5.31)
is at least 1/2. Thus, generating independent samples of χ and terminating with H = Hχ when the latter matrix satisfies (5.31), we with probability 1 terminate with (H, λ) satisfying C, and the probability for the outlined procedure to terminate in the course of the first M = 1, 2, ... steps is at least 1 − 2−M . When µ = 0, we have Θ = 0 (since µ = 0 implies φ(dg(Θ)) = 0, which with Θ 0 is possible only when Θ = 0); thus, when µ = 0, we set H = 0m×m and λ = 0m×1 . Note that the lemma states, essentially, that the cone H is a tight, up to a factor logarithmic in m, inner approximation of the set Θ = HDiag{λ}H T , m×m [H]) ≤ 1, ℓ ≤ m, (Θ, µ) : ∃(λ ∈ Rm ) : π(Col . ℓ +,H ∈ R P µ ≥ ℓ λℓ For proof, see Section 5.4.2.
404
CHAPTER 5
5.1.5.3
Specifying functions M
In this section we focus on computationally efficient upper-bounding of maxima of quadratic forms over convex compact sets symmetric w.r.t. the origin by semidefinite relaxation, our goal being to specify a “presumably good” efficiently computable convex function M(·) satisfying (5.23). Cones compatible with convex sets. Given a nonempty convex compact set Y ⊂ RN , we say that a cone Y is compatible with Y if • Y is a closed convex computationally tractable cone contained in SN + × R+ • one has ∀(V, τ ) ∈ Y : max y T V y ≤ τ (5.32) y∈Y
• Y contains a pair (V, τ ) with V ≻ 0 • relations (V, τ ) ∈ Y and τ ′ ≥ τ imply that (V, τ ′ ) ∈ Y.4 We call a cone Y sharp if Y is a closed convex cone contained in SN + × R+ and such that the only pair (V, τ ) ∈ Y with τ = 0 is the pair (0, 0), or, equivalently, a sequence {(Vi , τi ) ∈ Y, i ≥ 1} is bounded if and only if the sequence {τi , i ≥ 1} is bounded. Note that whenever the linear span of Y is the entire RN , every cone compatible with Y is sharp. Observe that if Y ⊂ RN is a nonempty convex compact set and Y is a cone compatible with a shift Y − a of Y, then Y is compatible with Ys . Indeed, when shifting a set Y, its symmetrization 21 [Y − Y] remains intact, so that we can assume that Y is compatible with Y. Now let (V, τ ) ∈ Y and y, y ′ ∈ Y. We have [y − y ′ ]T V [y − y ′ ] + [y + y ′ ]T V [y + y ′ ] = 2[y T V y + [y ′ ]T V y ′ ] ≤ 4τ, {z } | ≥0
whence for z = 12 [y − y ′ ] it holds z T V z ≤ τ . Since every z ∈ Ys is of the form 1 [y − y ′ ] with y, y ′ ∈ Y, the claim follows. 2 Note that the claim can be “nearly inverted”: if 0 ∈ Y and Y is compatible with Ys , then the “widening” of Y—the cone Y + = {(V, τ ) : (V, τ /4) ∈ Y} —is compatible with Y (evident, since when 0 ∈ Y, every vector from Y is proportional, with coefficient 2, to a vector from Ys ).
Constructing functions M. The role of compatibility in our context becomes clear from the following observation: Proposition 5.7. In the situation described in Section 5.1.1, assume that we have at our disposal cones X and U compatible, respectively, with Xs and with the unit 4 The latter requirement is “for free”—passing from a computationally tractable closed convex + = {(V, τ ) : ∃¯ cone Y ⊂ SN τ ≤ τ : (V, τ¯) ∈ Y}, we get + × R+ satisfying (5.32) to the cone Y a cone larger than Y and still compatible with Y. It will be clear from the sequel that in our context, the larger is a cone compatible with Y, the better.
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
405
ball B∗ = {v ∈ Rν : kuk∗ ≤ 1}
of the norm k · k∗ conjugate to the norm k · k. Given M ∈ Sν+n , let us set M(M ) = inf {t + s : (X, t) ∈ X, (U, s) ∈ U, Diag{U, X} M } . X,t,U,s
(5.33)
Then M is a real-valued efficiently computable convex function on Sν+n such that (5.23) takes place: for every M ∈ Sn+ν it holds M(M ) ≥
max
[u;z]∈B∗ ×Xs
[u; z]T M [u; z].
In addition, when X and U are sharp, the infimum in (5.33) is achieved. Proof is immediate. Given that the objective of the optimization problem specifying M(M ) is nonnegative on the feasible set, the fact that M is real-valued is equivalent to problem’s feasibility, and the latter is readily given by the fact that X is a cone containing a pair (X, t) with X ≻ 0 and similarly for U. Convexity of M is evident. To verify (5.23), let (X, t, U, s) form a feasible solution to the optimization problem in (5.33). When [u; z] ∈ B∗ × Xs we have [u; z]T M [u; z] ≤ uT U u + z T Xz ≤ s + t, where the first inequality is due to the -constraint in (5.33), and the second is due to the fact that U is compatible with B∗ , and X is compatible with Xs . Since the resulting inequality holds true for all feasible solutions to the optimization problem in (5.33), (5.23) follows. Finally, when X and U are sharp, (5.33) is a feasible conic problem with bounded level sets of the objective and as such is solvable. ✷ 5.1.5.4
Putting things together
The following statement combining the results of Propositions 5.7 and 5.5 summarizes our second approach to the design of the polyhedral estimate. Proposition 5.8. In the situation of Section 5.1.1, assume that we have at our disposal cones X and U compatible, respectively, with Xs and with the unit ball B∗ of the norm conjugate to k · k. Given reliability tolerance ǫ ∈ (0, 1) along with a positive integer N and a computationally tractable cone H satisfying Assumption C, consider the (clearly feasible) convex optimization problem Opt = minΘ,µ,X,t,U,s f (t, s, µ) := 2(t + s + µ) : (Θ, t) ∈ X,(U, s) ∈ U, ) (5.34) µ) ∈ H, (X, 1 B U . 2 0 1 T AT ΘA + X 2B Let Θ, µ, X, t, U, s be a feasible solution to (5.34). Invoking C, we can convert, in a computationally efficient manner, (Θ, µ) into (H, λ) such that the columns of the P m × N contrast matrix H satisfy (5.1), Θ = HDiag{λ}H T , and µ ≥ ℓ λℓ . The
406
CHAPTER 5
(ǫ, k · k)-risk of the polyhedral estimate x bH satisfies the bound Riskǫ,k·k [b xH |X ] ≤ f (t, s, µ).
(5.35)
In particular, we can build, in a computationally efficient manner, polyhedral estimates with risks arbitrarily close to Opt (and with risk Opt, provided that (5.34) is solvable). Proof. Let Θ, µ, X, t, U, s form a feasible solution to (5.34). By the semidefinite constraint in (5.34) we have 1 − 21 B U 2B , = Diag{U, X} − 0 1 T − 21 B T AT ΘA + X −AT ΘA 2B {z } | =:M
whence for the function M defined in (5.33) one has M(M ) ≤ t + s.
Since M, by Proposition 5.7, satisfies (5.23), invoking Proposition 5.5 we arrive at R[H] ≤ 2(µ + M(M )) ≤ f (t, s, µ). By Proposition 5.1 this implies the target relation (5.35). 5.1.5.5
✷
Compatibility: Basic examples and calculus
Our approach to the design of polyhedral estimates utilizing the recipe described in Proposition 5.8 relies upon our ability to equip convex “sets of interest” (in our context, these are the symmetrization Xs of the signal set and the unit ball B∗ of the norm conjugate to the norm k · k) with compatible cones.5 Below, we discuss two principal sources of such cones, namely (a) spectratopes/ellitopes, and (b) absolute norms. More examples of compatible cones can be constructed using a “compatibility calculus.” Namely, let us assume that we are given a finite collection of convex sets (operands) and apply to them some basic operation, such as taking the intersection, or arithmetic sum, direct or inverse linear image, or convex hull of the union. It turns out that cones compatible with the results of such operations can be easily (in a fully algorithmic fashion) obtained from the cones compatible with the operands; see Section 5.1.8 for principal calculus rules. In view of Proposition 5.8, the larger are the cones X and U compatible with Xs and B∗ , the better—the wider is the optimization domain in (5.34) and, consequently, the less is (the best) risk bound achievable with the recipe presented in the proposition. Given convex compact set Y ∈ RN , the “ideal”—the largest— candidate to the role of the cone compatible with Y would be T Y∗ = {(V, τ ) ∈ SN + × R+ : τ ≥ max y V y}. y∈Y
However, this cone is typically intractable, therefore, we look for “as large as pos5 Recall
H.
that we already know how to specify the second element of the construction, the cone
407
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
sible” tractable inner approximations of Y∗ . 5.1.5.5.A. Cones compatible with ellitopes/spectratopes are readily given by semidefinite relaxation. Specifically, when Y = {y ∈ RN : ∃(r ∈ RK ) : y = M z, Rℓ2 [z] i rℓ Idℓ , ℓ ≤ L} h ∈ R, z P Rℓ [z] = j zj Rℓj , Rℓj ∈ Sdℓ
with our standard restrictions on R, invoking Proposition 4.8 it is immediately seen that the set dℓ Y = (V, τ ) ∈ SN (λ[Λ]) ≤ τ + × R+ : ∃Λ = {Λℓ ∈ S+ , ℓ ≤ L} : φRP (5.36) R∗ [Λℓ ] MT V M ℓ
is a closed convex cone which is compatible with Y; here, as usual,
[R∗ℓ [Λℓ ]]ij = Tr(Rℓi Λℓ Rℓj ), λ[Λ] = [Tr(Λ1 ); ...; Tr(ΛL )], φR (λ) = max rT λ. r∈R
Similarly, when Y is an ellitope: Y = {y ∈ RN : ∃(r ∈ R, z ∈ RK ) : y = M z, z T Rℓ z ≤ rℓ , ℓ ≤ L} with our standard restrictions on Rℓ , invoking Proposition 4.6, the set X L T Y = {(V, τ ) ∈ SN λℓ Rℓ , φR (λ) ≤ τ } + × R+ : ∃λ ∈ R+ : M V M
(5.37)
ℓ
is a closed convex cone which is compatible with Y. In both cases, Y is sharp, provided that the image space of M is the entire RN . Note that in both these cases Y is a reasonably tight inner approximation of Y∗ : wheneverP (V, τ ) ∈ Y∗ , we have (V, θτ ) ∈ Y, with a moderate θ (specifically, θ = O(1) ln 2 ℓ dℓ in the spectratopic, and θ = O(1) ln(2L) in the ellitopic case; see Propositions 4.8, 4.6, respectively). 5.1.5.5.B. Compatibility via absolute norms. Preliminaries. Recall that a norm p(·) on RN is called absolute if p(x) is a function of the vector abs[x] := [|x1 |; ...; |xN |] of the magnitudes of entries in x. It ′ is well known that an absolute norm p is monotone on RN + , so that abs[x] ≤ abs[x ] ′ implies that p(x) ≤ p(x ), and that the norm p∗ (x) = max xT y y:p(y)≤1
conjugate to p(·) is absolute along with p. Let us say that an absolute norm r(·) fits an absolute norm p(·) on RN if for every vector x with p(x) ≤ 1 the entrywise square [x]2 = [x21 ; ...; x2N ] of x satisfies r([x]2 ) ≤ 1. For example, the largest norm r(·) which fits the absolute norm p(·) = k · ks , s ∈ [1, ∞], is k · k1 , 1≤s≤2 r(·) = . k · ks/2 , s ≥ 2
408
CHAPTER 5
An immediate observation is that an absolute norm p(·) on RN can be “lifted” to a norm on SN , specifically, the norm p+ (Y ) = p([p(Col1 [Y ]); ...; p(ColN [Y ])]) : SN → R+ ,
(5.38)
where Colj [Y ] is j-th column in Y . It is immediately seen that when p is an absolute norm, the right-hand side in (5.38) indeed is a norm on SN satisfying the identity p+ (xxT ) = p2 (x), x ∈ RN .
(5.39)
Absolute norms and compatibility. Our interest in absolute norms is motivated by the following immediate observation: Observation 5.9. Let p(·) be an absolute norm on RN , and r(·) be another absolute norm which fits p(·), both norms being computationally tractable. These norms give rise to the computationally tractable and sharp closed convex cone N N P = Pp(·),r(·) = (V, τ ) ∈ SN + × R+ : ∃(W ∈ S , w ∈ R+ ) : (5.40) V W + Diag{w}, [p+ ]∗ (W ) + r∗ (w) ≤ τ where [p+ ]∗ (·) is the norm on SN conjugate to the norm p+ (·), and r∗ (·) is the norm on RN conjugate to the norm r(·), and this cone is compatible with the unit ball of the norm p(·) (and thus with any convex compact subset of this ball). Verification is immediate. The fact that P is a computationally tractable and closed convex cone is evident. Now let (V, τ ) ∈ P, so that V 0 and V W + Diag{w} with [p+ ]∗ (W ) + r∗ (w) ≤ τ . For x with p(x) ≤ 1 we have xT V x
≤ ≤ ≤
xT [W + Diag{w}]x = Tr(W [xxT ]) + wT [x]2 p+ (xxT )[p+ ]∗ (W ) + r([x]2 )r∗ (w) = p2 (x)[p+ ]∗ (W ) + r∗ (w) [p+ ]∗ (W ) + r∗ (w) ≤ τ
(we have used (5.40)), whence xT V x ≤ τ for all x with p(x) ≤ 1. ✷ Let us look at the proposed construction in the case where p(·) = k·ks , s ∈ [1, ∞], s¯ s , s¯∗ = s¯−1 , we clearly have and let r(·) = k · ks¯, s¯ = max[s/2, 1]. Setting s∗ = s−1 +
[p ]∗ (W ) = kW ks∗ :=
( P
s∗ i,j |Wij | maxi,j |Wij |,
1/s∗
,
s∗ < ∞ , r (w) = kwk , (5.41) ∗ s¯∗ s∗ = ∞
resulting in Ps
:=
Pk·ks ,k·ks¯ =
N N (V, τ ) : V ∈ SN + , ∃(W ∈ S , w ∈ R+ ) : V W + Diag{w}, . kW ks∗ + kwks¯∗ ≤ τ
(5.42)
By Observation 5.9, Ps is compatible with the unit ball of k · ks -norm on RN (and therefore with every closed convex subset of this ball).
409
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
When s = 1, that is, s∗ = s¯∗ = ∞, (5.42) results in V W + Diag{w}, P1 = (V, τ ) : V 0, ∃(W ∈ SN , w ∈ RN ) : + kW k∞ + kwk∞ ≤ τ =
{(V, τ ) : V 0, kV k∞ ≤ τ },
(5.43)
and it is easily seen that the situation is a good as it could be, namely, P1 = {(V, τ ) : V 0, max xT V x ≤ τ }. kxk1 ≤1
It can be shown (see Section 5.4.3) that when s ∈ [2, ∞], and so s¯∗ = results in
s s−2 ,
s Ps = {(V, τ ) : V 0, ∃(w ∈ RN + ) : V Diag{w} & kwk s−2 ≤ τ }.
(5.42) (5.44)
Note that P2 = {(V, τ ) : V 0, kV k2,2 ≤ τ }, and this is exactly the largest cone compatible with the unit Euclidean ball. When s ≥ 2, the unit ball Y of the norm k · ks is an ellitope: {y ∈ RN : kyks ≤ 1} = {y ∈ RN : ∃(t ≥ 0, ktks¯ ≤ 1) : y T Rℓ y := yℓ2 ≤ tℓ , ℓ ≤ L = N },
so that one of the cones compatible with Y is given by (5.37) with the identity matrix in the role of M . As it is immediately seen, the latter cone is nothing but the cone (5.44). 5.1.5.6
Near-optimality of polyhedral estimate in the spectratopic sub-Gaussian case
As an instructive application of the approach developed so far, consider the special case of the estimation problem stated in Section 5.1.1, where 1. The signal set X and the unit ball B∗ of the norm conjugate to k · k are spectratopes: X B∗
= =
{x ∈ Rn : ∃t ∈ T : Rk2 [x] tk Idk , 1 ≤ k ≤ K}, {z ∈ Rν : ∃y ∈ Y : z = M y}, Y := {y ∈ Rq : ∃r ∈ R : Sℓ2 [y] rℓ Ifℓ , 1 ≤ ℓ ≤ L},
(cf. Assumptions A, B in Section 4.3.3.2; as always, we lose nothing assuming spectratope X to be basic). 2. For every x ∈ X , observation noise ξx is sub-Gaussian, i.e., ξx ∼ SG(0, σ 2 Im ). We are about to show that in the present situation, the polyhedral estimate constructed in Sections 5.1.5.2–5.1.5.4, i.e., yielded by the efficiently computable (high accuracy near-) optimal solution to the optimization problem (5.34), is near-optimal in the minimax sense. Given reliability tolerance ǫ ∈ (0, 1), the recipe for constructing the m × m contrast matrix H as presented in Proposition 5.8 is as follows: • Set
Z = {ϑ2 [1; ...; 1]}, ϑ = σκ, κ =
p
2 ln(2m/ǫ),
410
CHAPTER 5
and utilize the construction from Section 5.1.5.2, thus arriving at the cone 2 2 H = {(Θ, µ) ∈ Sm + × R+ : σ κ Tr(Θ) ≤ µ}
satisfying the requirements of Assumption C. • Specify the cones X and U compatible with Xs = X , and B∗ , respectively, according to (5.36). The resulting problem (5.34), after immediate straightforward simplifications, reads 2 φR (λ[Υ]) + φT (λ[Λ]) + σ 2 κ2 Tr(Θ) : Opt = min Θ,U,Λ,Υ Θ 0, U 0, Λ = {Λk 0, k ≤ K}, P (5.45) ∗ T S [Υ ], Υ = {Υ 0, ℓ ≤ L}, M U M ℓ ℓ ℓ ℓ 1 B U 2P 0 1 T T ∗ R [Λ ] A ΘA + B k k k 2 where, as always,
and
P [R∗k [Λk ]]ij = Tr(Rki Λk Rkj ) [Rk [x] =P i xi Rki ], [Sℓ∗ [Υℓ ]]ij = Tr(S ℓi Υℓ S ℓj ) [Sℓ [u] = i ui S ℓi ],
λ[Λ] = [Tr(Λ1 ); ...; Tr(ΛK )], λ[Υ] = [Tr(Υ1 ); ...; Tr(ΥL )], φW (f ) = max wT f. w∈W
Let now RiskOptǫ = inf sup inf ρ : Probξ∼N (0,σ2 I) {kBx − x b(Ax + ξ)k > ρ} ≤ ǫ ∀x ∈ X , x b(·) x∈X
be the minimax optimal (ǫ, k · k)-risk of estimating Bx in the Gaussian observation scheme where ξx ∼ N (0, σ 2 Im ) independently of x ∈ X. Proposition 5.10. When ǫ ≤ 1/8, the polyhedral estimate x bH yielded by a feasible near-optimal, in terms of the objective, solution to problem (5.45) is minimax optimal within the logarithmic factor, namely r P P Riskǫ,k·k [b xH |X ] ≤ O(1) ln ℓ fℓ ln(2m/ǫ) RiskOpt 81 k dk ln r P P ≤ O(1) ln ℓ fℓ ln(2m/ǫ) RiskOptǫ k dk ln where O(1) is an absolute constant.
See Section 5.4.4 for the proof. Discussion. It is worth mentioning that the approach described in Section 5.1.4 is complementary to the approach developed in this section. In fact, it is easily seen that the bound Opt for the risk of the polyhedral estimate stemming from (5.34) is suboptimal in the simple situation described in the motivating example from Section 5.1.1. Indeed, let X be the unit k · k1 -ball, k · k = k · k2 , and let us consider the problem of estimating x ∈ X from the direct observation ω = x + ξ with Gaussian observation noise ξ ∼ N (0, σ 2 I). We equip the ball B∗ = {u ∈ Rn : kuk2 ≤ 1}
411
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
with the cone U = P2 = {(U, τ ) : U 0, kU k2,2 ≤ τ } and X with the cone X = P1 = {(X, t) : X 0, kXk∞ ≤ t}, (note that both cones are the largest w.r.t. inclusion cones compatible with the respective sets). The corresponding problem (5.34) reads Opt
=
=
Θ 0, X 0, U 0, 1 I U min 2 κ2 σ 2 Tr(Θ) + max Xii + kU k2,2 : 2 n i 0 Θ,X,U 1 Θ + X I n 2 0, U 0, Θ 0, X 2 2 1 τ In I . min 2 κ σ Tr(Θ) + max Xii + τ : 2 n i 0 Θ,X,U 1 Θ + X I n 2
(5.46)
Observe that every n × n matrix of the form Q = EP , where E is diagonal with diagonal entries ±1, and P is a permutation matrix, induces a symmetry (Θ, X, τ ) 7→ (QΘQT , QXQT , τ ) of the second optimization problem in (5.46), that is, a transformation which maps the feasible set onto itself and keeps the objective intact. Since the problem is convex and solvable, we conclude that it has an optimal solution which remains intact under the symmetries in question, i.e., solution with scalar matrices Θ = θIn and X = uIn . As a result, √ 2 2 Opt = min 2(κ σ nθ + u + τ ) : τ (θ + u) ≥ 41 = 2 min κσ n, 1 . (5.47) θ≥0,u≥0,τ
A similar derivation shows that the value Opt remains intact if we replace the set X = {x : kxk1 ≤ 1} with X = {x : kxks ≤ 1}, s ∈ [1, 2], and the cone X = P1 with X = Ps ; see (5.42). Since the Θ-component of an optimal solution to (5.46) can be selected to be scalar, the contrast matrix H we end up with can be selected to be the unit matrix. An unpleasant observation is that when s < 2, the quantity Opt given by (5.47) “heavily overestimates” the actual risk of the polyhedral estimate with H = In . Indeed, the analysis of this estimate in Section 5.1.4 results in the √ risk bound (up to a factor√logarithmic in n) min[σ 1−s/2 , σ n], which √ can be much less than Opt = 2 min [κσ n, 1], e.g., in the case of large n, and σ n = O(1). 5.1.6
Assembling estimates: Contrast aggregation
The good news is that whenever the approaches to the design of polyhedral estimates presented in Sections 5.1.4 and 5.1.5 are applicable, they can be utilized simultaneously. The underlying observation is that (!) In the problem setting described in Section 5.1.2, a collection of K candidate polyhedral estimates can be assembled into a single polyhedral estimate with the (upper bound on the) risk, as given by Proposition 5.1, being nearly the minimum of the risks of estimates we aggregate. Indeed, given an observation scheme (that is, collection of probability distributions Px of noises ξx , x ∈ X ), assume we have at our disposal norms πδ (·) : Rm → R parameterized by δ ∈ (0, 1) such that πδ (h), for every h, is larger the lesser δ is,
412
CHAPTER 5
and ∀(x ∈ X , δ ∈ (0, 1), h ∈ Rm ) : πδ (h) ≤ 1 ⇒ Probξ∼Px {ξ : |hT ξ| > 1} ≤ δ. Assume also (as is indeed the case in all our constructions) that we ensure (5.1) by imposing on the columns hℓ of an m × N contrast matrix H the restrictions πǫ/N (hℓ ) ≤ 1. Now suppose that given risk tolerance ǫ ∈ (0, 1), we have generated K candidate contrast matrices Hk ∈ Rm×Nk such that πǫ/Nk (Colj [Hk ]) ≤ 1, j ≤ Nk , so that the (ǫ, k · k)-risk of the polyhedral estimate yielded by the contrast matrix Hk does not exceed Rk = max kBxk : x ∈ 2Xs , kHkT Axk∞ ≤ 2 . x
Let us combine the contrast matrices H1 , ..., HK into a single contrast matrix H with N = N1 + ... + NK columns by normalizing the columns of the concatenated matrix [H1 , ..., HK ] to have πǫ/N -norms equal to 1, so that ¯ 1 , ..., H ¯ K ], Colj [H ¯ k ] = θjk Colj [Hk ] ∀(k ≤ K, j ≤ Nk ) H = [H with θjk =
πǫ/Nk (h) 1 ≥ ϑk := min , h6=0 πǫ/N (h) πǫ/N (Colj [Hk ])
where the concluding ≥ is due to πǫ/Nk (Colj [Hk ]) ≤ 1. We claim that in terms of (ǫ, k·k)-risk, the polyhedral estimate yielded by H is “almost as good” as the best of the polyhedral estimates yielded by the contrast matrices H1 , ..., HK , specifically,6 R[H] := max kBxk : x ∈ 2Xs , kH T Axk∞ ≤ 2 ≤ min ϑ−1 k Rk . x
k
The justification is readily given by the following observation: when ϑ ∈ (0, 1), we have Rk,ϑ := max kBxk : x ∈ 2Xs , kHkT Axk∞ ≤ 2/ϑ ≤ Rk /ϑ. x
Indeed, when x is a feasible solution to the maximization problem specifying Rk,ϑ , ϑx is a feasible solution to the problem specifying Rk , implying that ϑkBxk ≤ Rk . It remains to note that we clearly have R[H] ≤ mink Rk,ϑk . The bottom line is that the aggregation just described of contrast matrices H1 , ..., HK into a single contrast matrix H results in a polyhedral estimate which in terms of upper bound R[·] on its (ǫ, k · k)-risk is, up to factor ϑ¯ = maxk ϑ−1 k , not worse than the best of the K estimates yielded by the original contrast matrices. Consequently, if πδ (·) grows slowly as δ decreases, the “price” ϑ¯ of assembling the original estimates is quite moderate. For example, in our basic cases (sub-Gaussian, Discrete, and Poisson), ϑ¯ is logarithmic in maxk Nk−1 (N1 +...+NK ), and ϑ¯ = 1+o(1) as ǫ → +0 for K, N1 , ..., NK fixed. 6 This
is the precise “quantitative expression” of the observation (!).
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
5.1.7
413
Numerical illustration
We are about to illustrate the numerical performance of polyhedral estimates by comparing it to the performance of a “presumably good” linear estimate. Our setup is deliberately simple: the signal set X is just the unit box {x ∈ Rn : kxk∞ ≤ 1}, B ∈ Rn×n is “numerical double integration”: for a δ > 0, 2 δ (i − j + 1), j ≤ i Bij = , 0, j>i so that x, modulo boundary effects, is the second order finite difference derivative of w = Bx, wi − 2wi−1 + wi−2 , 2 < i ≤ n; xi = δ2 and Ax is comprised of m randomly selected entries of Bx. The observation is ω = Ax + ξ, ξ ∼ N (0, σ 2 Im ) and the recovery norm is k · k2 . In other words, we want to recover a restriction of a twice differentiable function of one variable on the n-point regular grid on the segment ∆ = [0, nδ] from noisy observations of this restriction taken along m randomly selected points of the grid. A priori information on the function is that the magnitude of its second order derivative does not exceed 1. Note that in the considered situation both linear estimate x bH yielded by Proposition 4.14 and polyhedral estimate x bH yielded by Proposition 5.7, are near-optimal in the minimax sense in terms of their k · k2 - or (ǫ, k · k2 )-risk. In the experiments reported in Figure 5.1, we used n = 64, m = 32, and δ = 4/n (i.e., ∆ = [0, 4]); the reliability parameter for the polyhedral estimate was set to ǫ = 0.1. For different noise levels σ = {0.1, 0.01, 0.001, 0.0001} we generate 20 random signals x from X and record the k · k2 -recovery errors of the linear and the polyhedral estimates. In addition to testing the nearly optimal polyhedral estimate PolyI yielded by Proposition 5.8 as applied in the framework of item 5.1.5.5.A, we also record the performance of the polyhedral estimate PolyII yielded by the construction from Section 5.1.4. The observed k · k2 -recovery errors of the three estimates are plotted in Figure 5.1. All three estimates exhibit similar empirical performance in these simulations. However, when the noise level becomes small, polyhedral estimates seem to outperform the linear one. In addition, the estimate PolyII seems to “work” better than or, at the very worst, similarly to PolyI in spite of the fact that in the situation in question the estimate PolyI, in contrast to PolyII, is provably near-optimal. 5.1.8
Calculus of compatibility
The principal rules of the calculus of compatibility are as follows (verification of the rules is straightforward and is therefore skipped): 1. [passing to a subset] When Y ′ ⊂ Y are convex compact subsets of RN and a cone Y is compatible with Y, the cone is compatible with Y ′ as well.
414
CHAPTER 5
0.7 0.11
0.6
0.1
0.5
0.09 0.08
0.4 0.07 0.06
0.3
0.05
0.2 0.04
0
2
4
6
8
10
12
14
16
18
20
0
2
4
6
σ = 0.1
8
10
12
14
16
18
20
14
16
18
20
σ = 0.01
0.02 0.018
10-2 0.016 0.014 0.012
0.01
0.008
10-3 0.006 0
2
4
6
8
10
12
14
16
18
0
20
2
4
6
8
10
12
σ = 0.001 σ = 0.0001 Figure 5.1: Recovery errors for the near-optimal linear estimate (circles) and for polyhedral estimates yielded by Proposition 5.8 (PolyI, pentagrams) and by the construction from Section 5.1.4 (PolyII, triangles), 20 simulations per each value of σ.
2. [finite intersection] Let cones Yj be compatible with convex compact sets Yj ⊂ RN , j = 1, ..., J. Then the cone j Y = cl{(V, τ ) ∈ SN + × R+ : ∃((Vj , τj ) ∈ Y , j ≤ J) : V
is compatible with Y =
T j
X j
Vj ,
X j
τj ≤ τ }
Yj . The closure operation can be skipped when all
cones Yj are sharp, in which case Y is sharp as well. 3. [convex hulls of finite union] Let cones Yj be compatible with convex compact sets Yj ⊂ RN , j = 1, ..., J, and let there exist (V, τ ) such that V ≻ 0 and \ Yj . (V, τ ) ∈ Y := j
S Then Y is compatible with Y = Conv{ Yj } and, in addition, is sharp provided j
that at least one of the Yj is sharp. 4. [direct product] Let cones Yj be compatible with convex compact sets Yj ⊂ RNj , j = 1, ..., J. Then the cone N1 +...+NJ Y = {(V, τ ) ∈ S+ × R+ : ∃(Vj , τj ) ∈ Y j : V Diag{V1 , ..., VJ } & τ ≥
X j
τj }
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
415
is compatible with Y = Y1 × ... × YJ . This cone is sharp, provided that all the Yj are so. 5. [linear image] Let cone Y be compatible with convex compact set Y ⊂ RN , let A be a K × N matrix, and let Z = AY. The cone T Z = cl{(V, τ ) ∈ SK + × R+ : ∃U A V A : (U, τ ) ∈ Y}
is compatible with Z. The closure operation can be skipped whenever Y is either sharp, or complete, completeness meaning that (V, τ ) ∈ Y and 0 V ′ V imply that (V ′ , τ ) ∈ Y. The cone Z is sharp, provided Y is so and the rank of A is K. 6. [inverse linear image] Let cone Y be compatible with convex compact set Y ⊂ RN , let A be an N × K matrix with trivial kernel, and let Z = A−1 Y := {z ∈ RK : Az ∈ Y}. The cone T Z = cl{(V, τ ) ∈ SK + × R+ : ∃U : A U A V & (U, τ ) ∈ Y}
is compatible with Z. The closure operations can be skipped whenever Y is sharp, in which case Z is sharp as well. 7. [arithmetic summation] Let cones Yj be compatible with convex compact sets Yj ⊂ RN , j = 1, ..., J. Then the arithmetic sum Y = Y1 + ... + YJ of the sets Yj can be equipped with a compatible cone readily given by the cones Yj ; this cone is sharp, provided all the Yj are so. Indeed, the arithmetic sum of Yj is the linear image of the direct product of the Yj ’s under the mapping [y 1 ; ...; y J ] 7→ y 1 + ... + y J , and it remains to combine rules 4 and 5; note the cone yielded by rule 4 is complete, so that when applying rule 5, the closure operation can be skipped.
5.2
RECOVERING SIGNALS FROM NONLINEAR OBSERVATIONS BY STOCHASTIC OPTIMIZATION
The “common denominator” of all estimation problems considered so far in this chapter is that what we observed was obtained by adding noise to the linear image of the unknown signal to be recovered. In this section we consider the problem of signal estimation in the case where the observation is obtained by adding noise to a nonlinear transformation of the signal. 5.2.1
Problem setting
A motivating example for what follows is provided by the logistic regression model, where • the unknown signal to be recovered is a vector x known to belong to a given signal set X ⊂ Rn , which we assume to be a nonempty convex compact set; • our observation ω K = {ωk = (ηk , yk ), 1 ≤ k ≤ K} stemming from a signal x is as follows: – the regressors η1 , ..., ηK are i.i.d. realizations of an n-dimensional random
416
CHAPTER 5
vector η with distribution Q independent of x and such that Q possesses a finite and positive definite matrix Eη∼Q {ηη T } of second moments;
– the labels yk are generated as follows: yk is the Bernoulli random variable independent of the “history” η1 , ..., ηk−1 , y1 , ..., yk−1 , and the conditional, given ηk , probability for yk to be 1 is φ(ηkT x), where φ(s) =
exp{s} . 1 + exp{s}
In this model, the standard (and very well-studied) approach to estimating the signal x underlying the observations is to use the Maximum Likelihood (ML) estimate: the logarithm of the conditional, given ηk , 1 ≤ k ≤ K, probability of getting the observed labels as a function of a candidate signal z is K
ℓ(z, ω )
=
K X
k=1
=
"
X k
yk ln φ(ηkT z) + (1 − yk ) ln 1 − φ(ηkT z) yk η k
#T
z−
X k
ln 1 + exp{ηkT z} ,
(5.48)
and the ML estimate of the “true” signal x underlying our observation ω K is obtained by maximizing the log-likelihood ℓ(z, ω K ) over z ∈ X , x bML (ω K ) ∈ Argmax ℓ(z, ω K ),
(5.49)
z∈X
which is a convex optimization problem.
The problem we intend to consider (referred to as the generalized linear model (GLM) in Statistics) can be viewed as a natural generalization of the logistic regression just presented and is as follows: Our observation depends on unknown signal x known to belong to a given convex compact set X ⊂ Rn and is ω K = {ωk = (ηk , yk ), 1 ≤ k ≤ K}
(5.50)
with ωk , 1 ≤ k ≤ K, which are i.i.d. realizations of a random pair (η, y) with the distribution Px such that • the regressor η is a random n×m matrix with some probability distribution Q independent of x; • the label y is an m-dimensional random vector such that the conditional distribution of y given η induced by Px has the expectation f (η T x): Ex|η {y} = f (η T x),
(5.51)
where Ex|η {y} is the conditional expectation of y given η stemming from the distribution Px of ω = (η, y), and f (·) : Rm → Rm (“link function”) is a given mapping. Note that the logistic regression model corresponds to the case where m = 1,
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
417
exp{s} f (s) = 1+exp{s} , and y takes values 0,1, with the conditional probability of taking value 1 given η equal to f (η T x). Another example is provided by the model
y = f (η T x) + ξ, where ξ is a random vector with zero mean independent of η, say, ξ ∼ N (0, σ 2 Im ). Note that in the latter case the ML estimate of the signal x underlying the observations is X kyk − f (ηkT z)k22 . (5.52) x bML (ω K ) ∈ Argmin z∈X
k
In contrast to what happens with logistic regression, now the optimization problem—“Nonlinear Least Squares”—responsible for the ML estimate typically is nonconvex and can be computationally difficult. Following [140], we intend to impose on the data of the estimation problem we have just described (namely, on X , f (·), and the distributions Px , x ∈ X , of the pair (η, y)) assumptions which allow us to reduce our estimation problem to a problem with convex structure—a strongly monotone variational inequality represented by a stochastic oracle. At the end of the day, this will lead to a consistent estimate of the signal, with explicit “finite sample” accuracy guarantees. 5.2.2
Assumptions
Preliminaries: Monotone vector fields. A monotone vector field on Rm is a single-valued everywhere defined mapping g(·) : Rm → Rm which possesses the monotonicity property [g(z) − g(z ′ )]T [z − z ′ ] ≥ 0 ∀z, z ′ ∈ Rm . We say that such a field is monotone with modulus κ ≥ 0 on a closed convex set Z ⊂ Rm if [g(z) − g(z ′ )]T [z − z ′ ] ≥ κkz − z ′ k22 , ∀z z ′ ∈ Z, and say that g is strongly monotone on Z if the modulus of monotonicity of g on Z is positive. It is immediately seen that for a monotone vector field which is continuously differentiable on a closed convex set Z with a nonempty interior, the necessary and sufficient condition for being monotone with modulus κ on the set is dT f ′ (z)d ≥ κdT d ∀(d ∈ Rn , z ∈ Z).
(5.53)
Basic examples of monotone vector fields are: • gradient fields ∇φ(x) of continuously differentiable convex functions of m variables or, more generally, the vector fields [∇x φ(x, y); −∇y φ(x, y)] stemming from continuously differentiable functions φ(x, y) which are convex in x and concave in y; • “diagonal” vector fields f (x) = [f1 (x1 ); f2 (x2 ); ...; fm (xm )] with monotonically nondecreasing univariate components fi (·). If, in addition, the fi (·) are continuously differentiable with positive first order derivatives, then the associated field f is strongly monotone on every compact convex subset of Rm , the monotonicity modulus depending on the subset.
418
CHAPTER 5
Monotone vector fields on Rn admit simple calculus which includes, in particular, the following two rules: I. [affine substitution of argument]: If f (·) is a monotone vector field on Rm and A is an n × m matrix, the vector field g(x) = Af (AT x + a) is monotone on Rn ; if, in addition, f is monotone with modulus κ ≥ 0 on a closed convex set Z ⊂ Rm and X ⊂ Rn is closed, convex, and such that AT x + a ∈ Z whenever x ∈ X, g is monotone with modulus σ 2 κ on X, where σ is the n-th singular value of A (i.e., the largest γ such that kAT xk2 ≥ γkxk2 for all x). II. [summation]: If S is a Polish space, f (x, s) : Rm × S → Rm is a Borel vectorvalued function which is monotone in x for every s ∈ S, and µ(ds) is a Borel probability measure on S such that the vector field Z F (x) = f (x, s)µ(ds) S
is well-defined for all x, then F (·) is monotone. If, in addition, X is a closed convex set in Rm and f (·, s) is monotone on X with Borel in s modulus κ(s) for R every s ∈ S, then F is monotone on X with modulus S κ(s)µ(ds). Assumptions. In what follows, we make the following assumptions on the ingredients of the estimation problem posed in Section 5.2.1: • A.1. The vector field f (·) is continuous and monotone, and the vector field F (z) = Eη∼Q ηf (η T z)
is well-defined (and therefore is monotone along with f by I, II); • A.2. The signal set X is a nonempty convex compact set, and the vector field F is monotone with positive modulus κ on X ; • A.3. For properly selected M < ∞ and every x ∈ X it holds E(η,y)∼Px kηyk22 ≤ M 2 . (5.54)
A simple sufficient condition for the validity of Assumptions A.1-3 with properly selected M < ∞ and κ > 0 is as follows: • The distribution Q of η has finite moments of all orders, and Eη∼Q {ηη T } ≻ 0; • f is continuously differentiable, and dT f ′ (z)d > 0 for all d 6= 0 and all z. Besides this, f is of polynomial growth: for some constants C ≥ 0 and p ≥ 0 and all z one has kf (z)k2 ≤ C(1 + kzkp2 ). Verification of sufficiency is straightforward. The principal observation underlying the construction we are about to discuss is as follows. Proposition 5.11. With Assumptions A.1–3 in force, let us associate with a pair
419
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
(η, y) ∈ Rn×m × Rm the vector field G(η,y) (z) = ηf (η T z) − ηy : Rn → Rn . Then for every x ∈ X we have E(η,y)∼Px G(η,y) (z) kF (z)k 2 E(η,y)∼Px kG(η,y) (z)k22
= ≤ ≤
F (z) − F (x) ∀z ∈ Rn M ∀z ∈ X 4M 2 ∀z ∈ X .
(5.55)
(a) (b) (c)
(5.56)
Proof is immediate. Indeed, let x ∈ X . Then n o E(η,y)∼Px {ηy} = Eη∼Q Ex|η {ηy} = Eη ηf (η T x) = F (x)
(we have used (5.51) and the definition of F ), whence, E(η,y)∼Px G(η,y) (z)
= =
n o n o E(η,y)∼Px ηf (η T z) − ηy = E(η,y)∼Px ηf (η T z) − F (x) n o Eη∼Q ηf (η T z) − F (x) = F (z) − F (x),
as stated in (5.56.a). Besides this, for x, z ∈ X , taking into account that the marginal distribution of η induced by Pz is Q, we have E(η,y)∼Px {kηf (η T z)k22 } = Eη∼Q kηf (η T z)k22 o n = Eη∼Q kEy∼P|ηz {ηy}k22 [since Ey∼P|ηz {y} = f (η T z)] n o ≤ Eη∼Q Ey∼P|ηz kηyk22 [by Jensen’s inequality] = E(η,y)∼Pz kηyk22 ≤ M 2 [by A.3 due to z ∈ X ].
This combines with the relation E(η,y)∼Px {kηyk22 } ≤ M 2 given by A.3 due to x ∈ X to imply (5.56.b) and (5.56.c). ✷ Consequences. Our goal is to recover the signal x ∈ X underlying observations (5.50), and under assumptions A.1–3, x is a root of the monotone vector field G(z) = F (z) − F (x), F (z) = Eη∼Q ηf (η T z) ; (5.57)
we know that this root belongs to X , and this root is unique because G(·) is strongly monotone on X along with F (·). Now, the problem of finding a root, known to belong to a given convex compact set X , of a vector field G which is strongly monotone on this set is known to be computationally tractable, provided we have access to an “oracle” which, given on input a point z ∈ X , returns the value G(z) of the field at the point. The latter is not exactly the case in the situation we are interested in: the field G is the expectation of a random field: G(z) = E(η,y)∼Px ηf (η T z) − ηy , and we do not know a priori what the distribution is over which the expectation is taken. However, we can sample from this distribution—the samples are exactly the observations (5.50), and we can use these samples to approximate G and use
420
CHAPTER 5
this approximation to approximate the signal x.7 Two standard implementations of this idea are Sample Average Approximation (SAA) and Stochastic Approximation (SA). We are about to consider these two techniques as applied to the situation we are in. 5.2.3
Estimating via Sample Average Approximation
The idea underlying SAA is quite transparent: given observations (5.50), let us approximate the field of interest G with its empirical counterpart GωK (z) =
K 1 X ηk f (ηkT z) − ηk yk . K k=1
By the Law of Large Numbers, as K → ∞, the empirical field GωK converges to the field of interest G, so that under mild regularity assumptions, when K is large, GωK , with overwhelming probability, will be close to G uniformly on X . Due to strong monotonicity of G, this would imply that a set of “near-zeros” of GωK on X will be close to the zero x of G, which is nothing but the signal we want to recover. The only question is how we can consistently define a “near-zero” of GωK on X .8 A convenient notion of a “near-zero” in our context is provided by the concept of a weak solution to a variational inequality with a monotone operator, defined as follows (we restrict the general definition to the situation of interest): Let X ⊂ Rn be a nonempty convex compact set, and H(z) : X → Rn be a monotone (i.e., [H(z) − H(z ′ )]T [z − z ′ ] ≥ 0 for all z, z ′ ∈ X ) vector field. A vector z∗ ∈ X is called a weak solution to the variational inequality (VI) associated with H, X when H T (z)[z − z∗ ] ≥ 0 ∀z ∈ X . Let X ⊂ Rn be a nonempty convex compact set and H be monotone on X . It is well known that • The VI associated with H, X (let us denote it by VI(H, X )) always has a weak solution. It is clear that if z¯ ∈ X is a root of H, then z¯ is a weak solution to VI(H, X ).9 • When H is continuous on X , every weak solution z¯ to VI(H, X ) is also a strong solution, meaning that H T (¯ z )(z − z¯) ≥ 0 ∀z ∈ X .
(5.58)
Indeed, (5.58) clearly holds true when z = z¯. Assuming z 6= z¯ and setting zt = z¯+t(z−¯ z ), 0 < t ≤ 1, we have H T (zt )(zt −¯ z ) ≥ 0 (since z¯ is a weak solution), 7 The observation expressed by Proposition 5.11, however simple, and the resulting course of actions seem to be new. In retrospect, one can recognize unperceived ad hoc utilization of this approach in Perceptron and Isotron algorithms, see [1, 2, 29, 62, 116, 141, 142, 210] and references therein. 8 Note that we in general cannot define a “near-zero” of G ω K on X as a root of Gω K on this set—while G does have a root belonging to X , nobody told us that the same holds true for GωK . 9 Indeed, when z ¯ ∈ X and H(¯ z ) = 0, monotonicity of H implies that H T (z)[z − z¯] = [H(z) − H(¯ z )]T [z − z¯] ≥ 0 for all z ∈ X , that is, z¯ is a weak solution to the VI.
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
421
whence H T (zt )(z − z¯) ≥ 0 (since z − z¯ is a positive multiple of zt − z¯). Passing to limit as t → +0 and invoking the continuity of H, we get H T (¯ z )(z − z¯) ≥ 0, as claimed. • When H is the gradient field of a continuously differentiable convex function on X (such a field indeed is monotone), weak (or strong, which in the case of continuous H is the same) solutions to VI(H, X ) are exactly the minimizers of the function on X . Note also that a strong solution to VI(H, X ) with monotone H always is a weak one: if z¯ ∈ X satisfies H T (¯ z )(z − z¯) ≥ 0 for all z ∈ X , then H(z)T (z − z¯) ≥ 0 for all z ∈ X , since by monotonicity H T (z)(z − z¯) ≥ H T (¯ z )(z − z¯). In the sequel, we utilize the following simple and well-known fact: Lemma 5.12. Let X be a convex compact set, and H be a monotone vector field on X with monotonicity modulus κ > 0, i.e. ∀z, z ′ ∈ X [H(z) − H(z ′ )]T [z − z ′ ] ≥ κkz − z ′ k22 . Further, let z¯ be a weak solution to VI(H, X ). Then the weak solution to VI(H, X ) is unique. Besides this, H T (z)[z − z¯] ≥ κkz − z¯k22 ∀z ∈ X .
(5.59)
Proof: Under the premise of lemma, let z ∈ X and let z¯ be a weak solution to VI(H, X ) (recall that it does exist). Setting zt = z¯ + t(z − z¯), for t ∈ (0, 1) we have H T (z)[z − zt ] ≥ H T (zt )[z − zt ] + κkz − zt k22 ≥ κkz − zt k22 , where the first ≥ is due to strong monotonicity of H, and the second ≥ is due to the fact that H T (zt )[z − zt ] is proportional, with positive coefficient, to H T (zt )[zt − z¯], and the latter quantity is nonnegative since z¯ is a weak solution to the VI in question. We end up with H T (z)(z − zt ) ≥ κkz − zt k22 ; passing to limit as t → +0, we arrive at (5.59). To prove uniqueness of a weak solution, assume that besides the weak solution z¯ there exists a weak solution ze distinct from z¯, and let us z + ze]. Since both z¯ and ze are weak solutions, both the quantities set z ′ = 12 [¯ H T (z ′ )[z ′ − z¯] and H T (z ′ )[z ′ − ze] should be nonnegative, and because the sum of these quantities is 0, both of them are zero. Thus, when applying (5.59) to z = z ′ , we get z ′ = z¯, whence ze = z¯ as well. ✷ Now let us come back to the estimation problem under consideration. Let Assumptions A.1-3 hold, so that vector fields G(ηk ,yk ) (z) defined in (5.55), and therefore vector field GωK (z) are continuous and monotone. When using the SAA, we compute a weak solution x b(ω K ) to VI(GωK , X ) and treat it as the SAA estimate of signal x underlying observations (5.50). Since the vector field GωK (·) is monotone with efficiently computable values, provided that so is f , computing (a high accuracy approximation to) a weak solution to VI(GωK , X ) is a computationally tractable problem (see, e.g., [189]). Moreover, utilizing the techniques from [30, 204, 220, 212, 213], under mild regularity assumptions additional to A.1–3 one can get a non-asymptotical upper bound on, say, the expected k · k22 -error of the SAA estimate as a function of the sample size K and find out the rate at which this bound converges to 0 as K → ∞; this analysis, however, goes beyond our scope.
422
CHAPTER 5
Let us specify the SAA estimate in the logistic regression model. In this case we have f (u) = (1 + e−u )−1 , and exp{ηkT z} G(ηk ,yk ) (z) = − yk η k , 1 + exp{ηkT z} K exp{ηkT z} 1 X − yk η k GωK (z) = K 1 + exp{ηkT z} k=1 # " X 1 T T ln 1 + exp{ηk z} − yk ηk z . ∇z = K k
In other words, GωK (z) is proportional, with negative coefficient −1/K, to the gradient field of the log-likelihood ℓ(z, ω K ); see (5.48). As a result, in the case in question weak solutions to VI(GωK , X ) are exactly the maximizers of the loglikelihood ℓ(z, ω K ) over z ∈ X , that is, for the logistic regression the SAA estimate is nothing but the Maximum Likelihood estimate x bML (ω K ) as defined in (5.49).10 On the other hand, in the “nonlinear least squares” example described in Section 5.2.1 with (for the sake of simplicity, scalar) monotone f (·) the vector field GωK (·) is given by K 1 X f (ηkT z) − yk ηk GωK (z) = K k=1
which is different (provided that f is nonlinear) from the gradient field 2
K X
k=1
f ′ (ηkT z) f (ηkT z) − yk ηk
of the minus log-likelihood appearing in (5.52). As a result, in this case the ML estimate (5.52) is, in general, different from the SAA estimate (and, in contrast to the ML, the SAA estimate is easy to compute). 10 This phenomenon is specific to the logistic regression model. The equality of the SAA and the ML estimates in this case is due to the fact that the logistic sigmoid f (s) = exp{s}/(1+exp{s}) “happens” to satisfy the identity f ′ (s) = f (s)(1 − f (s)). When replacing the logistic sigmoid with f (s) = φ(s)/(1 + φ(s)) with differentiable monotonically nondecreasing positive φ(·), the SAA estimate becomes the weak solution to VI(Φ, X ) with # " X φ(ηkT z) − yk ηk . Φ(z) = 1 + φ(ηkT z) k
On the other hand, the gradient field of the minus log-likelihood i Xh − yk ln(f (ηkT z)) + (1 − yk ) ln(1 − f (ηkT z)) k
we need to minimize when computing the ML estimate is # " X φ′ (η T z) φ(ηkT z) k − y k ηk . Ψ(z) = φ(ηkT z) 1 + φ(ηkT z) k When k > 1 and φ is not an exponent, Φ and Ψ are “essentially different,” so that the SAA estimate typically will differ from the ML one.
423
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
5.2.4
Stochastic Approximation estimate
The Stochastic Approximation (SA) estimate stems from a simple algorithm— Subgradient Descent—for solving variational inequality VI(G, X ). Were the values of the vector field G(·) available, one could approximate a root x ∈ X of this VI using the recurrence zk = ProjX [zk−1 − γk G(zk−1 )], k = 1, 2, ..., K, where • ProjX [z] is the metric projection of Rn onto X : ProjX [z] = argmin kz − uk2 ; u∈X
• γk > 0 are given stepsizes; • the initial point z0 is an arbitrary point of X . It is well known that under Assumptions A.1-3 this recurrence with properly selected stepsizes and started at a point from X allows to approximate the root of G (in fact, the unique weak solution to VI(G, X )) to any desired accuracy, provided K is large enough. However, we are in the situation when the actual values of G are not available; the standard way to cope with this difficulty is to replace in the above recurrence the “unobservable” values G(zk−1 ) of G with their unbiased random estimates G(ηk ,yk ) (zk−1 ). This modification gives rise to Stochastic Approximation (coming back to [146])—the recurrence zk = ProjX [zk−1 − γk G(ηk ,yk ) (zk−1 )], 1 ≤ k ≤ K,
(5.60)
where z0 is a once and forever chosen point from X , and γk > 0 are deterministic stepsizes. The next item on our agenda is the (well-known) convergence analysis of SA under assumptions A.1–3. To this end observe that the zk are deterministic functions of the initial fragments ω k = {ωt , 1 ≤ t ≤ k} ∼ Px × ... × Px of our sequence {z } | of observations ω
K
Pxk k
= {ωk = (ηk , yk ), 1 ≤ k ≤ K}: zk = Zk (ω ). Let us set
Dk (ω k ) = 12 kZk (ω k ) − xk22 = 21 kzk − xk22 ,
dk = Eωk ∼Pxk {Dk (ω k )},
where x ∈ X is the signal underlying observations (5.50). Note that, as is well known, the metric projection onto a closed convex set X is contracting: ∀(z ∈ Rn , u ∈ X ) : kProjX [z] − uk2 ≤ kz − uk2 . Consequently, for 1 ≤ k ≤ K it holds Dk (ω k )
= ≤
=
1 2 1 2 1 2
kProjX [zk−1 − γk Gωk (zk−1 )] − xk22 kzk−1 − γk Gωk (zk−1 ) − xk22
kzk−1 − xk22 − γk Gωk (zk−1 )T (zk−1 − x) + 21 γk2 kGωk (zk−1 )k22 .
Taking expectations w.r.t. ω k ∼ Pxk on both sides of the resulting inequality and
424
CHAPTER 5
keeping in mind relations (5.56) along with the fact that zk−1 ∈ X , we get (5.61) dk ≤ dk−1 − γk Eωk−1 ∼Pxk−1 G(zk−1 )T (zk−1 − x) + 2γk2 M 2 .
Recalling that we are in the case where G is strongly monotone on X with modulus κ > 0, x is the weak solution VI(G, X ), and zk−1 takes values in X , invoking (5.59), the expectation in (5.61) is at least 2κdk , and we arrive at the relation dk ≤ (1 − 2κγk )dk−1 + 2γk2 M 2 . We put S=
2M 2 , κ2
γk =
1 . κ(k + 1)
(5.62)
(5.63)
Let us verify by induction in k that for k = 0, 1, ..., K it holds dk ≤ (k + 1)−1 S.
(∗k )
Base k = 0. Let D stand for the k · k2 -diameter of X , and z± ∈ X be such that kz+ − z− k2 = D. By (5.56) we have kF (z)k2 ≤ M for all z ∈ X , and by strong monotonicity of G(·) on X we have [G(z+ ) − G(z− )]T [z+ − z− ] = [F (z+ ) − F (z− )][z+ − z− ] ≥ κkz+ − z− k22 = κD2 . By the Cauchy inequality, the left-hand side in the concluding ≥ is at most 2M D, and we get 2M D≤ , κ whence S ≥ D2 /2. On the other hand, due to the origin of d0 we have d0 ≤ D2 /2. Thus, (∗0 ) holds true. Inductive step (∗k−1 ) ⇒ (∗k ). Now assume that (∗k−1 ) holds true for some k, 1 ≤ k ≤ K, and let us prove that (∗k ) holds true as well. Observe that κγk = (k + 1)−1 ≤ 1/2, so that dk
≤ ≤ =
dk−1 (1 − 2κγk ) + 2γk2 M 2 [by (5.62)] S (1 − 2κγk ) + 2γk2 M 2 [by (∗k−1 ) and due to κγk ≤ 1/2] k S k−1 2 S S S 1 1− + ≤ = + , 2 k k+1 (k + 1) k+1 k k+1 k+1
so that (∗k ) hods true. Induction is complete. Recalling that dk = 21 E{kzk − xk22 }, we arrive at the following: Proposition 5.13. Under Assumptions A.1–3 and with the stepsizes γk =
1 , k = 1, 2, ... , κ(k + 1)
(5.64)
for every signal x ∈ X the sequence of estimates x bk (ω k ) = zk given by the SA
425
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
recurrence (5.60) and ωk = (ηk , yk ) defined in (5.50) obeys the error bound Eωk ∼Pxk kb xk (ω k ) − xk22 ≤
4M 2 , k = 0, 1, ... , + 1)
κ 2 (k
(5.65)
Px being the distribution of (η, y) stemming from signal x.
5.2.5
Numerical illustration
To illustrate the above developments, we present here the results of some numerical experiments. Our deliberately simplistic setup is as follows: • X = {x ∈ Rn : kxk2 ≤ 1}; • the distribution Q of η is N (0, In ); • f is the monotone vector field on R given by one of the following four options: A. f (s) = exp{s}/(1 + exp{s}) (“logistic sigmoid”); B. f (s) = s (“linear regression”); C. f (s) = max[s, 0] (“hinge function”); D. f (s) = min[1, max[s, 0]] (“ramp sigmoid”). • the conditional distribution of y given η induced by Px is
– Bernoulli distribution with probability f (η T x) of outcome 1 in the case of A (i.e., A corresponds to the logistic model), – Gaussian distribution N (f (η T x), In ) in cases B–D.
Note that when m = 1 and η ∼ N (0, In ), one can easily compute the field F (z). Indeed, we have ∀z ∈ Rn \{0}: zz T zz T η, η + I − η= kzk22 kzk22 | {z } η⊥
and due to the independence of η T z and η⊥ , F (z) = Eη∼N (0,I) {ηf (η T z)} = Eη∼N (0,I)
zz T η f (η T z) kzk22
=
z Eζ∼N (0,1) {ζf (kzk2 ζ)}, kzk2
so that F (z) is proportional to z/kzk2 with proportionality coefficient h(kzk2 ) = Eζ∼N (0,1) {ζf (kzk2 ζ)}. In Figure 5.2 we present the plots of the function h(t) for the situations A–D and of the moduli of strong monotonicity of the corresponding mappings F on the k · k2 -ball of radius R centered at the origin, as functions of R. The dimension n in all experiments was set to 100, and the number of observations K was 400, 1, 000, 4, 000, 10, 000, and 40, 000. For each combination of parameters we ran 10 simulations for signals x underlying observations (5.50) drawn randomly from the uniform distribution on the unit sphere (the boundary of X ).
426
CHAPTER 5
5
100
4.5 4 3.5 3 2.5 2
10-1
1.5 1 0.5 0
0 0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Figure 5.2: Left: functions h; right: moduli of strong monotonicity of the operators F (·) in {z : kzk2 ≤ R} as functions of R. Dashed lines – case A (logistic sigmoid), solid lines – case B (linear regression), dash-dotted lines – case C (hinge function), dotted line – case D (ramp sigmoid).
In each experiment, we computed the SAA and the SA estimates (note that in the cases A and B the SAA estimate is the Maximum Likelihood estimate as well). The SA stepsizes γk were selected according to (5.64) with “empirically tuned” κ.11 Namely, given observations ωk = (ηk , yk ), k ≤ K—see (5.50)—we used them to build the SA estimate in two stages: — at the tuning stage, we generate a random “training signal” x′ ∈ X and then generate labels yk′ as if x′ were the actual signal. For instance, in the case of A, yk′ is assigned value 1 with probability f (ηkT x′ ) and value 0 with complementary probability. After the “training signal” and associated labels are generated, we run on the resulting artificial observations SA with different values of κ, compute the accuracy of the resulting estimates, and select the value of κ resulting in the best recovery; — at the execution stage, we run SA on the actual data with stepsizes (5.64) specified by the κ found at the tuning stage. The results of some numerical experiments are presented in Figure 5.3. Note that the CPU time for SA includes both tuning and execution stages. The conclusion from these experiments is that as far as estimation quality is concerned, the SAA estimate marginally outperforms the SA, while being significantly more time consuming. Note also that the dependence of recovery errors √ on K observed in our experiments is consistent with the convergence rate O(1/ K) established by Proposition 5.13. Comparison with Nonlinear Least Squares. Observe that in the case m = 1 of scalar monotone field f , the SAA estimate yielded by our approach as applied to observation ω K is the minimizer of the convex function Z t k 1 X T T f (s)ds, v(ηk z) − yk ηk z , v(r) = HωK (z) = K 0 k=1
11 We could get (lower bounds on) the moduli of strong monotonicity of the vector fields F (·) we are interested in analytically, but this would be boring and conservative.
427
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
102
100
101
100
10-1
10-1 10-2
10-3 103
104
103
k
Mean estimation error kb xk (ω ) − xk2
104
CPU time (sec)
Figure 5.3: Mean errors and CPU times for SA (solid lines) and SAA estimates (dashed lines) as functions of the number of observations K. o – case A (logistic link), x – case B (linear link), + – case C (hinge function), ✷ – case D (ramp sigmoid).
on the signal set X . When f is the logistic sigmoid, HωK (·) is exactly the convex loss function leading to the ML estimate in the logistic regression model. As we have already mentioned, this is not the case for a general GLM. Consider, e.g., the situation where the regressors and the signals are reals, the distribution of regressor η is N (0, 1), and the conditional distribution of y given η is N (f (ηx), σ 2 ), with f (s) = arctan(s). In this situation the ML estimate stemming from observation ω K is the minimizer on X of the function MωK (z) =
k 1 X 2 [yk − arctan(ηk z)] . K
(5.66)
k=1
The latter function is typically nonconvex and can be multi-extremal. For example, when running simulations12 we from time to time observe the situation similar to that presented in Figure 5.4. Of course, in our toy situation of scalar x the existence of several local minima of MωK (·) is not an issue—we can easily compute the ML estimate by a brute force search along a dense grid. What to do in the multidimensional case—this is another question. We could also add that in the simulations which led to Figure 5.4 both the SAA and the ML estimates exhibited nearly the same performance in terms of the estimation error: in 1, 000 experiments, the median of the observed recovery errors was 0.969 for the ML, and 0.932 for the SAA estimate. When increasing the number of observations to 1, 000, the empirical median (taken over 1, 000 simulations) of recovery errors became 0.079 for the ML, and 0.085 for the SAA estimate. 12 In these simulations, the “true” signal x underlying observations was drawn from N (0, 1), the number K of observations also was random with uniform distribution on {1, ..., 20}, and X = [−20, 20], σ = 3 were used.
428
CHAPTER 5
25
20
15
10
5
0
-5 -20
-15
-10
-5
0
5
10
15
20
Figure 5.4: Solid curve: MωK (z), dashed curve: HωK (z). True signal x (solid vertical line): +0.081; SAA estimate (unique minimizer of HωK , dashed vertical line): −0.252; ML estimate (global minimizer of MωK on [−20, 20]): −20.00, closest to x local minimizer of MωK (dotted vertical line): −0.363. 5.2.6
“Single-observation” case
Let us look at the special case of our estimation problem where the sequence η1 , ..., ηK of regressors in (5.50) is deterministic. At first glance, this situation goes beyond our setup, where the regressors should be i.i.d. drawn from some distribution Q. However, we can circumvent this “contradiction” by saying that we are now in the single-observation case with the regressor being the matrix [η1 , ..., ηK ] and Q being a degenerate distribution supported at a singleton. Specifically, consider the case where our observation is ω = (η, y) ∈ Rn×mK × RmK
(5.67)
(m, n, K are given positive integers), and the distribution Px of observation stemming from a signal x ∈ Rn is as follows: • η is a given deterministic matrix independent of x; • y is random, and the distribution of y induced by Px is with mean φ(η T x), where φ : RmK → RmK is a given mapping. As an instructive example connecting our current setup with the previous one, consider the case where η = [η1 , ..., ηK ] with n×m deterministic “individual regressors” ηk , and y = [y1 ; ...; yK ] with random “individual labels” yk ∈ Rm conditionally independent, given x, across k, and such that the expectations of yk induced by x are f (ηkT x) for some f : Rm → Rm . We set φ([u1 ; ...; uK ]) = [f (u1 ); ...; f (uK )]. The resulting “single observation” model is a natural analogy of the K-observation model considered so far, the only difference being that the individual regressors now form a fixed deterministic sequence rather than being a sample of realizations of some random matrix. As before, our goal is to use observation (5.67) to recover the (unknown) signal x underlying, as explained above, the distribution of the observation. Formally, we
429
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
are now in the case K = 1 of our previous recovery problem where Q is supported on a singleton {η} and can use the constructions developed so far. Specifically, • The field F (z) associated with our problem (it used to be Eη∼Q {ηf (η T z)}) is F (z) = ηφ(η T z), and the vector field G(z) = F (z)−F (x), x being the signal underlying observation (5.67), is G(z) = E(η,y)∼Px {F (z) − ηy} (cf. (5.57)). As before, the signal to recover is a zero of the latter field. Note that now the vector field F (z) is observable, and the vector field G still is the expectation, over Px , of an observable vector field: G(z) = E(η,y)∼Px {ηφ(η T z) − ηy }; {z } | Gy (z)
cf. Lemma 5.11. • Assumptions A.1–2 now read
A.1′ The vector field φ(·) : RmK → RmK is continuous and monotone, so that F (·) is continuous and monotone as well,
A.2′ X is a nonempty compact convex set, and F is strongly monotone, with modulus κ > 0, on X . A simple sufficient condition for the validity of the above monotonicity assumptions is positive definiteness of the matrix ηη T plus strong monotonicity of φ on every bounded set. • For our present purposes, it is convenient to reformulate assumption A.3 in the following equivalent form: A.3′ For properly selected σ ≥ 0 and every x ∈ X it holds E(η,y)∼Px {kη[y − φ(η T x)]k22 } ≤ σ 2 . In the present setting, the SAA x b(y) is the unique weak solution to VI(Gy , X ), and we can easily quantify the quality of this estimate:
Proposition 5.14. In the situation in question, let Assumptions A.1′ –3′ hold. Then for every x ∈ X induced by x and every realization (η, y) of observation (5.67) one has kb x(y) − xk2 ≤ κ −1 k η[y − φ(η T x)] k2 , (5.68) | {z } ∆(x,y)
whence also
E(η,y)∼Px {kb x(y) − xk22 } ≤ σ 2 /κ 2 .
(5.69)
Proof. Let x ∈ X be the signal underlying observation (5.67), and G(z) = F (z) − F (x) be the associated vector field G. We have Gy (z) = F (z) − ηy = F (z) − F (x) + [F (x) − ηy] = G(z) − η[y − φ(η T x)] = G(z) − ∆(x, y).
430
CHAPTER 5
For y fixed, z¯ = x b(y) is the weak, and therefore the strong (since Gy (·) is continuous), solution to VI(Gy , X ), implying, due to x ∈ X , that 0 ≤ GTy (¯ z )[x − z¯] = GT (¯ z )[x − z¯] − ∆T (x, y)[x − z¯],
whence −GT (¯ z )[x − z¯] ≤ −∆T (x, y)[x − z¯].
Besides this, G(x) = 0, whence GT (x)[x − z¯] = 0, and we arrive at [G(x) − G(¯ z )]T [x − z¯] ≤ −∆T (x, y)[x − z¯], whence also κkx − z¯k22 ≤ −∆T (x, y)[x − z¯] (recall that G, along with F , is strongly monotone with modulus κ on X and x, z¯ ∈ X ). Applying the Cauchy inequality, we arrive at (5.68). ✷ Example. Consider the case where m = 1, φ is strongly monotone, with modulus κφ > 0, on the entire RK , and η in (5.67) is drawn from a “Gaussian ensemble”— the columns ηk of the n × K matrix η are independent N (0, In )-random vectors. Assume also that the observation noise is Gaussian: y = φ(η T x) + λξ, ξ ∼ N (0, IK ). It is well known that as K/n → ∞, the minimal singular value of the n × n matrix ηη T is at least O(1)K with overwhelming probability, implying that when K/n ≫ 1, the typical modulus of strong monotonicity of F (·) is κ ≥ O(1)Kκφ . Furthermore, in our situation, as K/n √ → ∞, the Frobenius norm of η with overwhelming probability is at most O(1) nK. In other words, when K/n is large, a “typical” recovery problem from the ensemble just described satisfies the premise of Proposition 5.14 with κ = O(1)Kκφ and σ 2 = O(λ2 nK). As a result, (5.69) reads E(η,y)∼Px {kb x(y) − xk22 } ≤ O(1)
λ2 n . κφ2 K
[K ≫ n]
It is well known that in the standard case of linear regression, where φ(x) = κφ x, the resulting bound is near-optimal, provided X is large enough. Numerical illustration: In the situation described in the example above, we set m = 1, n = 100, and use φ(u) = arctan[u] := [arctan(u1 ); ...; arctan(uK )] : RK → RK ; the set X is the unit ball {x ∈ Rn : kxk2 ≤ 1}. In a particular experiment, η is chosen at random from the Gaussian ensemble as described above, and signal x ∈ X underlying observation (5.67) is drawn at random; the observation noise y − φ(η T x) is N (0, λ2 IK ). Some typical results (10 simulations for each combination of the samples size and noise variance λ2 ) are presented in Figure 5.5.
431
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
100
10-1
10-1
10-2
10-2
103
102
105
104
10-3
102
k
Mean estimation error kb xk (ω ) − xk2
103
104
105
CPU time (sec)
Figure 5.5: Mean errors and CPU times for standard deviation λ = 1 (solid line) and λ = 0.1 (dashed line).
5.3 5.3.1
EXERCISES FOR CHAPTER 5 Estimation by Stochastic Optimization
Exercise 5.1. Consider the following “multinomial” version of the logistic regression problem from Section 5.2.1: For k = 1, ..., K, we observe pairs (ζk , ℓk ) ∈ Rn × {0, 1, ..., m}
(5.70)
drawn independently of each other from a probability distribution Px parameterized by an unknown signal x = [x1 ; ...; xm ] ∈ Rn × ... × Rn as follows: • The probability distribution of regressor ζ induced by the distribution Sx of (ζ, ℓ) is a once forever fixed, independent of x, distribution R on Rn with finite second order moments and positive definite matrix Z = Eζ∼R {ζζ T } of second order moments; • The conditional distribution of label ℓ given ζ induced by the distribution Sx of (ζ, ℓ) is the distribution of the discrete random variable taking value ι ∈ {0, 1, ..., m} with probability ( T ι x } Pexp{ζ 1 ≤ ι ≤ m, m T xi } , exp{ζ 1+ i=1 [x = [x1 ; ...; xm ]] pι = Pm 1 , ι = 0. exp{ζ T xi } 1+ i=1
Given a nonempty convex compact set X ∈ Rmn known to contain the (unknown) signal x underlying observations (5.70), we want to recover x. Note that the recovery problem associated with the standard logistic regression model is the case m = 1 of the problem just defined. Your task is to process the above recovery problem via the approach developed in Section 5.2 and to compare the resulting SAA estimate with the Maximum Likelihood estimate. Exercise 5.2.
432
CHAPTER 5
Let H(x) : Rn → Rn be a vector field strongly monotone and Lipschitz continuous on the entire space: ∀(x, x′ ∈ Rn ) :
[H(x) − H(x′ )]T [x − x′ ] ≥ κkx − x′ k2 , kH(x) − H(x′ )k2 ≤ Lkx − x′ k2
(5.71)
for some κ > 0 and L < ∞. 1.1) Prove that for every x ∈ Rn , the vector equation H(z) = x in variable z ∈ Rn has a unique solution (which we denote by H −1 (x)), and that for every x, y ∈ Rn one has kH −1 (x) − yk2 ≤ κ −1 kx − H(y)k2 . 1.2) Prove that the vector field
(5.72)
x 7→ H −1 (x)
is strongly monotone with modulus κ∗ = κ/L2 and Lipschitz continuous, with constant 1/κ w.r.t. k · k2 , on the entire Rn . Let us interpret −H(·) as a field of “reaction forces” applied to a particle: when the particle is in a position y ∈ Rn the reaction force applied to it is −H(y). Next, let us interpret x ∈ Rn as an external force applied to the particle. An equilibrium y is a point in space where the reaction force −H(y) compensates the external force, that is, H(y) = x, or, which for our H is the same, y = H −1 (x). Note that with this interpretation, strong monotonicity of H makes perfect sense, implying that the equilibrium in question is stable: when the particle is moved from the equilibrium y = H −1 (x) to a position y + ∆, the total force acting at the particle becomes f = x − H(y + ∆), so that f T ∆ = [x − H(y + ∆)]T ∆ = [H(y) − H(y + ∆)]T [∆] ≤ −κ∆2 , that is, the force is oriented “against” the displacement ∆ and “wants” to return the particle to the equilibrium position. Now imagine that we can observe in noise equilibrium H −1 (x) of the particle, the external force x being unknown, and want to recover x from our observation. For the sake of simplicity, let the observation noise be zero mean Gaussian, so that our observation is ω = H −1 (x) + σξ, ξ ∼ N (0, In ). 2) Verify that the recovery problem we have posed is a special case of the “single observation” recovery problem from Section 5.2.6, with Rn in the role of X , 13 13 In Section 5.2.6, X was assumed to be closed, convex, and bounded; a straightforward inspection shows that when the vector field φ is strongly monotone, with some positive modulus,
433
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
and that the SAA estimate x b(ω) from that section under the circumstances is just the root of the equation H −1 (·) = ω, that is, x b(ω) = H(ω).
Prove also that
E{kb x(ω) − xk22 } ≤ nσ 2 L2 .
(5.73)
Note that in the situation in question the ML estimate should be the minimizer of the function f (z) = kω − H −1 (z)k22 , and this minimizer is nothing but x b(ω).
Exercise 5.3.
[identification of parameters of a linear dynamic system] Consider the problem as follows: A deterministic sequence x = {xt : t ≥ −d + 1} satisfies the linear finitedifference equation d X αi xt−i = yt , t = 1, 2, ... (5.74) i=0
of given order d and is bounded:
|xt | ≤ Mx < ∞, ∀t ≥ −d + 1, implying that the sequence {yt } also is bounded, |yt | ≤ My < ∞, ∀t ≥ 1. The vector α = [α0 ; ...; αd ] is unknown; all we know is that this vector belongs to a given closed convex set X ⊂ Rd+1 . We have at our disposal observations ωt = xt + σx ξt , −d + 1 ≤ t ≤ K,
(5.75)
of the terms in the sequence, with ξt ∼ N (0, 1) independent across t, with some given σx , and observations ζ t = yt + σ y η t
(5.76)
with ηt ∼ N (0, 1) independent across t and independent of {ξτ }τ . Our goal is to recover from these observations the vector α. Strategy. To get the rationale underlying the construction to follow, let us start with the case when there is no observation noise at all: σx = σy = 0. In this case on the entire space, and η has trivial kernel, all constructions and results of Section 5.2.6 can be extended to the case of an arbitrary closed convex X .
434
CHAPTER 5
we could act as follows: let us denote xt = [xt ; xt−1 ; xt−2 ; ...; xt−d ], 1 ≤ t ≤ K, and rewrite (5.74) as [xt ]T α = yt , 1 ≤ t ≤ K. When setting AK = we get
K K 1 X 1 X t tT x [x ] , aK = yt x t , K t=1 K t=1
AK α = aK .
(5.77)
Assuming that K is large and trajectory x is “rich enough” to ensure that AK is nonsingular, we could identify α by solving the linear system (5.77). Now, when the observation noise is present, we could try to use the noisy observations of xt and yt we have at our disposal in order to build empirical approximations to AK and aK which are good for large K, and identify α by solving the “empirical counterpart” of (5.77). The straightforward way would be to define ω t as an “observable version” of xt , ω t = [ωt ; ωt−1 ; ...; ωt−d ] = xt + σx [ξt ; ξt−1 ; ...; ξt−d ], {z } | ξt
and to replace AK , aK with
K K X X et = 1 ζt ω t . ω t [ω t ]T , e aK = A K t=1 t=1
As far as empirical approximation of aK is concerned, this approach works: we have K 1 X e aK = aK + δaK , δaK = [σx yt ξ t + σy ηt xt + σx σy ηt ξ t ] . K t=1 | {z } δt
Since the sequence {yt } is bounded, the random error δaK of approximation e aK of aK is small for large K with overwhelming probability. Indeed, δaK is the average of K zero mean random vectors δt (recall that ξ t and ηt are independent and have zero mean) with14 E{kδt k22 } ≤ 3(d + 1) σx2 My2 + σy2 Mx2 + σx2 σy2 , and δt is independent of δs whenever |t − s| > d + 1, implying that 3(d + 1)(2d + 1) σx2 My2 + σy2 Mx2 + σx2 σy2 2 E{kδaK k2 } ≤ . K 14 We
use the elementary inequality k
Pp
t=1
at k22 ≤ p
Pp
t=1
kat k22 .
(5.78)
435
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
eK is essentially worse: setting The quality of approximating AK with A eK − AK = δAK = A
K 1 X 2 t tT [σ ξ [ξ ] + σx ξ T [xt ]T + σx xt [ξ t ]T ] {z } K t=1 | x ∆t
we see that δAK is the average of K random matrices ∆t with nonzero mean, namely, the mean σx2 Id+1 , and as such ∆AK is “large” independently of how large K is. There is, however, a simple way to overcome this difficulty—splitting observations ωt .15 Splitting observations. Let θ be a random n-dimensional vector with unknown mean µ and known covariance matrix, namely, σ 2 In , and let χ ∼ N (0, In ) be independent of θ. Finally, let κ > 0 be a deterministic real. 1) Prove that setting
θ′ = θ + σκχ, θ′′ = θ − σκ−1 χ,
we get two random vectors with mean µ and covariance matrices σ 2 (1 + κ2 )In and σ 2 (1 + 1/κ2 )In , respectively, and these vectors are uncorrelated: E{[θ′ − µ][θ′′ − µ]T } = 0. In view of item 1, let us do as follows: given observations {ωt } and {ζt }, let us generate i.i.d. sequence {χt ∼ N (0, 1), t ≥ −d + 1}, so that the sequences {ξt }, {ηt }, and {χt } are i.i.d. and independent of each other, and let us set u t = ω t + σx χ t , v t = ω t − σ x χ t . Note that given the sequence {ωt } of actual observations, sequences {ut } and {vt } are observable as well, and that the sequence {(ut , vt )} is i.i.d.. Moreover, for all t, E{ut } = E{vt } = xt , E{[ut − xt ]2 } = 2σx2 , E{[vt − xt ]2 } = 2σx2 , and for all t and s E{[ut − xt ][vs − xs ]} = 0. Now, let us put ut = [ut ; ut−1 ; ...; ut−d ], v t = [vt ; vt−1 ; ...; vt−d ], and let
K X bK = 1 ut [v t ]T . A K t=1
bK is a good empirical approximation of AK : 2) Prove that A bK } = AK , E{kA bK − AK k2F } ≤ E{A
12[d + 1]2 [2d + 3] Mx2 + σx2 σx2 , K
(5.79)
15 The model (5.74)–(5.76) is referred to as Errors in Variables model [85] in the statistical literature or Output Error model in the literature on System Identification [173, 218]. In general, statistical inference for such models is difficult—for instance, parameter estimation problem in such models is ill-posed. The estimate we develop in this exercise can be seen as a special application of the general Instrumental Variables methodology [7, 219, 241].
436
CHAPTER 5
the expectation being taken over the distribution of observation noises {ξt } and auxiliary random sequence {χt }. Conclusion. We see that as K → ∞, the differences of typical realizations of bK − AK and e A aK − aK approach 0. It follows that if the sequence {xt } is “rich enough” to ensure that the minimal eigenvalue of AK for large K stays bounded away from 0, the estimate bK β − e α bK ∈ Argmin kA aK k22 β∈Rd+1
will converge in probability to the desired vector α, and we can even say something reasonable about the rate of convergence. To account for a priori information α ∈ X , we can modify the estimate by setting bK β − e α bK ∈ Argmin kA aK k22 . β∈X
Note that the assumption that noises affecting observations of xt ’s and yt ’s are zero mean Gaussian random variables independent of each other with known dispersions is not that important; we could survive the situation where samples {ωt − xt , t > −d}, {ζt − yt , t ≥ 1} are zero mean i.i.d., and independent of each other, with a priori known variance of ωt − xt . Under this and mild additional assumptions (like finiteness of the fourth moments of ωt − xt and ζt − yt ), the obtained results would be similar to those for the Gaussian case. Now comes the concluding part of the exercise: 3) To evaluate numerically the performance of the proposed identification scheme, run experiments as follows: • Given an even value of d and ρ ∈ (0, 1], select d/2 complex numbers λi at random on the circle {z ∈ C : |z| = ρ}, and build a real polynomial of degree d with roots λi , λ∗i (∗ here stands for complex conjugation). Build a finite-difference equation (5.77) with this polynomial as the characteristic polynomial. • Generate i.i.d. N (0, 1) “inputs” {yt , t = 1, 2, ...}, select at random initial conditions x−d+1 , x−d+2 , ..., x0 for the trajectory {xt } of states (5.77), and simulate the trajectory along with observations ωt of xt and ζt of yt , with σx , σy being the experiment’s parameters. • Look at the performance of the estimate α bK on the simulated data.
Exercise 5.4.
[more on generalized linear models] Consider a generalized linear model as follows: we observe i.i.d. random pairs ωk = (yk , ζk ) ∈ R × Rν×µ , k = 1, ..., K,
where the conditional expectation of the scalar label yk given ζk is ψ(ζkT z), z being an unknown signal underlying the observations. What we know is that z belongs to a given convex compact set Z ⊂ Rn . Our goal is to recover z. Note that while the estimation problem we have just posed looks similar to those treated in Section 5.2, it cannot be straightforwardly handled via techniques
437
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
developed in that section unless µ = 1. Indeed, these techniques in the case of µ > 1 require ψ to be a monotone vector field on Rµ , while our ψ is just a scalar function on Rµ . The goal of the exercise is to show that when X X cq w1q1 ...wµqµ [cq 6= 0, q ∈ Q ⊂ Zµ+ ] cq w q ≡ ψ(w) = q∈Q
q∈Q
is an algebraic polynomial (which we assume from now on), one can use lifting to reduce the situation to that considered in Section 5.2. The construction is straightforward. Let us associate with algebraic monomial z p := z1p1 z2p2 ...zνpν with ν variables16 a real variable xp . For example, monomial z1 z2 is associated with x1,1,0,...,0 , z12 zν3 is associated with x2,0,...,0,3 , etc. For q ∈ Q, the contribution of the monomial cq wq into ψ(ζ T z) is X hpq (ζ)z1p1 z2p2 ...zνpν , cq [ColT1 [ζ]z]q1 [ColT2 [ζ]z]q2 ...[ColTµ [ζ]z]qµ = p∈Pq
where Pq is a properly built set of multi-indices p = (p1 , ..., pν ), and hpq (ζ) are easily computable functions of ζ. Consequently, X X X Hp (ζ)z p , hpq (ζ)z p = ψ(ζ T z) = q∈Q p∈Pq
p∈P
with properly selected finite set P and readily given functions Hp (ζ), p ∈ P. We can always take, as P, the set of all ν-entry multi-indices with the sum of entries not exceeding d, where d is the total degree of the polynomial ψ. This being said, the structure of ψ and/or the common structure, if any, of regressors ζk can enforce some of the functions Hp (·) to be identically zero. When this is the case, it makes sense to eliminate the corresponding “redundant” multi-index p from P. Next, consider the mapping x[z] which maps a vector z ∈ Rν into a vector with entries xp [z] = z p indexed by p ∈ P, and let us associate with our estimation problem its “lifting” with observations ω k = (yk , ηk = {Hp (ζk ), p ∈ P}). I.e., new observations are deterministic transformations of the actual observations; observe that the new observations still are i.i.d., and the conditional expectation of yk given ηk is nothing but X [ηk ]p xp [z]. p∈P
In our new situation, the “signal” underlying observations is a vector from RN , N = Card(P), the regressors are vectors from the same RN , and regression is linear—the conditional expectation of the label yk given regressor ηk is a linear function ηkT x of our new signal. Given a convex compact localizer Z for the “true signal” z, we can in many ways find a convex compact localizer X for x = x[z]. 16 Note
that factors in the monomial are ordered according to the indices of the variables.
438
CHAPTER 5
Thus, we find ourselves in the simplest possible case of the situation considered in Section 5.2 (one with scalar φ(s) ≡ s), and can apply the estimation procedures developed in this section. Note that in the “lifted” problem the SAA estimate x b(·) of the lifted signal x = x[z] is nothing but the standard Least Squares: i hP hP iT K K T y η x − x η η x b(ω K ) ∈ Argminx∈X 21 xT k=1 k k k=1 k k (5.80) P T 2 (y − η x) . = Argminx∈X k k k Of course, there is no free lunch, and there are some potential complications:
• It may happen that the matrix H = Eη∼Q {ηη T } (Q is the common distribution of “artificial” regressors ηk induced by the common distribution of the actual regressors ζk ) is not positive definite, which would make it impossible to recover well the signal x[z] underlying our transformed observations, however large be K. • Even when H is positive definite, so that x[z] can be recovered well, provided K is large, we still need to recover z from x[z], that is, to solve a system of polynomial equations, which can be difficult; besides, this system can have more than one solution. • Even when the above difficulties can be somehow avoided, “lifting” z → x[z] typically increases significantly the number of parameters to be identified, which, in turn, deteriorates “finite time” accuracy bounds. Note also that when H is not positive definite, this still is not the end of the world. Indeed, H is positive semidefinite; assuming that it has a nontrivial kernel L which we can identify, a realization ηk of our artificial regressor is orthogonal to L with probability 1, implying that replacing artificial signal x with its orthogonal projection onto L⊥ , we almost surely keep the value of the objective in (5.80) intact. Thus, we lose nothing when restricting the optimization domain in (5.80) to the orthogonal projection of X onto L⊥ . Since the restriction of H onto L⊥ is positive definite, with this approach, for large enough values of K we will still get a good approximation of the projection of x[z] onto L⊥ . With luck, this approximation, taken together with the fact that the “artificial signal” we are looking for is not an arbitrary vector from X —it is of the form x[z] for some z ∈ Z—will allow us to get a good approximation of z. Here is the first part of the exercise: 1) Carry out the outlined approach in the situation where • The common distribution Π of regressors ζk has density w.r.t. the Lebesgue measure on Rν×µ and possesses finite moments of all orders • ψ(w) is a quadratic form, either (case A) homogeneous, ψ(w) = wT Sw
[S 6= 0],
or (case B) inhomogeneous, ψ(w) = wT Sw + sT w
[S 6= 0, s 6= 0].
• The labels are linked to the regressors and to the true signal z by the relation yk = ψ(ηkT z) + χk ,
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
439
where the χk ∼ N (0, 1) are mutually independent and independent from the regressors. Now comes the concluding part of the exercise, where you are supposed to apply the approach we have developed to the situation as follows: You are given a DC electric circuit comprised of resistors, that is, connected oriented graph with m nodes and n arcs γj = (sj , tj ), 1 ≤ j ≤ n, where 1 ≤ sj , tj ≤ m and sj 6= tj for all j; arcs γj are assigned with resistances Rj > 0 known to us. At instant k = 1, 2, ..., K, “nature” specifies “external currents” (charge flows from the “environment” into the circuit) s1 , ..., sm at the nodes; these external currents specify currents in the arcs and voltages at the nodes, and consequently, the power dissipated by the circuit. Note that nature cannot be completely free in generating the external currents: their total should be zero. As a result, all that matters is the vector s = [s1 ; ...; sm−1 ] of external currents at the first m − 1 nodes, due to sm ≡ −[s1 + ... + sm−1 ]. We assume that the mechanism of generating the vector of external currents at instant k—let this vector be denoted by sk ∈ Rm−1 —is as follows. There are somewhere m − 1 sources producing currents z1 , ..., zm−1 . At time k nature selects a one-to-one correspondence i 7→ πk (i), i = 1, ..., m − 1, between these sources and the first m − 1 nodes of the circuit, and “forwards” current zπk (i) to node i: ski = zπk (i) , 1 ≤ i ≤ m − 1. For the sake of definiteness, assume that the permutations πk of 1, ..., m − 1, k = 1, ..., K, are i.i.d. drawn from the uniform distribution on the set of (m − 1)! permutations of m − 1 elements. Assume that at time instants k = 1, ..., K we observe the permutations πk and noisy measurements of the power dissipated at this instant by the circuit; given those observations, we want to recover the vector z. Here is your task: 2) Assuming the noises in the dissipated power measurements to be independent of each other and of πk zero mean Gaussian noises with variance σ 2 , apply to the estimation problem in question the approach developed in item 1 of the exercise and run numerical experiments. Exercise 5.5. [shift estimation] Consider the situation as follows: given a continuous vector field f (u) : Rm → Rm which is strongly monotone on bounded subsets of Rm and a convex compact set S ⊂ Rm , we observe in noise vectors f (p − s), where p ∈ Rm is an observation point known to us, and s ∈ Rm is a shift unknown to us known to belong to S. Precisely, assume that our observations are yk = f (pk − s) + ξk , k = 1, ..., K, where p1 , ..., pK is a deterministic sequence known to us, and ξ1 , ..., ξK are N (0, γ 2 Im ) observation noises independent of each other. Our goal is to recover from observations y1 , ..., yK the shift s.
440
CHAPTER 5
1. Pose the problem as a single-observation version of the estimation problem from Section 5.2 2. Assuming f to be strongly monotone, with modulus κ > 0, on the entire space, what is the error bound for the SAA estimate? 3. Run simulations in the case of m = 2, S = {u ∈ R2 : kuk2 ≤ 1} and 2u1 + sin(u1 ) + 5u2 . f (u) = 2u2 − sin(u2 ) − 5u1
Note: Field f (·) is not potential; this is the monotone vector field associated with the strongly convex-concave function ψ(u) = u21 − cos(u1 ) − u22 − cos(u2 ) + 5u1 u2 , ∂ ∂ so that f (u) = [ ∂u φ(u); − ∂u φ(u)]. Compare the actual recovery errors with 1 2 their theoretical upper bounds. 4. Think what can be done when our observations are
yk = f (Apk − s) + ξk , 1 ≤ k ≤ K with known pk , noises ξk ∼ N (0, γ 2 I2 ) independent across k, and unknown A and s which we want to recover.
5.4 5.4.1
PROOFS Proof of (5.8)
Let h ∈ Rm , and let ω be a random vector with entries ωi ∼ Poisson(µi ) independent across i. Taking into account that the ωi are independent across i, we have Q Q T E exp{γh P ω} = i E {γhi ωi } = i exp{[exp{γhi } − 1]µi } = exp{ i [exp{γhi } − 1]µi },
441
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
whence by the Chebyshev inequality for γ ≥ 0 it holds Prob{hTω > hT µ + t} = Prob{γhT ω > γhT µ + γt} T ≤ E exp{γh ω} exp{−γhT µ − γt} P T ≤ exp i [exp{γhi } − 1]µi − γh µ − γt .
Now, for |s| < 3, one has es ≤ 1 + s + with (5.81) to imply that 0≤γ
hT µ + t} ≤ − γt. khk∞ 2(1 − γkhk∞ /3)
Minimizing the right hand side in this inequality in γ ∈ [0, khk3 ∞ ), we get t2 Prob hT ω > hT µ + t ≤ exp − P 2 . 2[ i hi µi + khk∞ t/3]
This inequality combines with the same inequality applied to −h in the role of h to imply (5.8). ✷ 5.4.2
Proof of Lemma 5.6
(i): When π(Colℓ [H])) ≤ 1 for all ℓ and λ ≥ 0, denoting by [h]2 the vector comprised of squares of the entries in h, we have P P 2 2 φ(dg(HDiag{λ}H T )) = φ( P ℓ λℓ2[Colℓ [H]] ) ≤P ℓ λℓ φ([Colℓ [H]] ) = ℓ λℓ , ℓ λℓ π (Colℓ [H]) ≤
implying that (H T Diag{λ}H T , κ
P
ℓ
λℓ ) belongs to H.
(ii): Let Θ, µ, Q, V be as stated in (ii); there is nothing to prove when µ = 0; thus assume that µ > 0. Let d = dg(Θ), so that X di = Q2ij & κφ(d) ≤ µ (5.82) j
(the second relation is due to (Θ, µ) ∈ H). (5.30) is evident. We have # "m X p Qik χk Vkj [Hχ ]ij = m/µ[Gχ ]ij , Gχ = QDiag{χ}V = k=1
We claim that for every i it holds √ ∀γ > 0 : Prob [Gχ ]2ij > 3γdi /m ≤ 3 exp{−γ/2}.
. i,j
(5.83)
Indeed, let us fix i. There is nothing to prove when di = 0, since in this case Qij = 0 for all j and therefore [Gχ ]ij ≡ 0. When di > 0, by homogeneity in Q it suffices to verify (5.83) when di /m = 1/3. Assuming that this is the case, let η ∼ N (0, 1) be
442
CHAPTER 5
independent of χ. We have o nQ Q 1 2 2 2 Eη {Eχ {exp{η[Gχ ]ij }}} = Eη { k cosh(ηQik Vkj )} ≤ Eη k exp{ 2 η Qik Vkj } X √ 2 = Eη exp{ 12 η 2 } ≤ Eη η 2 di /m = Eη exp{η 2 /3} = 3, Q2ik Vkj k {z } | ≤2di /m
and
implying that
Eχ {Eη {exp{η[Gχ ]ij }}} = Eχ exp{ 21 [Gχ ]2ij } , √ Eχ exp{ 12 [Gχ ]2ij } ≤ 3.
Therefore in the case of di /m = 1/3 for all s > 0 it holds √ Prob{χ : [Gχ ]2ij > s} ≤ 3 exp{−s/2}, and (5.83) follows. Recalling the relation between H and G, we get from (5.83) that √ ∀γ > 0 : Prob{χ : [Hχ ]2ij > 3γdi /µ} ≤ 3 exp{−γ/2}. By the latter inequality, with κ given by (5.29) the probability of the event ∀i, j : [Hχ ]2ij ≤ κ
di µ
is at least 1/2. Let this event take place; in this case we have [Colℓ [Hχ ]]2 ≤ κd/µ, whence, by definition of the norm π(·), π 2 (Colℓ [Hχ ]) ≤ κφ(d)/µ ≤ 1 (see the second relation in (5.82)). Thus, the probability of the event (5.31) is at least 1/2. ✷ 5.4.3
Verification of (5.44)
Given s ∈ [2, ∞] and setting s¯ = s/2, s∗ =
s s−1 ,
s¯∗ =
s¯ s¯−1 ,
we want to prove that
N N {(V, τ ) ∈ SN ¯∗ ≤ τ } + × R+ : ∃(W ∈ S , w ∈ R+ ) : V W + Diag{w} & kW ks∗ + kwks N = {(V, τ ) ∈ S+ × R+ : ∃w ∈ RN : V Diag{w}, kwk s ¯ + ∗ ≤ τ }.
To this end it clearly suffices to check that whenever W ∈ SN , there exists w ∈ RN satisfying W Diag{w}, kwks¯∗ ≤ kW ks∗ .
The latter is equivalent to saying that for any W ∈ SN such that kW ks∗ ≤ 1, the conic optimization problem Opt = min{t : t ≥ kwks¯∗ , Diag{w} W } t,w
(5.84)
is solvable (which is evident) with optimal value ≤ 1. To see that the latter indeed is the case, note that the problem clearly is strictly feasible, whence its optimal value is the same as the optimal value in the conic problem Opt = maxP Tr(P W ) : P 0, kdg{P }ks¯∗ /(¯s∗ −1) ≤ 1 [dg{P } = [P11 ; P22 ; ...; PN N ]]
443
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
dual to (5.84). Since Tr(P W ) ≤ kP ks∗ /(s∗ −1) kW ks∗ ≤ kP ks∗ /(s∗ −1) , recalling what s∗ and s¯∗ are, our task boils down to verifying that when a matrix P 0 satisfies kdg{P )ks/2 ≤ 1, one also has kP ks ≤ 1. This is immediate: since P is positive 1/2
1/2
semidefinite, we have |Pij | ≤ Pii Pjj , whence, assuming s < ∞, kP kss
=
X i,j
s
|Pij | ≤
X
s/2 s/2 Pii Pjj
=
i,j
X
s/2 Pii
i
!2
≤ 1.
When s = ∞, the same argument leads to kP k∞ = max |Pij | = max |Pii | = kdg{P }k∞ . i,j
5.4.4
✷
i
Proof of Proposition 5.10
o
1 . Let us consider the optimization problem (4.42) (where one should set Q = σ 2 Im ), which under the circumstances is responsible for building a nearly optimal linear estimate of Bx yielded by Proposition 4.14, namely, Opt∗ = min′ ′′ φT (λ[Λ]) + φR (λ[Υ′ ]) + φR (λ[Υ′′ ]) + σ 2 Tr(Θ) : Θ,H,Λ,Υ ,Υ ′ Λ = {Λk 0, k ≤ K}, Υ′ = L}, P{Υℓ∗ ′′0, ℓ ≤ Sℓ [Υℓ ] 12 M T H T (5.85) ′′ ′′ ℓ 0, Υ = {Υℓ 0, ℓ ≤ L}, 1 HM Θ 2 P ∗ ′ 1 M T [B − H T A] ℓ Sℓ [Υℓ ] 2 P 0, 1 T T ∗ [B − H A] M R [Λ ] k k k 2 Let us show that the optimal value Opt of (5.45) satisfies p Opt ≤ 2κOpt∗ = 2 2 ln(2m/ǫ)Opt∗ .
(5.86)
To this end, observe that the matrices 1 U B 2P Q := 1 T AT ΘA + k R∗k [Λk ] 2B
and
MT UM 1 T 2B M
1 T 2M PB
AT ΘA +
k
R∗k [Λk ]
=
MT In
Q
M In
simultaneously are/are not positive semidefinite due to the fact that the image space of M contains the full-dimensional set B∗ and thus is the entire Rν , so that M the image space of is the entire Rν × Rn . Therefore, In 2 φR (λ[Υ]) + φT (λ[Λ]) + σ 2 κ2 Tr(Θ) : Opt = min Θ,U,Λ,Υ
Θ 0, U 0, Λ = {Λ 0, k ≤ K}, Υ = {Υ 0, ℓ ≤ L}, k ℓ 1 T P ∗ B MT UM . T 2M P 0, M U M ℓ Sℓ [Υℓ ] 1 T AT ΘA + k R∗k [Λk ] 2B M
444
CHAPTER 5
Further, note that if a collection Θ, U, {Λk }, {Υℓ } is a feasible solution to the latter problem and θ > 0, the scaled collection θΘ, θ−1 U, {θΛk }, {θ−1 Υℓ } is also a feasible solution. When optimizing with respect to the scaling, we get p Opt = inf 4 φR (λ[Υ]) [φT (λ[Λ] + σ 2 κ2 Tr(Θ)] : Θ,U,Λ,Υ
≤
Θ k 0, k ≤ K}, T0, U 0, Λ = {Λ Υ = {Υℓ 0, ℓ ≤ L}, 1 T P M UM 2M PB ∗ 0, M T U M ℓ Sℓ∗ [Υℓ ] 1 T T B M R [Λ ] A ΘA + k k k 2
2κOpt+ ,
(5.87)
where (note that κ > 1) Opt+
=
inf
Θ,U,Λ,Υ
p 2 φR (λ[Υ]) [φT (λ[Λ]) + σ 2 Tr(Θ)] :
Θ 0, U 0, Λ = {Λk 0, k ≤ K}, P Υ = {Υℓ 0, ℓ ≤ L}, M T U M ℓ Sℓ∗ [Υℓ ], 1 T MT UM 2M PB ∗ 0 1 T T A ΘA + k Rk [Λk ] 2B M
.
(5.88)
On the other hand, when strengthening the constraint Λk 0 of (5.85) to Λk ≻ 0, we still have Opt∗ = inf ′ ′′ φT (λ[Λ]) + φR (λ[Υ′ ]) + φR (λ[Υ′′ ]) + σ 2 Tr(Θ) : Θ,H,Λ,Υ ,Υ ′ Λ = {Λk ≻ 0, k ≤ K}, Υ′ = L}, P{Υℓ∗ ′′0, ℓ ≤ 1 T T S [Υ ] M H (5.89) ℓ ℓ ℓ 2 0, Υ′′ = {Υ′′ℓ 0, ℓ ≤ L}, 1 HM Θ . 2 P ∗ ′ 1 M T [B − H T A] ℓ Sℓ [Υℓ ] 2 P 0 1 T T ∗ [B − H A] M R [Λ ] k k k 2 Now let Θ, H, Λ, Υ′ , Υ′′ be a feasible solution to the latter problem. By the second semidefinite constraint in (5.89) we have P
Sℓ∗ [Υ′′ℓ ] 1 T A HM 2 ℓ
1 M T HT A 2 T
A ΘA
=
I A
T P
∗ ′′ ℓ Sℓ [Υℓ ] 1 HM 2
1 M T HT 2
Θ
I A
0,
which combines with the first semidefinite constraint in (5.89) to imply that P ∗ ′ 1 ′′ MT B ℓ Sℓ [Υℓ + Υℓ ] 2 P 0. 1 AT ΘA + k R∗k [Λk ] BT M 2 Next, by the Schur Complement Lemma (which is applicable due to X X R∗k [Λk ] ≻ 0, R∗k [Λk ] AT ΘA + k
k
where the concluding ≻ is due to Lemma 4.44 combined with Λk ≻ 0), this relation implies that for Υℓ = Υ′ℓ + Υ′′ℓ ,
445
SIGNAL RECOVERY BEYOND LINEAR ESTIMATES
we have
X ℓ
Sℓ∗ [Υℓ ]
M
T
"
1 4
T
B[A ΘA +
X k
{z
|
R∗k [Λk ]]−1 B T
U
#
M.
}
Using the Schur Complement Lemma again, for the U 0 just defined we obtain 1 T B MT UM 2M P 0, 1 T AT ΘA + k R∗k [Λk ] 2B M
and in addition, by the definition of U ,
MT UM
X ℓ
Sℓ∗ [Υℓ ].
We conclude that (Θ, U, Λ, Υ := {Υℓ = Υ′ℓ + Υ′′ℓ , ℓ ≤ L}) is a feasible solution to optimization problem (5.88) specifying Opt+ . The value of the objective of the latter problem at this feasible solution is p 2 φR (λ[Υ′ ] + λ[Υ′′ ]) [φT (λ[Λ]) + σ 2 Tr(Θ)] ≤
≤
φR (λ[Υ′ ] + λ[Υ′′ ]) + φT (λ[Λ]) + σ 2 Tr(Θ)
φR (λ[Υ′ ]) + φR (λ[Υ′′ ]) + φT (λ[Λ]) + σ 2 Tr(Θ),
the concluding quantity in the chain being the value of the objective of problem (5.89) at the feasible solution Θ, H, Λ, Υ′ , Υ′′ to this problem. Since the resulting inequality holds true for every feasible solution to (5.89), we conclude that Opt+ ≤ Opt∗ , and we arrive at (5.86) due to (5.87). 2o . Now, from Proposition 4.16 we conclude that Opt∗ is within a logarithmic factor of the minimax optimal ( 18 , k · k)-risk corresponding to the case of Gaussian noise ξx ∼ N (0, σ 2 Im ) for all x: Opt∗ ≤ θ∗ RiskOpt1/8 , where p θ∗ = 8 (2 ln F + 10 ln 2) (2 ln D + 10 ln 2),
F =
X ℓ
fℓ , D =
X
dk .
k
Since the minimax optimal (ǫ, k · k)-risk clearly only grows when ǫ decreases, we conclude that for ǫ ≤ 1/8 a feasible near-optimal solution to (5.45) is minimax optimal within the factor 2θ∗ κ. ✷
Solutions to Selected Exercises 6.1
SOLUTIONS FOR CHAPTER 1
Exercise 1.1. The k-th Hadamard matrix, Hk (here k is a nonnegative integer) is the nk × nk matrix, nk = 2k , given by the recurrence Hk Hk H0 = [1]; Hk+1 = . Hk −Hk In the sequel, we assume that k > 0. Now comes the exercise: 1. Check that Hk is a symmetric matrix with entries ±1, and columns of the matrix are √ mutually orthogonal, so that Hk / nk is an orthogonal matrix. √ √ 2. Check that when k > 0, Hk has just two distinct eigenvalues, nk and − nk , each of multiplicity mk := 2k−1 = nk /2. 3. Prove that whenever f is an eigenvector of Hk , one has √ kf k∞ ≤ kf k1 / nk . Derive from this observation the conclusion as follows: Let a1 , ..., amk ∈ Rnk be unit vectors orthogonal to each other which are eigenvec√ tors of Hk with eigenvalues nk (by the above, the dimension of the eigenspace √ of Hk associated with the eigenvalue nk is mk , so that the required a1 , ..., amk do exist), and let A be the mk × nk matrix with the rows aT1 , ..., aTmk . For every x ∈ Ker A it holds 1 kxk∞ ≤ √ kxk1 , nk whence √ A satisfies the nullspace property whenever the sparsity s satisfies 2s < √ nk = 2mk . Moreover, there exists (and can be found efficiently) an mk × nk √ contrast matrix H = Hk such that for every s < 12 nk , the pair (Hk , k · k∞ ) √ satisfies the condition Q∞ (s, κs = s/ nk ) associated with A, and the k · k2 -norms | {z } q 0), there are mk eigenvalues equal to +1 and mk
448
SOLUTIONS TO SELECTED EXERCISES
√ eigenvalues equal to −1, which for Hk means eigenvalues ± nk of multiplicity mk each. √ Now let f be an eigenvector of Hk , so that Hk f = λf with |λ| = nk . Since the entries in Hk are ±1, we have kHk f k∞ ≤ kf k1 , whence kf k∞ = kHk f k∞ /|λ|, √ and thus kf k∞ ≤ kf k1 / nk , as claimed. Now let x ∈ Ker A. By construction, Ker A is the orthogonal complement to √ the eigenspace of Hk corresponding to the eigenvalue nk . From what we know about the eigenstructure of Hk , this orthogonal complement is the eigenspace of √ √ Hk corresponding to the eigenvalue − nk , whence Hk x = − nk x, and thus √ kxk∞ ≤ kxk1 / nk .
Given i ≤ nk , consider the LP program Opti = max{xi : x ∈ Ker A, kxk1 ≤ 1}; x
(∗)
√ as we have seen, Opti ≤ 1/ nk . On the other hand, Opti is the optimal value in the dual to (∗) LP program, which is min t : ei = AT h + g, kgk∞ ≤ t , t,h,g
where ei is the i-th basic orth in Rnk . Denoting by hi the h-component of an optimal solution to the dual problem, we get √ kAT hi − ei k∞ ≤ 1/ nk ,
√ whence AT hi = ei + gi , kgi k∞ ≤ 1/ nk , and thus
√ kAT hi k22 = kei + gi k22 = kei k22 + kgi k22 + 2eTi gi ≤ 2 + 2/ nk . q √ n +1 By construction, AAT = Imk , whence kAT hi k22 = khi k22 ; thus, khi k2 ≤ 2 √knk . Now let Hk be the mk × nk matrix with the columns h1 , ..., hnk . The columns of Hk do satisfy the norm bound from the claim we are proving, and we have √ |[I − HkT A]ij | = |[eTi − hTi A]j | ≤ kei − AT hi k∞ ≤ Opti ≤ 1/ nk
√ for all i, j, implying that Hk satisfies the condition (1.34) with κ = s/ nk = κs , and thus, by Proposition 1.10, (Hk , k · k∞ ) indeed satisfies Q∞ (s, κs ). Similar “fully constructive” results for other size ratios can be extracted from [61]. ✷ Exercise 1.2.
[Follow-up to Exercise 1.1] Exercise 1.1 provides us with an explicitly given (m = 512) × (n = 1024) sensing matrix A¯ such that the efficiently verifiable condition 15 ) is satisfiable; in particular, A¯ is 15-good. With all we know about the limits Q∞ (15, 32 of performance of verifiable sufficient conditions for goodness, how should we evaluate this specific sensing matrix? Could we point out a sensing matrix of the same size which is provably s-good for a value of s larger (or “much larger”) than 15? We do not know the answer, and you are requested to explore some possibilities, including (but not reducing to—you are welcome to investigate more options!) the following ones. 1. Generate at random a sample of m × n sensing matrices A, compute their mutual incoherences and look at how large the goodness levels certified by these incoherences are. What
449
SOLUTIONS TO SELECTED EXERCISES
happens when the matrices are Gaussian (independent N (0, 1) entries) and Rademacher ones (independent entries taking values ±1 with probabilities 1/2)? 2. Generate at random a sample of m × n matrices with independent N (0, 1/m) entries. Proposition 1.7 suggests that a sampled matrix A has good chances to satisfy RIP(δ, k) with some δ < 1/3 and some k, and thus to be s-good (and even more than this; see Proposition 1.8) for every s ≤ k/2. Of course, given A we cannot check whether the matrix indeed satisfies RIP(δ, k) with given δ, k; what we can try to do is to certify that RIP(δ, k) does not take place. To this end, it suffices to select at random, say, 200 m × k ˜ if A possesses RIP(δ, k), all submatrices A˜ of A and compute the eigenvalues of A˜T A; these eigenvalues should belong to the segment [1 − δ, 1 + δ], and if in reality this does not happen, A definitely is not RIP(δ, k).
Solution: 1. Here are the levels of goodness as justified by mutual incoherence for randomly generated 512 × 1024 matrices: Generation Rademacher Gaussian
Mutual incoherence 0.191406 0.199985
Justified goodness level 3 3
The mutual incoherences and associated goodness levels in the table are the best, over 768 randomly generated matrices. 2. Here are the largest values of s for which in a series of 128 randomly generated 512 × 1024 matrices there could be one satisfying RIP(δ, 2s) with δ < 1/3: Generation Rademacher Gaussian
s 6 5
We would say that the above numerical results are rather discouraging. We do know a 15-good individual 512 × 1024 matrix, and this fact is given by our “heavily conservative and severely restricted in its scope” verifiable sufficient condition for goodness. Moreover, it seems to be immediate to build an m × n sensing matrix √ which is s-good for s “nearly” as large as m. For example, it is immediately seen (check it!) that if ξ, η are two independent random m-dimensional Rademacher vectors, then ∀(α > 0) : Prob{|ξ T η| > αm} ≤ 2 exp{−α2 /2}, implying that the mutual incoherence for an m × n Rademacher matrix is with probability ≥ 1/2 upper-bounded by the quantity p µ b = 2 ln(2n2 )/m.
Thus, mutual incoherence of a random, say, Rademacher matrix justifies its “nearly” √ (up to logarithmic in n factor) m-goodness. This being said, numerical data above demonstrate that the “logarithmic toll” in question in reality may be quite heavy. Similarly, beautiful theoretical results on the RIP property of random Rademacher and Gaussian matrices seem to be poorly suited for “real life” purposes, even when we are ready to skip full-scale verification of RIP. As our numerical results show, in the situation where our “conservative and severely restricted in scope” sufficient condition for goodness allows us to justify goodness 15, the “theoretically perfect”
450
SOLUTIONS TO SELECTED EXERCISES
RIP-based approach cannot justify something like 7-goodness. To conclude, we remark that our pessimism applies to what we can provably say about the level of goodness, and not with this level per se. For example, RIP(δ, 2s) with δ < 1/3 is just a sufficient condition for s-goodness, and the fact that for the sizes m, n we were considering this sufficient condition is typically not satisfied with “meaningful” values of s does not prevent the random matrices we were testing from being s-good with “nice” values of s. It could also happen that the sizes m, n we are speaking about are too small for meaningful Compressed Sensing applications, and what happens in this range of sizes is of no actual interest. Exercise 1.3.
Let us start with a preamble. Consider a finite Abelian group; the only thing which matters for us is that such a group G is specified by a collection of a k ≥ 1 of positive integers ν1 , ..., νk and is comprised of all collections ω = (ω1 , ..., ωk ) where every ωi is an integer from the range {0, 1, ..., νk − 1}; the group operation, denoted by ⊕, is (ω1 , ..., ωk ) ⊕ (ω1′ , ..., ωk′ ) = ((ω1 + ω1′ ) mod ν1 , ..., (ωk + ωk′ ) mod νk ), where a mod b is the remainder, taking values in {0, 1, ..., b − 1}, in the division of an integer a by positive integer b; say, 5 mod 3 = 2, and 6 mod 3 = 0. Clearly, the cardinality of the above group G is nk = ν1 ν2 ...νk . A character of group G is a homomorphism acting from G into the multiplicative group of complex numbers of modulus 1, or, in simple words, a complex-valued function χ(ω) on G such that |χ(ω)| = 1 for all ω ∈ G and χ(ω ⊕ ω ′ ) = χ(ω)χ(ω ′ ) for all ω, ω ′ ∈ G. Note that characters themselves form a group w.r.t. pointwise multiplication; clearly, all characters of our G are functions of the form ω
1 k χ((ω1 , ..., ωk )) = µω 1 ...µk ,
where µi are restricted to be roots of degree νi from 1: µνi i = 1. It is immediately seen that the group G∗ of characters of G is of the same cardinality nk = ν1 ...νk as G. We can associate with G the matrix F of size nk × nk ; the columns in the matrix are indexed by the elements ω of G, the rows by the characters χ ∈ G∗ of G, and the element in cell (χ, ω) is χ(ω). The standard example here corresponds to k = 1, in which case F clearly is the ν1 × ν1 matrix of the Discrete Fourier Transform. Now comes the exercise: √ matrix: denoting by a the complex 1. Verify that the above F is, up to factor P nk , a unitary ′ conjugate of a complex number a, ω∈G χ(ω)χ (ω) is nk or 0 depending on whether χ = χ′ or χ 6= χ′ . 2. Let ω ¯, ω ¯ ′ be two elements of G. Prove that there exists a permutation Π of elements of G which maps ω ¯ into ω ¯ ′ and is such that ColΠ(ω) [F ] = DColω [F ] ∀ω ∈ G, where D is a diagonal matrix with diagonal entries χ(¯ ω ′ )/χ(¯ ω ), χ ∈ G∗ . 3. Consider the special case of the above construction where ν1 = ν2 = ... = νk = 2. Verify that in this case F , up to permutation of rows and permutation of columns (these permutations depend on how we assign the elements of G and of G∗ their serial numbers), is exactly the Hadamard matrix Hk . 4. Extract from the above the following fact: let m, k be positive integers such that m ≤ nk := 2k , and let sensing matrix A be obtained from Hk by selecting m distinct rows. Assume we want to find an m × nk contrast matrix H such that the pair (H, k · k∞ ) satisfies the condition Q∞ (s, κ) with as small κ as possible; by Proposition 1.10, to this end we should solve n LP programs Opti = min kei − AT hk∞ , h
451
SOLUTIONS TO SELECTED EXERCISES
where ei is i-th basic orth in Rn . Prove that with A coming from Hk , all these problems have the same optimal value, and optimal solutions to all of the problems are readily given by the optimal solution to just one of them.
Q Q Solution: Item 1: Let χ(ω1 , ..., ωk ) = i αiωi , χ′ (ω1 , ..., ωk ) = i βiωi , where αi , βi are some roots of degree νi of 1. Let us set γi = αi /βi , so that γi also are roots of degree νi of 1. We have " −1 # i X Y νX χ(ω)χ′ (ω) = (αi /βi )ωi , ω∈G
i
ωi =0
Pν−1
ν, γ=1 , so and if γ is a root of degree ν of 1, then clearly s=0 γ = 1−γ ν = 0, γ 6= 1 1−γ P Q that ω∈G χ(ω)χ′ (ω) is nk = i νi when all γi are equal to 1 (i.e., when χ = χ′ ) and is 0 otherwise. ✷ Item 2: Since G is a group, there exists δ ∈ G such that ω ¯′ = δ ⊕ ω ¯ , and the mapping ω 7→ Π(ω) := δ ⊕ ω is a permutation of elements of G which maps ω ¯ onto ω ¯ ′ . We clearly have [F]χ,δ⊕ω = χ(δ ⊕ ω) = χ(δ)χ(ω), s
implying that Colδ⊕ω [F] = Diag{χ(δ) : χ ∈ G∗ }Colω [F] ∀ω ∈ G, as required. ✷ Item 3: When ν1 = ν2 = ... = νk = 2, the elements of G are just binary words ω1 |ω2 |...|ωk of length k (ωi is the i-th bit in the word), and ⊕ is the letterwise summation of these words modulo 2. Characters of G also can be encoded by binary words ζ1 |ζ2 |...|ζk of length k—such a character is identified by ordered collection µ1 , ..., µk of roots ±1 of degree 2 of 1, and we can represent such a collection by binary word ζ1 |...|ζk according to the rule µi = (−1)ζi . With these conventions, the value of a character ζ = ζ1 |...|ζk at a point ω = ω1 |...|ωk ∈ G is ζ(ω) = Qk ζi ωi ; for example, 0|0(0|0) = 1, 0|1(1|0) = 1, 0|1(0|1) = −1. Now, we can i=1 (−1) treat a k-bit binary word s = s1 |...|sk as the binary representation of the integer b(s) = s1 + 2s2 + 4s3 + ... + 2k−1 sk , thus arriving at a one-to-one correspondence between the set Bk of all binary words of length k and the set {0, 1, ..., 2k − 1}; we treat b(s) as the serial number of s ∈ Bk . Now consider the matrix F = Fk corresponding to ν1 = ... = νk = 2, and let us order its rows and columns, originally indexed by elements of Bk , in the order given by serial numbers of these elements (so that the rows and starting columns areindexed with 0, and not with 1). When 0(0) 0(1) 1 1 k = 1, we get F1 = = = H1 . Besides this, our ordering 1(0) 1(1) 1 −1 of words from Bk+1 is as follows: we first write down the words s|0 with s ∈ Bk , in the order in which the s’s appear in Bk , and then all words s|1 with s ∈ Bk , again in the order in which the s’s appear in Bk . It follows that with our ordering of rows and columns in Fk+1 , we have [ζ|0(ω|0)]ζ,ω∈Bk [ζ|0(ω|1)]ζ,ω∈Bk Fk Fk . = Fk+1 = Fk −Fk [ζ|1(ω|0)]ζ,ω∈Bk [ζ|1(ω|1)]ζ,ω∈Bk
452
SOLUTIONS TO SELECTED EXERCISES
We see that F1 = H1 , and the recurrence linking matrices F1 , F2 , ... is exactly the recurrence linking H1 , H2 , ..., whence Fk = Hk for all k = 1, 2, .... ✷ Item 4: Indexing rows and columns of Hk starting with 0, and denoting by s(i) = s1 (i)|s2 (i)|...|sk (i) the binary representation of an integer i ∈ {0, 1, ..., 2k −1} (this representation is augmented by zeros from the right to get a binary word of length k), we get from item 2 that for every i ∈ I = {0, 1, ..., 2k − 1} there exists a permutation j 7→ Πi (j) of elements of I which maps 0 onto i and is such that ColΠi (j) [Hk ] = Di Colj [Hk ], for all j ∈ I, where Di is the diagonal matrix. Since i = Πi (0) and Col0 [Hk ] is the all-ones vector, we have Di = Diag{Coli [Hk ]}. As a result, ColΠi (j) [A] = Ei Colj [A], j ∈ I, where Ei is diagonal m × m matrix cut from Di by the rows of Hk participating in A and the columns with the same indexes; note that all diagonal entries in Ei are ±1. It follows that for every h ∈ Rk and for every j we have [AT h]Πi (j) = ColTΠi (j) [A]h = (Ei Colj [A])T h = ColTj [A](Ei h). Equivalently, denoting by Π−1 the permutation inverse to Πi , i [AT h]j = ColTΠ−1 (j) [A](Ei h). i
It follows that
[ei ]j − [AT h]j = [ei ]j − ColTΠ−1 (j) [A](Ei h); i
noting that Πi (0) = i, that is, that for all j ∈ I it holds
Π−1 i (i)
= 0, so that [ei ]j = [e0 ]Π−1 (j) , we conclude i
[ei ]j − [AT h]j = [e0 ]Π−1 (j) − ColTΠ−1 (j) [A](Ei h). i
i
Hence kei − AT hk∞ = ke0 − AT (Ei h)k∞
for all i ∈ I and all h ∈ Rm . We see that all problems minh kei − AT hk∞ are equivalent to the problem minh ke0 − AT hk∞ , and the optimal solution h0 to the latter problem induces optimal solutions hi = Ei−1 h0 = Ei h0 to the former problems. ✷ Exercise 1.4.
Proposition 1.13 states that the verifiable condition Q∞ (s, κ) can certify s-goodness of an “essentially nonsquare” (with m ≤ n/2) m × n sensing matrix A only when s √ is small as compared to m, namely, s ≤ 2m. The exercise to follow is aimed at investigating what happens when an m × n “low” (with m < n) sensing matrix A is “nearly square,” meaning that mo = n − m is small compared to n. Specifically, you should prove that for properly selected individual (n − mo ) × n matrices A the condition Q∞ (s, κ) with κ < 1/2 is √ satisfiable when s is as large as O(1)n/ mo .
1. Let n = 2k p with positive integer p and integer k ≥ 1, and let mo = 2k−1 . Given a
2mo -dimensional vector u, let u+ be the n-dimensional vector built as follows: we split indexes from {1, ..., n = 2k p} into 2k consecutive groups I1 , ..., I2k , p elements per group, and all entries of u+ with indexes from Ii are equal to i-th entry, ui , of vector u. Now let √ k U be the linear subspace in R2 comprised of all eigenvectors, with eigenvalue 2k , of the Hadamard matrix Hk —see Exercise 1.1—so that the dimension of U is 2k−1 = mo , and let L be given by L = {u+ : u ∈ U } ⊂ Rn .
453
SOLUTIONS TO SELECTED EXERCISES
Clearly, L is a linear subspace in Rn of dimension mo . Prove that √ 2mo kxk1 . ∀x ∈ L : kxk∞ ≤ n Conclude that if A is an (n − mo ) × n sensing matrix with Ker A = L, then the verifiable sufficient condition Q∞ (s, κ) does certify s-goodness of A whenever n 1≤s< √ . 2 2mo
Solution: Let x ∈ L, so that√x = u+ for some u ∈ U . By the result of Exercise 1.1, we have kuk∞ ≤ kuk1 / 2k , whence, due to the construction of L, √ √ √ √ 2k 2mo k k kxk∞ = kuk∞ ≤ kuk1 / 2 = (kxk1 /p)/ 2 = kxk1 = kxk1 . n n It remains to use Proposition 1.10; see (1.33). ✷ 2. Let L be an mo -dimensional subspace in Rn . Prove that L contains a nonzero vector x with
√
mo kxk1 , n so that the condition Q∞ (s, κ) cannot certify s-goodness of an (n − mo ) × n sensing matrix √ A whenever s > O(1)n/ mo , for properly selected absolute constant O(1). kxk∞ ≥
Solution: Let the columns in n × mo matrix F form an orthonormal basis in L. o The squared Frobenius exists a row in F of p norm of F is m , implying that there o Euclidean norm ≥ m /n, whence there existspa unit mo -dimensional vector d p such that z = F d has an entry of magnitude ≥ mo /n: kzk√ mo /n. Now, ∞ ≥ by construction z ∈ L and kzk2 = kdk2 = 1, whence kzk1 ≤ n, and we get √ √ p p mo √ mo kzk∞ ≥ mo /n = mo /nkzk2 = [ nkzk2 ] ≥ kzk1 , n n as claimed.
✷
Exercise 1.5. Utilize the results of Exercise 1.3 in a numerical experiment as follows. • select n as an integer power 2k of 2, say, set n = 210 = 1024 • select a “representative” sequence M of values of m, 1 ≤ m < n, including values of m close to n and “much smaller” than n, say, M = {2, 5, 8, 16, 32, 64, 128, 256, 512, 7, 896, 960, 992, 1008, 1016, 1020, 1022, 1023} • for every m ∈ M ,
– generate at random an m × n submatrix A of the n × n Hadamard matrix Hk and utilize the result of item 4 of Exercise 1.3 in order to find the largest s such that s-goodness of A can be certified via the condition Q∞ (·, ·); call s(m) the resulting value of s. – generate a moderate sample of Gaussian m × n sensing matrices Ai with independent N (0, 1/m) entries and use the construction from Exercise 1.2 to upper-bound the largest s for which a matrix from the sample satisfies RIP(1/3, 2s); call sb(m) the largest, over your Ai ’s, of the resulting upper bounds.
454
SOLUTIONS TO SELECTED EXERCISES
The goal of the exercise is to compare the computed values of s(m) and sb(m); in other words, we again want to understand how “theoretically perfect” RIP compares to “conservative restricted scope” condition Q∞ .
Solution: Here are the results of our experiment, n = 1024: m s b(m) s(m) m s b(m) s(m)
1023 11 511 768 7 20
1022 10 256 512 5 11
1020 10 191 256 2 5
1016 10 139 128 0 3
1008 10 100 64 0 2
992 10 70 32 0 1
960 10 48 16 0 1
896 9 32 8 0 0
And here are the results for n = 2048: m s b(m) s(m) m s b(m) s(m)
2047 22 1023 1536 16 29
2046 21 511 1024 10 15
2044 22 384 512 5 8
2040 22 279 256 2 4
2032 22 199 128 1 3
2016 21 143 64 0 2
1984 21 100 32 0 1
1920 20 69 16 0 1
1792 19 46 8 0 0
We believe the results speak for themselves.
6.2 6.2.1
SOLUTIONS FOR CHAPTER 2 Two-point lower risk bound
Exercise 2.1. Let p and q be two probability distributions distinct from each other on d-element observation space Ω = {1, ..., d}, and consider two simple hypotheses on the distribution of observation ω ∈ Ω, H1 : ω ∼ p, and H2 : ω ∼ q. 1. Is it true that there always exists a simple deterministic test deciding on H1 , H2 with risk < 1/2? 2. Is it true that there always exists a simple randomized test deciding on H1 , H2 with risk < 1/2? 3. Is it true that when quasi-stationary K-repeated observations are allowed, one can decide on H1 , H2 with any small risk, provided K is large enough?
Solution: The answer to the first question is negative. E.g., when d = 2, p = [0.8; 0.2], and q = [0.9; 0.1], no deterministic test can decide on the hypotheses with risk < 0.8 (look at what happens when our observation is ω = 1). The answer to the second question is positive. Indeed, assuming w.l.o.g. that p + q > 0, consider randomized test which, given observations ω ∈ Ω, accepts H1 with probability pω /(pω + qω ), and accepts H2 whenever it does not accept H1 . As is immediately seen, both partial risks of this randomized test are equal to each other and equal to X √ pω q ω √ X√ X pω q ω pω qω ≤ 21 pω q ω . = pω + q ω p +q ω | ω {z ω } ω ω 1
≤2
Thus, there always exists a randomized test with risk upper-bounded by half of the Hellinger affinity of p and q, and since p 6= q, this affinity is < 1 due to X√ X√ √ 2 pω qω = 2 − ( pω − q ω ) 2 . ω
ω
455
SOLUTIONS TO SELECTED EXERCISES
A positive answer to the second question implies a positive answer to the third one— it suffices to pass to the majority version of the single-observation randomized test just defined. 6.2.2
Around Euclidean Separation
Exercise 2.2. Justify the “immediate observation” in Section 2.2.2.3.B. Solution:√Function (2.18) clearly is the marginal univariate density of the random variable Zη with Z underlying the Gaussian mixture in question and random vector η ∼ N (0, Id ) independent of Z. All we need to prove is that if e is a unit vector in Rd and ξ ∼ N√(0, Θ) is independent√of Z, with ΘR Id , then for every ∞ δ ≥ 0 we have Prob{eT Zξ ≥ δ} ≤ Prob{eT Zη ≥ δ} [= δ γZ (s)ds]. Indeed, we lose nothing when assuming that ξ = Θ1/2 η, so that, denoting by Φ(s) the probability for the N (0, 1) random variable to be ≥ s, we have √ √ Prob{eT Zξ > δ} = Prob{eT ZΘ1/2 η ≥ δ} √ R R = Rt>0 Prob{eT Θ1/2 η ≥ δt−1/2 }PZ (dt) = t>0 Φ(δt−1/2 / eT Θe)PZ (dt) ≤ t>0 Φ(δt−1/2 )PZ (dt) [since Θ Id ] √ R∞ R = t>0 Prob{eT η ≥ δt−1/2 }PZ (dt) = Prob{eT Zη ≥ δ} = δ γZ (s)ds,
as claimed.
Exercise 2.3. 1) Prove Proposition 2.9. Hint: You can find useful the following simple observation (prove it if you indeed use it): Let f (ω), g(ω) be probability densities taken w.r.t. a reference measure P on an observation space Ω, and let ǫ ∈ (0, 1/2] be such that Z min[f (ω), g(ω)]P (dω) ≤ 2ǫ. 2¯ ǫ := Ω
Then
Z p Ω
f (ω)g(ω)P (dω) ≤ 2
p
ǫ(1 − ǫ).
Solution: Let us prove the claim in the hint. Setting h(ω) = min[f (ω), g(ω)], h(ω) = max[f (ω), g(ω)], we have R p Ω
f (ω)g(ω)P (dω)
R q
R 1/2 R 1/2 h(ω)h(ω)P (dω) ≤ Ω h(ω)P (dω) h(ω)P (dω) Ω 1/2 R 1/2 RΩ = √ Ω h(ω)P (dω) p Ω [f (ω) + g(ω) p − h(ω)]P (dω) √ = 2¯ ǫ 2 − 2¯ ǫ = 2 ǫ¯(1 − ǫ¯) ≤ 2 ǫ(1 − ǫ),
=
where the first ≤ is due to the Cauchy inequality, and the last ≤ is because 0 ≤ ǫ¯ ≤ ǫ ≤ 1/2. ✷ We are ready to prove Proposition 2.9. Let TKmaj be the K-observation majority test in question. Observe that (2.25) combines with (2.27) and (2.24) to imply that
456
SOLUTIONS TO SELECTED EXERCISES
the quantity ǫ∗ as given by (2.23) is ≤ 21 − βOpt, so that by Proposition 2.6 we have P K k K−k Risk(TKmaj |H1 , H2 ) ≤ k (1/2 − βOpt) (1/2 + βOpt) K≥k≥K/2 P 2 k 2 2−K K ≤ k (1 − 4β Opt ) ≤
K≥k≥K/2 2
(1 − 4β Opt2 )K/2 ≤ exp{−2Kβ 2 Opt2 }.
In particular, when ǫ ∈ (0, 1/4), we have K≥
ln(1/ǫ) maj 2 ⇒ Risk(TK |H1 , H2 ) ≤ ǫ, 2 2β Opt
as claimed in (2.28). To prove (2.29), let TK be a test using a K-repeated stationary observation with risk ≤ ǫ. Let x1∗ , x2∗ form an optimal solution to (2.11), so that kx1∗ − x2∗ k2 = 2Opt. Consider two simple hypotheses stating that observations ω1 , ..., ωK are i.i.d. drawn from the distribution of x1∗ + ξ or of x2∗ + ξ, with ξ ∼ q(·). Test TK decides on these hypotheses with risk ≤ ǫ; consequently, assuming w.l.o.g. that x1∗ + x∗2 = 0, so that x∗1 = e, x∗2 = −e with kek2 = Opt, we have by the two-point lower risk bound (see Proposition 2.2) Z
min
Y K
k=1
|
q(ξk − e) dξ1 ...dξK ≤ 2ǫ. | {z } k=1 dξ K } | {z }
q(ξk + e), {z
q+ (ξ K )
K Y
q− (ξ K )
Applying the claim in the hint, we conclude that hR
iK R p q(ξ + e)q(ξ − e)dξ q+ (ξ K )q− (ξ K )dξ K = p Rp K K K = min[q+ (ξ K ), q− (ξ K )]max[q ˙ ≤ 2 ǫ(1 − ǫ). + (ξ ), q− (ξ )]dξ
Rn
p
Since kek2 = Opt ≤ D, (2.26) combines with (6.1) to imply that p √ exp{−KαOpt2 } ≤ 2 ǫ(1 − ǫ) ≤ 4ǫ,
and (2.29) follows.
(6.1)
✷
2) Justify the illustration in Section 2.2.3.2.C.
Solution: All we need to verify is that in the situation in question, denoting by q(·) the density of N (0, 21 In ), we have q ∈ Pγ . Indeed, taking this fact for granted we ensure the validity of (2.26) with α = 1 and D = ∞ (check it!). Next, it is immediately seen that the function γ—see (2.30)—satisfies the relation γ(s) ≥ 17 , 0 ≤ s ≤ 1, implying that (2.25) holds true with D = 1 and β = 71 , as claimed. It remains to verify that q ∈ Pγ , which reduces to verifying that Rthe marginal ∞ univariate density γ¯ (s) = √1π exp{−s2 } of q(·) satisfies the relation δ γ¯ (s)ds ≤ R∞ γ(s)ds, δ ≥ 0, or, which is the same since both γ and γ¯ are even probability δ
457
SOLUTIONS TO SELECTED EXERCISES
densities on the axis, that ∀(δ ≥ 0) :
Z
δ 0
(¯ γ (s) − γ(s))ds ≥ 0.
The latter relation is an immediate consequence of the fact that the ratioR γ¯ (s)/γ(s) ∞ is a strictly decreasing function of s ≥ 0 (check it!) combined with 0 (¯ γ (s) − γ(s))ds = 0. Indeed, since γ¯R 6≡ γ, the function ∆(s) = γ¯ (s) − γ(s) of s ≥ 0 ∞ is not identically zero; since 0 ∆(s)ds = 0, ∆ takes on R+ both positive and negative values and therefore, being continuous, has zeros on R+ . Since γ¯ and γ are positive and γ¯ (s)/γ(s) is strictly decreasing on R+ , ∆(s) has exactly one zero s¯ on R+ , this zero is positive, and ∆ is ≥ 0 on [0, s¯] and is ≤ 0 on [¯ s, ∞). Rδ As a result, when 0 ≤ δ ≤ s¯, we have 0 ∆(s)ds ≥ 0, and when δ > s¯, we have Rδ R∞ ∆(s)ds ≥ 0 ∆(s)ds = 0. ✷ 0 6.2.3
Hypothesis testing via ℓ1 separation
Let d be a positive integer, and the observation space Ω be the finite set {1, ..., d} equipped with the counting reference measure.1 Probability distributions on Ω can be identified with points p of d-dimensional probabilistic simplex X ∆d = {p ∈ Rd : p ≥ 0, pi = 1}; i
the i-th entry pi in p ∈ ∆d is the probability for the random variable distributed according to p to take value i ∈ {1, ..., d}. With this interpretation, p is the probability density taken w.r.t. the counting measure on Ω. Assume B and W are two nonintersecting nonempty closed convex subsets of ∆d ; we interpret B and W as the sets of black and white probability distributions on Ω, and our goal is to find an optimal, in terms of its total risk, test deciding on the hypotheses H1 : p ∈ B, H2 : p ∈ W via a single observation ω ∼ p. Warning: Everywhere in this section, “test” means “simple test.”
Exercise 2.4. Our first goal is to find an optimal, in terms of its total risk, test deciding on the hypotheses H1 , H2 via a single observation ω ∼ p ∈ B ∪ W . To this end we consider the convex optimization problem " # d X Opt = min f (p, q) := |pi − qi | (2.154) p∈B,q∈W
∗
i=1
∗
and let (p , q ) be an optimal solution to this problem (it clearly exists).
1. Extract from optimality conditions that there exist reals ρi ∈ [−1, 1], 1 ≤ i ≤ n, such that ρi =
1 Counting
1, −1,
p∗i > qi∗ p∗i < qi∗
(2.155)
measure is the measure on a discrete (finite or countable) set Ω which assigns every point of Ω with mass 1, so that the measure of a subset of Ω is the cardinality of the subset when it is finite and is +∞ otherwise.
458
SOLUTIONS TO SELECTED EXERCISES
and
ρT (p − p∗ ) ≥ 0 ∀p ∈ B & ρT (q − q ∗ ) ≤ 0 ∀q ∈ W.
(2.156)
Solution: We are minimizing convex real-valued function f (p, q) = kp − qk1 of z = [p; q] over a nonempty convex compact set, so that there exists a subgradient g = [fp′ ; fq′ ] of this function at the minimizer [p∗ ; q ∗ ] such that g T ([p; q] − [p∗ ; q ∗ ]) ≥ 0 ∀[p; q] ∈ B × W, or, equivalently, [fp′ ]T (p − p∗ ) ≥ 0 ∀p ∈ B, [fq′ ]T (q − q ∗ ) ≥ 0 ∀q ∈ W. Taking into account the expression for f , we see that fp′ = −fq′ is a vector with entries from [−1, 1], and these entries satisfy (2.155). 2. Extract from the previous item that the test T which, given an observation ω ∈ {1, ..., d}, accepts H1 with probability πω = (1 + ρω )/2 and accepts H2 with complementary probability, has its total risk equal to X min[p∗ω , qω∗ ], (2.157) ω∈Ω
and thus is minimax optimal in terms of the total risk.
Solution: Let ω ∼ p ∈ B. Then the p-probability to accept H1 is d X
1 2
ω=1
(1 + ρω )pω = 12 [1 + ρT p] ≥ 21 [1 + ρT p∗ ]
(we have used (2.156)), and consequently Risk1 (T |H1 , H2 ) ≤ 1 − 12 [1 + ρT p∗ ] = 21 [1 − ρT p∗ ]. Similarly, when p ∈ W , the probability to accept H2 is d X
ω=1
1 2
(1 − ρω )pω = 12 [1 − ρT p] ≥ 21 [1 − ρT q ∗ ]
(we have used (2.156)), and thus Risk2 (T |H1 , H2 ) ≤ 1 − 21 [1 − ρT q ∗ ] = 12 [1 + ρT q ∗ ]. As a result, Risk1 (T |H1 , H2 ) + Risk2 (T |H1 , H2 )
≤ =
T ∗ 1 − 12 ρP [p − q ∗ ] n 1 1 − 2 ω=1 |p∗i − qi∗ |,
where the concluding equality follows from (2.155), and we arrive at P 1 ∗ ∗ ∗ ∗ Risk1 (T |H1 , H2 ) + Risk2 (T |H1 , H2 ) ≤ P ω∈Ω [pω + qω − |pω − qω |] 2 ∗ ∗ = ω∈Ω min[pω , qω ].
459
SOLUTIONS TO SELECTED EXERCISES
Strict inequality in the resulting relation is impossible by Proposition 2.2, so that the total risk of our test indeed is as stated in (2.157). Comments. Exercise 2.4 describes an efficiently computable and optimal in terms of the worst-case total risk simple test deciding on a pair of “convex” composite hypotheses on the distribution of a discrete random variable. While it seems an attractive result, we believe by itself this result is useless, since typically in the testing problem in question a single observation by far is not enough for a reasonable inference; such an inference requires observing several independent realizations ω1 , ..., ωK of the random variable in question. And the construction presented in Exercise 2.4 says nothing on how to adjust the test to the case of repeated observation. Of course, when ω K = (ω1 , ..., ωK ) is a K-element i.i.d. sample drawn from a probability distribution p on Ω = {1, ..., d}, ω K can be thought of as a single observation of a discrete random variable taking values in the set ΩK = Ω × ... × Ω, the probability {z } | K
distribution pK of ω K being readily given by p. So, why not to apply the construction from Exercise 2.4 to ω K in the role of ω? On a close inspection, this idea fails. One of the reasons for this failure is that the cardinality of ΩK (which, among other factors, is responsible for the computational complexity of implementing the test in Exercise 2.4) blows up exponentially as K grows. Another, even more serious, complication is that pK depends on p nonlinearly, so that the family of distributions pK of ω K induced by a convex family of distributions p of ω—convexity meaning that the p’s in question fill a convex subset of the probabilistic simplex—is not convex; and convexity of the sets B, W in the context of Exercise 2.4 is crucial. Thus, passing from a single realization of a discrete random variable to the sample of K > 1 independent realizations of the variable results in severe structural and quantitative complications “killing,” at least at first glance, the approach undertaken in Exercise 2.4. In spite of the above pessimistic conclusions, the single-observation test from Exercise 2.4 admits a meaningful multi-observation modification, which is the subject of our next exercise.
Exercise 2.5. There is a straightforward way to use the optimal, in terms of its total risk, single-observation test built in Exercise 2.4 in the “multi-observation” environment. Specifically, following the notation from Exercise 2.4, let ρ ∈ Rd , p∗ , q ∗ be the entities built in this exercise, so that p∗ ∈ B, q ∗ ∈ W , all entries in ρ belong to [−1, 1], and {ρT p ≥ α := ρT p∗ ∀p ∈ B} & {ρT q ≤ β := ρT q ∗ ∀q ∈ W } & α − β = ρT [p∗ − q ∗ ] = kp∗ − q ∗ k1 . Given an i.i.d. sample ω K = (ω1 , ..., ωK ) with ωt ∼ p, where p ∈ B ∪ W , we could try to decide on the hypotheses H1 : p ∈ B, H2 : p ∈ W as follows. Let us set ζt = ρωt . For large PK 1 K, the observable, given ω K , quantity ζ K := K t=1 ζt , by the Law of Large Numbers, will be with overwhelming probability close to Eω∼p {ρω } = ρT p, and the latter quantity is ≥ α when p ∈ B and is ≤ β < α when p ∈ W . Consequently, selecting a “comparison level” ℓ ∈ (β, α), we can decide on the hypotheses p ∈ B vs. p ∈ W by computing ζ K , comparing the result to ℓ, accepting the hypothesis p ∈ B when ζ K ≥ ℓ, and accepting the alternative p ∈ W otherwise. The goal of this exercise is to quantify the above qualitative considerations. To this end let us fix ℓ ∈ (β, α) and K and ask ourselves the following questions:
A. For p ∈ B, how do we upper-bound the probability ProbpK {ζ K ≤ ℓ} ? B. For p ∈ W , how do we upper-bound the probability ProbpK {ζ K ≥ ℓ} ?
Here pK is the probability distribution of the i.i.d. sample ω K = (ω1 , ..., ωK ) with ωt ∼ p. The simplest way to answer these questions is to use Bernstein’s bounding scheme. Specifically, to answer question A, let us select γ ≥ 0 and observe that for every probability distribution
460
SOLUTIONS TO SELECTED EXERCISES
p on {1, 2, ..., d} it holds n
ProbpK ζ {z |
K
πK,− [p]
whence
o
≤ ℓ exp{−γℓ} ≤ EpK } ln(πK,− [p]) ≤ K ln
implying, via substitution γ = µK, that
n
o
"
d X
1 exp{−γζ } = pi exp − γρi K i=1 K
#K
,
! 1 pi exp − γρi + γℓ, K i=1
d X
∀µ ≥ 0 : ln(πK,− [p]) ≤ Kψ− (µ, p), ψ− (µ, p) = ln Similarly, setting πK,+ [p] = ProbpK ζ K ≥ ℓ , we get ∀ν ≥ 0 : ln(πK,+ [p]) ≤ Kψ+ (ν, p), ψ+ (ν, p) = ln Now comes the exercise:
d X i=1
pi exp{−µρi }
d X i=1
pi exp{νρi }
!
!
+ µℓ.
− νℓ.
1. Extract from the above observations that
Risk(T K,ℓ |H1 , H2 ) ≤ exp{Kκ}, κ = max max inf ψ− (µ, p), max inf ψ+ (ν, q) , p∈B µ≥0
q∈W ν≥0
where T K,ℓ is the K-observation test which accepts the hypothesis H1 : p ∈ B when ζ K ≥ ℓ and accepts the hypothesis H2 : p ∈ W otherwise. 2. Verify that ψ− (µ, p) is convex in µ and concave in p, and similarly for ψ+ (ν, q), so that max inf ψ− (µ, p) = inf max ψ− (µ, p), max inf ψ+ (ν, q) = inf max ψ+ (ν, q). p∈B µ≥0
µ≥0 p∈B
q∈W ν≥0
ν≥0 q∈W
Thus, computing κ reduces to minimizing on the nonnegative ray the convex functions φ− (µ) = maxp∈B ψ+ (µ, p) and φ+ (ν) = maxq∈W ψ+ (ν, q). 3. Prove that when ℓ = 21 [α + β], one has 1 2 ∆ , ∆ = α − β = kp∗ − q ∗ k1 . 12 Note that the above test and the quantity κ responsible for the upper bound on its risk depend, as on a parameter, on the “acceptance level” ℓ ∈ (β, α). The simplest way to select a reasonable value of ℓ is to minimize κ over an equidistant grid Γ ⊂ (β, α), of small cardinality, of values of ℓ. κ≤−
Solution: All claims in items 1 and 2 are self-evident. Item 3: One can easily verify that exp{s} ≤ 1 + s + 34 s2 when s ∈ [−1, 1]. Thus, taking into account that |ρi | ≤ 1 for all i, when µ ∈ [0, 1], for every probabilistic (i.e., nonnegative with unit sum of entries) vector p it holds P P 3 2 2 ln ( i pi exp{−µρi }) ≤ ln i pi [1 − µρi + 4 µ ρi ] ≤ ln 1 − µpT ρ + 34 µ2 ≤ −µpT ρ + 34 µ2 , implying that ψ− (µ, p) ≤ −µpT ρ + 34 µ2 + µℓ when µ ∈ [0, 1]. This, in turn, implies that max inf ψ− (µ, p) = inf µ≥0 max ψ− (µ, p) ≤ min max −µpT ρ + 43 µ2 + µℓ p∈B µ≥0
p∈B
≤
0≤µ≤1 p∈B
min [−µα + µℓ + 43 µ2 ]
0≤µ≤1
461
SOLUTIONS TO SELECTED EXERCISES
(recall that pT ρ ≥ α when p ∈ B). The bottom line is that 1 max inf ψ− (µ, p) ≤ min [−µα + µℓ + 34 µ2 ] = min − 12 µ∆ + 43 µ2 = − 12 ∆2 p∈B µ≥0
0≤µ≤1
0≤µ≤1
(note that by the origin of α, β it holds |α| ≤ 1, |β| ≤ 1, whence 0 ≤ ∆ ≤ 2). “Symmetric” reasoning shows that 1 ∆2 , max inf ψ+ (ν, q) ≤ − 12 q∈W ν≥0
and the claim follows. Now, let us consider an alternative way to pass from a “good” single-observation test to its multi-observation version. Our “building block” now is the minimum risk randomized single-observation test,2 and its multi-observation modification is just the majority version of this building block. Our first observation is that building the minimum risk single-observation test reduces to solving a convex optimization problem.
Exercise 2.6. Let, as above, B and W be nonempty nonintersecting closed convex subsets of probabilistic simplex ∆d . Show that the problem of finding the best, in terms of its risk, randomized single-observation test deciding on H1 : p ∈ B vs. H2 : p ∈ W via observation ω ∼ p reduces to solving a convex optimization problem. Write down this problem as an explicit LO program when B and W are polyhedral sets given by polyhedral representations: B W
= =
{p : ∃u : PB p + QB u ≤ aB }, {p : ∃u : PW p + QW u ≤ aW }.
Solution: We can parameterize a randomized single-observation test by a vector π ∈ Bd = [0, 1]d ; πω , ω ∈ Ω = {1, ..., d}, is the probability for the test to accept H1 given observation ω. Denoting the test associated with parameter vector π by Tπ , we clearly have Risk(Tπ |H1 , H2 ) = max fB (π) = max[e − π]T p, fW (π) = max π T q , p∈B
q∈W
where e is the all-ones vector. Functions fB and fW are convex, so that minimizing risk is a convex programming problem. With B, W given by the above polyhedral representations, we have fB (π) = maxp,u [e − π]T p : PB p + QN u ≤ aB = minλB aTB λB : λB ≥ 0, PBT λB = e − π, QTB λB = 0 [LP Duality]
and similarly
T λW = π, QTW λW = 0 . fW (π) = min aTW λW : λW ≥ 0, PW λW
2 This test can differ from the test built in Exercise 2.4—the latter test is optimal in terms of the sum, rather than the maximum, of its partial risks.
462
SOLUTIONS TO SELECTED EXERCISES
Thus, the risk minimization problem can be set as λB ≥ 0, PBT λB = e − π, QTB λB = 0, aTB λB ≤ t T λW = π, QTW λW = 0, aTW λW ≤ t t : λW ≥ 0, PW . min t,π,λB ,λW 0≤π≤e
We see that the “ideal building block”—the minimum-risk single-observation test–can be built efficiently. What is at this point unclear, is whether this block is of any use for majority modifications, that is, is the risk of this test < 1/2. This is what we need for the majority version of the minimum-risk single-observation test to be consistent.
Exercise 2.7. Extract from Exercise 2.4 that in the situation of this section, denoting by ∆ the optimal value in the optimization problem (2.154), one has 1. The risk of any single-observation test, deterministic or randomized, is ≥ 21 − ∆ 4 2. There exists a single-observation randomized test with risk ≤ 21 − ∆ , and thus the risk of 8 1 the minimum risk single-observation test given by Exercise 2.6 does not exceed 2 − ∆ < 1/2 8 as well. Pay attention to the fact that ∆ > 0 (since, by assumption, B and W do not intersect).
Solution: 1) With p∗ , q ∗ defined in Exercise 2.4, the best possible total risk of testing simple hypothesis ω ∼ p∗ vs. simple alternative ω ∼ q ∗ is X X [p∗ω + qω∗ − |p∗ω − qω∗ |] = 1 − ∆ min[p∗ω , qω∗ ] = 21 2. ω
ω
The best possible risk of testing the same simple hypotheses cannot be less than half of the best total risk, and we arrive at the result announced in the first item of the exercise. 2) By the result stated in Exercise 2.4, there exists a vector ρ, kρk∞ ≤ 1, such that ∀(p ∈ B, q ∈ W ) : ρT p ≥ α := ρT p∗ , ρT q ≤ β := ρT q ∗ , (6.2) α − β = ρT p∗ − ρT q ∗ = ∆. Let us set π=
1 2
−
α+β 8
e + 14 ρ,
where e is the all-ones vector. Note that kρk∞ ≤ 1, that is, −e ≤ ρ ≤ e, whence |α| ≤ 1, |β| ≤ 1 and therefore π ≥ ( 21 − 41 ) e − 14 e ≥ 0, π ≤ ( 21 + 14 ) e + 41 e = e, so that the randomized single-observation test Tπ is well-defined. When p ∈ B, we have in view of (6.2) h i α−β ∆ 1 1 1 [e − π]T p = 21 + α+β eT p − 14 ρT p ≤ 12 + α+β 8 8 − 4α = 2 − 8 = 2 − 8 , and when q ∈ W , we have h i π T q = 21 − α+β eT q + 14 ρT q ≤ 8
1 2
−
α+β 8
Taken together, the last two inequalities imply that Risk(Tπ |H1 , H2 ) ≤
1 2
−
∆ 8,
+ 14 β =
1 2
−
∆ 8.
SOLUTIONS TO SELECTED EXERCISES
463
as required in the second item of the exercise. The bottom line is that in the situation of this section, given a target value ǫ of risk and assuming stationary repeated observations are allowed, we have (at least) three options to meet the risk specifications: 1. To start with the optimal, in terms of its total risk, single-observation detector as explained in Exercise 2.4, and then to pass to its multi-observation version built in Exercise 2.5; 2. To use the majority version of the minimum-risk randomized single-observation test built in Exercise 2.6; 3. To use the test based on the minimum risk detector for B, W , as explained in the main body of Chapter 2. In all cases, we have to specify the number K of observations which guarantees that the risk of the resulting multi-observation test is at most a given target ǫ. A bound on K can be easily obtained by utilizing the results on the risk of a detector-based test in a Discrete o.s. from the main body of Chapter 2 along with risk-related results of Exercises 2.5, 2.6, and 2.7.
Exercise 2.8. Run numerical experimentation to see if one of the three options above always dominates the others (that is, requires smaller sample of observations to ensure the same risk). Solution: The simplest way to understand what is going on is to consider the case of d = 2, so that ∆d is just the segment [e1 , e2 ] (e1 , e2 are the basic orths in R2 ), to take an n-point equidistant grid {ri = e1 + λi (e2 − e1 ), 1 ≤ i ≤ n} on this segment, with λi = 2i−1 2n , and to look at n(n−1)/2 pairs of indexes (i, j), 1 ≤ i < j ≤ n. Such a pair (i, j) specifies two nonoverlapping segments B = Bij := {λe1 + (1 − λ)e2 : 0 ≤ λ ≤ λi } and W = Wij := {λe1 + (1 − λ)e2 : λj ≤ λ ≤ 1}. We then compute the sample sizes Ko = Kd (i, j), Km = Km (i, j), and Kd = Kd (i, j), as given by the first, the second, and the third of the above options, associated with Bij and Wij , and compare these quantities to each other. Here are the results for ǫ = 0.01 and n = 16:
Kd /Km Kd /Ko Ko /Km Histograms of ratios of sample sizes Kd (·), Km (·), Ko (·) Kd Km Ko Kd 0.7391 1.0000 Km 0.2609 0.3043 Ko 0.0435 0.6957 Fractions of pairs i, j, 1 ≤ i < j ≤ 15 with Krow (i, j) ≤ Kcolumn (i, j) We see that the first option—detector-based test—always outperforms the third option, and for ≈ 70% of our pairs i, j outperforms the second one. This should not be too surprising—the test based on the minimum risk detector is provably near-optimal!
464
SOLUTIONS TO SELECTED EXERCISES
Let us now focus on a theoretical comparison of the detector-based test and the majority version of the minimum-risk single-observation test (options 1 and 2 above) in the general situation described at the beginning of Section 2.10.3. Given ǫ ∈ (0, 1), the corresponding sample sizes Kd and Km are completely determined by the relevant “measure of closeness” between B and W . Specifically, • For Kd , the closeness measure is X√ p ω qω ; (2.158) ρd (B, W ) = 1 − max p∈B,q∈W
ω
1 − ρd (B, W ) is the minimal risk of a detector for B, W , and for ρd (B, W ) and ǫ small, we have Kd ≈ ln(1/ǫ)/ρd (B, W ) (why?). • Given ǫ, Km is fully specified by the minimal risk ρ of simple randomized single-observation test T deciding on the hypotheses associated with B, W . By Exercise 2.7, we have ρ = 12 − δ, where δ is within absolute constant factor of the optimal value ∆ = minp∈B,q∈W kp−qk1 of (2.154). The risk bound for the K-observation majority version of T is the probability to get at least K/2 heads in K independent tosses of a coin with probability to get heads in a single toss equal to ρ = 1/2 − δ. When ρ is not close to 0 and ǫpis small, the (1 − ǫ)-quantile of p the number of heads in our K coin tosses is Kρ + O(1) K ln(1/ǫ) = K/2 − δK + O(1) K ln(1/ǫ) (why?). Km is the smallest K for which this quantile is < K/2, so that Km is of the order of ln(1/ǫ)/δ 2 , or, which is the same, of the order of ln(1/ǫ)/∆2 . We see that the “responsible for Km ” closeness between B and W is 2 ρm (B, W ) = ∆2 = min kp − qk1 , p∈B,q∈W
and Km is of the order of ln(1/ǫ)/ρm (B, W ). The goal of the next exercise is to compare ρb and ρm .
Exercise 2.9. Prove that in the situation of this section one has 1 ρ (B, W ) 8 m
≤ ρd (B, W ) ≤
1 2
p
ρm (B, W ).
(2.159)
Solution: Let p ∈ B, q ∈ W . Then X√ X√ X √ pω qω = 1 − 21 ( pω − qω )2 ≥ 1 − 21 |pω − qω | ω
ω
ω
√ (note that for nonnegative a, b we always have | a − 1−
X√ ω
pω q ω ≤
1 2
X ω
√
b| ≤
|pω − qω |
p
|a − b|), whence
and therefore ρd (B, W ) =
min
p∈B,q∈W
"
1−
X√ ω
#
pω q ω ≤
1 2
min
p∈B,q∈W
kp − qk1 = 21 ∆ =
1 2
p ρm (b, W ).
On the other hand, for p ∈ B, q ∈ W we have P √ P √ √ √ | pω − qω |( pω + qω ) ω |pω − qω | = Pω √ √ 2 1/2 P √ √ 2 1/2 qω ) qω ) ≤ ω ( pω + ω ( pω − P √ √ 2 1/2 P 1/2 ( ω [2pω + 2qω ]) qω ) ≤ ω ( pω − √ 1/2 P P √ √ √ 2 1/2 ≤ 2 qω ) = 2 2 1 − ω pω q ω , ω ( pω −
465
SOLUTIONS TO SELECTED EXERCISES
whence ∀(p ∈ B, q ∈ W ) : 1 −
X√ ω
pω qω ≥ 81 kp − qk21 .
Taking infima of the both sides of the resulting inequality over p ∈ B, q ∈ W , we get ρd (B, W ) ≥ 81 ρm (B, W ). ✷
Relation (2.159) suggests that while Kd never is “much larger” than Km (this we know in advance: in repeated versions of Discrete o.s., a properly built detector-based test provably is nearly optimal), Km might be much larger than Kd . This indeed is the case:
Exercise 2.10. Given δ ∈ (0, 1/2), let B = {[δ; 0; 1 − δ]} and W = {[0; δ; 1 − δ]}. Verify that in this case, the numbers of observations Kd and Km resulting in a given risk ǫ ≪ 1 of multi-observation tests, as functions of δ are proportional to 1/δ and 1/δ 2 , respectively. Compare the numbers when ǫ = 0.01 and δ ∈ {0.01; 0.05; 0.1}. Solution: We clearly have ρd (B, W ) = δ, see (2.158), implying that Kd = Ceil(ln(1/ǫ)/ ln(1/(1 − δ/2))) ≈ 2 ln(1/ǫ)/δ. At the same time, the ℓ1 -distance ∆ between B and W is 2δ, whence Km = O(1) ln(1/ǫ)/δ 2 . Here are the numbers for ǫ = 0.01: δ 0.10 0.05 0.01
6.2.4
Kd 44 90 459
Km 557 2201 54315
Miscellaneous exercises
Exercise 2.11. Prove that the conclusion in Proposition 2.18 remains true when the test T in the premise of the proposition is randomized.
Solution: Assume that simple randomized test T given observation ω accepts H1 with probability π(ω) ∈ [0, 1]. Let us associate with a probability distribution P on Ω the probability distribution P + on Ω × [0, 1], where P + is the distribution of independent pair (ω, θ) with ω ∼ P and θ uniformly distributed on [0, 1]. Next, let us augment observation ω ∈ Ω sampled from a distribution P on Ω with θ sampled, independently of ω, from the uniform distribution on [0, 1]. With this association, families of distributions Pχ , χ = 1, 2, become associated with families Pχ+ of probability distributions on Ω+ , and hypotheses Hχ , χ = 1, 2, on the distribution P underlying observation ω give rise to hypotheses Hχ+ , on the distribution P + underlying the augmented observation ω + = (ω, θ). The randomized test T deciding on H1 , H2 via observation ω gives rise to a deterministic simple test T+ deciding on H1+ , H2+ via observation ω + = (ω, θ) as follows: given (ω, θ), T+ accepts H1+ when θ ≤ π(ω), and accepts H2+ otherwise. It is immediately seen that the risks R Risk1 (T |H1 , H2 ) = supP ∈P1 RΩ (1 − π(ω))P (dω), Risk2 (T |H1 , H2 ) = supP ∈P2 Ω π(ω)P (dω)
466
SOLUTIONS TO SELECTED EXERCISES
of test T deciding on H1 , H2 via observation ω are exactly the same as the risks Risk1 (T+ |H1+ , H2+ ), Risk2 (T+ |H1+ , H2+ ) of test T+ deciding on hypotheses H1+ , H2+ via observation ω + . Under the premise of Proposition 2.18, these risks are ≤ ǫ ≤ 1/2, and since T+ is a deterministic test, the already proved version of the proposition says that there exists a detector φ+ (ω + ) = φ+ (ω, θ) such that i p R hR 1 ∀(P ∈ P1 ) : Ω 0 exp{−φ+ (ω, θ)}dθ P (dω) ≤ ǫ+ := 2 ǫ(1 − ǫ), i (6.3) R hR 1 ∀(P ∈ P2 ) : Ω 0 exp{φ+ (ω, θ)}dθ P (dω) ≤ ǫ+ .
R1 After setting φ(ω) = 0 φ+ (ω, θ)dθ and invoking Jensen’s inequality, (6.3) implies that R R ∀(P ∈ P1 ) : Ω exp{−φ(ω)}P (dω) ≤ ǫ+ , ∀(P ∈ P2 ) : Ω exp{φ(ω)}P (dω) ≤ ǫ+ , and φ is the detector yielding simple test Tφ with risks ≤ ǫ+ ; see Proposition 2.14. ✷ Exercise 2.12. Let p1 (ω), p2 (ω) be two positive probability densities, taken w.r.t. a reference measure Π on an observation space Ω, and let Pχ = {pχ }, χ = 1, 2. Find the optimal, in terms of its risk, balanced detector for Pχ , χ = 1, 2.
Solution: An optimal detector is φ∗ (ω) =p21 ln(p1 (ω)/p2 (ω)), and the correspondR ing risk is the Hellinger affinity ǫ⋆ := Ω p1 (ω)p2 (ω)Π(dω) of the distributions. Indeed, the fact that the risk of the detector just defined is ǫ⋆ , is evident. On the other hand, if φ is a balanced detector with risk ǫ, we have Z 2ǫ ≥ ✷ (e−φ(ω) p1 (ω) + eφ(ω) p2 (ω)) Π(dω) ≥ 2ǫ⋆ ⇒ ǫ ≥ ǫ⋆ . {z } Ω| √ ≥
p1 (ω)p2 (ω)
Exercise 2.13. Recall that the exponential distribution with parameter µ > 0 on Ω = R+ is the distribution with the density pµ (ω) = µe−µω , ω ≥ 0. Given positive reals α < β, consider two families of exponential distributions, P1 = {pµ : 0 < µ ≤ α} and P2 = {pµ : µ ≥ β}. Build the optimal, in terms of its risk, balanced detector for P1 , P2 . What happens with the risk of the detector you have built when the families Pχ , χ = 1, 2, are replaced with their convex hulls?
Solution: In light of Exercise 2.12, the risk of a balanced detector cannot be less than the Hellinger affinity of the probability densities p(ω) = αe−αω , q(ω) = βe−βω , that is, less than √ 2 αβ . ǫ⋆ := α+β An educated guess is that the detector φ∗ (ω) = 21 ln(p(ω)/q(ω)) for P1 , P2 is balanced and has risk ǫ⋆ . All we need to verify in order to justify the guess is that R p RΩ pq(ω)/p(ω)pµ (ω)dω ≤ ǫ⋆ , 0 < µ ≤ α, p(ω)/q(ω)pµ (ω)dω ≤ ǫ⋆ , µ ≥ β, Ω
467
SOLUTIONS TO SELECTED EXERCISES
or, equivalently, that
√
2 β/αµ β−α+2µ √ 2 α/βµ α−β+2µ
≤ ≤
√ 2 αβ α+β , √ 2 αβ α+β ,
0 0 of your test (i.e., if the lightbulb you are testing does not “die” on time horizon δ, you terminate the test) 3. ω = χζ 0 is the allowed test duration (i.e., you observe whether or not a lightbulb “dies” on time horizon δ, but do not register the lifespan when it is < δ). Consider the values 0.25, 0.5, 1, 2, 4 of δ.
Solution: In item 1, we deal with testing hypotheses via stationary K-repeated observations√in the situation where the hypotheses admit a balanced detector with √ αβ 1·1.5 risk ǫ⋆ = 2α+β = 21+1.5 ≈ 0.9798. This detector, for every K, induces a test T K deciding on our hypotheses via K-repeated observation with risk ǫK ⋆ . The smallest K for which the latter quantity is ≤ 0.01 (we want 0.99-reliable inference) is K = 226. In item 2, the observation space is the segment ∆ = [0, δ], and a natural reference measure is the Lebesgue measure on [0, δ) augmented by unit mass assigned to the point δ, to account for the case that the lightbulb you are testing did not “die” on the testing time horizon (that is, your observation min[δ, ω] is δ). The density of observation w.r.t. the reference measure just defined is µe−µω , 0 ≤ ω < δ pbµ (ω) = e−µδ , ω=δ 3 In
Reliability, probability distribution of the lifespan ζ of an organism or a technical device is Prob{t≤ζ≤t+δ} characterized by the failure rate λ(t) = limδ→+0 δ·Prob{ζ≥t} (so that for small δ, λ(t)δ is the conditional probability to “die” in the time interval [t, t + δ] provided the organism or device is still ‘alive” at time t). The exponential distribution corresponds to the case of failure rate independent of t; in applications, this indeed is often the case except for ”very small” and “very large” values of t.
468
SOLUTIONS TO SELECTED EXERCISES
The hypotheses to be decided upon state that the distribution of observation bebχ , χ = 1, 2, with longs to P b1 = {b b2 = {b P pµ : 0 < µ ≤ α = 1} , P pµ : µ ≥ β = 1.5} .
There is numerical evidence that specifying φ∗ (ω) as the optimal detector for the pair of “extreme” distributions from our families, that is, φ∗ (ω) =
1 2
ln(b pα (ω)/b pβ (ω)),
b1 , P b2 , equal to the risk of the detector we get a detector with risk on the families P on the pair of distributions pbα , pbβ : R R ∀µ ∈ (0, α) : ∆ e−φ∗ (ω) pbµ (ω)Π(dω) ≤ ǫ⋆ :=R ∆pe−φ∗ (ω) pbα (ω)Π(dω) pβ (ω)Π(dω), = ∆ pbα (ω)b R ∀µ ≥ β : ∆ eφ∗ (ω) pbµ (ω)Π(dω) ≤ ǫ⋆ . A reasoning similar to that used in the basic setting yields K =⌋ ln(1/0.01)/ ln(1/ǫ⋆ )⌊. In item 3, our observation takes value 0 with probability 1 − e−µδ and value 1 with complementary probability, and our testing problem becomes to decide, via stationary K-repeated observations in Discrete o.s. with Ω = {0, 1}, on two families of distributions b1 = {[p; 1 − p] : 1 > p ≥ u := e−αδ }, P b2 = {[p; 1 − p] : 0 < p ≤ v := e−βδ }. P
The risk of the optimal balanced single-observation test is p √ ǫ⋆ = (1 − u)(1 − v) + uv,
and the number of observations allowing for 0.99-reliable inference, as yielded by out theory, is ⌋ ln(1/0.01)/ ln(1/ǫ⋆ )⌊. Here are the values of K for items 2 and 3: observation min[ζ, δ] χζ 1.” In this test, you can select a number K of lightbulbs from the lot, switch them on at time 0 and record the actual lifetimes of the lightbulbs you are testing. As a result at the end of (any) observation
469
SOLUTIONS TO SELECTED EXERCISES
interval ∆ = [0, δ], you observe K independent realizations of r.v. min[ζ, δ], where ζ ∼ pµ (·) with some unknown µ. In your sequential test, you are welcome to make conclusions at the endpoints δ1 < δ2 < ... < δS of several observation intervals. Note: We deliberately skip details of problem’s setting; how you decide on these missing details, is part of your solution to the exercise.
Solution: The construction we are about to describe has several design parameters, specifically • number K of lightbulbs we test, • number S and S times 0 < δ1 < δ2 < ... < δS at which we observe what happened so far with the lightbulbs we are testing to decide whether the null hypothesis should be rejected. We assume we are given false alarm probability ǫ ∈ (0, 1); the probability that our inference rejects the null hypothesis when it is true (i.e., when the actual parameter µ of the lot we are testing is ≤ 1) must be at most ǫ. Since we decide on the hypotheses S times, at time instants δ1 , ..., δS , the simplest way to control the false alarm probability is to split the corresponding bound ǫ into S parts, ǫ1 , ..., ǫS , 0 < ǫs , s ≤ S, with
S X
ǫs = ǫ,
s=1
and to ensure that when deciding on the hypotheses at time instant δ = δs , the probability of false alarm is ≤ ǫs , and this is so for all s ≤ S. Now, given s ≤ S, we can find the smallest β = βs > 1 such that the two s hypotheses on the distribution of the lifetime of a lightbulb from the lot, one, H≤1 , s stating that this lifetime is ∼ pµ (·) with µ ≤ 1, and another, H≥β , stating that √ the lifetime is ∼ pµ (·) with µ ≥ β, admit a detector with the risk ρ := ǫs ǫ, the observations being what we see at time δs , that is, the collection ω s = {ωks = min[ζk , δs ] : 1 ≤ k ≤ K}, where ζ1 , ..., ζK are drawn independently of each other “by nature” from the “true” lifetime distribution pµ (·). Invoking the result of Exercise 2.14, β = βs is the minimal solution of the equation K √ h i 1+β 1+β √ 2 β 1 − e − 2 δs + e − 2 δ s = ǫǫs 1+β
(6.4)
on the ray β > 1. The quantity βs induces the single-observation balanced detector applicable at time instant δs , namely, 1 [ln(1/βs ) + (βs − 1)ω] , ω < δs 2 φs (ω) = : [0, δs ] → R 1 (β ω = δs s − 1)δs , 2 and its K-repeated version φs(K) (ω s ) =
K X
k=1
φs (ωks ).
470
SOLUTIONS TO SELECTED EXERCISES (K)
s s By construction, the risk of the detector φs on the pair of hypotheses H≤1 , H≥β , s √ s the observation being ω = {ωks : 1 ≤ k ≤ K}, is equal to ǫǫs . As a result, the test Ts which, given observation ω s , (K)
s s — accepts H≤1 and rejects H≥β , when φs s
— accepts
s H≥β s
and rejects
s H≤1 ,
when
(ω s ) +
(K) φs (ω s )
+
1 2 1 2
ln(ǫ/ǫs ) ≥ 0, ln(ǫ/ǫs ) < 0
possesses the following properties (see Section 2.3.2.2): s • when H≤1 is true, the probability for Ts to reject the hypothesis is at most ǫs ; s • when H≥βs is true, the probability for Ts to reject the hypothesis is at most ǫ.
Now consider sequential test T where the inferences are made at time instants δ1 , ..., δS as follows. At time δs , we apply Ts to observations ω s (this is what we s see at time instant δs ). If the result is in acceptance of H≤1 , we say that at time δs the observations support the null hypothesis “the true lifetime parameter µ for our lot is ≤ 1” and pass to time δs+1 (when s < S) or terminate (when s = S). In s the opposite case, that is, when Ts accepts H≥β , we claim that the alternative to s null hypothesis, that is, the hypothesis “the true lifetime parameter µ for our lot is > 1” takes place, and terminate further testing of our K lightbulbs. By construction, it is immediately seen that • for T , the probability of false alarm (probability to reject the null hypothesis in one of our S decision making acts) is at most ǫ; • on the other hand, for every s, if the true lifetime parameter µ is “large,” namely, µ ≥ βs , then the probability for T to terminate at the s-th decision making act or earlier with the conclusion that the alternative to null hypothesis takes place, is at least 1 − ǫ. Here are numerical illustrations for several setups; we used ǫ = 0.01 and ǫs = ǫ/S, 1 ≤ s ≤ S = 4. s 1 2 3 4
δs 0.100 0.200 0.400 0.800
βs 1.786 1.549 1.398 1.305
s 1 2 3 4
K = 1000
δs 0.100 0.200 0.400 0.800
βs 2.183 1.817 1.587 1.448
K = 500
s 1 2 3 4
δs 0.100 0.200 0.400 0.800
βs 2.620 2.106 1.790 1.601
K = 300
s 1 2 3 4
δs 0.100 0.200 0.400 0.800
βs 2.620 4.345 3.224 2.183
K = 100
Exercise 2.17.
Work out the following extension of the Opinion Poll Design problem. You are given two finite sets, Ω1 = {1, ..., I} and Ω2 = {1, ..., M }, along with L nonempty closed convex subsets Yℓ of the set ( ) I X M X ∆IM = [yim > 0]i,m : yim = 1 i=1 m=1
of all nonvanishing probability distributions on Ω = Ω1 × Ω2 = {(i, m) : 1 ≤ i ≤ I, 1 ≤ m ≤ M }. Sets Yℓ are such that all distributions from Yℓ have a common marginal distribution θℓ > 0 of i: M X yim = θiℓ , 1 ≤ i ≤ I, ∀y ∈ Yℓ , 1 ≤ ℓ ≤ L. m=1
Your observations ω1 , ω2 , ... are sampled, independently of each other, from a distribution partly selected “by nature,” and partly by you. Specifically, nature selects ℓ ≤ L and a
471
SOLUTIONS TO SELECTED EXERCISES
distribution y ∈ Yℓ , and you select a positive I-dimensional probabilistic vector q from a given convex compact subset Q of the positive part of the I-dimensional probabilistic simplex. Let y|i be the conditional distribution of m ∈ Ω2 , i being given, induced by y, so that y|i is the M -dimensional probabilistic vector with entries [y|i ]m = P
yim µ≤M yiµ
=
yim . θiℓ
In order to generate ωt = (it , mt ) ∈ Ω, you draw it at random from the distribution q, and then nature draws mt at random from the distribution y|it . Given closeness relation C, your goal is to decide, up to closeness C, on the hypotheses H1 , ..., HL , with Hℓ stating that the distribution y selected by nature belongs to Yℓ . Given “observation budget” (a number K of observations ωk you can use), you want to find a probabilistic vector q which results in the test with as small a C-risk as possible. Pose this Measurement Design problem as an efficiently solvable convex optimization problem.
Solution: Observation ω induced by your selection of q and the choice of y by nature takes value (i, m) ∈ Ω with probability qi yim /θiℓ . When y runs through Yℓ , the “signal” x(y) ∈ RI×M : xim (y) = yim /θiℓ runs through the convex compact subset Xℓ = {x ∈ RI×M : ∃y ∈ Yℓ : xim = yim /θiℓ , 1 ≤ i ≤ I, 1 ≤ m ≤ M } of the convex compact set X = {x ∈ RI×M : +
M X
m=1
xim = 1 ∀i},
and you lose nothing by assuming that Hℓ states that nature selects ℓ and signal x ∈ Xℓ , you select q ∈ Q, and your observations ωt are drawn at random and independently of each other from the probability distribution µ = Aq x : µim = [Aq x]im ≡ qi xim on Ω; and of course Aq x is a probability distribution on Ω whenever x ∈ X and q is a probabilistic vector. Consequently, you are in the Simple case (2.102) of Discrete o.s., and (2.101) is convex and tractable. Exercise 2.18.
[probabilities of deviations from the mean] The goal of what follows is to present the most straightforward application of simple families of distributions—bounds on probabilities of deviations of random vectors from their means. Let H ⊂ Ω = Rd , M, Φ be regular data such that 0 ∈ int H, M is compact, Φ(0; µ) = 0 ∀µ ∈ M, and Φ(h; µ) is differentiable at h = 0 for every µ ∈ M. Let, further, P¯ ∈ S[H, M, Φ] and let µ ¯ ∈ M be a parameter of P¯ . Prove that
1. P¯ possesses expectation e[P¯ ], and e[P¯ ] = ∇h Φ(0; µ ¯).
Solution: Denoting Rby ωi the coordinates of ω, and taking into account that 0 ∈ int H, we see that Rd [etωi + e−tωi ] P¯ (dω) < ∞ for all i and all small positive t,
472
SOLUTIONS TO SELECTED EXERCISES
whence e[P¯ ] exists. Now, for every h ∈ Rd and all small positive t we have Z Z T e±th ω P¯ (dω) ≤ exp {Φ(±th; µ ¯)} , [1 ± thT ω]P¯ (dω) ≤ Rd
Rd
so that ±
Z
Rd
h i hT ω P¯ (dω) ≤ t−1 eΦ(±th;¯µ) − 1 →t→+0 ±hT ∇h Φ(0; µ ¯).
The resulting relation holds true for all h, implying that 2. For every linear form eT ω on Ω it holds π
:= ≤
R
Rd
ω P¯ (dω) = ∇h Φ(0; µ ¯).
P¯ {ω: eT (ω − e[P¯ ]) ≥ 1} Φ(te; µ ¯) − teT ∇h Φ(0; µ ¯) − t . exp inf
(2.160)
t≥0:te∈H
What are the consequences of (2.160) for sub-Gaussian distributions?
Solution: Whenever t ≥ 0 is such that te ∈ H, we have R T ¯ ln(π) ≤ ln Rd ete (ω−e[P ])−t P¯ (dω) ≤ Φ(te, µ ¯) − teT e[P¯ ] − t = Φ(te, µ ¯) − teT ∇h Φ(0; µ ¯) − t, where the concluding equality is due to item 1, and (2.160) follows. Specifying H = Rd , M = Rd × Sd , Φ(h; µ, θ) = hT µ + 21 hT θh, the family S[H, M, Φ] is the family of sub-Gaussian distributions on Rd , and for a distribution P¯ of this type, a parameter µ ¯ w.r.t. S is a pair e[P¯ ], θ¯ of sub-Gaussianity parameters of P¯ . Consequently, (2.160) reads Probω∼P¯ {eT (ω − e[¯ p]) ≥ 1}
≤ =
¯ − teT e[¯ p] − t exp inf t≥0 teT e[P¯ ] + 12 t2 eT θe exp{− 2eT1 θe ¯ },
or, in a more convenient form, Probω∼P¯ {eT (ω − e[P¯ ]) > s
p ¯ ≤ exp{−s2 /2} ∀s ≥ 0. eT θe}
✷
Exercise 2.19. [testing convex hypotheses on mixtures] Consider the situation as follows. For given positive integers K, L and for χ = 1, 2, given are • nonempty convex compact signal sets Uχ ⊂ Rnχ χ χ • regular data Hkℓ ⊂ R d k , Mχ kℓ , Φkℓ , and affine mappings nχ uχ 7→ Aχ → R dk kℓ [uχ ; 1] : R
such that
χ u χ ∈ Uχ ⇒ A χ kℓ [uχ ; 1] ∈ Mkℓ ,
1 ≤ k ≤ K, 1 ≤ ℓ ≤ L, • probabilistic vectors µk = [µk1 ; ...; µkL ], 1 ≤ k ≤ K.
We can associate with the outlined data families of probability distributions Pχ on the observation space Ω = Rd1 × ... × RdK as follows. For χ = 1, 2, Pχ is comprised of all probability distributions P of random vectors ω K = [ω1 ; ...; ωK ] ∈ Ω generated as follows: We select
473
SOLUTIONS TO SELECTED EXERCISES
• a signal uχ ∈ Uχ , χ χ • a collection of probability distributions Pkℓ ∈ S[Hkℓ , Mχ kℓ , Φkℓ ], 1 ≤ k ≤ K, 1 ≤ ℓ ≤ L, in χ such a way that Akℓ [uχ ; 1] is a parameter of Pkℓ : T χ χ ∀h ∈ Hkℓ : ln Eωk ∼Pkℓ {eh ωk } ≤ Φχ kℓ (hk ; Akℓ [uχ ; 1]);
• we generate the components ωk , k = 1, ..., K, independently across k, from µk -mixture Π[{Pkℓ }L ℓ=1 , µ] of distributions Pkℓ , ℓ = 1, ..., L, that is, draw at random, from distribution µk on {1, ..., L}, index ℓ, and then draw ωk from the distribution Pkℓ .
Prove that when setting Hχ
=
Mχ
=
Φχ (h; µ)
=
{h = [h1 ; ...; hK ] ∈ Rd=d1 +...+dK : hk ∈
L T
ℓ=1
χ Hkℓ , 1 ≤ k ≤ K},
{0} ⊂ R, PL PK χ χ k µ exp max ln Φ (h ; A [u ; 1]) : Hχ × Mχ → R, χ k ℓ kℓ kℓ ℓ=1 k=1 uχ ∈Uχ
we obtain the regular data such that
Pχ ⊂ S[Hχ , Mχ , Φχ ]. Explain how to use this observation to compute via Convex Programming an affine detector and its risk for the families of distributions P1 and P2 .
Solution: The claim is an immediate consequence of the “calculus rules” in items 2.8.1.3.B (“mixtures”) and 2.8.1.3.A (“direct products”). Affine detectors in question are of the form φh (ω K ) =
K X
k=1
hTk ωk + 12 [Φ1 (−h; 0) − Φ2 (h; 0)]
with h ∈ H1 ∩ H2 , and Risk(φh |P1 , P2 ) ≤ 21 [Φ1 (h) + Φ2 (h)], see Proposition 2.39. Exercise 2.20 [mixture of sub-Gaussian distributions] Let Pℓ be sub-Gaussian distributions on Rd with sub-Gaussianity parameters θℓ , Θ, 1 ≤ ℓ ≤ L, with a common Θ-parameter, and let ν = [ν1 ; ...; νL ] be a probabilistic vector. Consider the ν-mixture P = Π[P L , ν] of distributions Pℓ , so that ω ∼ P is generated as follows: we draw at random from distribution ν index ℓ and then draw ω at random Pℓ . Prove that P is sub-Gaussian with P from distribution ¯ with (any) Θ ¯ chosen to satisfy sub-Gaussianity parameters θ¯ = ℓ νℓ θℓ and Θ, ¯ ℓ − θ] ¯ T ∀ℓ, ¯ Θ + 6 [θℓ − θ][θ Θ 5
in particular, according to any of the following rules: ¯ 22 Id , ¯ = Θ + 6 maxℓ kθℓ − θk 1. Θ 5P 6 ¯ ℓ − θ) ¯ T, ¯ =Θ+ 2. Θ (θ − θ)(θ 5 Pℓ ℓ T 6 ¯ =Θ+ 3. Θ ℓ θℓ θℓ , provided that ν1 = ... = νL = 1/L, 5
474
SOLUTIONS TO SELECTED EXERCISES
¯ we have Solution: Setting δℓ = θℓ − θ, P P 1 T hT θℓ + 2 hT Θh hT ω } ≤ ln ln Eω∼P {eh ω } = ln ℓ νℓ e ℓ νℓ Eω∼Pℓ {e P h T δℓ ν e = hT θ¯ + 12 hT Θh + ln ℓ i P ℓ h T 2 3 1 T T¯ T 5 (h δℓ ) ≤ h θ + 2 h Θh + ln ℓ ν ℓ h δℓ + e [due to exp{a} ≤ a+ exp{ 35 a2 } forall a] P T 2 3 5 (h δℓ ) = hT θ¯ + 21 hT Θh + ln ℓ νℓ e P [due to ℓ νℓ δℓ = 0] ≤ hT θ¯ + 12 hT Θh + 35 maxℓ (hT δℓ )2 = hT θ¯ + 12 hT Θh + 12 56 maxℓ (hT δℓ )2 .
✷
Exercise 2.21. The goal of this exercise is to give a simple sufficient condition for quadratic lifting “to work” in the Gaussian case. Namely, let Aχ , Uχ , Vχ , Gχ , χ = 1, 2, be as in Section 2.9.3, with the only difference that now we do not assume the compact sets Uχ to be convex, and let Zχ be convex compact subsets of the sets Z nχ , see item i.2. in Proposition 2.43, such that [uχ ; 1][uχ ; 1]T ∈ Zχ ∀uχ ∈ Uχ , χ = 1, 2. (∗)
(χ)
Augmenting the above data with Θχ , δχ such that V = Vχ , Θ∗ = Θ∗ , δ = δχ satisfy (2.129), χ = 1, 2, and invoking Proposition 2.43.ii, we get at our disposal a quadratic detector φlift such that Risk[φlift |G1 , G2 ] ≤ exp{SadVallift }, with SadVallift given by (2.134). A natural question is, when SadVallift is negative, meaning that our quadratic detector indeed “is working,” that is, its risk is < 1. When repeated observations are allowed, tests based upon this detector are consistent—they are able to decide on the hypotheses Hχ : P ∈ Gχ , χ = 1, 2, on the distribution of observation ζ ∼ P with any desired risk ǫ ∈ (0, 1)? With our computation-oriented ideology, this is not too important a question, since we can answer it via efficient computation. This being said, there is no harm in a “theoretical” answer which could provide us with an additional insight. The goal of the exercise is to justify a simple result on the subject. Here is the exercise: In the situation in the question, assume that V1 = V2 = {Θ∗ }, which allows us to set (χ) Θ∗ = Θ∗ , δχ = 0, χ = 1, 2. Prove that in this case a necessary and sufficient condition for SadVallift to be negative is that the convex compact sets Uχ = {Bχ ZBχT : Z ∈ Zχ } ⊂ Sd+1 + , χ = 1, 2 do not intersect with each other.
Solution: Substituting the descriptions (2.130) of ΦAχ ,Zχ into (2.134), under the
475
SOLUTIONS TO SELECTED EXERCISES
exercise premise we get SadVallift
=
min
(h,H)∈H1 ∩H2
h i 1/2 1/2 1/2 − 12 ln Det(Id + Θ1/2 ∗ HΘ∗ ) + ln Det(Id − Θ∗ HΘ∗ ) | {z }
+ max Tr W1 − |{z} W1 ∈U1 0
+ max Tr W2 |{z} W2 ∈U2 0
≥
min
H hT
H hT
max
(h,H)∈H1 ∩H2 W1 ∈U1 ,W2 ∈U2
h
A(H)≥0
−1 + [H, h]T [Θ−1 [H, h] ∗ + H] | {z } B+ (h,H)0 T −1 + [H, h] [Θ∗ − H]−1 [H, h] {z } | h
B− (h,H)0
Tr [W2 − W1 ]
H hT
h
,
(6.5)
and the concluding quantity in the chain clearly is nonnegative when U1 ∩ U2 6= ∅. Now assume that U1 ∩ U2 = ∅, and let us verify that SadVallift < 0. Indeed, U1 ∩ U2 = ∅, so that the nonempty convex compact sets U1 , U2 can be strictly ¯ ∈ Rd , H ¯ ∈ Sd , η ∈ R, and α > 0 such that separated: there exist h ¯ ¯ ≤ −α ∀(W1 ∈ U1 , W2 ∈ U2 ). (6.6) Tr [W2 − W1 ] h¯HT hη From the structure of matrices Bχ , χ = 1, 2—see (2.131)—it follows that [W1 ]d+1,d+1 = [W2 ]d+1,d+1 = 1 whenever W1 ∈ U1 and W2 ∈ U2 , whence (6.6) remains intact ¯ H ¯ as they are and setting η = 0. Now, looking at (6.5) it is clear when keeping h, ¯ tH) ¯ ∈ H1 ∩ H2 and that for all small enough t > 0 we have (th, ¯ tH)k ¯ ≤ Ct2 , kB± (th, ¯ ≤ Ct2 , A(tH) with C independent of t. Hence, due to boundedness of Uχ , χ = 1, 2, (6.5) implies that for all small enough positive t and some constant C¯ independent of t it holds ¯ ¯ ¯ 2 − αt ¯ 2+t ≤ Ct Tr [W2 − W2 ] h¯HT hη SadVallift ≤ Ct max W1 ∈U1 ,W2 ∈U2
(we have used (6.6)), implying that SadVallift < 0 due to α > 0.
✷
Exercise 2.22. Prove that if X is a nonempty convex compact set in Rd , then the function b Φ(h; µ) given by (2.114) is real-valued and continuous on Rd × X and is convex in h and concave in µ.
¯ = X −x ¯ Let Φ(h; e µ) ¯ ∈ rint X, X ¯, and E be the linear span of X. Solution: Let x ¯ in exactly the same way as Φ b is associated with X: be associated with X i h e µ) = inf Ψ(h, e Φ(h; g; µ) := (h − g)T µ + 18 [φX¯ (h − g) + φX¯ (−h + g)]2 + φX¯ (g) . g
1o .
We clearly have φX¯ (u) = φX (u) − x ¯T u, implying that
e Ψ(h, g; µ)
= = =
(h − g)T µ + 81 [φX (h − g) + φX (−h + g)]2 + φX (g) − g T x ¯ ¯ (h − g)T [µ + x ¯] + 18 [φX (h − g) + φX (−h + g)]2 + φX (g) − hT x b Ψ(h, g; µ + x ¯ ) − hT x ¯,
476
SOLUTIONS TO SELECTED EXERCISES
implying that
e µ) = Φ(h; b µ+x Φ(h; ¯ ) − hT x ¯.
b µ) on Rd ×X We conclude that verifying continuity and convexity-concavity of Φ(h; e µ) on Rd × X, ¯ is equivalent to verifying continuity and convexity-concavity of Φ(h; and this is what we do next. ¯ the functions φX¯ (g) and Ψ(h, e 2o . When µ ∈ X, g; µ) clearly depend solely on the orthogonal projections gE and hE of g and h on E: e Ψ(h, g; µ) = [hE − gE ]T µ + 81 [φX¯ (hE − gE ) + φX¯ (−hE + gE )]2 + φX¯ (gE ).
¯ contains a Euclidean ball of positive radius centered at the origin in the Now, X linear space E, implying that [φX¯ (e) + φX¯ (−e)]2 ≥ αkek22 for properly selected α > 0 and all e ∈ E, and that φX¯ (g) ≥ 0 for all g. As an immediate consequence, given a nonempty convex compact subset H of Rd , we can find a nonempty convex compact set G = G[H] ⊂ E such that ¯ : ∀(h ∈ H, µ ∈ X) e inf g Ψ(h, g; µ) = mine∈G[H] [hE − e]T µ + 81 [φX¯ (hE − e) + φX¯ (−hE + e)]2 + φX¯ (e) .
¯ and G[H] are nonempty convex compact sets, the right hand side in this Since X ¯ implying that representation is real-valued and continuous in (h, µ) on H × X, e µ) is continuous on Rd × X. ¯ Convexity-concavity of Φ e is evident. Φ(h; ✷ Exercise 2.23. The goal of what follows is to refine the change detection procedure (let us
refer to it as the “basic” one) developed in Section 2.9.5.1. The idea is pretty simple. With the notation from Section 2.9.5.1, in the basic procedure, when testing the null hypothesis H0 vs. signal hypothesis Htρ , we look at the difference ζt = ωt − ω1 and try to decide whether the energy of the deterministic component xt − x1 of ζt is 0, as is the case under H0 , or is ≥ ρ2 , as is the case under Htρ . Note that if σ ∈ [σ, σ] is the actual intensity of the observation noise, then the noise component of ζt is N (0, 2σ 2 Id ); other things being equal, the larger is the noise in ζt , the larger should be ρ to allow for a reliable, with a given reliability level, decision. Now note that under the hypothesis Htρ , we have x1 = ... = xt−1 , so that the deterministic component of the difference ζt = ωt − ω1 is exactly the same as for Pt−1 2 1 e the difference ζet = ωt − t−1 s=1 ωs , while the noise component in ζt is N (0, σt Id ) with σt2 = σ 2 + 1 σ 2 = t σ 2 . Thus, the intensity of noise in ζet is at most the same as in ζt , t−1
t−1
and this intensity, in contrast to that for ζt , decreases as t grows. Here goes the exercise: Let reliability tolerances ǫ, ε ∈ (0, 1) be given, and let our goal be to design a system of inferences Tt , t = 2, 3, ..., K, which, when used in the same fashion as tests Ttκ were used in the basic procedure, results in false alarm probability at most ǫ and in probability to miss a change of energy ≥ ρ2 at most ε. Needless to say, we want to achieve this goal with as small ρ as possible. Think how to utilize the above observation to refine the basic procedure eventually reducing (and provably not increasing) the required value of ρ. Implement the basic and the refined change detection procedures and compare their quality (the resulting values of ρ) e.g., on the data used in the experiment reported in Section 2.9.5.1.
Solution: Given ρ > 0, let us act as follows. For every t, 2 ≤ t ≤ K, we use Proposition 2.43.ii, in the same fashion as in the Basic procedure, to decide on two hypotheses, G1,t and Gρ2,t , on the distribution of observation ζet = ωt −
t−1 1 X ωs = xt − x1 + ξ t . t − 1 s=1
477
SOLUTIONS TO SELECTED EXERCISES 2
tσ Id ), Both hypotheses state that the noise component ξ t of the observation is N (0, t−1 with σ ∈ [σ, σ]. On top of that, G1,t states that the deterministic component xt −x1 of the observation is 0, while Gρ2,t states that the energy of this component is ≥ ρ2 . Let φρt (·) and κt = κt (ρ) be a detector quadratic in ζet and (the upper bound on) its risk yielded by Proposition 2.43.ii. Passing from this detector to its shift
φ¯ρt (ζ) = φρt (ζ) + ln(ε/κt ),
the simple test Ttρ which, given observation ζet , accepts G1,t when φ¯ρt (ζet ) ≥ 0 and accepts Gρ2,t otherwise, satisfies the relations (cf. (2.151)) κ2 (ρ) Risk1 (φ¯ρt |G1,t , Gρ2,t ) ≤ t , Risk2 (φ¯ρt |G1,t , Gρ2,t ) ≤ ε. ε
Same as in Section 2.9.5.1, the system of tests Ttρ , 2 ≤ t ≤ K, gives rise to a change detection procedure Πρ with the probability to miss a change of energy ≥ ρ2 at most ε, and the probability of false alarm at most ǫ(ρ) =
K X
κ2t (ρ)/ε.
t=2
It is immediately seen that ǫ(ρ) → 0 as ρ → ∞, ǫ(ρ) > ǫ for small ρ, and ǫ(ρ) is nonincreasing function of ρ. Applying Bisection, we can find the largest ρ = ρ∗ for which ǫ(ρ) ≤ ǫ, and then use the procedure Πρ∗ in actual change detection, thus ensuring the reliability specifications. Note that with the above refinement, we sacrifice Proposition 2.48. As for numerical results for √ the data from the experiment reported in Section 2.9.5.1 (dim x = 2562 , K = 9, 2σ = σ = 10, ǫ = ε = 0.01), they are as follows: the “resolution” ρ of the Basic procedure is 2, 717.4, and that of the refined one 2, 702.4. Not too impressive progress!
6.3
SOLUTIONS FOR CHAPTER 3
Exercise 3.1. In the situation of Section 3.3.4, design of a “good” estimate is reduced to solving convex optimization problem (3.39). Note that the objective in this problem is, in a sense, “implicit”—the design variable is h, and the objective is obtained from an explicit convex-concave function of h and (x, y) by maximization over (x, y). There exist solvers capable of processing problems of this type efficiently. However, commonly used off-the-shelf solvers, like cvx, cannot handle problems of this type. The goal of the exercise to follow is to reformulate (3.39) as a semidefinite program, thus making it amenable for cvx. On an immediate inspection, the situation we are interested in is as follows. We are given • a nonempty convex compact set X ⊂ Rn along with affine function M (x) taking values in Sd and such that M (x) 0 when x ∈ X, and • affine function F (h) : Rd → Rn .
Given γ > 0, this data gives rise to the convex function o n p Ψ(h) = max F T (h)x + γ hT M (x)h , x∈X
478
SOLUTIONS TO SELECTED EXERCISES
and we want to find a “nice” representation of this function, specifically, want to represent the inequality τ ≥ Ψ(h) by a bunch of LMI’s in variables τ , h, and perhaps additional variables. To achieve our goal, we assume in the sequel that the set X + = {(x, M ) : x ∈ X, M = M (x)} can be described by a system of linear and semidefinite constraints in variables x, M , and additional variables ξ, namely, (a) si − aTi x − bTi ξ − Tr(Ci M ) ≥ 0, i ≤ I + . X = (x, M ) : ∃ξ : (b) S − A(x) − B(ξ) − C(M ) 0 (c) M 0
Here si ∈ R, S ∈ SN are some constants, and A(·), B(·), C(·) are (homogeneous) linear functions taking values in SN . We assume that this system of constraints is essentially strictly feasible, meaning that there exists a feasible solution at which the semidefinite constraints (b), (c) are satisfied strictly (i.e., the left-hand sides of the LMI’s are positive definite). Now comes the exercise:
1) Check that Ψ(h) is the optimal value T Ψ(h) = max F (h)x + γt : x,M,ξ,t
in the semidefinite program si − aTi x − bTi ξ − Tr(Ci M ) ≥ 0, i ≤ I S − A(x) − B(ξ) − C(M ) 0 M T 0 h Mh t 0 t 1
(a) (b) (c) . (d)
(P ) 2) Passing from (P ) to the semidefinite dual of (P ), build an explicit semidefinite representation of Ψ, that is, an explicit system S of LMIs in variables h, τ and additional variables u such that {τ ≥ Ψ(h)} ⇔ {∃u : (τ, h, u) satisfies S}.
Solution: 1: The last LMI in (P ) represents equivalently the constraint t ≤ p hT M (x)h, so that the optimal value in (P ) indeed is Ψ(h). 2: From our assumptions it immediately follows that (P ) is feasible and bounded; in addition, in the case of h 6= 0 (which we assume for the time being) (P ) admits a feasible solution at which all semidefinite constraints are satisfied strictly. Invoking the Refined Conic Duality Theorem (see Remark 4.2), Ψ(h) is the optimal value in the solvable semidefinite dual of (P ). To build this dual, let λi ≥ 0 be Lagrange multipliers for (a), W 0 be the Lagrange multiplier for (b), Z 0 be the Lagrange u −v 0 be Lagrange multiplier for (d). Taking the multiplier for (c), and −v w weighted sum of the constraints in (P ) with the weights given by the multipliers, we get the following consequence of the constraints of (P ): P T λT s + Tr(W S) + w − [ i λi ai + A∗ (W )] x P P T ∗ ∗ T −Tr i λi bi + B (W )] ξ ≥ 0 i λi Ci + C (W ) − uhh − Z M − 2vt − [ P
yi Di : Rk → SN the mapping D∗ (·) : SN → Rk is given by D∗ (Y ) = [Tr(D1 Y ); ...; Tr(Dk Y )] ⇔ y T D∗ (Y ) ≡ Tr(D(y)Y ) .
where for D(y) =
i
To get the dual problem, we should impose on Lagrange multipliers, besides the above restrictions, also the constraint that the part of the left-hand side in the
479
SOLUTIONS TO SELECTED EXERCISES
aggregated constraint homogeneous in x, M, ξ, t is identically equal to the minus objective in (P ), so that the dual problem reads
Ψ(h)
P ∗ (W ) = F (h) i λi ai + A P T ∗ uhh + Z = i λi Ci + C (W ) 2v = γ P λT s + Tr(W S) + w : min ∗ λ,W,Z,u,v,w i λi bi + B (W ) = 0 2 uw ≥ v λ ≥ 0, W 0, Z 0, u ≥ 0, w ≥ 0 P λi ai + A∗P (W ) = F (h) i T ∗ uhh + Z = i λi Ci + C (W ) P ∗ T λ b + B (W ) = 0 λ s + Tr(W S) + w : min i i i 2 λ,W,Z,u,w uw ≥ γ4 λ ≥ 0, W 0, Z 0, w ≥ 0 P ∗ λi ai + A (W ) = F (h) P γ2i T ∗ hh λ C + C (W ) T i i i 4w P min λ s + Tr(W S) + w : ∗ λ,W,w i λi bi + B (W ) = 0 λ ≥ 0, W 0, w ≥ 0
=
=
=
that is,
Ψ(h) = min
λ,W,w
λT s + Tr(W S) + w :
P ∗ i λi ai + A (W ) = F (h) P γ ∗ h i λi Ci + C (W ) 2 γ
T
w h 2 P ∗ λ b + B (W ) = 0, i i i λ ≥ 0, W 0, w ≥ 0
0
. (6.7)
(6.7) was obtained under the assumption that h 6= 0, but in fact the relation holds true when h = 0 as well, due to Ψ(0)
=
⇒ =
=
si − aTi x − bTi ξ − Tr(Ci M ) ≥ 0, i ≤ I maxx,M,ξ F (0)x : S − A(x) − B(ξ) − C(M ) 0 M 0 [by conic duality] P ∗ i ai + A (W ) = F (0) i λP ∗ Z = i λi Ci + C (W ) T P min λ s + Tr(W S) : ∗ λ,W,Z i λi bi + B (W ) = 0, λ ≥ 0, W 0, Z 0 P ∗ Pi λi ai + A (W ) = F (0) λi Ci + C ∗ (W ) 0 min λT s + Tr(W S) : λ,W Pi ∗ i λi bi + B (W ) = 0, λ ≥ 0, W 0
T
which is exactly the same as (6.7) with h = 0. We conclude that
(a) (b) (c)
,
P λi ai + A∗ (W ) = F (h) iP γ ∗ h i λi Ci + C (W ) 2 0 T γ T , {τ ≥ Ψ(h)} ⇔ ∃(λ, W, w) : λ s+Tr(W S)+w ≤ τ & w h 2 P ∗ λi bi + B (W ) = 0, i λ ≥ 0, W 0
which is the desired representation of Ψ.
✷
Exercise 3.2. Let us consider the situation as follows. Given an m × n “sensing matrix” A which is stochastic—with columns from the probabilistic simplex ( ) X m ∆m = v ∈ R : v ≥ 0, vi = 1 i
480
SOLUTIONS TO SELECTED EXERCISES
and a nonempty closed subset U of ∆n , we observe an M -element, M > 1, i.i.d. sample ζ M = (ζ1 , ..., ζM ) with ζk drawn from the discrete distribution Au∗ , where u∗ is an unknown probabilistic vector (“signal”) known to belong to U . We handle the discrete distribution Au, u ∈ ∆n , as a distribution on the vertices e1 , ..., em of ∆m , so that possible values of ζk are basic orths e1 , ..., em in Rm . Our goal is to recover the value F (u∗ ) of a given quadratic form F (u) = uT Qu + 2q T u. Observe that for u ∈ ∆n , we have u = [uuT ]1n , where 1k is the all-ones vector in Rk . This observation allows us to rewrite F (u) as a homogeneous quadratic form: ¯ Q ¯ = Q + [q1Tn + 1n q T ]. F (u) = uT Qu,
(3.77)
The goal of the exercise is to follow the approach developed in Section 3.4.1 for the Gaussian case in order to build an estimate gb(ζ M ) of F (u). To this end, consider the following construction. Let
JM = {(i, j) : 1 ≤ i < j ≤ M }, JM = Card(JM ).
For ζ M = (ζ1 , ..., ζM ) with ζk ∈ {e1 , ..., em }, 1 ≤ k ≤ M , let
ωij [ζ M ] = 21 [ζi ζjT + ζj ζiT ], (i, j) ∈ JM . The estimates we are interested in are of the form 1 X ωij [ζ M ] gb(ζ M ) = Tr h +κ (i,j)∈JM JM | {z } ω[ζ M ]
where h ∈ S
m
and κ ∈ R are the parameters of the estimate.
Now comes the exercise:
1) Verify that when the ζk ’s stem from signal u ∈ U , the expectation of ω[ζ M ] is a linear image Az[u]AT of the matrix z[u] = uuT ∈ Sn : denoting by PuM the distribution of ζ M , we have (3.78) Eζ M ∼PuM {ω[ζ M ]} = Az[u]AT . Check that when setting
Zk = {ω ∈ Sk : ω 0, ω ≥ 0, 1Tk ω1k = 1}, where x ≥ 0 for a matrix x means that x is entrywise nonnegative, the image of Zn under the mapping z 7→ AzAT is contained in Zm .
Solution: Let us fix signal u ∈ ∆n underlying observations ζ1 , ..., ζM . Since ζ1 , ..., ζM are independent, and ζk takes value ei with probability [Au]i , i = 1, ..., m, all matrices ωij [ζ M ], (i, j) ∈ JM , have common expectation, namely, [Au][Au]T , and (3.78) follows. The fact that AzAT ∈ Zm when z ∈ Zn immediately follows from A ≥ 0 and AT 1m = 1n .
2) Let ∆k = {z ∈ Sk : z ≥ 0, 1Tn z1n = 1}, so that Zk is the set of all positive semidefinite matrices from ∆k . For µ ∈ ∆m , let Pµ be the distribution of the random matrix w taking values in Sm as follows: the possible values of w are matrices of the form eij = 1 [e eT + ej eTi ], 1 ≤ i ≤ j ≤ m; for every i ≤ m, w takes value eii with probability µii , 2 i j and for every i, j with i < j, w takes value eij with probability 2µij . Let us set ! m X Φ1 (h; µ) = ln µij exp{hij } : Sm × ∆m → R, i,j=1
so that Φ1 is a continuous convex-concave function on Sm × ∆m .
481
SOLUTIONS TO SELECTED EXERCISES
2.1. Prove that ∀(h ∈ Sm , µ ∈ Zm ) : ln Ew∼Pµ {exp{Tr(hw)}} = Φ1 (h; µ).
2.2. Derive from 2.1 that setting
K = K(M ) = ⌊M/2⌋, ΦM (h; µ) = KΦ1 (h/K; µ) : Sm × ∆m → R, ΦM is a continuous convex-concave function on Sm × ∆m such that ΦK (0; µ) = 0 for all µ ∈ Zm , and whenever u ∈ U , the following holds true:
Let Pu,M be the distribution of ω = ω[ζ M ], ζ M ∼ PuM . Then for all u ∈ U, h ∈ Sm , (3.79) ln Eω∼Pu,M {exp{Tr(hω)}} ≤ ΦM (h; Az[u]AT ), z[u] = uuT .
Solution: 2.1 is evident. Let us prove 2.2. Continuity and convexity-concavity of ΦM and the relation ΦM (0; µ) = 0, µ ∈ Zm , are evident, so that all we need is to verify (3.79). Let us fix u ∈ U and h ∈ Sm , and let µ = Az[u]AT , so that µ ∈ Zm ⊂ ∆m . Let SM be the set of all permutations σ of 1, ..., M such that σ(2k − 1) < σ(2k) for k = 1, ..., K, and let ω σ [ζ M ] =
K 1 X ωσ(2k−1)σ(2k) [ζ M ], σ ∈ SM . K k=1
We claim that ω[ζ M ] =
X 1 ω σ [ζ M ]. Card(SM )
(6.8)
σ∈SM
Indeed, we clearly have K X X
ωσ(2k−1)σ(2k) [ζ M ] = N
σ∈SM k=1
X
ωij [ζ M ],
(i,j)∈JM
where N is the number of permutations σ ∈ SM such that a particular pair (i, j) ∈ JM is met among the pairs (σ(2k − 1), σ(2k)), 1 ≤ k ≤ K. Comparing the total number of ωij -terms in the left- and the right-hand sides of the latter equality, we get Card(SK )K = N JM , which combines with the equality itself to imply that 1 JM
X
(i,j)∈JM
ωij [ζ M ] =
K X 1 X 1 ωσ(2k−1)σ(2k) [ζ M ], Card(SM ) K σ∈SM
k=1
which is exactly (6.8) (see what ω σ [·] is). Let σid be the identity permutation of 1, ..., M ; it clearly belongs to SM . We
482
SOLUTIONS TO SELECTED EXERCISES
have M Eω∼Pu,M {exp{Tr(hω)}} = Eζ M ∼PuM exp{Tr(hω[ζ ])} oo n n P 1 σ M [by (6.8)] = Eζ M ∼PuM exp Card(S σ∈SM Tr(hω [ζ ]) M) Q 1/Card(S ) M Eζ M ∼PuM exp{Tr(hω σ [ζ M ])} ≤ [by H¨ older’s inequality] σ∈SM = Eζ M ∼PuM exp{Tr(hω σid [ζ M ])} [since the distribution of ω σ [ζ M ], ζ M ∼ PuM , is independent nQ o of σ ∈ SM ] K 1 T T = Eζ M ∼PuM k=1 exp{ K Tr(h[ζ2k−1 ζ2k + ζ2k ζ2k−1 ]/2)} [by definition of ω σid [·]] K [since ζ1 , ..., ζM are i.i.d.] = Eζ 2 ∼Pu2 exp{Tr((h/K)[ζ1 ζ2T + ζ2 ζ1T ]/2)} K = Ew∼Pµ {exp{Tr((h/K)w)}} [since the distribution of
1 [ζ ζ T 2 1 2
+ ζ2 ζ1T ], ζ 2 ∼ Pu2 , clearly is Pµ with µ = AuuT AT ].
The resulting inequality combines with (3.78) to imply that ln Eω∼Pu,M {exp{Tr(hω)}} ≤ KΦ1 (h/K; Az[u]AT ), and (3.79) follows.
✷
3) Combine the above observations with Corollary 3.6 to arrive at the following result: Proposition 3.19 In the situation in question, let Z be a convex compact subset of Zn such that uuT ∈ Z for all u ∈ U . Given ǫ ∈ (0, 1), let ¯ : Sm × {α > 0} → R, Ψ+ (h, α) = max αΦM (h/α, AzAT ) − Tr(Qz) z∈Z ¯ : Sm × {α > 0} → R Ψ− (h, α) = max αΦM (−h/α, AzAT ) + Tr(Qz) b + (h) Ψ
b − (h) Ψ
z∈Z
:= =
inf α>0 [Ψ+ (h, α) + α ln(2/ǫ)] ¯ + α ln(2/ǫ) max inf αΦM (h/α, AzAT ) − Tr(Qz) z∈Z α>0 ¯ + β ln(2/ǫ) max inf βΦ1 (h/β, AzAT ) − Tr(Qz)
=
K
z∈Z β>0
:=
inf [Ψ− (h, α) + α ln(2/ǫ)] ¯ + α ln(2/ǫ) max inf αΦM (−h/α, AzAT ) + Tr(Qz) z∈Z α>0 ¯ + β ln(2/ǫ) max inf βΦ1 (−h/β, AzAT ) + Tr(Qz)
[β = Kα],
α>0
= =
K
z∈Z β>0
[β = Kα].
b ± are real valued and convex on Sm , and every candidate solution h to the The functions Ψ convex optimization problem h io n b + (h) + Ψ b − (h) , b (3.80) Opt = min Ψ(h) := 21 Ψ h
induces the estimate
b − (h) − Ψ b + (h)] gbh (ζ M ) = Tr(hω[ζ M ]) + κ(h), κ(h) = 21 [Ψ
b of the functional of interest (3.77) via observation ζ M with ǫ-risk on U not exceeding ρ = Ψ(h): ∀(u ∈ U ) : Probζ M ∼PuM {|F (u) − gbh (ζ M )| > ρ} ≤ ǫ.
Solution: Verification is straightforward.
483
SOLUTIONS TO SELECTED EXERCISES
4) Consider an alternative way to estimate F (u), namely, as follows. Let u ∈ U . Given a pair of independent observations ζ1 , ζ2 drawn from distribution Au, let us convert them into the symmetric matrix ω1,2 [ζ 2 ] = 12 [ζ1 ζ2T + ζ2 ζ1T ]. The distribution Pu,2 of this matrix is exactly the distribution Pµ(z[u]) —see item B—where µ(z) = AzAT : ∆n → ∆m . Now, given M = 2K observations ζ 2K = (ζ1 , ..., ζ2K ) stemming from signal u, we can split them into K consecutive pairs giving rise to K observations ω K = (ω1 , ..., ωK ), ωk = ω[[ζ2k−1 ; ζ2k ]] drawn independently of each other from probability distribution Pµ(z[u]) , and the functional ¯ of interest (3.77) is a linear function Tr(Qz[u]) of z[u]. Assume that we are given a set Z as in the premise of Proposition 3.19. Observe that we are in the situation as follows: Given K independent identically distributed observations ω K = (ω1 , ..., ωK ) with ωk ∼ Pµ(z) , where z is an unknown signal known to belong to Z, we want to ¯ of v ∈ Sn . Besides this, recover the value at z of linear function G(v) = Tr(Qv) we know that Pµ , for every µ ∈ ∆m , satisfies the relation ∀(h ∈ Sm ) : ln Eω∼Pµ {exp{Tr(hω)}} ≤ Φ1 (h; µ).
This situation fits the setting of Section 3.3.3, with the data specified as H = E H = S m , M = ∆ m ⊂ E M = S m , Φ = Φ1 , X := Z ⊂ EX = Sn , A(z) = AzAT .
Therefore, we can use the apparatus developed in that section to upper-bound the ǫ-risk of the affine estimate ! K 1 X Tr h ωk + κ K k=1
¯ and to build the best, in terms of the upper risk bound, estiof F (u) := G(z[u]) = u Qu mate; see Corollary 3.8. On closer inspection (carry it out!), the associated with the above b ± arising in (3.38) are exactly the functions Ψ b ± specified in Proposition data functions Ψ 3.19 for M = 2K. Thus, the approach just outlined to estimating F (u) via stemming from u ∈ U observations ζ 2K results in a family of estimates ! K 1 X 2K ω[[ζ2k−1 ; ζ2k ]] + κ(h), h ∈ Sm . geh (ζ ) = Tr h K T
k=1
b b The resulting upper bound on the ǫ-risk of estimate geh is Ψ(h), where Ψ(·) is associated with M = 2K according to Proposition 3.19. In other words, this is exactly the upper bound on the ǫ-risk of the estimate gbh offered by the proposition. Note, however, that the estimates geh and gbh are not identical: PK 2K 1 ] + κ(h), geh (ζ 2K ) = Tr h K k=1 ω2k−1,2k [ζ P 2K 1 gbh (ζ 2K ) = Tr h K(2K−1) ] + κ(h). 1≤i 1/2—it makes sense to be lazy! Note, however, that the probability of today’s ¯ so that lazy peoreward being at least that of tomorrow’s is exactly the same uT Qu, ple have scientific reasons to be lazy, and industrious people have equally scientific reasons to stay so .... Now, M historical data can be treated as a stationary M -repeated observation ζ M of a discrete random variable with distribution u, so that we find ourselves in ¯ the situation considered in the exercise, with m = n, A = In , and the matrix Q just defined, and we can apply the estimate we have just derived. Our numerical experiments (where we used n = 16 and Z = Zn ) fully support the claim that “in reality” estimate gbh outperforms its competitor geh . Here is a
4 In statistical literature aggregates g bh and geh are referred to as U-statistics of order 2. There exists a rich and technically advanced theory of U -statistics (cf. [100, 151, 195] and references therein); yet, available results usually impose restrictions on the choice of matrix h which we wish to avoid. This being said, in the simple situation considered in this exercise, a theoretical comparison of estimates gbh and geh could be carried out using, for instance, results of [119], but we were just too lazy to do it and leave this task to an interested reader.
485
SOLUTIONS TO SELECTED EXERCISES
typical result:
left, lower: left, upper:
gb, unbiased gb, optimal
right, lower: right, upper:
ge, unbiased ge, optimal
What you see are empirical cumulative distribution functions, built via 200 simu¯ of recovery errors. lations of ζ 2K with K = 256, of magnitudes |b g (ζ 2K ) − uT Qu| ¯ (“unbiased In each simulation, we compute gbh - and geh -estimates, for both h = Q estimates”), and for h selected as the optimal solution to (3.80) (“optimal estimates”). We see that while there is no significant difference between unbiased and optimal estimates of the same type, the gbh -estimates significantly outperform their geh counterparts. Exercise 3.3. What follows is a variation of Exercise 3.2. Consider the situation as
follows. We observe K realizations ηk , k ≤ K, of a discrete random variable with p possible values, and L ≥ K realizations ζℓ , ℓ ≤ L, of a discrete random variable with q possible values. All realizations are independent of each other; the ηk ’s are drawn from distribution P u, and the ζℓ ’s from distribution Qv, where P ∈ Rp×r , Q ∈ Rq×s are given stochastic “sensing matrices,” and u, v are unknown “signals” known to belong to given subsets U , V of probabilistic simplexes ∆r , ∆s . Our goal is to recover from observations {ηk , ζℓ } the value at u, v of a given bilinear function F (u, v) = uT F v = Tr(F [uv T ]T ).
(3.81)
A“covering story” could be as follows. Imagine that there are two possible actions, say, administering to a patient drug A or drug B. Let u be the probability distribution of a (quantified) outcome of the first action, and v be a similar distribution for the second action. Observing what happens when the first action is utilized K, and the second L times, we could ask ourselves what the probability is of the outcome of the first action being better than the outcome of the second one. This amounts to computing the probability π of the event “η > ζ,” where η, ζ are discrete real-valued random variables independent of each other with distributions u, v, and π is a linear function of the “joint distribution” uv T of η, ζ. This story gives rise to the aforementioned estimation problem with the unit sensing matrices P and Q. Assuming that there are “measurement errors”—instead of observing the action’s outcome “as is,” we observe a realization of a random variable with distribution depending, in a prescribed fashion, on the outcome—we arrive at problems where P and Q can be general type stochastic matrices.
486
SOLUTIONS TO SELECTED EXERCISES
As always, we encode the p possible values of ηk by the basic orths e1 , ..., ep in Rp , and the q possible values of ζ by the basic orths f1 , ..., fq in Rq . We focus on estimates of the form #T " # " 1 X 1X [h ∈ Rp×q , κ ∈ R]. ηk h ζℓ + κ gbh,κ (η K , ζ L ) = K L k
ℓ
This is what you are supposed to do:
1) (cf. item 2 in Exercise 3.2) Denoting by ∆mn the set of nonnegative m × n matrices with unit sum of all entries (i.e., the set of all probability distributions on {1, ..., m} × {1, ..., n}) and assuming L ≥ K, let us set A(z) = P zQT : Rr×s → Rp×q and
P
p i=1
Pq
µij exp{hij } : Rp×q × ∆pq → R,
Φ(h; µ)
=
ln
ΦK (h; µ)
=
KΦ(h/K; µ) : Rp×q × ∆pq → R.
j=1
Verify that A maps ∆rs into ∆pq , Φ and ΦK are continuous convex-concave functions on their domains, and that for every u ∈ ∆r , v ∈ ∆s , the following holds true:
(!) When η K = (η1 , ..., ηK ), ζ L = (ζ1 , ..., ζK ) with mutually independent η1 , ..., ζL such that ηk ∼ P u, ηℓ ∼ Qv for all k, ℓ, we have " #T " # X X 1 1 ≤ ΦK (h; A(uv T )). (3.82) ln Eη,ζ exp ηk h ζℓ K L k
ℓ
Solution: The only nontrivial claim is (!). To verify this claim, let us start with the immediate observation that when µ ∈ ∆pq and h ∈ Rp×q , treating µ as the probability distribution of a random p × q matrix w which takes values ei fjT with probabilities µij , 1 ≤ i ≤ p, 1 ≤ j ≤ q, we have ln Ew∼µ {exp{Tr(hwT )}} ≤ Φ(h, µ),
whence, in particular, with ηk and ζℓ stemming, as explained above, from u, v, we have ln E{exp{Tr(ζℓT hηk )}} ≤ Φ(h; A(uv T )) ∀(h ∈ Rp×q , k ≤ K, ℓ ≤ L). (∗) Now let S be the set of those mappings k 7→ σ(k) of {1, ..., K} into {1, ..., L} which are embeddings (i.e., σ(k) 6= σ(k ′ ) when k 6= k ′ ). We clearly have K XX
σ∈S k=1
T ηk ζσ(k) =N
K X L X
ηk ζℓT
k=1 ℓ=1
with properly selected N ; counting the numbers of ηζ T -terms in the left- and in the right-hand side, we get Card(S) , N= L whence h P ih P iT i h P K L K PL 1 1 1 T η ζ = N η ζ k ℓ k ℓ k=1 ℓ=1 k=1 K L N KL i h ℓ=1 P PK 1 1 T , = Card(S) σ∈S K k=1 ηk ζσ(k)
487
SOLUTIONS TO SELECTED EXERCISES
and therefore (
(
h
ih P L 1
PK 1
iT T
!))!
k=1 ηk ℓ=1 ζℓ L h iT P PK 1 1 T ln E exp Tr h = Card(S) η ζ σ∈S k=1 k σ(k) K
ln E exp Tr h
K
[by H¨ older inequality] = K ln E exp Tr(K −1 h[η1 ζ1T ]T ) [since η1 , η2 , ..., ηK , ζ1 , ..., ζL are independent] ≤ KΦ(h/K; P uv T QT ) [since η1 ζ1T ∼ P uv T QT ], and (!) follows.
✷
2) Combine (!) with Corollary 3.6 to arrive at the following analog of Proposition 3.19: Proposition 3.20 In the situation in question, let Z be a convex compact subset of ∆rs such that uv T ∈ Z for all u ∈ U , v ∈ V . Given ǫ ∈ (0, 1), let Ψ+ (h, α) = max αΦK (h/α, P zQT ) − Tr(F z T ) : Rp×q × {α > 0} → R, z∈Z Ψ− (h, α) = max αΦK (−h/α, P zQT ) + Tr(F z T ) : Rp×q × {α > 0} → R z∈Z
b + (h) Ψ
:= =
b − (h) Ψ
:=
=
= =
inf α>0 [Ψ+ (h, α) + α ln(2/ǫ)] max inf αΦK (h/α, P zQT ) − Tr(F z T ) + α ln(2/ǫ) z∈Z α>0 β max inf βΦ(h/β, P zQT ) − Tr(F z T ) + K ln(2/ǫ) [β = Kα], z∈Z β>0
inf [Ψ− (h, α) + α ln(2/ǫ)] max inf αΦK (−h/α, P zQT ) + Tr(F z T ) + α ln(2/ǫ) z∈Z α>0 β max inf βΦ(−h/β, P zQT ) + Tr(F z T ) + K ln(2/ǫ) [β = Kα].
α>0
z∈Z β>0
b ± are real-valued and convex on Rp×q , and every candidate solution h to The functions Ψ the convex optimization problem h io n b + (h) + Ψ b − (h) b Opt = min Ψ(h) := 12 Ψ h
induces the estimate "
1 X ηk gbh (η K , ζ L ) = Tr h K k
#"
1X ζℓ L ℓ
# T T b − (h) − Ψ b + (h)] + κ(h), κ(h) = 1 [Ψ 2
of the functional of interest (3.81) via observation η K , ζ L with ǫ-risk on U ×V not exceeding b ρ = Ψ(h): ∀(u ∈ U, v ∈ V ) : Prob{|F (u, v) − gbh (η K , ζ L )| > ρ} ≤ ǫ,
the probability being taken w.r.t. the distribution of observations η K , ζ L stemming from signals u, v.
Solution: Verification is straightforward. Exercise 3.4 [recovering mixture weights] The problem to be addressed in this exercise is as follows. We are given K probability distributions P1 , ..., PK on observation space Ω, and let these P distributions have densities pk (·) w.r.t. some reference measure Π on Ω; we assume that k pk (·) is positive on Ω. We are given also N independent observations ωt ∼ Pµ , t = 1, ..., N,
488
SOLUTIONS TO SELECTED EXERCISES
drawn from distribution Pµ =
K X
µk Pk ,
k=1
where µ is an unknown “signal” known to belong to the probabilistic simplex ∆K = {µ ∈ P RK : µ ≥ 0, k µk = 1}. Given ω N = (ω1 , ..., ωN ), we want to recover the linear image Gµ of µ, where G ∈ Rν×K is given. b N ) : Ω × ... × Ω → Rν by the We intend to measure the risk of a candidate estimate G(ω quantity h n oi1/2 b N ) − Gµk22 b Risk[G(·)] = sup EωN ∼P ×...×P kG(ω . µ
µ∈∆
µ
3.4.A. Recovering linear form. Let us start with the case when G = g T is a 1 × K matrix. 3.4.A.1. Preliminaries. To motivate the construction to follow, consider the case when Ω is a finite set (obtained, e.g., by “fine discretization” of the “true” observation space). In this situation our problem becomes an estimation problem in Discrete o.s.: given a stationary N repeated observation stemming from discrete probability distribution Pµ affinely parameterized by signal µ ∈ ∆K , we want to recover a linear form of µ. It is shown in Section 3.1—see Remark 3.2—that in this case a nearly optimal, in terms of its ǫ-risk, estimate is of the form gb(ω N ) =
N 1 X φ(ωt ) N t=1
(3.83)
with properly selected φ. The difficulty with this approach is that as far as computations are concerned, an optimal design of φ requires solving a convex optimization problem of design dimension of order of the cardinality of Ω, and this cardinality could be huge, as is the case when Ω is a discretization of a domain in Rd with d in the range of tens. To circumvent this problem, we are to simplify the outlined approach: from the construction of Section 3.1 we inherit the simple structure (3.83) of the estimator; taking this structure for granted, we are to develop an alternative design of φ. With this new design, we have no theoretical guarantees for the resulting estimates to be near-optimal; we sacrifice these guarantees in order to reduce dramatically the computational effort of building the estimates. 3.4.A.2. Generic estimate. Let us select somehow L functions Fℓ (·) on Ω such that Z Fℓ2 (ω)pk (ω)Π(dω) < ∞, 1 ≤ ℓ ≤ L, 1 ≤ k ≤ K. (3.84) With λ ∈ RL , consider an estimate of the form
1) Prove that Risk[b gλ ]
gbλ (ω N ) = ≤ :=
k≤K
R P
2 T ℓ λℓ Fℓ (ω) pk (ω)Π(dω) − g ek 1/2 , max N1 λT Wk λ + [eTk [M λ − g]]2 k≤K
(3.85)
ℓ
Risk(λ) 2 R P max N1 pk (ω)Π(dω) ℓ λℓ Fℓ (ω) +
=
N X 1 X λℓ Fℓ (ω). Φλ (ωt ), Φλ (ω) = N t=1
1/2
(3.86)
489
SOLUTIONS TO SELECTED EXERCISES
where M
=
Wk
=
Mkℓ := [Wk ]ℓℓ′
Fℓ (ω)pk (ω)Π(dω) k≤K , ℓ≤L R := Fℓ (ω)Fℓ′ (ω)pk (ω)Π(dω) ℓ≤L , 1 ≤ k ≤ K, R
ℓ′ ≤L
and e1 , ..., eK are the standard basic orths in RK .
Note that Risk(λ) is a convex function of λ; this function is easy to compute, provided the matrices M and Wk , k ≤ K, are available. Assuming this is the case, we can solve the convex optimization problem Opt = min Risk(λ) (3.87) λ∈RK
and use the estimate (3.85) associated with optimal solution to this problem; the risk of this estimate will be upper-bounded by Opt.
Solution: Let
gλ (ω N ) − g T µ]2 . Ψ(λ, µ) = EωN ∼Pµ ×...×Pµ [b
Let us associate with λ vector
φλ ∈ RK : [φλ ]k =
Z
Φλ (ω)pk (ω)Π(dω), k ≤ K;
note that φλ is linear in λ and that we have Eω∼Pµ (·) {Φλ (ω)} = φTλ µ. Hence, gbλ (ω N ) − g T µ =
N 1 X [Φλ (ωt ) − φTλ µ] + [φλ − g]T µ, N t=1
and random variables [Φλ (ωt ) − φTλ µ], t = 1, ..., N , with ω1 , ..., ωN sampled, independently from each other, from Pµ are zero mean i.i.d., implying that [b gλ (ω N ) − g T µ]2 Ψ(λ, µ) = EωN ∼Pµ ×...×P µ = N1 Eω∼Pµ [Φλ (ω) − φTλ µ]2 + ([φλ − g]T µ)2 .
That is,
PK Ψ(λ, µ) = N1 k=1 µk Eω∼Pk [Φλ (ω) − φTλ µ]2 + ([φλ − g]T µ)2 P K = N1 k=1 µk Eω∼Pk Φ2λ (ω) − 2[φTλ µ]Eω∼Pk {Φλ (ω)} + [φTλ µ]2 +([φλ − g]T µ)2 T φλ µ }| { z X PK 2 T = N1 µk Eω∼Pk {Φλ (ω)} +[φTλ µ]2 k=1 µk Eω∼Pk Φλ (ω) − 2[φλ µ] k
= ≤
1 N
PK
− N1 [φTλ µ]2 + ([φλ − g]T µ)2 k=1 µk Eω∼P PKk 1 Ψ(λ, µ) := N k=1 µk Eω∼Pk Φ2λ (ω) + ([φλ − g]T µ)2 . Φ2λ (ω)
+([φλ − g]T µ)2
Note that Φλ (·) and φλ are linear in λ, implying that Ψ(λ, µ) is convex in λ. Besides
490
SOLUTIONS TO SELECTED EXERCISES
this, we see that this function is convex in µ as well. We conclude that 2
Risk2 [b gλ ] := max Ψ(λ, µ) ≤ max Ψ(λ, µ) = Risk (λ) := max Ψ(λ, ek ). µ∈∆K
µ∈∆K
k
In other words, 1/2 maxk≤K Ψ(λ, ek ) 1/2 maxk≤K N1 λT Wk λ + [eTk [M λ − g]]2 , R # " M = Mkℓ := Fℓ (ω)pk (ω)Π(dω) k≤K , ℓ≤L R Wk = [Wk ]ℓℓ′ := Fℓ (ω)Fℓ′ (ω)pk (ω)Π(dω) ℓ≤L , 1 ≤ k ≤ K
Risk(b gλ ) ≤ Risk(λ)
:= =
ℓ′ ≤L
as required in (3.86)
✷
3.4.A.3. Implementation. When implementing the generic estimate we arrive at the “Measurement Design” question: how do we select the value of L and functions Fℓ , 1 ≤ ℓ ≤ L, resulting in a small (upper bound Opt on the) risk of the estimate (3.85) yielded by an optimal solution to (3.87)? We are about to consider three related options—naive, basic, and Maximum Likelihood (ML). The naive option is to take Fℓ = pℓ , 1 ≤ ℓ ≤ L = K, assuming that this selection meets (3.84). For the sake of definiteness, consider the “Gaussian case,” where Ω = Rd , Π is the Lebesgue measure, and pk is Gaussian distribution with parameters νk , Σk : pk (ω) = (2π)−d/2 Det(Σk )−1/2 exp − 21 (ω − νk )T Σ−1 k (ω − νk ) . In this case, the Naive option leads to easily computable matrices M and Wk appearing in (3.86).
2) Check that in the Gaussian case, when setting −1 −1 −1 −1 −1 Σkℓ = [Σ−1 , Σkℓm = [Σ−1 , χk = Σ−1 k + Σℓ ] k + Σℓ + Σm ] k νk ,
αkℓ = we have Mkℓ [Wk ]ℓm
q := = := =
Det(Σkℓ ) , (2π)d Det(Σk )Det(Σℓ )
βkℓm = (2π)−d
q
Det(Σkℓm ) , Det(Σk )Det(Σℓ )Det(Σm )
R
pℓ (ω)pk (ω)Π(dω) T T T 1 α kℓ R exp 2 [χk + χℓ ] Σkℓ [χk + χℓ ] − χk Σk χk − χℓ Σℓ χℓ , pℓ (ω)pm(ω)p k (ω)Π(dω) βkℓm exp 21 [χk + χℓ + χm ]T Σkℓm [χk + χℓ + χm ] −χTk Σk χk − χTℓ Σℓ χℓ − χTm Σm χm .
Solution: Verification is straightforward. Basic option. Though simple, the Naive option does not make much sense: when replacing the reference measure Π with another measure Π′ which has positive density θ(·) w.r.t. Π, the densities pk are updated according to pk (·) 7→ p′k (·) = θ−1 (·)p(·), so that selecting Fℓ′ = p′ℓ , the matrices M and Wk become M ′ and Wk′ with R pk (ω)pℓ (ω) ′ R ′ ℓ (ω) Mkℓ = Π(dω), Π (dω) = pk (ω)p θ(ω) θ 2 (ω) R R pk (ω)pℓ (ω) pk (ω)pℓ (ω)pm (ω) ′ ′ [Wk ]ℓm = Π (dω) = Π(dω). θ 3 (ω) θ 2 (ω)
491
SOLUTIONS TO SELECTED EXERCISES
We see that in general M 6= M ′ and Wk 6= Wk′ , which makes the Naive option rather unnatural. In the alternative Basic option we set pℓ (ω) . L = K, Fℓ (ω) = π(ω) := P k pk (ω)
The motivation is that the functions Fℓ are invariant when replacing Π with Π′ , so that here M = M ′ and Wk = Wk′ . Besides this, there are statistical arguments P in favor of the Basic option, namely, as follows. Let Π∗ be the measure with the density k pk (·) w.r.t. Π; taken P w.r.t. Π∗ , the densities of Pk are exactly the above πk (·), and k πk (ω) ≡ 1. Now, (3.86) says that the risk of estimate gbλ can be upper-bounded by the function Risk(λ) defined in (3.86), and this function, in turn, can be upper-bounded by the function 2 P R P 1 pk (ω)Π(dω) Risk+ (λ) := k ℓ λℓ Fℓ (ω) N R P 2 1/2 T + maxk k λℓ Fℓ (ω) pk (ω)Π(dω) − g ek 2 R P = N1 Π∗ (dω) ℓ λℓ Fℓ (ω) R P 2 1/2 T + maxk λ F (ω) π (ω)Π (dω) − g e ∗ ℓ ℓ k k k ≤ KRisk(λ)
(we have said that the maximum of K nonnegative quantities is at most their sum, and the latter is at most K times the maximum of the quantities). Consequently, the risk of the estimate (3.85) stemming from an optimal solution to (3.87) can be upper-bounded by the quantity Opt+ := min Risk+ (λ) [≥ Opt := max Risk(λ)]. λ
λ
And here comes the punchline: 3.1) Prove that both the quantities Opt defined in (3.87) and the above Opt+ depend only on the linear span of the functions Fℓ , ℓ = 1, ..., L, not on how the functions Fℓ are selected in this span. 3.2) Prove that the selection Fℓ = πℓ , 1 ≤ ℓ ≤ L = K, minimizes Opt+ among all possible selections L, {Fℓ }L ℓ=1 satisfying (3.84).
Conclude that the selection Fℓ = πℓ , 1 ≤ ℓ ≤ L = K, while not necessary optimal in terms of Opt, definitely is meaningful: this selection optimizes the natural upper bound Opt+ on Opt. Observe that Opt+ ≤ KOpt, so that optimizing instead of Opt the upper bound Opt+ , although rough, is not completely meaningless. + on λ solely Solution: 3.1: By construction, P both Risk(λ) and Risk (λ) depend through the function Φλ (·) = ℓ λℓ Fℓ (·); when λ runs through RL , this function runs exactly through the linear span L of F1 (·), ..., FL (·). Consequently,
Opt2
=
[Opt+ ]2
=
h R R 2 i Φ(ω)pk (ω)Π(dω) − g T ek , min max N1 Φ2 (ω)pk (ω)Π(dω) + Φ∈L k 2 R R P T 2 1 min N Φ (ω) pk (ω) Π(dω) + max Φ(ω)pk (ω)Π(dω) − g ek , Φ∈L
k
k
(6.9)
so that Opt and Opt+ depend solely on L. 3.2: We have R P 2 1 [ ℓ λℓ Fℓ (ω)] Π∗ (dω) Risk+ (λ) = N
2 R P + maxk [ k λℓ Fℓ (ω)] πk (ω)Π∗ (dω) − g T ek
✷
1/2
.
492
SOLUTIONS TO SELECTED EXERCISES
Thus, when replacing Fℓ with their orthogonal, in L2 [Ω, Π∗ ], projections on the linear span of πk (·), 1 ≤ k ≤ K, we can only reduce Risk+ (λ) at every point λ, implying that Opt+ as given by (6.9) cannot be smaller than what this relation yields when L is the linear span of π1 (·), ..., πK (·), i.e., when setting Fℓ = πℓ , 1 ≤ ℓ ≤ L = K. ✷
A downside of the Basic option is that it seems problematic to get closed form expressions for the associated matrices M and Wk ; see (3.86). For example, in the Gaussian case, the Naive choice of Fℓ ’s allows us to represent M and Wk in an explicit closed form; in contrast to this, when selecting Fℓ = πℓ , ℓ ≤ L = K, seemingly the only way to get M and Wk is to use Monte-Carlo simulations. This being said, we indeed can use Monte-Carlo simulations to compute M and Wk , provided we can sample from distributions P1 , ..., PK . In this respect, it should be stressed that with Fℓ ≡ πℓ , the entries in M and Wk are expectations, w.r.t. P1 , ..., PK , of functions of ω bounded in magnitude by 1, and thus well suited for Monte-Carlo simulation. Maximum Likelihood option. This choice of {Fℓ }ℓ≤L follows straightforwardly the idea of discretization we started with in this exercise. Specifically, we split Ω into L cells Ω1 , ..., ΩL in such a way that the intersection of any two different cells is of Π-measure zero, and treat as our observations not the actual observations ωt , but the indexes of the cells to which the ωt ’s belong. With our estimation scheme, this is the same as selecting Fℓ as the characteristic function of Ωℓ , ℓ ≤ L. Assuming that for distinct k, k′ the densities pk , pk′ differ from each other Π-almost surely, the simplest discretization independent of how the reference measure is selected is the Maximum Likelihood discretization Ωℓ = {ω : max pk (ω) = pℓ (ω)}, 1 ≤ ℓ ≤ L = K; k
with the ML option, we take, as Fℓ ’s, the characteristic functions of the sets Ωℓ just defined, 1 ≤ ℓ ≤ L = K. As with the Basic option, the matrices M and Wk associated with the ML option can be found by Monte-Carlo simulation. We have discussed three simple options for selecting the Fℓ ’s. In applications, one can compute the upper risk bounds Opt—see (3.87)—associated with each option, and use the option with the best—the smallest—risk bound (“smart” choice of Fℓ ’s). Alternatively, one can take as {Fℓ , ℓ ≤ L} the union of the three collections yielded by the above options (and, perhaps, further extend this union). Note that the larger is the collection of the Fℓ ’s, the smaller is the associated Opt, so that the only price for combining different selections is in increasing the computational cost of solving (3.87). 3.4.A.4. Illustration. In the experimental part of this exercise you are expected to 4.1) Run numerical experiments to compare the estimates yielded by the above three options (Naive, Basic, ML). Recommended setup: • d = 8, K = 90; • Gaussian case with the covariance matrices Σk of Pk selected at random, Sk = rand(d, d), Σk =
Sk SkT kSk k2
[k · k: spectral norm]
and the expectations νk of Pk selected at random from N (0, σ 2 Id ), with σ = 0.1; • values of N : {10s , s = 0, 1, ..., 5}; • linear form to be recovered: g T µ ≡ µ1 .
4.2† ). Utilize the Cramer-Rao lower risk bound (see Proposition 4.37, Exercise 4.22) to upperOpt of the estimates built in item 4.1. Here Risk∗ is the bound the level of conservatism Risk ∗ minimax risk in our estimation problem: h n oi1/2 Risk∗ = inf Risk[b g (ω N )] = inf sup EωN ∼Pµ ×...×Pµ |b g (ω N ) − g T µ|2 , g b(·)
g b(·) µ∈∆
493
SOLUTIONS TO SELECTED EXERCISES
where inf is taken over all estimates.
Solution: 4.1: In our implementation of Basic and ML options, M and Wk were computed by a 30, 000-sample Monte-Carlo simulation. Here are the results of a typical experiment following the above setup. Risk bounds for the three estimates we are comparing (that is, the respective values OptN , OptB , OptM L of Opt as given by (3.87)) are as follows: N 1 10 102 103 104 105
OptN 0.8639 0.4767 0.1698 0.0553 0.0177 0.0056
OptB 0.7071 0.3179 0.1099 0.0355 0.0113 0.0036
OptM L 0.7071 0.3206 0.1116 0.0361 0.0115 0.0036
We see that Basic and ML choices of the Fℓ ’s result in similar risk bounds and in this respect are significantly (although not dramatically) better than the Naive choice. Empirical risks of the estimate yielded by the “smart” choice of Fℓ ’s, as observed in 100 experiments with randomly selected µ’s, 100 experiments per every tested value of N , are as follows : N 1 10 102 103 104
OptB 0.7071 0.3179 0.1099 0.0355 0.0113
empirical risk mean median max 0.4945 0.4956 0.5055 0.2966 0.2910 0.3863 0.1097 0.1071 0.1468 0.0350 0.0352 0.0450 0.0106 0.0107 0.0134
Level of conservatism of the estimate yielded by the smart choice of Fℓ ’s in our experiment was as follows:
N OptB /Risk∗ ≤
1 3.35
10 3.23
102 2.59
103 2.34
104 2.26
105 2.23
Our scheme for lower-bounding the minimax risk is as follows. Let us select two points [µ, µ] in ∆K , and let kµ − µk2 = 2r, so that µ = µ + 2rf with k · k2 -unit vector f . From the Cramer-Rao inequality it is easy to derive that the minimax risk corresponding to the stationary N -repeated observations can be lower-bounded as follows: # " Z P 2 [ k fk pk (ω)] r|g T e| P Π(dω) , Risk∗ ≥ Risk = √ , J = max J(s) := 0≤s≤1 r JN + 1 k µk (s)pk (ω)
where f1 , ..., fK are the coordinates of f , and µ1 (s), ..., µK (s) are the coordinates of the vector (1 − s)µ + sµ. Note that J(s) is convex in s ∈ [0, 1], so that J = max[J(0), J(1)], and therefore we can compute J by Monte-Carlo simulation.
494
SOLUTIONS TO SELECTED EXERCISES
Lower bounds on Risk∗ underlying the above numbers were obtained in the fashion just described via a simple mechanism for selecting µ and µ as follows: • We specify µmin and µmax as a minimizer and a maximizer of the linear form g T µ on ∆K , and set f = [µmax − µmin ]/kµmax − µmin k2 ; • We build an m = 21-element equidistant grid µ0 = µmin , µ1 , ..., µ20 = µmax on [µmin , µmax ] and compute by Monte-Carlo simulation the quantities Ji =
R [Pk fk pk (ω)]2 P k
µik pk (ω)
Π(dω), 0 ≤ i ≤ m − 1;
• Finally, we apply the above scheme for lower-bounding the minimax risk to each selection µ = µi , µ = µj , 0 ≤ i < j < m, with J = max[Ji , Jj ] (for the selection associated with i, j, the quantities Ji and Jj are exactly the above quantities J(0) and J(1)). From the m(m − 1)/2 resulting lower bounds we select the best—the largest—one, and this is the lower hound on the minimax risk which we use to build the above table. While one can exhibit different attitudes to the claim that an estimation procedure is minimax optimal within a factor like 2.5, one thing is for sure: such a factor is much smaller than similar factors yielded by typical theoretical results on nearoptimality of statistical inferences. 3.4.B. Recovering linear images. Now consider the case when G is a general ν × K matrix. The analog of the estimate gbλ (·) is now as follows: with somehow chosen F1 , ..., FL satisfying (3.84), we select a ν × L matrix Λ = [λiℓ ], set X X X ΦΛ (ω) = [ λ1ℓ Fℓ (ω); λ2ℓ Fℓ (ω); ...; λνℓ Fℓ (ω)], ℓ
ℓ
ℓ
and estimate Gµ by
N X b Λ (ω N ) = 1 Φλ (ωt ). G N t=1
5) Prove the following analogy of the results of item 3.4.A:
Proposition 3.21 The risk of the proposed estimator can be upper-bounded as follows: bΛ ] Risk[G Ψ(Λ, µ)
:= ≤ = =
where
oi1/2 h n b N ) − Gµk22 maxµ∈∆K EωN ∼Pµ ×...×Pµ kG(ω
Risk(Λ) := maxk≤K Ψ(Λ, ek ), i1/2 h P K 2 2 1 k=1 µk Eω∼Pk kΦΛ (ω)k2 + k[ψΛ − G]µk2 N h i1/2 R P PK P 2 k[ψΛ − G]µk22 + N1 , k=1 µk [ i≤ν [ ℓ λiℓ Fℓ (ω)] ]Pk (dω)
R P [ ℓ λ1ℓ Fℓ (ω)]Pk (dω) , 1 ≤ k ≤ K ··· Colk [ψΛ ] = Eω∼Pk (·) ΦΛ (ω) = R P [ ℓ λνℓ Fℓ (ω)]Pk (dω)
and e1 , ..., eK are the standard basic orths in RK . Note that exactly the same reasoning as in the case of the scalar Gµ ≡ g T µ demonstrates that a reasonable way to select L and Fℓ , ℓ = 1, ..., L, is to set L = K and Fℓ (·) = πℓ (·), 1 ≤ ℓ ≤ L.
495
SOLUTIONS TO SELECTED EXERCISES
Solution: The proof is given by straightforward modification of the reasoning which led to (3.86).
6.4 6.4.1
SOLUTIONS FOR CHAPTER 4 Linear Estimates vs. Maximum Likelihood
Exercise 4.1. Consider the problem posed at the beginning of Chapter 4: Given observation ω = Ax + σξ, ξ ∼ N (0, I)
of an unknown signal x known to belong to a given signal set X ⊂ Rn , we want to recover Bx. Let us consider the case where matrix A is square and invertible, B is the identity, and X is a computationally tractable convex compact set. As far as computational aspects are concerned, the situation is well suited for utilizing the “magic wand” of Statistics—the Maximum Likelihood (ML) estimate where the recovery of x is x bML (ω) = argmin kω − Ayk2
(ML)
y∈X
—the signal which maximizes, over y ∈ X , the likelihood (the probability density) to get the observation we actually got. Indeed, with computationally tractable X , (ML) is an explicit convex, and therefore efficiently solvable, optimization problem. Given the exclusive role played by the ML estimate in Statistics, perhaps the first question about our estimation problem is: how good is the ML estimate? The goal of this exercise is to show that in the situation we are interested in, the ML estimate can be “heavily nonoptimal,” and this may happen even when the techniques we develop in Chapter 4 do result in an efficiently computable near-optimal linear estimate. To justify the claim, investigate the risk (4.2) of the ML estimate in the case where X = {x ∈ Rn : x21 + ǫ−2
n X i=2
x2i ≤ 1} & A = Diag{1, ǫ−1 , ..., ǫ−1 },
ǫ and σ are small, and n is large, so that σ 2 (n − 1) ≥ 2. Accompany your theoretical analysis by numerical experiments—compare the empirical risks of the ML estimate with theoretical and empirical risks of the linear estimate optimal under the circumstances. Recommended setup: n runs through {256, 1024, 2048}, ǫ = σ runs through {0.01; 0.05; 0.1}, and signal x is generated as x = [cos(φ); sin(φ)ǫζ], where φ ∼ Uniform[0, 2π] and random vector ζ is independent of φ and is distributed uniformly on the unit sphere in Rn−1 .
Solution: We have {Ay : y ∈ X } = {u : kuk2 ≤ 1}, so that x bML (ω) = A−1 argmin kω − uk2 . kuk2 ≤1
Let us look what happens when the true signal x is the first basic orth e1 (which does belong to X ). In this case, x b(Ax + σξ) = A−1 argmin ke1 + σξ − uk2 . kuk2 ≤1
496
SOLUTIONS TO SELECTED EXERCISES
Now, it is easily seen that when η ∼ N (0, Ik ) and δ ∈ (0, 1), we have k [δ + ln(1 − δ)] . Prob kηk22 < (1 − δ)k ≤ exp 2
(!)
In particular,
3k k ≤ exp Probη∼N (0,Ik ) kηk22 < [1/4 + ln(3/4)] ≤ 0.9813, 4 2 implying that when σ 2 (n − 1) ≥ 2 and ξ ∼ N (0, σ 2 In ), the probability of the event n o E = ξ : ke1 + σξk22 ≥ 43 σ 2 (n − 1) & ξ1 ≤ 0 | {z } ≥ 23
is at least some positive absolute constant p C (one can take C = (1 − 0.9813)/2). When E takes place, we have ke1 + σξk2 ≥ 3/2 > 1, whence z := argmin ke1 + σξ − uk2 = [e1 + σξ]/ke1 + σξk2 , kuk2 ≤1
p implying that the first entry in z p is ≤ 2/3 and therefore the norm of the ML recovery error e1 − z is at least 1 − 2/3. We see that under the circumstances the p risk (4.2) of the ML estimate is at least a positive absolute constant (1 − 2/3)C. On the other hand, the risk of the simplest linear estimate
is at most
x b = ω1 = x1 + σξ1
v ) ( u n p X u 2 tmax Eξ∼N σ 2 ξ 2 + xi ≤ σ 2 + ǫ 2 , 1 x∈X
i=2
and we see that when σ, ǫ are small and σ 2 (n − 1) ≥ 2, the risk of the ML estimate can be worse, by a large factor, than the risk of a simple linear estimate. It should be added that we are in the situation when X is an ellipsoid centered at the origin; in this case, first, it is easy to compute the minimal risk linear estimate, and, second, the latter estimate is near-optimal among all estimates, linear and nonlinear alike, see Proposition 4.5. Here are the results of our numerical experiments, 1000 simulations per every pair (n, σ) with ǫ = σ: σ 0.01 0.05 0.10
0.010 0.061 0.129
n = 256 0.010 0.019 0.060 0.241 0.128 0.434
0.010 0.064 0.132
n = 1024 0.010 0.035 0.064 0.340 0.128 0.511
0.011 0.066 0.133
n = 2048 0.011 0.065 0.066 0.424 0.136 0.570
For each combination of parameters, from left to right: upper risk bound for the linear estimate, empirical risk of the linear estimate, empirical risk of the ML estimate
Note that when ǫ ≤ 1, the risk of the trivial estimate x b(ω) ≡ 0 does not exceed 1, and that this trivial risk bound is not that far from the empirical risk of the ML
497
SOLUTIONS TO SELECTED EXERCISES
estimate when σ ≥ 0.05. To make the solution self-contained, here is the demonstration of (!) via Cramer bounding: k
2 ∀γ > 0 : Prob{kηk22 ≤ (1 − δ)k} exp{−γ(1 (1 + 2γ)− 2 − δ)k} ≤ E{exp{−γkηk 2 }} = 2 k k ⇒ ln Prob{kηk2 ≤ (1 − δ)k} ≤ inf γ(1 − δ)k − 2 ln(1 + 2γ) = 2 [δ + ln(1 − δ)] . γ>0
6.4.2
Measurement Design in Signal Recovery
Exercise 4.2. [Measurement Design in Gaussian o.s.] As a preamble to the exercise, please read the story about the possible “physics” of Gaussian o.s. from Section 2.7.3.3. The summary of the story is as follows: We consider the Measurement Design version of signal recovery in Gaussian o.s.; specifically, we are allowed to use observations ω = Aq x + σξ [ξ ∼ N (0, Im )]
where
√ √ √ Aq = Diag{ q1 , q2 , ..., qm }A,
with a given A ∈ Rm×n and vector q which we can select in a given convex compact set Q ⊂ Rm + . The signal x underlying the observation is known to belong to a given ellitope X . Your goal is to select q ∈ Q and a linear recovery ω 7→ GT ω of the image Bx of x ∈ X , with given B, resulting in the smallest worst-case, over x ∈ X , expected k·k22 recovery risk. Modify, according to this goal, problem (4.12). Is it possible to end up with a tractable problem? P Work out in full details the case when Q = {q ∈ Rm + : i qi = m}.
Solution: For the same reasons as in the case of problem (4.12), we lose nothing by assuming that X = {x ∈ Rn : ∃t ∈ T : xT Sk x ≤ tk , 1 ≤ k ≤ K},
(we use the same notation and assumptions as everywhere in this chapter). Let us find out the squared risk of a candidate estimate GT ω. We are in the situation when √ the i-th entry in ω is qi [Ax]i +σξi , where ξ1 , ..., ξm are independent N (0, 1) random variables. Consequently, passing from the matrix variable G in linear recovery to 1/2 1/2 variable H = Diag{q1 , ..., qm }G, we get Eξ1 ,...,ξm kGT ωn− Bxk22 o −1/2 −1/2 = Eξ1 ,...,ξm kσH T Diag{q1 , ..., qm }[ξ1 ; ...; ξm ] + [H T A − B]xk22 Pm = σ 2 i=1 kRowi [H]k22 /qi + k[H T A − B]xk22 ,
where Rowi [H] is the transpose of the i-th row of H. Similarly to the case of (4.12), a natural upper bound on the worst-case, over x ∈ X , squared k · k22 -risk of recovery is Pm b R(H, q) = σ 2 i=1kRowi [H]k22 /qi P B T − AT H k λ k Sk 0 , + minλ φT (λ) : λ ≥ 0, B − HT A Iν
498
SOLUTIONS TO SELECTED EXERCISES
and the Measurement Design analog of (4.12) is the optimization problem P σ 2 i qi−1 RowTi [H]Rowi [H] + φT (λ) : Opt = min H,q,λ P B T − AT H k λ k Sk 0 , λ ≥ 0, q ∈ Q, B − HT A Iν T
where a 0 a is +∞ or 0 depending on whether a is nonzero or not. The problem is convex, since the function aT a/s is convex in (a, s) in the domain s ≥ 0; indeed, the epigraph of this function is given by the Linear Matrix Inequality, aT a τ aT 0 ⇔ τ≥ a sI s (Schur Complement Lemma) and P thus is convex. Finally, when Q = {q ≥ 0 : i qi = m}, we can carry out partial minimization in q analytically: X RowT [H]Rowi [H] i min q∈Q qi i
is achieved when
and is equal to
mkRowi [H]k2 qi = P , 1 ≤ i ≤ m, j kRowj [H]k2 1 m
X i
kRowi [H]k2
!2
;
as it should be, this quantity is ≤ its counterpart Tr(H T H) = appearing in (4.12).
P
i
kRowi [H]k22
Exercise 4.3. [follow-up to Exercise 4.2] A translucent bar of length n = 32 is comprised of 32 consecutive segments of length 1 each, with density ρi of i-th segment known to belong to the interval [µ − δi , µ + δi ]. Sample translucent bar The bar is lit from the left end; when light passes through a segment with density ρ, the light’s intensity is reduced by factor e−αρ . The light intensity at the left endpoint of the bar is 1. You can scan the segments one by one from left to right and measure light intensity ℓi at the right √ endpoint of the i-th segment during time qi ; the result zi of the measurement is ℓi eσξi / qi , where the ξi ∼ N (0, 1) are independent across i. The total time budget is n, and you are interested in recovering the m = n/2-dimensional vector of densities of the right m segments. Build an optimization problem responsible for near-optimal linear recovery with and without Measurement Design (in the latter case, we assume that each segment is observed during unit time) and compare the resulting near-optimal risks. Recommended data: α = 0.01, δi = 1.2 + cos(4π(i − 1)/n), µ = 1.1 max δi , σ = 0.001. i
499
SOLUTIONS TO SELECTED EXERCISES
Solution: Passing from measurements to their logarithms, we arrive at observations √ ω ¯ i = −α[ρ1 + ... + ρi ] + σξi / qi . Passing from densities ρi to the differences xi = ρi − µ and from observations ω ¯ i to observations ωi = ω ¯ i + αµi, we arrive at the Gaussian o.s. √ ωi = −α[x1 + ... + xi ] + σξi / qi , 1 ≤ i ≤ n. We are in the situation of Exercise 4.2 with square sensing matrix A proportional, with coefficient −α, to lower-triangular n × n matrix with all ones in the lowertriangular part, with the “lower half” of the n × n unit matrix in the role of B, and with the box {x : |xi | ≤ δi , 1 ≤ i ≤ n} in the role of X ; in other words, X is the ellitope given by x : ∃t ∈ [0, 1]n : xT Sk x := δk−2 x2k ≤ tk , 1 ≤ k ≤ K := n .
Problem (4.12) in this case reads P Opt = min σ 2 Tr(H T H) + k λk : H,λ Diag{λi /δi2 , 1 ≤ i ≤ n} λ ≥ 0, B − HT A
B T − AT H In/2
0 .
This problem is different from the one of Measurement Design: the term σ 2 Tr(H T H) 2 Pn 2 in the objective of the latter is replaced with the term σn ( i=1 kRowi [H]k2 ) , and the optimal qi ’s are obtained from the optimal solution (H∗ , λ∗ ) via the relation nkRowi [H∗ ]k2 qi∗ = P , 1 ≤ i ≤ n. j kRowj [H∗ ]k2
Solving the problems, we find that with no Measurement Design, the risk of the near-optimal linear estimate is ≈ 0.5656; Measurement Design reduces this risk to ≈ 0.4103. The optimal q’s are as shown on the plot below:
500
SOLUTIONS TO SELECTED EXERCISES
Exercise 4.4. Let X be a basic ellitope in Rn : X = {x ∈ Rn : ∃t ∈ T : xT Sk x ≤ tk , 1 ≤ k ≤ K} with our usual restrictions on Sk and T . Let, further, m be a given positive integer, and x 7→ Bx : Rn → Rν be a given linear mapping. Consider the Measurement Design problem where you are looking for a linear recovery ω 7→ x bH (ω) := H T ω of Bx, x ∈ X, from observation ω = Ax + σξ [σ > 0 is given and ξ ∼ N (0, Im )] in which the m × n sensing matrix A is under your control—it is allowed to be any m × n matrix of spectral norm not exceeding 1. You are interested in selecting H and A in order to minimize the worst-case, over x ∈ X, expected k · k22 recovery error. Similarly to (4.12), this problem can be posed as Opt = minH,λ,A σ 2 Tr(H T H) + φT (λ) : P (4.68) B T − AT H k λk S k 0, kAk ≤ 1, λ ≥ 0 , B − HT A Iν
where k · k stands for the spectral norm. The objective in this problem is the (upper bound on the) squared risk Risk2 [b xH |X], the sensing matrix being A. The problem is nonconvex, since the matrix participating in the semidefinite constraint is bilinear in H and A. A natural way to handle an optimization problem with objective and/or constraints bilinear in the decision variables u, v is to use “alternating minimization,” where one alternates optimization in v for u fixed and optimization in u for v fixed, the value of the variable fixed in a round being the result of optimization w.r.t. this variable in the previous round. Alternating minimizations are carried out until the value of the objective (which in the outlined process definitely improves from round to round) stops improving (or nearly so). Since the algorithm does not necessarily converge to the globally optimal solution to the problem of interest, it makes sense to run the algorithm several times from different, say, randomly selected, starting points. Now comes the exercise.
1. Implement Alternating Minimization as applied to (4.68). You may restrict your experimentation to the case where the sizes m, n, ν are quite moderate, in the range of tens, and P 2γ 2 X is either the box {x : j 2γ x2j ≤ 1, 1 ≤ j ≤ n}, or the ellipsoid {x : n j=1 j xj ≤ 1}, where γ is a nonnegative parameter (try γ = 0, 1, 2, 3). As for B, you can generate it at random, or enforce B to have prescribed singular values, say, σj = j −θ , 1 ≤ j ≤ ν, and a randomly selected system of singular vectors. 2. Identify cases where a globally optimal solution to (4.68) is easy to find and use this information in order to understand how reliable Alternating Minimization is in the application in question, reliability meaning the ability to identify near-optimal, in terms of the objective, solutions. If you are not satisfied with Alternating Minimization “as is,” try to improve it.
Solution: When σ is small and m ≥ ν, we clearly can recover Bx with high accuracy. To this end, we can select A as the projector on the orthogonal complement to the kernel of B; note that when σ = 0, this selection of A allows us to recover Bx exactly. This being said, in our experiments Alternating minimization “as is” typically was unable to take full advantage of the just outlined favorable circumstances. Seemingly the simplest way to cure this drawback is to start with singular value decomposition of B and to select the initial A for alternating minimization as the projector on the linear span L of the µ = min[m, Rank(B)] leading right singular vectors of B (L can be thought of as the best approximation of Ker⊥ (B) by a linear subspace of dimension ≤ m).
501
SOLUTIONS TO SELECTED EXERCISES
3. Modify (4.68) and your experiment to cover the cases where the constraint kAk ≤ 1 on the sensing matrix is replaced with one of the following: • kRowi [A]k2 ≤ 1, 1 ≤ i ≤ m, • |Aij | ≤ 1 for all i, j (note that these two types of restrictions mimic what happens if you are interested in recovering (the linear image of) the vector of parameters in a linear regression model from noisy observations of the model’s outputs at m points which you are allowed to select in the unit ball or unit box). 4. [Embedded Exercise] Recall that a ν × n matrix G admits singular value decomposition G = U DV T with orthogonal matrices U ∈ Rν×ν and V ∈ Rn×n and diagonal ν × n matrix D with nonnegative and nonincreasing diagonal entries.5 These entries are uniquely defined by G and are called singular values σi (G), 1 ≤ i ≤ min[ν, n]. Singular values admit characterization similar to variational characterization of eigenvalues of a symmetric matrix; see, e.g., [15, Section A.7.3]: Theorem 4.23 [VCSV—Variational Characterization of Singular Values] For a ν × n matrix G it holds σi (G) = min max kGek2 , 1 ≤ i ≤ min[ν, n], E∈Ei e∈E,kek2 =1
where Ei is the family of all subspaces in Rn of codimension i − 1. Corollary 4.24 [SVI—Singular Value Interlacement] Let G and G′ be ν × n matrices, and let k = Rank(G − G′ ). Then σi (G) ≥ σi+k (G′ ), 1 ≤ i ≤ min[ν, n],
(4.69)
where, by definition, singular values of a ν × n matrix with indexes > min[ν, n] are zeros. We denote by σ(G) the vector of singular values of G arranged in nonincreasing order. The function kGkSh,p = kσ(G)kp is called the Shatten p-norm of matrix G; this indeed is a norm on the space of ν × n matrices, and the conjugate norm is k · kSh,q , with p1 + 1q = 1. An easy and important consequence of Corollary 4.24 is the following fact: Corollary 4.25 Given a ν × n matrix G, an integer k, 0 ≤ k ≤ min[ν, n], and p ∈ [1, ∞], (one of) the best approximations of G in the Shatten p-norm among matrices of rank ≤ k is obtained P from G by zeroing our all but k largest singular values, that is, the matrix Gk = ki=1 σi (G)Coli [U ]ColTi [V ], where G = U DV T is the singular value decomposition of G. Prove Theorem 4.23 and Corollaries 4.24 and 4.25.
Solution: Let us start with VCSV. Let us denote the right hand sides in (4.69) by σi′ (G). Observe that when we multiply G from the left and from the right by orthogonal matrices, the quantities σi (G) and σi′ (G) remain intact, so that we lose nothing when considering the case when G is a ν × n matrix with the only nonzero entries in diagonal cells, these entries being nonnegative. Given i ≤ min[ν, n], let Ei be the linear subspace of Rn spanned by the first i standard basic orths in Rn . The dimension of this space is i, and for every unit vector e from Ei we have kGek2 ≥ σi (G). When E is a linear subspace of RN of codimension i − 1, the intersection of E and Ei is a nontrivial linear subspace of Rn , so that it contains a unit vector e, and since e ∈ Ei , we have kGek2 ≥ σi (g). Consequently, maxe∈E:kek2 =1 kGek2 ≥ σi (G), implying that σi′ (G) ≥ σi (G). On the other hand, the linear span E i of all but the first i − 1 standard basic orths in Rn has codimension i − 1, and we clearly have maxe∈E i :kek2 =1 kGek2 ≤ σi (G), implying that σi′ (G) ≤ σi (G). The bottom line is that σ i (G) = σi′ (G) for all i. ✷ 5 We
say that a rectangular matrix D is diagonal if all entries Dij in D with i 6= j are zeros.
502
SOLUTIONS TO SELECTED EXERCISES
Now let us prove SVI. Let us fix i ≤ min[ν, n]. The inequality σi (G) ≥ σi+k (G′ ) is evident when i + k > min[ν, n], since in this case σi+k (G′ ) = 0, while σi (G) is nonnegative. Now consider the case when i + k ≤ min[ν, n]. Let ∆ = G − G′ , so that the kernel F of ∆ is a linear subspace of Rn of codimension at most k = Rank(∆). For every linear subspace E of Rn of codimension i − 1, the linear subspace F ∩ E of Rn is of codimension at most i − 1 + k, implying by VCSV that F ∩ E contains a unit vector e with kG′ ek2 ≥ σi+k (G′ ). Since e ∈ F , we have Ge = G′ e, so that E contains a unit vector e with kGek2 ≥ σi+k (G′ ). Since this is so for every linear subspace E in Rn of codimension ≤ i − 1, VCSV says that σi (G) ≥ σi+k (G′ ). ✷ Corollary 4.25: The Shatten p-norm of G − Gk clearly is the p-norm of the µ = min[ν, n]-dimensional vector σ ¯ obtained from σ(G) by zeroing out the first k entries. On the other hand, let H be a ν × n matrix of rank ≤ k; by SVI, the singular values σi (G − H) satisfy the inequalities σi (G − H) ≥ σi+k (G), implying that kG − HkSh,p = kσ(G − H)kp ≥ k¯ σ kp = kG − Gk kSh,p . Thus, Gk is one of the matrices of rank ≤ k closest to G in k · kSh,p -norm. ✷
5. Consider the Measurement Design problem (4.68) in the case when X is an ellipsoid: o n Xn x2i /a2j ≤ 1 , X = x ∈ Rn : j=1
A is an m × n matrix of spectral norm not exceeding 1, and there is no noise in the observations: σ = 0. Find an optimal solution to this problem. Think how this result can be used to get a (hopefully) good starting point for Alternating Minimization in the case when X is an ellipsoid and σ is small.
Solution: Without loss of generality, we can assume that m ≤ ν ≤ n (why?). When σ = 0, (4.68) reduces to minimizing the quantity max k[H T A − B]xk2 = x∈X
max
y∈Rn ,kyk2 ≤1
k[H T AD − BD]yk2 , D = Diag{a1 , ..., an }
over H and A with an m × n matrix A with spectral norm ≤ 1. Passing from ¯ = variables H and A to a single matrix variable G = H T AD and setting B BD, observe that the only constraint on G in the case of m ≤ ν ≤ n is that Rank(G) ≤ m. Thus, all we need is to find the best, in the spectral norm (i.e., in ¯ by a matrix of rank ≤ m. the Shatten ∞-norm) approximation of the matrix B The solution G∗ to this problem is readily given by Corollary 4.25. After G∗ is found, it suffices to represent the ν × n matrix G∗ D−1 (which has rank ≤ m along with G∗ ) as the product of a ν × m matrix H T and an m × n matrix A of norm not exceeding 1. The decomposition is readily given by the singular value decomposition U DV T of G∗ D−1 : since this matrix is of rank ≤ m, the number of its nonzero is Pmsingular valuesTis at most m, so that the SVDT of the matrix T G∗ D−1 = σ Col [U ]Col [V ], and we can set A = [Col [V ]; ...; Col [V ]] i i 1 m i=1 i and H T = [σ1 Col1 [U ], ..., σm Colm [U ]]. 6.4.3
Around semidefinite relaxation
Exercise 4.5. Let X be an ellitope: X = {x ∈ Rn : ∃(y ∈ RN , t ∈ T ) : x = P y, y T Sk y ≤ tk , k ≤ K}
503
SOLUTIONS TO SELECTED EXERCISES
Pk with our standard restrictions on T and Sk . Representing Sk = rj=1 skj sTkj , we can pass from initial ellitopic representation of X to the spectratopic representation of the same set: X = {x ∈ Rhn : ∃(y ∈ RN , t+ ∈ T + ) : x = P y, [sTkj x]2 t+ kj I1 , 1 ≤ k ≤ K,i1 ≤ j ≤ rk } P rk + . T + = {t+ = {t+ ≥ 0} : ∃t ∈ T : t ≤ tk , 1 ≤ k ≤ K} kj j=1 kj
If now C is a symmetric n × n matrix and Opt = maxx∈X xT Cx, we have P Opt∗ ≤ Opte := min φT (λ) : P T CP k λk Sk λ={λk ∈R+ } o n P φT + (Λ) : P T CP k,j Λkj skj sTkj Opt∗ ≤ Opts := min Λ={Λkj ∈R+ }
where the first relation is yielded by ellitopic representation of X and Proposition 4.6, and the second, on closer inspection (carry this inspection out!) by the spectratopic representation of X and Proposition 4.8. Prove that Opte = Opts .
Solution: For Λ = {Λkj ≥ 0, 1 ≤ k ≤ K, 1 ≤ j ≤ rk }, we have o nP P φT + (Λ) = max tkj Λkj : tkj ≥ 0, ∃t ∈ T : j tkj ≤ tk ∀k k,j tkj PK = max k=1 tk maxj Λkj = φT ([λ1 (Λ); ...; λK (Λ)]), λk (Λ) = max Λkj . j t∈T | {z } λ(Λ)
It follows that if Λ is a feasible solution to the optimization problem specifying Opts and Λ is obtained from Λ by replacing all Λkj , 1 ≤ j ≤ rk , with λk (Λ), then Λ is another feasible solution to the same problem and with the same value of the objective, namely, the value equal to φT (λ(Λ)). We conclude that o n P Opts = min φT (λ) : P T CP k,j Λkj skj sTkj Λ={Λkj =λkn≥0} o X P skj sTkj = Opte . = min φT (λ) : P T CP k λk λ={λk ≥0}
|
j
{z
Sk
}
Exercise 4.6. Proposition 4.6 provides us with an upper bound on the quality of semidefinite relaxation as applied to the problem of upper-bounding the maximum of a homogeneous quadratic form over an ellitope. Extend the construction to the case when an inhomogeneous quadratic form is maximized over a shifted ellitope, so that the quantity to upper-bound is Opt = max f (x) := xT Ax + 2bT x + c , x∈X
X = {x : ∃(y, t ∈ T ) : x = P y + p, y T Sk y ≤ tk , 1 ≤ k ≤ K}
with our standard assumptions on Sk and T . Note: X is centered at p, and a natural upper bound on Opt is d Opt ≤ f (p) + Opt,
d is an upper bound on the quantity where Opt
Opt = max [f (x) − f (p)] . x∈X
d What you are interested in upper-bounding is the ratio Opt/Opt.
504
SOLUTIONS TO SELECTED EXERCISES
Solution: Passing from variables x to variables u = x − p, we get Opt = maxu,y,t uT Au + 2eT u : u = P y, y T Sk y ≤ tk , k ≤ K, t ∈ T = maxy,t y T [P T AP ]y + 2[P e]T y : y T Sk y ≤ tk , k ≤ K, t ∈ T y T Sk y ≤ tk , k ≤ K T T T max = g(y, s) := y [P AP ]y + 2s[P e] y : y,t |{z} s2 ≤ 1, t ∈ T (∗)
where (∗) is due to the symmetry w.r.t. the origin of the set {y : ∃t ∈ T : y T Sk y ≤ tk , k ≤ K}. The bottom line is that Opt is the maximum of the homogeneous quadratic form g(y, s) := [y; s]T
P T AP P e [y; s] eT P T | {z } C+
over the ellitope
+ + + + T + + X = {z := [y; s] : ∃t ∈ T : z Sk z≤tk , 1 ≤ k ≤ K = K + 1} Sk , k ≤ K, + . T = [t; τ ] : t ∈ T , 0 ≤ τ ≤ 1, S + = k , k =K +1 1
Applying Proposition 4.6, the efficiently computable quantity ( ) K+1 X d = min φT + (λ) : λ ≥ 0, C + Opt λk S + k
λ
satisfies
k=1
√ d ≤ 3 ln( 3(K + 1))Opt. Opt ≤ Opt
Exercise 4.7. [estimating Kolmogorov widths of sperctratopes/ellitopes] 4.7.A. Preliminaries: Kolmogorov and Gelfand widths. Let X be a convex compact set in Rn , and let k · k be a norm on Rn . Given a linear subspace E in Rn , let distk·k (x, E) = min kx − zk : Rn → R+ z∈E
be the k · k-distance from x to E. The quantity distk·k (X , E) = max distk·k (x, E) x∈X
can be viewed as the worst-case k · k-accuracy to which vectors from X can be approximated by vectors from E. Given positive integer m ≤ n and denoting by Em the family of all linear subspaces in Rm of dimension m, the quantity δm (X , k · k) = min distk·k (X , E) E∈Em
can be viewed as the best achievable quality of approximation, measured in k · k, of vectors from X by vectors from an m-dimensional linear subspace of Rn . This quantity is called the m-th Kolmogorov width of X w.r.t. k · k.
505
SOLUTIONS TO SELECTED EXERCISES
Observe that one has distk·k (x, E) = maxξ {ξ T x : kξk∗ ≤ 1, ξ ∈ E ⊥ }, ξT x distk·k (X , E) = max
(4.70)
x∈X , kξk∗ ≤1,ξ∈E ⊥
where E ⊥ is the orthogonal complement to E. 1) Prove (4.70). Hint: Represent distk·k (x, E) as the optimal value in a conic problem on the cone K = {[x; t] : t ≥ kxk} and use the Conic Duality Theorem.
Solution: Representing E = Ker A for a properly selected matrix A, we clearly have distk·k (x, E) = min {t : Az = 0, [x − z; t] ∈ K} ,
(∗)
[z;t]
and the conic problem just specified clearly is strictly feasible and bounded. It is immediately seen that the cone K∗ dual to K is {[y; s] : s ≥ kyk∗ }, so that the problem dual to (∗) reads max −y T x : kyk∗ ≤ s, −y T z + ts + λT Az = t ∀(z, t) λ,y,s ⇔ max −y T x : kyk∗ ≤ 1, y = AT λ λ,y
⇔ max{y T x : y ∈ ImAT , kyk∗ ≤ 1}. y
To justify the first equality in (4.70), it remains to note that the optimal value in the dual is equal to the one in the primal, and that E = Ker A implies E ⊥ = ImAT . The second equality in (4.70) is an immediate consequence of the first one. ✷ Now consider the case when X is the unit ball of some norm k · kX . In this case (4.70) combines with the definition of Kolmogorov width to imply that δm (X , k · k)
=
min distk·k (x, E) = min max
E∈Em
=
= min
=
min
max
max
E∈Em y∈E ⊥ ,kyk∗ ≤1 x:kxkX ≤1
max
max
E∈Em x∈X y∈E ⊥ ,kyk∗ ≤1 T
F ∈En−m y∈F,kyk∗ ≤1
yT x
y x
(4.71)
kykX ,∗ ,
where k · kX ,∗ is the norm conjugate to k · kX . Note that when Y is a convex compact set in Rn and | · | is a norm on Rn , the quantity dm (Y, | · |) =
min
max |y|
F ∈En−m y∈Y∩F
has a name—it is called the m-th Gelfand width of Y taken w.r.t. | · |. The “duality relation” (4.71) states that When X , Y are the unit balls of respective norms k · kX , k · kY , for every m < n the m-th Kolmogorov width of X taken w.r.t. k · kY,∗ is the same as m-th Gelfand width of Y taken w.r.t. k · kX ,∗ .
The goal of the remaining part of the exercise is to use our results on the quality of semidefinite relaxation on ellitopes/spectratopes to infer efficiently computable upper bounds on Kolmogorov widths of a given set X ⊂ Rn . In the sequel we assume that • X is a spectratope,
X = {x ∈ Rn : ∃(t ∈ T , u) : x = P u, Rk2 [u] tk Idk , k ≤ K};
506
SOLUTIONS TO SELECTED EXERCISES
• The unit ball B∗ of the norm conjugate to k · k is a spectratope: B∗ = {y : kyk∗ ≤ 1} = {y ∈ Rn : ∃(r ∈ R, z) : y = M z, Sℓ2 [z] rℓ Ifℓ , ℓ ≤ L}. with our usual restrictions on T , R and Rk [·] and Sℓ [·]. 4.7.B. Simple case: k · k = k · k2 . We start with the simple case where k · k = k · k2 , so that T B∗ is the ellitope P {y : y y ≤ 1}. Let D = k dk be the size of the spectratope X , and let κ = 2 max[ln(2D), 1].
Given integer m < n, consider the convex optimization problem Opt(m) = minΛ={Λk ,k≤K},Y φT (λ[Λ]) : Λk 0∀k, 0 Y In , P ∗ T S [Λ ] P Y P, Tr(Y ) = n − m . k k k
(Pm )
2) Prove the following
Proposition 4.26 Whenever 1 ≤ µ ≤ m < n, one has 2 2 Opt(m) ≤ κδm (X , k · k2 ) & δm (X , k · k2 ) ≤
m+1 Opt(µ). m+1−µ
(4.72)
Moreover, the above upper bounds on δm (X , k · k2 ) are “constructive,” meaning that an optimal solution to (Pµ ), µ ≤ m, can be straightforwardly converted into a linear subspace E m,µ of dimension m such that r m+1 Opt(µ). distk·k2 (X , E m,µ ) ≤ m+1−µ Finally, Opt(µ) is nonincreasing in µ < n.
Solution: Let Em be an m-dimensional linear subspace of Rn such that δm (X , k · k2 ) = maxx∈X distk·k2 (x, Em ), and let Ym be the projector onto the orthogonal complement of Em . The maximum of the quadratic form xT Ym x over X is nothing 2 but δm (X , k · k2 ), implying by Proposition 4.8 that there exists P a collection Λ = 2 {Λk 0, k ≤ K} such that φT (λ[λ]) ≤ κδm (X , k · k2 ) and k Sk∗ [Λk ] P T Ym P . We see that (Λ, Ym ) is a feasible solution of (Pm ) with the value of the objective at 2 most κδm (X , k · k2 ), which proves the first inequality in (4.72). To prove the second inequality, let (Λ∗ , Y∗ ) be an optimal solution to (Pµ ), and let ν1 ≤ ν2 ≤ ... ≤ νn be the eigenvalues of Y∗ , so that 0 ≤ νs ≤ 1, s ≤ n. Denoting by E m,µ the linear span of the first m eigenvectors of Y∗ , we have dim E m,µ = m and P n − µ = Tr(Y∗ ) = i νi ≤ (m + 1)νm+1 + (n − m − 1) ⇒ νm+1 ≥ m+1−µ m+1 2 m+1 m,µ ) ⇒ dist2k·k2 (x, E m,µ ) ≤ m+1−µ xT Y∗ x ⇒ xT Y∗ x ≥ m+1−µ m+1 distk·k2 (x, E m+1 2 ⇒ δm (X , k · k2 ) ≤ maxx∈X dist2k·k2 (x, E m,µ ) ≤ m+1−µ maxx∈X xT Y∗ x. On the other hand, (Λ∗ , Y∗ ) is feasible for (Pµ ), implying by Proposition 4.8 that max xT Y∗ x ≤ φT (λ[Λ∗ ]) = Opt(µ), and we arrive at the second inequality in (4.72). x∈X
Finally, if (Λ, Y ) is feasible for (Pk ) and k ≤ k ′ < n, then (Λ, Y ′ ), with Y ′ = , is feasible for (Pk′ ), implying that Opt(k) ≥ Opt(k ′ ). ✷
n−k′ n−k Y
507
SOLUTIONS TO SELECTED EXERCISES
Remark 6.15. Inspecting the above proof, we arrive at the bound −1 2 δm (X , k · k2 ) ≤ dist2k·k2 (X , E m,µ ) ≤ νm+1 Opt(µ)
where νm+1 is the (m + 1)-st smallest eigenvalue of the Y -component of an optimal solution to (Pµ ). This bound holds true whenever µ < n, and when µ ≤ m, it is at least as good as the upper bound on Kolmogorov width stated in Proposition 4.26. Remark 6.16. When X is an ellitope, X = {x ∈ Rn : ∃(t ∈ T , y) : x = P y, y T Sk y ≤ tk , k ≤ K}, setting Opt(m) = min λ,Y
(
X k
λk : λ ≥ 0,
X k
)
T
λk Sk P Y P, 0 Y In , Tr(Y ) = n − m ,
the same reasoning as in the proof of the lemma with Proposition √ 4.6 in the role of Proposition 4.8 results in the validity of (4.72) with κ = 3 ln( 3K). 4.7.C. General case. Now consider the case when both X and the unit ball B∗ of the norm conjugate to k·k are spectratopes. As we are about to see, this case is essentially more difficult than the case of k · k = k · k2 , but something still can be done.
3) Prove the following statement: (!) Given m < n, let Y be a projector of Rn of rank n − m, and let collections Λ = {Λk 0, k ≤ K} and Υ = {Υℓ 0, ℓ ≤ L} satisfy the relation P ∗ 1 T P YM k Rk [Λk ] 2 P 0. (4.73) ∗ T 1 M YP ℓ Sℓ [Υℓ ] 2 Then
distk·k (X , Ker Y ) ≤ φT (λ[Λ]) + φR (λ[Υ]).
As a result, δm (X , k · k)
≤
distk·k (X , Ker Y )
≤
Opt :=
min
Λ={Λk ,k≤K}, Υ={Υℓ ,ℓ≤L}
(4.74)
φT (λ[Λ]) + φR (λ[Υ]) :
) ΛkP 0 ∀k, Υℓ 0 ∀ℓ, ∗ 1 T P YM . k Rk [Λk ] 2 P 0 ∗ T 1 S [Υ ] M Y P ℓ ℓ ℓ 2
(4.75)
Solution: Let F = ImY , and let Y = {y ∈ F : kyk∗ ≤ 1}. Let us fix y ∈ Y, so that y = Y y due to y ∈ F and y = M z with z satisfying for some r ∈ R relations Sℓ2 [z] rℓ Ifℓ , ℓ ≤ L. Now let x ∈ X , so that x = P u with u satisfying for some t ∈ T relations Rk2 [u] ≤ tk Idk , k ≤ K. We have xT y
= ≤ = = = ≤ ≤
xT YP y = uT P T Y M z P ∗ ∗ T T u P [ k R∗k [Λk ]] u T+ z [P ℓ Sℓ [Υ∗ℓ ]] z [byT(4.73)] Pk Tr(Rk [ΛkT][uu ]) +P ℓ Tr(Sℓ [ΥTℓ ][zz ]) + ℓ Tr(Sℓ [zz ]Υℓ ) Pk Tr(R2k [uu ]Λk )P 2 Tr(R [u]Λ ) + kP k ℓ Tr(Sℓ [z]Υℓ ) Pk ℓ rℓ Tr(Υℓ ) [due to the origin of u and z] k tk Tr(Λk ) + θ := φT (λ[Λ]) + φR (λ[Υ]),
508
SOLUTIONS TO SELECTED EXERCISES
whence max max y T x ≤ θ. y∈Y x∈X
Recalling that Y = {y ∈ F : kyk∗ ≤ 1} and invoking (4.70), we see that distk·k (X , F ⊥ ) ≤ θ, noting that F ⊥ = Ker Y , (4.74) follows.
✷
4) Prove the following statement: (!!) Let m, n, Y be as in (!). Then δm (X , k · k) ≤ distk·k (X , Ker Y) ≤
d := Opt
min
ν,Λ={Λk ,k≤K}, Υ={Υℓ ,ℓ≤L}
φT (λ[Λ]) + φR (λ[Υ]) :
) ν ≥ P0, Λk∗ 0 ∀k, Υℓ 0 ∀ℓ,1 T P M , k Rk [Λk ] 2 P ∗ 0 T T 1 M P S [Υ ] + νM (I − Y )M ℓ ℓ ℓ 2
(4.76)
d ≤ Opt, with Opt given by (4.75). and Opt
Solution: Same as in the proof of (!), let F = ImY and Y = {y ∈ F : kyk∗ ≤ 1} = {y ∈ F : ∃(r ∈ R, z) : y = M z, Sℓ2 [z] rℓ Ifℓ , ℓ ≤ L}. Given ǫ > 0, consider the spectratope Yǫ = {y : ∃([r; ρ] ∈ R+ := R × [0, ǫ], z) : y = M z, Sℓ2 [z] rℓ Ifℓ , ℓ ≤ L, z T M T (I − Y )M z ≤ ρ}. Observe that for evident reasons we have Y ⊂ Yǫ . Now let ν, Λ, Υ be a feasible solution to the optimization problem in (4.76). We have max y T x
y∈Y,x∈X
≤ = ≤
yT x
max
y∈Yǫ ,x∈X
t ∈ T , Rk2 [u] tk Idk ∀k max uT P T M z : r ∈ R, ρ ∈ [0, ǫ], Sℓ2 [z] rℓ Ifℓ ∀ℓ, u,t,z,[r;ρ] z T M T (I − Y )M z ≤ ρ φT (λ[Λ]) + φR+ ([λ[Υ]; ν]) = φT (λ[Λ]) + φR (λ[Υ]) + ǫν,
where the concluding ≤ is by Proposition 4.8. Recalling that by (4.70) and the construction of F and Y it holds max y T x =
y∈Y,x∈X
max
x∈X ,kyk∗ ≤1,y∈F
y T x = distk·k (X , F ⊥ ) = distk·k (X , Ker Y ).
We see that distk·k (X , Ker Y ) ≤ φT (λ[Λ]) + φR (λ[Υ]) + ǫν whenever (Λ, Υ, ν) is feasible for the optimization problem in (4.76). Since ǫ > 0 can be made arbitrarily small, (4.76) follows. d ≤ Opt. Indeed, let Λo = {Λo ∈ Sdk } be a It remains to verify that Opt k once and forever fixed collection of positive definite matrices, so that the matrix P Qo = k R∗k [Λok ] is positive definite by Lemma 4.44. Since Y is a projector, we
509
SOLUTIONS TO SELECTED EXERCISES
have
"
1 I 4
1 2 (I
1 2 (I
−Y)
−Y) I−Y
#
1 T P P 4 T
0,
whence also
1 T P (I 2 T
− Y )M 1 M (I − Y )M M (I − Y )P 2 1 I T T 4 = Diag{P , M } 1 (I − Y ) 2
−Y) I −Y
− Y )M M (I − Y )M
1 (I 2
Diag{P, M } 0,
so that for properly selected ao > 0 it holds P
∗ o k Rk [ao Λk ] T
1 M 2
whence
P
(I − Y )P
∗ o k Rk [ǫao Λk ] T 1 M (I − Y )P 2
ǫ
1 T P (I 2 T
1 T P (I 2 −1 T
− Y )M M (I − Y )M
0,
0 ∀ǫ > 0.
It follows that whenever (Λ, Υ) is feasible for the optimization problem in (4.74) and ǫ > 0, the collection (Λǫ = {Λk +ǫao Λok }, Υ, ν = ǫ−1 ) is feasible for the optimization problem in (4.76); since ǫ > 0 can be arbitrarily small and the objective of the d ≤ Opt. problem in (4.76) is independent of ν, we arrive at the desired relation Opt ✷ Statements (!), (!!) suggest the following policy for upper-bounding the Kolmogorov width δm (X , k · k):
A. First, we select an integer µ, 1 ≤ µ < n, and solve the convex optimization problem minΛ,Υ,Y φT (λ[Λ]) + φR (λ[Υ]) : 0 Y I, Tr(Y ) = n − µ Λ Υ = {Υ (P µ ) ℓ 0, ℓ ≤ L}, = P{Λk∗ 0, k ≤1 K}, T P Y M R [Λ ] k k k 2 P 0 ∗ 1 MT Y P ℓ Sℓ [Υℓ ] 2
B. Next, we take the Y -component Y µ of the optimal solution to (P µ ) and “round” it to a projector Y of rank n − m in the same fashion as in the case of k · k = k · k2 , that is, keep the eigenvectors of Y µ intact and replace m smallest eigenvalues with zeros, and all remaining eigenvalues with ones. C. Finally, we solve the convex optimization problem Optm,µ = minΛ,Υ,ν φT (λ[Λ]) + φR (λ[Υ]) : ν Υ = {Υℓ 0, ℓ ≤L}, (P m,µ ) ≥ P0, Λ ∗= {Λk 0, k ≤ K}, 1 T P M R [Λ ] k k k 2 P ∗ 0 T 1 MT P ℓ Sℓ [Υℓ ] + νM (I − Y )M 2 By (!!), Optm,µ is an upper bound on the Kolmogorov width δm (X , k · k) (and in fact also on distk·k (X , Ker Y )).
Observe all the complications we encounter when passing from the simple case k · k = k · k2 to the case of a general norm k · k with a spectratope p as the unit ball of the conjugate norm. on the m-th Kolmogorov Note that Proposition 4.26 gives both a lower bound Opt(m)/κ q
m+1 Opt(µ), 1 ≤ µ ≤ m, on width of X w.r.t. k · k2 , and a family of upper bounds m+1−µ this width. As a result, we can approximate X by m-dimensional subspaces in the Euclidean norm in a “nearly optimal” fashion. Indeed, if for some ǫ and k it holds δk (X , k · k2 ) ≤ ǫ, then Opt(k) ≤ κǫ2 by Proposition 4.26 as applied with m = k. On the other hand, assuming k < n/2, the same proposition when applied with m = 2k and µ = k says that r p √ 2k + 1 distk·k2 (X , E m,k ) ≤ Opt(k) ≤ 2Opt(k) ≤ 2κ ǫ. k+1
510
SOLUTIONS TO SELECTED EXERCISES
Thus, if X may somehow be approximated by a k-dimensional subspace within √ k · k2 -accuracy ǫ, we can efficiently get an approximation of “nearly the same quality” ( 2κǫ instead of ǫ; recall that κ is just logarithmic in D) and “nearly the same dimension” (2k instead of k). Neither of these options is preserved when passing from the Euclidean norm to a general one: in the latter case, we do not have lower bounds on Kolmogorov widths, and have no understanding of how tight our upper bounds are. Now, two concluding questions: 5) Why in step A of the above bounding scheme do we utilize statement (!) rather than less d ≤ Opt) statement (!!)? conservative (since Opt 6) Implement the scheme numerically and run experiments. Recommended setup: • Given σ > 0 and positive integers n and κ, let f be a function of continuous argument t ∈ [0, 1] satisfying the smoothness restriction |f (k) (t)| ≤ σ k , 0 ≤ t ≤ 1, k = 0, 1, 2, ..., κ. Specify X as the set of n-dimensional vectors x obtained by restricting f onto the npoint equidistant grid {ti = i/n}n i=1 . To this end,translate the description on f into a bunch of two-sided linear constraints on x: |dT(k) [xi ; xi+1 ; ...; xi+k ]| ≤ σ k , 1 ≤ i ≤ n − k, 0 ≤ k ≤ κ, where d(k) ∈ Rk+1 is the vector of coefficients of finite-difference approximation, with resolution 1/n, of the k-th derivative: d(0) = 1, d(1) = n[−1; 1], d(2) = n2 [1; −2; 1], d(3) = n3 [−1; 3; −3; 1], d(4) = n4 [1; −4; 6; −4; 1], ... • Recommended parameters: n = 32, m = 8, κ = 5, σ ∈ {0.25, 0.5; 1, 2, 4}. • Run experiments with k · k = k · k1 and k · k = k · k2 .
Solution: Were we able to optimize in (4.76) over (Λ, Υ) and over projector Y of rank n − m, this is definitely what we would do. The problem is that projectors form a nonconvex set, so that we should relax the constraint “Y is a projector.” Now, whatever be a convex outer approximation P of the set of all projectors of a given positive rank n − m, it must contain the matrix Y∗ = n−m n I, since this matrix is the average of a properly selected sequence of projectors of rank n − m. As a result, setting Y = Y∗ and selecting small enough positive definite Λk , Υℓ = 0, and large enough ν, we get a feasible solution to the relaxed version of (4.76), and the value of the objective at this feasible solution can be made as close to 0 as we want. Thus, all natural convex relaxations of (4.76) are, in a sense, trivial and do not provide any clue to what a good projector Y is. In contrast to this, the optimization problem in (4.76) combines with our “rounding” to yield a “qualified guess” on what may be a good Y . Here are some numerical results: σ 0.25 0.50 1.00 2.00 4.00
distk·k1 (X , E) 0.0030 0.0104 0.0025 0.0197 0.0462
distk·k2 (X , F ) 0.0015 0.0036 0.0118 0.0105 0.0493
As recommended, we used n = 32 and m = 8; E and F stand for the resulting eightdimensional subspaces obtained when minimizing k·k1 and k·k2 width, respectively.
511
SOLUTIONS TO SELECTED EXERCISES
Notice the “counterintuitive” phenomena exhibited by the presented results: the actual k · kp -widths should decrease in p and increase in σ (since k · kp decreases as p grows, and X grows as σ grows). We see from the table that some of these relations are violated by our bounds. This is unpleasant, but not surprising: what we compute are conservative upper bounds on the actual widths, and there is no guarantee that these bounds preserve inequalities between the widths. Exercise 4.8. [more on semidefinite relaxation] The goal of this exercise is to extend SDP relaxation beyond ellitopes/spectratopes. SDP relaxation is aimed at upper-bounding the quantity OptX (B) = max xT Bx, x∈X
[B ∈ Sn ]
where X ⊂ Rn is a given set (which we from now on assume to be nonempty convex compact). To this end we look for a computationally tractable convex compact set U ⊂ Sn such that for every x ∈ X it holds xxT ∈ U ; in this case, we refer to U as to a set matching X (equivalent wording: ”U matches X ”). Given such a set U, the optimal value in the convex optimization problem OptU (B) = max Tr(BU ) (4.77) U ∈U
is an efficiently computable convex upper bound on OptX (B). Given U matching X , we can pass from U to the conic hull of U, that is, to the set U[U] = cl{(U, µ) ∈ Sn × R+ : µ > 0, U/µ ∈ U }
which, as it is immediately seen, is a closed convex cone contained in Sn × R+ . The only point (U, µ) in this cone with µ = 0 has U = 0 (since U is compact), and U = {U : (U, 1) ∈ U} = {U : ∃µ ≤ 1 : (U, µ) ∈ U}, so that the definition of OptU can be rewritten equivalently as OptU (B) = min {Tr(BU ) : (U, µ) ∈ U, µ ≤ 1} . U,µ
The question, of course, is where to take a set U matching X , and the answer depends on what we know about X . For example, when X is a basic ellitope: X = {x ∈ Rn : ∃t ∈ T : xT Sk x ≤ tk , k ≤ K}
with our usual restrictions on T and Sk , it is immediately seen that
x ∈ X ⇒ xxT ∈ U = {U ∈ Sn : U 0, ∃t ∈ T : Tr(U Sk ) ≤ tk , k ≤ K}.
Similarly, when X is a basic spectratope,
X = {x ∈ Rn : ∃t ∈ T : Sk2 [x] tk Idk , k ≤ K}
with our usual restrictions on T and Sk [·], it is immediately seen that
x ∈ X ⇒ xxT ∈ U = {U ∈ Sn : U 0, ∃t ∈ T : Sk [U ] tk Idk , k ≤ K}.
One can verify that the semidefinite relaxation bounds on the maximum of a quadratic form on an ellitope/spectratope X derived in Sections 4.2.3 (for ellitopes) and 4.3.2 (for spectratopes) are nothing but the bounds (4.77) associated with the U just defined. 4.8.A Matching via absolute norms. There are other ways to specify a set matching X . The seemingly simplest of them is as follows. Let p(·) be an absolute norm on Rn (recall that this is a norm p(x) which depends solely on abs[x], where abs[x] is the vector comprised of the magnitudes of entries in x). We can convert p(·) into the norm p+ (·) on the space Sn as follows: p+ (U ) = p([p(Col1 [U ]); ...; p(Coln [U ])]) [U ∈ Sn ]
512
SOLUTIONS TO SELECTED EXERCISES
1.1) Prove that p+ indeed is a norm on Sn , and p+ (xxT ) = p2 (x). Denoting by q(·) the norm conjugate to p(·), what is the relation between the norm (p+ )∗ (·) conjugate to p+ (·) and the norm q + (·) ? 1.2) Derive from 1.1 that whenever p(·) is an absolute norm such that X is contained in the unit ball Bp(·) = {x : p(x) ≤ 1} of the norm p, the set Up(·) = {U ∈ Sn : U 0, p+ (U ) ≤ 1} is matching X . If, in addition, X ⊂ {x : p(x) ≤ 1, P x = 0}, then the set
(4.78)
Up(·),P = {U ∈ Sn : U 0, p+ (U ) ≤ 1, P U = 0}
is matching X . Assume that in addition to p(·), we have at our disposal a computationally tractable closed convex set D such that whenever p(x) ≤ 1, the vector [x]2 := [x21 ; ...; x2n ] belongs to D; in the sequel we call such a D square-dominating p(·). For example, when p(·) = k · kr , we can take P n : i y1 ≤ 1 , r ≤ 2 y ∈ R + D= . n y ∈ R+ : kykr/2 ≤ 1 , r > 2
Prove that in this situation the above construction can be refined: whenever X satisfies (4.78), the set
D Up(·),P = {U ∈ Sn : U 0, p+ (U ) ≤ 1, P U = 0, dg(U ) ∈ D} [dg(U ) = [U11 ; U22 ; ...; Unn ]]
matches X . D Note: in the sequel, we suppress P in the notation Up(·),P and Up(·),P when P = 0; thus, Up(·) is the same as Up(·),0 . 1.3) Check that when p(·) = k · kr with r ∈ [1, ∞], one has ( P 1/r r , 1 ≤ r < ∞, . i,j |Uij | p+ (U ) = kU k := r
maxi,j |Uij |,
r=∞
1.4) Let X = {x ∈ Rn : kxk1 ≤ 1} and p(x) = kxk1 , so that X ⊂ {x : p(x) ≤ 1}, and n o X Conv{[x]2 : x ∈ X } ⊂ D = y ∈ Rn yi = 1 . (4.79) + : i
What are the bounds OptUp(·) (B) and OptU D (B)? Is it true that the former (the latter) p(·)
of the bounds is precise? Is it true that the former (the latter) bound is precise when B 0? 1.5) Let X = {x ∈ Rn : kxk2 ≤ 1} and p(x) = kxk2 , so that X ⊂ {x : p(x) ≤ 1} and (4.79) holds true. What are the bounds OptUp(·) (B) and OptU D (B)? Is the former (or the p(·)
latter) bound precise? 1.6) Let X ⊂ Rn + be closed, convex, bounded, and with a nonempty interior. Verify that the set X + = {x ∈ Rn : ∃y ∈ X : abs[x] ≤ y}
is the unit ball of an absolute norm pX , and this is the largest absolute norm p(·) such that X ⊂ {x : p(x) ≤ 1}. Derive from this observation that the norm pX (·) is the best (i.e., resulting in the least conservative bounding scheme) among absolute norms which allow us to upper-bound OptX (B) via the construction from item 1.2.
513
SOLUTIONS TO SELECTED EXERCISES
Solution: 1.1: Verification that p+ is a norm is completely straightforward. For example, here is the derivation of the Triangle inequality: p+ (U + V )
=
p([p(Col1 [U ] + Col1 [V ]); ...; p(Coln [U ] + Coln [V ])]) | {z } {z } | ≤p(Col1 [U ])+p(Col1 [V ])
≤ |{z}
≤p(Coln [U ])+p(Coln [V ])
p([p(Col1 [U ]) + p(Col1 [V ]); ...; p(Coln [U ]) + p(Coln [V ])])
(a)
≤ =
p([p(Col1 [U ]); ...; p(Coln [U ])]) + p([p(Col1 [V ]); ...; p(Coln [V ])]) p+ (U ) + p+ (V ),
where (a) is due to the fact that every absolute norm is entrywise monotone on the nonnegative orthant. Next, we have p+ (xxT ) = p([|x1 |p(x); |x2 |p(x); ...; |xn |p(x)]) = p(x)p(abs[x]) = p2 (x). Finally, we have (p+ )∗ (·) ≤ q + (·), and the inequality may be strict. Indeed, to verify the inequality, let V ∈ Sn . Then for every U ∈ Sn , denoting by u the vector comprised of p(·)-norms of the columns in U , and by v the vector comprised of q(·)-norms of the columns of V , it holds (recall that U and V are symmetric) X X vi ui ≤ q(v)p(u) = q + (V )p+ (U ), ColTi [V ]Coli [U ] ≤ Tr(V U ) = i
i
implying that (p+ )∗ (V ) =
max
U ∈Sn :p+ (U )≤1
Tr(V U ) ≤ q + (V ).
To check that in general the inequality is strict, it suffices to compare (p+ )∗ and q + numerically for, say, n = 5 and p(x) = kxk(2) , where kxk(k) is the sum of the k largest magnitudes of entries in x. Of course, sometimes (p+ )∗ ≡ q + ; this is so, e.g., when p(·) = k · kr with r ∈ [1, ∞]. ✷ 1.2 is immediate: when X ⊂ {x : p(x) ≤ 1}, we have x ∈ X ⇒ p(x) ≤ 1 ⇒ p+ (xxT ) = p2 (x) ≤ 1 ⇒ xxT ∈ Up(·) = {U ∈ Sn : U 0, p+ (U ) ≤ 1}. Besides this, dg[xxT ] = [x]2 , and if P x = 0 for all x ∈ X , then also P [xxT ] = 0 for all x ∈ X . ✷ 1.3: Evident. P 1.4: Observing that p+ (U ) = i,j |Uij |, we see that whenever U 0 and D p+ (U ) ≤ 1, we have also dg(U ) ∈ D, implying that Uk·k1 = Uk·k ; thus, the bounds 1 in question are identical. We have o n P OptUk·k (B) = max Tr(BU ) : U 0, i,j |Uij | ≤ 1 1 U n o P P |U | ≤ 1 B U : ≤ max ij ij ij i,j i,j U
=
maxi,j |Bij |.
The bound OptUk·k (B) in general is strictly greater than OptX (B) = max xT Bx. 1
kxk1 ≤1
However, the bound is equal to OptX (B) when B 0. Indeed, in the latter case the function xT Bx is convex, and its maximum over X is achieved at a vertex of
514
SOLUTIONS TO SELECTED EXERCISES
X = {x : kxk1 ≤ 1}, that p is, OptX (B) = maxi Bii . On the other hand, when B 0, we have |Bij | ≤ Bii Bjj , implying that maxi,j |Bij | = maxi Bii . 1.5: We have Uk·k2 D Uk·k 2
= = =
{U ∈ Sn : U 0, kU k2 ≤ 1}, {U ∈ Sn : U 0, kU k2 ≤ 1, Tr(U ) ≤ 1} {U ∈ Sn : U 0, Tr(U ) ≤ 1}.
and OptUk·k (B) = min {Tr(BU ) : kU k2 ≤ 1} . 2
U
Let λ(Q) be the vector of eigenvalues of Q ∈ Sn taken with their multiplicities and arranged in the non-ascending order. Since kQk2 is the Frobenius norm of a matrix Q and thus remains invariant when replacing Q with V QW , with orthogonal V, W , we can assume w.l.o.g. that B is diagonal, which results in OptUk·k (B) = 2 kλ+ [B]k2 , where λ+ [B] is the vector comprised of the positive parts max[λi , 0] of the eigenvalues λi of B. Recalling that under the circumstances we have OptX (B) = max[λ1 (B), 0], we conclude that the bound in question is precise if and only if B has at most √ one positive eigenvalue, and can be greater than OptX (B) by factor as large as n, as in the case of B = In . In contrast to this, the bound OptU D (B) is precise for all B; indeed, for the k·k2 same reasons as above, it suffices to verify this claim when B is diagonal, in which case the claim becomes evident. ✷ 1.6: It is clear that X + is a closed and bounded convex set symmetric w.r.t. the origin and containing a neighbourhood of the origin. Consequently, X + is the unit ball of some norm pX (·). By construction, if E is a diagonal matrix with diagonal entries ±1, then Ex ∈ X + if and only if x ∈ X + , implying that pX is an absolute norm. Now, if p is an absolute norm such that X ⊂ B := {x : p(x) ≤ 1} then X + ⊂ B. Indeed, when x ∈ X + , we have abs[x] ≤ y for some y ∈ X, that is, p(x) = p(abs[x]) ≤ p(y) ≤ 1, where the first inequality is due to the fact that the absolute norm p(·) is monotone in the nonnegative orthant, and the last inequality is due to X ⊂ B. Thus, X + ⊂ B, implying that p(·) ≤ pX (·) and consequently UpX (·) ⊂ Up(·) . ✷ 4.8.B. “Calculus of matchings.” Observe that the matching we have introduced admits a kind of “calculus.” Specifically, consider the situation as follows: for 1 ≤ ℓ ≤ L, we are given
• nonempty convex compact sets Xℓ ⊂ Rnℓ , 0 ∈ Xℓ , along with matching Xℓ convex compact sets Uℓ ⊂ Snℓ giving rise to the closed convex cones Uℓ = cl{(Uℓ , µℓ ) ∈ Snℓ × R+ : µℓ > 0, µ−1 ℓ Uℓ ∈ Uℓ }. We denote by ϑℓ (·) the Minkowski functions of Xℓ : ϑℓ (y ℓ ) = inf{t : t > 0, t−1 y ℓ ∈ Xℓ } : Rnℓ → R ∪ {+∞};
note that Xℓ = {y ℓ : ϑℓ (y ℓ ) ≤ P 1}; • nℓ × n matrices Aℓ such that ℓ ATℓ Aℓ ≻ 0.
L On top of that, we are given a monotone convex set T ⊂ RL + intersecting the interior of R+ . These data specify the convex set
X = {x ∈ Rn : ∃t ∈ T : ϑ2ℓ (Aℓ x) ≤ tℓ , ℓ ≤ L}
(∗)
515
SOLUTIONS TO SELECTED EXERCISES
2.1) Prove the following Lemma 4.27 In the situation in question, the set U = U ∈ Sn : U 0 & ∃t ∈ T : (Aℓ U ATℓ , tℓ ) ∈ Uℓ , ℓ ≤ L
is a closed and bounded convex set which matches X . As a result, the efficiently computable quantity OptU (B) = max {Tr(BU ) : U ∈ U } U
is an upper bound on
OptX (B) = max xT Bx. x∈X
2.2) Prove that if X ⊂ Rn is a nonempty convex compact set, P is an m × n matrix, and U
matches X , then the set V = {V = P U P T : U ∈ U} matches Y = {y : ∃x ∈ X : y = P x}. 2.3) Prove that if X ⊂ Rn is a nonempty convex compact set, P is an n × m matrix of rank m, and U matches X , then the set V = {V 0 : P V P T ∈ U } matches Y = {y : P y ∈ X }. 2.4) Consider the “direct product” case where X = X1 × ... × XL . When specifying Aℓ as the matrix which “cuts” the ℓ-th block Aℓ x = xℓ of a block vector x = [x1 ; ...; xL ] ∈ Rn1 × ... × RnL and setting T = [0, 1]L , we cover this situation by the setup under consideration. In the direct product case, the construction from item 2.1 is as follows: given the sets Uℓ matching Xℓ , we build the set ′
U = {U = [U ℓℓ ∈ Rnℓ ×nℓ′ ]ℓ,ℓ′ ≤L ∈ Sn1 +...+nL : U 0, U ℓℓ ∈ Uℓ , ℓ ≤ L} and claim that this set matches X . Could we be less conservative? While we do not know how to be less conservative in general, we do know how to be less conservative in the special case when Uℓ are built via absolute norms. Namely, let pℓ (·) : Rnℓ → R+ , ℓ ≤ L, be absolute norms, let sets Dℓ be square-dominating pℓ (·),
and let
bℓ = {xℓ ∈ Rnℓ : Pℓ xℓ = 0, pℓ (xℓ ) ≤ 1}, Xℓ ⊂ X
Uℓ = {U ∈ Snℓ : U 0, Pℓ U = 0, p+ ℓ (U ) ≤ 1, dg(U ) ∈ Dℓ }.
In this case the above construction results in Pℓ U ℓℓ = 0 ′ ℓℓ 1 +...+nL ,ℓ ≤ L . : U 0, p+ U = U = [U ℓℓ ∈ Rnℓ ×nℓ′ ]ℓ,ℓ′ ≤L ∈ Sn + ℓ (U ) ≤ 1 dg(U ℓℓ ) ∈ Dℓ Now let
p([x1 ; ...; xL ]) = max[p1 (x1 ), ..., pL (xL )] : Rn1 × ... × RnL → R, so that p is an absolute norm and X ⊂ {x = [x1 ; ...; xL ] : p(x) ≤ 1, Pℓ xℓ = 0, ℓ ≤ L}. Prove that in fact the set Pℓ U ℓℓ = 0 ′ 1 +...+nL : U 0, dg(U ℓℓ ) ∈ Dℓ , ℓ ≤ L U = U = [U ℓℓ ∈ Rnℓ ×nℓ′ ]ℓ,ℓ′ ≤L ∈ Sn + p+ (U ) ≤ 1
matches X , and that we always have U ⊂ U. Verify that in general this inclusion is strict.
516
SOLUTIONS TO SELECTED EXERCISES
Solution: 2.1: The fact that U is closed and convex is evident (recall that T is compact and the Uℓ are closed convex cones). To check that the set is bounded, assume the opposite; in this case U admits a nonzero recessive direction δ, and since T and the sets Uℓ are bounded, it is immediately seen that Aℓ δATℓ = 0 for every ℓ. In addition, we should have δ 0 due to U ⊂ Sn+ . Since δ 0 is nonzero, we can find a nonzero vector e ∈ Rn such that eeT δ. Then 0 Aℓ eeT ATℓ Aℓ δATℓ = 0, implying P T that Aℓ e = 0 for all ℓ; but the latter is impossible since e 6= 0 and ℓ Aℓ Aℓ ≻ 0. It remains to √ verify that U matches X . Let x ∈ X , so that for some t ∈ T it holds ϑℓ (Aℓ x) ≤ tℓ for all ℓ. When ℓ is such that tℓ = 0, we have Aℓ x = 0 (indeed, √ Xℓ is bounded, so that ϑℓ (·) > 0 outside of the origin). Let us set yℓ = Aℓ x/ tℓ when tℓ > 0 and yℓ = 0nℓ when tℓ = 0, so that ϑℓ (yℓ ) ≤ 1 for all ℓ, that is, yℓ ∈ Xℓ and therefore yℓ yℓT ∈ Uℓ , implying that (tℓ yℓ yℓT , tℓ ) ∈ Uℓ for all ℓ. Taking into account that tℓ yℓ yℓT = Aℓ xxT ATℓ for all ℓ, we see that there exists t ∈ T such that (Aℓ xxT ATℓ , tℓ ) ∈ Uℓ for all ℓ, implying that xxT ∈ U. ✷ 2.2: Obviously, V is a convex compact set; when y ∈ Y, there exists x ∈ X such that y = P x, whence yy T = P [xxT ]P T and U := xxT ∈ U, so that yy T ∈ V. ✷ 2.3: The fact that V is a closed convex set is evident. Boundedness of V is readily given by Ker P = 0, implying that ImP T = Rm , combined with boundedness of U . Finally, if y ∈ Y, then P y ∈ X , whence P xxT P T ∈ U, so that xT x ∈ V. ✷ 2.4: The fact that U matches X is readily given by item 1.2. The inclusion U ⊂ U is completely straightforward. The fact that inclusion U ⊂ U in general is strict can be easily verified numerically—take L = n1 = n2 = 2, zero P1 and P2 , p1 (·) = k · k2 , p2 (·) = k · k2 , generate several 4 × 4 positive semidefinite matrices, normalize them to have the maximum of Frobenius norms of diagonal 2×2 blocks to be 1 (that is, normalize them to become points of U ), and see whether the resulting matrices are of p+ -norm ≤ 1; if they are not, you are done. ✷ 4.8.C Illustration: Nullspace property revisited. Recall the sparsity-oriented signal recovery via ℓ1 minimization from Chapter 1: Given an m × n sensing matrix A and (noiseless) observation y = Aw of an unknown signal w known to have at most s nonzero entries, we recover w as w b ∈ Argmin {kzk1 : Az = y} . z
We called matrix A s-good, if whenever y = Aw with s-sparse w, the only optimal solution to the right-hand side optimization problem is w. The (difficult to verify!) necessary and sufficient condition for s-goodness is the Nullspace property: Opt := max kzk(s) : z ∈ Ker A, kzk1 ≤ 1 < 1/2, z
where kzk(k) is the sum of the k largest entries in the vector abs[z]. A verifiable sufficient condition for s-goodness is d := min max kColj [I − H T A]k(s) < 1 , Opt 2 H
j
(4.80)
d is an upper bound on Opt (see Proposition the reason being that, as is immediately seen, Opt 1.9 with q = 1).
517
SOLUTIONS TO SELECTED EXERCISES
An immediate observation is that Opt is nothing but the maximum of a quadratic form over an appropriate convex compact set. Specifically, let P X = {[u; v] ∈ Rn × Rn : Au= 0, kuk1 ≤ 1, i |vi | ≤ s, kvk∞ ≤ 1}, 1 I 2 n B= . 1 I n 2 Then OptX (B)
= = = |{z} (a)
=
max [u; v]T B[u; v] P max uT v : Au = 0, kuk1 ≤ 1, i |vi | ≤ s, kvk∞ ≤ 1 u,v maxu kuk(s) : Au = 0, kuk1 ≤ 1 [u;v]∈X
Opt,
where (a) is due to the well-known fact (prove it!) that whenever s is a positive integer ≤ n, the extreme points of the set X V = {v ∈ Rn : |vi | ≤ s, kvk∞ ≤ 1} i
are exactly the vectors with at most s nonzero entries, the nonzero entries being ±1; as a result ∀(z ∈ Rn ) : max z T v = kzk(s) . v∈V
Now, V is the unit ball of the absolute norm r(v) = min {t : kvk1 ≤ st, kvk∞ ≤ t} , so that X is contained in the unit ball B of the absolute norm on R2n specified as [u, v ∈ Rn ],
p([u; v]) = max {kuk1 , r(v)} i.e., X = {[u; v] : p([u, v]) ≤ 1, Au = 0} .
As a result, whenever x = [u; v] ∈ X , the matrix 11 U = uuT U = xxT = U 21 = vuT
U 12 = uv T U 22 = vv T
satisfies the condition p+ (U ) ≤ 1 (see item 1.2 above). In addition, this matrix clearly satisfies the condition A[U 11 , U 12 ] = 0. It follows that the set 11 U U = {U = U 21
U 12 U 22
∈ S2n : U 0, p+ (U ) ≤ 1, AU 11 = 0, AU 12 = 0}
(which clearly is a nonempty convex compact set) matches X . As a result, the efficiently computable quantity Opt
= =
max Tr(BU ) U ∈U 11 U max Tr(U 12 ) : U = U U 21
U 12 U 22
0, p+ (U ) ≤ 1, AU 11 = 0, AU 12 = 0
is an upper bound on Opt. As a result, the verifiable condition Opt < 1/2 is sufficient for s-goodness of A. Now comes the concluding part of the exercise:
(4.81)
518
SOLUTIONS TO SELECTED EXERCISES
d so that (4.81) is less conservative than (4.80). 3.1) Prove that Opt ≤ Opt, Hint: Apply Conic Duality to verify that ( ) n X n×n T d = max Tr(V ) : V ∈ R Opt , AV = 0, r(Coli [V ]) ≤ 1 . V
(4.82)
i=1
3.2) Run simulations with randomly generated Gaussian matrices A and play with different d and Opt. To save time, you can use toy sizes m, n, say, values of s to compare Opt m = 18, n = 24.
d is the optimal value in the strictly feasible and bounded conic Solution: 3.1: Opt problem n o d = min τ : [Colj [I − H T A]; τ ] ∈ K, j ≤ n , K = {[h; τ ] ∈ Rn × R : khk(s) ≤ τ }. Opt H,τ
(P )
From the above considerations, the cone K∗ is the cone {[v; t] ∈ Rn ×R : r(v) ≤ t}, d is the optimal value in the problem dual so that by the Conic Duality Theorem Opt to (P ). Straightforward computation shows that the dual problem is o nP P T T P v H Col [A] = 0∀H t = 1, [v ] : [v ; t ] ∈ K , max{[vj ;tj ],j≤n} j j j j j j ∗ j j j j P = maxV =[v1T ;v2T ;...;vnT ]∈Rn×n {Tr(V ) : i r(vi ) ≤ 1, AV = 0} ,
implying the validity of (4.82). It remains to note that when U runs through the feasible set of (4.81), V = U 12 is contained in the feasible set of (4.82) (see what d ✷ p+ (·) is), implying that Opt ≤ Opt. 3.2: Here is an instructive simulation run: s 1 2 3 4
d Opt 0.2497 0.3751 0.5075 0.6319
Opt 0.2497 0.3725 0.4975 0.6209
We see that (4.81) is slightly better than (4.80). For example, (4.81) certifies that matrix A underlying the experiment is 3-good, while (4.80) only certifies 2-goodness. 6.4.4 6.4.4.1
Around Propositions 4.4 and 4.14 Optimizing linear estimates on convex hulls of unions of spectratopes
Exercise 4.9 Let • X1 , ..., XJ be spectratopes in Rn :
2 Xj = {x ∈ Rn : ∃(y ∈ RNj , t h∈ Tj ) : x = Pj y, Rkj [y]i tk Idkj , k ≤ Kj }, 1 ≤ j ≤ J P Nj kji Rkj [y] = i=1 yi R
• A ∈ Rm×n and B ∈ Rν×n be given matrices, • k·k be a norm on Rν such that the unit ball B∗ of the conjugate norm k·k∗ is a spectratope: B∗
:= =
{u : kuk∗ ≤ 1} N {u ∈ Rν : ∃(z ∈ R) : u = iM z, Sℓ2 [z] rℓ Ifℓ , ℓ ≤ L} h R ,r ∈ PN Sℓ [z] = i=1 zi S ℓi
519
SOLUTIONS TO SELECTED EXERCISES
• Π be a convex compact subset of the interior of the positive semidefinite cone Sm +,
with our standard restrictions on Rkj [·], Sℓ [·], Tj and R. Let, further, ! [ X = Conv Xj j
be the convex hull of the union of spectratopes Xj . Consider the situation where, given observation ω = Ax + ξ of an unknown signal x known to belong to X , we want to recover Bx. We assume that the matrix of second moments of noise is -dominated by a matrix from Π, and quantify the performance of a candidate estimate x b(·) by its k · k-risk RiskΠ,k·k [b x|X ] = sup sup Eξ∼P {kBx − x b(Ax + ξ)k} x∈X P :P ✁Π
where P ✁ Π means that the matrix Var[P ] = Eξ∼P {ξξ T } of second moments of distribution P is -dominated by a matrix from Π. Prove the following Proposition 4.28 In the situation in question, consider the convex optimization problem h i Opt = min max φTj (λ[Λj ]) + φR (λ[Υj ]) + φR (λ[Υ′ ]) + ΓΠ (Θ) : H,Θ,Λj ,Υj ,Υ′
where, as usual,
j
Λj = {Λjk 0, j ≤ Kj }, j ≤ J, j j Υ = {Υℓ 0, ℓ ≤ L}, j ≤ J, Υ′ = {Υ′ℓ 0, ℓ ≤ L} P j ∗ 1 T P [B T − AT H]M k Rkj [Λk ] 2 j P 0, j ≤ J, j ∗ 1 M T [B − H T A]Pj ℓ Sℓ [Υℓ ] 2 1 HM Θ P2 ∗ ′ 0 , T T 1 M H ℓ Sℓ [Υℓ ] 2
(4.83)
φTj (λ) = max tT λ, φR (λ) = max rT λ, t∈Tj
r∈R
ΓΠ (Θ) = max Tr(QΘ), λ[U1 , ..., Us ] = [Tr(U1 ); ...; Tr(US )], Q∈Π Sℓ∗ [·] : Sfℓ → SN : Sℓ∗ [U ] = Tr(S ℓp U S ℓq ) p,q≤N , R∗kj [·] : Sdkj → SNj : R∗kj [U ] = Tr(Rkjp U Rkjq ) p,q≤N . j
Problem (4.83) is solvable, and H-component H∗ of its optimal solution gives rise to linear estimate x bH∗ (ω) = H∗T ω such that RiskΠ,k·k [b xH∗ |X ] ≤ Opt.
Moreover, the estimate x bH∗ is near-optimal among linear estimates: where
ln(D + F )RiskOptlin i h Opt ≤ O(1) P P D = maxj k≤Kj dkj , F = ℓ≤L fℓ
RiskOptlin = inf
sup
H x∈X ,Q∈Π
n o Eξ∼N (0,Q) kBx − H T (Ax + ξ)k
(4.84)
(4.85)
is the best risk attainable by linear estimates in the current setting under zero mean Gaussian observation noise. It should be stressed that the convex hull of unions of spectratopes is not necessarily a spectratope, and that Proposition 4.28 states that the linear estimate stemming from (4.83)
520
SOLUTIONS TO SELECTED EXERCISES
is near-optimal only among linear, not among all estimates (the latter might indeed not be the case).
Solution: 1o . Let Φ(H)
=
Φj (H)
=
Φ+ j (H)
=
=
+
=
+
Φ (H)
x∈X
max k[B − H T A]xk, x∈Xj min φTj (λ[Λj ]) + φR (λ[Υj ]) : j
j
Λk ,Υℓ
Ψ(H) Ψ (H)
max k[B − H T A]xk,
=
Λj 0, j ≤ Kj , Υjℓ 0, ℓ ≤ L k P j ∗ 1 T P [B T − AT H]M k Rkj [Λk ] 2 j P 0 j ∗ T T 1 ℓ Sℓ [Υℓ ] 2 MT [B − H A]Pj max Eξ∼P kH ξk , P ✁Π ′ Υ ℓ 0, ℓ ≤ L 1 ′ Θ HM min′ φR (λ[Υ ]) + ΓΠ (Θ) : 2 P 0 Θ,Υℓ ∗ ′ 1 M T HT ℓ Sℓ [Υℓ ] 2 + max Φj (H). j
,
Observe that Φj (H) is the maximum of a convex function of x (independent of j and depending on H as parameter) over x ∈ Xj , while Φ(H) is the maximum of the same function over x ∈ X . Since X is the convex hull of ∪j Xj , we get Φ(H) = max Φj (H).
(6.10)
j
It is easily seen that (4.83) is a feasible problem with bounded level sets of the objective, so that the problem is solvable, and we clearly have Opt = min Φ+ (H) + Ψ+ (H) = Φ+ (H∗ ) + Ψ+ (H∗ ). (6.11) H
Applying Proposition 4.8 to the quadratic form fH ([u; x]) = uT [B − H T A]x and spectratope B∗ × Xj , we get X dkj , 1 ≤ j ≤ J, (6.12) Φj (H) ≤ Φ+ j (H) ≤ 2 ln(2(Dj + F ))Φj (H), Dj = k≤Kj
while by Lemma 4.11 it holds Ψ(H) ≤ Ψ+ (H).
(6.13)
In addition, (6.10) combines with the definition of Φ+ and (6.12) to imply that Φ(H) ≤ Φ+ (H) ≤ 2 ln(2(D + F ))Φ(H). Finally, by the origin of Φ and Ψ we have RiskΠ,k·k [b xH∗ |X ] ≤ Φ(H∗ ) + Ψ(H∗ ) = max Φj (H∗ ) + Ψ(H∗ ) j
which combines with (6.11), (6.14), and (6.13) to imply (4.84).
(6.14)
521
SOLUTIONS TO SELECTED EXERCISES
2o . Let Ω be the feasible set of (4.83). Consider the function F (H, Θ, {Λj , Υj , j ≤ J}, Υ′ ; Q) = max φTj (λ[Λj ]) + φR (λ[Υj ]) j {z } | ζ ′ +φR (λ[Υ ]) + Tr(QΘ) : Ω × Π → R. This function clearly is continuous convex-concave on its domain, and is coercive in the convex argument ζ (restricted to vary in Ω). By the Sion-Kakutani Theorem, the function possesses a saddle point (ζ∗ , Q∗ ). In other words, there exists Q∗ ∈ Π such that Opt =
min
H,Θ,Λj ,Υj ,Υ′
max φTj (λ[Λj ]) + φR (λ[Υj ]) + φR (λ[Υ′ ]) + Tr(Q∗ Θ) : j
Λj = {Λjk 0, k ≤ Kj }, j ≤ J, Υj = {Υjℓ 0, ℓ ≤ L}, j ≤ J, Υ′ = {Υ′ℓ 0, ℓ ≤ L},
P
R∗kj [Λjk ] T 1 M [B − H T A]Pj 2 k
= min Φ+ (H) + Ψ+ ∗ (H) H
where
− AT H]M 0, j ≤ J, j ∗ ℓ Sℓ [Υℓ ] 1 HM Θ 2 P 0 ∗ ′ T T 1 S [Υ ] M H ℓ ℓ ℓ 2 1 T P [B T 2 j P
(6.15)
′ Υ ℓ 0, ℓ ≤ L, 1 Θ HM Ψ+ φR (λ[Υ′ ]) + Tr(Q∗ Θ) : . ∗ (H) := min′ P2 ∗ ′ 0 Θ,Υ T T 1 M H S [Υ ] ℓ ℓ ℓ 2
Now consider the optimization problem
Opt∗ = min {Φ(H) + Ψ∗ (H)} , Ψ∗ (H) = Eξ∼N (0,Q∗ ) kH T ξk . H
We claim that
RiskOptlin ≥ 21 Opt∗ .
(6.16)
Indeed, taking into account that Q∗ ∈ Π, it suffices to verify that for every linear estimate x b(ω) = H T ω we have 1 Risk[H] := max Eξ∼N (0,Q∗ ) kBx − H T (Ax + ξ)k ≥ Opt∗ . x∈X 2
By Jensen’s inequality, we have
Risk[H] ≥ max kEξ∼N (0,Q∗ ) {Bx − H T (Ax + ξ)}k = max kBx − H T Axk = Φ(H). x∈X
x∈X
On the other hand, 0 ∈ X , whence Risk[H] ≥ Eξ∼N (0,Q∗ ) {kB · 0 − H T (A · 0 + ξ)k} = Eξ∼N (0,Q) {kH T ξk} = Ψ∗ (H). We conclude that 2Risk[H] ≥ Φ(H) + Ψ∗ (H) ≥ Opt∗ , as claimed.
522
SOLUTIONS TO SELECTED EXERCISES
3o . It remains to note that by Lemma 4.17 we have p Ψ∗ (H) ≤ Ψ+ ∗ (H) ≤ O(1) ln(2F )Ψ∗ (H),
which combines with (6.14) and (6.15) to imply that Opt
= =
minH {Φ+ (H) + Ψ+ ∗ (H)} ≤ O(1) ln(D + F ) minH {Φ(H) + Ψ∗ (H)} O(1) ln(D + F )Opt∗ ≤ O(1) ln(D + F ) · 2RiskOptlin ,
where the concluding inequality is due to (6.16), and (4.85) follows. 6.4.4.2
✷
Recovering nonlinear vector-valued functions
Exercise 4.10 Consider the situation as follows: We are given a noisy observation ω = Ax + ξx
[A ∈ Rm×n ]
of the linear image Ax of an unknown signal x known to belong to a given spectratope X ⊂ Rn ; here ξx is the observation noise with distribution Px which can depend on x. Similarly to Section 4.3.3, we assume that we are given a computationally tractable convex compact set Π ⊂ int Sm + such that for every x ∈ X , Var[Px ] Θ for some Θ ∈ Π; cf. (4.32). We want to recover the value f (x) of a given vector-valued function f : X → Rν , and we measure the recovery error in a given norm | · | on Rν . 4.10.A. Preliminaries and Main observation. Let k · k be a norm on Rn , and g(·) : X → Rν be a function. Recall that the function is called Lipschitz continuous on X w.r.t. the pair of norms k · k on the argument and | · | on the image spaces, if there exist L < ∞ such that |g(x) − g(y)| ≤ Lkx − yk ∀(x, y ∈ X ); every L with this property is called a Lipschitz constant of g. It is well known that in our finite-dimensional situation, the property of g to be Lipschitz continuous is independent of how the norms k · k, | · | are selected; this selection affects only the value(s) of the Lipschitz constant(s). Assume from now on that the function of interest f is Lipschitz continuous on X . Let us call a norm k · k on Rn appropriate for f if f is Lipschitz continuous with constant 1 on X w.r.t. k · k, | · |. Our immediate observation is as follows: Observation 4.29 In the situation in question, let k · k be appropriate for f . Then recovering f (x) is not more difficult than recovering x in the norm k · k: every estimate x b(ω) of x via ω such that x b(·) ∈ X induces the “plug-in” estimate of f (x), and the k · k-risk
fb(ω) = f (b x(ω))
Riskk·k [b x|X ] = sup Eξ∼Px {kb x(Ax + ξ) − xk} x∈X
of estimate x b upper-bounds the | · |-risk
n o Risk|·| [fb|X ] = sup Eξ∼Px kfb(Ax + ξ) − f (x)k x∈X
of the induced by x b estimate fb:
Risk|·| [fb|X ] ≤ Riskk·k [b x|X ].
523
SOLUTIONS TO SELECTED EXERCISES
When f is defined and Lipschitz continuous with constant 1 w.r.t. k · k, | · | on the entire Rn , this conclusion remains valid without the assumption that x b is X -valued. 4.10.B. Consequences. Observation 4.29 suggests the following simple approach to solving the estimation problem we started with: assuming that we have at our disposal a norm k · k on Rn such that • k · k is appropriate for f , and • k · k is good, goodness meaning that the unit ball B∗ of the norm k · k∗ conjugate to k · k is a spectratope given by an explicit spectratopic representation; we use the machinery of linear estimation developed in Section 4.3.3 to build a near-optimal, in terms of its k · k-risk, linear estimate of x via ω, and convert this estimate in an estimate of f (x). By the above observation, the | · |- risk of the resulting estimate is upper-bounded by the k · k-risk of the underlying linear estimate. The construction just outlined needs a correction: in general, the linear estimate x e(·) yielded by Proposition 4.14 (same as any nontrivial—not identically zero—linear estimate) is not guaranteed to take values in X , which is, in general, required for Observation 4.29 to be applicable. This correction is easy: it is enough to convert x e into the estimate x b defined by x b(ω) ∈ Argmin ku − x e(ω)k. u∈X
This transformation preserves efficient computability of the estimate, and ensures that the corrected estimate takes its values in X ; at the same time, “correction” x e 7→ x b nearly preserves the k · k-risk: Riskk·k [b x|X ] ≤ 2Riskk·k [e x|X ]. (∗)
Note that when k · k is a (general-type) Euclidean norm—kxk2 = xT Qx for some Q ≻ 0—the factor 2 on the right-hand side can be discarded. 1) Justify (∗).
Solution: When x ∈ X , we have kb x(Ax + ξ) − x e(Ax + ξ)k = minu∈X ku − x e(Ax + ξ)k ≤ kx − x e(Ax + ξ)k ⇒ kb x(Ax + ξ) − xk ≤ kb x(Ax + ξ) − x e(Ax + ξ)k + ke x(Ax + ξ) − xk ≤ 2kx − x e(Ax + ξ)k ⇒ (∗).
When k · k is a Euclidean norm, the situation is simpler: as it is well known, in this case projection onto a convex set makes the point closer to all points of the set: kb x(ω) − xk ≤ ke x − xk, ∀x ∈ X , and the factor 2 does not arise.
✷
4.10.C. How to select k · k. When implementing the outlined approach, the major question is how to select a norm k · k appropriate for f . The best choice would be to select the smallest among the norms appropriate for f (such a norm does exist under mild assumptions), because the smaller the k · k, the smaller the k · k-risk of an estimate of x. This ideal can be achieved in rare cases only: first, it could be difficult to identify the smallest among the norms appropriate for f ; second, our approach requires from k · k to have an explicitly given spectratope as the unit ball of the conjugate norm. Let us look at a couple of “favorable cases,” where the difficulties just outlined can be (partially) overcome. Example: a norm-induced f . Let us start with the case, important in its own right, when f is a scalar functional which itself is a norm, and this norm has a spectratope as the unit
524
SOLUTIONS TO SELECTED EXERCISES
ball of the conjugate norm, as is the case when f (·) = k · kr , r ∈ [1, 2], or when f (·) is the nuclear norm. In this case the smallest of the norms appropriate for f clearly is f itself, and none of the outlined difficulties arises. As an extension, when f (x) is obtained from a good norm k · k by Lipschitz continuity and constant, such as f (x) = kx − ck, P operations preserving P or f (x) = i ai kx − ci k, i |ai | ≤ 1, or f (x) = sup / inf kx − ck, c∈C
or even something like f (x) = sup / inf α∈A
(
sup / inf kx − ck c∈Cα
)
,
it seems natural to use this norm in our construction, although now this, perhaps, is not the smallest of the norms appropriate for f . Now let us consider the general case. Note that in principle the smallest of the norms appropriate for a given Lipschitz continuous f admits a description. Specifically, assume that X has a nonempty interior (this is w.l.o.g.—we can always replace Rn with the linear span of X ). A well-known fact of Analysis (Rademacher Theorem) states that in this situation (more generally, when X is convex with a nonempty interior), a Lipschitz continuous f is differentiable almost everywhere in X o = int X , and f is Lipschitz continuous with constant 1 w.r.t. a norm k · k if and only if kf ′ (x)kk·k→|·| ≤ 1 whenever x ∈ X o is such that the derivative (a.k.a. Jacobian) of f at x exists; here kQkk·k→|·| is the matrix norm of a ν × n matrix Q induced by the norms k · k on Rn and | · | on Rν : kQkk·k→|·| := max |Qx| = max y T Qx = kxk≤1
kxk≤1 |y|∗ ≤1
max
|y|∗ ≤1 [kxk∗ ]∗ ≤1
xT QT y = kQT k|·|∗ →k·k∗ ,
where k · k∗ , | · |∗ are the conjugates of k · k, | · |.
2) Prove that a norm k · k is appropriate for f if and only if the unit ball of the norm conjugate to k · k contains the set
Bf,∗ = cl Conv{z : ∃(x ∈ Xo , y, |y|∗ ≤ 1) : z = [f ′ (x)]T y}, where Xo is the set of all x ∈ X o where f ′ (x) exists. Geometrically, Bf,∗ is the closed convex hull of the union of all images of the unit ball B∗ of | · |∗ under the linear mappings y 7→ [f ′ (x)]T y stemming from x ∈ Xo . Equivalently: k · k is appropriate for f if and only if kuk ≥ kukf := max z T u. z∈Bf,∗
(!)
Check that kukf is a norm, provided that Bf,∗ (this set by construction is a convex compact set symmetric w.r.t. the origin) possesses a nonempty interior; whenever this is the case, kukf is the smallest of the norms appropriate for f . Derive from the above that the norms k · k we can use in our approach are the norms on Rn for which the unit ball of the conjugate norm is a spectratope containing Bf,∗ .
Proof. By the above Rademacher Theorem, k · k is appropriate for f if and only if ′ kf (x)kk·k→|·| = k[f ′ (x)]T k|·|∗ →k·k∗ ≤ 1 ∀x ∈ Xo , or, equivalently, if and only if
kuk ≤ 1 ⇒ uT [f ′ (x)]T y ≤ 1 ∀(y : |y|∗ ≤ 1, x ∈ Xo ),
525
SOLUTIONS TO SELECTED EXERCISES
or, which is the same, if and only if kuk ≤ 1 ⇒ uT z ≤ 1 ∀z ∈ Bf,∗ , which is nothing but (!).
✷
Example. Consider the case of componentwise quadratic f : i h f (x) = 21 xT Q1 x; 12 xT Q2 x; ...; 12 xT Qν x
[Qi ∈ Sn ]
and |u| = kukq with q ∈ [1, 2].6 In this case B∗ = {u ∈ Rν : kukp ≤ 1}, p =
h i q ∈ [2, ∞[, and f ′ (x) = xT Q1 ; xT Q2 ; ...; xT Qν . q−1
Setting S = {s ∈ Rν+ : kskp/2 ≤ 1} and
S 1/2 = {s ∈ Rν+ : [s21 ; ...; s2ν ] ∈ S} = {s ∈ Rν+ : kskp ≤ 1}, the set is contained in the set ( Y=
Z = {[f ′ (x)]T u : x ∈ X , u ∈ B∗ } n
y ∈ R : ∃(s ∈ S
1/2
i
, x ∈ X , i ≤ ν) : y =
X i
s i Qi xi
)
.
Set Y is a spectratope with spectratopic representation readily given by that of X . Indeed, Y is nothing but the S-sum of the spectratopes Qi X , i = 1, ..., ν; see Section 4.10. As a result, we can use the spectratope Y (when int Y 6= ∅) or the arithmetic sum of Y with a small Euclidean ball (when int Y = ∅) as the unit ball of the norm conjugate to k · k, thus ensuring, by 2), that k · k is appropriate for f . We then can use k · k in order to build an estimate of f (·). 3.1) For illustration, work out the problem of recovering the value of a scalar quadratic form f (x) = xT M x, M = Diag{iα , i = 1, ..., n}
[ν = 1, | · | is the absolute value]
from noisy observation ω = Ax + ση, A = Diag{iβ , i = 1, ..., n}, η ∼ N (0, In )
(4.86)
of a signal x known to belong to the ellipsoid X = {x ∈ Rn : kP xk2 ≤ 1}, P = Diag{iγ , i = 1, ..., n}, where α, β, γ are given reals satisfying α − γ − β < −1/2. You could start with the simplest unbiased estimate
of x.
x e(ω) = [1−β ω1 ; 2−β ω2 ; ...; n−β ωn ]
6 To save notation, we assume that the linear parts in the components of f are trivial—just zeros. i In this respect, note that we always can subtract from f any linear mapping and reduce our estimation problem to two distinct problems of estimating separately the values at the signal x of the modified f and the linear mapping we have subtracted (we know how to solve the latter problem reasonably well).
526
SOLUTIONS TO SELECTED EXERCISES
3.2) Work out the problem of recovering the norm f (x) = kM xkp , M = Diag{iα , i = 1, ..., n}, p ∈ [1, 2], from observation (4.86) with X = {x : kP xkr ≤ 1}, P = Diag{iγ , i = 1, ..., n}, r ∈ [2, ∞].
Solution: 3.1: We are in the situation where the set {y = [f ′ (x)]T u : x ∈ X , u ∈ B∗ = [−1, 1]} is the set of gradients of the P quadratic form f taken at points from n X , so that this set is just the ellipsoid {y : i=1 i2γ−2α yi2 ≤ 1}. We can take it as our Y, resulting in r Xn i2γ−2α yi2 , kyk∗ = i=1
so that the norm k · k is
2
kzk =
r Xn
i=1
i2α−2γ zi2 .
With x e(ω) given by [e x(ω)]i = i−β ωi , i ≤ n, we have ke x(Ax + ση) − xk2 =
X i
X 2 i2α−2γ i−β [iβ xi + σηi ] − xi = σ 2 i2α−2γ−2β ηi2 , i
implying that Riskk·k [e x|X ] ≤ max x∈X
q
Eη∼N (0,In ) {ke x(Ax + ση) − xk2 } ≤ σ
When δ := 2α − 2γ − 2β < −1, this bound results in
r X
i
i2α−2γ−2β .
Riskk·k [e x|X ] ≤ Cσ, with C depending solely on δ. The bottom line is that in the situation in question, our construction results in an estimate of f (x) with “parametric risk” O(σ). Note that when restricting the signals to have all entries, except for the first one, equal to 0, the risk of recovering f (x) can clearly be lower-bounded by cσ, with an absolute constant c > 0. In other words, in the situation in question, already a completely unsophisticated implementation of the approach we have developed results in a near-optimal estimate. 3.2: We are in the situation where the set Bf,∗ = {y ∈ Rn : kM −1 ykp∗ ≤ 1}, p ∈ [2, ∞], and the optimal norm k · k is just f (·). Problem (4.42) yielding p∗ = p−1 the estimate x e reads Opt
=
min′
H,λ,µ,µ ,Θ
n ′ n φr (λ) + ψp (µ) + ψp (µ′ ) + σ 2 Tr(Θ) : λ ∈ Rn + , µ ∈ R+ , µ ∈ R+ , 1 [I − AH] Diag{λi i2γ , i ≤ n} 2 n 0 T 1 Diag{µi i−2α , i ≤ n} [I − H A] 2 n , 1 Θ H 2 0 T ′ −2α 1 H Diag{µi i , i ≤ n} 2 r , ψp (µ) = kµk p φr (λ) = kλk r−2 . 2−p
It is immediately seen that when seeking for an optimal solution, we can restrict ourselves with diagonal H and Θ, which makes the problem easy to solve in the
527
SOLUTIONS TO SELECTED EXERCISES
large scale case. It is also worthy of mentioning that we can use x b≡x e, since f is Lipschitz continuous with constant 1 w.r.t. the norm k · k on the entire Rn . 6.4.4.3
Suboptimal linear estimation
Exercise 4.11.
[recovery of large-scale signals] Consider the problem of estimating the image Bx ∈ Rν of signal x ∈ X from observation ω = Ax + σξ ∈ Rm in the simplest case where X = {x ∈ Rn : xT Sx ≤ 1} is an ellipsoid (so that S ≻ 0), the recovery error is measured in k · k2 , and ξ ∼ N (0, Im ). In this case, Problem (4.12) to solve when building “presumably good linear estimate” reduces to B T − AT H λS 0 , (4.87) Opt = min λ + σ 2 kHk2F : H,λ B − HT A Iν where k · kF is the Frobenius norm of a matrix. An optimal solution H∗ to this problem results in the linear estimate x bH∗ (ω) = H∗T ω satisfying the risk bound q p Risk[b xH∗ |X ] := max E{kBx − H∗T (Ax + σξ)k22 } ≤ Opt. x∈X
Now, (4.87) is an efficiently solvable convex optimization problem. However, when the sizes m, n of the problem are large, solving the problem by standard optimization techniques could become prohibitively time-consuming. The goal of what follows is to develop a relatively cheap computational technique for finding a good enough suboptimal solution to (4.87). In the sequel, we assume that A 6= 0; otherwise (4.87) is trivial.
1) Prove that problem (4.87) can be reduced to a similar problem with S = In and diagonal positive semidefinite matrix A, the reduction requiring several singular value decompositions and multiplications of matrices of the same sizes as those of A, B and S.
Solution: First, we can assume w.l.o.g. that A is a square matrix (i.e., m = n). Indeed, when m > n, computing the singular value decomposition A = U DV T of A, multiplying A from the left by U T and passing from ω to U T ω, we reduce the situation to the one where the last m − n rows of A are zero. Eliminating these rows in A and in H, we arrive at a problem equivalent to (4.87) of the same form with the initial value of m replaced with m = n. Similarly, when m < n, adding to A n − m zero rows, we arrive at a problem equivalent to (4.87) of the same form with m = n. Assuming from now on that A is square, let us reduce the situation to the one where S = In . To this end we find the Cholesky decomposition S = DDT with triangular D and observe that by multiplying the LMI constraint in (4.87) by Diag{D−1 , Iν } from the left and by the transpose of this matrix from the right the problem becomes ( # ) " T T e e B − A H λI n 0 , Opt = min λ + σ 2 kHk2F : e − HT A e H,λ (6.17) B Iν −1 T −1 T e = B(D ) , A e = A(D ) . B
Now let us compute singular value decomposition e = U ΛV T A
528
SOLUTIONS TO SELECTED EXERCISES
e Multiplying the semidefinite constraint in (6.17) by Diag{V T , Iν } from the of A. left and by the transpose of this matrix from the right, we arrive at the equivalent to (6.17) problem # ) ( " b T − ΛT U T H B λIn 2 2 0 , Opt = min λ + σ kHkF : b − HT U Λ H,λ B Iν
b = BV e , and it remains to pass in the latter problem from variable H to with B ¯ = U T H. variable H ✷
2) By item 1, we can assume from the very beginning that S = I and A = Diag{α1 , ..., αn } with √ 0 ≤ αT1 ≤ α2 ≤ ... ≤ αn . Passing in (4.87) from variables λ, H to variables τ = λ, G = H , the problem becomes (4.88) Opt = min τ 2 + σ 2 kGk2F : kB − GAk ≤ τ , G,τ
where k · k is the spectral norm. Now consider the construction as follows:
• Consider a partition {1, ..., n} = I0 ∪I1 ∪...∪IK of the index set {1, ..., n} into consecutive segments in such a way that (a) I0 is the set of those i, if any, for which αi = 0, and Ik 6= ∅ when k ≥ 1, (b) for k ≥ 1 the ratios αj /αi , i, j ∈ Ik , do not exceed θ > 1 (θ is the parameter of our construction), while (c) for 1 ≤ k < k′ ≤ K, the ratios αj /αi , i ∈ Ik , j ∈ Ik′ , are > θ. The recipe for building the partition is self-evident, and we clearly have K ≤ ln(α/α)/ ln(θ) + 1, where α is the largest of the αi , and α is the smallest of those αi which are positive. • For 1 ≤ k ≤ K, we denote by ik the first index in Ik , set αk = αik , nk = Card Ik , and define Ak as an nk × nk diagonal matrix with diagonal entries αi , i ∈ Ik .
Now, given a ν × n matrix C, let us specify Ck , 0 ≤ k ≤ K, as the ν × nk submatrix of C comprised of columns with indexes from Ik , and consider the following parametric optimization problems: Opt∗k (τ ) = minGk ∈Rν×nk kGk k2F : kBk − Gk Ak k ≤ τ (Pk∗ [τ ]) k 2 (Pk [τ ]) Optk (τ ) = minGk ∈Rν×nk kGk kF : kBk − α Gk k ≤ τ where τ ≥ 0 is the parameter, and 1 ≤ k ≤ K. Justify the following simple observations:
2.1) Gk is feasible for (Pk [τ ]) if and only if the matrix G∗k = αk Gk A−1 k is feasible for (Pk∗ [τ ]), and kG∗k kF ≤ kGk kF ≤ θkG∗k kF , implying that Opt∗k (τ ) ≤ Optk (τ ) ≤ θ2 Opt∗k (τ );
2.2) Problems (Pk [τ ]) are easy to solve: if Bk = Uk Dk VkT is the singular value decomposition of Bk and σkℓ , 1 ≤ ℓ ≤ νk := min[ν, nk ], are diagonal entries of Dk , then an optimal solution to (Pk [τ ]) is b k [τ ] = [αk ]−1 Uk Dk [τ ]VkT , G where Dk [τ ] is the diagonal matrix obtained from Dk by truncating the diagonal entries σkℓ 7→ [σkℓ − τ ]+ (from now on, a+ = max[a, 0], a ∈ R). The optimal value in (Pk [τ ]) is νk X [σkℓ − τ ]2+ . Optk (τ ) = [αk ]−2 ℓ=1
529
SOLUTIONS TO SELECTED EXERCISES
2.3) If (τ, G) is a feasible solution to (4.88) then τ ≥ τ := kB0 k, and the matrices Gk , 1 ≤ k ≤ K, are feasible solutions to problems (Pk∗ [τ ]), implying that X Opt∗k (τ ) ≤ kGk2F . k
And vice versa: if τ ≥ τ , Gk , 1 ≤ k ≤ K, are feasible solutions to problems (Pk∗ [τ ]), and K, I0 = ∅ K+ = , K + 1, I0 6= ∅ √ then the matrix G = [0ν×n0 , G1 , ..., GK ] and τ+ = K+ τ form a feasible solution to (4.88). Extract from these observations that if τ∗ is an optimal solution to the convex optimization problem ( ) K X 2 2 2 Optk (τ ) : τ ≥ τ min θ τ + σ (4.89) τ
k=1
and Gk,∗ are optimal solutions to the problems (Pk [τ∗ ]), then the pair p b = [0ν×n0 , G∗1,∗ , ..., G∗K,∗ ] τb = K+ τ∗ , G [G∗k,∗ = αk Gk,∗ A−1 k ]
is a feasible solution to (4.88), and the value of the objective of the latter problem at this feasible solution is within factor max[K+ , θ2 ] of the true optimal value Opt of this b problem. √ rise to a linear estimate with risk on X which is within factor √ As a result, G gives max[ K+ , θ] of the risk Opt of the “presumably good” linear estimate yielded by an optimal solution to (4.87). Notice that • After carrying out singular value decompositions of matrices Bk , 1 ≤ k ≤ K, specifying τ∗ and Gk,∗ requires solving an univariate convex minimization problem with an easy to compute objective, so that the problem can be easily solved, e.g., by bisection; • The computationally cheap suboptimal solution we end up with is not that bad, since K is “moderate”—just logarithmic in the condition number α/α of A.
Solution: 2.1 is readily given by the fact that by construction Bk − αk Gk = Bk − G∗k Ak combined with kG∗k kF ≤ kGk kF (because G∗k is obtained from Gk by multiplying Gk by a matrix with spectral norm ≤ 1) and kGk kF ≤ θkG∗k kF (because Gk is obtained from G∗k by multiplying G∗k by a matrix with spectral norm ≤ θ). 2.2: Representing the variable Gk in (Pk [τ ]) as Gk = Uk Ek VkT and taking into account that Uk and Vk are orthogonal, (Pk [τ ]) becomes the problem Optk (τ ) = min{kEk k2F : kDk − αk Ek k ≤ τ }, Ek
(∗)
and all we need to verify is that an optimal solution of the latter problem is obtained from Dk by replacing the diagonal entries σkℓ in Dk with γkℓ = [αk ]−1 [σkℓ − τ ]+ and keeping the off-diagonal entries zeros. This is immediate—the P 2 latter matrix , while the ℓ-th is a feasible solution to (∗) with the value of the objective ℓ γkℓ diagonal entry in a feasible solution to (∗) should be ≥ γkℓ (since the magnitude of an entry in a matrix does not exceed the spectral norm of the matrix), P 2so that the . ✷ value of the objective at any feasible solution to the problem is ≥ ℓ γkℓ
530
SOLUTIONS TO SELECTED EXERCISES
2.3: Let (τ, G) be a feasible solution to (4.88). Since A is diagonal, we have [B − GA]0 = B0 − G0 A0 = B0 , and since the spectral norm of a submatrix is ≤ the one of a matrix, we have τ ≥ kB0 k = τ . By similar reasons, for k ≥ 1 we have τ ≥ k[B − GA]k k = kBk − Gk Ak k, that is, Gk is feasible for (Pk∗ [τ ]), 1 ≤ k ≤ K. Now let τ ≥ τ and let Gk be feasible for (Pk∗ [τ ]), 1 ≤ k ≤ K. Setting G = [0ν×n0 , G1 , ..., GK ], observe that k[B − GA]0 k = kB0 k = τ ≤ τ and k[B − GA]k k = kBk − Gk Ak k ≤ τ.
It follows that for every vector z = [z0 ; ...; zK ] ∈ Rn with blocks zk of dimension nk it holds k[B − GA]zk2 ≤
K X
k=0
k[Bk − Gk Ak ]zk k2 ≤ τ
K X
k=0
kzk k2 ≤ τ
p
K+ kzk2 ,
p p that is, kB − GAk ≤ K+ τ , implying that ( K+ τ, G) is a feasible solution to (4.88). ✷ ¯ be an optimal Now let us extract from 2.1–3 the desired conclusions. Let (¯ τ , G) solution to (4.88), so that 2
Opt = τ¯ + σ
2
K X
k=0
¯ k k2F . kG
¯ 0 with 0ν×n , we get another feasible solution to the problem and can Replacing G 0 ¯ 0 = 0, whence only reduce the value of the objective, implying that G Opt = τ¯2 + σ 2
K X
k=1
¯ k k2 . kG F
¯ k are feasible solutions to (P ∗ [¯ By the first part of 2.3, the matrices G k τ ]), whence τ¯2 + σ 2
K X
k=1
Opt∗k (¯ τ ) ≤ τ¯2 + σ 2
X
k=1
¯ k k2F = Opt. kG
¯ k = αk Gk A−1 for some feasible solutions Gk to problems (Pk [¯ By 2.1, we have G τ ]) k 2 ¯ k k2 , whence such that kGk kF ≤ θ2 kG F # " X X X 2 2 2 2 2 2 2 2 2 2 2 2 ¯ k kF ≤ θ2 Opt. kG kGk kF ≤ θ τ¯ + σ Optk (¯ τ ) ≤ θ τ¯ +σ θ θ τ¯ +σ k
k
k
Due to the origin of τ∗ , we conclude that X X X ¯ ≤ θ2 Opt, θ2 τ∗2 + σ 2 kGk,∗ k2F = θ2 τ∗2 + σ 2 Optk (τ∗ ) ≤ θ2 τ¯2 + σ 2 Optk (θ) k
k
k
531
SOLUTIONS TO SELECTED EXERCISES
whence, by 2.1, θ2 τ∗2 + σ 2
X k
kG∗k,∗ k2F ≤ θ2 Opt.
(!)
Now, by 2.1, the matrices G∗k,∗ are feasible solutions to problems (Pk∗ [τ∗ ]), whence p by the second part of 2.3, the resulted pair ( K+ τ∗ , [0ν×n0 , G∗1,∗ , ..., G∗K,∗ ]) is a feasible solution to (4.88), as claimed. Next, when K+ ≤ θ2 , (!) implies that X K+ τ∗2 + σ 2 kG∗k,∗ k2F ≤ θ2 Opt, k
that is, the value of the objective of problem (4.88) at the resulting feasible solution to the problem is within factor θ2 of Opt. When K+ > θ2 , we get from (!) " # X X 2 2 ∗ 2 −2 2 2 2 ∗ 2 K + τ∗ + σ kGk,∗ kF ≤ K+ θ θ τ∗ + σ kGk,∗ kF ≤ K+ Opt, k
k
where the concluding inequality is due to (!). Thus, in all cases the value of the objective of problem (4.88) at the feasible solution to this problem yielded by our construction is at most max[K+ , θ2 ]Opt. ✷ Your next task is a follows: 3) To get an idea of the performance of the proposed synthesis of “suboptimal” linear estimation, run numerical experiments as follows: • select somehow n and generate at random the n × n data matrices S, A, B • for “moderate” values of n compute both the linear estimate yielded by the optimal solution to (4.12)7 and the suboptimal estimate as yielded by the above construction and compare their risk bounds and the associated CPU times. For “large” n, where solving (4.12) becomes prohibitively time consuming, compute only a suboptimal estimate in order to get an impression of how the corresponding CPU time grows with n. Recommended setup: • range of n: 50, 100 (“moderate” values), 1, 000, 2, 000 (“large” values) • range of σ: {1.0, 0.01, 0.0001} • generation of S, A, B: generate the matrices at random according to S = US Diag{1, 2, ..., n}UST , A = UA Diag{µ1 , ..., µn }VAT , B = UB Diag{µ1 , ..., µn }VBT , where US , UA , VA , UB , VB are random orthogonal n × n matrices, and µi form a geometric progression with µ1 = 0.01 and µn = 1. You could run the above construction for several values of θ and select the best, in terms of its risk bound, of the resulting suboptimal estimates.
Solution: Here are some numerical results (θ runs through the range {2i/2 , 1 ≤ i ≤ 4}): 7 When X is an ellipsoid, the semidefinite relaxation bound on the maximum of a quadratic form over x ∈ X is exact, so that we are in the case where an optimal solution to (4.12) yields the best, in terms of its risk on X , linear estimate.
532
SOLUTIONS TO SELECTED EXERCISES
n=50 σ 1.0000 0.0100 0.0001
RiskO 0.37877 0.13623 0.00775
RiskSO 0.41538 0.21085 0.00778
RiskSO/RiskO 1.10 1.55 1.00
cpuO 14.92 10.18 14.72
cpuSO 0.02 0.01 0.02
cpuO/cpuSO 7.5e2 1.0e3 7.7e2
n=100 σ 1.0000 0.0100 0.0001
RiskO 0.33499 0.12953 0.01095
RiskSO 0.37705 0.17978 0.01114
RiskSO/RiskO 1.13 1.39 1.02
cpuO 633.29 364.68 532.94
cpuSO 0.10 0.04 0.05
cpuO/cpuSO 6.3e3 9.1e3 1.2e4
n=1000 σ 1.0000 0.0100 0.0001
RiskSO 0.34067 0.08272 0.02631
n=2000 σ 1.0000 0.0100 0.0001
cpuSO 3.84 3.97 4.03
RiskSO 0.31069 0.07098 0.02714
cpuSO 53.72 53.20 53.98
RiskO, cpuO: risk bound and CPU (sec) for the optimal linear estimate RiskSO, cpuSO: risk bound and CPU (sec) for suboptimal linear estimate No data for RiskO and cpuO for n = 1000 and n = 2000: (4.12) becomes too computationally expensive
We believe the results speak for themselves. 4.11.A. Simple case. There is a trivial case where (4.88) is really easy; this is the case where the right orthogonal factors in the singular value decompositions of A and B are the same, that is, when B = W F V T , A = U DV T with orthogonal n × n matrices W, U, V and diagonal F, D. This, at the first glance, very special case is in fact of some importance—it covers the denoising situation where B = A, so that our goal is to denoise our observation of Ax given a priori information x ∈ X on x. In this situation, setting W T H T U = G, problem (4.88) becomes (4.90) Opt = min kF − GDk2 + σ 2 kGk2F . G
Now comes the concluding part of the exercise:
4) Prove that in the situation in question an optimal solution G∗ to (4.90) can be selected to be diagonal, with diagonal entries γi , 1 ≤ i ≤ n, yielded by the optimal solution to the optimization problem ) ( n X 2 2 2 Opt = min f (G) := max(φi − γi δi ) + σ γi [φi = Fii , δi = Dii ] γ
i≤n
i=1
Solution: Let E be a diagonal n × n matrix with diagonal entries ±1. It is immediately seen that for every candidate solution G to (4.90), EGE is another candidate solution with the same value of the objective. Since the problem is convex, taking the average of solutions EGE over all 2n matrices E of the above type, we get ¯ which clearly is diagonal and satisfies f (G) ¯ ≤ f (G). Thus, we lose a solution G nothing when restricting ourselves to diagonal solutions to (4.90). ✷ Exercise 4.12. [image reconstruction–follow-up to Exercise 4.11] A grayscale image can be represented by an m × n matrix x = [xpq ] 0≤p A∗ + θǫ B∗ } ≤ ǫ
(4.97)
√
2 ln( 2e1/4 /ǫ).
Solution: (i): Let (H, Λ, Υ, Υ′ , Θ) be feasible for (4.42). When α > 0, the collection (H, α−1 Λ, αΥ, Υ′ , Θ) is feasible for (4.42). Applying to this collection the first inequality in (4.95) and taking into account that φT and φR are positively homogeneous of degree 1, we get max k[B − H T A]yk ≤ αφT (λ[Λ]) + α−1 φR (λ[Υ]), y∈X
which after minimization of the right-hand side in α results in D := max k[B − H T A]yk ≤ A∗ . y∈X
(6.22)
Similarly, when α > 0, the collection (H, Λ, Υ, α−1 Υ′ , αΘ) is feasible for (4.42). Applying to this collection (4.95) and taking into account that ΓΠ (·) and φR (·) are positively homogeneous of degree 1, we get from the second relation in (4.95) that Eξ∼P kH T ξk ≤ αΓΠ (Θ) + α−1 φR (λ[Υ′ ]); minimizing the right hand side in α > 0, we get E := Eξ∼P kH T ξk ≤ B∗ .
(6.23)
548
SOLUTIONS TO SELECTED EXERCISES
When x ∈ X , we clearly have Eξ∼P {kBx − x bH (Ax + ξ)k} ≤ D + E, so that (6.22) and (6.23) yield the result stated in (i). ✷ (ii): Since µµT + S ✁ Π, there exists Q ∈ Π such that µµT + S Q; we lose nothing when assuming that Q = µµT + S. Let us set q √ (6.24) ϕ := φR (λ[Υ′ ]), t = 8 ln( 2e1/4 /ǫ), ρ = Tr(ΘQ) ≤ ΓΠ (Θ). Note that by Lemma 4.30 we have p √ √ Probξ∼P { ξ T Θξ > t ρ} ≤ 2e1/4 exp{−t2 /8} = ǫ.
(6.25)
Now, Θ, Υ′ , and H satisfy the last semidefinite constraint in (4.42), which, by a slight modification of the computation in the proof of Lemma 4.11, results in the bound p √ (6.26) ∀ξ : kH T ξk ≤ 2 ξ T Θξ ϕ. Here is the required computation: for every α > 0 we have
∀y ∈ Y : P |y T M T H T ξ| = |[α1/2 y]T M T H T [α−1/2 ξ]| ≤ α ℓ y T Sℓ∗P [Υ′ℓ ]y + α−1 ξ T Θξ [by the last constraint in (4.42)] ≤ α maxy∈Y Pℓ Tr(Sℓ∗ [Υ′ℓ ]yy T ) + α−1 ξ T Θξ = α maxy∈Y Pℓ Tr(Υ′ℓ Sℓ [yy T ]) + α−1 ξ T Θξ = α maxy∈Y Pℓ Tr(Υ′ℓ Sℓ2 [y]) + α−1 ξ T Θξ ≤ α maxr∈R ℓ rℓ Tr(Υ′ℓ ) + α−1 ξ T Θξ = αφR (λ[Υ′ ]) + α−1 ξ T Θξ. For ξ fixed, optimizing the resulting bound in α, we get p ∀(y ∈ Y) : |y T M T H T ξ| ≤ 2 [ξ T Θξ]ϕ.
When maximizing the left hand side in this inequality in y ∈ Y, we arrive at (6.26).
It remains to note that in view of x ∈ X and relations (6.22) and (6.26) one has kBx − x bH (Ax + ξ)k
= ≤
T k[B − H H T ξk ≤ k[B − H T A]xk + kH T ξk p A]x − √ A∗ + 2 ξ T Θξ ϕ.
Recalling what ϕ is and taking into account (6.24) and (6.25), we arrive at (4.97). ✷ 3) Suppose we are given observation ω = Ax + ξ of an unknown signal x known to belong to a given spectratope X ⊂ Rn and want to recover the signal. We quantify the error of a recovery x b by maxk≤K kBk (b x − x)k(k) , where Bk ∈ Rνk ×n are given matrices, and k · k(k) νk are given norms on R (for example, x can represent a discretization of a continuoustime signal, and Bk x can be finite-difference approximations of the signal’s derivatives). We also assume, as in item 2, that observation noise ξ is independent of signal x and is sub-Gaussian with sub-Gaussianity parameters µ, S satisfying µµT + S Q, for some given matrix Q ≻ 0. Finally, we suppose that the unit balls of the norms conjugate to the norms k · k(k) are spectratopes. In this situation, Proposition 4.14 provides us with K efficiently νk computable linear estimates x bk(ω) = HkT ω : Rdim ω → R along with upper bounds Optk on their risks maxx∈X E kBk x − x bk (Ax + ξ)k(k) . Think about how, given reliability tolerance ǫ ∈ (0, 1), to aggregate these linear estimates into a single estimate x b(ω) : Rdim ω → Rn such that for every x ∈ X , the probability of the event kBk (b x(Ax + ξ) − x)k(k) ≤ θOptk , 1 ≤ k ≤ K, (!)
SOLUTIONS TO SELECTED EXERCISES
549
is at least 1 − ǫ, for some moderate (namely, logarithmic in K and 1/ǫ) “assembling price” θ.
Solution: Let t∗ = t∗ (ǫ) := and k ≤ K we have
q √ 8 ln 2 exp{1/4}K/ǫ . By item 2, for every x ∈ X
Prob{kBk x − x bk (Ax + ξ)k(k) > t∗ Optk } ≤ ǫ/K,
whence for every x ∈ X probability of the event
Ex = {ξ : kBk x − x bk (Ax + ξ)k(k) ≤ t∗ Optk , 1 ≤ k ≤ K}
is at least 1 − ǫ. Now let us build estimate x b(·) as follows: given ω, we compute xk = x bk (ω), 1 ≤ k ≤ K, and solve the convex feasibility problem find x ¯ : kBk x ¯ − xk k(k) ≤ t∗ Optk , k ≤ K.
If this problem is feasible, x b(ω) is its feasible solution; otherwise x b(ω) is, say, the origin in the space where signals live. Let us verify that the resulting estimate, augmented with θ = 2t∗ , is what we are looking for. Indeed, let x ∈ X and ω be such that Ex takes place (which happens with probability at least 1−ǫ). In this case our feasibility problem is solvable, x being one of its feasible solutions, implying that x ¯=x b(ω) is another feasible solution to this problem. Since both x and x ¯ are feasible, we conclude that kBk (x − x ¯)k(k) ≤ kBk x − xk k(k) + kxk − Bk x ¯k(k) ≤ 2t∗ Optk ∀k.
A better construction (which does not require a priori knowledge of ǫ and hopefully is less conservative in practice) is to specify x b(ω) as the y-component of the optimal solution to the convex optimization problem bk (ω)k(k) ≤ tOptk , 1 ≤ k ≤ K . min t : t ≥ 0, kBk y − x t,y
Slightly and straightforwardly modifying the above reasoning, we conclude that this estimate, for every x ∈ X and every ǫ ∈ (0, 1), satisfies the relation kBk (x − x b(Ax + ξ))k(k) ≤ 2t∗ (ǫ)Optk , k ≤ K
with probability at least 1 − ǫ.
✷
Exercise 4.15 Prove that if ξ is uniformly distributed on the unit sphere {x : kxk2 = 1} in Rn , then ξ is sub-Gaussian with parameters (0, n1 In ). Solution: The function f (η) := Eξ exp{ξ T η} clearly depends solely on kηk2 : f (η) = φ(kηk2 ). Now let ζ be an n-dimensional Rademacher vector independent of ξ, and let t ≥ 0. We have Eξ,ζ exp{tξ T ζ} = Eζ Eξ exp{ξ T [tζ]} √ √ = Eζ {φ(ktζk2 )} = Eζ {φ(t n)} = φ(t n),
550
SOLUTIONS TO SELECTED EXERCISES
and Eξ,ζ
exp{tξ η} T
= ≤
P
Q
cosh(tξi ) Eξ {Eζ {exp{t i ξi ζi }}} = Eξ i Q Eξ exp{t2 ξi2 /2} = exp{t2 /2}
i
(we have used the fact that cosh(a) ≤ exp{a2 /2}—look at the coefficients in the √ power series expansion of both quantities). We conclude that φ(t n) ≤ exp{t2 /2} for all t ≥ 0, or, which is the same, f (η) = φ(kηk2 ) ≤ exp{η T η/(2n)} for all η. ✷ 6.4.4.5
Linear recovery under signal-dependent noise
Exercise 4.16. Consider the situation as follows: we observe a realization ω of an mdimensional random vector ω = Ax + ξx , where • x is an unknown signal belonging to a given signal set X , specifically, spectratope (which, as usual, we can assume to be basic): X = {x ∈ Rn : ∃t ∈ T : Rk2 [x] tk Idk , k ≤ K} with standard restrictions on T and Rk [·]; • ξx is the observation noise with distribution which can depend on x; all we know is that Var[ξx ] := E{ξx ξxT } C[x], where the entries of symmetric matrix C[x] are quadratic in x. We assume in the sequel that signals x belong to the subset XC = {x ∈ X : C[x] 0} of X ; • Our goal is to recover Bx, with given B ∈ Rν×n , in a given norm k · k such that the unit ball B∗ of the conjugate norm is a spectratope: B∗ = {u : kuk∗ ≤ 1} = M V, V = {v : ∃r ∈ R : Sℓ2 [v] rℓ Ifℓ , ℓ ≤ L}. We quantify the performance of a candidate estimate x b(ω) : Rm → Rν by the risk Riskk·k [b x|XC ] = sup
sup
x∈XC ξx :Var[ξx ]C[x]
E {kBx − x b(Ax + ξx )k} .
1) Utilize semidefinite relaxation in order to build, in a computationally efficient fashion, a “presumably good” linear estimate; specifically, prove the following: Proposition 4.32 In the situation in question, for G ∈ Sm let us define α0 [G] ∈ R, α1 [G] ∈ Rn , α2 [G] ∈ Sn from the identity Tr(C[x]G) = α0 [G] + α1T [G]x + xT α2 [G]x ∀(x ∈ Rn , G ∈ Sm ),
551
SOLUTIONS TO SELECTED EXERCISES
so that the αχ [G] are affine in G. Consider the convex optimization problem Opt = min ′ µ + φT (λ[Λ]) + φR (λ[Υ]) + φR (λ[Υ′ ]) : H,µ,D,Λ,Υ,Υ ,G d
f
f
′ ′ m k ℓ ℓ Λ = {Λk ∈ S+ , k 1≤ K}, Υ = {Υℓ ∈ S+ , ℓ ≤ L}, Υ = {Υℓ ∈ S+ , ℓ ≤ L}, D ∈ S+ T α [G] α0 [G] 2 1 1 1 [B T − AT H]M α2 [G] 2 α1 [G] 2 1 T T M [B − H A] 2 µ − α0 [D] − 12 αT 1 [D] P ∗ − 21 α1 [D] k Rk [Λk ] − α2 [D] P ∗ S [Υ ] ℓ# ℓ ℓ " 1 HM G 2 P 0 1 ∗ ′ M T HT ℓ Sℓ [Υℓ ] 2
P [R∗k [Λk ]]ij = Tr(Λk 12 [Rki Rkj + Rkj Rki ]), where Rk [x] = j xj Rkj P ∗ ℓi ℓj ℓj ℓi ℓj 1 . [Sℓ [Υℓ ]] = Tr(Υℓ [S S + S S ]), where Sℓ [v] = ij j vj S 2 λ[{Zi , i ≤ I}] = [Tr(Z1 ); ...; Tr(ZI )], φA (q) = maxs∈A q T s
Whenever H, µ, D, Λ, Υ, Υ′ , and G are feasible for the problem, one has
Riskk·k [b xH |XC ] ≤ µ + φT (λ[Υ]) + φR (λ[Υ]) + φR (λ[Υ′ ]) where x bH (ω) = H T ω.
bH (ω) = H T ω, x ∈ XC , and random noise ξx with Solution: For a linear estimate x matrix of second moments C[x] we have E kBx − H T (Ax + ξx )k ≤ k[B − H T A]xk + E kH T ξx k ′ ′ Υ L} = {Υℓ 0, ℓ ≤ 1 ′ T G HM ≤ k[B − H A]xk + min′ φR (λ[Υ ]) + Tr(C[x]G) : 2 P 0 G,Υ ∗ ′ 1 M T HT ℓ Sℓ [Υℓ ] 2 [by Corollary 4.12] ′ ′ Υ = {Υ 0, ℓ ≤ L} ℓ 1 T ′ HM G = min′ k[B − H A]xk + φR (λ[Υ ]) + Tr(C[x]G) : . 2 P 0 G,Υ ∗ ′ T T 1 S [Υ ] M H ℓ ℓ ℓ 2 (6.27)
Consider the quantity
FG (v, x) = v T M T [B − H T A]x + Tr(C[x]G). This is a quadratic function of (v, x): FG (v, x) = α0 [G] + α1T [G]x + xT α2 [G]x + v T M T [B − H T A]x. We can write
α0 [G] FG (v, x) = [1; x; v]T 12 α1 [G] |
1 T 2 α1 [G]
α2 [G] 1 T M [B − H T A] 2 {z
A[G,H]
1 T 2 [B
− AT H]M [1; x; v]. }
Now let Υ = {Υℓ ∈ Sfℓ , ℓ ≤ L}, Λ = {Λk ∈ Sdk , k ≤ K}, µ ∈ R, D ∈ Sm be such that Υℓ 0, ℓ ≤ L, Λk 0, k ≤ K, D 0,
552
SOLUTIONS TO SELECTED EXERCISES
and, in addition,
µ − α0 [D] A[G, H] − 21 α1 [D]
P
− 21 α1T [D] ∗ k Rk [Λk ] − α2 [D]
We claim that in this case for all x ∈ XC it holds
P
∗ ℓ Sℓ [Υℓ ]
.
(!)
k[B − H T A]xk + Tr(C[x]G) ≤ µ + φT (λ[Λ]) + φR (λ[Υ]). Indeed, we have max k[B − H T A]xk + Tr(C[x]G) =
x∈XC
=
≤
=
≤ = = ≤
max
v∈V,x∈XC
max
v∈V,x∈XC
max
v∈V,x∈XC
T
max
v∈V,x∈XC
FG (v, x)
[1; x; v] A[G, H][1; x; v] [1; x; v]T A[G, H][1; x; v] + Tr(C[x]D)
[since ≥ 0 when1 xT∈ XC due Tr(C[x]D) to D 0] α [D] α0 [D] 2 1 [1; x; v] [1; x; v]T A[G, h] + 12 α1 [D] α2 [D]
µ
max [1; x; v]T
P
[1; x; v] [by (!)] P ∗ P ℓ S∗ℓ [Υℓ ] P µ + max xT [ k R∗k [Λk ]]x + max v T ℓ Sℓ [Υℓ ] v x∈X P v∈V P 2 µ + max k Tr(Λk Rk [x]) + max ℓ Tr(Υℓ Sℓ2 [v])
v∈V,x∈X
k
R∗k [Λk ]
x∈X
v∈V
µ + φT (λ[Λ]) + φR (λ[Υ]) [cf. proof of Lemma 4.9].
Invoking (6.27), we conclude that Riskk·k [b xH |XC ] ≤ Λ = {Λk ∈ α0 [G] 1 α1 [G] 2
min
µ,D,Λ,Υ,Υ′ ,G
d S+k , k
µ + φT (λ[Λ]) + φR (λ[Υ]) + φR (λ[Υ′ ]) : f
f
≤ K}, Υ = {Υℓ ∈ S+ℓ , ℓ ≤ L}, Υ′ = {Υ′ℓ ∈ S+ℓ , ℓ ≤ L}, D ∈ Sm + 1 T 2 α1 [G] T T 1 α2 [G] − A H]M 2 [B T T 1 2 M [B − H A] µ − α0 [D] P − 21 αT 1 [D] ∗ − 21 α1 [D] R [Λ ] − α [D] 2 k k k P ∗ ℓ] ℓ Sℓ [Υ 1 G HM P2 ∗ ′ 0 T T 1 ℓ Sℓ [Υℓ ] 2M H
and the conclusion of the proposition follows.
,
✷
2) Work out the following special case of the above situation dealing with Poisson Imaging (see Section 2.4.3.2): your observation is an m-dimensional random vector with independent Poisson entries, the vector of parameters of the corresponding Poisson distributions being P y; here P is an ×n entrywise nonnegative matrix, and the unknown signal y is known to belong to a given box Y = {y ∈ Rn : a ≤ y ≤ a}, where 0 ≤ a < a. You want to recover y in k · kp -norm with given p ∈ [1, 2].
Solution: Setting y0 = 12 [a + a], B = Diag{ 12 [a − a]}, parameterizing y ∈ Y as y = y0 + Bx, x ∈ X := [−1; 1]n = {x ∈ Rn : ∃t ∈ T := [0, 1]n : Sk2 [x] := x2k tk I1 , k ≤ K := n},
553
SOLUTIONS TO SELECTED EXERCISES
and subtracting from our actual observation the vector P y0 , we reduce the situation to the one where our observation is ω = Ax + ξx , where A = P B, and the observation noise ξx has independent entries, the i-th entry being the difference between a realization of the Poisson random variable with parameter [P y0 + Ax]i and the expectation of this random variable, so that ξx has zero mean. Our goal is to recover in k · kp the vector y0 + Bx, or, which is the same, the vector Bx. Immediate computation shows that the covariance matrix of ξx is C[x] = Diag{P y0 + Ax}, and we find ourselves in the situation considered in Proposition 4.32. Exercise 4.17 The goal of what follows is to “transfer” the constructions of linear estimates to the case of multiple indirect observations of discrete random variables. Specifically, we are interested in the situation where • Our observation is a K-element sample ω K = (ω1 , .., ωK ) with independent identically distributed components ωk taking values in an m-element set. As always, we encode the points from this m-element set by the standard basic orths e1 , ..., em in Rm . • The (common for all k) probability distribution of ωk is Ax, where x is an unknown “signal”—n-dimensional probabilistic vector known to belong toP a closed convex subset X of n-dimensional probabilistic simplex ∆n = {x ∈ Rn : x ≥ 0, i xi = 1}—and A is a given m × n column-stochastic matrix (i.e., entrywise nonnegative matrix with unit column sums). • Our goal is to recover Bx, where B is a given ν × n matrix, and we quantify a candidate estimate x b(ω K ) : RmK → Rν by its risk n o b(ω K )k , Riskk·k [b x|X ] = sup EωK ∼[Ax]×...×[Ax] kBx − x x∈X
where k · k is a given norm on Rν .
We use linear estimates—estimates of the form x bH (ω K ) = H T where H ∈ Rm×ν .
# K 1 X ωk , K k=1 {z } |
"
(4.98)
ω b K [ω K ]
1) In the main body of Chapter 4, X always was assumed to be symmetric w.r.t. the origin, which easily implies that we gain nothing when passing from linear estimates to affine ones (sums of linear estimates and constants). Now we are in the case where X can be “heavily asymmetric,” which, in general, can make “genuinely affine” estimates preferable. Show that in the case in question, we still lose nothing when restricting ourselves to linear, rather than affine, estimates.
Solution: Denoting by 1k the k-dimensional all-ones vector, observe that we are in the situation when 1Tm ω bK [ω K ] = 1 for all signals and observations, implying every affine estimate can be written down as a linear one: ¯Tω ¯ = H + 1m h T . HT ω bK [ω K ] + h ≡ H bK [ω K ], H
554
SOLUTIONS TO SELECTED EXERCISES
4.17.A. Observation scheme revisited. When observation ω K stems from a signal x ∈ ∆n , we have ω bK [ω K ] = Ax + ξx , where
ξx =
K 1 X [ωk − Ax] K k=1
is the average of K independent identically distributed zero mean random vectors with common covariance matrix Q[x]. 2) Check that
Q[x] = Diag{Ax} − [Ax][Ax]T ,
and derive from this fact that the covariance matrix of ξx is QK [x] = Setting Π = ΠX =
Q=
1 Q[x]. K
1 Diag{Ax} : x ∈ X K
,
check that ΠX is a convex compact subset of the positive semidefinite cone Sm + , and that whenever x ∈ X , one has Q[x] Q for some Q ∈ Π. 4.17.B. Upper-bounding risk of a linear estimate. We can upper-bound the risk of a linear estimate x bH as follows: K Riskk·k [b xH |X ] = supx∈X EωK∼[Ax]×...×[Ax] kBx − H T ω b K [ω ]k T T = supx∈X Eξx k[Bx − H A]x − H ξnx k o Eξ kH T ξk . sup ≤ sup k[B − H T A]xk + x∈X ξ:Cov[ξ]∈ΠX {z } | | {z } Φ(H)
ΨX (H)
As in the main body of Chapter 4, we intend to build a “presumably good” linear estimate by minimizing over H the sum of efficiently computable upper bounds Φ(H) on Φ(H) and X Ψ (H) on ΨX (H). Assuming from now on that the unit ball B∗ of the norm conjugate to k·k is a spectratope, B∗ := {u : kuk∗ ≤ 1} = {u : ∃r ∈ R, y : u = M y, Sℓ2 [y] rℓ Ifℓ , ℓ ≤ L} X
with our usual restrictions on R and Sℓ , we can take as Ψ (·) the function (4.40). For the sake of simplicity, we from now assume that X is cut off ∆n by linear inequalities: X = {x ∈ ∆n : Gx ≤ g, Ex = e}
[G ∈ Rp×n , E ∈ Rq×n ]
Observe that replacing G with G − g1Tn and E with E − e1Tn , we reduce the situation to that where all linear constraints are homogeneous, that is, X = {x ∈ ∆n : Gx ≤ 0, Ex = 0}, and this is what we assume from now on. Setting F = [G; E; −E] ∈ R(p+2q)×n , we have also X = {x ∈ ∆n : F x ≤ 0}.
555
SOLUTIONS TO SELECTED EXERCISES
Suppose that X is nonempty. Finally, in addition to what was already assumed about the norm k · k, let us also suppose that this norm is absolute, that is, kuk depends only on the vector of magnitudes of entries in u. From this assumption it immediately follows that if 0 ≤ u ≤ u′ , then kuk ≤ ku′ k (why?). Our next task is to efficiently upper-bound Φ(·). 4.17.C. Bounding Φ, simple case. We start with the simple case where there are no linear constraints (formally, G and E are zero matrices), in this case bounding Φ is straightforward: 3) Prove that in the simple case Φ is convex and efficiently computable “as is:” Φ(H) = max k(B − H T A)gi k, i≤n
where g1 , ..., gn are the standard basic orths in Rn .
Solution: Recall that in the simple case X = ∆n is the convex hull of g1 , ..., gn , and that the maximum of a convex function k(B − H T A)xk over x ∈ X is achieved at an extreme point of X . 4.17.D. Lagrange upper bound on Φ. 4) Observing that when µ ∈ Rp+2q , the function + k(B − H T A)xk − µT F x of x is convex in x ∈ ∆n and overestimates k(B − H T A)xk everywhere on X , conclude that the efficiently computable convex function ΦL (H) = min max{k(B − H T A)gi k − µT F gi : µ ≥ 0} µ
i≤n
upper-bounds Φ(H). In the sequel, we call this function the Lagrange upper bound on Φ.
Solution: ΦL (H) is obtained from the convex function of H, µ Ψ(H, µ) = max k(B − H T A)gi k − µT F gi i≤n
by minimization in µ ≥ 0, and the function of this type is convex in H on every convex set where the function does not take value −∞. It remains to note that Φ(H) ≥ 0 (since X is nonempty), and by construction Ψ(H, µ) ≥ Φ(H) ≥ 0 whenever µ ≥ 0, implying that ΦL (·) is nonnegative and thus does not take value −∞. ✷ 4.17.E. Basic upper bound on Φ. For vectors u and v of the same dimension, say, k, let Max[u, v] stand for the entrywise maximum of u, v: [Max[u, v]]i = max[ui , vi ], and let [u]+ = Max[u, 0k ], where 0k is the k-dimensional zero vector. 5.1) Let Λ+ ≥ 0 and Λ− ≥ 0 be ν ×(p+2q) matrices, Λ ≥ 0 meaning that matrix Λ is entrywise nonnegative. Prove that whenever x ∈ X , one has k(B − H T A)xk ≤ B(x, H, Λ+ , Λ− ) := min ktk : t ≥ Max [(B − H T A)x − Λ+ F x]+ , [−(B − H T A)x − Λ− F x]+ t
and that B(x, H, Λ+ , Λ− ) is convex in x.
556
SOLUTIONS TO SELECTED EXERCISES
5.2) Derive from 5.1 that whenever Λ± are as in 5.1, one has Φ(H) ≤ B+ (H, Λ+ , Λ− ) := max B(gi , H, Λ+ , Λ− ), i≤n
where, as in item 3, g1 , ..., gn are the standard basic orths in Rn . Conclude that n o ν×(p+2q) Φ(H) ≤ ΦB (H) = inf B+ (H, Λ+ , Λ− ) : Λ± ∈ R+ Λ±
and that ΦB is convex and real-valued. In the sequel we refer to ΦB (·) as to Basic upper bound on Φ(·).
Solution: 5.1: Let x ∈ X , so that F x ≤ 0 and therefore Λ± F x ≤ 0. Since k · k is an absolute norm, we have k(B − H T A)xk = min ktk : t ≥ Max [(B − H T A)x]+ , [−(B − H T A)x]+ t ≤ min ktk : t ≥ Max [(B − H T A)x − Λ+ F x]+ , [−(B − H T A)x − Λ− F x]+ , t
with the inequality due to x ∈ X , so that −Λ± F x are nonnegative vectors. Thus, we indeed have x ∈ X , Λ± ≥ 0 ⇒ k(B − H T A)xk ≤ B(x, H, Λ+ , Λ− ), and the right hand side in this inequality clearly is convex in x. 5.2: Invoking 5.1, we have Φ(H)
= ≤
max k(B − H T A)xk ≤ max B(x, H, Λ+ , Λ− ) x∈X
x∈X
max B(x, H, Λ+ , Λ− ) = B + (H, Λ+ Λ− ),
x∈∆n
where the last equality is due to the fact that B(·) is convex in x. The rest of 5.2 is evident. ✷ 4.17.F. Sherali-Adams upper bound on Φ. Let us apply the approach we used in Chapter 1, Section 1.3.2, when deriving verifiable sufficient conditions for s-goodness; see p. 21. Specifically, setting G I , W = E
let us introduce the slack variable z ∈ Rp and rewrite the description of X as X = {x ∈ ∆n : ∃z ≥ 0 : W [x; z] = 0}, so that X is the projection of the polyhedral set X + = {[x; z] : x ∈ ∆n , z ≥ 0, W [x; z] = 0}
on the x-space. Projection of X + on the z-space is a nonempty (since X is so) and clearly bounded subset of the nonnegative orthant Rp+ , and we can in many ways cover Z by the simplex X ∆[α] = {z ∈ Rp : z ≥ 0, αi zi ≤ 1}, i
where all αi are positive.
557
SOLUTIONS TO SELECTED EXERCISES
6.1) Let α > 0 be such that Z ⊂ ∆[α]. Prove that X + = {[x; z] : W [x; z] = 0, [x; z] ∈ Conv{vij = [gi ; hj ], 1 ≤ i ≤ n, 0 ≤ j ≤ p}} ,
(!)
where gi are the standard basic orhts in Rn , h0 = 0 ∈ Rp , and αj hj , 1 ≤ j ≤ p, are the standard basic orths in Rp . 6.2) Derive from 5.1 that the efficiently computable convex function n o ΦSA (H) = inf max k(B − H T A)gi + C T W vij k : C ∈ R(p+q)×ν C
i,j
is an upper bound on Φ(H). In the sequel, we refer to ΦSA (H) as to Sherali-Adams bound [214].
Solution: 6.1: Clearly, whenever [x; z] ∈ X + , we have W [x; z] = 0, x ∈ ∆n , z ∈ Z ⊂ ∆[α], the latter implying that z is a convex combination of hj , so that [x; z] is a convex combination of vij . Thus, X + is contained in the right hand side set of (!). Vice versa, if [x; z] belongs to the right hand side set of (!), we have x ∈ ∆n , z ≥ 0, and W [x; z] = 0, implying that [x; z] ∈ X + . 6.1 is proved. 6.2: Because X is the projection of X + on the x-space, for every C ∈ R(p+q)×ν we have Φ(H)
= maxx∈X k(B − H T A)xk = max[x;z]∈X + k(B − H T A)xk T T = max[x;z]∈X + k(B −TH A)x +TC W [x; z]k ≤ max[x;z] k(B − H A)x + C F [x; z]k : [x; z] ∈ Conv{vij } [by (!)] = maxi,j k(B − H T A)gi + C T W vij k, ⇒ Φ(H) ≤ inf C maxi,j k(B − H T A)gi + C T W vij k = ΦSA (H),
as claimed in 6.2.
✷
4.17.G. Combined bound. We can combine the above bounds, specifically, as follows: 7) Prove that the efficiently computable convex function ΦLBS (H) =
inf
max Gij (H, Λ± , C± , µ, µ+ ),
(#)
(Λ± ,C± ,µ,µ+ )∈R i,j
where Gij (H, Λ± , C± , µ, µ+ ) := −µT F gi + µT+ W vij + min ktk : t
t ≥ Max [(B −
HT A
− Λ+ F )gi +
R = {(Λ± , C± , µ, µ+ ) : Λ± ∈
T W v ] , [(−B C+ ij +
ν×(p+2q) R+ , C±
TWv ] + H T A − Λ− F )gi + C− ij +
∈R
(p+q)×ν
,µ ∈
Rp+2q , µ+ +
∈R
p+q
,
}
is an upper bound on Φ(H), and that this Combined bound is at least as good as any of the Lagrange, Basic, or Sherali-Adams bounds. ν×(p+2q)
, µ+ ∈ Rp+q , and Λ± ∈ R+ Solution: Whenever C± ∈ R(p+q)×ν , µ ∈ Rp+2q +
,
558
SOLUTIONS TO SELECTED EXERCISES
for [x; z] ∈ X + one has
k(B − H T A)xk = min ktk : t ≥ Max[[(B − H T A)x]+ , [(−B + H T A)x]+ ] t ≤ min ktk : t T W [x; z]] , [(−B + H T A)x − Λ F x + C T W [x; z]] t ≥ Max [(B − H T A)x − Λ+ F x + C+ + + − −
T W [x; z] = 0] [due to Λ± F x ≤ 0 and C±
min ktk : t ) T W [x; z]] , [(−B + H T A)x − Λ F x + C T W [x; z]] t ≥ Max [(B − H T A)x − Λ+ F x + C+ + − + − ≤
=:
−µT F x + µT + W [x; z] [due to µ ≥ 0, F x ≤ 0 and W [x; z] = 0]
F (H, Λ± , C± , µ, µ+ ; [x; z]);
note that F(H, Λ± , C± , µ, µ+ ; [x; z]) clearly is convex in [x; z]. Recalling that X + is contained in Conv{vij , 1 ≤ i ≤ n, 0 ≤ j ≤ p}, we conclude that max k(B − H T A)xk x∈X
= ≤ ≤ =
max k(B − H T A)xk
[x;z]∈X +
max F(H, Λ± , C± , µ, µ+ ; [x; z])
[x;z]∈X +
maxi,j F(H, Λ± , C± , µ, µ+ ; vij ) maxi,j Gij (H, Λ± , C± , µ, µ+ )
The concluding quantity clearly is convex in (H, Λ± , C± , µ, µ+ ), implying that ΦLBS (H) is a convex in H upper bound on Φ(H). Finally, to see that the Combined bound is not worse than each of the bounds we merge, it suffices to note that the latter bounds can be obtained from the righthand side in (#) by imposing additional constraints on the minimization variables, specifically, — for Lagrange bound—the constraints C± = 0, Λ± = 0; — for Basic bound—the constraints C± = 0, µ = 0, µ+ = 0; — for Sherali-Adams bound—the constraints C+ = −C− , Λ± = 0, µ = 0, µ+ = 0. ✷ 4.17.H. How to select α? A shortcoming of the Sherali-Adams and the combined upper bounds on Φ is the presence of a “degree of freedom”—the positive vector α. Intuitively, we would like to select α to make the simplex ∆[α] ⊃ Z to be “as small as possible.” It is unclear, however, what “as small as possible” is in our context, not to mention how to select the required α after we agree on how we measure the “size” of ∆[α]. It turns out, however, that we can efficiently select α resulting in the smallest volume ∆[α]. 8) Prove that minimizing the volume of ∆[α] ⊃ Z in α reduces to solving the following convex optimization problem: ) ( p X T T inf − (∗) ln(αs ) : 0 ≤ α ≤ −v, E u + G v ≤ 1n . α,u,v
s=1
Solution: α ≥ 0 satisfies Z ⊂ ∆[α] if and only if the linear inequality αT ζ ≤ 1
559
SOLUTIONS TO SELECTED EXERCISES
in variable ζ is a consequence of the feasible system of linear constraints (a) (b) (c) (d) (e)
x≥0 1Tn x = 1 Ex = 0 Gx + ζ = 0 ζ≥0
in variables x, ζ. By the Inhomogeneous Farkas Lemma, the latter is the case if and only if there exist Lagrange multipliers λa ≥ 0, λb , λc , λd , λe ≥ 0 such that λa + λb 1n + E T λc + GT λd = 0, λd + λe = −α, λb ≥ −1, which boils down to the existence of λc = u and λd = v such that α ≤ −v and E T u + GT v ≤ 1n . The lengths of edges of ∆[α] incident to the vertex 0 of ∆[α] are inversely proportional to the corresponding entries in α, and the volume of ∆[α] is proportional to the product of these edges, so that the minimization of this volume reduces to the convex problem (∗). ✷ 9) Run numerical experiments to evaluate the quality of the above bounds. It makes sense to generate problems where we know in advance the actual value of Φ, e.g., to take X = {x ∈ ∆n : x ≥ a}
P
(a)
with a ≥ 0 such that i ai ≤ 1. In this case, we can easily list the extreme point of X (how?) and thus can easily compute Φ(H). In your experiments, you can use the matrices stemming from “presumably good” linear estimates yielded by the optimization problems Opt = min Φ(H) + φR (λ[Υ]) + ΓX (Θ) : Υ = {Υℓ 0, ℓ ≤ L, } H,Υ,Θ 1 (4.99) Θ HM 2 P 0 T T ∗ 1 M H S [Υ ] ℓ ℓ ℓ 2
where
1 max Tr(Diag{Ax}Θ) K x∈X (see Corollary 4.12) with the actual Φ (which is available for our X ), or the upper bounds on Φ (Lagrange, Basic, Sherali-Adams, and Combined) in the role of Φ. Note that it may make sense to test seven bounds rather than just four. Indeed, with additional constraints on the optimization variables in (#), we can get, besides “pure” Lagrange, Basic, and Sherali-Adams bounds and their “three-component combination” (Combined bound), pairwise combinations of the pure bounds as well. For example, to combine the Lagrange and Sherali-Adams bound, it suffices to add to (#) the constraints Λ± = 0. ΓX (Θ) =
Solution: In our experiments, • the sizes of instances were n = ν = m = p = 10, q = 0 (no constraints Ex = 0), K = 10000; • we used B = I10 , k · k = k · k1 , and generated 10 × 10 stochastic matrices A at random, by normalizing the columns in a rand(10,10) matrix to have unit column sums; • we selected a in (a) at random, by generating a rand(10,1) vector and then scaling it to have a rand(1,1) sum of entries; • α was selected as explained in item 8.
560
SOLUTIONS TO SELECTED EXERCISES
Here are the optimal values of (P ) for 10 randomly selected data sets (corresponding to 10 vertical stripes on the plot): 1
0.8
0.6
0.4
0.2
0 0
1
2
3
4
5
6
7
8
9
1 0
∗: Φ(·); ✷: L/B/S-A; +: L/S-A; ◦: L/B; ∗: B/S-A; pentagram: L; ▽: S-A; △: B
We see that the “winners,” besides the Combined bound, are pairwise combinations of Lagrange and Sherali-Adams, and of Basic and Sherali-Adams bounds, as well as the “pure” Sherali-Adams bound, and under the circumstances these bounds are equal to the actual Φ within machine accuracy. Note that excellent performance exhibited in the above experiments by four of our seven estimates is seemingly due to the special structure of the signal set used in these experiments. This is what happens when we modify X by imposing upper and lower bounds on x-probabilities of some prescribed events rather than imposing lower bounds on the entries in x: 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
1
2
3
4
5
6
7
8
✷: L/B/S-A; +: L/S-A; ◦: L/B; ∗: B/S-A; pentagram: L; ▽: S-A; △: B
9
1 0
561
SOLUTIONS TO SELECTED EXERCISES
Now the winner, besides the Combined bound, is the pairwise combination of Lagrange and Sherali-Adams bounds. Exercise 4.18.
The exercise to follow deals with recovering discrete probability distributions in the Wasserstein norm. The Wasserstein distance between probability distributions is extremely popular in Statistics today; it is defined as follows.14 Consider discrete random variables taking values in finite observation space Ω = {1, 2, ..., n} which is equipped with the metric {dij : 1 ≤ i, j ≤ n} satisfying the standard axioms.15 As always, we identify probability distributions on Ω with n-dimensional probabilistic vectors p = [p1 ; ...; pn ], where pi is the probability mass assigned by p to i ∈ Ω. The Wasserstein distance between probability distributions p and q is defined as ( ) X X X W (p, q) = min dij xij : xij ≥ 0, xij = pi , xij = qj ∀1 ≤ i, j ≤ n . x=[xij ]
i
j
i
(4.100) In other words, one may think of p and q as two distributions of unit mass on the points of Ω, and consider the mass transport problem of redistributing the mass assigned to points by distribution p to getP the distribution q. Denoting by xij the mass moved from point i to point j, constraints j xij = pi say that the total mass taken from point i is exactly pi , P constraints i xij = qj say that as the result of transportation, the mass at point j will be exactly qj , and the constraints xij ≥ 0 reflect the fact that transport of a negative mass is forbidden. Assuming that the cost of transporting a mass µ from point i to point j is dij µ, the Wasserstein distance W (p, q) between p and q is the cost of the cheapest transportation plan which converts p into q. As compared to other natural distances between discrete probability distributions, like kp − qk1 , the advantage of the Wasserstein distance is that it allows us to model the situation (indeed arising in some applications) where the effect, measured in terms of the intended application, of changing probability masses of points from Ω is small when the probability mass of a point is redistributed among close points.16 Now comes the first part of the exercise: 1) Let p, q be two probability distributions. Prove that ( ) X fi (pi − qi ) : |fi − fj | ≤ dij ∀i, j W (p, q) = maxn f ∈R
(4.101)
i
Solution: By definition, W (p, q) is the optimal value in a (clearly solvable) Linear Programming program. By the LP Duality Theorem, passing to the dual problem we have W (p, q) = max f T p − g T q : fi − gj ≤ dij , 1 ≤ i, j ≤ n . (∗) f,g
All we need to prove is that we can find an optimal solution to (∗) where f = g. By the continuity argument, it suffices to prove this claim in the case when p > 0,
14 The distance we consider stems from the Wasserstein 1-distance between discrete probability distributions. This is a particular case of the general Wasserstein p-distance between (not necessarily discrete) probability distributions. 15 Namely, positivity and symmetry: d ij = dji ≥ 0, with dij = 0 if and only if i = j; and the triangle inequality: dik ≤ dij + djk for all triples i, j, k. 16 In fact, the Wasserstein distance shares this property with some other distances between distributions used in Probability Theory, such as Skorohod, or Prokhorov, or Ky Fan distances. What makes the Wasserstein distance so “special” is its representation (4.100) as the optimal value of a Linear Programming problem, responsible for efficient computational handling of this distance.
562
SOLUTIONS TO SELECTED EXERCISES
q > 0. Assuming that the latter is the case, let f¯, g¯ be an optimal solution to (∗). We claim that for all i, j it holds g¯i + dij ≥ g¯j .
(∗∗)
Indeed, assume that for some pair i∗ , j∗ the inequality opposite to (∗∗) holds true: g¯i∗ + di∗ j∗ < g¯j∗ , implying that i∗ 6= j∗ , and let g ′ be obtained from g¯ by updating a single entry, specifically, by setting gj′ ∗ = g¯i∗ + di∗ j∗ . Let us verify that f¯, g ′ is a feasible solution to (∗). Indeed, all we need to verify is that for all i it holds f¯i − gj′ ∗ ≤ dij∗ , or, which is the same, that for all i it holds f¯i − g¯i∗ − di∗ j∗ ≤ dij∗ , which is nothing but
f¯i − g¯i∗ ≤ dij∗ + di∗ j∗
By the triangle inequality, the right-hand side in the latter relation is ≥ dii∗ , so that the inequality indeed is valid due to f¯i − g¯i∗ ≤ dii∗ . Thus, f¯, g ′ is a feasible solution to (∗); when passing from f¯, g¯ to f¯, g ′ we keep f¯ and all entries in g¯, except for the j∗ -th one, intact, and strictly decrease the j∗ -th entry in g¯, thus strictly increasing the value of the objective in (∗) due to qj∗ > 0. The latter is impossible, since f¯, g¯ is an optimal solution to (∗). Thus, (∗∗) indeed holds true. It remains to note that by feasibility of f¯, g¯ for (∗) we have f¯i ≤ g¯i for all i due to dii = 0, and by (∗∗) and due to dij = dji the pair g¯, g¯ is feasible for (∗). When passing from the optimal solution f¯, g¯ to (∗) to the feasible solution g¯, g¯, we entrywise increase (perhaps nonstrictly) f¯ due to the already proved relation f¯ ≤ g¯ and consequently increase (perhaps nonstrictly) the objective of (∗). Since actual increase is impossible (f¯, g¯ form an optimal solution to the problem!) and p > 0, we conclude that f¯ = g¯, as claimed. ✷ Treating vector f ∈ Rn as a function on Ω, the value of the function at a point i ∈ Ω being fi , (4.101) admits a very transparent interpretation: the Wasserstein distance W (p, q) between probability distributions p, q is the maximum of inner products of p − q and Lipschitz continuous, with constant 1 w.r.t. the metric d, functions f on Ω. When shifting f by a constant, the inner product remains intact (since p − q is a vector with zero sum of entries). Therefore, denoting by D = max dij i,j
the d-diameter of Ω, we have n o W (p, q) = max f T (p − q) : |fi − fj | ≤ dij , |fi | ≤ D/2 ∀i, j , f
(4.102)
the reason being that every Lipschitz continuous, with constant 1 w.r.t. metric d, function f on Ω can be shifted by a constant to ensure kf k∞ ≤ D/2 (look what happens when the shift ensures that mini fi = −D/2). Representation (4.102) shows that the Wasserstein distance is generated by a norm on Rn : for all probability distributions on Ω one has W (p, q) = kp − qkW ,
563
SOLUTIONS TO SELECTED EXERCISES
where k · kW is the Wasserstein norm on Rn given by kxkW = max f T x,
f ∈B∗ B∗ = ∈ Rn : uT Sij u ≤ 1, 1 ≤ i ≤ j ≤ n , u−2 dij [ei − ej ][ei − ej ]T , 1 ≤ i < j ≤ n, Sij = 4D−2 ei eTi , 1 ≤ i = j ≤ n,
where e1 , ..., en are the standard basic orths in Rn . 2) Let us equip n-element set Ω = {1, ..., d} with the metric dij = the associated Wasserstein norm?
(4.103)
2, 0,
i 6= j . What is i=j
Solution: This is just the k · k1 -norm.
Note that the set B∗ in (4.103) is the unit ball of the norm conjugate to k · kW , and as we see, this set is a basic ellitope. As a result, the estimation machinery developed in Chapter 4 is well suited for recovering discrete probability distributions in the Wasserstein norm. This observation motivates the concluding part of the exercise:
3) Consider the situation as follows: Given an ×n column-stochastic matrix A and a ν × n column-stochastic matrix B, we observe K samples ωk , 1 ≤ k ≤ K, independent of each other and drawn from discrete probability distribution Ax ∈ ∆m (as always, ∆ν ⊂ Rν is the probabilistic simplex in Rν ), x ∈ ∆n being an unknown “signal” underlying observations; realizations of ωk are identified with respective vertices f1 , ..., fm of ∆m . Our goal is to use the observations to estimate the distribution Bx ∈ ∆ν . We are given a metric d on the set Ων = {1, 2, ..., ν} of indices of entries in Bx, and measure the recovery error in the Wasserstein norm k · kW associated with d. Build an explicit convex optimization problem responsible for “presumably good” linear recovery of the form K X 1 ωk . x bH = H T K k=1
Solution: The estimate in question is given by the H-component of the optimal solution to the convex optimization problem P Opt = min max kColj [B − H T A]kW + K −1 max[AT dg(G)]i + λij : j≤n i≤n H,G,λ 1≤i≤j≤ν 1 G H 2 0 , λij ≥ 0, 1 T P 2
H
1≤i≤j≤ν
λij Sij
where dg(G) is the diagonal of G, and Sij are the matrices specified in (4.103), with ν in the role of n; cf. Simple Case in Exercise 4.17. The risk Riskk·kW [b x|∆n ] of this estimate does not exceed Opt. Exercise 4.19. [follow-up to Exercise 4.17] In Exercise 4.17, we have built a “presumably good” linear estimate x bH∗ (·)—see (4.98)—yielded by the H-component H∗ of an optimal solution to problem (4.99). The optimal value Opt in this problem is an upper bound on the risk Riskk·k [b xH∗ |X ] (here and in what follows we use the same notation and impose the same assumptions as in Exercise 4.17). Recall that Riskk·k is the worst, w.r.t. signals x ∈ X underlying our observations, expected norm of the recovery error. It makes sense also to provide upper bounds on the probabilities of deviations of error’s magnitude from its expected value, and this is the problem we consider here, cf. Exercise 4.14.
564
SOLUTIONS TO SELECTED EXERCISES
1) Prove the following Lemma 4.33 Let Q ∈ Sm + , let K be a positive integer, and let p ∈ ∆m . Let, further, ω K = (ω1 , ..., ωK ) be i.i.d. random vectors, with ωk taking the value ej (e1 , ..., em are the standard basic orths in Rm ) with probability pj . Finally, let ξk = ωk − E{ωk } = ωk − p, PK 1 and ξb = K k=1 ξk . Then for every ǫ ∈ (0, 1) it holds b 22 ≤ 12 ln(2m/ǫ) ≥ 1 − ǫ. Prob kξk K Hint: use the classical Bernstein inequality: Let X1 , ..., XK be independent zero mean random variables taking values in [−M, M ], and let σk2 = E{Xk2 }. Then for every t ≥ 0 one has nX K o t2 Prob . Xk ≥ t ≤ exp − P 2 1 k=1 2[ k σk + 3 M t]
Solution: Let us fix p ∈ ∆m , and let ξk , 1 ≤ k ≤ K, be the random vectors associated with p, as explained in the premise of Lemma. For i ≤ m fixed, denote by Xki the i-th entry in ξk , so that Xki are random reals taking values in the segment [−pi , 1 − pi ] ⊂ [−1, 1] with second moments pi (1 − pi ) ≤ pi and independent across k. Applying the Bernstein inequality, we conclude that for every ti ≥ 0 it holds b i > ti } Prob{[ξ]
o n P K 2 t2 i Prob k Xki > Kti ≤ exp − 2[Kpi + 1 Kti ] 3 o n Kt2 exp − 2[p + i1 t ] ,
= =
n
i
3 i
Kt2i 1 i + 3 ti ]
b i < −ti } ≤ exp − and similarly Prob{[ξ] 2[p
b Prob{|[ξ]i | > ti } ≤ 2 exp −
o
, implying that
Kt2i 2[pi + 13 ti ]
.
Now let us specify ti in such a way that the right hand side in the above inequality is equal to ǫ/m. Denoting α = 2 ln(2m/ǫ) , we get K α ti = + 6
r
α √ α2 + αpi ≤ + αpi . 36 3
With our choice of ti we have b i | > ti } ≤ ∀i ≤ m : Prob{|[ξ]
ǫ , m
b i | ≤ ti ∀i} is at least 1 − ǫ. Now let so that the probability of the event E = {ξb : |[ξ] I = {i : pi ≤
α }. 9
√ Note that ti ≤ 2 αpi when i 6∈ I and ti ≤ 2α/3 when i ∈ I. Note also that by the b 1 ≤ 2. Consequently, when E takes place we have origin of ξb we have kξk Pm b 2 i=1 [ξ]i
= ≤
P b i |ti + P t2 b 2 ≤ P |[ξ] + i6∈I [ξ] i6∈I i i∈I P i2 P 2 maxi∈I ti + i6∈I ti ≤ 4α i6∈I αpi ≤ 6α. 3 +4 P
b2 i∈I [ξ]i
Substituting the expression for α, we arrive at the desired result.
✷
565
SOLUTIONS TO SELECTED EXERCISES
2) Consider the situation described in Exercise 4.17 with X = ∆n , specifically,
• Our observation is a sample ω K = (ω1 , ..., ωK ) with i.i.d. components ωk ∼ Ax, where X ∈ ∆n is an unknown n-dimensional probabilistic vector, A is an m × n stochastic matrix (nonnegative matrix with unit column sums), and ω ∼ Ax means that ω is a random vector taking value ei (ei are standard basic orths in Rm ) with probability [Ax]i , 1 ≤ i ≤ m; • Our goal is to recover Bx in a given norm k · k; here B is a given ν × n matrix; • We assume that the unit ball B∗ of the norm k · k∗ conjugate to k · k is a spectratope, B∗ = {u = M y, y ∈ Y}, Y = {y ∈ RN : ∃r ∈ R : Sℓ2 [y] rℓ Ifℓ , ℓ ≤ L}.
Our goal is to build a presumably good linear estimate x bH (ω K ) = H T ω b [ω K ], ω b [ω K ] =
Prove the following
1 X ωk . K k
Proposition 4.34 Let H, Θ, Υ be a feasible solution to the convex optimization problem minH,Θ,Υ {Φ(H) + φR (λ[Υ])+ Γ(Θ)/K : Υ = {Υℓ 0, ℓ ≤ L} 1 Θ HM 2 P 0 ∗ 1 M T HT ℓ Sℓ [Υℓ ] 2
where
(4.104)
Φ(H) = max kColj [B − H T A]k, Γ(Θ) = max Tr(Diag{Ax}Θ). x∈∆n
j≤n
Then (i) For every x ∈ ∆n it holds EωK kBx − x bH (ω K )k ≤ ≤
Φ(H) + 2K −1/2
p
φR (λ[Υ])Γ(Θ)
Φ(H) + φR (λ[Υ]) + Φ(H) + Γ(Θ)/K
(4.105)
(ii) Let ǫ ∈ (0, 1). For every x ∈ ∆n with p γ = 2 3 ln(2m/ǫ)
one has
ProbωK
n
kBx − x bH (ω K )k ≤ Φ(H) + 2γK −1/2
≥ 1 − ǫ.
p
φR (λ[Υ])kΘkSh,∞
o
(4.106)
Solution: (i): Let us fix x ∈ X , let p = Ax, so that p ∈ ∆m , and let P be the distribution of i.i.d. sample ω K = (ω1 , ..., ωK ) with ωk ∼ p. Setting ξk = ωk − p, P 1 ξb = K k ξk , observe that ξ1 , ..., ξK are i.i.d. zero mean with covariance matrix 1 [Diag{p} − Diag{p} − ppT , whence ξb is zero mean with covariance matrix Qx = K 1 T K b pp ] K Diag{p}. Besides this, ω b [ω ] = p + ξ. We now have kBx − x bH (ω K )k
= ≤ ≤
b = kBx − H T (Ax + ξ)k b kBx − H T (p + ξ)k b k[B − H T A]xk + kH T ξk Tb Φ(H) + kH ξk [recall that x ∈ ∆n ].
(6.28)
566
SOLUTIONS TO SELECTED EXERCISES
Next, from the last semidefinite constraint in (4.104) by exactly the same argument as in the proof of Proposition 4.31 (see derivation of (6.26) in Solution to Exercise 4.14) one has q Tb b (6.29) kH ξk ≤ 2 φR (λ[Υ])[ξbT Θξ].
Taking expectations and invoking Cauchy’s inequality, we get q n o p b ≤ 2 φR (λ[Υ])E{ξbT Θξ} b ≤ 2K −1/2 φR (λ[Υ])Tr(Diag{Ax}Θ), E kH T ξk
where the concluding inequality is due to Θ 0 and ξb being zero mean with covariance matrix K −1 Diag{Ax}. Thus, n o p b ≤ 2K −1/2 φR (λ[Υ])Γ(Θ), E kH T ξk
which combines with (6.28) to yield the first inequality in (4.105). The second inequality in (4.105) is evident. ✷ (ii): In the notation from the proof of (i), by (6.28), (6.29) we have q b kBx − x bH (ω K )k ≤ Φ(H) + 2 φR (λ[Υ])[ξbT Θξ] whence
b2 kBx − x bH (ω K )k ≤ Φ(H) + 2kξk
It remains to apply Lemma 4.33.
q φR (λ[Υ])kΘkSh,∞ .
✷
3) Look what happens when ν = m = n, A and B are the unit matrices, and H = I, i.e., we want to understand how good is the recovery of a discrete probability distribution by empirical distribution of a K-element i.i.d. sample drawn from the original distribution. Take, as k · k, the norm k · kp with p ∈ [1, 2], and show that for every x ∈ ∆n and every ǫ ∈ (0, 1) one has ∀(x ∈ ∆n ) : 1−1 1 E kx− x bI (ω K )kp ≤ n p 2 K − 2 . p 1−1 1 Prob kx − x bI (ω K )kp ≤ 2 3 ln(2n/ǫ)n p 2 K − 2 ≥ 1 − ǫ
Solution: In the situation in question it is immediately seen that the Υℓ are just p , and problem (4.104) with H set to I boils reals υℓ , φR (λ[Υ]) = kυkq , q = 2−p down to 1 p I Θ −1 2 ,q= . min kυkq + K max Tr(Diag{x}Θ) : υ ≥ 0, 1 Diag{υ} I Θ,υ x∈∆n 2−p 2 As far as the bounds (4.105) and (4.106) are concerned, we clearly lose nothing when restricting ourselves to Θ = 12 I and υ = 12 [1; ...; 1], so that the bounds become
as claimed.
∀(x ∈ ∆n ) : 1 1 1 E kxn− x bI (ω K )kp ≤ n p − 2 K − 2 , o p 1 1 1 Prob kx − x bI (ω K )kp ≤ 2 3 ln(2n/ǫ)n p − 2 K − 2 ≥ 1 − ǫ,
✷
567
SOLUTIONS TO SELECTED EXERCISES
Exercise 4.20. [follow-up to Exercise 4.17] Consider the situation as follows. A retailer sells n items by offering customers via internet bundles of m < n items, so that an offer is an m-element subset B of the set S = {1, ..., n} of the items. A customer has private preferences represented by a subset P of S—the customer’s preference set. We assume that if an offer B intersects with the preference set P of a customer, the latter buys an item drawn at random from the uniform distribution on B ∩ P , and if B ∩ P = ∅, the customer declines the offer. In the pilot stage we are interested in, the seller learns the market by making offers to K customers. Specifically, the seller draws the k-th customer, k ≤ K, at random from the uniform distribution on the population of customers, and makes the selected customer an offer drawn at random from the uniform distribution on the set Sm,n of all m-item offers. What is observed in the k-th experiment is the item, if any, bought by the customer, and we want to make statistical inferences from these observations. The outlined observation scheme can be formalized as follows. Let S be the set of all subsets of the n-element set, so that S is of cardinality N = 2n . The population of customers induces a probability distribution p on S: for P ∈ S, pP is the fraction of customers with the preference set being P ; we refer to p as to the preference distribution. An outcome of a single experiment can be represented by a pair (ι, B), where B ∈ Sm,n is the offer used in the experiment, and ι is either 0 (“nothing is bought,” P ∩ B = ∅), or a point from P ∩ B, the item which was bought, when P ∩ B 6= ∅. Note that AP is a probability distribution n on the (M = (m + 1) m )-element set Ω = {(ι, B)} of possible outcomes. As a result, our observation scheme is fully specified by M × N column-stochastic matrix A known to us with the columns AP indexed by P ∈ S. When a customer is drawn at random from the uniform distribution on the population of customers, the distribution of the outcome clearly is Ap, where p is the (unknown) preference distribution. Our inferences should be based on the K-element sample ω K = (ω1 , ..., ωK ), with ω1 , .., ωK drawn, independently of each other, from the distribution Ap. Now we can pose various inference problems, e.g., that of estimating p. We, however, intend to focus on a simpler problem—one of recovering Ap. In terms of our story, this makes sense: when we know Ap, we know, e.g., what the probability is for every offer to be “successful” (something indeed is bought) and/or to result in a specific profit, etc. With this knowledge at hand, the seller can pass from a “blind” offering policy (drawing an offer at random from the uniform distribution on the set Sm,n ) to something more rewarding. Now comes the exercise: 1. Use the results of Exercise 4.17 to build a “presumably good” linear estimate # " K 1 X K T ωk x bH (ω ) = H K k=1
of Ap (as always, we encode observations ω, which are elements of the M -element set Ω, by standard basic orths in RM ). As the norm k · k quantifying the recovery error, use k · k1 and/or k · k2 . In order to avoid computational difficulties, use small m and n (e.g., mP = 3 and n = 5). Compare your results with those for the “straightforward” estimate K 1 k=1 ωk (the empirical distribution of ω ∼ Ap). K 2. Assuming that the “presumably good” linear estimate outperforms the straightforward one, how could this phenomenon be explained? Note that we have no nontrivial a priori information on p!
Solution: We have no a priori information on p—all we know is that p ∈ ∆N . Consequently, we are in the Simple case of Exercise 4.17, and the optimization problem (P ) (see Exercise 4.17) responsible for building a presumably good linear estimate, k · k being k · kr , with r ∈ {1, 2}, after immediate simplification becomes
568
SOLUTIONS TO SELECTED EXERCISES
— in the case of r = 1: Opt1 =
min
H,λ∈RM ,Θ
max kAP − H T AP k1 + P ∈S
M P
1 max Tr(Diag{AP }Θ) : λk + K P ∈S 1 H Θ 2 0 , λ ≥ 0, 1 T Diag{λ} H 2 k=1
— in the case of r = 2: Opt2 = min max kAP − H T AP k2 + λ + H,λ∈R,Θ
P ∈S
λ≥
1 Tr(Diag{AP }Θ) : K max P ∈S 1 Θ H 0 . 0, 1 T 2 λIM 2H
(P1 )
(P2 )
Note that Opt is an upper bound on the risk of the linear estimate yielded by the H-component of the optimal solution. When m = 3, n = 5 and K = 1000, the computation yields Opt1 ≈ 0.1414, Opt2 ≈ 0.0230. The results of 100 simulation runs, K = 1000 observations each, with a common for all 100 runs randomly selected p, are as follows: kb x(·) − Apk1 kb x(·) − Apk2 kb x(·) − Apk∞ kb x(·) − Apk1 kb x(·) − Apk2 kb x(·) − Apk∞
min 0.0605/0.0794 0.0118/0.0190 0.0042/0.0073
mean 0.1008/0.1232 0.0201/0.0298 0.0081/0.0144
median 0.0994/0.1227 0.0195/0.0300 0.0077/0.0137
max 0.1643/0.1867 0.0317/0.0442 0.0175/0.0253
0.0555/0.0748 0.0119/0.0210 0.0049/0.0100
0.1084/0.1101 0.0218/0.0309 0.0088/0.0162
0.1080/0.1108 0.0221/0.0309 0.0086/0.0156
0.1620/0.1600 0.0314/0.0412 0.0137/0.0282
The first number in a cell corresponds to the linear estimate yielded by optimal solution to (P1 ) (top half of the table) or (P2 ) (bottom half of the table). The second number in a cell corresponds to estimating Ap by empirical distribution of the observations. We see that the near-optimal linear estimate indeed outperforms the straightforward one. The reason is simple: while we have no nontrivial information on p, we do have information on Ap—we know that this distribution belongs to the convex hull of vectors AP , P ∈ S, and utilize this information when building the linear estimate. Exercise 4.21. [Poisson Imaging] The Poisson Imaging Problem is to recover an unknown signal observed via the Poisson observation scheme. More specifically, assume that our observation is a realization of random vector ω ∈ Rm + with Poisson entries ωi = Poisson([Ax]i ) independent of each other. Here A is a given entrywise nonnegative m × n matrix, and x is an unknown signal known to belong to a given compact convex subset X of Rn + . Our goal is to recover in a given norm k · k the linear image Bx of x, where B is a given ν × n matrix. We assume in the sequel that X is a subset cut off the n-dimensional probabilistic simplex ∆n by a collection of linear equality and inequality constraints. The assumption P X ⊂ ∆n is not too restrictive. Indeed, assume that we know in advance a linear inequality i P αi xi ≤ 1 with positive coefficients which is valid on X .17 Introducing slack variable s given by i αi xi + s = 17 For
example, in PET—see Section 2.4.3.2—where x is the density of a radioactive P tracer injected into the patient taking the PET procedure, we know in advance the total amount i vi xi of the tracer, vi being the volumes of voxels.
569
SOLUTIONS TO SELECTED EXERCISES
1 and passing from signal x to the new signal [α1 x1 ; ...; αn xn ; s], after a straightforward modification of matrices A and B, we arrive at the situation where X is a subset of the probabilistic simplex. Our goal in the sequel is to build a presumably good linear estimate x bH (ω) = H T ω of Bx. As in Exercise 4.17, we start with upper-bounding the risk of a linear estimate. When representing ω = Ax + ξx , we arrive at zero mean observation noise ξx with [ξx ]i = ωi − [Ax]i independent of each other entries and covariance matrix Diag{Ax}. We now can upper-bound the risk of a linear estimate x bH (·) in the same way as in Exercise 4.17. Specifically, denoting by ΠX the set of all diagonal matrices Diag{Ax}, x ∈ X , and by Pi,x the Poisson distribution with parameter [Ax]i , we have Riskk·k [b xH |X ] = supx∈X Eω∼P kBx − H T ωk 1,x ×...×Pm,x T T = supx∈X Eξx k[Bx − H A]x − H ξnx k o Eξ kH T ξk . sup ≤ sup k[B − H T A]xk + x∈X ξ:Cov[ξ]∈ΠX | {z } | {z } Φ(H)
ΨX (H)
In order to build a presumably good linear estimate, it suffices to build efficiently computable X upper bounds Φ(H) on Φ(H) and Ψ (H) on ΨX (H) convex in H, and then take as H an optimal solution to the convex optimization problem h i X Opt = min Φ(H) + Ψ (H) . H
As in Exercise 4.17, assume from now on that k · k is an absolute norm, and the unit ball B∗ of the conjugate norm is a spectratope, B∗ := {u : kuk∗ ≤ 1} = {u : ∃r ∈ R, y : u = M y, Sℓ2 [y] rℓ Ifℓ , ℓ ≤ L}. Observe that
• In order to build Φ, we can use exactly the same techniques as those developed in Exercise 4.17. Indeed, as far as building Φ is concerned, the only difference with the situation of Exercise 4.17 is that in the latter, A was a column-stochastic matrix, while now A is just an entrywise nonnegative matrix. Note, however, that when upper-bounding Φ in Exercise 4.17, we never used the fact that A is column-stochastic. • In order to upper-bound ΨX , we can use the bound (4.40) of Exercise 4.17.
The bottom line is that in order to build a presumably good linear estimate, we need to solve the convex optimization problem Opt = min Φ(H) + φR (λ[Υ]) + ΓX (Θ) : Υ = {Υℓ 0, ℓ ≤ L} H,Υ,Θ 1 (P ) Θ HM 2 P 0 ∗ T T 1 S [Υ ] M H ℓ ℓ ℓ 2 where
ΓX (Θ) = max Tr(Diag{Ax}Θ) x∈X
(cf. problem (4.99)) with Φ yielded by any construction from Exercise 4.17, e.g., the least conservative Combined upper bound on Φ. What in our present situation differs significantly from the situation of Exercise 4.17, are the bounds on probabilities of large deviations (for Discrete o.s., established in Exercise 4.19), and the goal of what follows is to establish these bounds for Poisson Imaging. Here is what you are supposed to do:
570
SOLUTIONS TO SELECTED EXERCISES
1. Let ω ∈ Rm be a random vector with independent entries ωi ∼ Poisson(µi ), and let µ = [µ1 ; ...; µm ]. Prove that whenever h ∈ Rm , γ > 0, and δ ≥ 0, one has X ln Prob{hT ω > hT µ + δ} ≤ [exp{γhi } − 1]µi − γhT µ − γδ. (4.107) i
2. Taking for granted (or see, e.g., [178]) that ex − x − 1 ≤ that in the situation of item 1 one has for t > 0: 0≤γ
hT µ + t} ≤ − γt. khk∞ 2(1 − γkhk∞ /3)
Derive from the latter fact that n o δ2 , Prob hT ω > hT µ + δ ≤ exp − P 2 2[ i hi µi + khk∞ δ/3]
and conclude that n o δ2 T T P Prob |h ω − h µ| > δ ≤ 2 exp − . 2[ i h2i µi + khk∞ δ/3]
(4.108)
(4.109)
(4.110)
3. Extract from (4.110) the following
Proposition 4.35 In the situation and under the assumptions of Exercise 4.21, let Opt be the optimal value, and H, Υ, Θ be a feasible solution to problem (P ). Whenever x ∈ X and ǫ ∈ (0, 1), denoting by Px the distribution of observations stemming from x (i.e., the distribution of random vector ω with independent entries ωi ∼ Poisson([Ax]i )), one has p Eω∼Px {kBx − x bH (ω)k} ≤ Φ(H) + 2 φR (λ[Υ])Tr(Diag(Ax}Θ) (4.111) ≤ Φ(H) + φR (λ[Υ]) + ΓX (Θ) and
Probω∼Px kBx − x bH (ω)k ≤ Φ(H) q p 4 29 ln2 (2m/ǫ)Tr(Θ) + ln(2m/ǫ)Tr(Diag{Ax}Θ) φR (λ[Υ]) ≥ 1 − ǫ.
(4.112)
Note that in the case of [Ax]i ≥ 1 for all x ∈ X and all i we have Tr(Θ) ≤ Tr(Diag{Ax}Θ), so that in this case the Px -probability of the event n o p ω : kBx − x bH (ω)k ≤ Φ(H) + O(1) ln(2m/ǫ) φR (λ[Υ])ΓX (Θ) is at least 1 − ǫ.
Solution: 1. Taking into account that ωi ∼ Poisson(µi ) are independent across i, we have Q Q {γhi ωi } = i exp{[exp{γhi } − 1]µi } E exp{γhT ω} = i EP = exp{ i [exp{γhi } − 1]µi }, whence by the Chebyshev inequality Prob{hT ω > hT µ + δ}
= ≤ ≤
T T Prob{γh ω > γh µ + γδ}T T E exp{γh ω} exp{−γh µ − γδ} P exp{ i [exp{γhi } − 1]µi − γhT µ − γδ},
571
SOLUTIONS TO SELECTED EXERCISES
and the required bound follows.
✷
2. Relation (4.108) is an immediate consequence of (4.107) combined with the γ 2 h2 inequality exp{γhi } ≤ 1 + γhi + 2(1−γkhki ∞ /3) which takes place when |γ|khk∞ < P 3. Assuming w.l.o.g. that i h2i µi > 0 (otherwise (4.109) is evident, since then t hT ω ≡ hT µ = 0), with γ = P h2 µi +khk (4.108) results in (4.109). The latter ∞ t/3 i i relation combines with a similar relation for −h in the role of h to yield (4.110). ✷ 3. Let x ∈ X and ǫ ∈ (0, 1), and let µ = Ax and α = ln(2m/ǫ). Let also Θ = U Diag{λ1 , ..., λm }U T p be the eigenvalue decomposition of Θ, and let up be p-th column in U and hp = λp up , 1 ≤ p ≤ m, so that Θ=
m X p=1
and ϑ :=
X
hp [hp ]T , ρp := khp k∞ ≤ khp k2 =
Θii µi =
i
XX i
[hpi ]2 µi
p
=
" X X p
|
i
p λp ,
[hpi ]2 µi {z σp
#
.
}
Let Px be the distribution of observation ω stemming from x, and let Px′ be the distribution of ξ := ω − µ = ω − Ax induced by the distribution Px of ω, so that ξ is zero mean with the covariance matrix Diag{Ax}. For 1 ≤ p ≤ m, let p δp = 32 αρp + 2ασp ,
so that for every p ≤ m it holds δp ≥ 0 and 2
P
δp2 p 2 i [hi ] µi
+ 23 khp k∞ δp
=
δp2 2σp + 23 ρp δp
≥ α.
Invoking (4.110) with δ = δp , h = hp we get p T
p T
Probω∼Px {|[h ] ω − [h ] µ| > δp } ≤ ≤
(
δp2 2 exp − P p 2 2[ i [hi ] µi + khp k∞ δp /3] ǫ 2e−α = , m
)
so that the Px′ -probability of the event Ξ = ΞΘ := {ξ : |[hp ]T ξ| ≤ δp , 1 ≤ p ≤ m} is at least 1 − ǫ. When this event takes place, we have P P P ξ T Θξ = ξ T p hp [hp ]T ξ ≤ p δp2 ≤ p 2[ 94 α2 ρ2p + 2ασp ] P P P ≤ 2[ 94 α2 p ρ2p + 2α p σp ] ≤ 2[ 49 α2 p khp k22 + 2αϑ] P = 2[ 94 α2 p λp + 2αϑ] = 2[ 49 α2 Tr(Θ) + 2αϑ].
(6.30)
572
SOLUTIONS TO SELECTED EXERCISES
It remains to note that for every realization of ω it holds kBx − x bH (ω)k = ≤
kBx − H Tp[Ax + ξ]k ≤ k[B − H T A]xk + kH T ξk Φ(H) + 2 φR (λ[Υ])[ξ T Θξ],
(6.31)
where the concluding inequality is given by exactly the same computation as that used to derive (4.106); see the solution to Exercise 4.19. Taking expectations of both sides in (6.31) and using the fact that ξ is zero mean with covariance matrix Diag{Ax}, we get (4.111). Furthermore, (6.31) combines with (6.30) to imply that when ξ = ω − Ax ∈ ΞΘ (which happens with probability at least 1 − ǫ), one has kBx − x bH (ω)k ≤ Φ(H) + 4
and (4.112) follows. 6.4.5
q
2 9
ln2 (2m/ǫ)Tr(Θ) + ln(2m/ǫ)Tr(Diag{Ax}Θ)
p
φR (λ[Υ]),
✷
Numerical lower-bounding minimax risk
Exercise 4.22 4.22.A. Motivation. From the theoretical viewpoint, the results on near-optimality of presumably good linear estimates stated in Propositions 4.5, 4.16 seem to be quite strong and general. This being said, for a practically oriented user the “nonoptimality factors” arising in these propositions can be too large to make any practical sense. This drawback of our theoretical results is not too crucial—what matters in applications, is whether the risk of a proposed estimate is appropriate for the application in question, and not by how much it could be improved were we smart enough to build the “ideal” estimate; results of the latter type from a practical viewpoint offer no more than some “moral support.” Nevertheless, the “moral support” has its value, and it makes sense to strengthen it by improving the lower risk bounds as compared to those underlying Propositions 4.5 and 4.16. In this respect, an appealing idea is to pass from lower risk bounds yielded by theoretical considerations to computation-based ones. The goal of this exercise is to develop some methodology yielding computation-based lower risk bounds. We start with the main ingredient of this methodology—the classical Cramer-Rao bound. 4.22.B. Cramer-Rao bound. Consider the situation as follows: we are given • an observation space Ω equipped with reference measure Π, basic examples being (A) Ω = Rm with Lebesgue measure Π, and (B) (finite or countable) discrete set Ω with counting measure Π; • a convex compact set Θ ⊂ Rk and a family P = {p(ω, θ) : θ ∈ Θ} of probability densities, taken w.r.t. Π. Our goal is, given an observation ω ∼ p(·, θ) stemming from an unknown θ known to belong to Θ, to recover θ. We quantify the risk of a candidate estimate θb as n o1/2 b b − θk22 Risk[θ|Θ] = sup Eω∼p(·,θ) kθ(ω) , (4.113) θ∈Θ
and define the “ideal” minimax risk as
b Riskopt = inf Risk[θ], b θ
the infimum being taken w.r.t. all estimates, or, which is the same, all bounded estimates (indeed, passing from a candidate estimate θb to the projected estimate θbΘ (ω) = b argminθ∈Θ kθ(ω) − θk2 will only reduce the estimate risk).
573
SOLUTIONS TO SELECTED EXERCISES
The Cramer-Rao inequality [58, 205], which we intend to use,18 is a certain relation between the covariance matrix of a bounded estimate and its bias; this relation is valid under mild regularity assumptions on the family P, specifically, as follows:
1) p(ω, θ) > 0 for all ω ∈ Ω, θ ∈ U , and p(ω, θ) is differentiable in θ, the with ∇θ p(ω, θ) continuous in θ ∈ Θ; 2) The Fisher Information matrix Z ∇θ p(ω, θ)[∇θ p(ω, θ)]T I(θ) = Π(dω) p(ω, θ) Ω is well-defined for all θ ∈ Θ; R 3) There exists function M (ω) ≥ 0 such that Ω M (ω)Π(dω) < ∞ and k∇θ p(ω, θ)k2 ≤ M (ω) ∀ω ∈ Ω, θ ∈ Θ.
b The derivation of the Cramer-Rao bound is as follows. Let θ(ω) be a bounded estimate, and let Z b θ(ω)p(ω, θ)Π(dω) φ(θ) = [φ1 (θ); ...; φk (θ)] = Ω
be the expected ivalue of the estimate. By item 3, φ(θ) is differentiable on Θ, with the Jacobian h
φ′ (θ) =
∂φi (θ) ∂θj
i,j≤k
given by
φ′ (θ)h =
Z
Ω
T b θ(ω)h ∇θ p(ω, θ)Π(dω), h ∈ Rk .
R R Besides this, since Ω p(ω, θ)Π(dω) ≡ 1, invoking item 3 we have Ω hT ∇θ p(ω, θ)Π(dω) = 0, whence, in view of the previous identity, Z b − φ(θ)]hT ∇θ p(ω, θ)Π(dω), h ∈ Rk . φ′ (θ)h = [θ(ω) Ω
k
Therefore for all g, h ∈ R we have i2 hR [g T (θb − φ(θ)][hT ∇θ p(ω, θ)/p(ω, θ)]p(ω, θ)Π(dω) [g T φ′ (θ)h]2 = ω i hR g T [θb − φ(θ)][θb − φ(θ)]T gp(ω, θ)Π(dω) ≤ Ω R × Ω [hT ∇θ p(ω, θ)/p(ω, θ)]2 p(ω, θ)Π(dω) [by T the Cauchy Inequality] = g Covθb(θ)g hT I(θ)h , n o b b b − φ(θ)][θ(ω) − φ(θ)]T of θ(ω) inwhere Covθb(θ) is the covariance matrix Eω∼p(·,θ) [θ(ω) duced by ω ∼ p(·, θ). We have arrived at the inequality ih i h g T Covθb(θ)g hT I(θ)h ≥ [g T φ′ (θ)h]2 ∀(g, h ∈ Rk , θ ∈ Θ).
(∗)
For θ ∈ Θ fixed, let J be a positive definite matrix such that J I(θ), whence by (∗) it holds h ih i g T Covθb(θ)g hT J h ≥ [g T φ′ (θ)h]2 ∀(g, h ∈ Rk ). (∗∗)
18 As a matter of fact, the classical Cramer-Rao inequality dealing with unbiased estimates is not sufficient for our purposes “as is.” What we need to build bounds for the minimax risk is a “bias enabled” version of this inequality. Such an inequality may be developed using Bayesian argument [99, 233].
574
SOLUTIONS TO SELECTED EXERCISES
For g fixed, the maximum of the right hand side quantity in (∗∗) over h satisfying hT J h ≤ 1 is g T φ′ (θ)J −1 [φ′ (θ]T g, and we arrive at the Cramer-Rao inequality ∀(θ ∈ Θ, J I(θ), J ≻ 0) : Covθb(θ) φ′ (θ)J −1 [φ′ (θ]T (4.114) h n o n oi b Covθb(θ) = Eω∼p(·,θ) [θb − φ(θ)][θb − φ(θ)]T , φ(θ) = Eω∼p(·,θ) θ(ω)
b which holds true for every bounded estimate θ(·). Note also that for every θ ∈ Θ and every bounded estimate x we have n o n o b ≥ Eω∼p(·,θ) kθ(ω) b b Risk2 [θ] − θk22 = Eω∼p(·,θ) k[θ(ω) − φ(θ)] + [φ(θ) − θ]k22 o n b = Eω∼p(·,θ) kθ(ω) − φ(θ)k22 +kφ(θ) − θ)k22 h o b −2 Eω∼p(·,θ) [θ(ω) − φ(θ)]T [φ(θ) − θ)] {z } | =0
=
Tr(Covθb(θ)) + kφ(θ) − θk22 .
Hence, in view of (4.114), for every bounded estimate θb it holds
∀(J ≻ 0 : J I(θ) ∀θ ∈ Θ) : b ≥ sup Risk2 [θ] Tr(φ′ (θ)J −1 [φ′ (θ)]Ti) + kφ(θ) − θk22 θ∈Θ h b φ(θ) = Eω∼p(·,θ) {θ(ω)} .
(4.115)
The fact that we considered the risk of estimating “the entire” θ rather than a given vectorvalued function f (θ) : Θ → Rν plays no special role, and in fact the Cramer-Rao inequality admits the following modification yielded by a completely similar reasoning: Proposition 4.36 In the situation described in item 4.22.B and under assumptions 1)–3) of this item, let f (·) : Θ → Rν be a bounded Borel function, and let fb(ω) be a bounded estimate of f (ω) via observation ω ∼ p(·, θ). Then, setting for θ ∈ Θ n o φ(θ) = Eω∼p(·,θ) fb(θ) , n o Covfb(θ) = Eω∼p(·,θ) [fb(ω) − φ(θ)][fb(ω) − φ(θ)]T , one has
As a result, for
∀(θ ∈ Θ, J I(θ), J ≻ 0) : Covfb(θ) φ′ (θ)J −1 [φ′ (θ)]T . oi1/2 h n Risk[fb] = sup Eω∼p(·,θ) kfb(ω) − f (θ)k22 θ∈Θ
it holds
∀(J ≻ 0 : J I(θ) ∀θ ∈ Θ) : Risk2 [fb] ≥ supθ∈Θ Tr(φ′ (θ)J −1 [φ′ (θ)]T ) + kφ(θ) − f (θ)k22
Now comes the first part of the exercise: 1) Derive from (4.115) the following:
Proposition 4.37 In the situation of item 4.22.B, let • Θ ⊂ Rk be a k · k2 -ball of radius r > 0, • the family P be such that I(θ) J for some J ≻ 0 and all θ ∈ Θ.
Then the minimax optimal risk satisfies the bound
rk . Riskopt ≥ p r Tr(J ) + k
(4.116)
575
SOLUTIONS TO SELECTED EXERCISES
In particular, when J = α−1 Ik , we have Riskopt ≥
√ r αk √ . r + αk
(4.117)
Hint. Assuming w.l.o.g. that Θ is centered at the origin, and given a bounded estimate θb with risk R, let φ(θ) be associated with the estimate via (4.115). Select γ ∈ (0, 1) and consider two cases: (a): there exists θ ∈ ∂Θ such that kφ(θ) − θk2 > γr, and (b): kφ(θ)−θk2 ≤ γr for all θ ∈ ∂Θ. In the case of (a), lower-bound R by maxθ∈Θ kφ(θ)−θk2 ; see (4.115). In the case of (b), lower-bound R2 by maxθ∈Θ Tr(φ′ (θ)J −1 [φ′ (θ)]T )—see (4.115)—and use the Gauss Divergence theorem to lower-bound the latter quantity in terms of the flux of the vector field φ(·) over ∂Θ. When implementing the above program, you might find useful the following fact (prove it!): Lemma 4.38 Let Φ be an n × n matrix, and J be a positive definite n × n matrix. Then Tr(ΦJ −1 ΦT ) ≥
Tr2 (Φ) . Tr(J )
Solution: Let us act as suggested in the hint. We lose nothing when assuming that Θ is centered at the origin: Θ = {θ ∈ Rk : kθk2 ≤ r}. Under the premise of the proposition, we can use J in (4.115). In the case of (a), (4.115) says that R ≥ maxθ∈∂Θ kφ(θ) − θk2 ≥ γr. In the case of (b), denoting by S the boundary of Θ (that is, the Euclidean sphere of radius r centered at the origin), by dS the element of (k − 1)-dimensional surface of S, by S the entire surface, and by V the volume of Θ, we have V R= k −1 rS. Let now n(θ) be the unit outer normal to S at a point θ ∈ S. The flux S nT (θ)φ(θ)dS is at least (1 − γ)rS due to nT (θ)φ(θ) ≥ nT (θ)θ − kθ − φ(θ)k2 ≥ r − γr.
R hP ∂φi (θ) i dθ, On the other hand, by Divergence Theorem this flux is equal to Θ i ∂θi and we arrive at the inequality # Z "X X ∂φi (θ) ∂φi (θ) dθ ≤ V max (1 − γ)rS ≤ , θ∈Θ ∂θi ∂θi Θ i i whence max θ∈Θ
X ∂φi (θ) i
∂θi
≥
(1 − γ)rS = k(1 − γ). V
Thus, there exists θ∗ ∈ Θ such that
X ∂φi (θ∗ ) i
∂θi
≥ k(1 − γ).
Now, taking Lemma 4.38 for granted, (4.115) says that R2 ≥ Tr(φ′ (θ∗ )J −1 [φ′ (θ∗ )]T ) ≥ Tr2 (φ′ (θ∗ ))/Tr(J ). Invoking (6.32), we obtain R2 ≥ (1 − γ)2 k 2 /Tr(J ).
(6.32)
576
SOLUTIONS TO SELECTED EXERCISES
Therefore, in all cases we have
which, setting γ = √ r
p R ≥ min[γr, (1 − γ)k/ Tr(J )],
k , Tr(J )+k
results in R ≥ √ r
rk . Tr(J )+k
Since this lower bound
b (4.116) follows. holds true for the risk R of an arbitrary bounded estimate θ, It remains to proveP Lemma 4.38. Let J = U Diag{λ}U T be eigenvalue decomposition of J , let α = i λi = Tr(J ), and let Ψ = U T ΦU . We have Tr(ΦJ −1 ΦT ) = Tr([ΦU ]Diag{1/λj , j ≤ n}U T ΦT ) = Tr([U T ΦU ]Diag{1/λj , j ≤ n}[U T ΦT U ]) = Tr(ΨDiag{1/λj , j ≤ n}ΨT ) i2 hP P 2 P 2 P 2 /α |Ψ | Ψ /µ = /λ ≥ min Ψ /λ ≥ Ψ = jj j j j P j j jj j jj i,j ij ≥
Tr2 (Ψ)/α = Tr2 (Φ)/Tr(J ).
µj ≥0:
j
µj =α
✷
4.22.C. Application to signal recovery. Proposition 4.37 allows us to build computationbased lower risk bounds in the signal recovery problem considered in Section 4.2, in particular, the problem where one wants to recover the linear image Bx of an unknown signal x known to belong to a given ellitope X = {x ∈ Rn : ∃t ∈ T : xT Sℓ x ≤ tℓ , ℓ ≤ L} (with our usual restriction on Sℓ and T ) via observation ω = Ax + σξ, ξ ∼ N (0, Im ), and the risk of a candidate estimate, as in Section 4.2, is defined according to (4.113).19 It is convenient to assume that the matrix B (which in our general setup can be an arbitrary ν × n matrix) is a nonsingular n × n matrix.20 Under this assumption, setting Y = B −1 X = {y ∈ Rn : ∃t ∈ T : y T [B −1 ]T Sℓ B −1 y ≤ tℓ , ℓ ≤ L} and A¯ = AB −1 , we lose nothing when replacing the sensing matrix A with A¯ and treating ¯ X as our signal y ∈ Y rather than X . Note that in our new situation A is replaced with A, with Y, and B is the unit matrix In . For the sake of simplicity, we assume from now on that ¯ has trivial kernel. Finally, let S˜ℓ Sℓ be close to Sk positive definite A (and therefore A) matrices, e.g., S˜ℓ = Sℓ + 10−100 In . Setting S¯ℓ = [B −1 ]T S˜ℓ B −1 and Y¯ = {y ∈ Rn : ∃t ∈ T : y T S¯ℓ y ≤ tℓ , ℓ ≤ L}, we get S¯ℓ ≻ 0 and Y¯ ⊂ Y. Therefore, any lower bound on the k · k2 -risk of recovery y ∈ Y¯ via observation ω = AB −1 y + σξ, ξ ∼ N (0, Im ), automatically is a lower bound on the minimax risk Riskopt corresponding to our original problem of interest. Now assume that we can point out a k-dimensional linear subspace E in Rn and positive reals r, γ such that 19 In fact, the approach to be developed can be applied to signal recovery problems involving Discrete/Poisson observation schemes and norms different from k · k2 used to measure the recovery error, signal-dependent noises, etc. 20 This assumption is nonrestrictive. Indeed, when B ∈ Rν×n with ν < n, we can add to B n − ν zero rows, which keeps our estimation problem intact. When ν ≥ n, we can add to B a small perturbation to ensure Ker B = {0}, which, for small enough perturbation, again keeps our estimation problem basically intact. It remains to note that when Ker B = {0} we can replace Rν with the image space of B, which again does not affect the estimation problem we are interested in.
577
SOLUTIONS TO SELECTED EXERCISES
¯ (i) the k · k2 -ball Θ = {θ ∈ E : kθk2 ≤ r} is contained in Y; (ii) The restriction A¯E of A¯ onto E satisfies the relation Tr(A¯∗E A¯E ) ≤ γ (A¯∗E : Rm → E is the conjugate of the linear map A¯E : E → Rm ).
Consider the auxiliary estimation problem obtained from the (reformulated) problem of interest ¯ the minimax risk in the auxiliary problem by replacing the signal set Y¯ with Θ. Since Θ ⊂ Y, is a lower bound on the minimax risk Riskopt we are interested in. On the other hand, the auxiliary problem is nothing but the problem of recovering parameter θ ∈ Θ from observation ¯ σ 2 I), which is nothing but a special case of the problem considered in item 4.22.B. ω ∼ N (Aθ, As it is immediately seen, the Fisher Information matrix in this problem is independent of θ and is σ −2 A¯∗E A¯E : eT I(θ)e = σ −2 eT A¯∗E A¯E e, e ∈ E. Invoking Proposition 4.37, we arrive at the lower bound on the minimax risk in the auxiliary problem (and thus in the problem of interest as well): rσk Riskopt ≥ √ . r γ + σk
(4.118)
The resulting risk bound depends on r, k, γ and is larger the smaller γ is and the larger k and r are. Lower-bounding Riskopt . In order to make the bounding scheme just outlined give its best, we need a mechanism which allows us to generate k-dimensional “disks” Θ ⊂ Y¯ along with associated quantities r, γ. In order to design such a mechanism, it is convenient to represent k-dimensional linear subspaces of Rn as the image spaces of orthogonal n × n projectors P of rank k. Such a projector P gives rise to the disk ΘP of the radius r = rP contained in ¯ where rP is the largest ρ such that the set {y ∈ ImP : y T P y ≤ ρ2 } is contained in Y¯ Y, (“condition C(r)”), and we can equip the disk with γ satisfying (ii) if and only if ¯ ) ≤ γ, Tr(P A¯T AP or, which is the same (recall that P is an orthogonal projector) ¯ A¯T ) ≤ γ Tr(AP
(4.119)
(“condition D(γ)”). Now, when P is a nonzero orthogonal projector, the simplest sufficient condition for the validity of C(r) is the existence of t ∈ T such that ∀(y ∈ Rn , ℓ ≤ L) : y T P S¯ℓ P y ≤ tℓ r−2 y T P y, or, which is the same,
∃s : r2 s ∈ T & P S¯ℓ P sℓ P, ℓ ≤ L.
(4.120)
Let us rewrite (4.119) and (4.120) as a system of linear matrix inequalities. This is what you are supposed to do: 2.1) Prove the following simple fact: Observation 4.39 Let Q be a positive definite, R be a nonzero positive semidefinite matrix, and let s be a real. Then RQR sR
if and only if
sQ−1 R.
578
SOLUTIONS TO SELECTED EXERCISES
2.2) Extract from the above observation the conclusion as follows. Let T be the conic hull of T: T = cl{[s; τ ] : τ > 0, s/τ ∈ T } = {[s; τ ] : τ > 0, s/τ ∈ T } ∪ {0}. Consider the system of constraints
¯ A¯T ) ≤ γ, [s; τ ] ∈ T & sℓ S¯ℓ−1 P, ℓ ≤ L & Tr(AP P is an orthogonal projector of rank k ≥ 1
(#)
in variables [s; τ ], k, γ, and P . Every feasible solution to this system gives rise to a kdimensional Euclidean subspace E ⊂ Rn (the image space of P ) such that the Euclidean ball Θ in E centered at the origin of radius √ r = 1/ τ taken along with γ satisfy conditions (i)–(ii). Consequently, such a feasible solution yields the lower bound σk √ Riskopt ≥ ψσ,k (γ, τ ) := √ γ + σ τk on the minimax risk in the problem of interest.
Solution: 2.1: Since Q ≻ 0 and 0 R 6= 0, relation RQR sR implies that s > 0. When s > 0, this relation by the SchurComplement Lemma implies that R R R R + ǫI 0, whence 0 whenever ǫ > 0, implying, R sQ−1 R sQ−1 by the same Schur Complement Lemma, that sQ−1 R(R + ǫI)−1 R → R as ǫ → +0, that is, the first of the relations in question implies the second. Vice versa, assuming sQ−1 R, we clearly have s > 0. Let us fix s′ > s, so that s′ Q−1 R + ǫI for all small enough ǫ > 0, whence by the Schur Complement R + ǫI R + ǫI 0 for all small enough positive ǫ, implying that Lemma R + ǫI s′ Q−1 R R 0, whence RQR s′ R. Since this relation holds true for all R s′ Q−1 s′ > s, we get RQR sR. ✷ 2.2: Let [s; τ ], k, γ, P form a feasible solution to (#). Then clearly s > 0, so that t := s/τ is well-defined and t ∈ T . Applying Observation 4.39 we conclude that ¯ A¯T ) ≤ γ, P S¯ℓ P sℓ P, ℓ ≤ L & Tr(AP √ which implies, via (4.120), that the conditions C(1/ τ ) and D(γ) take place. The claim now follows from the origin of these conditions. ✷ Ideally, to utilize item 2.2 to lower-bound Riskopt , we should look through k = 1, ..., n and maximize for every k the lower risk bound ψσ,k (γ, τ ) under constraints (#), thus arriving at the problem n √ √ min[s;τ ],γ,P ψσ,kσ(γ,τ ) = γ/k + σ τ : (Pk ) ¯ A¯T ) ≤ γ, [s; τ ] ∈ T & sℓ S¯ℓ−1 P, ℓ ≤ L & Tr(AP P is an orthogonal projector of rank k. This problem seems to be computationally intractable, since the constraints of (Pk ) include the nonconvex restriction on P to be a projector of rank k. A natural convex relaxation of this constraint is 0 P In , Tr(P ) = k.
579
SOLUTIONS TO SELECTED EXERCISES
The (minor) remaining difficulty is that the objective in (P ) is nonconvex. Note, however, that √ √ to minimize γ/k + σ τ is basically the same as to minimize the convex function γ/k2 + σ 2 τ which is a tight “proxy” of the squared objective of (Pk ). We arrive at a convex “proxy” of (Pk )—the problem [s; τ ] ∈ T, 0 P In , Tr(P ) = k (P [k]) γ/k2 + σ 2 τ : min −1 T ¯ A¯ ) ≤ γ sℓ S¯ℓ P, ℓ ≤ L, Tr(AP [s;τ ],γ,P k = 1, ..., n. Problem (P [k]) clearly is solvable, and the P -component P (k) of its optimal (k) solution gives rise to a collection of orthogonal projectors Pκ , κ = 1, ..., n, obtained from (k) (k) P by “rounding”—to get Pκ , we replace the κ leading eigenvalues of P (k) with ones, and the remaining eigenvalues with zeros, while keeping the eigenvectors intact. We can now for (k) every κ = 1, ..., n fix the P -variable in (Pk ) as Pκ and solve the resulting problem in the remaining variables [s; τ ] and γ, which is easy—with P fixed, the problem clearly reduces to minimizing τ under the convex constraints sℓ S¯ℓ−1 P, ℓ ≤ L, [s; τ ] ∈ T on [s; τ ]. As a result, for every k ∈ {1, ..., n}, we get n lower bounds on Riskopt , that is, a total of n2 lower risk bounds, of which we select the best—the largest. Now comes the next part of the exercise: 3) Implement the outlined program numerically and compare the lower bound on the minimax risk with the upper risk bounds of presumably good linear estimates yielded by Proposition 4.4. Recommended setup: • • • •
Sizes: m = n = ν = 16. A, B: B = In , A = Diag{a1 , ..., an } with ai = i−α and α running through {0, 1, 2}. X = {x ∈ Rn : xT Sℓ x ≤ 1, ℓ ≤ L} (i.e., T = [0, 1]L ) with randomly generated Sℓ . Range of L: {1, 4, 16}. For L in this range, you can generate Sℓ , ℓ ≤ L, as Sℓ = Rℓ RℓT with Rℓ = randn(n, p), where p =⌋n/L⌊. • Range of σ: {1.0, 0.1, 0.01, 0.001, 0.0001}.
Solution: The results of typical numerical experiments implemented according to the above setup are presented in Table 6.3. We see that the presumably good linear estimates indeed are good—(theoretical upper bounds on) their risk are within factor 2 of the lower bounds on the minimax optimal risk. Exercise 4.23 [follow-up to Exercise 4.22] 1) Prove the following version of Proposition 4.37: Proposition 4.40 In the situation of item 4.22.B and under Assumptions 1) – 3) from this item, let • k · k be a norm on Rk such that kθk2 ≤ κkθk ∀θ ∈ Rk • Θ ⊂ Rk be k · k-ball of radius r > 0, • the family P be such that I(θ) J for some J ≻ 0 and all θ ∈ Θ.
Then the minimax optimal risk
Riskopt,k·k = inf
b θ(·)
n o1/2 2 b sup Eω∼p(·,θ) kθ − θ(ω)k
θ∈Θ
580
SOLUTIONS TO SELECTED EXERCISES
α 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2
L 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
σ 1.0000 0.1000 0.0100 0.0010 0.0001 1.0000 0.1000 0.0100 0.0010 0.0001 1.0000 0.1000 0.0100 0.0010 0.0001
Risk 1.5156 0.2860 0.0397 0.0040 0.0004 5.2537 1.2918 0.2724 0.0384 0.0039 4.5613 3.0733 0.7586 0.2730 0.0486
R 1.5 1.7 1.3 1.0 1.0 1.4 1.5 1.7 1.4 1.2 1.3 1.4 1.7 1.7 1.4
α 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2
L 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
σ 1.0000 0.1000 0.0100 0.0010 0.0001 1.0000 0.1000 0.0100 0.0010 0.0001 1.0000 0.1000 0.0100 0.0010 0.0001
Risk 1.6949 0.3451 0.0399 0.0040 0.0004 4.8681 1.1947 0.3118 0.0386 0.0039 3.3401 2.2972 1.1169 0.3841 0.0492
R 1.6 1.7 1.2 1.0 1.0 1.3 1.9 2.0 1.3 1.2 1.1 1.5 1.5 2.0 1.4
α 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2
L 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16
σ 1.0000 0.1000 0.0100 0.0010 0.0001 1.0000 0.1000 0.0100 0.0010 0.0001 1.0000 0.1000 0.0100 0.0010 0.0001
Risk 2.1007 0.3796 0.0400 0.0040 0.0004 5.3097 1.7878 0.3667 0.0387 0.0039 9.5095 4.1331 1.5697 0.4406 0.0493
R 1.7 1.8 1.2 1.0 1.0 2.0 1.9 1.9 1.3 1.2 1.5 1.7 2.1 2.0 1.3
√ Table 6.3: Sample numerical results for Exercise 4.22.3. Risk: upper bound Opt on risk of the linear estimate; see Proposition 4.4; R: ratios of Risk to the lower risk bounds.
of recovering parameter θ ∈ Θ from observation ω ∼ p(·, θ) in the norm k · k satisfies the bound rk p . (4.121) Riskopt,k·k ≥ rκ Tr(J ) + k
In particular, when J = α−1 Ik , we get
Riskopt,k·k ≥
√ r αk √ . rκ + αk
(4.122)
Solution: Let us reuse the argument from the proof of Proposition 4.37. By a straightforward approximation argument, we can assume that k · k is continuously differentiable outside of the origin. Assuming w.l.o.g. that Θ is centered at the origin, let S be the boundary of Θ, dS be the element of (k − 1)-dimensional surface area of S, and n(θ) be the k · k2 -unit outer normal to S at a point θ ∈ S. Observe that n(θ) is obtained from the gradient g(θ) of k · k, taken at point θ ∈ S, by normalization n(θ) = kg(θ)k−1 2 g(θ), whence
−1 T θ ∈ S ⇒ nT (θ)θ = kg(θ)k−1 2 g (θ)θ = kg(θ)k2 r. R b Let θb be a bounded candidate estimate, φ(θ) = Ω θ(ω)p(ω, θ)Π(dω), and ψ(θ) = θ. b observe that by Jensen’s inequality for Denoting by R the k · k-risk of estimate θ, θ ∈ Θ one has kφ(θ) − θk2 ≤ R2 . Now let γ ∈ (0, 1). It may happen (“case (a)”) that R ≥ γr; otherwise (“case (b)”) we should have kφ(θ) − θk ≤ γr for θ ∈ S. Assume that (b) is the case. From the origin of g(·), denoting by k · k∗ the norm conjugate to k · k we have kg(θ)k∗ = 1 whenever θ 6= 0, implying that
g T (θ)φ(θ) ≥ g T (θ)θ − kθ − φ(θ)k = r − kθ − φ(θ)k ≥ (1 − γ)r. R It follows that the flux S nT (θ)φ(θ)dS(θ) satisfies R
S
R R T nT (θ)φ(θ)dS(θ) = S kg(θ)k−1 (1 − γ) S kg(θ)k−1 2 g (θ)φ(θ)dS(θ) ≥ R 2 rdS(θ) R T T = (1 − γ) S kg(θ)k−1 g (θ)ψ(θ)dS(θ) = (1 − γ) n (θ)ψ(θ)dS(θ). 2 S
581
SOLUTIONS TO SELECTED EXERCISES
The resulting inequality between the fluxes of φ and ψ combines with the Divergence Theorem to imply that # # Z "X Z Z "X ∂ψi (θ) ∂φi (θ) dθ, dθ ≥ (1 − γ) dθ = (1 − γ)k ∂θi ∂θi Θ Θ Θ i i so that
X ∂φi (θ∗ ) i
∂θi
≥ (1 − γ)k
(6.33)
for some θ∗ ∈ Θ. On the other hand, setting Z b b − φ(θ)][θ(ω) − φ(θ)]T p(ω, θ)Π(dω) Covθb(θ) = [θ(ω) Ω
and observing that Z Z b b kθ(ω) − φ(θ)k22 p(ω, θ)Π(dω) = Tr(Covθb(θ)), kθ(ω) − θk22 p(ω, θ)Π(dω) ≥ Ω
Ω
we have
R2
≥ ≥ ≥ ≥ ≥
R
b kθ(ω) − θ∗ k2 p(ω, θ∗ )Π(dω) R b κ kθ(ω) − θ∗ k22 p(ω, θ∗ )Π(dω) [since k · k ≥ κ−1 k · k2 ] Ω −2 κ Tr(Covθb(θ∗ )) κ−2 Tr(φ′ (θ∗ )J −1 [φ′ (θ∗ )]T ) [by (4.114)] κ−2 (1 − γ)2 k 2 /Tr(J ), Ω −2
where the concluding inequality is yielded by Lemma 4.38 combined with (6.33). Now, in case (a) we have R ≥ γr. We see that in all cases we have p R ≥ min[γr, (1 − γ)k/(κ Tr(J ))]. Setting γ =
rκ
√
k , Tr(J )+k
we arrive at R ≥
rκ
√
rk , Tr(J )+k
and (4.121) follows.
✷
2) Apply Proposition 4.40 to get lower bounds on the minimax k · k-risk in the following estimation problems: 2.1) Given indirect observation ω = Aθ + σξ, ξ ∼ N (0, Im ) of an unknown vector θ known to belong to Θ = {θ ∈ Rk : kθkp ≤ r} with given A, Ker A = {0}, p ∈ [2, ∞], r > 0, we want to recover θ in k · kp . 2.2) Given indirect observation ω = LθR + σξ, where θ is an unknown µ × ν matrix known to belong to the Shatten norm ball Θ ∈ Rµ×ν : kθkSh,p ≤ r, we want to recover θ in k · kSh,p . Here L ∈ Rm×µ , Ker L = {0} and R ∈ Rν×n , Ker RT = {0} are given matrices, p ∈ [2, ∞], and ξ is random Gaussian m × n matrix (i.e., the entries in ξ are N (0, 1) random variables) independent of each other. 2.3) Given a K-repeated observation ω K = (ω1 , ..., ωK ) with i.i.d. components ωt ∼ N (0, θ), 1 ≤ t ≤ K, with an unknown θ ∈ Sn known to belong to the matrix box Θ = {θ : β− In θ β+ In } with given 0 < β− < β+ < ∞, we want to recover θ in the spectral norm.
Solution: 2.1: We are in the case of p(ω, θ) = N (Aθ, σ 2 I) resulting in ∇θ p(ω, θ) = AT σ −2 (ω − Aθ) and I(θ) = σ −2 AT A. Consequently, the premise of Proposition
582
SOLUTIONS TO SELECTED EXERCISES 1
1
4.40 holds true with r = 1, κ = k 2 − p and J = σ −2 AT A, resulting in the bound 1
Riskopt,k·kp ≥
1
rσk 2 + p 1
1
rkAkF + σk 2 + p
.
2.2: This is, basically, the same situation as in 2.1: rearranging the entries of θ to form a (k = µν)-dimensional vector, and the entries in ω to form an mndimensional vector, and denoting by A the matrix of the linear transformation θ 7→ LθR (which now is a linear mapping of Rµν into Rmn ), we get Ker A = 0 and kAkF = kLkF kRkF (why?). As for the value of κ, note that k · k2 on the argument space (a.k.a. the Frobenius norm on Rµ×ν ) is the Euclidean norm of the vector of singular values of a µ × ν matrix, while the Shatten p-norm is the ℓp norm of the same vector. Since the number of nonzero singular values in a µ × ν matrix is at most min[µ, ν], and every nonnegative vector of the latter dimension is the vector 1 1 of singular values of an appropriate µ × ν matrix, we get κ = min 2 − p [µ, ν]. The resulting risk bound is therefore 1
Riskopt,k·kSh,p ≥
1
rσµν min p − 2 [µ, ν] 1
1
rkLkF kRkF + σµν min p − 2 [µ, ν]
.
2.3: We are in the case where ln p(ω K , θ) within an additive term independent PK of θ is − t=1 12 [ωtT θ−1 ωt + ln Detθ], resulting in K
h∇θ ln p(ω K , θ), dθi =
1 X T −1 ωt θ dθθ−1 ωt − Tr(θ−1 dθ) , 2 t=1
h·, ·i being the Frobenius inner product on Sn . Consequently, the quadratic form associated with the Fisher Information matrix is T −1 [ω θ dθθ−1 ω − Tr(θ−1 dθ)]2 hdθ, I(θ)dθi = K 4 Eω∼N (0,θ) −1/2 −1/2 2 K 1/2 T −1 −1 1/2 dθθ = 4 Eη∼N (0,In ) [[θ η] θ dθθ [θ η] − Tr(θ| {z })] =
K 4 Eη∼N (0,In )
[η T hη − Tr(h)]2 .
h
Denoting by λi the eigenvalues of h, the latter quantity in the chain is nothing but P 2 P K 2 i λi ] inξi λi − 4 Eξ∼N (0,In ) [ o P P P P 2 4 P 2 2 2 2 λ ] λ ξ ] + [ λ ][ λ λ ξ ξ − 2[ E λ ξ + =K i j i i j ξ∼N (0,I ) j n i j i6=j i i i 4 h i i j P i P P 2 P K K 2 2 = 4 3 i λi + i6=j λi λj − [ i λi ] = 2 i λi ,
that is,
hθ, I(θ)dθi =
KX 2 λ = 2 i i
−1/2 K dθθ−1 dθθ−1/2 ) 2 Tr(θ
=
−1 K dθθ−1 dθ). 2 Tr(θ
Now, the function Tr(θ1−1 dθθ2−1 dθ) of θ1 ≻ 0, θ2 ≻ 0 is -nonincreasing in θ1 and θ2 , meaning that replacing θ1 with θ1′ θ1 , the value of the function does not increase, and similarly for θ2 . Indeed, to see that the first claim holds true, it suffices to
583
SOLUTIONS TO SELECTED EXERCISES −1/2
−1/2
rewrite the value of the function as Tr([θ2 dθ]θ1−1 [θ2 dθ]T ); to observe that the −1/2 −1/2 second claim is true, it suffices to rewrite the value as Tr([θ1 dθ]θ2−1 [θ1 dθ]T ). It follows that when θ ∈ Θ, we have hθ, I(θ)dθi ≤
2 K −2 2 β− Tr(dθ )
=
2 K −2 2 β− kdθkF
where k · kF is the Frobenius norm. Besides this, setting k · k = k · kSh,∞ , Θ is the β− +β+ − In , and, as in the previous item, k · k-ball of radius r = β+ −β 2 √ centered at 2 k · k ≤ k · kF ≤ κk · k, κ = n. We conclude that with the κ, r just defined, and 2 with α = 2β− K −1 , k = dim Sn = n(n+1) , the situation is under the premise of 2 Proposition 4.40, so that by this proposition √ β− [β+ /β− − 1] n + 1 Riskopt,k·kSh,∞ ≥ √ . √ K[β+ /β− − 1] + 2 n + 1 Note. When β+ /β− is large enough, the lower risk bound can be improved. Indeed, it is clear that for β+ fixed, the true minimax risk is a decreasing function of β− (since Θ shrinks as β− grows), which is not the case with our bound. The inverse of this bound is # "√ √ 2 n+1 K 1 , + ρ(β− , β+ ) = √ β+ − β− n + 1 β− which is not an increasing function of β− in the entire range 0 < β− ≤ β+ . Clearly, a better lower risk bound is ρ(β∗1,β+ ) with β∗ obtained by minimization of ρ(β, β+ ) in β ∈ [β− , β+ ]. The resulting bound is √ √ 4 β+ n+1 √ √K√ √ √ β , β− ≤ √ 4 2, 4 4 K+ 2 4 n+1 + K+ 2 n+1) ( √ Riskopt,k·kSh,∞ ≥ n+1 β [β /β −1] √ − + − √ , otherwise. K[β /β −1]+2 n+1 +
−
Exercise 4.24 [More on Cramer-Rao risk bound] Let us fix µ ∈ (1, ∞) and a norm k · k on
µ Rk , and let k · k∗ be the norm conjugate to k · k, and µ∗ = µ−1 . Assume that we are in the situation of item 4.22.B and under assumptions 1) and 3) from this item; as for assumption 2) we now replace it with the assumption that the quantity
1/µ∗ ∗ Ik·k∗ ,µ∗ (θ) := Eω∼p(·,θ) {k∇θ ln(p(ω, θ))kµ ∗ }
is well-defined and bounded on Θ; in the sequel, we set
Ik·k∗ ,µ∗ = sup Ik·k∗ ,µ∗ (θ). θ∈Θ
1) Prove the following variant of the Cramer-Rao risk bound: Proposition 4.41 In the situation described in the beginning of item 4.22.D, let Θ ⊂ Rk be a k · k-ball of radius r. Then the minimax k · k-risk of recovering θ ∈ Θ via observation ω ∼ p(·, θ) can be lower-bounded as h n oi1/µ b − θkµ ≥ Riskopt,k·k [Θ] := inf sup Eω∼p(·,θ) kθ(ω) b θ∈Θ θ(·)
rk , rIk·k∗ ,µ∗ +k
h 1/µ∗ i ∗ Ik·k∗ ,µ∗ = max Ik·k∗ ,µ∗ (θ) := Eω∼p(·,θ) {k∇θ ln(p(ω, θ))kµ . ∗ } θ∈Θ
(4.123)
584
SOLUTIONS TO SELECTED EXERCISES
b Solution: Given a bounded estimate θ(·), let R
=
φ(θ)
=
hR i1/µ b supθ∈Θ kθ(ω) − θkµ p(ω, θ)Π(dω) , R b θ(ω)p(ω, θ)Π(dθ).
From now on, we refer to R as to the k · k-risk of the estimate; note that when k · k = k · k2 and µ = 2, this risk becomes what was called estimation risk in item 4.22.B of Exercise 4.22. By Jensen’s inequality we have kφ(θ) − θk ≤ R, so that R R b b Tr(φ′ (θ)) = [∇θ p(ω, θ)]T θ(ω)Π(dθ) = [∇θ p(ω, θ)]T [θ(ω) − θ]Π(dθ) R b ≤ kθ(ω) − θkk∇θ ln(p(ω, θ))k∗ p(ω, θ)Π(dω) i1/µ R hR 1/µ∗ b k∇θ ln(p(ω, θ))kµ∗ ∗ p(ω, θ)Π(dω) kθ(ω) − θkµ p(ω, θ)Π(dω) ≤ ≤ RIk·k∗ ,µ∗ (θ). On the other hand, from the proof of Proposition 4.40 (see (6.33) in Solution to Exercise 4.23) it follows that if Θ ⊂ Rk is a k · k-ball of radius r and for some γ ∈ (0, 1) one has R ≤ γr, then there exists θ∗ ∈ Θ such that Tr(φ′ (θ∗ )) ≥ (1 − γ)k, resulting in (1 − γ)k ≤ RIk·k∗ ,µ∗ (θ∗ ). Thus,
(1 − γ)k (1 − γ)k ≥ min γr, . R ≥ min γr, Ik·k∗ ,µ∗ (θ∗ ) Ik·k∗ ,µ∗ h i We conclude that R ≥ supγ∈(0,1) min γr, I(1−γ)k , that is, k·k ,µ
∗
R≥
∗
rk rIk·k∗ ,µ∗ + k
as claimed.
✷
Example I: Gaussian case, estimating shift. Let µ = 2, and let p(ω, θ) = N (Aθ, σ 2 Im ) with A ∈ Rm×k . Then −2 T ∇ R θ ln(p(ω, θ)) = σ2 A (ω − Aθ) ⇒ −4 R k∇θ ln(p(ω, θ))k∗ p(ω, θ)dω = σ kAT (ω − Aθ)k2∗ p(ω, θ)dω R T −4 √ 1 = σ [ 2πσ]m kAT ωk2∗ exp{− ω2σ2ω }dω R = σ −4 [2π]1m/2 kAT σξk2∗ exp{−ξ T ξ/2}dξ R = σ −2 [2π]1m/2 kAT ξk2∗ exp{−ξ T ξ/2}dξ
585
SOLUTIONS TO SELECTED EXERCISES
whence
oi1/2 h n Ik·k∗ ,2 = σ −1 Eξ∼N (0,Im ) kAT ξk2∗ . {z } | γk·k (A)
Consequently, assuming Θ to be a k · k-ball of radius r in Rk , lower bound (4.123) becomes Riskopt,k·k [Θ] ≥
rk rσk rk = = . rIk·k∗ + k rσ −1 γk·k (A) + k rγk·k (A) + σk
(4.124)
The case of direct observations. To see “how it works,” consider the case m = k, A = Ik of direct observations, and let Θ = {θ ∈ Rk : kθk ≤ r}. Then p • We have γk·k1 (Ik ) ≤ O(1) ln(k), whence the k · k1 -risk bound is
rσk ; [Θ = {θ ∈ Rk : kθ − ak1 ≤ r}] Riskopt,k·k1 [Θ] ≥ O(1) p r ln(k) + σk √ • We have γk·k2 (Ik ) = k, whence the k · k2 -risk bound is √ rσ k √ ; [Θ = {θ ∈ Rk : kθ − ak2 ≤ r}] Riskopt,k·k2 [Θ] ≥ r+σ k
• We have γk·k∞ (Ik ) ≤ O(1)k, whence the k · k∞ -risk bound is Riskopt,k·k∞ [Θ] ≥ O(1)
rσ . r+σ
[Θ = {θ ∈ Rk : kθ − ak∞ ≤ r}]
In fact, the above examples are essentially covered by the following Observation 4.42 Let k · k be a norm on Rk , and let Θ = {θ ∈ Rk : kθk ≤ r}. Consider the problem of recovering signal θ ∈ Θ via observation ω ∼ N (θ, σ 2 Ik ). Let n o1/2 b b Riskk·k [θ|Θ] = sup Eω∼N (θ,σ2 I) kθ(ω) − θk2 θ∈Θ
b be the k · k-risk of an estimate θ(·), and let
b Riskopt,k·k [Θ] = inf Riskk·k [θ|Θ] b θ(·)
be the associated minimax risk. Assume that the norm k · k is absolute and symmetric w.r.t permutations of coordinates. Then rσk , α∗ = k[1; ...; 1]k∗ . (4.125) Riskopt,k·k [Θ] ≥ p 2 ln(ek)rα∗ + σk
Here is the concluding part of the exercise:
2) Prove the observation and compare the lower risk bound it yields with the k · k-risk of the “plug-in” estimate χ b(ω) ≡ ω.
586
SOLUTIONS TO SELECTED EXERCISES
Solution: By (4.123) we have Riskopt,k·k [Θ] ≥
1/2 rσk , I∗ = Eξ∼N (0,Ik ) kξk2∗ . rI∗ + σk
(6.34)
The norm k · k∗ is absolute and symmetric along with k · k, whence the maximum of k · k∗ over the unit box {x : kxk∞ ≤ 1} is α∗ = k[1; ...; 1]k∗ . Hence, kξk∗ ≤ α∗ kξk∞ , so that R∞ I∗2 ≤ α∗2 Eξ∼N (0,Ik ) kξk2∞ = α∗2 0 t2 [−dF (t)] −t2 /2 F (t) = Probξ∼N (0,Ik ) {kξk∞ ≥ t} ≤ ] min[1, ke ∞ ∞ ∞ R R R 2 2 te−t /2 dt = 2α∗2 tF (t)dt ≤ 2α∗2 t min[1, ke−t /2 ]dt = α∗2 2 ln(k) + 2k √ 0 0 2 ln(k)
= α∗2 [2 ln(k) + 2ke− ln(k) ] = 2α∗2 [ln(k) + 1],
resulting in I∗ ≤
p 2 ln(ek)α∗ ,
and (4.125) follows from (6.34). Observe that the k · k-risk of the “plug-in” estimate χ b(ω) ≡ ω is
✷
p 1/2 I := σ Eξ∼N (0,In ) kξk2 ≤ σα 2 ln(ek), α = k[1; ...; 1]k,
where the inequality holds for exactly the same reasons as similar inequality for I∗ . We clearly have αα∗ = k (why?), implying that p 2 ln(ek)σk . Riskk·k [b χ|Θ] ≤ α∗ In the case of “small σ,” specifically, σk ≤
p
2 ln(ek)rα∗ ,
(6.35)
the lower bound on the minimax k · k-risk given by the observation is at least √ σk , that is, the plug-in estimate is optimal within the logarithmic factor 2
2 ln(ek)α∗
4 ln(ek). Note that when (6.35) does not hold the lower risk bound from the observation is at least 2r , implying that in this case the trivial estimate χ b(ω) ≡ 0 is optimal within factor 2. Moreover, when k · k = k · kp with p ∈ (1, ∞), a straightforward refinement of the above computations shows that when p ∈ [π∗ , π], for some π ∈ [2, ∞), the logarithmic in k factors in the lower bound on the minimax k·kp -risk and the upper bound on the k · kp -risk of the plug-in estimate can be replaced with factors depending solely on π. 6.4.6
Around S-Lemma
S-Lemma is a classical result of extreme importance in Semidefinite Optimization. Basically, the lemma states that when the ellitope X in Proposition 4.6 is an ellipsoid, (4.19) can be strengthen to Opt = Opt∗ . In fact, S-Lemma is even stronger: Lemma 4.43.
[S-Lemma] Consider two quadratic forms f (x) = xT Ax + 2aT x + α and
587
SOLUTIONS TO SELECTED EXERCISES
g(x) = xT Bx + 2bT x + β such that g(¯ x) < 0 for some x ¯. Then the implication g(x) ≤ 0 ⇒ f (x) ≤ 0 takes place if and only if for some λ ≥ 0 it holds f (x) ≤ λg(x) for all x, or, which is the same, if and only if the Linear Matrix Inequality λb − a λB − A 0 λbT − aT λβ − α in scalar variable λ has a nonnegative solution. Proof of S-Lemma can be found, e.g., in [15, Section 3.5.2]. The goal of subsequent exercises is to get “tight” tractable outer approximations of sets obtained from ellitopes by quadratic lifting. We fix an ellitope X = {x ∈ Rn : ∃t ∈ T : xT Sk x ≤ tk , 1 ≤ k ≤ K}
(4.132)
where, as always, Sk are positive semidefinite matrices with positive definite sum, and T is a computationally tractable convex compact subset in Rk+ such that t ∈ T implies t′ ∈ T whenever 0 ≤ t′ ≤ t and T contains a positive vector.
Exercise 4.25. Let us associate with ellitope X given by (4.132) the sets X Xb
= =
Conv{xxT : x ∈ X}, {Y ∈ Sn : Y 0, ∃t ∈ T : Tr(Sk Y ) ≤ tk , 1 ≤ k ≤ K},
so that X , Xb are convex compact sets containing the origin, and Xb is computationally tractable along with T . Prove that 1. When K = 1, we have X = Xb; √ 2. We always have X ⊂ Xb ⊂ 3 ln( 3K)X .
Solution: We clearly have X ⊂ Xb, and both sets are convex compact sets containing the origin. As a result, in order to verify that Xb ⊂ κX for some κ > 0, it suffices to show that the support functions φX (W )
=
φXb (W )
=
satisfy the relation
max Tr(Y W ) = max xT W x : Sn → R, Y ∈X
x∈X
max Tr(Y W ) b Y ∈X
φXb (W ) ≤ κφX (W ) ∀W.
(6.36)
In the case of K = 1 we should prove that (6.36) holds true with κ = 1. Assuming w.l.o.g. that the largest element in T ⊂ R+ is 1, we have X = {x : g(x) := xT S1 x − 1 ≤ 0}. Given W ∈ Sn and setting f (x) = xT W x − φX (W ), we ensure the validity of the implication g(x) ≤ 0 ⇒ f (x) ≤ 0, and clearly g(0) < 0. Applying S-Lemma, we conclude that f (x) ≤ λg(x) for some λ ≥ 0 and all x, whence W λS1 and φX (W ) ≥ λ, so that Tr(W Y ) ≤ λTr(S1 Y ) for all Y 0, implying that ∀(Y ∈ Xb) : Tr(W Y ) ≤ λTr(S1 Y ) ≤ λ ≤ φX (W ).
588
SOLUTIONS TO SELECTED EXERCISES
Thus, φXb (W ) ≤ φX (W ), that is, (6.36) indeed holds true with κ = 1. √ In the case of K > 1 we should prove that (6.36) holds true with κ = 3 ln( 3K). We clearly have n o X W λk Sk & λ ≥ 0 ⇒ φXb (W ) ≤ φT (λ), k
whence
(
φXb (W ) ≤ min φT (λ) : λ ≥ 0, W λ
X
λ k Sk
k
)
.
Now, by Proposition 4.6 the right hand side in this inequality is at most √ √ 3 ln( 3K) max xT W x = 3 ln( 3K)φX (W ), x∈X
√ implying that (6.36) holds true with κ = 3 ln( 3K). Exercise 4.26 For x ∈ Rn let Z(x) = [x; 1][x; 1]T , Z o [x] = C=
1
✷
xxT xT
x
. Let
,
and let us associate with ellitope X given by (4.132) the sets X+
Xb+
= =
o Conv{Z [x] : x ∈X}, U u n+1 ∈S : Y + C 0, ∃t ∈ T : Tr(Sk U ) ≤ tk , 1 ≤ k ≤ K , Y = uT
so that X + , Xb+ are convex compact sets containing the origin, and Xb+ is computationally tractable along with T . Prove that 1. When K = 1, we have X + = Xb+ ; √ 2. We always have X + ⊂ Xb+ ⊂ 3 ln( 3(K + 1))X + .
be the space of all symmetric (n + 1) × (n + 1) matrices W with Solution: Let Sn+1 o V v with Wn+1,n+1 = 0; thus, a generic matrix W from Sn+1 is W [V, v] := o vT V ∈ Sn and v ∈ Rn . We clearly have X + ⊂ Xb+ ⊂ Sn+1 , and both sets are convex, o compact and contain the origin. As a result, in order to verify that Xb+ ⊂ κX + for some κ > 0, it suffices to show that the support functions φX + (V, v)
=
φXb+ (V, v)
=
satisfy the relation
max Tr(W [V, v]Y ) = max[xT V x + 2v T x] : Sn+1 → R, o
Y ∈X +
x∈X
max Tr(W [V, v]Y )
b+ Y ∈X
φXb+ (V, v) ≤ κφX + (V, v) ∀(V ∈ Sn , v ∈ Rn ).
(6.37)
In the case of K = 1 we should prove that (6.37) holds true with κ = 1. Assuming w.l.o.g. that the largest element in T ⊂ R+ is 1, we have X = {x : g(x) := xT S1 x − 1 ≤ 0}.
589
SOLUTIONS TO SELECTED EXERCISES
Given V, v and setting f (x) = xT V x + 2v T x − φX (W [V, v]) = xT V x + 2v T x − max[y T V y + 2v T y], y∈X
we ensure the validity of the implication g(x) ≤ 0 ⇒ f (x) ≤ 0, and clearly g(0) < 0. Applying S-Lemma, we conclude that f (x) ≤ λg(x) for some λ ≥ 0 and all x, whence −v λS1 − V 0. G := −v T φX (V, v) − λ Now let Y = that is,
U uT
u
∈ Xb+ , so that Y +C 0, implying that Tr(G(Y +C)) ≥ 0,
λTr(S1 U ) − Tr(V U ) − 2v T u + φX (V, v) − λ ≥ 0.
Taking into account that Tr(S1 U ) ≤ 1 due to Y ∈ Xb+ and our normalization max t = 1 of T , we get t∈T
Tr(V U ) + 2v T u ≤ φX (V, v)
U u ∈ Xb+ , and we see that (6.37) indeed holds with κ = 1. uT Now let K > 1. Consider the ellitope
for all Y =
R = X × [−1, 1] = {[x; s] ∈ Rn+1 : ∃t ∈ T : xT Sk x ≤ tk , 1 ≤ k ≤ K, s2 ≤ 1}, and let us set R+ = {Y ∈ Sn+1 : Y 0, ∃t ∈ T : Tr (Diag{Sk , 0}Y ) ≤ tk , 1 ≤ k ≤ K, Tr(CY ) ≤ 1}.
We clearly have φXb+ (V, v)
=
= ≤ =
max Tr(V U ) + 2v T u : Tr(Sk U ) ≤ tk , k ≤ K, U,u,t U u 0, t ∈ T Y = uT 1 V v Y : Y ∈ R+ max Tr T v Y √ T 3 ln( 3(K + 1)) x V x + 2sv T x max [x;s]∈R=X×[−1,1]
to ellitope R] √ √ [Proposition 4.6 as applied 3 ln( 3(K + 1)) max xT V x + 2v T x = 3 ln( 3(K + 1))φX (V, v), x∈X
√ and we conclude that (6.37) holds true with κ = 3 ln( 3(K + 1)). 6.4.7
✷
Miscellaneous exercises
Exercise 4.27. Let X ⊂ Rn be a convex compact set, let b ∈ Rn , and let A be an m × n
matrix. Consider the problem of affine recovery ω 7→ hT ω + c of the linear function Bx = bT x
590
SOLUTIONS TO SELECTED EXERCISES
of x ∈ X from indirect observation ω = Ax + σξ, ξ ∼ N (0, Im ). Given tolerance ǫ ∈ (0, 1), we are interested in minimizing the worst-case, over x ∈ X, width of (1 − ǫ) confidence interval, that is, the smallest ρ such that Prob{ξ : bT x−f T (Ax+σξ) > ρ} ≤ ǫ/2 & Prob{ξ : bT x−f T (Ax+σξ) < ρ} ≤ ǫ/2 ∀x ∈ X. Pose the problem as a convex optimization problem and consider in detail the case where X is the box {x ∈ Rn : aj |xj | ≤ 1, 1 ≤ j ≤ n}, where aj > 0 for all j.
Solution: In order to get the desired probability bounds for given x and ρ we should have T ρ − [hT Ax + c − bT x] [h Ax + c − bT x] + ρ ≤ ǫ/2 & Erfc ≤ ǫ/2, Erfc σkhk2 σkhk2 that is, ρ ± [hT Ax + c − bT x] ≥ σErfcInv(ǫ/2)khk2 , or, equivalently, ρ ≥ |[AT h − b]T x + c| + σErfcInv(ǫ/2)khk2 . Thus, the design problem becomes min ψ(h, c) = max |[AT h − b]T x + c| + σErfcInv(ǫ/2)khk2 . x∈X
h,c
(#)
Function ψ clearly is convex (as the supremum, over x ∈ X, of a family of convex functions of h, c parameterized by x), and is efficiently computable, provided X is a computationally tractable convex set; indeed, computing ψ(h, c) reduces to maximizing over x ∈ X two affine functions of x, namely, [AT h − b]T x + c and −[AT h − b]T x − c. Now, when X is symmetric w.r.t. the origin, we have ψ(h, c) = ψ(h, −c), implying that (#) has an optimal with c = 0. In the case of X = {x : Psolution n T aj |xj | ≤ 1} we have ψ(h, 0) = |Col j [A]h − bj |/aj , and (#) becomes the j=1 convex optimization problem nXn o T min a−1 j |Colj [A]h − bj | + σErfcInv(ǫ/2)khk2 . h
j=1
Exercise 4.28. Prove Proposition 4.21. x + η with x ¯ ∈ X and kηk(m) ≤ σ. Problem (F [ω]) clearly is Solution: Let ω = A¯ solvable (a solution is x ¯), so that x b := x b∗ (ω) is well-defined, and kA(¯ x−x b∗ (ω))k(m) ≤ kA¯ x − ωk(m) + kAb x − ωk(m) ≤ 2σ.
Hence, kB x ¯ − Bb xk(ν) ≤ Υ by definition of Υ; we have proved the first inequality in (4.66). On the other hand, let x, y ∈ X be such that kA(x − y)k(m) ≤ 2σ. Setting
591
SOLUTIONS TO SELECTED EXERCISES
¯ ∈ X , kζk(m) ≤ σ and Ax−ζ = Ay+ζ = A¯ x. x ¯ = 12 [x+y] and ζ = 21 A[x−y], we get x It follows that the observation ω ¯ = A¯ x is compatible with signals x and y: kAx − ω ¯ k(m) = kAy − ω ¯ k(m) ≤ σ, implying that for every estimate x b(·) it holds
RiskH,k·k [b x|X ] ≥ max[kBx − x b(¯ ω )k(ν) , kBy − x b(¯ ω )k(ν) ] ≥ 12 kBx − Byk(ν) .
Thus,
Riskopt,σ ≥ 12 kBx − Byk(ν) ∀(x, y ∈ X : kA(x − y)k(m) ≤ 2σ),
implying the second inequality in (4.66).
✷
Exercise 4.29. Prove Proposition 4.22. Solution: When X is the spectratope defined in (4.63), the image of X × X under the linear mapping (x, y) 7→ 12 [x − y] is exactly X , so that (4.66) reads Opt#
= 2 max kBxk(ν) : x ∈ X , Ax ∈ σB(m) x ∗ = 2 max v T Bx : x ∈ X , Ax ∈ σB(m) , v ∈ B(ν) x,v = 2 max wT M T Bx : x ∈ X , Ax ∈ σB(m) , ∃r ∈ R : Sℓ2 [w] rℓ Ifℓ , ℓ ≤ L x,w = 2 max wT M T Bx : ∃(t ∈ T , p ∈ P, r ∈ R) : Rk2 [x] tk Idk , k ≤ K, x,w Q2j [Ax] σ 2 pj Iej , j ≤ J, Sℓ2 [w] rℓ Ifℓ , ℓ ≤ L b : ∃(t ∈ T , p ∈ σ 2 P, r ∈ R) : Rk2 [x] tk Id , k ≤ K, = max z T Bz k z=[w;x] Mj2 [x] pj Iej , j ≤ J, Sℓ2 [w] rℓ Ifℓ , ℓ ≤ L
where b= Mj [x] = Qj [Ax], B
MT B BT M
.
By Proposition 4.8, the quantity Opt = min φT (λ[Λ]) + φR (λ[Υ]) + σ 2 φP (λ[Σ]) + φR (λ[Σ]) : Λ={Λk ,k≤K}, Υ={Υℓ ,ℓ≤L}, Σ={Σj ,j≤J}
Λ 0, Υℓ 0, Σj 0 ∀(k, ℓ, j), kP T ∗ M B S [Υ ] ℓ ℓ ℓ P P 0 ∗ ∗ BT M j Mj [Σj ] k Rk [Λk ] +
upper-bounds Opt# , and P this bound is tight within the factor 2 max[ln(2D), 1], P P D = k dk + ℓ fℓ + ℓ ej . It remains to note that, as it is immediately seen, M∗j [Σj ] = AT Q∗j [Σj ]A. ✷
592 6.5 6.5.1
SOLUTIONS TO SELECTED EXERCISES
SOLUTIONS FOR CHAPTER 5 Estimation by Stochastic Optimization
Exercise 5.1. Consider the following “multinomial” version of the logistic regression problem from Section 5.2.1: For k = 1, ..., K, we observe pairs (ζk , ℓk ) ∈ Rn × {0, 1, ..., m}
(5.70)
drawn independently of each other from a probability distribution Px parameterized by an unknown signal x = [x1 ; ...; xm ] ∈ Rn × ... × Rn as follows:
• The probability distribution of regressor ζ induced by the distribution Sx of (ζ, ℓ) is a once and forever fixed, and independent of x, distribution R on Rn with finite second order moments and a positive definite matrix Z = Eζ∼R {ζζ T } of second order moments; • The conditional probability distribution of label ℓ given ζ, induced by the distribution Sx of (ζ, ℓ) is the distribution of a discrete random variable taking value ι ∈ {0, 1, ..., m} with probability ( exp{ζ T xι } P 1 ≤ ι ≤ m, T i , 1+ m i=1 exp{ζ x } [x = [x1 ; ...; xm ]] pι = Pm 1 , ι = 0. 1+ exp{ζ T xi } i=1
Given a nonempty convex compact set X ∈ Rmn known to contain the (unknown) signal x underlying observations (5.70), we want to recover x. Note that the recovery problem associated with the standard logistic regression model is the case m = 1 of the problem just defined. Your task is to process the above recovery problem via the approach developed in Section 5.2 and to compare the resulting SAA estimate with the Maximum Likelihood estimate.
Solution: Let us associate with ζ ∈ Rn the m × n matrix ζ ζ η[ζ] = Diag{ζ, ..., ζ } = .. | {z } . m
ζ
and, as always, let us associate with labels ℓk ∈ {0, 1, ..., m} vectors yk = y[ℓk ], where y[0] = 0 and y[ι] = eι , 1 ≤ ι ≤ m, e1 , ..., em being the standard basic orths in Rm . Finally, consider the vector field f ([s1 ; ...; sm ])
= =
[exp{s 1 };...;exp{sm }] P 1+ m i=1 exp{s Xim} ∇s ln 1 + exp{si } i=1
|
{z
φ(s)
}
: Rm → Rm .
With these conventions, observations (5.70) stemming from a signal x can be converted, in a deterministic fashion, into pairs ωk = (ηk = η[ζk ], yk = y[ℓk ]) ∈ Rmn×m × Rm which are i.i.d. samples drawn from distribution Px parameterized by x of (η, y) as follows:
593
SOLUTIONS TO SELECTED EXERCISES
• the distribution Q of ζ ∈ Rmn×m is the distribution of η[ζ] induced by the distribution R of ζ; Q is independent of x, has finite second moments, and Eη∼Q {ηη T } = Diag{Z, ..., Z } | {z } m
is positive definite along with Z = Eζ∼R {ζζ T }; • Conditional distribution of y given η = η[ζ], induced by distribution Px of ω = (η, y) is a discrete distribution supported on the set {0, e1 , ..., em } ⊂ Rm , and the expectation of this conditional distribution is f (η T x). Note that f is the gradient field of the smooth convex function φ, and the Hessian of φ is positive definite everywhere. Consequently, f is a continuous vector field which is strongly monotone on bounded sets. We find ourselves in the situation described in Section 5.2.1 and under Assumptions A.1–3 (verification of these assumptions is immediate, with strong monotonicity of F on bounded sets readily given by the similar property of f combined with positive definiteness of Eη∼Q {ηη T }) and can therefore utilize the constructions and estimates developed in Section 5.2. In particular, the vector field G(ηk ,yk ) (z)—see (5.55)—z = [z 1 ; ...; z m ] ∈ Rmn now becomes G(ηk ,yk ) (z)
ηk f (ηkT z) − ηk yk = η[ζk ]f (η T [ζk ]z) − η[ζk ]y[ℓk ] 1 }ζk ;...;exp{ζkT z m }ζk ] [exp{ζkT zP − η[ζk ]yk m exp{ζkT z i } 1+ i=1P m T i ∇z ln 1 + i=1 exp{ζk z } − dTk z , ℓk = 0; [0; ...; 0], | {z } mn dk = [0; ...; 0 ; ζk ; 0; ...; 0], ℓk 6= 0. | {z }
= = =
n(ℓk −1)
On the other hand, the Maximum Likelihood estimate in multinomial logistic regression, the observations being {(ζk , ℓk ), k ≤ K}, is the minimizer, over z = [z 1 ; ...; z m ] ∈ Rm , of the conditional negative log-likelihood L(z) given {ζk , k ≤ K}, L(z)
=
Lk (z)
= =
whence
PK k (z), k=1 L exp{ζ z ℓk } − ln Pm k , ℓk > 0 T i 1+ i=1 exp{ζk z } 1 − ln Pm , ℓk = 0 T i i=1 exp{ζk z } P1+ m T i ln 1 + i=1 exp{ζk z } − dTk z, G(ηk ,yk ) (z) = ∇z Lk (z),
that is, the SAA estimate coincides with that by Maximum Likelihood. Exercise 5.2 Let
H(x) : Rn → Rn
be a strongly monotone and Lipschitz continuous on the entire space vector field: ∀(x, x′ ∈ Rn ) : for some κ > 0 and L < ∞.
[H(x) − H(x′ )]T [x − x′ ] ≥ κkx − x′ k2 , kH(x) − H(x′ )k2 ≤ Lkx − x′ k2
594
SOLUTIONS TO SELECTED EXERCISES
1.1) Prove that for every x ∈ Rn , the vector equation H(z) = x in variable z ∈ Rn has unique solution (which we denote H −1 (x)), and that for every x, y ∈ Rn one has kH −1 (x) − yk2 ≤ κ −1 kx − H(y)k2 . (5.72)
1.2) Prove that the vector field
is strongly monotone with modulus
x 7→ H −1 (x) κ∗ = κ/L2
and Lipschitz continuous, with constant 1/κ w.r.t. k · k2 , on the entire Rn .
Solution: 1: Given x, y ∈ Rn , let Hx (·) = H(·) − x, and let X be the k · k2 -ball of radius > kH(y) − xk2 /κ centered at y. Let, next, z be a weak (and thus strong, since Hx is continuous) solution to VI(Hx , X ) (as we know, such a solution exists). Since z is a strong solution to the VI and Hx is strongly monotone with modulus κ along with H, we have HxT (y)[y − z] ≥ HxT (z)[y − z] + κky − zk22 ≥ κky − zk22 , whence kHx (y)k2 ky − zk2 ≥ κky − zk22 , that is, ky − zk2 ≤ kHx (y)k2 /κ. In particular, z is an interior point of X , and since z is a strong solution to VI(Hx , X ), we should have HxT (z)[z ′ − z] ≥ 0 for all z ′ from a neighborhood of z, implying that Hx (z) = 0, that is H(z) = x. Thus, the equation H(·) = x has a solution, namely, z. This solution is unique, since for every two solutions z, z ′ to the equation we should have 0 = [H(z) − H(z ′ )]T [z − z ′ ] ≥ κkz − z ′ k22 . As a byproduct of our reasoning, we have seen that kH −1 (x)−yk2 ≤ kHx (y)k2 /κ = kH(y) − xk2 /κ, which is (5.72). ✷ 2: Let x, x′ ∈ Rn , and let y = H −1 (x), y ′ = H −1 (x′ ), so that H(y) = x and H(y ′ ) = x′ . By strong monotonicity of H we have κky − y ′ k2
≤ =
[H(y) − H(y ′ )]T [y − y ′ ] = [x − x′ ]T [y − y ′ ] [H −1 (x) − H −1 (x′ )]T [x − x′ ],
(∗)
and by Lipschitz continuity of H we have kx−x′ k2 = kH(y)−H(y ′ )k2 ≤ Lky−y ′ k2 . We conclude that ky − y ′ k2 ≥ L−1 kx − x′ k2 which combines with (∗) to imply that [H −1 (x) − H −1 (x′ )]T [x − x′ ] ≥ L−2 κkx − x′ k22 . Thus, H −1 is strongly monotone with modulus κ/L2 on the entire Rn , as claimed. Next, given x, x′ ∈ Rn , let us set y = H −1 (x′ ), so that H(y) = x′ . Applying (5.72), we get kH −1 (x) − yk2 ≤ kH(y) − xk2 /κ which is nothing but kH −1 (x) − H −1 (x′ )k2 ≤ kx − x′ k2 /κ.
✷
595
SOLUTIONS TO SELECTED EXERCISES
Let us interpret −H(·) as a field of “reaction forces” applied to a particle: when particle is in a position y ∈ Rn the reaction force applied to the particle is −H(y). Next, let us interpret x ∈ Rn as an external force applied to the particle. An equilibrium y is a point in space where the reaction force −H(y) compensates the external force, that is, H(y) = x, or, which is the same, y = H −1 (x). Note that with this interpretation, strong monotonicity of H makes perfect sense, implying that the equilibrium in question is stable: when the particle is moved from the equilibrium y = H −1 (x) to a position y + ∆, the total force acting at the particle becomes f = x − H(y + ∆), so that f T ∆ = [x − H(y + ∆)]T ∆ = [H(y) − H(y + ∆)]T [∆] ≤ −κ∆2 , that is, the force is oriented “against” the displacement ∆ and “wants” to return the particle to the equilibrium position. Now imagine that we can observe in noise equilibrium H −1 (x) of the particle, the external force x being unknown, and want to recover x from our observation. For the sake of simplicity, let the observation noise be zero mean Gaussian, so that our observation is ω = H −1 (x) + σξ, ξ ∼ N (0, In ). 2) Verify that the recovery problem we have posed is a special case of the “single observation” recovery problem from Section 5.2.6, with Rn in the role of X 21 and that the SAA estimate x b(ω) from that section under the circumstances is just the root of the equation H −1 (·) = ω,
that is, Prove also that
x b(ω) = H(ω).
E{kb x(ω) − xk22 } ≤ nσ 2 L2 .
(5.73)
Note that in the situation in question the ML estimate should be the minimizer of the function f (z) = kω − H −1 (z)k22 , and this minimizer is nothing but x b(ω).
Solution: The required verification is straightforward: in the notation from Section 5.2.6 we should set K = 1, m = n, η = In , and y ≡ ω. To verify (5.73), note that x b(ω) = H(ω) = H(σξ + H −1 (x)) and x = H(H −1 (x)), so that by Lipschitz continuity of H it holds kb x(ω) − xk2 = kH(σξ + H −1 (x)) − H(H −1 (x))k ≤ Lkσξk2 .
Exercise 5.3 [identification of parameters of a linear dynamic system] Consider the problem as follows: A deterministic sequence x = {xt : t ≥ −d + 1} satisfies the linear finite-difference equation d X αi xt−i = yt , t = 1, 2, ... (5.74) i=0
21 In Section 5.2.6, X was assumed to be closed, convex, and bounded; a straightforward inspection shows that when the vector field φ is strongly monotone, with some positive modulus, on the entire space, and η has trivial kernel, all constructions and results of Section 5.2.6 can be extended to the case of an arbitrary closed convex X .
596
SOLUTIONS TO SELECTED EXERCISES
of given order d and is bounded, |xt | ≤ Mx < ∞, ∀t ≥ −d + 1, implying that the sequence {yt } also is bounded: |yt | ≤ My < ∞, ∀t ≥ 1. The vector α = [α0 ; ...; αd ] is unknown, all we know is that this vector belongs to a given closed convex set X ⊂ Rd+1 . We have at our disposal observations ωt = xt + σx ξt , −d + 1 ≤ t ≤ K,
(5.75)
of the terms in the sequence, with ξt ∼ N (0, 1) independent across t, with some given σx , and observations ζ t = yt + σ y ηt (5.76) with ηt ∼ N (0, 1) independent across t and independent of {ξτ }τ . Our goal is to recover from these observations the vector α. Strategy. To get the rationale underlying the construction to follow, let us start with the case when there is no observation noise at all: σx = σy = 0. In this case we could act as follows: let us denote xt = [xt ; xt−1 ; xt−2 ; ...; xt−d ], 1 ≤ t ≤ K, and rewrite (5.74) as
[xt ]T α = yt , 1 ≤ t ≤ K.
When setting AK = we get
K K 1 X 1 X t tT x [x ] , aK = yt x t , K t=1 K t=1
A K α = aK .
(5.77)
Assuming that K is large and trajectory x is “rich enough” to ensure that AK is nonsingular, we could identify α by solving the linear system (5.77). Now, when the observation noise is present, we could try to use the noisy observations of xt and yt we have at our disposal in order to build empirical approximations to AK and aK which are good for large K, and identify α by solving the “empirical counterpart” of (5.77). The straightforward way would be to define ω t as an “observable version” of xt , ω t = [ωt ; ωt−1 ; ...; ωt−d ] = xt + σx [ξt ; ξt−1 ; ...; ξt−d ] | {z } ξt
and to replace AK and aK with
K K X X et = 1 A ω t [ω t ]T , e aK = ζt ω t . K t=1 t=1
As far as empirical approximation of aK is concerned, this approach works: we have e aK = aK + δaK , δaK =
K 1 X [σx yt ξ t + σy ηt xt + σx σy ηt ξ t ] . K t=1 | {z } δt
Since the sequence {yt } is bounded, the random error δaK of approximation e aK of aK is small for large K with overwhelming probability. Indeed, δaK is the average of K zero mean random vectors δt (recall that ξ t and ηt are independent and zero mean) with22 E{kδt k22 } ≤ 3(d + 1) σx2 My2 + σy2 Mx2 + σx2 σy2 22 We
use the elementary inequality k
Pp
t=1
at k22 ≤ p
Pp
t=1
kat k22 .
597
SOLUTIONS TO SELECTED EXERCISES
and δt is independent of δs whenever |t − s| > d + 1, implying that 3(d + 1)(2d + 1) σx2 My2 + σy2 Mx2 + σx2 σy2 2 . E{kδaK k2 } ≤ K
(5.78)
eK is essentially worse: setting The quality of approximating AK with A eK − AK = δAK = A
K 1 X 2 t tT [σx ξ [ξ ] + σx ξ T [xt ]T + σx xt [ξ t ]T ] K t=1 | {z } ∆t
we see that δAK is the average of K random matrices ∆t with nonzero mean, namely, the mean σx2 Id+1 and as such ∆AK is “large” independently of how large K is. There is, however, a simple way to overcome this difficulty – splitting observations ωt .23 Splitting observations. Let θ be a random n-dimensional vector with an unknown mean µ and known covariance matrix, namely, σ 2 In , and let χ ∼ N (0, In ) be independent of θ. Finally, let κ > 0 be a deterministic real. 1) Prove that setting
θ′ = θ + σκχ, θ′′ = θ − σκ−1 χ,
we get two random vectors with mean µ and covariance matrices σ 2 (1 + κ2 )In and σ 2 (1 + 1/κ2 )In , respectively, and these vectors are uncorrelated E{[θ′ − µ][θ′′ − µ]T } = 0.
Solution: θ′ , θ′′ clearly are with mean µ, and their covariance matrices are exactly as stated above. Besides this, setting δ = σ −1 (θ − µ), so that δ has zero mean and covariance matrix In , and is independent of χ ∼ N (0, In ), we have E{[θ′− µ][θ′′ − µ]T } = σ 2 E{[δ + κχ][δ − κ−1 χ]T } = σ 2 E{δδ T } +κ E{χδ T } −κ−1 E{δχT } − E{χχT } | {z } | {z } | {z } | {z } =In
=0
=0
=In
= 0.
✷
In view of item 1, let us do as follows: given observations {ωt } and {ζt }, let us generate i.i.d. sequence {χt ∼ N (0, 1), t ≥ −d + 1}, so that the sequences {ξt }, {ηt }, and {χt } are i.i.d. and independent of each other, and let us set ut = ωt + σx χt , vt = ωt − σx χt . Note that given the sequence {ωt } of actual observations, sequences {ut } and {vt } are observable as well, and that the sequence {(ut , vt )} is i.i.d.. Moreover, for all t, E{ut } = E{vt } = xt , E{[ut − xt ]2 } = 2σx2 , E{[vt − xt ]2 } = 2σx2 , and for all t and s Now, let us put
E{[ut − xt ][vs − xs ]} = 0. ut = [ut ; ut−1 ; ...; ut−d ], v t = [vt ; vt−1 ; ...; vt−d ],
23 The model (5.74)– (5.76) is referred to as the Errors in Variables model [85] in the statistical literature or Output Error model in the literature on System Identification [173, 218]. In general, statistical inference for such models is difficult—for instance, the parameter estimation problem in such models is ill-posed. The estimate we develop in this exercise can be seen as a special application of the general Instrumental Variables methodology [7, 219, 241].
598
SOLUTIONS TO SELECTED EXERCISES
and let
K X bK = 1 ut [v t ]T . A K t=1
bK is a good empirical approximation of AK : 2) Prove that A bK } = AK , E{kA bK − AK k2F } ≤ E{A
12[d + 1]2 [2d + 3] Mx2 + σx2 σx2 , K
(5.79)
the expectation being taken over the distribution of observation noises {ξt } and auxiliary random sequence {χt }.
Solutions: Setting
we have
δut = ut − xt , δvt = vt − xt , δut = [δut ; δut−1 ; ...; δut−d ], δv t = [δvt ; δvt−1 ; ...; δvt−d ], bK − AK , ∆=A ∆=
K 1 X t t T δu [δv ] + δut [xt ]T + xt [δv t ]T . K t=1 | {z } ∆t
Since the joint distribution of {δut , δvt , −d + 1 ≤ t ≤ K} is Gaussian and these random variables are mutually uncorrelated, they are mutually independent; besides this, these variables are zero mean with variance 2σx2 . As a result, the random matrices ∆t are with zero mean. Besides this, when |t − s| > d + 1, the matrices ∆t and ∆s are independent of each other. We have 2 t t T 2 t t T 2 t t T 2 E{k∆ # " t kF } ≤ 3 E{kδu [δv ] kF } + E{kδu [x ] kF } + E{kx [δv ] kF } P P P x2r [δvs ]2 } [δur ]2 x2s } + E{ [δur ]2 [δvs ]2 } + E{ = 3 E{ t−d≤r,s≤t t−d≤r,s≤t t−d≤r,s≤t ≤ γ 2 := 12[d + 1]2 Mx2 + σx2 σx2 (recall we are in the case where δut ∼ N (0, 2σx2 ) with δvt ∼ N (0, 2σx2 ), and δut and δvs independent of each other for all t, s). It follows that P E{k∆k2F } = K12 1≤t,s≤K E{Tr(∆t ∆Ts )} P T = K12 1≤t,s≤K, E{Tr(∆t ∆s )} |t−s|≤d+1
≤
≤
[since ∆t , ∆s are zero mean and mutually independent when |t − s| > d + 1]
P
1 1≤t,s≤K, K2 |t−s|≤d+1 2d+3 2 K γ .
γ 2 [by the Cauchy inequality]
✷
bK − AK and Conclusion. We see that as K → ∞, the differences of typical realizations of A e aK − aK approach 0. It follows that if the sequence {xt } is “rich enough” to ensure that the minimal eigenvalue of AK for large K stay bounded away from 0, the estimate bK β − e aK k22 α bK ∈ Argmin kA β∈Rd+1
will converge in probability to the desired vector α, and we can even say something reasonable about the rate of convergence. To account for a priori information α ∈ X , we can modify the
599
SOLUTIONS TO SELECTED EXERCISES
estimate by setting
bK β − e α bK ∈ Argmin kA aK k22 . β∈X
Note that the assumption that noises affecting observations of xt ’s and yt ’s are zero mean Gaussian random variables independent of each other with known dispersions is not that important; we could survive the situation where samples {(ωt − xt , t > −d}, {ζt − yt , t ≥ 1} are zero mean i.i.d., and independent of each other, with a priori known variance of ωt − xt . Under this and mild additional assumptions (like finiteness of the fourth moments of ωt − xt and ζt − yt ), the results obtained would be similar to those for the Gaussian case. Now comes the concluding part of the exercise: 3) To evaluate numerically the performance of the proposed identification scheme, run experiments as follows: • Given an even value of d and ρ ∈ (0, 1], select d/2 complex numbers λi at random on the circle {z ∈ C : |z| = ρ}, and build a real polynomial of degree d with roots λi , λ∗i (∗ here stands for complex conjugation). Build a finite-difference equation (5.77) with this polynomial as the characteristic polynomial. • Generate i.i.d. N (0, 1) “inputs” {yt , t = 1, 2, ...}, select at random initial conditions x−d+1 , x−d+2 , ..., x0 for the trajectory {xt } of states (5.77), and simulate the trajectory along with observations ωt of xt and ζt of yt , with σx , σy being the experiment’s parameters. • Look at the performance of the estimate α bK on the simulated data.
Solution: These are our results (obtained via 40 simulations for every collection of experiment’s parameters with σx = σy = σ): d 2 2 2 2 2 2 6 6 6 6 6 6 10 10 10 10 10 10
σ 1.0 1.0 0.5 0.5 0.1 0.1 1.0 1.0 0.5 0.5 0.1 0.1 1.0 1.0 0.5 0.5 0.1 0.1
ρ 1.0 0.8 1.0 0.8 1.0 0.8 1.0 0.8 1.0 0.8 1.0 0.8 1.0 0.8 1.0 0.8 1.0 0.8
k · k2 -error of recovery mean median max 0.0032 0.0024 0.012 0.0585 0.0544 0.146 0.0018 0.0014 0.015 0.0251 0.0186 0.088 0.0003 0.0002 0.001 0.0038 0.0030 0.015 2.5353 0.0393 47.59 2.9813 1.0660 19.29 1.3901 0.0170 27.24 1.5914 0.4164 19.73 0.3452 0.0015 13.68 0.0677 0.0349 0.426 26.362 0.4579 389.7 9.9474 2.9212 90.14 5.0241 0.1142 144.5 9.4643 2.8923 97.38 26.671 0.0132 241.8 2.9864 0.1886 46.50 K = 1000
d 2 2 2 2 2 2 6 6 6 6 6 6 10 10 10 10 10 10
σ 1.0 1.0 0.5 0.5 0.1 0.1 1.0 1.0 0.5 0.5 0.1 0.1 1.0 1.0 0.5 0.5 0.1 0.1
ρ 1.0 0.8 1.0 0.8 1.0 0.8 1.0 0.8 1.0 0.8 1.0 0.8 1.0 0.8 1.0 0.8 1.0 0.8
k · k2 -error of recovery mean median max 0.0000 0.0000 0.000 0.0061 0.0055 0.027 0.0000 0.0000 0.000 0.0028 0.0027 0.006 0.0000 0.0000 0.000 0.0004 0.0004 0.001 0.5370 0.0002 20.82 0.6703 0.0532 12.74 0.5929 0.0001 15.71 0.0704 0.0185 1.52 1.8412 0.0000 46.85 0.0078 0.0025 0.121 8.2724 0.0006 123.5 9.7451 2.1950 70.97 16.543 0.0002 179.9 5.3034 0.2000 105.5 27.462 0.0002 527.3 3.0461 0.0152 101.3 K = 100, 000
Notice an important discrepancy in the mean and median of the recovery errors. On close inspection, these differences as well as large maximal recovery errors result from severe ill-conditioning (with condition number ∼ 1011 or more) of some realizations of the random matrix AK . Exercise 5.4. [more on generalized linear models] Consider a generalized linear model as follows: we observe i.i.d. random pairs ωk = (yk , ζk ) ∈ R × Rν×µ , k = 1, ..., K,
600
SOLUTIONS TO SELECTED EXERCISES
where the conditional expectation of the scalar label yk given ζk is ψ(ζkT z), z being an unknown signal underlying the observations. What we know is that z belongs to a given convex compact set Z ⊂ Rn . Our goal is to recover z. Note that while the estimation problem we have just posed looks similar to those treated in Section 5.2, it cannot be straightforwardly handled via techniques developed in that section unless µ = 1. Indeed, these techniques in the case of µ > 1 require ψ to be a monotone vector field on Rµ , while our ψ is just a scalar function on Rµ . The goal of the exercise is to show that when X X q ψ(w) = cq w q ≡ cq w1q1 ...wµµ [cq 6= 0, q ∈ Q ⊂ Zµ +] q∈Q
q∈Q
is an algebraic polynomial (which we assume from now on), one can use lifting to reduce the situation to that considered in Section 5.2. The construction is straightforward. Let us associate with algebraic monomial with ν variables24 z p := z1p1 z2p2 ...zνpν a real variable xp . For example, monomial z1 z2 is associated with x1,1,0,...,0 , z12 zν3 is associated with x2,0,...,0,3 , etc. For q ∈ Q, the contribution of the monomial cq wq into ψ(ζ T z) is X cq [ColT1 [ζ]z]q1 [ColT2 [ζ]z]q2 ...[ColTµ [ζ]z]qµ = hpq (ζ)z1p1 z2p2 ...zνpν , p∈Pq
where Pq is a properly built set of multi-indices p = (p1 , ..., pν ), and hpq (ζ) are easily computable functions of ζ. Consequently, X X X ψ(ζ T z) = hpq (ζ)z p = Hp (ζ)z p , q∈Q p∈Pq
p∈P
with properly selected finite set P and readily given functions Hp (ζ), p ∈ P. We can always take, as P, the set of all ν-entry multi-indices with the sum of entries not exceeding d, where d is the total degree of the polynomial ψ. This being said, the structure of ψ and/or the common structure, if any, of regressors ζk can enforce some of the functions Hp (·) to be identically zero. When this is the case, it makes sense to eliminate the corresponding “redundant” multi-index p from P. Next, consider the mapping x[z] which maps a vector z ∈ Rν into a vector with entries xp [z] = z p indexed by p ∈ P, and let us associate with our estimation problem its “lifting” with observations ω k = (yk , ηk = {Hp (ζk ), p ∈ P}). I.e., new observations are deterministic transformations of the actual observations; observe that the new observations still are i.i.d., and the conditional expectation of yk given ηk is nothing but X [ηk ]p xp [z]. p∈P
In our new situation, the “signal” underlying observations is a vector from RN , N = Card(P), the regressors are vectors from the same RN , and regression is linear—the conditional expectation of the label yk given regressor ηk is the linear function ηkT x of our new signal. Given a convex compact localizer Z for the “true signal” z, we can in many ways find convex compact localizer X for x = x[z]. Thus, we find ourselves in the simplest possible case of the situation considered in Section 5.2 (one with scalar φ(s) ≡ s), and can apply the estimation procedures developed in this section. Note that in the “lifted” problem the SAA estimate x b(·) of the lifted signal x = x[z] is nothing but the standard Least Squares: iT i hP hP K K T x x b(ω K ) ∈ Argminx∈X 12 xT k=1 yk ηk k=1 ηk ηk x − (5.80) P T 2 = Argminx∈X (y − η x) . k k k 24 note
that factors in the monomial are ordered according to the indices of the variables.
601
SOLUTIONS TO SELECTED EXERCISES
Of course, there is no free lunch, and there are some potential complications: • It may happen that the matrix H = Eη∼Q {ηη T } (Q is the common distribution of “artificial” regressors ηk induced by the common distribution of the actual regressors ζk ) is not positive definite, which would make it impossible to recover well the signal x[z] underlying our transformed observations, however large be K; • Even when H is positive definite, so that x[z] can be recovered well, provided K is large, we still need to recover z from x[z], that is, to solve a system of polynomial equations, which can be difficult; besides, this system can have more than one solution. • Even when the above difficulties can be somehow avoided, “lifting” z → x[z] typically increases significantly the number of parameters to be identified, which, in turn, deteriorates “finite time” accuracy bounds. Note also that when H is not positive definite, this still is not the end of the world. Indeed, H is positive semidefinite; assuming that it has a nontrivial kernel L which we can identify, a realization ηk of our artificial regressor is orthogonal to L with probability 1, implying that replacing artificial signal x with its orthogonal projection onto L⊥ , we almost surely keep the value of the objective in (5.80) intact. Thus, we lose nothing when restricting the optimization domain in (5.80) to the orthogonal projection of X onto L⊥ . Since the restriction of H onto L⊥ is positive definite, with this approach, for large enough values of K we will still get a good approximation of the projection of x[z] onto L⊥ , With luck, this approximation, taken together with the fact that the “artificial signal” we are looking for is not an arbitrary vector from X —it is of the form x[z] for some z ∈ Z—will allow to get a good approximation of z. Here is the first part of the exercise: 1) Carry out the outlined approach in the situation where • The common distribution Π of regressors ζk has density w.r.t. the Lebesgue measure on Rν×µ and possesses finite moments of all orders • ψ(w) is a quadratic form, either (case A) homogeneous, ψ(w) = wT Sw
[S 6= 0],
or (case B) inhomogeneous: ψ(w) = wT Sw + sT w
[S 6= 0, s 6= 0]
• The labels are linked to the regressors and to the true signal z by the relation yk = ψ(ηkT z) + χk , where the χk ∼ N (0, 1) are mutually independent and independent from the regressors.
P Solution: Case A: Formally, we should set P = p = (p1 , ..., pν ) ∈ Zn+ : i pi = 2 and write some rather messy expressions for hp (ζ), p ∈ P. To save notation, it is much better to implement our approach “from scratch”, namely, • to associate with z ∈ Rν a ν × ν symmetric matrix x[z] = zz T ; • to write the relation between yk , ηk , and z as yk
= =
[ζkT z]T S[ζkT z] + χk = Tr(S[ζkT z][ζkT z]T ) + χk = Tr(SζkT x[z]ζk ) + χk Tr(ζk SζkT x[z]) + χk . | {z } ηk
Thus, we arrive at a linear regression problem where the observed labels yk and “artificial regressors” ηk = ζk SζkT ∈ Sν are linked by the relation yk = hηk , xiF + χk
602
SOLUTIONS TO SELECTED EXERCISES
(h·, ·iF is the Frobenius inner product), signal x underlying the observations lives in Sν , and ηk , χk , k = 1, ..., K, are i.i.d. Positive definiteness of the matrix H is equivalent to the fact that whenever P ∈ Sν and P 6= 0, the expectation of the random quantity hηk , P i2F is positive; we are about to prove that this indeed is the case. Note that the latter quantity is nothing but f (ζk ) := Tr(ζk SζkT P )2 ≥ 0, and the set Z of those ζ for which f (ζ) = 0 is of Lebesgue measure zero (indeed, f (ζ) is a polynomial of ζ; as such, it is either identically zero, or the set of its zeros is of Lebesgue measure 0.25 ) In the first case Tr(ζSζ T P ) = 0 for all ζ; assuming that it is the case and taking derivative in ζ along a direction d, we get Tr(dSζ T P + ζSdT P ) = 0 identically in d, ζ. Taking derivative in ζ again along direction e, we get Tr(dSeT P + eSdT P ) = 0, or, which is the same due to the symmetry of S and P , Tr(dSeT P ) = 0 identically in d, e, whence SeT P = 0 identically in e ∈ Rν×µ . The latter clearly contradicts the fact that P and S are nonzero. Indeed, assuming that g ∈ Rν and f ∈ Rµ are such that P g 6= 0 and Sf 6= 0 and setting e = [P g]f T , we get SeT P g = Sf [P g]T P g = kP gk22 Sf 6= 0, which is the desired contradiction. Thus, our f (ζ) ≥ 0 vanishes at the set of Lebesgue measure zero, and since the distribution of ζk has a density w.r.t. the Lebesgue measure, the expected value of f (ζk ) is positive, as claimed. From positive definiteness of H it follows that for large K we can recover well x[z]; since x[z] = zz T , that means that for large K we can also recover z up to multiplication of z by −1. The resulting ambiguity in recovering z is quite natural—our observations do not distinguish between the cases where the signal is z or −z. Case B. Here again to save notation it makes sense to repeat the derivation “from scratch.” Setting x[z] = (zz T , z) ∈ Sν × Rν , we have yk
= =
[ζkT z]T S[ζkT z] + sT ζkT z + χk = Tr(ζk SζkT [zz T ]) + [ζk s]T z + χk h(ζk SζkT , ζk s), x[z]i, {z } | ηk
with the inner product on the space E := Sν × Rν where the “lifted signal” x[z] lives given by h(P, p), (Q, q)i = Tr(P Q) + pT q. Positive definiteness of the matrix H is nothing but the fact that whenever (P, p) ∈ E is nonzero, we have E{h(ζk SζkT , ζk s), (P, p)i2 } > 0, and we are about to demonstrate that this indeed is the case. Similarly to the Case A, all we need to do is to lead to contradiction the assumption that the polynomial f (ζ) = Tr(ζSζ T P ) + sT ζ T p is identically in ζ ∈ Rν×µ equal to 0. Assuming that the latter is the case and differentiating f (ζ) in ζ along a direction d, we get Tr(dSζ T P ) + Tr(ζSdT P ) + sT dT p = 0 ∀(ζ, d) ∈ Rν×µ . The latter is impossible when p 6= 0 (set ζ = 0 and note that s 6= 0). When p = 0, we have P 6= 0, whence f (ζ) is not identically zero by the same argument as in 25 this simple claim can be easily verified by induction in the number of variables in the polynomial.
603
SOLUTIONS TO SELECTED EXERCISES
Case A. The consequences are completely similar to those for Case A, with the only— and pleasant—difference that now z is a component of x[z], so that good recovery of x[z] automatically implies good recovery of z. Now comes the concluding part of the exercise, where you are supposed to apply the approach we have developed to the situation as follows: You are given a DC electric circuit comprised of resistors, that is, connected oriented graph with m nodes and n arcs γj = (sj , tj ), 1 ≤ j ≤ n, where 1 ≤ sj , tj ≤ m and sj 6= tj for all j; arcs γj are assigned with resistances Rj > 0 known to us. At instant k = 1, 2, ..., K, “nature” specifies “external currents” (charge flows from the “environment” into the circuit) s1 , ..., sm at the nodes; these external currents specify currents in the arcs and voltages at the nodes, and consequently, the power dissipated by the circuit. Note that nature cannot be completely free in generating the external currents: their total should be zero. As a result, all what matters is the vector s = [s1 ; ...; sm−1 ] of external currents at the first m − 1 nodes, due to sm ≡ −[s1 + ... + sm−1 ]. We assume that the mechanism of generating the vector of external currents at instant k—let this vector be denoted by sk ∈ Rm−1 —is as follows. There are somewhere m − 1 sources producing currents z1 , ..., zm−1 . At time k nature selects a one-to-one correspondence i 7→ πk (i), i = 1, ..., m − 1, between these sources and the first m − 1 nodes of the circuit, and “forwards” current zπk (i) to node i: ski = zπk (i) , 1 ≤ i ≤ m − 1.
For the sake of definiteness, assume that the permutations πk of 1, ..., m − 1, k = 1, ..., K, are i.i.d. drawn from the uniform distribution on the set of (m − 1)! permutations of m − 1 elements. Assume that at time instants k = 1, ..., K we observe the permutations πk and noisy measurements of the power dissipated at this instant by the circuit; given those observations, we want to recover the vector z. Here is your task: 2) Assuming the noises in the dissipated power measurements to be independent of each other and of πk zero mean Gaussian noises with variance σ 2 , apply to the estimation problem in question the approach developed in item 1 of the exercise and run numerical experiments.
Solution: Let us associate with our circuit its incidence matrix B – the m×n matrix B, so that the rows of B are indexed by the nodes, and columns by the arcs; the j-th column has just two nonzero entries, namely, entry +1 in row sj (sj is the starting node of arc γj ), and entry −1 in the row tj (tj is the ending node of the arc). Let node m be the ground node, and let A be the (m − 1) × n matrix comprised by the first µ := m − 1 rows of B. Let now R be the diagonal matrix with diagonal entries Rj , 1 ≤ j ≤ n, and v be the vector of voltages at the first µ nodes, the voltage at the ground node being zero. Let s be the vector of external currents at the first µ nodes, and let ι be the vector of currents in arcs. Finally, let P be the dissipated power. The Kirchhoff Laws say that Rι Aι P
= = =
AT v s ιT AT v.
[Ohm’s Law] [current conservation]
Note that A is of rank µ (recall that our circuit is connected). As a result, we get P = sT Ss, S = (AR−1 AT )−1 .
604
SOLUTIONS TO SELECTED EXERCISES
Now let us associate with a permutation π the µ × µ permutation matrix Dπ ; the only nonzero entry in the π(i)-th row of the matrix is in the cell i and is equal to 1, so that (DπT x)i = xπ(i) , i = 1, ..., m. In other words, the vector sk of external currents at time k is DπTk z, so that our k-th label is yk = [DπTk z]T S[DπTk z] + χk = hDπk SDπTk , |{z} zz T iF + χk , | {z } ηk
x[z]
where χk ∼ N (0, σ 2 ) are independent of each other and of the π1 , ..., πK observation noises. Observe that the matrix H in our situation is positive semidefinite, but not positive definite. Indeed, all matrices ηk inherit from S the trace and the sum of all elements. In other words, all realizations of our artificial regressors belong to the affine plane of codimension 2 in Sµ associated with S, and thus are orthogonal to a nonzero vector f readily given by this affine plane. Since all realizations of ηk are orthogonal to f , f is in the kernel of H. An “educated guess” (fully supported by numerical evidence) is that the kernel of H is the line Rf , and the orthogonal complement E to this line in Sµ , and this is what we used when building our Least Squares estimate # "K ( X T T 1 Vec[ηk ]Vec [ηk ] Vec[x] Vec [x] x b({yk , πk }k≤K ∈ Argmin 2 x∈E k=1 kxkF ≤ M ) K X T − Vec[ηk ] Vec[x] . k=1
Here for x ∈ Sµ Vec[x] is the µ2 -dimensional vector obtained by vertical concatenation of the columns of x, and M is an a priori upper bound on the Frobenius norm of x[z] for z ∈ Z. After x b is built, we can try to recover x[z] by looking at the points of the line x b + Rf and selecting on this line a “nearly rank 1” matrix; the leading eigenvector of this matrix, multiplied by the square root of the corresponding eigenvalue, is our estimate zb of z (or, better said, “±b z is our estimate of ±z”—in our problem, observations do not allow us to distinguish between z and −z).
605
SOLUTIONS TO SELECTED EXERCISES
Our sample numerical results are as follows:
K δ
100 0.0356
1000 0.0107
10000 0.0047
K δ
100 0.0668
1000 0.0391
10000 0.0270
δ: k · k2 -recovery error In the reported experiments, we used σ = 0.1 and M = 40. The signal z was drawn at random from N (0, I4 ), and the resistances were drawn at random from the uniform distribution on [1, 2]. Exercise 5.5 [shift estimation] Consider the situation as follows: given a continuous vector field f (u) : Rm → Rm which is strongly monotone on bounded subsets of Rm and a convex compact set S ⊂ Rm , we observe in noise vectors f (p − s), where p ∈ Rm is an observation point known to us, and s ∈ Rm is a shift unknown to us and known to belong to S. Precisely, assume that our observations are yk = f (pk − s) + ξk , k = 1, ..., K, where p1 , ..., pK is a deterministic sequence known to us, and ξ1 , ..., ξK are N (0, γ 2 Im ) observation noises independent of each other. Our goal is to recover from observations y1 , ..., yK the shift s. 1. Pose the problem as a single-observation version of the estimation problem from Section 5.2. 2. Assuming f to be strongly monotone, with modulus κ > 0, on the entire space, what is the error bound for the SAA estimate?
606
SOLUTIONS TO SELECTED EXERCISES
3. Run simulations in the case of m = 2, S = {u ∈ R2 : kuk2 ≤ 1} and 2u1 + sin(u1 ) + 5u2 f (u) = . 2u2 − sin(u2 ) − 5u1
Note: Field f (·) is not potential; this is the monotone vector field associated with the strongly convex-concave function ψ(u) = u21 − cos(u1 ) − u22 − cos(u2 ) + 5u1 u2 , ∂ ∂ so that f (u) = [ ∂u φ(u); − ∂u φ(u)]. Compare the actual recovery errors with their theo1 2 retical upper bounds. 4. Think about what can be done when our observations are
yk = f (Apk − s) + ξk , 1 ≤ k ≤ K with known pk , noises ξk ∼ N (0, γ 2 I2 ) independent across k, and unknown A and s which we want to recover.
Solution: Item 1: Let us associate with p ∈ Rm the (m+1)×m matrix (“regressor”) η[p] = [pT ; −Im ], and with a shift s ∈ Rm the signal x[s] = [1; s]. Then our observations become yk = f (η T [pk ] x[s]) + ξk , 1 ≤ k ≤ K, | {z } T ηk
and we find ourselves in the situation considered in Section 5.2.6, the signal set being X = {[1, s] : s ∈ S}. Item 2: The SAA estimate x b=x b(y) is the weak solution to VI(Gy , X ), where X Gy (z) = ηk f (ηkT z) − ηk yk , X = {z = [1; s] : s ∈ S}. k
607
SOLUTIONS TO SELECTED EXERCISES
Let us lower-bound the constant of strong monotonicity of Gy (·) on X . Let z = [1; r], z ′ = [1; r′ ] be two points from X . We have T P [Gy (z)P − Gy (z ′ )]T [z − z ′ ] = k f (ηkT z) − f (ηkT z ′ ) ηkT [z − z ′ ] = Pk [f (pk − r) − f (pk − r′ )]T [r′ − r] = Pk [f (pk − r) − f (pk − r′ )]T [[pk − r] − [pk − r′ ]] ′ 2 ′ 2 ′ 2 ≥ k κk[pk − r] − [pk − r ]k2 = Kκkr − r k2 = Kκkz − z k2 ,
so that Gy is strongly monotone on X with modulus Kκ. Now let us check the validity of Assumption A.3′ from Section 5.2.6. In the notation from this assumption, we have P P f (ηkT [1; s])] = k ηk [yk − f (pk − s)] η[y − φ(η T x[s])] = Pk ηk [yk −P T = k [pk ; −Im ]ξk , k ηk ξk = whence
X E kη[y − φ(η T x[s])]k22 = γ 2 kpk k22 + m . k
′
Thus, Assumption A.3 is satisfied with sX [kpk k22 + m], σ=γ k
and the error bound from Proposition 5.14 reads P 1 2 γ2 m + K k kpk k2 1 2 E{kb x(y) − [1; s]k2 } ≤ . κ2 K Item 3: Let us compute the modulus of strong monotonicity of our vector field. 2 + cos(u1 ) 5 , and its The Jacobian of the field at a point u is −5 2 − cos(u2 ) symmetric part is I2 , implying that the field is strongly monotone with modulus 1 on the entire space. Consequently, the theoretical upper bound on the recovery error for the SAA estimate as given by Proposition 5.14 becomes q P 1 2 q γ m+ K k kpk k2 √ . E{kb x − sk22 } ≤ ρ := K The numerical experiments we are about to report were organized as follows. In a particular experiment, the “true” shift s was drawn ar random from N (0, I2 ) and then projected onto the set S = {s ∈ R2 : ksk2 ≤ 5}. The observation points p1 , ..., pK were drawn at random, independently of each other, from the uniform distribution on the boundary of S, and the noise level γ was set to 1. In all experiments, the number of observations K was set to 100. The median of the recovery error kb x − sk2 in the 10 experiments we have carried out was as small as 0.019; the largest observed ratio of the actual recovery error to its theoretical upper bound ρ was 0.089. Item 4: Similarly to the case of an unknown shift, our measurements are of the form yk = f (η T [pk ]x) + ξk ,
608
SOLUTIONS TO SELECTED EXERCISES
where ηk = η[p] is a properly defined 6 × 2 matrix affinely depending on p, and x is the six-dimensional vector obtained by writing in a single column the two columns of A and z. We can now build the SAA estimate of x similarly to what was done in the case when A was identity.
Appendix: Executive Summary on Efficient Solvability of Convex Optimization Problems
Convex Programming is a “solvable case” in Optimization: under mild computability and boundedness assumptions, a globally optimal solution to a convex optimization problem can be “approximated to any desired accuracy in reasonable time.” The goal of what follows is to provide a reader with an understanding, “sufficient for all practical purposes,” of what these words mean.1 In the sequel we are interested in computational tractability of a convex optimization problem in the form Opt = minn {f0 (x) : fi (x) ≤ 0, 1 ≤ i ≤ m} , x∈BR
BnR = {x ∈ Rn : kxk2 ≤ R}.
(1)
We always assume that fi (·), 0 ≤ i ≤ m, are convex and continuous real-valued functions on BnR , and what we are looking for is an ǫ-accurate solution to the problem, that is, a point xǫ ∈ BnR such that fi (xǫ ) f0 (xǫ )
≤ ≤
ǫ, i = 1, ..., m Opt + ǫ
[ǫ-feasibility] [ǫ-optimality]
where ǫ > 0 is a given tolerance. Note that when (1) is infeasible, i.e., Opt = +∞, every point of BnR is an ǫ-optimal (but not necessarily ǫ-feasible, and thus not necessarily ǫ-accurate) solution. We intend to provide two versions of what “efficient solvability” means: “practical” and “scientific.” “Practical” version of efficient solvability: For most practical purposes, efficient solvability means that we can feed (1) to cvx, that is, rewrite the problem in the form (2) Opt = min cT [x; u] : A(x, u) 0 , x,u
where A(x, u) is a symmetric matrix which is affine in [x; u].
“Scientific” version of efficient solvability. Let us start with the following basic fact: (!) Assume that when solving (1), we have at our disposal R, a real V such that maxn |fi (x)| ≤ V /2, (3) x∈BR
1 For rigorous treatment of this subject in the context of continuous optimization (this is what we deal with in our book) the reader is referred to [15, Chapter 5].
610
APPENDIX
and access to a First Order oracle—a “black box” which, given on input a query point x ∈ BnR and tolerance δ > 0, returns δ-subgradients of fi at x, that is, affine functions gi,x (·), 0 ≤ i ≤ m, such that gi,x (y) ≤ fi (y) ∀y ∈ BnR & gi,x (x) ≥ fi (x) − δ, 0 ≤ i ≤ m. Then for every ǫ ∈ (0, V ), an ǫ-accurate solution to (1), or a correct claim that the problem is infeasible, can be found by an appropriate algorithm (e.g., the Ellipsoid method) at cost at most N (ǫ) =⌋2n2 ln(2V /ǫ)⌊+1
(4)
subsequent steps, with • at most one call to the First Order oracle per step, the input at the t-th step being (xt , ǫ/2), with x1 = 0 and recursively computed x2 , ..., xN (ǫ) ; • at most O(1)n(m+n) operations of precise real arithmetic per step needed to update xt and the output of the First Order oracle (if the latter was invoked at step t) into xt+1 . The remaining computational effort when executing the algorithm is just O(1)N (ǫ)n operations of precise real arithmetic needed to convert the trajectory x1 , ..., xN (ǫ) and the outputs of the First Order oracle into the result of the computation—either an ǫ-accurate solution to (1), or a correct infeasibility claim. The consequences are as follows. Consider a family F of convex functions of a given structure, so that every function f in the family can be identified by a finite-dimensional data vector Data(f ). For the sake of simplicity, assume that all functions from the family are real-valued (extensions to partially defined functions are straightforward). Examples include (but, of course, do not reduce to) 1. affine functions of n = 1, 2, ... variables, 2. convex quadratic functions of n = 1, 2, ... variables, 3. (logarithms of) posynomials—functions of n = 1, 2, ... variables of the form Pm ln( i=1 exp{φi (x)}), each with its m, with affine φi , Pown n 4. functions of the form λmax (A0 + j=1 xj Aj ), where Aj , 1 ≤ j ≤ m, are m × m symmetric matrices, λmax (·) is the maximal eigenvalue of a symmetric matrix, and m, n can be arbitrary positive integers (in all these examples, it is self-evident what the data is). For f ∈ F, denoting by n(f ) the dimension of the argument of f , let us call the quantity Size(f ) = max[n(f ), dim Data(f )] the size of f . Let us say that family F • is with polynomial growth, if for all f ∈ F and R > 0 it holds V (f, R)
:=
max f (x) − min f (x) n(f )
x∈BR
≤
n(f )
x∈BR
χ[Size(f ) + R + kData(f )k∞ ](χSize
χ
(f ))
;
611
SUMMARY ON SOLVABILITY OF CONVEX PROBLEMS
here and in what follows χ’s stand for positive constants, perhaps different in different places, depending solely on F; • is polynomially computable, if there exists a code for a Real Arithmetic computer2 with the following property: whenever f ∈ F, R > 0, and δ > 0, executing the code on input comprised of Data(f ) augmented by δ, R, and a query vector x ∈ Rn(f ) with kxk2 ≤ R, the computer after finitely many operations outputs the coefficients of affine function gf,x (·) which is a δ-subgradient of f , taken at n(f ) x, on BR , n(f )
gf,x (y) ≤ f (y) ∀y ∈ BR
& f (x) ≤ gf,x (x) + δ,
and the number N of arithmetic operations in this computation is upper-bounded by a polynomial in Size(f ) and “the required number of accuracy digits” Size(f ) + kData(f )k∞ + R + δ 2 , Digits(f, R, δ) = log δ that is, N ≤ χ[Size(f ) + Digits(f, R, δ)]χ . Observe that typical families of convex functions, like those in the above examples, are both with polynomial growth and polynomially computable. In the main body of this book, the words “a convex function f is efficiently computable” mean that f belongs to a polynomially computable family (it is always clear from the context what this family is). Similarly, the words “a closed convex set X is computationally tractable” mean that the convex function f (x) = miny∈X ky− xk2 is efficiently computable. In our context, the role of the notions introduced is as follows. Consider problem (1) and assume that the functions fi , i = 0, 1, ..., m participating in the problem are taken from a family F which is polynomially computable and with polynomial growth (as is the case when (1) is a linear, or a second order conic, or a semidefinite program). In this situation a particular instance P of (1) is fully specified by its data vector Data(P ) obtained by augmenting the “sizes” n, m, R of the instance by the concatenation of the data vectors of f0 , f1 , ..., fm . Similarly to the above, let us define the size of P as Size(P ) = max[n, m, dim Data(P )], so that Size(P ) ≥ Size(fi ) for all i, 0 ≤ i ≤ m. Given Data(P ) and R and invoking the fact that F is with polynomial growth, we can easily compute V satisfying (3) and such that V = V (P, R) ≤ χ[Size(P ) + R + kData(P )k∞ ](χSize
χ
(P ))
.
(5)
2 An idealized computer capable of storing reals and carrying out operations of precise real arithmetic—the four arithmetic √ operations, comparison, and the computing of elementary univariate functions, like sin(s), s, etc.
612
APPENDIX
Similarly to the above, we set Digits(P, R, δ) = log
Size(P ) + kData(P )k∞ + R + δ 2 δ
,
so that Digits(P, R, δ) ≥ Digits(fi , R, δ), 0 ≤ i ≤ m. Invoking polynomial computability of F, we can implement the First Order oracle for problems P of the form (1) with fi ∈ F on the Real Arithmetic Computer in such a way that executing the resulting code on input comprised of the data vector Data(P ) augmented by δ > 0, R, and a query vector x ∈ BnR will produce δ-subgradients, taken at x, of fi , 0 ≤ i ≤ m; the total number M = M (P, R, δ) of real arithmetic operations in the course of computation is upper-bounded by a polynomial in Size(P ) and Digits(P, R, δ): M (P, R, δ) ≤ χ[Size(P ) + Digits(P, R, δ)]χ . (6) Finally, given, Data(P ), R, and a desired accuracy ǫ > 0 and assuming w.l.o.g. that ǫ < V = V (P, R),3 we can use the above First Order oracle (with δ set to ǫ/2) in (!) in order to find an ǫ-accurate solution to problem P (or conclude correctly that the problem is infeasible). The number N of steps in this computation, in view of (4) and (5), is upper-bounded by a polynomial in Size(P ) and Digits(P, R, ǫ): N ≤ O(1)[Size(P ) + Digits(P, R, ǫ)]χ , with computational expenses per step stemming from mimicking the First Order oracle upper-bounded by a polynomial in Size(P ) and Digits(P, R, ǫ) (by (6)). By (!), the overall “computational overhead” needed to process the oracle’s outputs and to generate the result is bounded by another polynomial of the same type. The bottom line is that When F is a polynomially computable family of convex functions of polynomial growth, and the objective and the constraints fi , i ≤ m, in (1) belong to F, the overall number of arithmetic operations needed to find an ǫ-approximate solution to (1) (or to conclude correctly that (1) is infeasible) is, for every ǫ > 0, upper-bounded by a polynomial, depending solely on F, in the size Size(P ) of the instance and the number Digits(P, R, ǫ) of accuracy digits in the desired solution. For all our purposes, this is a general enough “scientific translation” of the informal claim “an explicit convex problem with computationally tractable objective and constraints is efficiently solvable.”
3 This
indeed is w.l.o.g., since, say, the origin is a V -accurate solution to P .
Bibliography [1]
M. Aizerman, E. Braverman, and L. Rozonoer. Theoretical foundations of the potential function method in pattern recognition. Avtomatika i Telemekhanika, 25:917–936, 1964. English translation: Automation & Remote Control.
[2]
M. Aizerman, E. Braverman, and L. Rozonoer. Method of potential functions in the theory of learning machines. Nauka, Moscow, 1970.
[3]
E. Anderson. The MOSEK optimization toolbox for MATLAB Manual. Version 8.0, 2015. http://docs.mosek.com/8.0/toolbox/.
[4]
T. Anderson. The integral of a symmetric unimodal function over a symmetric convex set and some probability inequalities. Proceedings of the American Mathematical Society, 6(2):170–176, 1955.
[5]
A. Antoniadis and I. Gijbels. Detecting abrupt changes by wavelet methods. Journal of Nonparametric Statistics, 14(1-2):7–29, 2002.
[6]
B. Arnold and P. Stahlecker. Another view of the Kuks–Olman estimator. Journal of Statistical Planning and Inference, 89(1):169–174, 2000.
[7]
K. Astr¨om and P. Eykhoff. System identification—a survey. Automatica, 7(2):123–162, 1971.
[8]
T. Augustin and R. Hable. On the impact of robust statistics on imprecise probability models: a review. Structural Safety, 32(6):358–365, 2010.
[9]
R. Bakeman and J. Gottman. Observing Interaction: An Introduction to Sequential Analysis. Cambridge University Press, 1997.
[10]
M. Basseville. Detecting changes in signals and systems—a survey. Automatica, 24(3):309–326, 1988.
[11]
M. Basseville and I. Nikiforov. Detection of Abrupt Changes: Theory and Application. Prentice-Hall, Englewood Cliffs, N.J., 1993.
[12]
T. Bednarski. Binary experiments, minimax tests and 2-alternating capacities. The Annals of Statistics, 10(1):226–232, 1982.
[13]
D. Belomestny and A. Goldenschluger. Nonparametric density estimation from observations with multiplicative measurement errors. arXiv 1709.00629, 2017. https://arxiv.org/pdf/1709.00629.pdf.
[14]
A. Ben-Tal, L. El Ghaoui, and A. Nemirovski. Robust Optimization. Princeton University Press, 2009.
614
BIBLIOGRAPHY
[15]
A. Ben-Tal and A. Nemirovski. Lectures on Modern Convex Optimization: Analysis, Algorithms, and Engineering Applications. SIAM, 2001.
[16]
M. Bertero and P. Boccacci. Application of the OS-EM method to the restoration of LBT images. Astronomy and Astrophysics Supplement Series, 144(1):181–186, 2000.
[17]
M. Bertero and P. Boccacci. Image restoration methods for the large binocular telescope (LBT). Astronomy and Astrophysics Supplement Series, 147(2):323–333, 2000.
[18]
E. Betzig, G. Patterson, R. Sougrat, O. W. Lindwasser, S. Olenych, J. Bonifacino, M. Davidson, J. Lippincott-Schwartz, and H. Hess. Imaging intracellular fluorescent proteins at nanometer resolution. Science, 313(5793):1642–1645, 2006.
[19]
P. Bickel and Y. Ritov. Estimating integrated squared density derivatives: sharp best order of convergence estimates. Sankhy¯ a: The Indian Journal of Statistics, Series A, 50(3):381–393, 1988.
[20]
P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics, 37(4):1705–1732, 2009.
[21]
L. Birg´e. Approximation dans les spaces m´etriques et th´eorie de l’estimation: in´egalit´es de Cr` amer-Chernoff et th´eorie asymptotique des tests. PhD thesis, Universit´e Paris VII, 1980.
[22]
L. Birg´e. Vitesses maximales de d´ecroissance des erreurs et tests optimaux associ´es. Zeitschrift f¨ ur Wahrscheinlichkeitstheorie und verwandte Gebiete, 55(3):261–273, 1981.
[23]
L. Birg´e. Approximation dans les ´espaces m´etriques et th´eorie de l’estimation. Zeitschrift f¨ ur Wahrscheinlichkeitstheorie und verwandte Gebiete, 65(2):181– 237, 1983.
[24]
L. Birg´e. Robust testing for independent non identically distributed variables and Markov chains. In J. Florens, M. Mouchart, J. Raoult, L. Simar, and A. Smith, editors, Specifying Statistical Models, volume 16 of Lecture Notes in Statistics, pages 134–162. Springer, 1983.
[25]
L. Birg´e. Sur un th´eor`eme de minimax et son application aux tests. Probability and Mathematical Statistics, 3(2):259–282, 1984.
[26]
L. Birg´e. Model selection via testing: an alternative to (penalized) maximum likelihood estimators. Annales de l’Institut Henri Poincar´e, Probabilit´es et Statistiques, 42(3):273–325, 2006.
[27]
L. Birg´e. Robust tests for model selection. In M. Banerjee, F. Bunea, J. Huang, V. Koltchinskii, and M. Maathuis, editors, From Probability to Statistics and Back: High-Dimensional Models and Processes—A Festschrift in Honor of Jon A. Wellner, pages 47–64. Institute of Mathematical Statistics, 2013.
[28]
L. Birg´e and P. Massart. Estimation of integral functionals of a density. The
BIBLIOGRAPHY
615
Annals of Statistics, 23(1):11–29, 1995. [29]
H.-D. Block. The perceptron: A model for brain functioning. I. Reviews of Modern Physics, 34(1):123, 1962.
[30]
O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2:499–526, 2002.
[31]
S. Boyd, L. El Ghaoui, E. Feron, and V. Balakrishnan. Linear Matrix Inequalities in System and Control Theory. SIAM, 1994.
[32]
E. Brodsky and B. Darkhovsky. Nonparametric Methods in Change Point Problems. Springer Science & Business Media, 2013.
[33]
E. Brunel, F. Comte, and V. Genon-Catalot. Nonparametric density and survival function estimation in the multiplicative censoring model. Test, 25(3):570–590, 2016.
[34]
A. Buchholz. Operator Khintchine inequality in non-commutative probability. Mathematische Annalen, 319(1):1–16, 2001.
[35]
A. Buja. On the Huber-Strassen theorem. Probability Theory and Related Fields, 73(1):149–152, 1986.
[36]
M. Burnashev. On the minimax detection of an imperfectly known signal in a white noise background. Theory of Probability & Its Applications, 24(1):107– 119, 1979.
[37]
M. Burnashev. Discrimination of hypotheses for Gaussian measures and a geometric characterization of the Gaussian distribution. Mathematical Notes of the Academy of Sciences of the USSR, 32:757–761, 1982.
[38]
C. Butucea and F. Comte. Adaptive estimation of linear functionals in the convolution model and applications. Bernoulli, 15(1):69–98, 2009.
[39]
C. Butucea and K. Meziani. Quadratic functional estimation in inverse problems. Statistical Methodology, 8(1):31–41, 2011.
[40]
T. T. Cai and M. Low. A note on nonparametric estimation of linear functionals. The Annals of Statistics, 31(4):1140–1153, 2003.
[41]
T. T. Cai and M. Low. Minimax estimation of linear functionals over nonconvex parameter spaces. The Annals of Statistics, 32(2):552–576, 2004.
[42]
T. T. Cai and M. Low. On adaptive estimation of linear functionals. The Annals of Statistics, 33(5):2311–2343, 2005.
[43]
T. T. Cai and M. Low. Optimal adaptive estimation of a quadratic functional. The Annals of Statistics, 34(5):2298–2325, 2006.
[44]
E. Candes. Compressive sampling. In Proceedings of the International Congress of Mathematicians, volume 3, pages 1433–1452. Madrid, August 22-30, Spain, 2006.
[45]
E. Candes. The restricted isometry property and its implications for com-
616
BIBLIOGRAPHY
pressed sensing. Comptes rendus de l’Acad´emie des Sciences, Math´ematique, 346(9-10):589–592, 2008. [46]
E. Candes, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59(8):1207–1223, 2006.
[47]
E. Candes and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):4203–4215, 2005.
[48]
E. Candes and T. Tao. Near-optimal signal recovery from random projections: Universal encoding strategies? IEEE Transactions on Information Theory, 52(12):5406–5425, 2006.
[49]
E. Candes and T. Tao. The Dantzig selector: statistical estimation when p is much larger than n. The Annals of Statistics, 35(6):2313–2351, 2007.
[50]
Y. Cao, V. Guigues, A. Juditsky, A. Nemirovski, and Y. Xie. Change detection via affine and quadratic detectors. Electronic Journal of Statistics, 12(1):1–57, 2018.
[51]
J. Chen and A. Gupta. Parametric statistical change point analysis: with applications to genetics, medicine, and finance. Boston: Birkh¨ auser, 2012.
[52]
S. Chen and D. Donoho. Basis pursuit. In Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers, pages 41–44. IEEE, 1994.
[53]
S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM Review, 43(1):129–159, 2001.
[54]
N. Chentsov. Evaluation of an unknown distribution density from observations. Doklady Academii Nauk SSSR, 147(1):45, 1962. English translation: Soviet Mathematics.
[55]
H. Chernoff. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. The Annals of Mathematical Statistics, 23(4):493–507, 1952.
[56]
H. Chernoff. Sequential Analysis and Optimal Design. SIAM, 1972.
[57]
N. Christopeit and K. Helmes. Linear minimax estimation with ellipsoidal constraints. Acta Applicandae Mathematica, 43(1):3–15, 1996.
[58]
H. Cram´er. Mathematical Methods of Statistics. Princeton University Press, 1946.
[59]
A. d’Aspremont and L. El Ghaoui. Testing the nullspace property using semidefinite programming. Mathematical Programming Series B, 127(1):123– 144, 2011. https://arxiv.org/pdf/0807.3520.pdf.
[60]
I. Dattner, A. Goldenshluger, and A. Juditsky. On deconvolution of distribution functions. The Annals of Statistics, 39(5):2477–2501, 2011.
[61]
R. DeVore. Deterministic constructions of compressed sensing matrices. Journal of Complexity, 23(4-6):918–925, 2007.
BIBLIOGRAPHY
617
[62]
I. Devyaterikov, A. Propoi, and Y. Tsypkin. Iterative learning algorithms for pattern recognition. Avtomatika i Telemekhanika, 28:122–132, 1967. English translation: Automation & Remote Control.
[63]
D. Donoho. Nonlinear wavelet methods for recovery of signals, densities, and spectra from indirect and noisy data. In I. Daubechies, editor, Proceedings of Symposia in Applied Mathematics, volume 47, pages 173–205. AMS, 1993.
[64]
D. Donoho. Statistical estimation and optimal recovery. The Annals of Statistics, 22(1):238–270, 1994.
[65]
D. Donoho. De-noising by soft-thresholding. IEEE Transactions on Information Theory, 41(3):613–627, 1995.
[66]
D. Donoho. Nonlinear solution of linear inverse problems by wavelet– vaguelette decomposition. Applied and Computational Harmonic Analysis, 2(2):101–126, 1995.
[67]
D. Donoho. Neighborly polytopes and sparse solutions of underdetermined linear equations. Technical report, Stanford University Statistics Report 200504, 2005. https://statistics.stanford.edu/research/neighborly-polytopes-andsparse-solution-underdetermined-linear-equations.
[68]
D. Donoho. Compressed sensing. IEEE Transactions on Information Theory, 52(4):1289–1306, 2006.
[69]
D. Donoho, M. Elad, and V. Temlyakov. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on Information Theory, 52(1):6–18, 2006.
[70]
D. Donoho and X. Huo. Uncertainty principles and ideal atomic decomposition. IEEE Transactions on Information Theory, 47(7):2845–2862, 2001.
[71]
D. Donoho and I. Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81(3):425–455, 1994.
[72]
D. Donoho and I. Johnstone. Minimax risk over ℓp -balls for ℓp -error. Probability Theory and Related Fields, 99(2):277–303, 1994.
[73]
D. Donoho and I. Johnstone. Adapting to unknown smoothness via wavelet shrinkage. Journal of the American Statistical Association, 90(432):1200– 1224, 1995.
[74]
D. Donoho and I. Johnstone. Minimax estimation via wavelet shrinkage. The Annals of Statistics, 26(3):879–921, 1998.
[75]
D. Donoho, I. Johnstone, G. Kerkyacharian, and D. Picard. Wavelet shrinkage: asymptopia? Journal of the Royal Statistical Society: Series B, 57(2):301–337, 1995.
[76]
D. Donoho and R. Liu. Geometrizing rate of convergence I. Technical report, 137a, Department of Statistics, University of California, Berkeley, 1987.
[77]
D. Donoho and R. Liu. Geometrizing rates of convergence, II. The Annals of Statistics, 19(2):633–667, 1991.
618
BIBLIOGRAPHY
[78]
D. Donoho and R. Liu. Geometrizing rates of convergence, III. The Annals of Statistics, 19(2):668–701, 1991.
[79]
D. Donoho, R. Liu, and B. MacGibbon. Minimax risk over hyperrectangles, and implications. The Annals of Statistics, 18(3):1416–1437, 1990.
[80]
D. Donoho and M. Low. Renormalization exponents and optimal pointwise rates of convergence. The Annals of Statistics, 20(2):944–970, 1992.
[81]
D. Donoho and M. Nussbaum. Minimax quadratic estimation of a quadratic functional. Journal of Complexity, 6(3):290–323, 1990.
[82]
H. Drygas. Spectral methods in linear minimax estimation. Acta Applicandae Mathematica, 43(1):17–42, 1996.
[83]
M. Duarte, M. Davenport, D. Takhar, J. Laska, T. Sun, K. Kelly, and R. Baraniuk. Single-pixel imaging via compressive sampling. IEEE Signal Processing Magazine, 25(2):83–91, 2008.
[84]
L. Dumbgen and V. Spokoiny. Multiscale testing of qualitative hypotheses. The Annals of Statistics, 29(1):124–152, 2001.
[85]
J. Durbin. Errors in variables. Revue de l’Institut International de Statistique, 22(1/3):23–32, 1954.
[86]
S. Efromovich. Nonparametric Curve Estimation: Methods, Theory, and Applications. Springer Science & Business Media, 1999.
[87]
S. Efromovich and M. Low. On optimal adaptive estimation of a quadratic functional. The Annals of Statistics, 24(3):1106–1125, 1996.
[88]
S. Efromovich and M. Pinsker. Sharp-optimal and adaptive estimation for heteroscedastic nonparametric regression. Statistica Sinica, 6(4):925–942, 1996.
[89]
T. Eltoft, T. Kim, and T.-W. Lee. On the multivariate Laplace distribution. IEEE Signal Processing Letters, 13(5):300–303, 2006.
[90]
J. Fan. On the estimation of quadratic functionals. The Annals of Statistics, 19(3):1273–1294, 1991.
[91]
J. Fan. On the optimal rates of convergence for nonparametric deconvolution problems. The Annals of Statistics, 19(3):1257–1272, 1991.
[92]
G. Fellouris and G. Sokolov. Second-order asymptotic optimality in multisensor sequential change detection. IEEE Transactions on Information Theory, 62(6):3662–3675, 2016.
[93]
J.-J. Fuchs. On sparse representations in arbitrary redundant bases. IEEE Transactions on Information Theory, 50(6):1341–1344, 2004.
[94]
J.-J. Fuchs. Recovery of exact sparse representations in the presence of bounded noise. IEEE Transactions on Information Theory, 51(10):3601–3608, 2005.
[95]
W. Gaffey. A consistent estimator of a component of a convolution. The
BIBLIOGRAPHY
619
Annals of Mathematical Statistics, 30(1):198–205, 1959. [96]
U. Gamper, P. Boesiger, and S. Kozerke. Compressed sensing in dynamic MRI. Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine, 59(2):365–373, 2008.
[97]
G. Gayraud and K. Tribouley. Wavelet methods to estimate an integrated quadratic functional: Adaptivity and asymptotic law. Statistics & Probability Letters, 44(2):109–122, 1999.
[98]
N. Gholson and R. Moose. Maneuvering target tracking using adaptive state estimation. IEEE Transactions on Aerospace and Electronic Systems, 13(3):310–317, 1977.
[99]
R. Gill and B. Levit. Applications of the van Trees inequality: a Bayesian Cram´er-Rao bound. Bernoulli, 1(1-2):59–79, 1995.
[100] E. Gin´e, R. Latala, and J. Zinn. Exponential and moment inequalities for U -statistics. In E. Gin´e, D. Mason, and J. Wellner, editors, High Dimensional Probability II, volume 47 of Progress in Probability, pages 13–38. Burkh¨ auser, 2000. [101] A. Goldenshluger. A universal procedure for aggregating estimators. The Annals of Statistics, 37(1):542–568, 2009. [102] A. Goldenshluger, A. Juditsky, and A. Nemirovski. Hypothesis testing by convex optimization. Electronic Journal of Statistics, 9(2):1645–1712, 2015. [103] A. Goldenshluger, A. Juditsky, A. Tsybakov, and A. Zeevi. Change point estimation from indirect observations. I. Minimax complexity. Annales de l’Institut Henri Poincar´e, Probabilit´es et Statistiques, 44:787–818, 2008. [104] A. Goldenshluger, A. Juditsky, A. Tsybakov, and A. Zeevi. Change point estimation from indirect observations. II. Adaptation. Annales de l’Institut Henri Poincar´e, Probabilit´es et Statistiques, 44(5):819–836, 2008. [105] A. Goldenshluger, A. Tsybakov, and A. Zeevi. Optimal change-point estimation from indirect observations. The Annals of Statistics, 34(1):350–372, 2006. [106] Y. Golubev, B. Levit, and A. Tsybakov. Asymptotically efficient estimation of analytic functions in Gaussian noise. Bernoulli, 2(2):167–181, 1996. [107] L. Gordon and M. Pollak. An efficient sequential nonparametric scheme for detecting a change of distribution. The Annals of Statistics, 22(2):763–804, 1994. [108] M. Grant and S. Boyd. The CVX Users’ Guide. Release 2.1, 2014. https: //web.cvxr.com/cvx/doc/CVX.pdf. [109] M. Grasmair, H. Li, and A. Munk. Variational multiscale nonparametric regression: Smooth functions. Annales de l’Institut Henri Poincar´e, Probabilit´es et Statistiques, 54(2):1058–1097, 2018. [110] V. Guigues, A. Juditsky, and A. Nemirovski. Hypothesis testing via Euclidean
620
BIBLIOGRAPHY
separation. arXiv 1705.07196, 2017. https://arxiv.org/pdf/1705.07196. pdf. [111] F. Gustafsson. Adaptive Filtering and Change Detection. John Wiley & Sons, 2000. [112] W. H¨ ardle, G. Kerkyacharian, D. Picard, and A. Tsybakov. Wavelets, Approximation, and Statistical Applications. Springer Science & Business Media, 1998. [113] S. Hell. Toward fluorescence nanoscopy. Nature Biotechnology, 21(11):1347, 2003. [114] S. Hell. Microscopy and its focal switch. Nature Methods, 6(1):24, 2009. [115] S. Hell and J. Wichmann. Breaking the diffraction resolution limit by stimulated emission: stimulated-emission-depletion fluorescence microscopy. Optics Letters, 19(11):780–782, 1994. [116] D. Helmbold and M. Warmuth. On weak learning. Journal of Computer and System Sciences, 50(3):551–573, 1995. [117] S. Hess, T. Girirajan, and M. Mason. Ultra-high resolution imaging by fluorescence photoactivation localization microscopy. Biophysical Journal, 91(11):4258–4272, 2006. [118] J.-B. Hiriart-Urruty and C. Lemarechal. Convex Analysis and Minimization Algorithms I: Fundamentals. Springer, 1993. [119] C. Houdr´e and P. Reynaud-Bouret. Exponential inequalities, with constants, for U -statistics of order two. In E. Gin´e, C. Houdr´e, and D. Nualart, editors, Stochastic Inequalities and Applications, volume 56 of Progress in Probability, pages 55–69. Birkh¨ auser, 2003. [120] L.-S. Huang and J. Fan. Nonparametric estimation of quadratic regression functionals. Bernoulli, 5(5):927–949, 1999. [121] P. Huber. A robust version of the probability ratio test. The Annals of Mathematical Statistics, 36(6):1753–1758, 1965. [122] P. Huber and V. Strassen. Minimax tests and the Neyman-Pearson lemma for capacities. The Annals of Statistics, 1(2):251–263, 1973. [123] P. Huber and V. Strassen. Note: Correction to minimax tests and the Neyman-Pearson lemma for capacities. The Annals of Statistics, 2(1):223– 224, 1974. [124] I. Ibragimov and R. Khasminskii. Statistical Estimation: Asymptotic Theory. Springer, 1981. [125] I. Ibragimov and R. Khasminskii. On nonparametric estimation of the value of a linear functional in Gaussian white noise. Theory of Probability & Its Applications, 29(1):18–32, 1985. [126] I. Ibragimov and R. Khasminskii. Estimation of linear functionals in Gaussian
BIBLIOGRAPHY
621
noise. Theory of Probability & Its Applications, 32(1):30–39, 1988. [127] I. Ibragimov, A. Nemirovskii, and R. Khasminskii. Some problems on nonparametric estimation in Gaussian white noise. Theory of Probability & Its Applications, 31(3):391–406, 1987. [128] Y. Ingster and I. Suslina. Nonparametric Goodness-of-Fit Testing Under Gaussian Models, volume 169 of Lecture Notes in Statistics. Springer, 2002. [129] A. Juditsky, F. Kilinc-Karzan, and A. Nemirovski. Verifiable conditions of ℓ1 recovery for sparse signals with sign restrictions. Mathematical Programming, 127(1):89–122, 2011. [130] A. Juditsky, F. Kilinc-Karzan, A. Nemirovski, and B. Polyak. Accuracy guaranties for ℓ1 -recovery of block-sparse signals. The Annals of Statistics, 40(6):3077–3107, 2012. [131] A. Juditsky and A. Nemirovski. Nonparametric estimation by convex programming. The Annals of Statistics, 37(5a):2278–2300, 2009. [132] A. Juditsky and A. Nemirovski. Accuracy guarantees for ℓ1 -recovery. IEEE Transactions on Information Theory, 57(12):7818–7839, 2011. [133] A. Juditsky and A. Nemirovski. On verifiable sufficient conditions for sparse signal recovery via ℓ1 minimization. Mathematical Programming, 127(1):57– 88, 2011. [134] A. Juditsky and A. Nemirovski. On sequential hypotheses testing via convex optimization. Automation & Remote Control, 76(5):809–825, 2015. https: //arxiv.org/pdf/1412.1605.pdf. [135] A. Juditsky and A. Nemirovski. Estimating linear and quadratic forms via indirect observations. arXiv 1612.01508, 2016. https://arxiv.org/pdf/ 1612.01508.pdf. [136] A. Juditsky and A. Nemirovski. Hypothesis testing via affine detectors. Electronic Journal of Statistics, 10:2204–2242, 2016. [137] A. Juditsky and A. Nemirovski. Near-optimality of linear recovery from indirect observations. Mathematical Statistics and Learning, 1(2):101–110, 2018. https://arxiv.org/pdf/1704.00835.pdf. [138] A. Juditsky and A. Nemirovski. Near-optimality of linear recovery in Gaussian observation scheme under k · k22 -loss. The Annals of Statistics, 46(4):1603– 1629, 2018. [139] A. Juditsky and A. Nemirovski. On polyhedral estimation of signals via indirect observations. arXiv:1803.06446, 2018. https://arxiv.org/pdf/ 1803.06446.pdf. [140] A. Juditsky and A. Nemirovski. Signal recovery by stochastic optimization. Avtomatika i Telemekhanika, 80(10):153–172, 2019. https://arxiv.org/ pdf/1903.07349.pdf English translation: Automation & Remote Control. [141] S. Kakade, V. Kanade, O. Shamir, and A. Kalai. Efficient learning of gener-
622
BIBLIOGRAPHY
alized linear and single index models with isotonic regression. In J. ShaweTaylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 927–935. Curran Associates, Inc., 2011. [142] A. Kalai and R. Sastry. The isotron algorithm: High-dimensional isotonic regression. In COLT 2009 - The 22nd Conference on Learning Theory, Montreal, Quebec, Canada, June 18-21, 2009. [143] R. Kalman. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 82(1):35–45, 1960. [144] R. Kalman and R. Bucy. New results in linear filtering and prediction theory. Journal of Basic Engineering, 83(1):95–108, 1961. [145] G. Kerkyacharian and D. Picard. Minimax or maxisets? Bernoulli, 8:219– 253, 2002. [146] J. Kiefer and J. Wolfowitz. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, 23(3), 1952. [147] J. Klemel¨ a. Sharp adaptive estimation of quadratic functionals. Probability Theory and Related Fields, 134(4):539–564, 2006. [148] J. Klemel¨ a and A. Tsybakov. Sharp adaptive estimation of linear functionals. The Annals of Statistics, 29(6):1567–1600, 2001. [149] V. Koltchinskii and K. Lounici. Concentration inequalities and moment bounds for sample covariance operators. Bernoulli, 23(1):110–133, 2017. [150] V. Koltchinskii, K. Lounici, and A. Tsybakov. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. The Annals of Statistics, 39(5):2302–2329, 2011. [151] V. Korolyuk and Y. Borovskich. Theory of U -statistics. Springer Science & Business Media, 1994. [152] A. Korostelev and O. Lepski. On a multi-channel change-point problem. Mathematical Methods of Statistics, 17(3):187–197, 2008. [153] S. Kotz and S. Nadarajah. Multivariate t-Distributions and Their Applications. Cambridge University Press, 2004. [154] C. Kraft. Some conditions for consistency and uniform consistency of statistical procedures. University of California Publications in Statistics, 2:493–507, 1955. [155] J. Kuks and W. Olman. Minimax linear estimation of regression coefficients (I). Iswestija Akademija Nauk Estonskoj SSR, 20:480–482, 1971. [156] J. Kuks and W. Olman. Minimax linear estimation of regression coefficients (II). Iswestija Akademija Nauk Estonskoj SSR, 21:66–72, 1972. [157] V. Kuznetsov. Stable detection when signal and spectrum of normal noise are inaccurately known. Telecommunications and Radio Engineering, 30(3):58–
BIBLIOGRAPHY
623
64, 1976. [158] T. L. Lai. Sequential changepoint detection in quality control and dynamical systems. Journal of the Royal Statistical Society. Series B, 57(4):613–658, 1995. [159] A. Lakhina, M. Crovella, and C. Diot. Diagnosing network-wide traffic anomalies. ACM SIGCOMM Computer Communication Review, 34(4):219– 230, 2004. [160] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by model selection. The Annals of Statistics, 28(5):1302–1338, 2000. [161] L. Le Cam. On the assumptions used to prove asymptotic normality of maximum likelihood estimates. The Annals of Mathematical Statistics, 41(3):802– 828, 1970. [162] L. Le Cam. Convergence of estimates under dimensionality restrictions. The Annals of Statistics, 1(1):38–53, 1973. [163] L. Le Cam. On local and global properties in the theory of asymptotic normality of experiments. Stochastic Processes and Related Topics, 1:13–54, 1975. [164] L. Le Cam. Asymptotic Methods in Statistical Decision Theory, volume 26 of Springer Series in Statistics. Springer, 1986. [165] O. Lepski. Asymptotically minimax adaptive estimation: I. Upper bounds. Optimally adaptive estimates. Theory of Probability & Its Applications, 36(4):645–659, 1991. [166] O. Lepski. On a problem of adaptive estimation in Gaussian white noise. Theory of Probability & Its Applications, 35(3):454–466, 1991. [167] O. Lepski. Some new ideas in nonparametric estimation. arXiv 1603.03934, 2016. https://arxiv.org/pdf/1603.03934.pdf. [168] O. Lepski and V. Spokoiny. Optimal pointwise adaptive methods in nonparametric estimation. The Annals of Statistics, 25(6):2512–2546, 1997. [169] O. Lepski and T. Willer. Oracle inequalities and adaptive estimation in the convolution structure density model. The Annals of Statistics, 47(1):233–287, 2019. [170] B. Levit. Conditional estimation of linear functionals. Problemy Peredachi Informatsii, 11(4):39–54, 1975. English translation: Problems of Information Transmission. [171] R. Liptser and A. Shiryaev. Statistics of Random Processes I: General Theory. Springer, 2001. [172] R. Liptser and A. Shiryaev. Statistics of Random Processes II: Applications. Springer, 2001. [173] L. Ljung. System Identification: Theory for the User. Prentice Hall, 1986.
624
BIBLIOGRAPHY
[174] G. Lorden. Procedures for reacting to a change in distribution. The Annals of Mathematical Statistics, 42(6):1897–1908, 1971. [175] F. Lust-Piquard. In´egalit´es de Khintchine dans C p (1 < p < ∞). Comptes rendus de l’Acad´emie des Sciences, S´erie I, 303(7):289–292, 1986. [176] M. Lustig, D. Donoho, and J. Pauly. Sparse MRI: The application of compressed sensing for rapid MR imaging. Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine, 58(6):1182–1195, 2007. [177] L. Mackey, M. Jordan, R. Chen, B. Farrell, and J. Tropp. Matrix concentration inequalities via the method of exchangeable pairs. The Annals of Probability, 42(3):906–945, 2014. [178] P. Massart. Concentration Inequalities and Model Selection. Springer, 2007. [179] P. Math´e and S. Pereverzev. Direct estimation of linear functionals from indirect noisy observations. Journal of Complexity, 18(2):500–516, 2002. [180] E. Mazor, A. Averbuch, Y. Bar-Shalom, and J. Dayan. Interacting multiple model methods in target tracking: a survey. IEEE Transactions on Aerospace and Electronic Systems, 34(1):103–123, 1998. [181] Y. Mei. Asymptotic optimality theory for decentralized sequential hypothesis testing in sensor networks. IEEE Transactions on Information Theory, 54(5):2072–2089, 2008. [182] A. Meister. Deconvolution Problems in Nonparametric Statistics, volume 193 of Lecture Notes in Statistics. Springer, 2009. [183] G. Moustakides. Optimal stopping times for detecting changes in distributions. The Annals of Statistics, 15(4):1379–1387, 1986. [184] H.-G. M¨ uller and U. Stadtm¨ uller. Discontinuous versus smooth regression. The Annals of Statistics, 27(1):299–337, 1999. [185] A. Nemirovski. Topics in non-parametric statistics. In P. Bernard, editor, Lectures on Probability Theory and Statistics, Ecole d’Et´e de Probabilit´es de Saint-Flour, volume 1738 of Lecture Notes in Mathematics, pages 87–285. Springer, 2000. [186] A. Nemirovski. Interior Point Polynomial Time methods in Convex Programming. Lecture Notes, 2005. https://www.isye.gatech.edu/~nemirovs/ Lect_IPM.pdf. [187] A. Nemirovski. Introduction to Linear Optimization. Lecture Notes, 2015. https://www2.isye.gatech.edu/~nemirovs/OPTI_LectureNotes2016. pdf. [188] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009. [189] A. Nemirovski, S. Onn, and U. Rothblum. Accuracy certificates for computa-
BIBLIOGRAPHY
625
tional problems with convex structure. Mathematics of Operations Research, 35(1):52–78, 2010. [190] A. Nemirovski, B. Polyak, and A. Tsybakov. Convergence rate of nonparametric estimates of maximum-likelihood type. Problemy Peredachi Informatsii, 21(4):17–33, 1985. English translation: Problems of Information Transmission. [191] A. Nemirovski, C. Roos, and T. Terlaky. On maximization of quadratic form over intersection of ellipsoids with common center. Mathematical Programming, 86(3):463–473, 1999. [192] A. Nemirovskii. Nonparametric estimation of smooth regression functions. Izvestia AN SSSR, Seria Tekhnicheskaya Kibernetika, 23(6):1–11, 1985. English translation: Engineering Cybernetics: Soviet Journal of Computer and Systems Sciences. [193] Y. Nesterov and A. Nemirovskii. Interior-Point Polynomial Algorithms in Convex Programming. SIAM, 1994. [194] M. Neumann. Optimal change point estimation in inverse problems. Scandinavian Journal of Statistics, 24(4):503–521, 1997. [195] D. Nolan and D. Pollard. U -processes: Rates of convergence. The Annals of Statistics, 15(2):780–799, 1987. ¨ [196] F. Osterreicher. On the construction of least favourable pairs of distributions. Zeitschrift f¨ ur Wahrscheinlichkeitstheorie und verwandte Gebiete, 43(1):49– 55, 1978. [197] J. Pilz. Minimax linear regression estimation with symmetric parameter restrictions. Journal of Statistical Planning and Inference, 13:297–318, 1986. [198] M. Pinsker. Optimal filtration of square-integrable signals in Gaussian noise. Problemy Peredachi Informatsii, 16(2):120–133, 1980. English translation: Problems of Information Transmission. [199] G. Pisier. Non-commutative vector valued lp -spaces and completely psumming maps. Ast´erisque, 247, 1998. [200] M. Pollak. Optimal detection of a change in distribution. The Annals of Statistics, 13(1):206–227, 1985. [201] M. Pollak. Average run lengths of an optimal method of detecting a change in distribution. The Annals of Statistics, 15(2):749–779, 1987. [202] H. V. Poor and O. Hadjiliadis. Quickest Detection. Cambridge University Press, 2009. [203] K. Proksch, F. Werner, and A. Munk. Multiscale scanning in inverse problems. The Annals of Statistics, 46(6B):3569–3602, 2018. [204] A. Rakhlin, S. Mukherjee, and T. Poggio. Stability results in learning theory. Analysis and Applications, 3(4):397–417, 2005.
626
BIBLIOGRAPHY
[205] C. R. Rao. Information and accuracy attainable in the estimation of statistical parameters. Bulletin of Calcutta Mathematical Society, 37:81–91, 1945. [206] C. R. Rao. Linear Statistical Inference and Its Applications. John Wiley & Sons, 1973. [207] C. R. Rao. Estimation of parameters in a linear model. The Annals of Statistics, 4(6):1023–1037, 1976. [208] J. Rice. Bandwidth choice for nonparametric regression. The Annals of Statistics, 12(4):1215–1230, 1984. [209] H. Rieder. Least favorable pairs for special capacities. The Annals of Statistics, 5(5):909–921, 1977. [210] F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386, 1958. [211] M. Rust, M. Bates, and X. Zhuang. Sub-diffraction-limit imaging by stochastic optical reconstruction microscopy (STORM). Nature Methods, 3(10):793, 2006. [212] S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan. Stochastic convex optimization. In COLT 2009 – The 22nd Conference on Learning Theory, Montreal, Quebec, Canada, 2009. [213] A. Shapiro, D. Dentcheva, and A. Ruszczy´ nski. Lectures on Stochastic Programming: Modeling and Theory, Second Edition. SIAM, 2014. [214] H. Sherali and W. Adams. A hierarchy of relaxations between the continuous and convex hull representations for zero-one programming problems. SIAM Journal on Discrete Mathematics, 3(3):411–430, 1990. [215] A. Shiryaev. On optimum methods in quickest detection problems. Theory of Probability & Its Applications, 8(1):22–46, 1963. [216] D. Siegmund. Sequential Analysis: Tests and Confidence Intervals. Springer Science & Business Media, 1985. [217] D. Siegmund and B. Yakir. The Statistics of Gene Mapping. Springer Science & Business Media, 2007. [218] T. S¨ oderstr¨om, U. Soverini, and K. Mahata. Perspectives on errors-invariables estimation for dynamic systems. Signal Processing, 82(8):1139–1154, 2002. [219] T. S¨ oderstr¨om and P. Stoica. Comparison of some instrumental variable methods–consistency and accuracy aspects. Automatica, 17(1):101–115, 1981. [220] K. Sridharan, S. Shalev-Shwartz, and N. Srebro. Fast rates for regularized objectives. In D. Koller, D. Schuurmans, B. Y., and L. Bottou, editors, Advances in Neural Information Processing Systems 21, pages 1545–1552. 2009. [221] J. Stoer and C. Witzgall. Convexity and Optimization in Finite Dimensions
BIBLIOGRAPHY
627
I. Springer, 1970. [222] C. Stone. Optimal rates of convergence for nonparametric estimators. The Annals of Statistics, 8(6):1348–1360, 1980. [223] C. Stone. Optimal global rates of convergence for nonparametric regression. The Annals of Statistics, 10(4):1040–1053, 1982. [224] A. Tartakovsky, I. Nikiforov, and M. Basseville. Sequential Analysis: Hypothesis Testing and Change point Detection. CRC Press, 2014. [225] A. Tartakovsky and V. Veeravalli. Change point detection in multichannel and distributed systems. In N. Mukhopadhyay, S. Datta, and S. Chattopadhyay, editors, Applied Sequential Methodologies: Real-World Examples with Data Analysis, pages 339–370. CRC Press, 2004. [226] A. Tartakovsky and V. Veeravalli. Asymptotically optimal quickest change detection in distributed sensor systems. Sequential Analysis, 27(4):441–475, 2008. [227] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B, pages 267–288, 1996. [228] J. Tropp. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning, 8(1-2):1–230, 2015. [229] A. Tsybakov. Optimal rates of aggregation. In B. Sch¨ olkopf and M. Warmuth, editors, Learning Theory and Kernel Machines, volume 2777 of Lecture Notes in Computer Science, pages 303–313. Springer, 2003. [230] A. Tsybakov. Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, New York, 2009. [231] S. Van De Geer. The deterministic Lasso. Technical report, Seminar f¨ ur Statistik, Eidgen¨ossische Technische Hochschule (ETH) Z¨ urich, 2007. https: //stat.ethz.ch/~geer/lasso.pdf JSM Proceedings, 2007, paper nr. 489. [232] S. Van De Geer and P. B¨ uhlmann. On the conditions used to prove oracle results for the Lasso. Electronic Journal of Statistics, 3:1360–1392, 2009. [233] H. Van Trees. Detection, Estimation, and Modulation Theory, Part I: Detection, Estimation, and Linear Modulation Theory. John Wiley & Sons, 1968. [234] Y. Vardi, L. Shepp, and L. Kaufman. A statistical model for positron emission tomography. Journal of the American Statistical Association, 80(389):8–20, 1985. [235] A. Wald. Sequential tests of statistical hypotheses. The Annals of Mathematical Statistics, 16(2):117–186, 1945. [236] A. Wald. Sequential Analysis. John Wiley & Sons, 1947. [237] A. Wald and J. Wolfowitz. Optimum character of the sequential probability ratio test. The Annals of Mathematical Statistics, 19(1):326–339, 1948.
628
BIBLIOGRAPHY
[238] Y. Wang. Jump and sharp cusp detection by wavelets. Biometrika, 82(2):385– 397, 1995. [239] L. Wasserman. All of Nonparametric Statistics. Springer Science & Business Media, 2006. [240] A. Willsky. Detection of Abrupt Changes in Dynamic Systems. Springer, 1985. [241] K. Wong and E. Polak. Identification of linear discrete time systems using the instrumental variable method. IEEE Transactions on Automatic Control, 12(6):707–718, 1967. [242] Y. Xie and D. Siegmund. Sequential multi-sensor change-point detection. The Annals of Statistics, 41(2):670–692, 2013. [243] Y. Yin. Detection of the number, locations and magnitudes of jumps. Communications in Statistics. Stochastic Models, 4(3):445–455, 1988. [244] C.-H. Zhang. Fourier methods for estimating mixing densities and distributions. The Annals of Statistics, 18(2):806–831, 1990.
Index O(1), xvi Diag, xv Erfc, 16 ErfcInv, 17 Risk, 41 Eξ { }, Eξ∼P { }, E{ }, xvi Qq (s, κ)-condition, 12 links with RIP, 24 tractability when q = ∞, 22 verifiable sufficient conditions for, 20 Rn ,Rm×n , xv R+ ,Rn+ , xvi Sn , xv Sn+ , xvi A∗ , xv N (µ, Θ), xvi Rk [·], R∗k [·], Sℓ [·], Sℓ∗ [·], ..., 275 dg, xv ℓ1 minimization, see Compressed Sensing Poisson(·), xvi Uniform(·), xvi R f (ξ)Π(dξ), xvi Ω λ[·], 276 , ≻, , ≺, xvi ξ ∼ P ,ξ ∼ p(·), xvi s-goodness, 9 k · kp , xvi k · k2,2 , xvi Bisection estimate, 198 near-optimality of, 202 closeness relation, 59 Compressed Sensing, 3–6 via ℓ1 minimization, 6–17 imperfect, 11 validity of, 8 verifiable sufficient validity conditions, 17–26
verifiable sufficient validity conditions, limits of performance, 24 via penalized ℓ1 recovery, 14 via regular ℓ1 recovery, 13 conditional quantile, 198 cone dual, 262 Lorentz, 262 regular, 262 semidefinite, 262 conic problem, 263 dual of, 263 programming, 262, 265 Conic Duality Theorem, 264 conic hull, 263 contrast matrix, see nullspace property quantification Cramer-Rao risk bound, 347–350, 354–355, 570 detector, 65 affine, 123 in simple observation schemes, 83 quadratic, 139 risks of, 65 structural properties, 65 ellitope, 265–266 calculus of, 299–302, 429 estimation of N -convex functions, 193–211 of linear form, 185, 211–222 from repeated observations, 215–217 of sub-Gaussianity parameters, 217–222 of sub-Gaussianity parameters, direct product case, 219
630 of quadratic form, 222–232 Gaussian case, 222–228 Gaussian case, consistency, 227 Gaussian case, construction, 224 sub-Gaussian case, 228, 232 sub-Gaussian case, construction, 228 family of distributions regular/simple, 124–132 calculus of, 126 examples of, 125 spherical, 53 cap of, 53 function N -convex, 197 examples of, 197 Gaussian mixtures, 53 generalized linear model, 415 GLM, see generalized linear model Hellinger affinity, 83 Hypothesis Testing change detection via quadratic lifting, 149–156 of multiple hypotheses, 58–64 in simple observation schemes, 87–105 up to closeness, 59, 91 via Euclidean separation, 62– 64 via repeated observations, 95 of unions, 87 problem’s setting, 41 sequential, 105–113 test, 42 detector-based, 65 detector-based, limits of performance, 70 detector-based, via repeated observations, 66 deterministic, 42 partial risks of, 45 randomized, 42 simple, 42 total risk of, 45 two-point lower risk bound, 46
INDEX
via affine detectors, 132–139 via Euclidean separation, 49–58 and repeated observations, 55 majority test, 56 multiple hypotheses case, 62– 64 pairwise, 50 via quadratic lifting, 139 Gaussian case, 139–145 sub-Gaussian case, 145–149 via repeated observations, 42 inequality Cramer-Rao, 349, 572 lemma on Schur Complement, see Schur Complement Lemma LMI, xvi logistic regression, 414–415 matrices notation, xv sensing, 1 MD, see Measurement Design Measurement Design, 113–123 simple case discrete o.s., 118 Gaussian o.s., 122 Poisson o.s., 121 Mutual Incoherence, 23 norm conjugate, 279 Shatten, 305 Wasserstein, 340, 559 Nullspace property, 9, 10 quantification of, 11 o.s., see observation scheme observation scheme discrete, 77, 85 Gaussian, 74, 84 Poisson, 75, 84 simple, 72–87 K-th power of, 85 definition of, 73 direct product of, 77 PET, see Positron Emission Tomography
631
INDEX
Poisson Imaging, 75 polyhedral estimate, 385–414 Positron Emission Tomography, 75 Rademacher random vector, 363 regular data, 124 repeated observations quasi-stationary, 44 semi-stationary, 43 stationary, 42 Restricted Isometry Property, 20 RIP, see Restricted Isometry Property risk Risk(T |H1 , ..., HL ), 45 RiskOptΠ,k·k [X ], 290 Risk[b x(·)|X ], 260 Risk∗ǫ , 234 RiskC (T |H1 , ..., HL ), 60 RiskCℓ (T |H1 , ..., HL ), 60 Riskopt ǫ (K), 220 Riskℓ (T |H1 , ..., HL ), 45 Riskǫ (b g (·)|G, X , υ, A, H, M, Φ), 212 Risk± [φ|P], Risk[φ|P1 , P2 ], 65 Risktot (T |H1 , ..., HL ), 45 RiskH [b x∗ |X ], 298 Riskopt [X ], 260 RiskH,k·k [b x|X ], 295 C-, 60 H-, 295 ǫ-, 212 RiskΠ,k·k [b x|X ], 279 RiskΠ,H,k·k [b x|X , 299 in Hypothesis Testing partial, 45 total, 45 up to closeness, 59 minimax, 266 ǫ-, 220 of detector, 65 of simple test, 45
SA, see Stochastic Approximation SAA, see Sample Average Approximation saddle point convex-concave saddle point problem, 79 Sample Average Approximation, 419–421 Schur Complement Lemma, 265 semidefinite relaxation on ellitope tightness of, 274 on spectratope tightness of, 277 signal estimation, see signal recovery signal recovery linear, 267 on ellitope, 267–271 on ellitope, near-optimality of, 271–274 on spectratope, 277–291 on spectratope under uncertain-but-bounded noise, 291– 297 on spectratope under uncertain-but-bounded noise, nearoptimality of, 297 on spectratope, nearoptimality of, 277, 290 problem setting, 1, 260 sparsity, s-sparsity, 3 spectratope, 275 calculus of, 299–302, 429 examples of, 276 Stochastic Approximation, 421–424 test, see Hypothesis Testing test theorem Sion-Kakutani, 81 vectors notation, xv