Statistical Inference via Convex Optimization 9780691200316

This authoritative book draws on the latest research to explore the interplay of high-dimensional statistics with optimi

127 38 24MB

English Pages 656 [657] Year 2020

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

Statistical Inference via Convex Optimization
 9780691200316

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Statistical Inference via Convex Optimization

Princeton Series in Applied Mathematics Ingrid Daubechies (Duke University); Weinan E (Princeton University); Jan Karel Lenstra (Centrum Wiskunde & Informatica, Amsterdam); Endre S¨ uli (University of Oxford), Series Editors The Princeton Series in Applied Mathematics features high-quality advanced texts and monographs in all areas of applied mathematics. The series includes books of a theoretical and general nature as well as those that deal with the mathematics of specific applications and real-world scenarios. For a full list of titles in the series, go to https://press.princeton.edu/series/princeton-series-in-applied-mathematics Statistical Inference via Convex Optimization, Anatoli Juditsky and Arkadi Nemirovski A Dynamical Systems Theory of Thermodynamics, Wassim M. Haddad Formal Verification of Control System Software, Pierre-Loic Garoche Rays, Waves, and Scattering: Topics in Classical Mathematical Physics, John A. Adam Mathematical Methods in Elasticity Imaging, Habib Ammari, Elie Bretin, Josselin Garnier, Hyeonbae Kang, Hyundae Lee, and Abdul Wahab Hidden Markov Processes: Theory and Applications to Biology, M. Vidyasagar Topics in Quaternion Linear Algebra, Leiba Rodman Mathematical Analysis of Deterministic and Stochastic Problems in Complex Media Electromagnetics, G. F. Roach, I. G. Stratis, and A. N. Yannacopoulos Stability and Control of Large-Scale Dynamical Systems: A Vector Dissipative Systems Approach, Wassim M. Haddad and Sergey G. Nersesov Matrix Completions, Moments, and Sums of Hermitian Squares, Mih´aly Bakonyi and Hugo J. Woerdeman Modern Anti-windup Synthesis: Control Augmentation for Actuator Saturation, Luca Zaccarian and Andrew R. Teel Totally Nonnegative Matrices, Shaun M. Fallat and Charles R. Johnson Graph Theoretic Methods in Multiagent Networks, Mehran Mesbahi and Magnus Egerstedt Matrices, Moments and Quadrature with Applications, Gene H. Golub and G´erard Meurant Control Theoretic Splines: Optimal Control, Statistics, and Path Planning, Magnus Egerstedt and Clyde Martin Robust Optimization, Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski Distributed Control of Robotic Networks: A Mathematical Approach to Motion Coordination Algorithms, Francesco Bullo, Jorge Cort´es, and Sonia Martinez Algebraic Curves over a Finite Field, J.W.P. Hirschfeld, G. Korchm´aros, and F. Torres Wave Scattering by Time-Dependent Perturbations: An Introduction, G. F. Roach Genomic Signal Processing, Ilya Shmulevich and Edward R. Dougherty The Traveling Salesman Problem: A Computational Study, David L. Applegate, Robert E. Bixby, Vaˇsek Chv´ atal, and William J. Cook Positive Definite Matrices, Rajendra Bhatia Impulsive and Hybrid Dynamical Systems: Stability, Dissipativity, and Control, Wassim M. Haddad, VijaySekhar Chellaboina, and Sergey G. Nersesov

Statistical Inference via Convex Optimization Anatoli Juditsky Arkadi Nemirovski

Princeton University Press Princeton and Oxford

c 2020 by Princeton University Press Copyright Requests for permission to reproduce material from this work should be sent to [email protected] Published by Princeton University Press 41 William Street, Princeton, New Jersey 08540 6 Oxford Street, Woodstock, Oxfordshire OX20 1TR press.princeton.edu All Rights Reserved ISBN 978-0-691-19729-6 ISBN (e-book) 978-0-691-20031-6 British Library Cataloging-in-Publication Data is available Editorial: Susannah Shoemaker and Lauren Bucca Production Editorial: Nathan Carr Production: Jacquie Poirier Publicity: Matthew Taylor and Katie Lewis Jacket/Cover Credit: Adapted from Fran¸cois de Kresz, “Excusez-moi, excusezmoi...,” 1974 Copyeditor: Bhisham Bherwani The publisher would like to acknowledge the authors of this volume for providing the camera-ready copy from which this book was printed. This book has been composed in LaTeX Printed on acid-free paper. Printed in the United States of America 10 9 8 7 6 5 4 3 2 1

Contents

List of Figures

xi

Preface

xiii

Acknowledgements

xvii

Notational Conventions

xix

About Proofs

xxi

On Computational Tractability

xxi

1 Sparse Recovery via ℓ1 Minimization 1.1 Compressed Sensing: What is it about? 1.1.1 Signal Recovery Problem . . . . . . . . . . . . . . . . . . . 1.1.2 Signal Recovery: Parametric and nonparametric cases . . . 1.1.3 Compressed Sensing via ℓ1 minimization: Motivation . . . . 1.2 Validity of sparse signal recovery via ℓ1 minimization 1.2.1 Validity of ℓ1 minimization in the noiseless case . . . . . . . 1.2.2 Imperfect ℓ1 minimization . . . . . . . . . . . . . . . . . . . 1.2.3 Regular ℓ1 recovery . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 Penalized ℓ1 recovery . . . . . . . . . . . . . . . . . . . . . . 1.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Verifiability and tractability issues 1.3.1 Restricted Isometry Property and s-goodness of random matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.2 Verifiable sufficient conditions for Qq (s, κ) . . . . . . . . . . 1.3.3 Tractability of Q∞ (s, κ) . . . . . . . . . . . . . . . . . . . . 1.4 Exercises for Chapter 1 1.5 Proofs 1.5.1 Proofs of Theorem 1.3, 1.4 . . . . . . . . . . . . . . . . . . . 1.5.2 Proof of Theorem 1.5 . . . . . . . . . . . . . . . . . . . . . 1.5.3 Proof of Proposition 1.7 . . . . . . . . . . . . . . . . . . . . 1.5.4 Proof of Propositions 1.8 and 1.12 . . . . . . . . . . . . . . 1.5.5 Proof of Proposition 1.10 . . . . . . . . . . . . . . . . . . . 1.5.6 Proof of Proposition 1.13 . . . . . . . . . . . . . . . . . . .

20 20 22 26 30 30 32 33 36 37 39

2 Hypothesis Testing 2.1 Preliminaries from Statistics: Hypotheses, Tests, Risks 2.1.1 Hypothesis Testing Problem . . . . . . . . . . . 2.1.2 Tests . . . . . . . . . . . . . . . . . . . . . . . . 2.1.3 Testing from repeated observations . . . . . . . 2.1.4 Risk of a simple test . . . . . . . . . . . . . . .

41 41 41 42 42 45

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

1 1 1 2 6 8 8 11 13 14 14 19

vi

CONTENTS 2.1.5 Two-point lower risk bound . . . . . . . . . . . . . . . . . . Hypothesis Testing via Euclidean Separation 2.2.1 Situation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Pairwise Hypothesis Testing via Euclidean Separation . . . 2.2.3 Euclidean Separation, Repeated Observations, and Majority Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 From Pairwise to Multiple Hypotheses Testing . . . . . . . 2.3 Detectors and Detector-Based Tests 2.3.1 Detectors and their risks . . . . . . . . . . . . . . . . . . . . 2.3.2 Detector-based tests . . . . . . . . . . . . . . . . . . . . . . 2.4 Simple observation schemes 2.4.1 Simple observation schemes—Motivation . . . . . . . . . . . 2.4.2 Simple observation schemes—The definition . . . . . . . . . 2.4.3 Simple observation schemes—Examples . . . . . . . . . . . 2.4.4 Simple observation schemes—Main result . . . . . . . . . . 2.4.5 Simple observation schemes—Examples of optimal detectors 2.5 Testing multiple hypotheses 2.5.1 Testing unions . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Testing multiple hypotheses “up to closeness” . . . . . . . . 2.5.3 Illustration: Selecting the best among a family of estimates 2.6 Sequential Hypothesis Testing 2.6.1 Motivation: Election polls . . . . . . . . . . . . . . . . . . . 2.6.2 Sequential hypothesis testing . . . . . . . . . . . . . . . . . 2.6.3 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . 2.7 Measurement Design in simple observation schemes 2.7.1 Motivation: Opinion polls revisited . . . . . . . . . . . . . . 2.7.2 Measurement Design: Setup . . . . . . . . . . . . . . . . . . 2.7.3 Formulating the MD problem . . . . . . . . . . . . . . . . . 2.8 Affine detectors beyond simple observation schemes 2.8.1 Situation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8.2 Main result . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9 Beyond the scope of affine detectors: lifting the observations 2.9.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9.2 Quadratic lifting: Gaussian case . . . . . . . . . . . . . . . 2.9.3 Quadratic lifting—Does it help? . . . . . . . . . . . . . . . 2.9.4 Quadratic lifting: Sub-Gaussian case . . . . . . . . . . . . . 2.9.5 Generic application: Quadratically constrained hypotheses . 2.10 Exercises for Chapter 2 2.10.1 Two-point lower risk bound . . . . . . . . . . . . . . . . . . 2.10.2 Around Euclidean Separation . . . . . . . . . . . . . . . . . 2.10.3 Hypothesis testing via ℓ1 -separation . . . . . . . . . . . . . 2.10.4 Miscellaneous exercises . . . . . . . . . . . . . . . . . . . . . 2.11 Proofs 2.11.1 Proof of the observation in Remark 2.8 . . . . . . . . . . . 2.11.2 Proof of Proposition 2.6 in the case of quasi-stationary Krepeated observations . . . . . . . . . . . . . . . . . . . . . 2.11.3 Proof of Theorem 2.23 . . . . . . . . . . . . . . . . . . . . . 2.11.4 Proof of Proposition 2.37 . . . . . . . . . . . . . . . . . . . 2.11.5 Proof of Proposition 2.43 . . . . . . . . . . . . . . . . . . . 2.11.6 Proof of Proposition 2.46 . . . . . . . . . . . . . . . . . . . 2.2

46 49 49 50 55 58 65 65 65 72 72 73 74 79 83 87 88 91 97 105 105 108 113 113 113 115 116 123 124 132 139 139 140 142 145 147 157 157 157 157 163 168 168 168 172 175 176 180

vii

CONTENTS 3 From Hypothesis Testing to Estimating Functionals 3.1 Estimating linear forms on unions of convex sets 3.1.1 The problem . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 The estimate . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Main result . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.4 Near-optimality . . . . . . . . . . . . . . . . . . . . . . . 3.1.5 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Estimating N -convex functions on unions of convex sets 3.2.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Estimating N -convex functions: Problem setting . . . . 3.2.3 Bisection estimate: Construction . . . . . . . . . . . . . 3.2.4 Building Bisection estimate . . . . . . . . . . . . . . . . 3.2.5 Bisection estimate: Main result . . . . . . . . . . . . . . 3.2.6 Illustration . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.7 Estimating N -convex functions: An alternative . . . . . 3.3 Estimating linear forms beyond simple observation schemes 3.3.1 Situation and goal . . . . . . . . . . . . . . . . . . . . . 3.3.2 Construction and main results . . . . . . . . . . . . . . 3.3.3 Estimation from repeated observations . . . . . . . . . . 3.3.4 Application: Estimating linear forms of sub-Gaussianity rameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Estimating quadratic forms via quadratic lifting 3.4.1 Estimating quadratic forms, Gaussian case . . . . . . . 3.4.2 Estimating quadratic form, sub-Gaussian case . . . . . . 3.5 Exercises for Chapter 3 3.6 Proofs 3.6.1 Proof of Proposition 3.3 . . . . . . . . . . . . . . . . . . 3.6.2 Verifying 1-convexity of the conditional quantile . . . . 3.6.3 Proof of Proposition 3.4 . . . . . . . . . . . . . . . . . . 3.6.4 Proof of Proposition 3.14 . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . . . .

. . . . . . .

. . . . . . pa. . . . . .

. . . .

4 Signal Recovery by Linear Estimation Overview 4.1 Preliminaries: Executive summary on Conic Programming 4.1.1 Cones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Conic problems and their duals . . . . . . . . . . . . . . . 4.1.3 Schur Complement Lemma . . . . . . . . . . . . . . . . . 4.2 Near-optimal linear estimation from Gaussian observations 4.2.1 Situation and goal . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Building a linear estimate . . . . . . . . . . . . . . . . . . 4.2.3 Byproduct on semidefinite relaxation . . . . . . . . . . . . 4.3 From ellitopes to spectratopes 4.3.1 Spectratopes: Definition and examples . . . . . . . . . . . 4.3.2 Semidefinite relaxation on spectratopes . . . . . . . . . . 4.3.3 Linear estimates beyond ellitopic signal sets and k · k2 -risk 4.4 Linear estimates of stochastic signals 4.4.1 Minimizing Euclidean risk . . . . . . . . . . . . . . . . . . 4.4.2 Minimizing k · k-risk . . . . . . . . . . . . . . . . . . . . . 4.5 Linear estimation under uncertain-but-bounded noise 4.5.1 Uncertain-but-bounded noise . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . . . . .

185 185 186 187 189 190 191 193 194 197 199 201 202 203 205 211 212 213 216 218 222 222 228 238 250 250 253 254 258 260 260 262 262 263 265 265 265 267 274 275 275 277 278 291 293 294 295 295

viii

CONTENTS

4.6 4.7

4.8

4.5.2 Mixed noise . . . . . . . . . . . . . . . . . . . . . . . . . . . Calculus of ellitopes/spectratopes Exercises for Chapter 4 4.7.1 Linear estimates vs. Maximum Likelihood . . . . . . . . . . 4.7.2 Measurement Design in Signal Recovery . . . . . . . . . . . 4.7.3 Around semidefinite relaxation . . . . . . . . . . . . . . . . 4.7.4 Around Propositions 4.4 and 4.14 . . . . . . . . . . . . . . . 4.7.5 Signal recovery in Discrete and Poisson observation schemes 4.7.6 Numerical lower-bounding minimax risk . . . . . . . . . . . 4.7.7 Around S-Lemma . . . . . . . . . . . . . . . . . . . . . . . 4.7.8 Miscellaneous exercises . . . . . . . . . . . . . . . . . . . . . Proofs 4.8.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8.2 Proof of Proposition 4.6 . . . . . . . . . . . . . . . . . . . . 4.8.3 Proof of Proposition 4.8 . . . . . . . . . . . . . . . . . . . . 4.8.4 Proof of Lemma 4.17 . . . . . . . . . . . . . . . . . . . . . . 4.8.5 Proofs of Propositions 4.5, 4.16 and 4.19 . . . . . . . . . . . 4.8.6 Proofs of Propositions 4.18 and 4.19, and justification of Remark 4.20 . . . . . . . . . . . . . . . . . . . . . . . . . . . .

299 300 302 302 303 306 317 335 347 359 360 361 361 364 366 368 371 383

5 Signal Recovery Beyond Linear Estimates 386 Overview 386 5.1 Polyhedral estimation 386 5.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 386 5.1.2 Generic polyhedral estimate . . . . . . . . . . . . . . . . . . 388 5.1.3 Specifying sets Hδ for basic observation schemes . . . . . . 390 5.1.4 Efficient upper-bounding of R[H] and contrast design, I. . . 392 5.1.5 Efficient upper-bounding of R[H] and contrast design, II. . 399 5.1.6 Assembling estimates: Contrast aggregation . . . . . . . . . 411 5.1.7 Numerical illustration . . . . . . . . . . . . . . . . . . . . . 413 5.1.8 Calculus of compatibility . . . . . . . . . . . . . . . . . . . 413 5.2 Recovering signals from nonlinear observations by Stochastic Optimization 415 5.2.1 Problem setting . . . . . . . . . . . . . . . . . . . . . . . . . 415 5.2.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . 417 5.2.3 Estimating via Sample Average Approximation . . . . . . . 420 5.2.4 Stochastic Approximation estimate . . . . . . . . . . . . . . 423 5.2.5 Numerical illustration . . . . . . . . . . . . . . . . . . . . . 425 5.2.6 “Single-observation” case . . . . . . . . . . . . . . . . . . . 428 5.3 Exercises for Chapter 5 431 5.3.1 Estimation by Stochastic Optimization . . . . . . . . . . . . 431 5.4 Proofs 440 5.4.1 Proof of (5.8) . . . . . . . . . . . . . . . . . . . . . . . . . . 440 5.4.2 Proof of Lemma 5.6 . . . . . . . . . . . . . . . . . . . . . . 441 5.4.3 Verification of (5.44) . . . . . . . . . . . . . . . . . . . . . . 442 5.4.4 Proof of Proposition 5.10 . . . . . . . . . . . . . . . . . . . 443 Solutions to Selected Exercises 6.1 Solutions for Chapter 1 6.2 Solutions for Chapter 2

447 447 454

ix

CONTENTS

6.3 6.4

6.5

6.2.1 Two-point lower risk bound . . . . . . . . . 6.2.2 Around Euclidean Separation . . . . . . . . 6.2.3 Hypothesis testing via ℓ1 separation . . . . 6.2.4 Miscellaneous exercises . . . . . . . . . . . . Solutions for Chapter 3 Solutions for Chapter 4 6.4.1 Linear Estimates vs. Maximum Likelihood 6.4.2 Measurement Design in Signal Recovery . . 6.4.3 Around semidefinite relaxation . . . . . . . 6.4.4 Around Propositions 4.4 and 4.14 . . . . . . 6.4.5 Numerical lower-bounding minimax risk . . 6.4.6 Around S-Lemma . . . . . . . . . . . . . . 6.4.7 Miscellaneous exercises . . . . . . . . . . . . Solutions for Chapter 5 6.5.1 Estimation by Stochastic Optimization . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . . . .

454 455 457 465 477 495 495 497 502 518 572 586 589 592 592

Appendix: Executive Summary on Efficient Solvability of Convex Optimization Problems 609 Bibliography

613

Index

629

List of Figures

1.1

1.2 1.3

1.4

Top: true 256 × 256 image; bottom: sparse in the wavelet basis approximations of the image. Wavelet basis is orthonormal, and a natural way to quantify near-sparsity of a signal is to look at the fraction of total energy (sum of squares of wavelet coefficients) stored in the leading coefficients; these are the “energy data” presented in the figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Singe-pixel camera. . . . . . . . . . . . . . . . . . . . . . . . . . . . Regular and penalized ℓ1 recovery of nearly s-sparse signals. o: true signals, +: recoveries (to make the plots readable, one per eight consecutive vector’s entries is shown). Problem sizes are m = 256 and n = 2m = 512, noise level is σ = 0.01, deviation from s-sparsity p is kx − xs k1 = 1, contrast pair is (H = n/mA, k · k∞ ). In penalized recovery, λ = 2s, parameter ρ of regular recovery is set to σ · ErfcInv(0.005/n). . . . . . . . . . . . . . . . . . . . . . . . . . . Erroneous ℓ1 recovery of a 25-sparse signal, no observation noise. Top: frequency domain, o – true signal, + – recovery. Bottom: time domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.1

“Gaussian Separation” (Example 2.5): Optimal test deciding on whether the mean of Gaussian r.v. belongs to the domain A (H1 ) or to the domain B (H2 ). Hyperplane o-o separates the acceptance domains for H1 (“left” half-space) and for H2 (“right” half-space). . . . . . . . . . . . . . . .

2.2 2.3 2.4

Drawing for Proposition 2.4. . . . . . . . . . . . . . . . . . . . . . . Positron Emission Tomography (PET) . . . . . . . . . . . . . . . . Nine hypotheses on the location of the mean µ of observation ω ∼ N (µ, I2 ), each stating that µ belongs to a specific polygon. . . . . Signal (top, solid) and candidate estimates (top, dotted). Bottom: the primitive of the signal. . . . . . . . . . . . . . . . . . . . . . . . 3-candidate hypotheses in probabilistic simplex ∆3 . . . . . . . . . PET scanner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Frames from a “movie” . . . . . . . . . . . . . . . . . . . . . . . .

2.5 2.6 2.7 2.8 3.1

Boxplot of empirical distributions, over 20 random estimation problems, of the upper 0.01-risk bounds max1≤i,j≤100 ρij (as in (3.15)) for different observation sample sizes K. . . . . . . . . . . . . . . . . . . . . . . .

3.2 3.3

Bisection via Hypothesis Testing. . . . . . . . . . . . . . . . . . . . A circuit (nine nodes and 16 arcs). a: arc of interest; b: arcs with measured currents; c: input node where external current and voltage are measured. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 6

18

25

48 52 76 93 105 109 121 150

193 194

209

xii

FIGURES 3.4 4.1

5.1

5.2

5.3

5.4

5.5 6.1 6.2 6.3

Histograms of recovery errors in experiments, 1,000 simulations per experiment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . True distribution of temperature U∗ = B(x) at time t0 = 0.01 (left) b via the optimal linear estimate (center) along with its recovery U e (right). . . . . . . . . . . . . . . . . . . and the “naive” recovery U

Recovery errors for the near-optimal linear estimate (circles) and for polyhedral estimates yielded by Proposition 5.8 (PolyI, pentagrams) and by the construction from Section 5.1.4 (PolyII, triangles), 20 simulations per each value of σ. . . . . . . . . . . . . . . . . . . . . Left: functions h; right: moduli of strong monotonicity of the operators F (·) in {z : kzk2 ≤ R} as functions of R. Dashed lines – case A (logistic sigmoid), solid lines – case B (linear regression), dash-dotted lines – case C (hinge function), dotted line – case D (ramp sigmoid). Mean errors and CPU times for SA (solid lines) and SAA estimates (dashed lines) as functions of the number of observations K. o – case A (logistic link), x – case B (linear link), + – case C (hinge function), ✷ – case D (ramp sigmoid). . . . . . . . . . . . . . . . . . . . . . . Solid curve: MωK (z), dashed curve: HωK (z). True signal x (solid vertical line): +0.081; SAA estimate (unique minimizer of HωK , dashed vertical line): −0.252; ML estimate (global minimizer of MωK on [−20, 20]): −20.00, closest to x local minimizer of MωK (dotted vertical line): −0.363. . . . . . . . . . . . . . . . . . . . . . . . . . Mean errors and CPU times for standard deviation λ = 1 (solid line) and λ = 0.1 (dashed line). . . . . . . . . . . . . . . . . . . . . . . . Recovery of a 1200×1600 image at different noise levels, ill-posed case ∆ = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Recovery of a 1200×1600 image at different noise levels, well-posed case ∆ = 0.25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Smooth curve: f ; “bumpy” curve: recovery; gray cloud: observations. In all experiments, n = 8192, κ = 2, p0 = p1 = p2 = 2, σ = 0.5, Lι = (10π)ι , 0 ≤ ι ≤ 2. . . . . . . . . . . . . . . . . . . . .

238

272

414

426

427

428 431 539 540

545

PREFACE When speaking about links between Statistics and Optimization, what comes to mind first is the indispensable role played by optimization algorithms in the “computational toolbox” of Statistics (think about the numerical implementation of the fundamental Maximum Likelihood method). However, on a second thought, we should conclude that no matter how significant this role could be, the fact that it comes to our mind first primarily reflects the weaknesses of Optimization rather than its strengths; were optimization algorithms which are used in Statistics as efficient and as reliable as, say, Linear Algebra techniques, nobody would think about special links between Statistics and Optimization, just as nobody usually thinks about special links between Statistics and Linear Algebra. When computational, rather than methodological, issues are concerned, we start to think about links of Statistics with Optimization, Linear Algebra, Numerical Analysis, etc. only when computational tools offered to us by these disciplines do not work well and need the attention of experts in these disciplines. The goal of this book is to present other types of links between Optimization and Statistics, those which have little in common with algorithms and numbercrunching. What we are speaking about, are the situations where Optimization theory (theory, not algorithms!) seems to be of methodological value in Statistics, acting as the source of statistical inferences with provably optimal, or nearly so, performance. In this context, we focus on utilizing Convex Programming theory, mainly due to its power, but also due to the desire to end up with inference routines reducing to solving convex optimization problems and thus implementable in a computationally efficient fashion. Therefore, while we do not mention computational issues explicitly, we do remember that at the end of the day we need a number, and in this respect, intrinsically computationally friendly convex optimization models are the first choice. The three topics we intend to consider are: A. Sparsity-oriented Compressive Sensing. Here the role of Convex Optimization theory as a creative tool motivating the construction of inference procedures is relatively less important than in the two other topics. This being said, its role is by far non-negligible in the analysis of Compressive Sensing routines (it allows, e.g., to derive from “first principles” the necessary and sufficient conditions for the validity of ℓ1 recovery). On account of this, and also due to its popularity and the fact that now it is one of the major “customers” of advanced convex optimization algorithms, we believe that Compressive Sensing is worthy of being considered. B. Pairwise and Multiple Hypothesis Testing, including sequential tests, estimation of linear functionals, and some rudimentary design of experiments. C. Recovery of signals from noisy observations of their linear images. B and C are the topics where, as of now, the approaches we present in this book appear to be the most successful. The exposition does not require prior knowledge of Statistics and Optimization; as far as these disciplines are concerned, all necessary facts and concepts are incorporated into the text. The actual prerequisites are basic Calculus, Probability, and Linear Algebra. Selection and treatment of our topics are inspired by a kind of “philosophy”

xiv

PREFACE

which can be explained to an expert as follows. Compare two well-known results of nonparametric statistics (“h...i” marks fragments irrelevant to the discussion to follow): Theorem A [I. Ibragimov & R. Khasminskii [124], 1979] Given α, L, k, let X be the set of all functions f : [0, 1] → R with (α, L)-H¨ older continuous k-th derivative. For a given t, the minimax risk of estimating f (t), f ∈ X , from noisy observations y = f Γn + ξ, ξ ∼ N (0; In ) taken along n-point equidistant grid Γn , up to a factor C(β) = h...i, β := k + α, is (Ln−β )1/(2β+1) , and the upper risk bound is attained at the affine in y estimate explicitly given by h...i.

Theorem B [D. Donoho [64], 1994] Let X ⊂ RN be a convex compact set, A be an n × N matrix, and g(·) be a linear form on X . The minimax, over f ∈ X , risk of recovering g(f ) from the noisy observations y = Af + ξ, ξ ∼ N (0, In ), within factor 1.2 is attained at an affine in y estimate which, along with its risk, can be built efficiently by solving convex optimization problem h...i. In many respects, A and B are similar: both are theorems on minimax optimal estimation of a given linear form of an unknown “signal” f known to belong to a given convex set X from observations, corrupted by Gaussian noise, of the image of f under linear mapping,1 and both are associated with efficiently computable nearoptimal—in a minimax sense—estimators which happen to be affine in observations. There is, however, a significant structural difference: A gives an explicit “closed form” analytic description of the minimax risk as a function of n and smoothness parameters of f , along with explicit description of the near-optimal estimator. Numerous results of this type—let us call them descriptive—form the backbone of the deep and rich theory of Nonparametric Statistics. This being said, strong “explanation power” of descriptive results has its price: we need to impose assumptions, sometimes quite restrictive, on the entities involved. For example, A says nothing about what happens with the minimax risk/estimate when in addition to smoothness other a priori information on f , like monotonicity or convexity, is available, and/or when “direct” observations of f |Γn are replaced with observations of a linear image of f (say, convolution of f with a given kernel; more often than not, this is what happens in applications), and descriptive answers to the questions just posed require a dedicated (and sometimes quite problematic) investigation more or less “from scratch.” In contrast, the explanation power of B is basically nonexistent: the statement presents “closed form” expressions neither for the near-optimal estimate, nor for its worst-case risk. As a compensation, B makes only (relatively) mild general structural assumptions about the model (convexity and compactness of X , linear dependence of y on f ), and all the rest—the near-optimal estimate and its risk—can be found by efficient computation. Moreover, we know in advance that the risk, whatever it happens to be, is within 20% of the actual minimax risk achievable under the circumstances. In this respect, B is an operational, rather than a descriptive, result: it explains how to act to achieve the (nearly) best possible performance, with no a priori prediction of what this performance will be. This hardly is a “big issue” in applications—with huge computational power readily available, efficient computability is, basically, as good as a “simple explicit formula.” We

1 Infinite dimensionality of X in A is of no importance—nothing changes when replacing the original X with its n-dimensional image under the mapping f 7→ f |Γn .

PREFACE

xv

strongly believe that as far as applications of high-dimensional statistics are concerned, operational results, possessing much broader scope than their descriptive counterparts, are of significant importance and potential. Our main motivation when writing this book was to contribute to the body of operational results in Statistics, and this is what Chapters 2–5 to follow are about. Anatoli Juditsky & Arkadi Nemirovski March 6, 2019

.

ACKNOWLEDGEMENTS We are greatly indebted to H. Edwin Romeijn who initiated creating the Ph.D. course “Topics in Data Science.” The Lecture Notes for this course form the seed of the book to follow. We gratefully acknowledge support from SF Grant CC-1523768 Statistical Inference via Convex Optimization; this research project is the source of basically all novel results presented in Chapters 2–5. Our deepest gratitude goes to Lucien Birge, who encouraged us to write this monograph, and to Stephen Boyd, who many years ago taught one of the authors “operational philosophy,” motivating the research we are presenting. Our separate thanks to those who decades ago guided our first steps along the road which led to this book—Rafail Khasminskii, Yakov Tsypkin, and Boris Polyak. We are deeply indebted to our colleagues Alekh Agarwal, Aharon Ben-Tal, Fabienne Comte, Arnak Dalalyan, David Donoho, C´eline Duval, Valentine Genon-Catalot, Alexander Goldenshluger, Yuri Golubev, Zaid Harchaoui, G´erard Kerkyacharian, Vladimir Koltchinskii, Oleg Lepski, Pascal Massart, Eric Moulines, Axel Munk, Aleksander Nazin, Yuri Nesterov, Dominique Picard, Alexander Rakhlin, Philippe Rigollet, Alex Shapiro, Vladimir Spokoiny, Alexandre Tsybakov, and Frank Werner for their advice and remarks. We would like to thank Elitsa Marielle, Andrey Kulunchakov and Hlib Tsyntseus for their assistance when preparing the manuscript. It was our pleasure to collaborate with Princeton University Press on this project. We highly appreciate valuable comments of the anonymous referees, which helped to improve the initial text. We are greatly impressed by the professionalism of Princeton University Press editors, and in particular, Lauren Bucca, Nathan Carr, and Susannah Shoemaker, and also by their care and patience. Needless to say, responsibility for all drawbacks of the book is ours. A. J. & A. N.

.

NOTATIONAL CONVENTIONS Vectors and matrices. By default,  all  vectors are column ones; to write them 1 down, we use “Matlab notation”:  2  is written as [1; 2; 3]. More generally, for 3 vectors/matrices A, B, ..., Z of the same “width” (or vectors/matrices A, B, C, ..., Z of the same “height”), [A; B; C; ...; D] is the matrix obtained by vertical (or horizontal) concatenation of A,B, C, etc.  Examples: For what  inthe “normal” notation   7 1 2 , we have ,B= 5 6 ,C= is written down as A = 8 3 4 

   1 2 1 2 7 [A; B] =  3 4  = [1, 2; 3, 4; 5, 6], [A, C] = = [1, 2, 7; 3, 4, 8]. 3 4 8 5 6

Blanks in matrices replace (blocks of) zero entries.    1 1 0  2 = 2 0 3 4 5 3 4

For example,  0 0 . 5

Diag{A1 , A2 , ..., Ak } stands for a block-diagonal matrix with diagonal blocks A1 , A2 , ..., Ak . For example,     1 2 1  , Diag{[1, 2]; [3; 4]} =  3 . 2 Diag{1, 2, 3} =  4 3

For an m×n matrix A, dg(A) is the diagonal of A—a vector of dimension min[m, n] with entries Aii , 1 ≤ i ≤ min[m, n].

Standard linear spaces in our book are Rn (the space of n-dimensional column vectors), Rm×n (the space of m × n real matrices), and Sn (the space of n × n real symmetric matrices). All these linear spaces are equipped with the standard inner product: X hA, Bi = Aij Bij = Tr(AB T ) = Tr(BAT ) = Tr(AT B) = Tr(B T A); i,j

in the case when A = a and B = b are column vectors, this simplifies to ha, bi = aT b = bT a, and when A, B are symmetric, there is no need to write B T in Tr(AB T ). Usually, we denote vectors by lowercase, and matrices by uppercase letters; sometimes, however, lowercase letters are used also for matrices. Given a linear mapping A(x) : Ex → Ey , where Ex , Ey are standard linear spaces, one can define the conjugate mapping A∗ (y) : Ey → Ex via the identity hA(x), yi = hx, A∗ (y)i ∀(x ∈ Ex , y ∈ Ey ). One always has (A∗ )∗ = A. When Ex = Rn , Ey = Rm and Pn A(x) = Ax, one has A∗ (y) = AT y; when Ex = Rn , Ey = Sm , so that A(x) = i=1 xi Ai , Ai ∈ Sm , we

xx

NOTATIONAL CONVENTIONS

have

A∗ (Y ) = [Tr(A1 Y ); ...; Tr(An Y )].

Zn is the set of n-dimensional integer vectors.

Norms. For 1 ≤ p ≤ ∞ and for a vector x = [x1 ; ...; xn ] ∈ Rn , kxkp is the standard p-norm of x:  Pn 1/p ( i=1 |xi |p ) , 1 ≤ p < ∞, kxkp = maxi |xi | = limp′ →∞ kxkp′ , p = ∞. The spectral norm (the largest singular value) of a matrix A is denoted by kAk2,2 ; notation for other norms of matrices is specified when used. Standard cones. R+ is the nonnegative ray on the real axis; Rn+ stands for the ndimensional nonnegative orthant, the cone comprised of all entrywise nonnegative vectors from Rn ; Sn+ stands for the positive semidefinite cone in Sn , the cone comprised of all positive semidefinite matrices from Sn . Miscellaneous. • For matrices A, B, relation A  B, or, equivalently, B  A, means that A, B are symmetric matrices of the same size such that B − A is positive semidefinite; we write A  0 to express the fact that A is a symmetric positive semidefinite matrix. Strict version A ≻ B (⇔ B ≺ A) of A  B means that A − B is positive definite (and, as above, A and B are symmetric matrices of the same size). • Linear Matrix Inequality (LMI, a.k.a. semidefinite constraint) in variables x is the constraint on x stating that a symmetric matrix affinely depending on x is positive semidefinite. When x ∈ Rn , LMI reads X xi Ai  0 [Ai ∈ Sm , 0 ≤ i ≤ n]. A0 + i

• N (µ, Θ) stands for the Gaussian distribution with mean µ and covariance matrix Θ. Poisson(µ) denotes Poisson distribution with parameter µ ∈ R+ , i.e., the disi tribution of a random variable taking values i = 0, 1, 2, ... with probabilities µi! e−µ . Uniform([a, b]) is the uniform distribution on segment [a, b]. • For a probability distribution P , • ξ ∼ P means that ξ is a random variable with distribution P . Sometimes we express the same fact by writing ξ ∼ p(·), where p is the density of P taken w.r.t. some reference measure (the latter always is fixed by the context); • Eξ∼P {f (ξ)} is the expectation of f (ξ), ξ ∼ P ; when P is clear from the context, this notation can be shortened to Eξ {f (ξ)}, or EP {f (ξ)}, or even E{f (ξ)}. Similarly, Probξ∼P {...}, Probξ {...}, ProbP {...}, and Prob{...} denote the P -probability of the event specified inside the braces. • O(1)’s stand for positive absolute constants—positive reals with numerical values (completely independent of the parameters of the situation at hand) which we do not R want or are too lazy to write down explicitly, as in sin(x) ≤ O(1)|x|. • Ω f (ξ)Π(dξ) stands for the integral, taken w.r.t. measure Π over domain Ω, of function f .

ABOUT PROOFS The book is basically self-contained in terms of proofs of the statements to follow. Simple proofs usually are placed immediately after the corresponding statements; more technical proofs are transferred to dedicated sections titled “Proof of ...” at the end of each chapter, and this is where a reader should look for “missing” proofs.

ON COMPUTATIONAL TRACTABILITY In the main body of the book, one can frequently meet sentences like “Φ(·) is an efficiently computable convex function,” or “X is a computationally tractable convex set,” or “(P ) is an explicit, and therefore efficiently solvable, convex optimization problem.” For an “executive summary” on what these words actually mean, we refer the reader to the Appendix.

ABOUT PROOFS The book is basically self-contained in terms of proofs of the statements to follow. Simple proofs usually are placed immediately after the corresponding statements; more technical proofs are transferred to dedicated sections titled “Proof of ...” at the end of each chapter, and this is where a reader should look for “missing” proofs.

ON COMPUTATIONAL TRACTABILITY In the main body of the book, one can frequently meet sentences like “Φ(·) is an efficiently computable convex function,” or “X is a computationally tractable convex set,” or “(P ) is an explicit, and therefore efficiently solvable, convex optimization problem.” For an “executive summary” on what these words actually mean, we refer the reader to the Appendix.

.

.

Statistical Inference via Convex Optimization

.

Chapter One Sparse Recovery via ℓ1 Minimization In this chapter, we overview basic results of Compressed Sensing, a relatively new and rapidly developing area in Statistics and Signal Processing dealing with recovering signals (vectors x from some Rn ) from their noisy observations Ax + η (A is a given m × n sensing matrix, η is observation noise) in the case when the number of observations m is much smaller than the signal’s dimension n, but is essentially larger than the “true” dimension—the number of nonzero entries—in the signal. This setup leads to a deep, elegant and highly innovative theory and possesses quite significant application potential. It should be added that along with the plain sparsity (small number of nonzero entries), Compressed Sensing deals with other types of “low-dimensional structure” hidden in high-dimensional signals, most notably, with the case of low rank matrix recovery—when the signal is a matrix, and sparse signals are matrices with low ranks—and the case of block sparsity, where the signal is a block vector, and sparsity means that only a small number of blocks are nonzero. In our presentation, we do not consider these extensions, and restrict ourselves to the simplest sparsity paradigm.

1.1

COMPRESSED SENSING: WHAT IS IT ABOUT?

1.1.1

Signal Recovery Problem

One of the basic problems in Signal Processing is the problem of recovering a signal x ∈ Rn from noisy observations y = Ax + η

(1.1)

of a linear image of the signal under a given sensing mapping x 7→ Ax : Rn → Rm ; in (1.1), η is the observation error. Matrix A in (1.1) is called sensing matrix. Recovery problems of the outlined types arise in many applications, including, but by far not reducing to, • communications, where x is the signal sent by the transmitter, y is the signal recorded by the receiver, and A represents the communication channel (reflecting, e.g., dependencies of decays in the signals’ amplitude on the transmitter-receiver distances); η here typically is modeled as the standard (zero mean, unit covariance matrix) m-dimensional Gaussian noise;1 1 While

the “physical” noise indeed is often Gaussian with zero mean, its covariance matrix is not necessarily the unit matrix. Note, however, that a zero mean Gaussian noise η always can be represented as Qξ with standard Gaussian ξ. Assuming that Q is known and is nonsingular (which indeed is so when the covariance matrix of η is positive definite), we can rewrite (1.1) equivalently as Q−1 y = [Q−1 A]x + ξ and treat Q−1 y and Q−1 A as our new observation and new sensing matrix; the new observation

2

CHAPTER 1

• image reconstruction, where the signal x is an image—a 2D array in the usual photography, or a 3D array in tomography—and y is data acquired by the imaging device. Here η in many cases (although not always) can again be modeled as the standard Gaussian noise; • linear regression, arising in a wide range of applications. In linear regression, one is given m pairs “input ai ∈ Rn ” to a “black box,” with output yi ∈ R. Sometimes we have reason to believe that the output is a corrupted by noise version of the “existing in nature,” but unobservable, “ideal output” yi∗ = xT ai which is just a linear function of the input (this is called “linear regression model,” with inputs ai called “regressors”). Our goal is to convert actual observations (ai , yi ), 1 ≤ i ≤ m, into estimates of the unknown “true” vector of parameters x. Denoting by A the matrix with the rows [ai ]T and assembling individual observations yi into a single observation y = [y1 ; ...; ym ] ∈ Rm , we arrive at the problem of recovering vector x from noisy observations of Ax. Here again the most popular model for η is the standard Gaussian noise. 1.1.2

Signal Recovery: Parametric and nonparametric cases

Recovering signal x from observation y would be easy if there were no observation noise (η = 0) and the rank of matrix A were equal to the dimension n of the signals. In this case, which arises only when m ≥ n (“more observations than unknown parameters”), and is typical in this range of m and n, the desired x would be the unique solution to the system of linear equations, and to find x would be a simple problem of Linear Algebra. Aside from this trivial “enough observations, no noise” case, people over the years have looked at the following two versions of the recovery problem: Parametric case: m ≫ n, η is nontrivial noise with zero mean, say, standard Gaussian. This is the classical statistical setup with the emphasis on how to use numerous available observations in order to suppress in the recovery, to the extent possible, the influence of observation noise. Nonparametric case: m ≪ n.2 If addressed literally, this case seems to be senseless: when the number of observations is less that the number of unknown parameters, even in the noiseless case we arrive at the necessity to solve an undetermined (fewer equations than unknowns) system of linear equations. Linear Algebra says that if solvable, the system has infinitely many solutions. Moreover, the solution set (an affine subspace of positive dimension) is unbounded, meaning that the solutions are in no sense close to each other. A typical way to make the case of m ≪ n meaningful is to add to the observations (1.1) some a priori information about the signal. In traditional Nonparametric Statistics, this additional information is summarized in a bounded convex set X ⊂ Rn , given to us in advance, known to contain the true signal x. This set usually is such that every signal x ∈ X can be approximated by a linear combination of s = 1, 2, ..., n vectors noise ξ is indeed standard. Thus, in the case of Gaussian zero mean observation noise, to assume the noise standard Gaussian is the same as to assume that its covariance matrix is known. 2 Of course, this is a blatant simplification—the nonparametric case covers also a variety of important and by far nontrivial situations in which m is comparable to n or larger than n (or even ≫ n). However, this simplification is very convenient, and we will use it in this introduction.

SPARSE RECOVERY VIA ℓ1 MINIMIZATION

3

from a properly selected basis known to us in advance (“dictionary” in the slang of signal processing) within accuracy δ(s), where δ(s) is a function, known in advance, approaching 0 as s → ∞. In this situation, with appropriate A (e.g., just the unit matrix, as in the denoising problem), we can select some s ≪ m and try to recover x as if it were a vector from the linear span Es of the first s vectors of the outlined basis [54, 86, 124, 112, 208]. In the “ideal case,” x ∈ Es , recovering x in fact reduces to the case where the dimension of the signal is s ≪ m rather than n ≫ m, and we arrive at the well-studied situation of recovering a signal of low (compared to the number of observations) dimension. In the “realistic case” of x δ(s)-close to Es , deviation of x from Es results in an additional component in the recovery error (“bias”); a typical result of traditional Nonparametric Statistics quantifies the resulting error and minimizes it in s [86, 124, 178, 222, 223, 230, 239]. Of course, this outline of the traditional approach to “nonparametric” (with n ≫ m) recovery problems is extremely sketchy, but it captures the most important fact in our context: with the traditional approach to nonparametric signal recovery, one assumes that after representing the signals by vectors of their coefficients in properly selected base, the n-dimensional signal to be recovered can be well approximated by an s-sparse (at most s nonzero entries) signal, with s ≪ n, and this sparse approximation can be obtained by zeroing out all but the first s entries in the signal vector. The assumption just formulated indeed takes place for signals obtained by discretization of smooth uni- and multivariate functions, and this class of signals for several decades was the main, if not the only, focus of Nonparametric Statistics. Compressed Sensing. The situation changed dramatically around the year 2000 as a consequence of important theoretical breakthroughs due to D. Donoho, T. Tao, J. Romberg, E. Candes, and J.-J. Fuchs, among many other researchers [49, 44, 45, 46, 48, 67, 68, 69, 70, 93, 94]; as a result of these breakthroughs, a novel and rich area of research, called Compressed Sensing, emerged. In the Compressed Sensing (CS) setup of the Signal Recovery problem, as in the traditional Nonparametric Statistics approach to the m ≪ n case, it is assumed that after passing to an appropriate basis, the signal to be recovered is s-sparse (has ≤ s nonzero entries, with s ≪ m), or is well approximated by an s-sparse signal. The difference with the traditional approach is that now we assume nothing about the location of the nonzero entries. Thus, the a priori information about the signal x both in the traditional and in the CS settings is summarized in a set X known to contain the signal x we want to recover. The difference is that in the traditional setting, X is a bounded convex and “nice” (well approximated by its low-dimensional cross-sections) set, while in CS this set is, computationally speaking, a “monster”: already in the simplest case of recovering exactly s-sparse signals, X is the union of all s-dimensional coordinate planes, which is a heavily combinatorial entity. Note that, in many applications we indeed can assume that the true vector of parameters x is sparse. Consider, e.g., the following story about signal detection. There are n locations where signal transmitters could be placed, and m locations with the receivers. The contribution of a signal of unit magnitude originating in location j to the signal measured by receiver i is a known quantity Aij , and signals originating in different locations merely sum up in the receivers. Thus, if x is the n-dimensional vector with entries xj representing the magnitudes of signals transmitted in locations j = 1, 2, ..., n, then the m-dimensional vector y of measurements of the m receivers is y =

4

CHAPTER 1

Ax + η, where η is the observation noise. Given y, we intend to recover x. Now, if the receivers are, say, hydrophones registering noises emitted by submarines in a certain part of the Atlantic, tentative positions of “submarines” being discretized with resolution 500 m, the dimension of the vector x (the number of points in the discretization grid) may be in the range of tens of thousands, if not tens of millions. At the same time, presumably, there is only a handful of “submarines” (i.e., nonzero entries in x) in the area. To “see” sparsity in everyday life, look at the 256 × 256 image at the top of Figure 1.1. The image can be thought of as a 2562 = 65, 536-dimensional vector comprised of the pixels’ intensities in gray scale, and there is not much sparsity in this vector. However, when representing the image in the wavelet basis, whatever it means, we get a “nearly sparse” vector of wavelet coefficients (this is true for typical “nonpathological” images). At the bottom of Figure 1.1 we see what happens when we zero out all but a small percentage of the wavelet coefficients largest in magnitude and replace the true image by its sparse—in the wavelet basis—approximations. This simple visual illustration along with numerous similar examples shows the “everyday presence” of sparsity and the possibility to utilize it when compressing signals. The difficulty, however, is that simple compression—compute the coefficients of the signal in an appropriate basis and then keep, say, 10% of the largest in magnitude coefficients—requires us to start with digitalizing the signal— representing it as an array of all its coefficients in some orthonormal basis. These coefficients are inner products of the signal with vectors of the basis; for a “physical” signal, like speech or image, these inner products are computed by analogous devices, with subsequent discretization of the results. After the measurements are discretized, processing the signal (denoising, compression, storing, etc.) can be fully computerized. The major (to some extent, already actualized) advantage of Compressed Sensing is in the possibility to reduce the “analogous effort” in the outlined process: instead of computing analogously n linear forms of n-dimensional signal x (its coefficients in a basis), we use an analog device to compute m ≪ n other linear forms of the signal and then use the signal’s sparsity in a basis known to us in order to recover the signal reasonably well from these m observations. In our “picture illustration” this technology would work (in fact, works—it is called “single pixel camera” [83]; see Figure 1.2) as follows: in reality, the digital 256×256 image on the top of Figure 1.1 was obtained by an analog device—a digital camera which gets on input an analog signal (light of varying intensity along the field of view caught by camera’s lens) and discretizes the light’s intensity in every pixel to get the digitalized image. We then can compute the wavelet coefficients of the digitalized image, compress its representation by keeping, say, just 10% of leading coefficients, etc., but “the damage is already done”—we have already spent our analog resources to get the entire digitalized image. The technology utilizing Compressed Sensing would work as follows: instead of measuring and discretizing light intensity in each of the 65,536 pixels, we compute (using an analog device) the integral, taken over the field of view, of the product of light intensity and an analog-generated “mask.” We repeat it for, say, 20,000 different masks, thus obtaining measurements of 20,000 linear forms of our 65,536-dimensional signal. Next we utilize, via the Compressed Sensing machinery, the signal’s sparsity in the wavelet basis in order to recover the signal from these 20,000 measurements. With this approach, we reduce the “analog component” of signal processing effort,

5

SPARSE RECOVERY VIA ℓ1 MINIMIZATION

1% of leading wavelet coefficients (97.83 % of energy) kept

5% of leading wavelet coefficients (99.51 % of energy) kept

10% of leading wavelet coefficients (99.82% of energy) kept

25% of leading wavelet coefficients (99.97% of energy) kept

Figure 1.1: Top: true 256 × 256 image; bottom: sparse in the wavelet basis approximations of the image. Wavelet basis is orthonormal, and a natural way to quantify near-sparsity of a signal is to look at the fraction of total energy (sum of squares of wavelet coefficients) stored in the leading coefficients; these are the “energy data” presented in the figure.

6

CHAPTER 1

Yh Ed/• Z W,KdK /K

WZK ^^/E'

Figure 1.2: Singe-pixel camera.

at the price of increasing the “computerized component” of the effort (instead of ready-to-use digitalized image directly given by 65,536 analog measurements, we need to recover the image by applying computationally nontrivial decoding algorithms to our 20,000 “indirect” measurements). When taking pictures with your camera or iPad, the game is not worth the candle—the analog component of taking usual pictures is cheap enough, and decreasing it at the cost of nontrivial decoding of the digitalized measurements would be counterproductive. There are, however, important applications where the advantages stemming from reduced “analog effort” outweigh significantly the drawbacks caused by the necessity to use nontrivial computerized decoding [96, 176]. 1.1.3 1.1.3.1

Compressed Sensing via ℓ1 minimization: Motivation Preliminaries

In principle there is nothing surprising in the fact that under reasonable assumption on the m × n sensing matrix A we may hope to recover from noisy observations of Ax an s-sparse signal x, with s ≪ m. Indeed, assume for the sake of simplicity that there are no observation errors, and let Colj [A] be j-th column in A. If we knew the locations j1 < j2 < ... < js of the nonzero entries Ps in x, identifying x could be reduced to solving the system of linear equations ℓ=1 xiℓ Coljℓ [A] = y with m equations and s ≪ m unknowns; assuming every s columns in A to be linearly independent (a quite unrestrictive assumption on a matrix with m ≥ s rows), the solution to the above system is unique, and is exactly the signal we are looking for. Of course, the assumption that we know the locations of nonzeros in x makes the recovery problem completely trivial. However, it suggests the following course of action: given noiseless observation y = Ax of an s-sparse signal x, let us solve the

7

SPARSE RECOVERY VIA ℓ1 MINIMIZATION

combinatorial optimization problem min {kzk0 : Az = y} , z

(1.2)

where kzk0 is the number of nonzero entries in z. Clearly, the problem has a solution with the value of the objective at most s. Moreover, it is immediately seen that if every 2s columns in A are linearly independent (which again is a very unrestrictive assumption on the matrix A provided that m ≥ 2s), then the true signal x is the unique optimal solution to (1.2). What was said so far can be extended to the case of noisy observations and “nearly s-sparse” signals x. For example, assuming that the observation error is “uncertainbut-bounded,” specifically some known norm k · k of this error does not exceed a given ǫ > 0, and that the true signal is s-sparse, we could solve the combinatorial optimization problem min {kzk0 : kAz − yk ≤ ǫ} . (1.3) z

Assuming that every m×2s submatrix A¯ of A is not just with linearly independent columns (i.e., with trivial kernel), but is reasonably well conditioned, ¯ kAwk ≥ C −1 kwk2 for all (2s)-dimensional vectors w, with some constant C, it is immediately seen that the true signal x underlying the observation and the optimal solution x b of (1.3) are close to each other within accuracy of order of ǫ: kx − x bk2 ≤ 2Cǫ. It is easily seen that the resulting error bound is basically as good as it could be.

We see that the difficulties with recovering sparse signals stem not from the lack of information; they are of purely computational nature: (1.2) is a difficult combinatorial problem. As far as known theoretical complexity guarantees are concerned, they are not better than “brute force” search through all guesses on where the nonzeros in x are located—by inspecting first the only option that there are no nonzeros in x at all, then by inspecting n options that there is only one nonzero, for every one of n locations of this nonzero, then n(n − 1)/2 options that there are exactly two nonzeros, etc., until the current option results in a solvable system of linear equations Az = y in variables z with entries restricted to vanish outside the locations prescribed by the current option. The running time of this “brute force” search, beyond the range of small values of s and n (by far too small to be of any applied interest), is by many orders of magnitude larger than what we can afford in reality.3 A partial remedy is as follows. Well, if we do not know how to minimize the “bad” objective kzk0 under linear constraints, as in (1.2), let us “approximate” this objective with which we do know how to minimize. The true objective is Pone n separable: kzk = i=1 ξ(zj ), where ξ(s) is the function on the axis equal to 0 at the origin and equal to 1 otherwise. As a matter of fact, the separable functions which 3 When s = 5 and n = 100, a sharp upper bound on the number of linear systems we should process before termination in the “brute force” algorithm is ≈ 7.53e7—a lot, but perhaps doable. When n = 200 and s = 20, the number of systems to be processed jumps to ≈ 1.61e27, which is by many orders of magnitude beyond our “computational grasp”; we would be unable to carry out that many computations even if the fate of the mankind were at stake. And from the perspective of Compressed Sensing, n = 200 still is a completely toy size, 3–4 orders of magnitude less than we would like to handle.

8

CHAPTER 1

we do know how to minimize under linear constraints are sums of convex functions of z1 , ..., zn . The most natural candidate to the role of convex approximation of ξ(s) is |s|; with this approximation, (1.2) converts into the ℓ1 minimization problem n o Xn min kzk1 := |zj | : Az = y , (1.4) z

i=1

and (1.3) becomes the convex optimization problem

min {kzk1 : kAz − yk ≤ ǫ} .

(1.5)

z

Both problems are efficiently solvable, which is nice; the question, however, is how relevant these problems are in our context—whether it is true that they do recover the “true” s-sparse signals in the noiseless case, or “nearly recover” these signals when the observation error is small. Since we want to be able to handle any ssparse signal, the validity of ℓ1 recovery—its ability to recover well every s-sparse signal—depends solely on the sensing matrix A. Our current goal is to understand which sensing matrices are “good” in this respect.

1.2

VALIDITY OF SPARSE SIGNAL RECOVERY VIA ℓ1 MINIMIZATION

What follows is based on the standard basic results of Compressed Sensing theory originating from [19, 49, 45, 44, 46, 47, 48, 67, 69, 70, 93, 94, 232] and augmented by the results of [129, 130, 132, 133].4 1.2.1

Validity of ℓ1 minimization in the noiseless case

The minimal requirement on sensing matrix A which makes ℓ1 minimization valid is to guarantee the correct recovery of exactly s-sparse signals in the noiseless case, and we start with investigating this property. 1.2.1.1

Notational convention

From now on, for a vector x ∈ Rn • Ix = {j : xj 6= 0} stands for the support of x; we also set Ix+ = {j : xj > 0}, Ix− = {j : xj < 0}

[⇒ Ix = Ix+ ∪ Ix− ];

• for a subset I of the index set {1, ..., n}, xI stands for the vector obtained from x by zeroing out entries with indices not in I, and I o for the complement of I: I o = {i ∈ {1, ..., n} : i 6∈ I}; • for s ≤ n, xs stands for the vector obtained from x by zeroing out all but the s 4 In fact, in the latter source, an extension of the sparsity, the so-called block sparsity, is considered; in what follows, we restrict the results of [130] to the case of plain sparsity.

SPARSE RECOVERY VIA ℓ1 MINIMIZATION

9

entries largest in magnitude.5 Note that xs is the best s-sparse approximation of x in all ℓp norms, 1 ≤ p ≤ ∞; • for s ≤ n and p ∈ [1, ∞], we set kxks,p = kxs kp ; note that k · ks,p is a norm. 1.2.1.2

s-Goodness

Definition of s-goodness. Let us say that an m × n sensing matrix A is s-good if whenever the true signal x underlying noiseless observations is s-sparse, this signal will be recovered exactly by ℓ1 minimization. In other words, A is s-good if whenever y in (1.4) is of the form y = Ax with s-sparse x, x is the unique optimal solution to (1.4). Nullspace property. There is a simply-looking necessary and sufficient condition for a sensing matrix A to be s-good—the nullspace property originating from [70]. After this property is guessed, it is easy to see that it indeed is necessary and sufficient for s-goodness; we, however, prefer to derive this condition from the “first principles,” which can be easily done via Convex Optimization. Thus, in the case in question, as in many other cases, there is no necessity to be smart to arrive at the truth via a “lucky guess”; it suffices to be knowledgeable and use the standard tools. Let us start with necessary condition for A to be such that whenever x is ssparse, x is an optimal solution (perhaps not the unique one) of the optimization problem min {kzk1 : Az = Ax} ; (P [x]) z

we refer to the latter property of A as weak s-goodness. Our first observation is as follows: Proposition 1.1. If A is weakly s-good, then the following condition holds true: whenever I is a subset of {1, ..., n} of cardinality ≤ s, we have ∀w ∈ KerA kwI k1 ≤ kwI o k1 .

(1.6)

Proof is immediate. Assume A is weakly s-good, and let us verify (1.6). Let I be an s-element subset of {1, ..., n}, and x be an s-sparse vector with support I. Since A is weakly s-good, x is an optimal solution to (P [x]). Rewriting the latter problem in the form of LP, that is, as X min{ tj : tj + zj ≥ 0, tj − zj ≥ 0, Az = Ax}, z,t

j

and invoking LP optimality conditions, the necessary and sufficient condition for 5 Note that in general xs is not uniquely defined by x and s, since the s-th largest among the magnitudes of entries in x can be achieved at several entries. In our context, it does not matter how ties of this type are resolved; for the sake of definiteness, we can assume that when ordering the entries in x according to their magnitudes, from the largest to the smallest, entries of equal magnitude are ordered in the order of their indices.

10

CHAPTER 1

− z = x to be the z-component of an optimal solution is the existence of λ+ j , λj , µ ∈ Rm (Lagrange multipliers for the constraints tj − zj ≥ 0, tj + zj ≥ 0, and Az = Ax, respectively) such that

(a) (b) (c) (d) (e) (f )

− λ+ j + λj T λ −λ +A µ λ+ j (|xj | − xj ) λ− j (|xj | + xj ) λ+ j λ− j +



= = = = ≥ ≥

1 ∀j, 0, 0 ∀j, 0 ∀j, 0 ∀j, 0 ∀j.

(1.7)

− + − + − From (c, d), we have λ+ j = 1, λj = 0 for j ∈ Ix and λj = 0, λj = 1 for j ∈ Ix . ± From (a) and nonnegativity of λj it follows that for j 6∈ Ix we should have −1 ≤ − λ+ j − λj ≤ 1. With this in mind, the above optimality conditions admit eliminating λ’s and reduce to the following conclusion: (!) x is an optimal solution to (P [x]) if and only if there exists vector µ ∈ Rm such that the j-th entry of AT µ is −1 if xj > 0, +1 if xj < 0, and a real from [−1, 1] if xj = 0. Now let w ∈ Ker A be a vector with the same signs of entries wi , i ∈ I, as these of the entries in x. Then P 0 = µT Aw = [AT µ]T w = j [AT µ]j wj P P P P ⇒ j∈Ix |wj | = j∈Ix [AT µ]j wj = − j6∈Ix [AT µ]j wj ≤ j6∈Ix |wj |

(we have used the fact that [AT µ]j = sign xj = sign wj for j ∈ Ix and |[AT µ]j | ≤ 1 for all j). Since I can be an arbitrary s-element subset of {1, ..., n} and the pattern of signs of an s-sparse vector x supported on I can be arbitrary, (1.6) holds true. ✷ 1.2.1.3

Nullspace property

In fact, it can be shown that (1.6) is not only a necessary, but also sufficient condition for weak s-goodness of A; we, however, skip this verification, since our goal so far was to guess the condition for s-goodness, and this goal has already been achieved—from what we already know it immediately follows that a necessary condition for s-goodness is for the inequality in (1.6) to be strict whenever w ∈ Ker A is nonzero. Indeed, we already know that if A is s-good, then for every I of cardinality s and every nonzero w ∈ Ker A it holds kwI k1 ≤ kwI o k1 . If the latter inequality for some I and w in question holds true as equality, then A clearly is not s-good, since the s-sparse signal x = wI is not the unique optimal solution to (P [x])—the vector −wI o is a different feasible solution to the same problem and with the same value of the objective. We conclude that for A to be s-good, a necessary condition is ∀(0 6= w ∈ Ker A, I, Card(I) ≤ s) : kwI k1 < kwI o k1 .

SPARSE RECOVERY VIA ℓ1 MINIMIZATION

11

By the standard compactness argument, this is the same as the existence of γ ∈ (0, 1) such that ∀(w ∈ Ker A, I, Card(I) ≤ s) : kwI k1 ≤ γkwI o k1 , or—which is the same—existence of κ ∈ (0, 1/2) such that ∀(w ∈ Ker A, I, Card(I) ≤ s) : kwI k1 ≤ κkwk1 . Finally, the supremum of kwI k1 over I of cardinality s is the norm kwks,1 (the sum of s largest magnitudes of entries) of w, so that the condition we are processing finally can be formulated as ∃κ ∈ (0, 1/2) : kwks,1 ≤ κkwk1 ∀w ∈ Ker A.

(1.8)

The resulting nullspace condition in fact is necessary and sufficient for A to be s-good: Proposition 1.2. Condition (1.8) is necessary and sufficient for A to be s-good. Proof. We have already seen that the nullspace condition is necessary for sgoodness. To verify sufficiency, let A satisfy the nullspace condition, and let us prove that A is s-good. Indeed, let x be an s-sparse vector, and y be an optimal solution to (P [x]); all we need is to prove that y = x. Let I be the support of x, and w = y − x, so that w ∈ Ker A. By the nullspace property we have kwI k1 ≤ κkwk1 = κ[kwI k1 + kwI o k1 ] = κ[kwI k1 + kyI o k1 κ kyI o k1 ⇒ kwI k1 ≤ 1−κ κ kyI o k1 ≤ kyI k1 + kyI o k1 = kyk1 ⇒ kxk1 = kxI k1 = kyI − wI k1 ≤ kyI k1 + 1−κ where the concluding ≤ is due to κ ∈ [0, 1/2). Since x is a feasible, and y is an optimal solution to (P [x]), the resulting inequality kxk1 ≤ kyk1 must be equality, which, again due to κ ∈ [0, 1/2), is possible only when yI o = 0. Thus, y has the same support I as x, and w = x − y ∈ Ker A is supported on s-element set I; by nullspace property, we should have kwI k1 ≤ κkwk1 = κkwI k1 , which is possible only when w = 0. ✷ 1.2.2

Imperfect ℓ1 minimization

We have found a necessary and sufficient condition for ℓ1 minimization to recover exactly s-sparse signals in the noiseless case. More often than not, both these assumptions are violated: instead of s-sparse signals, we should speak about “nearly s-sparse” ones, quantifying the deviation from sparsity by the distance from the signal x underlying the observations to its best s-sparse approximation xs . Similarly, we should allow for nonzero observation noise. With noisy observations and/or imperfect sparsity, we cannot hope to recover the signal exactly. All we may hope for, is to recover it with some error depending on the level of observation noise and “deviation from s-sparsity,” and tending to zero as the level and deviation tend to 0. We are about to quantify the nullspace property to allow for instructive “error analysis.”

12

CHAPTER 1

1.2.2.1

Contrast matrices and quantifications of Nullspace property

By itself, the nullspace property says something about the signals from the kernel of the sensing matrix. We can reformulate it equivalently to say something important about all signals. Namely, observe that given sparsity s and κ ∈ (0, 1/2), the nullspace property kwks,1 ≤ κkwk1 ∀w ∈ Ker A (1.9)

is satisfied if and only if for a properly selected constant C one has6 kwks,1 ≤ CkAwk2 + κkwk1 ∀w.

(1.10)

Indeed, (1.10) clearly implies (1.9); to get the inverse implication, note that for every h orthogonal to Ker A it holds kAhk2 ≥ σkhk2 , where σ > 0 is the minimal positive singular value of A. Now, given w ∈ Rn , we can decompose w into the sum of w ¯ ∈ Ker A and h ∈ (Ker A)⊥ , so that √ √ kwks,1 ≤ kwk ¯ s,1 + khks,1 ≤ κkwk ¯ 1 + skhks,2 ≤ κ[kwk1 + khk1 ] + skhk2 √ √ √ √ ≤ κkwk1 + [κ n + s]khk2 ≤ σ −1 [κ n + s] kAhk2 +κkwk1 , {z } | {z } | C

=kAwk2

as required in (1.10).

Condition Q1 (s, κ). For our purposes, it is convenient to present the condition (1.10) in the following flexible form: kwks,1 ≤ skH T Awk + κkwk1 ,

(1.11)

where H is an m × N contrast matrix and k · k is some norm on RN . Whenever a pair (H, k · k), called contrast pair, satisfies (1.11), we say that (H, k · k) satisfies condition Q1 (s, κ). From what we have seen, If A possesses nullspace property with some sparsity level s and some κ ∈ (0, 1/2), then there are many ways to select pairs (H, k · k) satisfying Q1 (s, κ), e.g., to take H = CIm with appropriately large C and k · k = k · k2 . Conditions Qq (s, κ). As we will see in a while, it makes sense to embed the condition Q1 (s, κ) into a parametric family of conditions Qq (s, κ), where the parameter q runs through [1, ∞]. Specifically, Given an m × n sensing matrix A, sparsity level s ≤ n, and κ ∈ (0, 1/2), we say that m × N matrix H and a norm k · k on RN satisfy condition Qq (s, κ) if 1 1 (1.12) kwks,q ≤ s q kH T Awk + κs q −1 kwk1 ∀w ∈ Rn . Let us make two immediate observations on relations between the conditions: A. When a pair (H, k · k) satisfies condition Qq (s, κ), the pair satisfies also all conditions Qq′ (s, κ) with 1 ≤ q ′ ≤ q. √ that (1.9) is exactly the φ2 (s, κ)-Compatibility condition of [231] with φ(s, κ) = C/ s; see also [232] for the analysis of relationships of this condition with other assumptions (e.g., a similar Restricted Eigenvalue assumption of [20]) used to analyse ℓ1 -minimization procedures. 6 Note

13

SPARSE RECOVERY VIA ℓ1 MINIMIZATION

Indeed in the situation in question for 1 ≤ q ′ ≤ q it holds i 1 −1 h 1 1 −1 1 −1 kwks,q′ ≤ s q′ q kwkq,s ≤ s q′ q s q kH T Awk + κs q kwk1 =

1

1

s q′ kH T Awk + κs q′

−1

kwk1 ,

where the first inequality is the standard inequality between ℓp -norms of the s-dimensional vector ws .

B. When a pair (H, k · k) satisfies condition Qq (s, κ) and 1 ≤ s′ ≤ s, the pair 1 ((s/s′ ) q H, k · k) satisfies the condition Qq (s′ , κ). Indeed, in the situation in question we clearly have for 1 ≤ s′ ≤ s: h i 1 1 1 −1 kwks′ ,q ≤ kwks,q ≤ (s′ ) q k (s/s′ ) q H Awk + κ s| q{z } kwk1 . 1 −1

≤(s′ ) q

1.2.3

Regular ℓ1 recovery

Given the observation scheme (1.1) with an m × n sensing matrix A, we define the regular ℓ1 recovery of x via observation y as  (1.13) x breg (y) ∈ Argmin kuk1 : kH T (Au − y)k ≤ ρ , u

where the contrast matrix H ∈ Rm×N , the norm k · k on RN and ρ > 0 are parameters of the construction. The role of Q-conditions we have introduced is clear from the following

Theorem 1.3. Let s be a positive integer, q ∈ [1, ∞] and κ ∈ (0, 1/2). Assume that a pair (H, k · k) satisfies the condition Qq (s, κ) associated with A, and let Ξρ = {η : kH T ηk ≤ ρ}.

(1.14)

Then for all x ∈ Rn and η ∈ Ξρ one has

 1  kx − xs k1 4(2s) p ρ+ , 1 ≤ p ≤ q. kb xreg (Ax + η) − xkp ≤ 1 − 2κ 2s

(1.15)

The above result can be slightly strengthened by replacing the assumption that (H, k · k) satisfies Qq (s, κ) with some κ < 1/2, with a weaker—by observation A from Section 1.2.2.1—assumption that (H, k · k) satisfies Q1 (s, κ) with κ < 1/2 and satisfies Qq (s, κ) with some (perhaps large) κ: Theorem 1.4. Given A, integer s > 0, and q ∈ [1, ∞], assume that (H, k · k) satisfies the condition Q1 (s, κ) with κ < 1/2 and the condition Qq (s, κ) with some κ ≥ κ, and let Ξρ be given by (1.14). Then for all x ∈ Rn and η ∈ Ξρ it holds: q(p−1)   1 kx − xs k1 4(2s) p [1 + κ − κ] p(q−1) ρ+ , 1 ≤ p ≤ q. (1.16) kb xreg (Ax+η)−xkp ≤ 1 − 2κ 2s

For proofs of Theorems 1.3 and 1.4, see Section 1.5.1. Before commenting on the above results, let us present their alternative versions.

14

CHAPTER 1

1.2.4

Penalized ℓ1 recovery

Penalized ℓ1 recovery of signal x from its observation (1.1) is  x bpen (y) ∈ Argmin kuk1 + λkH T (Au − y)k ,

(1.17)

u

where H ∈ Rm×N , a norm k · k on RN , and a positive real λ are parameters of the construction. Theorem 1.5. Given A, positive integer s, and q ∈ [1, ∞], assume that (H, k · k) satisfies the conditions Qq (s, κ) and Q1 (s, κ) with κ < 1/2 and κ ≥ κ. Then (i) Let λ ≥ 2s. Then for all x ∈ Rn , y ∈ Rm it holds: 1

kb xpen (y) − xkp ≤

4λ p 1−2κ



1+

κλ 2s

−κ

In particular, with λ = 2s we have: 1

kb xpen (y) − xkp ≤

4(2s) p 1−2κ

 q(p−1) h p(q−1)

kH T (Ax − y)k +

kx−xs k1 2s

h q(p−1) [1 + κ − κ] p(q−1) kH T (Ax − y)k +

i

kx−xs k1 2s

, 1 ≤ p ≤ q.

(1.18)

i

, 1 ≤ p ≤ q.

(1.19) (ii) Let ρ ≥ 0, and let Ξρ be given by (1.14). Then for all x ∈ Rn and all η ∈ Ξρ one has: λ ≥ 2s λ = 2s



1



kb xpen (Ax + η) − xkp ≤

4λ p 1−2κ

kb xpen (Ax + η) − xkp ≤

4(2s) p 1−2κ



1

1+

κλ 2s

−κ

 q(p−1) h p(q−1)

ρ+

q(p−1) h [1 + κ − κ] p(q−1) ρ +

kx−xs k1 2s

kx−xs k1 2s

i

i

, 1 ≤ p ≤ q;

, 1 ≤ p ≤ q.

(1.20)

For proof, see Section 1.5.2. 1.2.5

Discussion

Some remarks are in order. A. Qualitatively speaking, Theorems 1.3, 1.4, and 1.5 say the same thing: when Q-conditions are satisfied, the regular or penalized recoveries reproduce the true signal exactly when there is no observation noise and the signal is s-sparse. In the presence of observation error η and imperfect sparsity, the signal is recovered within the error which can be upper-bounded by the sum of two terms, one proportional to the magnitude of observation noise and one proportional to the deviation kx − xs k1 of the signal from s-sparse ones. In the penalized recovery, the observation error is measured in the scale given by the contrast matrix and the norm k · k—as kH T ηk— and in the regular recovery by an a priori upper bound ρ on kH T ηk; when ρ ≥ kH T ηk, η belongs to Ξρ and thus the bounds (1.15) and (1.16) are applicable to the actual observation error η. Clearly, in qualitative terms, an error bound of this type is the best we may hope for. Now let us look at the quantitative aspect. Assume that in the regular recovery we use ρ ≈ kH T ηk, and in the penalized one λ = 2s. In this case, error bounds (1.15), (1.16), and (1.20), up to factors C depending solely on κ and κ, are the same, specifically, kb x − xkp ≤ Cs1/p [kH T ηk + kx − xs k1 /s], 1 ≤ p ≤ q.

(!)

SPARSE RECOVERY VIA ℓ1 MINIMIZATION

15

Is this error bound bad or good? The answer depends on many factors, including on how well we select H and k · k. To get a kind of orientation, consider the trivial case of direct observations, where matrix A is square and, moreover, is proportional to the unit matrix: A = αI. Let us assume in addition that x is exactly s-sparse. In this case, the simplest way to ensure condition Qq (s, κ), even with κ = 0, is to take k · k = k · ks,q and H = s−1/q α−1 I, so that (!) becomes kb x − xkp ≤ Cα−1 s1/p−1/q kηks,q , 1 ≤ p ≤ q.

(!!)

As far as the dependence of the bound on the magnitude kηks,q of the observation noise is concerned, this dependence is as good as it can be—even if we knew in advance the positions of the s entries of x of largest magnitudes, we would be unable to recover x in q-norm with error ≤ α−1 kηks,q . In addition, with the s largest magnitudes of entries in η equal to each other, the k·kp -norm of the recovery error clearly cannot be guaranteed to be less than α−1 kηks,p = α−1 s1/p−1/q kηks,q . Thus, at least for s-sparse signals x, our error bound is, basically, the best one can get already in the “ideal” case of direct observations. B. Given that (H, k · k) obeys Q1 (s, κ) with some κ < 1/2, the larger the q such that the pair (H, k · k) obeys the condition Qq (s, κ) with a given κ ≥ κ (recall that κ can be ≥ 1/2) and s, the larger the range p ≤ q of values of p where the error bounds (1.16) and (1.20) are applicable. This is in full accordance with the fact that if a pair (H, k · k) obeys condition Qq (s, κ), it obeys also all conditions Qq′ (s, κ) with 1 ≤ q ′ ≤ q (item A in Section 1.2.2.1). C. The flexibility offered by contrast matrix H and norm k · k allows us to adjust, to some extent, the recovery to the “geometry of observation errors.” For example, when η is “uncertain but bounded,” say, when all we know is that kηk2 ≤ δ with some given δ, all that matters (on the top of the requirement for (H, k · k) to obey Q-conditions) is how large kH T ηk could be when kηk2 ≤ δ. In particular, when k · k = k · k2 , the error bound “is governed” by the spectral norm of H. Consequently, if we have a technique allowing us to design H such that (H, k · k2 ) obeys Q-condition(s) with given parameters, it makes sense to look for a design with as small a spectral norm of H as possible. In contrast to this, in the case of Gaussian noise the most interesting for applications, y = Ax + η, η ∼ N (0, σ 2 Im ),

(1.21)

looking at the spectral norm of H, with k·k2 in the role of k·k, is counterproductive, √ since a typical realization of η is of Euclidean norm of order of mσ and thus is quite large when m is large. In this case to quantify “the magnitude” of H T η by the product of the spectral norm of H and the Euclidean norm of η is completely misleading—in typical cases, this product will grow rapidly with the number of observations m, completely ignoring the fact that η is random with zero mean.7 What is much better suited for the case of Gaussian noise, is the k · k∞ norm in the role of k·k and the norm of H which is “the maximum of k·k2 -norms of the columns 7 The simplest way to see the difference is to look at a particular entry hT η in H T η. Operating with spectral norms, we upper-bound √ this entry by khk2 kηk2 , and the second factor for η ∼ N (0, σ 2 Im ) is typically as large as σ m. This is in sharp contrast to the fact that typical values of hT η are of order of σkhk2 , independently of what m is!

16

CHAPTER 1

in H,” denoted by kHk1,2 . Indeed, with η ∼ N (0, σ 2 Im ), the entries in H T η are Gaussian with zero mean and variance bounded by σ 2 kHk21,2 , so that kH T ηk∞ is the maximum of magnitudes of N zero mean Gaussian random variables with standard deviations bounded by σkHk1,2 . As a result, T

Prob{kH ηk∞ ≥ ρ} ≤ 2N Erfc where



ρ σkHk1,2



≤ Ne

1 Erfc(s) = Probξ∼N (0,1) {ξ ≥ s} = √ 2π

Z





ρ2 2σ 2 kHk2 1,2

e−t

2

/2

,

(1.22)

dt

s

is the (slightly rescaled) complementary error function. T 2 It follows p that the typical values of kH ηk∞ , η ∼ N (0, σ Im ) are of order of at most σ ln(N )kHk1,2 . In applications we consider in this chapter, we have N = O(m), so that with σ and kHk1,2 given, typical values kH T ηk∞ are nearly independent of m. The bottom line is that ℓ1 minimization is capable of handling large-scale Gaussian observation noise incomparably better than “uncertain-butbounded” observation noise of similar magnitude (measured in Euclidean norm).

D. As far as comparison of regular and penalized ℓ1 recoveries with the same pair (H, k · k) is concerned, the situation is as follows. Assume for the sake of simplicity that (H, k · k) satisfies Qq (s, κ) with some s and κ < 1/2, and let the observation error be random. Given ǫ ∈ (0, 1), let   ρǫ [H, k · k] = min ρ : Prob η : kH T ηk ≤ ρ ≥ 1 − ǫ ; (1.23) this is nothing but the smallest ρ such that

Prob{η ∈ Ξρ } ≥ 1 − ǫ

(1.24)

(see (1.14)), and thus the smallest ρ for which the error bound (1.15) for the regular ℓ1 recovery holds true with probability 1 − ǫ (or at least the smallest ρ for which the latter claim is supported by Theorem 1.3). With ρ = ρǫ [H, k · k], the regular ℓ1 recovery guarantees (and that is the best guarantee one can extract from Theorem 1.3) that (#) For some set Ξ, Prob{η ∈ Ξ} ≥ 1 − ǫ, of “good” realizations of η ∼ N (0, σ 2 Im ), one has  1  kx − xs k1 4(2s) p ρǫ [H, k · k] + , 1 ≤ p ≤ q, (1.25) kb x(Ax + η) − xkp ≤ 1 − 2κ 2s

whenever x ∈ Rn and η ∈ Ξρ . The error bound (1.19) (where we set κ = κ) says that (#) holds true for the penalized ℓ1 recovery with λ = 2s. The latter observation suggests that the penalized ℓ1 recovery associated with (H, k · k) and λ = 2s is better than its regular counterpart, the reason being twofold. First, in order to ensure (#) with the regular recovery, the “built in” parameter ρ of this recovery should be set to ρǫ [H, k · k], and the latter quantity is not always easy to identify. In contrast to this, the construc-

SPARSE RECOVERY VIA ℓ1 MINIMIZATION

17

tion of penalized ℓ1 recovery is completely independent of a priori assumptions on the structure of observation errors, while automatically ensuring (#) for the error model we use. Second, and more importantly, for the penalized recovery the bound (1.25) is no more than the “worst, with confidence 1 − ǫ, case,” while the typical values of the quantity kH T ηk which indeed participates in the error bound (1.18) may be essentially smaller than ρǫ [H, k · k]. Numerical experience fully supports the above claim: the difference in observed performance of the two routines in question, although not dramatic, is definitely in favor of the penalized recovery. The only potential disadvantage of the latter routine is that the penalty parameter λ should be tuned to the level s of sparsity we aim at, while the regular recovery is free of any guess of this type. Of course, the “tuning” is rather loose—all we need (and experiments show that we indeed need this) is the relation λ ≥ 2s, so that a rough upper bound on s will do. However, that bound (1.18) deteriorates as λ grows. Finally, we remark that when H is m × N and η ∼ N (0, σ 2 Im ), we have ρǫ [H, k · k∞ ] ≤ σErfcInv(

p ǫ )kHk1,2 ≤ σ 2 ln(N/ǫ)kHk1,2 2N

(see (1.22)); here ErfcInv(δ) is the inverse complementary error function: Erfc(ErfcInv(δ)) = δ, 0 < δ < 1.

(1.26)

How it works. Here we present a small numerical illustration. We observe in Gaussian noise m = n/2 randomly selected terms in n-element “time series” z = (z1 , ..., zn ) and want to recover this series under the assumption that the series is “nearly s-sparse in frequency domain,” that is, that z = F x with kx − xs k1 ≤ δ, where F is the matrix of n × n the Inverse Discrete Cosine Transform, xs is the vector obtained from x by zeroing out all but the s entries of largest magnitudes and δ upper-bounds the distance from x to s-sparse signals. Denoting by A the m × n submatrix of F corresponding to the time instants t where zt is observed, our observation becomes y = Ax + σξ, where ξ is the standard Gaussian noise. After the signal in frequency domain, that is, x, is recovered by ℓ1 minimization, let the recovery be x b, we recover the signal in the time domain as zb = F x b. In Figure 1.3, we present four test signals, of different (near-)sparsity, along with their regular and penalized ℓ1 recoveries. The data in Figure 1.3 clearly show how the quality of ℓ1 recovery deteriorates as the number s of “essential nonzeros” of the signal in the frequency domain grows. It is seen also that the penalized recovery meaningfully outperforms the regular one in the range of sparsities up to 64.

18

CHAPTER 1

0.5 0.5 0

0 -0.5

-0.5 0

50

100

150

200

250

300

350

400

450

500

0

50

100

150

200

250

300

350

400

450

500

0

50

100

150

200

250

300

350

400

450

500

0.5 0.5 0

0

-0.5 -0.5 0

50

100

150

200

250

300

350

400

450

500

s=16 s=32 Top plots: regular ℓ1 recovery, bottom plots: penalized ℓ1 recovery. 2

1

h

0.5

1

0

0

-0.5 -1

-1 -2

0

50

100

150

200

250

300

350

400

450

500

0

50

100

150

200

250

300

350

400

450

500

0

50

100

150

200

250

300

350

400

450

500

2

1 1

0.5 0

0

-0.5 -1

-1 -2

0

50

100

150

200

250

300

350

400

450

500

s=64 s=128 Top plots: regular ℓ1 recovery, bottom plots: penalized ℓ1 recovery. kz − z bk2 kz − z bk∞

s = 16 0.2417 0.0343

s = 32 0.3871 0.0514

s = 64 0.8178 0.1744

s = 128 4.8256 0.8272

recovery errors, regular ℓ1 recovery

kz − z bk2 kz − z bk∞

s = 16 0.1399 0.0177

s = 32 0.2385 0.0362

s = 64 0.4216 0.1023

s = 128 5.3431 0.9141

recovery errors, penalized ℓ1 recovery

Figure 1.3: Regular and penalized ℓ1 recovery of nearly s-sparse signals. o: true signals, +: recoveries (to make the plots readable, one per eight consecutive vector’s entries is shown). Problem sizes are m = 256 and n = 2m = 512, noise level s is p σ = 0.01, deviation from s-sparsity is kx − x k1 = 1, contrast pair is (H = n/mA, k · k∞ ). In penalized recovery, λ = 2s, parameter ρ of regular recovery is set to σ · ErfcInv(0.005/n).

SPARSE RECOVERY VIA ℓ1 MINIMIZATION

1.3

19

VERIFIABILITY AND TRACTABILITY ISSUES

The good news about ℓ1 recovery stated in Theorems 1.3, 1.4, and 1.5 is “conditional”—we assume that we are smart enough to point out a pair (H, k · k) satisfying condition Q1 (s, κ) with κ < 1/2 (and condition Qq (s, κ) with a “moderate” κ 8 ). The related issues are twofold: 1. First, we do not know in which range of s, m, and n these conditions, or even the weaker than Q1 (s, κ), κ < 1/2, nullspace property can be satisfied; and without the nullspace property, ℓ1 minimization becomes useless, at least when we want to guarantee its validity whatever be the s-sparse signal we want to recover; 2. Second, it is unclear how to verify whether a given sensing matrix A satisfies the nullspace property for a given s, or a given pair (H, k · k) satisfies the condition Qq (s, κ) with given parameters. What is known about these crucial issues can be outlined as follows. 1. It is known that for given m, n with m ≪ n (say, m/n ≤ 1/2), there exist m × n sensing matrices which are s-good for the values of s “nearly as large m .9 Moreover, there are natural families as m,” specifically, for s ≤ O(1) ln(n/m) of matrices where this level of goodness “is a rule.” E.g., when drawing an m × n matrix at random from Gaussian or Rademacher distributions (i.e., when filling the matrix with independent realizations of a random variable which is either a standard (zero mean, unit variance) Gaussian one, or takes values ±1 with probabilities 0.5), the result will be s-good, for the outlined value of s, with probability approaching 1 as m and n grow. All this remains true when instead of speaking about matrices A satisfying “plain” nullspace properties, we are speaking about matrices A for which it is easy to point out a pair (H, k · k) satisfying the condition Q2 (s, κ) with, say, κ = 1/4. The above results can be considered as a good news. A bad news is that we do not know how to check efficiently, given an s and a sensing matrix A, that the matrix is s-good, just as we do not know how to check that A admits good (i.e., satisfying Q1 (s, κ) with κ < 1/2) pairs (H, k · k). Even worse: we do not know m an efficient recipe allowing us to build, given √ m, an m × 2m matrix A which is provably s-good for s larger than O(1) m, which is a much smaller “level of goodness” than the one promised by theory for randomly generated matrices.10 The “common life” analogy of this situation would be as follows: you know that 90% of bricks in your wall are made of gold, and at the same time, you do not know how to tell a golden brick from a usual one. 2. There exist verifiable sufficient conditions for s-goodness of a sensing matrix, similarly to verifiable sufficient conditions for a pair (H, k · k) to satisfy condition 8 Q (s, κ) is always satisfied with “large enough” κ, e.g., κ = s, but such values of κ are of no q interest: the associated bounds on p-norms of the recovery error are straightforward consequences of the bounds on the k · k1 -norm of this error yielded by the condition Q1 (s, κ). 9 Recall that O(1)’s denote positive absolute constants—appropriately chosen numbers like 0.5, or 1, or perhaps 100,000. We could, in principle, replace all O(1)’s with specific numbers; following the standard mathematical practice, we do not do it, partly out of laziness, partly because particular values of these numbers in our context are irrelevant. 10 Note that the naive algorithm “generate m × 2m matrices at random until an s-good, with s promised by the theory, matrix is generated” is not an efficient recipe, since we still do not know how to check s-goodness efficiently.

20

CHAPTER 1

Qq (s, κ). The bad news is that when m √ ≪ n, these verifiable sufficient conditions can be satisfied only when s ≤ O(1) m—once again, in a much more narrow range of values of s than √ when typical randomly selected sensing matrices are s-good. In fact, s = O( m) is so far the best known sparsity level for which we know individual s-good m × n sensing matrices with m ≤ n/2. 1.3.1

Restricted Isometry Property and s-goodness of random matrices

There are several sufficient conditions for s-goodness, equally difficult to verify, but provably satisfied for typical random sensing matrices. The best known of them is the Restricted Isometry Property (RIP) defined as follows: Definition 1.6. Let k be an integer and δ ∈ (0, 1). We say that an m × n sensing matrix A possesses the Restricted Isometry Property with parameters δ and k, RIP(δ, k), if for every k-sparse x ∈ Rn one has (1 − δ)kxk22 ≤ kAxk22 ≤ (1 + δ)kxk22 .

(1.27)

It turns out that for natural ensembles of random m × n matrices, a typical matrix from the ensemble satisfies RIP(δ, k) with small δ and k “nearly as large as m,” and that RIP( 61 , 2s) implies the nullspace condition, and more. The simplest versions of the corresponding results are as follows. Proposition 1.7. Given δ ∈ (0, 51 ], with properly selected positive c = c(δ), d = d(δ), f = f (δ) for all m ≤ n and all positive integers k such that k≤

m c ln(n/m) + d

(1.28)

1 the probability for a random m × n matrix A with independent N (0, m ) entries to satisfy RIP(δ, k) is at least 1 − exp{−f m}.

For proof, see Section 1.5.3. Proposition 1.8. Let A ∈ Rm×n satisfy RIP(δ, 2s) for some δ < 1/3 and positive integer s. Then    s−1/2 δ (i) The pair H = √ I , k · k satisfies the condition Q s, m 2 2 1−δ associated 1−δ with A;   δ 1 A, k · k∞ ) satisfies the condition Q2 s, 1−δ associated (ii) The pair (H = 1−δ with A. For proof, see Section 1.5.4. 1.3.2

Verifiable sufficient conditions for Qq (s, κ)

When speaking about verifiable sufficient conditions for a pair (H, k · k) to satisfy Qq (s, κ), it is convenient to restrict ourselves to the case where H, like A, is an m × n matrix, and k · k = k · k∞ . Proposition 1.9. Let A be an m × n sensing matrix, and s ≤ n be a sparsity level.

21

SPARSE RECOVERY VIA ℓ1 MINIMIZATION

Given an m × n matrix H and q ∈ [1, ∞], let us set νs,q [H] = max kColj [I − H T A]ks,q ,

(1.29)

j≤n

where Colj [C] is j-th column of matrix C. Then kwks,q ≤ s1/q kH T Awk∞ + νs,q [H]kwk1 ∀w ∈ Rn ,

(1.30)

1

implying that the pair (H, k · k∞ ) satisfies the condition Qq (s, s1− q νs,q [H]). Proof is immediate. Setting V = I − H T A, we have kwks,q = k[H T A + VP ]wks,q ≤ kH T Awks,q + kV wks,q 1/q T ≤ s kH Awk∞ + j |wj |kColj [V ]ks,q ≤ s1/q kH T Ak∞ + νs,q [H]kwk1 .



Observe that the function νs,q [H] is an efficiently computable convex function of H, so that the set 1

κ Hs,q = {H ∈ Rm×n : νs,q [H] ≤ s q −1 κ}

(1.31)

is a computationally tractable convex set. When this set is nonempty for some κ < 1/2, every point H in this set is a contrast matrix such that (H, k · k∞ ) satisfies the condition Qq (s, κ), that is, we can find contrast matrices making ℓ1 minimization valid. Moreover, we can design contrast matrix, e.g., by minimizing κ over Hs,q the function kHk1,2 , thus optimizing the sensitivity of the corresponding ℓ1 recoveries to Gaussian observation noise; see items C, D in Section 1.2.5. Explanation. The sufficient condition for s-goodness of A stated in Proposition 1.9 looks as if coming out of thin air; in fact it is a particular case of a simple and general construction as follows. Let f (x) be a real-valued convex function on Rn , and X ⊂ Rn be a nonempty bounded polytope represented as X = {x ∈ Conv{g1 , ..., gN } : Ax = 0}, P P where Conv{g1 , ..., gN } = { i λi gi : λ ≥ 0, i λi = 1} is the convex hull of vectors g1 , ..., gN . Our goal is to upper-bound the maximum Opt = maxx∈X f (x); this is a meaningful problem, since precisely maximizing a convex function over a polyhedron typically is a computationally intractable task. Let us act as follows: clearly, for any matrix H of the same size as A we have maxx∈X f (x) = maxx∈X f ([I − H T A]x), since on X we have [I − H T A]x = x. As a result, Opt

:= ≤ =

max f (x) = max f ([I − H T A]x) x∈X

x∈X

max

x∈Conv{g1 ,...,gN }

f ([I − H T A]x)

max f ([I − H T A]gj ). j≤N

We get a parametric—the parameter being H—upper bound on Opt, namely, the bound maxj≤N f ([I − H T A]gj ). This parametric bound is convex in H, and thus is well suited for minimization over this parameter. The result of Proposition 1.9 is inspired by this construction as applied to the

22

CHAPTER 1

nullspace property: given an m × n sensing matrix A and setting X = {x ∈ Rn : kxk1 ≤ 1, Ax = 0} = {x ∈ Conv{±e1 , ..., ±en } : Ax = 0} (ei are the basic orths in Rn ), A is s-good if and only if Opts := max{f (x) := kxks,1 } < 1/2. x∈X

A verifiable sufficient condition for this, as yielded by the above construction, is the existence of an m × n matrix H such that max max[f ([In − H T A]ej ), f (−[In − H T A]ej )] < 1/2, j≤n

or, which is the same, max kColj [In − H T A]ks,1 < 1/2. j

This observation brings to our attention the matrix I − H T A with varying H and the idea of expressing sufficient conditions for s-goodness and related properties in terms of this matrix. 1.3.3

Tractability of Q∞ (s, κ)

As we have already mentioned, the conditions Qq (s, κ) are intractable, in the sense that we do not know how to verify whether a given pair (H, k · k) satisfies the condition. Surprisingly, this is not the case with the strongest of these conditions, the one with q = ∞. Namely, Proposition 1.10. Let A be an m × n sensing matrix, s be a sparsity level, and ¯ k · k) satisfies the condition Q∞ (s, κ), there exists κ ≥ 0. Then whenever a pair (H, an m × n matrix H such that kColj [In − H T A]ks,∞ = kColj [In − H T A]k∞ ≤ s−1 κ, 1 ≤ j ≤ n (so that (H, k · k∞ ) satisfies Q∞ (s, κ) by Proposition 1.9), and also ¯ T ηk ∀η ∈ Rm . kH T ηk∞ ≤ kH

(1.32)

In addition, the m × n contrast matrix H such that the pair (H, k · k∞ ) satisfies the condition Q∞ (s, κ) with as small κ as possible can be found as follows. Consider n LP programs  Opti = min ν : kAT h − ei k∞ ≤ ν , (#i ) ν,h

where ei is i-th basic orth of Rn . Let Opti , hi , i = 1, ..., n be optimal solutions to these problems; we set H = [h1 , ..., hn ]; the corresponding value of κ is κ∗ = s max Opti . i

Besides this, there exists a transparent alternative description of the quantities Opti

23

SPARSE RECOVERY VIA ℓ1 MINIMIZATION

(and thus of κ∗ ); specifically, Opti = max {xi : kxk1 ≤ 1, Ax = 0} .

(1.33)

x

For proof, see Section 1.5.5. Taken along with (1.32) and error bounds of Theorems 1.3, 1.4, and 1.5, Proposition 1.10 says that As far as the condition Q∞ (s, κ) is concerned, we lose nothing when restricting ourselves with pairs (H ∈ Rm×n , k · k∞ ) and contrast matrices H satisfying the condition |[In − H T A]ij | ≤ s−1 κ,

(1.34)

implying that (H, k · k∞ ) satisfies Q∞ (s, κ). The good news is that (1.34) is an explicit convex constraint on H (in fact, even on H and κ), so that we can solve the design problems, where we want to optimize a convex function of H under the requirement that (H, k · k∞ ) satisfies the condition Q∞ (s, κ) (and, perhaps, additional convex constraints on H and κ). 1.3.3.1

Mutual Incoherence

The simplest (and up to some point in time, the only) verifiable sufficient condition for s-goodness of a sensing matrix A is expressed in terms of mutual incoherence of A, defined as |ColTi [A]Colj [A]| . (1.35) µ(A) = max i6=j kColi [A]k22 This quantity is well defined whenever A has no zero columns (otherwise A is not even 1-good). Note that when A is normalized to have all columns of equal k · k2 lengths,11 µ(A) is small when the columns of A are nearly mutually orthogonal. The standard related result is that Whenever A and a positive integer s are such that

2µ(A) 1+µ(A)

< 1s , A is s-good.

It is immediately seen that the latter condition is weaker than what we can get with the aid of (1.34): Proposition 1.11. Let A be an m × n matrix, and let the columns of m × n matrix H be given by Colj (H) =

1 Colj (A), 1 ≤ j ≤ n. (1 + µ(A))kColj (A)k22

Then |[Im − H T A]ij | ≤

µ(A) ∀i, j. 1 + µ(A)

(1.36)

11 As far as ℓ minimization is concerned, this normalization is non-restrictive: we always can 1 enforce it by diagonal scaling of the signal underlying observations (1.1), and ℓ1 minimization in scaled variables is the same as weighted ℓ1 minimization in original variables.

24

CHAPTER 1

In particular, when

2µ(A) 1+µ(A)

< 1s , A is s-good.

1 = Proof. With H as above, the diagonal entries in I −H T A are equal to 1− 1+µ(A) µ(A) 1+µ(A) ,

while by definition of mutual incoherence the magnitudes of the off-diagonal

µ(A) entries in I − H T A are ≤ 1+µ(A) as well, implying (1.36). The “in particular” claim is given by (1.36) combined with Proposition 1.9. ✷

1.3.3.2

From RIP to conditions Qq (·, κ)

It turns out that when A is RIP(δ, k) and q ≥ 2, it is easy to point out pairs (H, k·k) satisfying Qq (t, κ) with a desired κ > 0 and properly selected t: Proposition 1.12. Let A be an m × n sensing matrix satisfying RIP(δ, 2s) with some s and some δ ∈ (0, 1), and let q ∈ [2, ∞] and κ > 0 be given. Then (i) Whenever a positive integer t satisfies # "  q q−2 q κ(1 − δ) q−1 q−1 ,s s 2q−2 , (1.37) t ≤ min δ the pair (H =

−1 q

t √

I ,k 1−δ m

· k2 ) satisfies Qq (t, κ);

(ii) Whenever a positive integer t satisfies (1.37), the pair (H = satisfies Qq (t, κ).

1 −1

s2 t q 1−δ

A, k · k∞ )

For proof, see Section 1.5.4. The most important consequence of Proposition 1.12 deals with the case of q = ∞ and states that when s-goodness of a sensing matrix A can be ensured by the difficult to verify condition RIP(δ, 2s) with, say, δ = 0.2, the somehow worse level of √ sparsity, t = O(1) s with properly selected absolute constant O(1), can be certified via condition Q∞ (t, 13 )—there exists a pair (H, k·k∞ ) satisfying this condition. The point is that by Proposition 1.10, if the condition Q∞ (t, 31 ) can at all be satisfied, a pair (H, k · k∞ ) satisfying this condition can be found efficiently. Unfortunately, the significant “dropdown” in the level of sparsity when passing from unverifiable RIP to verifiable Q∞ is inevitable; this bad news is what is on our agenda now. 1.3.3.3

Limits of performance of verifiable sufficient conditions for goodness

Proposition 1.13. Let A be an m × n sensing matrix which is “essentially nonsquare,” specifically, such that 2m ≤ n, and let q ∈ [1, ∞]. Whenever a positive integer s and an m × n matrix H are linked by the relation 1

kColj [In − H T A]ks,q < 21 s q −1 , 1 ≤ j ≤ n, one has s≤



m.

(1.38) (1.39)

As a result, the sufficient condition for the validity √ of Qq (s, κ) with κ < 1/2 from Proposition 1.9 can never be satisfied when s > m. Similarly, the verifiable sufficient condition Q∞ (s, κ), κ < 1/2, for s-goodness of A cannot be satisfied

SPARSE RECOVERY VIA ℓ1 MINIMIZATION

25

Figure 1.4: Erroneous ℓ1 recovery of a 25-sparse signal, no observation noise. Top: frequency domain, o – true signal, + – recovery. Bottom: time domain.

when s >



m.

For proof, see Section 1.5.6. We see that unless A is “nearly square,” our (same as all others known to us) verifiable sufficient conditions for s-goodness are unable to justify this property for “large” s. This unpleasant fact is in full accordance with the already mentioned fact that no individual provably s-good “essentially nonsquare” m × n matrices √ with s ≥ O(1) m are known. Matrices for √ which our verifiable sufficient conditions do establish s-goodness with s ≤ O(1) m do exist. How it works: Numerical illustration. Let us apply our machinery to the 256×512 randomly selected submatrix A of the matrix of 512×512 Inverse Discrete Cosine Transform which we used in experiments reported in Figure 1.3. These experiments exhibit nice performance of ℓ1 minimization when recovering sparse (even nearly sparse) signals with as many as 64 nonzeros. In fact, the level of goodness of A is at most 24, as is witnessed in Figure 1.4. In order to upper-bound the level of goodness of a matrix A, one can try to maximize the convex function kwks,1 over the set W = {w : Aw = 0, kwk1 ≤ 1}: if, for a given s, the maximum of k·ks,1 over W is ≥ 1/2, the matrix is not s-good— it does not possess the nullspace property. Now, while global maximization of the convex function kwks,1 over W is difficult, we can try to find suboptimal solutions as follows. Let us start with a vector w1 ∈ W of k·k1 -norm 1, and let u1 be obtained from w1 by replacing the s entries in w1 of largest magnitudes by the signs of these entries and zeroing out all other entries, so that w1T u1 = kw1 ks,1 . After u1 is found, let us solve the LO program maxw {[u1 ]T w : w ∈ W }. w1 is a feasible solution to this problem, so that for the optimal solution w2 we have [u1 ]T w2 ≥ [u1 ]T w1 =

26

CHAPTER 1

kw1 ks,1 ; this inequality, by virtue of what u1 is, implies that kw2 ks,1 ≥ kw1 ks,1 , and, by construction, w2 ∈ W . We now can iterate the construction, with w2 in the role of w1 , to get w3 ∈ W with kw3 ks,1 ≥ kw2 ks,1 , etc. Proceeding in this way, we generate a sequence of points from W with monotonically increasing value of the objective k · ks,1 we want to maximize. We terminate this recurrence either when the achieved value of the objective becomes ≥ 1/2 (then we know for sure that A is not s-good, and can proceed to investigating s-goodness for a smaller value of s) or when the recurrence gets stuck—the observed progress in the objective falls below a given threshold, say, 10−6 . When it happens, we can restart the process from a new starting point randomly selected in W , after getting stuck, restart again, etc., until we exhaust our time budget. The output of the process is the best of the points we have generated—that of the largest k · ks,1 . Applying this approach to the matrix A in question, in a couple of minutes it turns out that the matrix is at most 24-good.

One can ask how it may happen that previous experiments with recovering 64sparse signals went fine, when in fact some 25-sparse signals cannot be recovered by ℓ1 minimization even in the ideal noiseless case. The answer is simple: in our experiments, we dealt with randomly selected signals, and typical randomly selected data are much nicer, whatever be the purpose of a numerical experiment, than the worst-case data. It is interesting to understand also which goodness we can certify using our verifiable sufficient conditions. Computations show that the fully verifiable (and strongest in our scale of sufficient conditions for s-goodness) condition Q∞ (s, κ) can be satisfied with κ < 1/2 when s is as large as 7 and κ = 0.4887, and cannot be satisfied with κ < 1/2 when s = 8. As for Mutual Incoherence, it can only justify 3-goodness, no more. We can hardly be happy with the resulting bounds—goodness at least 7 and at most 24; however, it could be worse.

1.4

EXERCISES FOR CHAPTER 1

Exercise 1.1. The k-th Hadamard matrix, Hk (here k is a nonnegative integer), is the nk × nk matrix, nk = 2k , given by the recurrence   Hk Hk . H0 = [1]; Hk+1 = Hk −Hk In the sequel, we assume that k > 0. Now comes the exercise: 1. Check that Hk is a symmetric matrix with entries ±1, and columns of the matrix √ are mutually orthogonal, so that Hk / nk is an orthogonal matrix. √ √ 2. Check that when k > 0, Hk has just two distinct eigenvalues, nk and − nk , each of multiplicity mk := 2k−1 = nk /2. 3. Prove that whenever f is an eigenvector of Hk , one has √ kf k∞ ≤ kf k1 / nk .

Derive from this observation the conclusion as follows:

SPARSE RECOVERY VIA ℓ1 MINIMIZATION

27

Let a1 , ..., amk ∈ Rnk be unit vectors orthogonal to each other which are √ eigenvectors of Hk with eigenvalues nk (by the above, the dimension of √ the eigenspace of Hk associated with the eigenvalue nk is mk , so that the required a1 , ..., amk do exist), and let A be the mk × nk matrix with the rows aT1 , ..., aTmk . For every x ∈ Ker A it holds 1 kxk∞ ≤ √ kxk1 , nk whence A satisfies the nullspace property whenever the sparsity s satisfies √ √ 2s < nk = 2mk . Moreover, there exists (and can be found efficiently) √ an mk × nk contrast matrix H = Hk such that for every s < 12 nk , the √ pair (Hk , k · k∞ ) satisfies the condition Q∞ (s, κs = s/ nk ) associated | {z } O(1)n/ mo , for properly selected absolute constant O(1). Exercise 1.5. Utilize the results of Exercise 1.3 in a numerical experiment as follows. • select n as an integer power 2k of 2, say, n = 210 = 1024; • select a “representative” sequence M of values of m, 1 ≤ m < n, including values of m close to n and “much smaller” than n, say, M = {2, 5, 8, 16, 32, 64, 128, 256, 512, 7, 896, 960, 992, 1008, 1016, 1020, 1022, 1023};

• for every m ∈ M ,

30

CHAPTER 1

– generate at random an m × n submatrix A of the n × n Hadamard matrix Hk and utilize the result of item 4 of Exercise 1.3 in order to find the largest s such that the s-goodness of A can be certified via the condition Q∞ (·, ·); call s(m) the resulting value of s; – generate a moderate sample of Gaussian m × n sensing matrices Ai with independent N (0, 1/m) entries and use the construction from Exercise 1.2 to upper-bound the largest s for which a matrix from the sample satisfies RIP(1/3, 2s); call sb(m) the largest—over your Ai ’s—of the resulting upper bounds.

The goal of the exercise is to compare the computed values of s(m) and sb(m); in other words, we again want to understand how “theoretically perfect” RIP compares to “conservative restricted scope” condition Q∞ .

1.5

PROOFS

1.5.1

Proofs of Theorem 1.3, 1.4

All we need is to prove Theorem 1.4, since Theorem 1.3 is the particular case κ = κ < 1/2 of Theorem 1.4. Let us fix x ∈ Rn and η ∈ Ξρ , and let us set x b = x breg (Ax + η). Let also I ⊂ {1, ..., n} be the set of indexes of the s entries in x of largest magnitudes, I o be the complement of I in {1, ..., n}, and, for w ∈ Rn , wI and wI o be the vectors obtained from w by zeroing entries with indexes j 6∈ I and j 6∈ I o , respectively, and keeping the remaining entries intact. Finally, let z = x b − x. 1o . By the definition of Ξρ and due to η ∈ Ξρ , we have kH T ([Ax + η] − Ax)k ≤ ρ,

(1.40)

so that x is a feasible solution to the optimization problem specifying x b, whence kb xk1 ≤ kxk1 . We therefore have kb xI o k1

= ≤

xI k1 kb xk1 − kb xI k1 ≤ kxk1 − kb xI k1 = kxI k1 + kxI o k1 − kb o kzI k1 + kxI k1 ,

(1.41)

and therefore kzI o k1 ≤ kb xI o k1 + kxI o k1 ≤ kzI k1 + 2kxI o k1 . It follows that kzk1 = kzI k1 + kzI o k1 ≤ 2kzI k1 + 2kxI o k1 .

(1.42)

Further, by definition of x b we have kH T ([Ax + η] − Ab x)k ≤ ρ, which combines with (1.40) to imply that kH T A(b x − x)k ≤ 2ρ. (1.43) 2o . Since (H, k · k) satisfies Q1 (s, κ), we have

kzks,1 ≤ skH T Azk + κkzk1 . By (1.43), it follows that kzks,1 ≤ 2sρ + κkzk1 , which combines with the evident

31

SPARSE RECOVERY VIA ℓ1 MINIMIZATION

inequality kzI k ≤ kzks,1 (recall that Card(I) = s) and with (1.42) to imply that kzI k1 ≤ 2sρ + κkzk1 ≤ 2sρ + 2κkzI k1 + 2κkxI o k1 , whence kzI k1 ≤ Invoking (1.42), we conclude that

2sρ + 2κkxI o k1 . 1 − 2κ

  4s kxI o k1 kzk1 ≤ ρ+ . 1 − 2κ 2s

(1.44)

3o . Since (H, k · k) satisfies Qq (s, κ), we have 1

1

kzks,q ≤ s q kH T Azk + κs q −1 kzk1 , which combines with (1.44) and (1.43) to imply that 1

1

kzks,q ≤ s q 2ρ + κs q

4ρ+2s−1 kxI o k1 1−2κ

1



4s q [1+κ−κ] 1−2κ

h

ρ+

kxo k1 2s

i

(1.45)

(we have taken into account that κ < 1/2 and κ ≥ κ). Let θ be the (s + 1)-st largest magnitude of entries in z, and let w = z − z s . Now (1.45) implies that   1 kxI o k1 4[1 + κ − κ] ρ+ . θ ≤ kzks,q s− q ≤ 1 − 2κ 2s Hence invoking (1.44) we have q−1

kwkq

≤ ≤ ≤

1

1

q−1

kwk∞q kwk1q ≤ θ q kzk1q 1 h i q1 q−1 q I o k1 θ q (4s) 1 ρ + kx2s [1−2κ] q q−1 h 1 i q kxI o k1 4s [1+κ−κ] q ρ + . 1−2κ 2s

Taking into account (1.45) and the fact that the supports of z s and w do not intersect, we get kzkq

≤ ≤

1

1

2 q max[kz s kq , kwkq ] = 2 q max[kzks,q , kwkq ] 1 h i 4(2s) q [1+κ−κ] kxI o k1 ρ + . 1−2κ 2s

This bound combines with (1.44), the Moment inequality,12 and with the relation kxI o k1 = kx − xs k1 to imply (1.16). ✷ 12 The

Moment inequality states that if (Ω, µ) is a space with measure and f is a µ-measurable R ρ 1 real-valued function on Ω, then φ(ρ) = ln Ω |f (ω)| ρ µ(dω) is a convex function of ρ on every

segment ∆ ⊂ [0, 1] such that φ(·) is well defined at the endpoints of ∆. As a corollary, when q−p p(q−1)

x ∈ Rn and 1 ≤ p ≤ q ≤ ∞, one has kxkp ≤ kxk1

q(p−1) p(q−1)

kxkq

.

32 1.5.2

CHAPTER 1

Proof of Theorem 1.5

Let us prove (i). Let us fix x ∈ Rn and η, and let us set x b=x bpen (Ax + η). Let also I ⊂ {1, ..., K} be the set of indexes of the s entries in x of largest magnitudes, I o be the complement of I in {1, ..., n}, and, for w ∈ Rn , wI and wI o be the vectors obtained from w by zeroing out all entries with indexes not in I and not in I o , respectively. Finally, let z = x b − x and ν = kH T ηk. o 1 . We have kb xk1 + λkH T (Ab x − Ax − η)k ≤ kxk1 + λkH T ηk and kH T (Ab x − Ax − η)k = kH T (Az − η)k ≥ kH T Azk − kH T ηk, whence kb xk1 + λkH T Azk ≤ kxk1 + 2λkH T ηk = kxk1 + 2λν. We have

kb xk1

= ≥

(1.46)

kx + zk1 = kxI + zI k1 + kxI o + zI o k1 kxI k1 − kzI k1 + kzI o k1 − kxI o k1 ,

which combines with (1.46) to imply that kxI k1 − kzI k1 + kzI o k1 − kxI o k1 + λkH T Azk ≤ kxk1 + 2λν, or, which is the same, kzI o k1 − kzI k1 + λkH T Azk ≤ 2kxI o k1 + 2λν.

(1.47)

Since (H, k · k) satisfies Q1 (s, κ), we have kzI k1 ≤ kzks,1 ≤ skH T Azk + κkzk1 , so that (1 − κ)kzI k1 − κkzI o k1 − skH T Azk ≤ 0.

(1.48)

Taking a weighted sum of (1.47) and (1.48), the weights being 1 and 2, respectively, we get (1 − 2κ) [kzI k1 + kzI o k1 ] + (λ − 2s)kH T Azk ≤ 2kxI o k1 + 2λν, whence, due to λ ≥ 2s, kzk1 ≤

  kxI o k1 2λ 2λν + 2kxI o k1 ν+ . ≤ 1 − 2κ 1 − 2κ 2s

(1.49)

Further, by (1.46) we have λkH T Azk ≤ kxk1 − kb xk1 + 2λν ≤ kzk1 + 2λν, which combines with (1.49) to imply that λkHAT zk ≤

2λν + 2kxI o k1 2λν(2 − 2κ) + 2kxI o k1 + 2λν = . 1 − 2κ 1 − 2κ

(1.50)

33

SPARSE RECOVERY VIA ℓ1 MINIMIZATION

From Qq (s, κ) it follows that 1

1

kzks,q ≤ s q kH T Azk + κs q −1 kzk1 , which combines with (1.50) and (1.49) to imply that kzks,q

h i  1 −1 4sν(1−κ)+ 2s kxI o k1 kxI o k1 ] κ[2λν+ λ λ s skH T Azk + κkzk1 ≤ s q + 1−2κ 1−2κ i 1 h  −1 −1 1 +κs−2 λ]kxI o k1 kxI o k1 sq κλ = s q [4(1−κ)+2s λκ]ν+[2λ ≤ 4 − κ ν + 1 + 1−2κ 1−2κ 2s 2s 1 −1

≤ sq



(1.51)

(recall that λ ≥ 2s, κ ≥ κ, and κ < 1/2). It remains to repeat the reasoning following (1.45) in item 3o of the proof of Theorem 1.4. Specifically, denoting by θ the (s + 1)-st largest magnitude of entries in z, (1.51) implies that   λ kxI o k1 4 , (1.52) [1 + κ − κ] ν + θ ≤ s−1/q kzks,q ≤ 1 − 2κ 2s 2s so that for the vector w = z − z s one has kwkq



1

1

θ1− q kwk1q ≤

1

4(λ/2) q 1−2κ



λ 1 + κ 2s −κ

h  q−1 q

ν+

kxI o k1 2s

i

(we have used (1.52) and (1.49)). Hence, taking into account that z s and w have nonintersecting supports, kzkq





1

1

2 q max[kz s kq , kwkq ] = 2 q max[kzks,q , kwkq ] i 1  h kxI o k1 4λ q λ 1 + κ − κ ν + 1−2κ 2s 2s

(we have used (1.51) along with λ ≥ 2s and κ ≥ κ). This combines with (1.49) and the Moment inequality to imply (1.18). All remaining claims of Theorem 1.5 are immediate corollaries of (1.18). ✷ 1.5.3

Proof of Proposition 1.7

1o . Assuming k ≤ m and selecting a set I of k indices from {1, ..., n} distinct from each other, consider an m × k submatrix AI of A comprised of columns with indexes from I, and let u be a unit vector in Rk . The entries in the vector m1/2 A are independent N (0, 1) random variables, so that for the random variable PI u m ζu = i=1 (m1/2 AI u)2i and γ ∈ (−1/2, 1/2) it holds (in what follows, expectations and probabilities are taken w.r.t. our ensemble of random A’s)   Z m 1 γt2 − 12 t2 e ds = − ln(1 − 2γ). ln (E{exp{γζ}}) = m ln √ 2 2π Given α ∈ (0, 0.1] and selecting γ in such a way that 1 − 2γ = 0 < γ < 1/2 and therefore Prob{ζu > m(1 + α)} ≤ E{exp{γζu }} exp{−mγ(1 + α)} = exp{− m 2 ln(1 − 2γ) − mγ(1 + α)} m 2 = exp{ m 2 [ln(1 + α) − α]} ≤ exp{− 5 α },

1 1+α ,

we get

34

CHAPTER 1

and similarly, selecting γ in such a way that 1 − 2γ = and therefore

1 1−α ,

we get −1/2 < γ < 0

Prob{ζu < m(1 − α)} ≤ E{exp{γζu }} exp{−mγ(1 − α)} = exp{− m 2 ln(1 − 2γ) − mγ(1 − α)} m 2 = exp{ m 2 [ln(1 − α) + α]} ≤ exp{− 5 α }, and we end up with u ∈ Rk , kuk2 = 1 ⇒



2 Prob{A : kAI uk22 > 1 + α} ≤ exp{− m 5α } . m 2 2 Prob{A : kAI uk2 < 1 − α} ≤ exp{− 5 α }

(1.53)

2o . As above, let α ∈ (0, 0.1], let M = 1 + 2α, ǫ =

α , 2(1 + 2α)

and let us build an ǫ-net on the unit sphere S in Rk as follows. We start with a point u1 ∈ S; after {u1 , ..., ut } ⊂ S is already built, we check whether there is a point in S at the k · k2 -distance from all points of the set > ǫ. If it is the case, we add such a point to the net built so far and proceed with building the net; otherwise we terminate with the net {u1 , ..., ut }. By compactness of S and due to ǫ > 0, this process eventually terminates; upon termination, we have at our disposal the collection {u1 , ..., uN } of unit vectors such that every two of them are at k · k2 -distance > ǫ from each other, and every point from S is at distance at most ǫ from some point of the collection. We claim that the cardinality N of the resulting set can be bounded as 

2+ǫ N≤ ǫ

k



4 + 9α = α

k

 k 5 ≤ . α

(1.54)

Indeed, the interiors of the k · k2 -balls of radius ǫ/2 centered at the points u1 , ..., uN are mutually disjoint, and their union is contained in the k · k2 -ball of radius 1 + ǫ/2 centered at the origin; comparing the volume of the union and that of the ball, we arrive at (1.54). 3o . Consider event E comprised of all realizations of A such that for all k-element subsets I of {1, ..., n} and all t ≤ n it holds 1 − α ≤ kAI ut k22 ≤ 1 + α.

(1.55)

By (1.53) and the union bound, Prob{A 6∈ E} ≤ 2N

  m n exp{− α2 }. k 5

(1.56)

We claim that A ∈ E ⇒ (1 − 2α) ≤

kAI uk22

≤ 1 + 2α ∀



I ⊂ {1, ..., n} : Card(I) = k u ∈ Rk : kuk2 = 1



.

(1.57)

35

SPARSE RECOVERY VIA ℓ1 MINIMIZATION

Indeed, let A ∈ E, let us fix I ∈ {1, ..., n}, Card(I) = k, and let M be the maximal value of the quadratic form f (u) = uT ATI AI u on the unit k · k2 -ball B, centered at the origin, in Rk . In this ball, f is Lipschitz continuous with constant 2M w.r.t. k · k2 ; denoting by u ¯ a maximizer of the form on B, we lose nothing when assuming that u ¯ is a unit vector. Now let us be the point of our net which is at k · k2 -distance at most ǫ from u ¯. We have M = f (¯ u) ≤ f (us ) + 2M ǫ ≤ 1 + α + 2M ǫ, whence M≤

1+α = 1 + 2α, 1 − 2ǫ

implying the right inequality in (1.57). Now let u be unit vector in Rk , and us be a point in the net at k · k-distance ≤ ǫ from u. We have f (u) ≥ f (us ) − 2M ǫ ≥ 1 − α − 2

1+α ǫ = 1 − 2α, 1 − 2ǫ

justifying the first inequality in (1.57). The bottom line is: δ ∈ (0, 0.2], 1 ≤ k ≤ n

k 10 ⇒ Prob{A : A does not satisfy RIP(δ, k)} ≤ 2 δ | {z } k ≤( 20 δ ) 

n k



2

exp{− mδ 20 }.

(1.58)

Indeed, setting α = δ/2, we have seen that whenever A 6∈ E, we have (1 − δ) ≤ kAuk22 ≤ (1 + δ) for all unit k-sparse u, which is nothing but RIP(δ, k); with this in mind, (1.58) follows from (1.56) and (1.54). 4o . It remains to verify that with properly selected—depending solely on δ— positive quantities c, d, f , for every k ≥ 1 satisfying (1.28) the right-hand side in (1.58) is at most exp{−f m}. Passing to logarithms, our goal is to ensure the relation  (δ) > 0 G := a(δ)m − b(δ)k − ln nk ≥ mf h i (1.59) δ2 20 a(δ) = 20 , b(δ) = ln δ provided that k ≥ 1 satisfies (1.28). Let k satisfy (1.28) with some c, d to be specified later, and let y = k/m. Assuming d ≥ 3, we have 0 ≤ y ≤ 1/3. Now, it is well known that     n n−k n k n ln( ) + ln( ) , ≤n C := ln n k n n−k k

whence C≤n ≤n

m

m

n

n y ln( my )+

n n y ln( my )

+

k n



n−k n

 k ln(1 + ) n−k {z } | k

 ≤ n−k  n n = m y ln( my ) + y ≤ 2my ln( my )

36

CHAPTER 1

(recall that n ≥ m and y ≤ 1/3). It follows that G

= =

n ) a(δ)m − b(δ)k − C ≥ a(δ)m − b(δ)ym − 2my ln( my   n 1 m a(δ) − b(δ)y − 2y ln( ) − 2y ln( ) , m y {z } | H

and all we need is to select c, d in such a way that (1.28) would imply that H ≥ f with some positive f = f (δ). This is immediate: we can find u(δ) > 0 such that when 0 ≤ y ≤ u(δ), we have 2y ln(1/y) + b(δ)y ≤ 31 a(δ); selecting d(δ) ≥ 3 large enough, (1.28) would imply y ≤ u(δ), and thus would imply H≥

n 2 a(δ) − 2y ln( ). 3 m

n Now we can select c(δ) large enough for (1.28) to ensure that 2y ln( m ) ≤ 13 a(δ). With the c, d just specified, (1.28) implies that H ≥ 31 a(δ), and we can take the latter quantity as f (δ). ✷

1.5.4

Proof of Propositions 1.8 and 1.12

Let x ∈ Rn , and let x1 , ..., xq be obtained from x by the following construction: x1 is obtained from x by zeroing all but the s entries of largest magnitudes; x2 is obtained by the same procedure applied to x − x1 ; x3 —by the same procedure applied to x − x1 − x2 ; and so on; the process is terminated at the first step q when j it happens that x = x1 + ... + xq . Note that for k∞ ≤ s−1 kxj−1 k1 pj ≥ 2 we have kx −1/2 j j−1 j kxj−1 k1 . It is and kx k1 ≤ kx k1 , whence also kx k2 ≤ kxj k∞ kxj k1 ≤ s easily seen that if A is RIP(δ, 2s), then for every two s-sparse vectors u, v with nonoverlapping supports we have |v T AT Au| ≤ δkuk2 kvk2 .

(∗)

Indeed, for s-sparse u, v, let I be the index set of cardinality ≤ 2s containing the supports of u and v, so that, denoting by AI the submatrix of A comprised of columns with indexes from I, we have v T AT Au = vIT [ATI AI ]uI . By RIP, the eigenvalues λi = 1 + µi of the symmetric matrix Q = ATI AI are in-between 1 − δ and 1 + δ; representing uI and vI by vectors w and z of their coordinates in P P the orthonormal eigenbasis of Q, we get |v T AT Au| = | i λi wi zi | = | i wi zi + P T T T i µi wi zi | ≤ |w z| + δkwk2 kzk2 . It remains to note that w z = uI vI = 0 and kwk2 = kuk2 , kzk2 = kvk2 .

We have







Pq kAx1 k2 kAxk2 ≥ [x1 ]T AT Ax = kAx1 k22 + j=2 [x1 ]T AT Axj P q ≥ kAx1 k22 − δ j=2 kx1 k2 kxj k2 [by (∗)] Pq ≥ kAx1 k22 − δs−1/2 kx1 k2 j=2 kxj−1 k1 ≥ kAx1 k22 − δs−1/2 kx1 k2 kxk1 kAx1 k22 ≤ kAx1 k2 kAxk2 + δs−1/2 kx1 k2 kxk1  1 2 kx k2 kx1 k2 kx1 k2 −1/2 1 2 kAxk + δs kAx k ≤ kxk1 kx1 k2 = kAx 2 1 2 1k 2 kAx k2 kAx1 k2 2

kxks,2 = kx1 k2 ≤ [by RIP(δ, 2s)]

√ 1 kAxk2 1−δ

+

δs−1/2 1−δ kxk1

(!)

37

SPARSE RECOVERY VIA ℓ1 MINIMIZATION

  s−1/2 δ ), as claimed in I , k · k and we see that the pair H = √ satisfies Q2 (s, 1−δ m 2 1−δ Proposition 1.8.i. Moreover, when q ≥ 2, κ > 0, and integer t ≥ 1 satisfy t ≤ s and −1/2 κt1/q−1 ≥ δs1−δ , by (!) we have kxkt,q ≤ kxks,q ≤ kxks,2 ≤ √

1 kAxk2 + κt1/q−1 kxk1 , 1−δ

or, equivalently, 1 ≤ t ≤ min ⇒

(H =

−1 q

t √

h

κ(1−δ) δ

I ,k 1−δ m

q i q−1

,s

q−2 2q−2



q

s 2q−2

· k2 ) satisfies Qq (t, κ),

as required in Proposition 1.12.i. Next, we have Pq kx1 k1 kAT Axk∞ ≥ [x1 ]T AT Ax = kAx1 k22 + j=2 [x1 ]T AT Axj ≥ kAx1 k22 − δs−1/2 kx1 k2 kxk1 [exactly as above] ⇒ kAx1 k22 ≤ kx1 k1 kAT Axk∞ + δs−1/2 kx1 k2 kxk1 ⇒ (1 − δ)kx1 k22 ≤ kx1 k1 kAT Axk∞ + δs−1/2 kx1 k2 kxk1 [by RIP(δ, 2s)] ≤ s1/2 kx1 k2 kAT Axk∞ + δs−1/2 kx1 k2 kxk1 1/2 δ s−1/2 kxk1 ⇒ kxks,2 = kx1 k2 ≤ s1−δ kAT Axk∞ + 1−δ 



(!!)

 1 δ and we see that the pair H = 1−δ A, k · k∞ satisfies the condition Q2 s, 1−δ , as required in Proposition 1.8.ii. Moreover, when q ≥ 2, κ > 0, and integer t ≥ 1 δ satisfy t ≤ s and κt1/q−1 ≥ 1−δ s−1/2 , we have by (!!) kxkt,q ≤ kxks,q ≤ kxks,2 ≤



1 1/2 T s kA Axk∞ + κt1/q−1 kxk1 , 1−δ

or, equivalently, 1 ≤ t ≤ min ⇒ (H =

h

κ(1−δ) δ

1 −1

s2 t q 1−δ

q i q−1

 q q−2 , s 2q−2 s 2q−2

A, k · k∞ ) satisfies Qq (t, κ),

as required in Proposition 1.12.ii. 1.5.5

Proof of Proposition 1.10

¯ ∈ Rm×N and k · k satisfy Q∞ (s, κ). Then for every k ≤ n we have (i): Let H ¯ T Axk + s−1 κkxk1 , |xk | ≤ kH or, which is the same by homogeneity,  T ¯ Axk − xk : kxk1 ≤ 1 ≥ −s−1 κ. min kH x



38

CHAPTER 1

In other words, the optimal value Optk of the conic optimization problem13  ¯ T Axk ≤ t, kxk1 ≤ 1 , Optk = min t − [ek ]T x : kH x,t

where ek ∈ Rn is k-th basic orth, is ≥ −s−1 κ. Since the problem clearly is strictly feasible, this is the same as saying that the dual problem  ¯ + g = ek , kgk∞ ≤ µ, kηk∗ ≤ 1 , −µ : AT Hη max µ∈R,g∈Rn ,η∈RN

where k · k∗ is the norm conjugate to k · k,

kuk∗ = max hT u, khk≤1

has a feasible solution with the value of the objective ≥ −s−1 κ. It follows that there exist η = η k and g = g k such that (a) : ek = AT hk + g k , ¯ k , kη k k∗ ≤ 1, (b) : hk := Hη k (c) : kg k∞ ≤ s−1 κ.

(1.60)

Denoting H = [h1 , ..., hn ], V = I − H T A, we get Colk [V T ] = ek − AT hk = g k , implying that kColk [V T ]k∞ ≤ s−1 κ. Since the latter inequality is true for all k ≤ n, we conclude that kColk [V ]ks,∞ = kColk [V ]k∞ ≤ s−1 κ, 1 ≤ k ≤ n, whence, by Proposition 1.9, (H, k · k∞ ) satisfies Q∞ (s, κ). Moreover, for every η ∈ Rm and every k ≤ n we have, in view of (b) and (c), ¯ T η| ≤ kη k k∗ kH ¯ T ηk, |[hk ]T η| = |[η k ]T H ¯ T ηk. whence kH T ηk∞ ≤ kH Now let us prove the “In addition” part of the proposition. Let H = [h1 , ..., hn ] be the contrast matrix specified in this part. We have |[Im − H T A]ij | = |[[ei ]T − hTi A]j | ≤ k[ei ]T − hTi Ak∞ = kei − AT hi k∞ ≤ Opti , implying by Proposition 1.9 that (H, k · k∞ ) does satisfy the condition Q∞ (s, κ∗ ) with κ∗ = s maxi Opti . Now assume that there exists a matrix H ′ which, taken along with some norm k · k, satisfies the condition Q∞ (s, κ) with κ < κ∗ , and let us lead this assumption to a contradiction. By the already proved first part of Proposition 1.10, our assumption implies that there exists an m × n matrix ¯ 1 , ..., h ¯ n ] such that kColj [In − H ¯ = [h ¯ T A]k∞ ≤ s−1 κ for all j ≤ n, implying that H −1 i T T ¯ i k∞ ≤ s−1 κ ¯ |[[e ] − hi A]j | ≤ s κ for all i and j, or, which is the same, kei − AT h i T¯ for all i. Due to the origin of Opti , we have Opti ≤ ke − A hi k∞ for all i, 13 For

a summary on conic programming, see Section 4.1.

39

SPARSE RECOVERY VIA ℓ1 MINIMIZATION

and we arrive at s−1 κ∗ = maxi Opti ≤ s−1 κ, that is, κ∗ ≤ κ, which is a desired contradiction. It remains to prove (1.33), which is just an exercise on LP duality: denoting by e an n-dimensional all-ones vector, we have  T i Opti := minh kei − AT hk∞ = minh,t t : ei − AT h P≤ te, APh − e ≤ te = maxλ,µ {λi − µi : λ, µ ≥ 0, A[λ − µ] = 0, i λi + i µi = 1} [LP duality] = maxx:=λ−µ {xi : Ax = 0, kxk1 ≤ 1} where the concluding equality follows from the fact that vectors x representable as λ − µ with λ, µ ≥ 0 satisfying kλk1 + kµk1 = 1 are exactly vectors x with kxk1 ≤ 1. ✷ 1.5.6

Proof of Proposition 1.13

Let H satisfy (1.38). Since kvks,1 ≤ s1−1/q kvks,q , it follows that H satisfies for some α < 1/2 the condition kColj [In − H T A]ks,1 ≤ α, 1 ≤ j ≤ n,

(1.61)

whence, as we know from Proposition 1.9, kxks,1 ≤ skH T Axk∞ + αkxk1 ∀x ∈ Rn . It follows that s ≤ m, since otherwise there exists a nonzero s-sparse vector x with Ax = 0; for this x, the inequality above cannot hold true. ¯ and A¯ be the m × n Let us set n ¯ = 2m, so that n ¯ ≤ n, and let H ¯ matrices comprised of the first 2m columns of H and A, respectively. Relation (1.61) implies ¯ T A¯ satisfies that the matrix V = In¯ − H kColj [V ]ks,1 ≤ α < 1/2, 1 ≤ j ≤ n ¯.

(1.62)

¯ T A¯ is ≤ m, at least n Now, since the rank of H ¯ − m singular values of V are ≥ 1, and therefore the squared Frobenius norm kV k2F of V is at least n ¯ − m. On the other hand, we can upper-bound this squared norm as follows. Observe that for every n ¯ -dimensional vector f one has hn ¯ i (1.63) kf k22 ≤ max 2 , 1 kf k2s,1 . s Indeed, by homogeneity it suffices to verify the inequality when kf ks,1 = 1; besides, we can assume w.l.o.g. that the entries in f are nonnegative, and that f1 ≥ f2 ≥ ... ≥ fn¯ . We have fs ≤ kf ks,1 /s = 1s ; in addition, Pn¯ 2 n − s)fs2 . Now, due to kf ks,1 = 1, for fixed fs ∈ [0, 1/s] we j=s+1 fj ≤ (¯ have   s−1 s−1 s  X X X tj = 1 − f s . t2j : tj ≥ fs , j ≤ s − 1, fj2 ≤ fs2 + max t   j=1

j=1

j=1

The maximum on the right-hand side is the maximum of a convex function

40

CHAPTER 1

over a bounded polytope; it is achieved at an extreme point, that is, at a point where one of the tj is equal to 1 − (s − 1)fs , and all remaining tj are equal to fs . As a result, X j

  fj2 ≤ (1 − (s − 1)fs )2 + (s − 1)fs2 + (¯ n − s)fs2 ≤ (1 − (s − 1)fs )2 + (¯ n − 1)fs2 .

The right-hand side in the latter inequality is convex in fs and thus achieves its maximum P over the range2[0, 1/s] of allowed values of fs at an endpoint, ¯ /s ], as claimed. yielding j fj2 ≤ max[1, n

Applying (1.63) to the columns of V and recalling that n ¯ = 2m, we get kV k2F =

  2m   2m 2m X kColj [V ]k2s,1 ≤ 2α2 m max 1, 2 . kColj [V ]k22 ≤ max 1, 2 s s j=1 j=1

2m X

The left hand side in this inequality, as we remember, is ≥ n ¯ − m = m, and we arrive at m ≤ 2α2 m max[1, 2m/s2 ]. √ Since α < 1/2, this inequality implies 2m/s2 ≥ 2, whence s ≤ m. It remains to prove that when √ m ≤ n/2, the condition Q∞ (s, κ) with κ < 1/2 can be satisfied only when s ≤ m. This is immediate: by Proposition 1.10, assuming Q∞ (s, κ) satisfiable, there exists an m × n contrast matrix H such that |[In − H T A]ij | ≤ κ/s for all i, √j, which, by the already proved part of Proposition ✷ 1.13, is impossible when s > m.

Chapter Two Hypothesis Testing Disclaimer for experts. In what follows, we allow for “general” probability and observation spaces, general probability distributions, etc., which, formally, would make it necessary to address the related measurability issues. In order to streamline our exposition, and taking into account that we do not expect our target audience to be experts in formal nuances of the measure theory, we decided to omit in the text comments (always self-evident for an expert) on measurability and replace them with a “disclaimer” as follows: Below, unless the opposite is explicitly stated, • all probability and observation spaces are Polish (complete separable metric) spaces equipped with σ-algebras of Borel sets; • all random variables (i.e., functions from a probability space to some other space) take values in Polish spaces; these variables, like other functions we deal with, are Borel; • all probability distributions we are dealing with are σ-additive Borel measures on the respective probability spaces; the same is true for all reference measures and probability densities taken w.r.t. these measures. When an entity (a random variable, or a probability density, or a function, say, a test) is part of the data, the Borel property is a default assumption; e.g., the sentence “Let random variable η be a deterministic transformation of random variable ξ” should be read as “let η = f (ξ) for some Borel function f ,” and the sentence “Consider a test T deciding on hypotheses H1 , ..., HL via observation ω ∈ Ω” should be read as “Consider a Borel function T on Polish space Ω, the values of the function being subsets of the set {1, ..., L}.” When an entity is built by us rather than being part of the data, the Borel property is (an always straightforwardly verifiable) property of the construction. For example, the statement “The test T given by ... is such that ...” should be read as “The test T given by ... is a Borel function of observations and is such that ....” On several occasions, we still use the word “Borel”; those not acquainted with the notion are welcome to just ignore this word.

2.1

2.1.1

PRELIMINARIES FROM STATISTICS: HYPOTHESES, TESTS, RISKS Hypothesis Testing Problem

Hypothesis Testing is one of fundamental problems of Statistics. Informally, this is the problem where one is given an observation—a realization of a random variable with unknown (at least partly) probability distribution—and wants to decide, based on this observation, on two or more hypotheses on the actual distribution of the observed variable. A formal setting convenient for us is as follows:

42

CHAPTER 2

Given are: • Observation space Ω, where the observed random variable (r.v.) takes its values; • L families Pℓ of probability distributions on Ω. We associate with these families L hypotheses H1 , ..., HL , with Hℓ stating that the probability distribution P of the observed r.v. belongs to the family Pℓ (shorthand: Hℓ : P ∈ Pℓ ). We shall say that the distributions from Pℓ obey hypothesis Hℓ . Hypothesis Hℓ is called simple if Pℓ is a singleton, and is called composite otherwise. Our goal is, given an observation—a realization ω of the r.v. in question—to decide which of the hypotheses is true. 2.1.2

Tests

Informally, a test is an inference procedure one can use in the above testing problem. Formally, a test for this testing problem is a function T (ω) of ω ∈ Ω; the value T (ω) of this function at a point ω is some subset of the set {1, ..., L}: T (ω) ⊂ {1, ..., L}. Given observation ω, the test accepts all hypotheses Hℓ with ℓ ∈ T (ω) and rejects all hypotheses Hℓ with ℓ 6∈ T (ω). We call a test simple if T (ω) is a singleton for every ω, that is, whatever be the observation, the test accepts exactly one of the hypotheses H1 , ..., HL and rejects all other hypotheses. Note: What we have defined is a deterministic test. Sometimes we shall consider also randomized tests, where the set of accepted hypotheses is a (deterministic) function of an observation ω and a realization θ of a random parameter (which w.l.o.g. can be assumed to be uniformly distributed on [0, 1]) independent of ω. Thus, in a randomized test, the inference depends both on the observation ω and the outcome θ of “flipping a coin,” while in a deterministic test the inference depends on observation only. In fact, randomized testing can be reduced to deterministic testing. To this end it suffices to pass from our “actual” observation ω to the new observation ω+ = (ω, θ), where θ ∼ Uniform[0, 1] is independent of ω; the ω-component of our new observation ω+ is, as before, generated “by nature,” and the θ-component is generated by us. Now, given families Pℓ , 1 ≤ ℓ ≤ L, of probability distributions on the original observation space Ω, we can associate with them families Pℓ,+ = {P × Uniform[0, 1] : P ∈ Pℓ } of probability distributions on our new observation space Ω+ = Ω × [0, 1]. Clearly, to decide on the hypotheses associated with the families Pℓ via observation ω is the same as to decide on the hypotheses associated with the families Pℓ,+ of our new observation ω+ , and deterministic tests for the latter testing problem are exactly the randomized tests for the former one. 2.1.3

Testing from repeated observations

There are situations where an inference can be based on several observations ω1 , ..., ωK rather than on a single one. Our related setup is as follows: We are given L families Pℓ , ℓ = 1, ..., L, of probability distributions on the

43

HYPOTHESIS TESTING

observation space Ω and a collection ω K = (ω1 , ..., ωK ) ∈ ΩK = Ω × ... × Ω | {z } K

and want to make conclusions on how the distribution of ω K “is positioned” w.r.t. the families Pℓ , 1 ≤ ℓ ≤ L. We will be interested in three situations of this type, specifically, as follows. 2.1.3.1

Stationary K-repeated observations

In the case of stationary K-repeated observations, ω1 , ..., ωK are independently of each other drawn from a distribution P . Our goal is to decide, given ω K , on the hypotheses P ∈ Pℓ , ℓ = 1, ..., L. Equivalently: Families Pℓ of probability distributions of ω ∈ Ω, 1 ≤ ℓ ≤ L, give rise to the families Pℓ⊙,K = {P K = P × ... × P : P ∈ Pℓ } {z } | K

of probability distributions on ΩK ; we refer to the family Pℓ⊙,K as the K-th diagonal power of the family Pℓ . Given observation ω K ∈ ΩK , we want to decide on the hypotheses Hℓ⊙,K : ω K ∼ P K ∈ Pℓ⊙,K , 1 ≤ ℓ ≤ L. 2.1.3.2

Semi-stationary K-repeated observations

In the case of semi-stationary K-repeated observations, “nature” selects somehow a sequence P1 , ..., PK of distributions on Ω, and then draws, independently across k, observations ωk , k = 1, ..., K, from these distributions: ωk ∼ Pk , ωk are independent across k ≤ K. Our goal is to decide, given ω K = (ω1 , ..., ωK ), on the hypotheses {Pk ∈ Pℓ , 1 ≤ k ≤ K}, ℓ = 1, ..., L. Equivalently: Families Pℓ of probability distributions of ω ∈ Ω, 1 ≤ ℓ ≤ L, give rise to the families Pℓ⊕,K = {P K = P1 × ... × PK : Pk ∈ Pℓ , 1 ≤ k ≤ K} of probability distributions on ΩK . Given observation ω K ∈ ΩK , we want to decide on the hypotheses Hℓ⊕,K : ω K ∼ P K ∈ Pℓ⊕,K , 1 ≤ ℓ ≤ L. In the sequel, we refer to families Pℓ⊕,K as the K-th direct powers of the families

44

CHAPTER 2

Pℓ . A closely related notion is that of the direct product Pℓ⊕,K =

K M k=1

Pℓ,k

of K families Pℓ,k , of probability distributions on Ωk , over k = 1, ..., K. By definition, Pℓ⊕,K = {P K = P1 × ... × PK : Pk ∈ Pℓ,k , 1 ≤ k ≤ K}. 2.1.3.3

Quasi-stationary K-repeated observations

Quasi-stationary K-repeated observations ω1 ∈ Ω, ..., ωK ∈ Ω stemming from a family P of probability distributions on an observation space Ω are generated as follows: “In nature” there exists random sequence ζ K = (ζ1 , ..., ζK ) of “driving factors” (or states) such that for every k, ωk is a deterministic function of ζ1 , ..., ζk , ωk = θk (ζ1 , ..., ζk ), and the conditional distribution Pωk |ζ1 ,...,ζk−1 of ωk given ζ1 , ..., ζk−1 always (i.e., for all ζ1 , ..., ζk−1 ) belongs to P. With the above mechanism, the collection ω K = (ω1 , ..., ωK ) has some distribution P K which depends on the distribution of driving factors and functions θk (·). We denote by P ⊗,K the family of all distributions P K which can be obtained in this fashion, and we refer to random observations ω K with distribution P K of the type just defined as the quasi-stationary K-repeated observations stemming from P. The quasi-stationary version of our hypothesis testing problem reads: Given L families Pℓ of probability distributions Pℓ , ℓ = 1, ..., L, on Ω and an observation ω K ∈ ΩK , decide on the hypotheses Hℓ⊗,K = {P K ∈ Pℓ⊗,K }, 1 ≤ ℓ ≤ K on the distribution P K of the observation ω K . A related notion is that of the quasi-direct product Pℓ⊗,K =

K O k=1

Pℓ,k

of K families Pℓ,k , of probability distributions on Ωk , over k = 1, ..., K. By definition, Pℓ⊗,K is comprised of all probability distributions of random sequences ω K = (ω1 , ..., ωK ), ωk ∈ Ωk , which can be generated as follows: “in nature” there exists a random sequence ζ K = (ζ1 , ..., ζK ) of “driving factors” such that for every k ≤ K, ωk is a deterministic function of ζ k = (ζ1 , ..., ζk ), and the conditional distribution of ωk given ζ k−1 always belongs to Pℓ,k . The description of quasi-stationary K-repeated observations seems to be too complicated. However, this is exactly what happens in some important applications, e.g., in hidden Markov chains. Suppose that Ω = {1, ..., d} is a finite set, and that ωk ∈ Ω, k = 1, 2, ..., are generated as follows: “in nature” there exists a Markov chain with D-element state space S split into d nonoverlapping bins, and ωk is the

45

HYPOTHESIS TESTING

index β(η) of the bin to which the state ηk of the chain belongs. Every column Qj of the transition matrix Q of the chain (this column is a probability distribution on {1, ..., D}) generates a probability distribution Pj on Ω, specifically, the distribution of β(η), η ∼ Qj . Now, a family P of distributions on Ω induces a family Q[P] of all D × D stochastic matrices Q for which all D distributions P j , j = 1, ..., D, belong to P. When Q ∈ Q[P], observations ωk , k = 1, 2, ..., clearly are given by the above “quasi-stationary mechanism” with ηk in the role of driving factors and P in the role of Pℓ . Thus, in the situation in question, given L families Pℓ , ℓ = 1, ..., L, of probability distributions on S, deciding on hypotheses Q ∈ Q[Pℓ ], ℓ = 1, ..., L, on the transition matrix Q of the Markov chain underlying our observations reduces to hypothesis testing via quasi-stationary K-repeated observations. 2.1.4

Risk of a simple test

Let Pℓ , ℓ = 1, ..., L, be families of probability distributions on observation space Ω; these families give rise to hypotheses Hℓ : P ∈ Pℓ , ℓ = 1, ..., L on the distribution P of a random observation ω ∼ P . We are about to define the risks of a simple test T deciding on the hypotheses Hℓ , ℓ = 1, ..., L, via observation ω. Recall that simplicity means that as applied to an observation, our test accepts exactly one hypothesis and rejects all other hypotheses. Partial risks Riskℓ (T |H1 , ..., HL ) are the worst-case, over P ∈ Pℓ , P -probabilities of T rejecting the ℓ-th hypothesis when it is true, that is, when ω ∼ P : Riskℓ (T |H1 , ..., HL ) = sup Probω∼P {ω : T (ω) 6= {ℓ}} , ℓ = 1, ..., L. P ∈Pℓ

Obviously, for ℓ fixed, the ℓ-th partial risk depends on how we order the hypotheses; when reordering them, we should reorder risks as well. In particular, for a test T deciding on two hypotheses H, H ′ we have Risk1 (T |H, H ′ ) = Risk2 (T |H ′ , H).

Total risk Risktot (T |H1 , ..., HL ) is the sum of all L partial risks: Risktot (T |H1 , ..., HL ) =

L X ℓ=1

Riskℓ (T |H1 , ..., HL ).

Risk Risk(T |H1 , ..., HL ) is the maximum of all L partial risks: Risk(T |H1 , ..., HL ) = max Riskℓ (T |H1 , ..., HL ). 1≤ℓ≤L

Note that at first glance, we have defined risks for single-observation tests only; in fact, we have defined them for tests based on stationary, semi-stationary, and quasistationary K-repeated observations as well, since, as we remember from Section

46

CHAPTER 2

2.1.3, the corresponding testing problems, after redefining observations and families K L Pℓ in the role of probability distributions (ω K in the role of ω and, say, Pℓ⊕,K = k=1

of Pℓ ), become single-observation testing problems. Pay attention to the following two important observations:

• Partial risks of a simple test are defined in the worst-case fashion: as the maximal, over the true distributions P of observations compatible with the hypothesis in question, probability to reject this hypothesis. • Risks of a simple test say what happens, statistically speaking, when the true distribution P of observation obeys one of the hypotheses in question, and say nothing about what happens when P does not obey any of the L hypotheses. Remark 2.1. “The smaller are the hypotheses, the less are the risks.” Specifically, given families of probability distributions Pℓ ⊂ Pℓ′ , ℓ = 1, ..., L, on observation space Ω, along with hypotheses Hℓ : P ∈ Pℓ , Hℓ′ : P ∈ Pℓ′ on the distribution P of an observation ω ∈ Ω, every test T deciding on the “larger” hypotheses H1′ , ..., HL′ can be considered as a test deciding on the smaller hypotheses H1 , ..., HL as well, and the risks of the test when passing from larger hypotheses to smaller ones can only drop down: Pℓ ⊂ Pℓ′ , 1 ≤ ℓ ≤ L ⇒ Risk(T |H1 , ..., HL ) ≤ Risk(T |H1′ , ..., HL′ ). For example, families of probability distributions Pℓ , 1 ≤ ℓ ≤ L, on Ω and a positive integer K induce three families of hypotheses on a distribution P K of K-repeated observations: Hℓ⊙,K K : P K ∈ Pℓ⊙,K ,

Hℓ⊕,K : P K ∈ Pℓ⊕,K =

Hℓ⊗,K : P K ∈ Pℓ⊗,K = (see Section 2.1.3), and clearly

K N

k=1

Pℓ , 1 ≤ ℓ ≤ L

K L

k=1

Pℓ ,

PℓK ⊂ Pℓ⊕,K ⊂ Pℓ⊗,K . It follows that when passing from quasi-stationary K-repeated observations to semistationary K-repeated observations, and then to stationary K-repeated observations, the risks of a test can only go down. 2.1.5

Two-point lower risk bound

The following well-known [162, 164] observation is nearly evident: Proposition 2.2. Consider two simple hypotheses H1 : P = P1 and H2 : P = P2 on the distribution P of observation ω ∈ Ω, and assume that P1 , P2 have densities p1 , p2 w.r.t. some reference measure Π on Ω.1 Then for any simple test T deciding 1 This

assumption is w.l.o.g.—we can take, as Π, the sum of the measures P1 and P2 .

47

HYPOTHESIS TESTING

on H1 , H2 it holds Risktot (T |H1 , H2 ) ≥

Z

min[p1 (ω), p2 (ω)]Π(dω).

(2.1)



Note that the right-hand side in this relation is independent of how Π is selected. Proof. Consider a simple test T , perhaps a randomized one, and let π(ω) be the probability for this test to accept H1 and reject H2 when the observation is ω. Since the test is simple, the probability for T to accept H2 and to reject H1 , the observation being ω, is 1 − π(ω). Consequently, R Risk1 (T |H1 , H2 ) = RΩ (1 − π(ω))p1 (ω)Π(dω), Risk2 (T |H1 , H2 ) = Ω π(ω)p2 (ω)Π(dω),

whence

Risktot (T |H1 , H2 )

= ≥

R RΩ [(1 − π(ω))p1 (ω) + π(ω)p2 (ω)]Π(dω) min[p1 (ω), p2 (ω)]Π(dω). ✷ Ω

Remark 2.3. Note that the lower risk bound (2.1) is achievable; given an observation ω, the corresponding test T accepts H1 with probability 1 (i.e., π(ω) = 1) when p1 (ω) > p2 (ω), accepts H2 when p1 (ω) < p2 (ω) (i.e., π(ω) = 0 when p1 (ω) < p2 (ω)) and accepts H1 and H2 with probabilities 1/2 in the case of a tie (i.e., π(ω) = 1/2 when p1 (ω) = p2 (ω)). This is nothing but the likelihood ratio test naturally adjusted to account for ties. Example 2.1. Let Ω = Rd , let the reference measure Π be the Lebesgue measure on Rd , and let pχ (·) = N (µχ , Id ), be the Gaussian densities on Rd with unit covariance and means µχ , χ = 1, 2. In this case, assuming µ1 6= µ2 , the recipe from Remark 2.3 reduces to the following: Let φ1,2 (ω) = 12 [µ1 − µ2 ]T [ω − w], w = 12 [µ1 + µ2 ].

(2.2)

Consider the simple test T which, given an observation ω, accepts H1 : p = p1 (and rejects H2 : p = p2 ) when φ1,2 (ω) ≥ 0, and accepts H2 (and rejects H1 ) otherwise. For this test, Risk1 (T |H1 , H2 ) = Risk2 (T |H1 , H2 ) = Risk(T |H1 , H2 ) = 21 Risktot (T |H1 , H2 ) = Erfc( 12 kµ1 − µ2 k2 )

(2.3)

(see (1.22) for the definition of Erfc), and the test is optimal in terms of its risk and its total risk. Note that optimality of T in terms of total risk is given by Proposition 2.2 and Remark 2.3; optimality in terms of risk is ensured by optimality in terms of total risk combined with the first equality in (2.3). Example 2.1 admits an immediate and useful extension [36, 37, 84, 128]: Example 2.2. Let Ω = Rd , let the reference measure Π be the Lebesgue measure on Rd , and let M1 and M2 be two nonempty closed convex sets in Rd with empty

48

CHAPTER 2

0

"

#

0

Figure 2.1: “Gaussian Separation” (Example 2.5): Optimal test deciding on whether the mean of Gaussian r.v. belongs to the domain A (H1 ) or to the domain B (H2 ). Hyperplane o-o separates the acceptance domains for H1 (“left” half-space) and for H2 (“right” half-space).

intersection and such that the convex optimization program min {kµ1 − µ2 k2 : µχ ∈ Mχ , χ = 1, 2}

µ1 ,µ2

(∗)

has an optimal solution µ∗1 , µ∗2 (this definitely is the case when at least one of the sets M1 , M2 is bounded). Let φ1,2 (ω) = 21 [µ∗1 − µ∗2 ]T [ω − w], w = 12 [µ∗1 + µ∗2 ],

(2.4)

and let the simple test T deciding on the hypotheses H1 : p = N (µ, Id ) with µ ∈ M1 ,

H2 : p = N (µ, Id ) with µ ∈ M2

be as follows (see Figure 2.1): given an observation ω, T accepts H1 (and rejects H2 ) when φ1,2 (ω) ≥ 0, and accepts H2 (and rejects H1 ) otherwise. Then Risk1 (T |H1 , H2 ) = Risk2 (T |H1 , H2 ) = Risk(T |H1 , H2 ) = 21 Risktot (T |H1 , H2 ) = Erfc( 12 kµ∗1 − µ∗2 k2 ),

(2.5)

and the test is optimal in terms of its risk and its total risk. Justification of Example 2.2 is immediate. Let e be the k · k2 -unit vector in the direction of µ∗1 − µ∗2 , and let ξ[ω] = eT (ω − w). From optimality conditions for (∗) it follows that eT µ ≥ eT µ∗1 ∀µ ∈ M1 & eT µ ≤ eT µ∗2 ∀µ ∈ M2 . As a result, if µ ∈ M1 and the density of ω is pµ = N (µ, Id ), the random variable ξ[ω] is a scalar Gaussian random variable with unit variance and expectation ≥ δ := 21 kµ∗1 − µ∗2 k2 , implying that the pµ -probability for ξ[ω] to be negative (which is exactly the same as the pµ -probability for T to reject H1 and accept H2 ) is at most

49

HYPOTHESIS TESTING

Erfc(δ). Similarly, when µ ∈ M2 and the density of ω is pµ = N (µ, Id ), ξ[ω] is a scalar Gaussian random variable with unit variance and expectation ≤ −δ, implying that the pµ -probability for ξ[ω] to be nonnegative (which is exactly the same as the probability for T to reject H2 and accept H1 ) is at most Erfc(δ). These observations imply the validity of (2.5). The test optimality in terms of risks follows from the fact that the risks of a simple test deciding on our now—composite—hypotheses H1 , H1 on the density p of observation ω can be only larger than the risks of a simple test deciding on two simple hypotheses p = pµ∗1 and p = pµ∗2 . In other words, the quantity Erfc( 12 kµ∗1 − µ∗2 k2 )—see Example 2.1—is a lower bound on the risk and half of the total risk of a test deciding on H1 , H2 . With this in mind, the announced optimalities of T in terms of risks are immediate consequences of (2.5). We remark that the (nearly self-evident) result stated in Example 2.2 seems to have first been noticed in [36]. Example 2.2 allows for substantial extensions in two directions: first, it turns out that the “Euclidean separation” underlying the test built in this example can be used to decide on hypotheses on the location of a “center” of d-dimensional distribution far beyond the Gaussian observation model considered in this example. This extension will be our goal in the next section, based on the recent paper [110]. Less straightforward and, we believe, more instructive extensions, originating from [102], will be considered in Section 2.3.

2.2 2.2.1

HYPOTHESIS TESTING VIA EUCLIDEAN SEPARATION Situation

In this section, we will be interested in testing hypotheses Hℓ : P ∈ Pℓ , ℓ = 1, ..., L

(2.6)

on the probability distribution of a random observation ω in the situation where the families of distributions Pℓ are obtained from a given family P of probability distributions by shifts. Specifically, we are given • a family P of probability distributions on Ω = Rd such that all distributions from P possess densities with respect to the Lebesgue measure on Rn , and these densities are even functions on Rd ;2 • a collection X1 , ..., XL of nonempty closed and convex subsets of Rd , with at most one of the sets unbounded. These data specify L families Pℓ of distributions on Rd ; Pℓ is comprised of distributions of random vectors of the form x + ξ, where x ∈ Xℓ is deterministic, and ξ is random with distribution from P. Note that with this setup, deciding upon hypotheses (2.6) via observation ω ∼ P is exactly the same as deciding, given observation ω = x + ξ, (2.7) 2 Allowing for a slight abuse of notation, we write P ∈ P, where P is a probability distribution, to express the fact that P belongs to P (no abuse of notation so far), and write p(·) ∈ P (this is an abuse of notation), where p(·) is the density of the probability distribution P , to express the fact that P ∈ P.

50

CHAPTER 2

where x is a deterministic “signal” and ξ is random noise with distribution P known to belong to P, on the “position” of x w.r.t. X1 , ..., XL , the ℓ-th hypothesis Hℓ saying that x ∈ Xℓ . The latter allows us to write down the ℓ-th hypothesis as Hℓ : x ∈ Xℓ (of course, this shorthand makes sense only within the scope of our current “signal plus noise” setup). 2.2.2 2.2.2.1

Pairwise Hypothesis Testing via Euclidean Separation The simplest case

Consider nearly the simplest case of the situation from Section 2.2.1 where L = 2, X1 = {x1 } and X2 = {x2 }, x1 6= x2 , are singletons, and P also is a singleton. Let the probability density of the only distribution from P be of the form p(u) = f (kuk2 ), f (·) is a strictly monotonically decreasing function on the nonnegative ray. (2.8) This situation is a generalization of the one considered in Example 2.1, where we dealt with the special case of f , namely, with p(u) = (2π)−d/2 e−u

T

u/2

.

In the case in question our goal is to decide on two simple hypotheses Hχ : p(u) = f (ku − xχ k2 ), χ = 1, 2, on the density of observation (2.7). Let us set x1 − x2 , φ(ω) = eT ω − 21 eT [x1 + x2 ], kx1 − x2 k2 | {z }

δ = 21 kx1 − x2 k2 , e =

(2.9)

c

and consider the test T which, given observation ω = x + ξ, accepts the hypothesis H1 : x = x1 when φ(ω) ≥ 0, and accepts the hypothesis H2 : x = x2 otherwise. ւ p2 (·) p1 (·) ց

x2 x1

φ(ω) > 0

φ(ω) = 0

φ(ω) < 0

We have (cf. Example 2.1) Risk1 (T |H1 , H2 )

= =

R

p1 (ω)dω =

ω:φ(ω) 0, ω∈Ω

• F is the space of all real-valued functions on the finite set Ω. Clearly, Discrete o.s. is simple; the function f (µ) := ln

Z

e

φ(ω)

pµ (ω)Π(dω)





= ln

X

e

φ(ω)

µω

ω∈Ω

!

indeed is concave in µ ∈ M. 2.4.3.4

Direct products of simple observation schemes

Given K simple observation schemes  Ok = Ωk , Πk ; {pµ,k (·) : µ ∈ Mk }; Fk , 1 ≤ k ≤ K,

6 Large Binocular Telescope [16, 17] is a cutting-edge instrument for high-resolution optical/infrared astronomical imaging; it is the subject of a huge ongoing international project; see http://www.lbto.org. Nanoscale Fluorescent Microscopy (a.k.a. Poisson Biophotonics) is a revolutionary tool for cell imaging trigged by the advent of techniques [18, 113, 117, 211] (2014 Nobel Prize in Chemistry) allowing us to break the diffraction barrier and to view biological molecules “at work” at a resolution of 10–20 nm, yielding entirely new insights into the signalling and transport processes within cells.

78

CHAPTER 2

we can define their direct product OK =

K Y

k=1

Ok = ΩK , ΠK ; {pµ : µ ∈ MK }; F K



by modeling the situation where our observation is a tuple ω K = (ω1 , ..., ωK ) with components ωk yielded, independently of each other, by observation schemes Ok , namely, as follows: • The observation space ΩK is the direct product of observations spaces Ω1 , ..., ΩK , and the reference measure ΠK is the product of the measures Π1 , ..., ΠK ; • The parameter space MK is the direct product of partial parameter spaces M1 , ..., MK , and the distribution pµ (ω K ) associated with parameter µ = (µ1 , µ2 , ..., µK ) ∈ MK = M1 × ... × MK is the probability distribution on ΩK with the density pµ (ω K ) =

K Y

pµ,k (ωk )

k=1

w.r.t. ΠK . In other words, random observation ω K ∼ pµ is a sample of observations ω1 , ..., ωK , drawn, independently of each other, from the distributions pµ1 ,1 , pµ2 ,2 , ..., pµK ,K ; • The space F K is comprised of all separable functions φ(ω K ) =

K X

φk (ωk )

k=1

with φk (·) ∈ Fk , 1 ≤ k ≤ K. It is immediately seen that the direct product of simple observation o.s.’s is simple. When all factors Ok , 1 ≤ k ≤ K, are the identical simple o.s. O = (Ω, Π; {pµ : µ ∈ M}; F), the direct product of the factors can be “truncated” to yield the K-th power (called also the stationary K-repeated version) of O, denoted by [O]K = (ΩK , ΠK ; {p(K) : µ ∈ M}; F (K) ) µ and defined as follows. • ΩK and ΠK are exactly the same as in the direct product: ΩK = Ω × ... × Ω, ΠK = Π × ... × Π; | {z } | {z } K

K

• the parameter space is M rather than the direct product of K copies of M, and

79

HYPOTHESIS TESTING

the densities are K p(K) = (ω1 , ..., ωK )) = µ (ω

K Y

pµ (ωk ).

k=1 (K)

In other words, random observations ω K ∼ pµ are K-element samples with components drawn, independently of each other, from pµ ; • the space F (K) is comprised of separable functions φ

(K)

K

(ω ) =

K X

φ(ωk )

k=1

with identical components belonging to F (i.e., φ ∈ F). It is immediately seen that a power of a simple o.s. is simple. Remark 2.19. Gaussian, Poisson, and Discrete o.s.’s clearly are nondegenerate. It is also clear that the direct product of nondegenerate o.s.’s is nondegenerate. 2.4.4

Simple observation schemes—Main result

We are about to demonstrate that when deciding on convex, in some precise sense to be specified below, hypotheses in simple observation schemes, optimal detectors can be found efficiently by solving convex-concave saddle point problems. We start with an “executive summary” of convex-concave saddle point problems.

2.4.4.1

Executive summary of convex-concave saddle point problems

The results to follow are absolutely standard, and their proofs can be found in all textbooks on the subject, see, e.g., [221] or [15, Section D.4]. Let U and V be nonempty sets, and let Φ : U ×V → R be a function. These data define an antagonistic game of two players, I and II, where player I selects a point u ∈ U , and player II selects a point v ∈ V ; as an outcome of these selections, player I pays to player II the sum Φ(u, v). Clearly, player I is interested in minimizing this payment, and player II in maximizing it. The data U, V, Φ are known to the players in advance, and the question is, what should be their selections? When player I makes his selection u first, and player II makes his selection v with u already known, player I should be ready to pay for a selection u ∈ U a toll as large as Φ(u) = sup Φ(u, v). v∈V

In this situation, a risk-averse player I would select u by minimizing the above worst-case payment, by solving the primal problem Opt(P ) = inf Φ(u) = inf sup Φ(u, v) u∈U

u∈U v∈V

(P )

associated with the data U, V, Φ. Similarly, if player II makes his selection v first, and player I selects u after v becomes known, player II should be ready to get, as a result of selecting v ∈ V , the

80

CHAPTER 2

amount as small as Φ(v) = inf Φ(u, v). u∈U

In this situation, a risk-averse player II would select v by maximizing the above worst-case payment, by solving the dual problem Opt(D) = sup Φ(v) = sup inf Φ(u, v). v∈V

v∈V u∈U

(D)

Intuitively, the first situation is less preferable for player I than the second one, so that his guaranteed payment in the first situation, that is, Opt(P ), should be ≥ his guaranteed payment, Opt(D), in the second situation: Opt(P ) := inf sup Φ(u, v) ≥ sup inf Φ(u, v) =: Opt(D). u∈U v∈V

v∈V u∈U

This fact, called Weak Duality, indeed is true. The central question related to the game is what should the players do when making their selections simultaneously, with no knowledge of what is selected by the adversary. There is a case when this question has a completely satisfactory answer—this is the case where Φ has a saddle point on U × V . Definition 2.20. A point (u∗ , v∗ ) ∈ U × V is called a saddle point7 of function Φ(u, v) : U × V → R if Φ as a function of u ∈ U attains at this point its minimum, and as a function of v ∈ V its maximum, that is, if Φ(u, v∗ ) ≥ Φ(u∗ , v∗ ) ≥ Φ(u∗ , v) ∀(u ∈ U, v ∈ V ). From the viewpoint of our game, a saddle point (u∗ , v∗ ) is an equilibrium: when one of the players sticks to the selection stemming from this point, the other one has no incentive to deviate from his selection stemming from the point. Indeed, if player II selects v∗ , there is no reason for player I to deviate from selecting u∗ , since with another selection, his loss (the payment) can only increase; similarly, when player I selects u∗ , there is no reason for player II to deviate from v∗ , since with any other selection, his gain (the payment) can only decrease. As a result, if the cost function Φ has a saddle point on U × V , this saddle point (u∗ , v∗ ) can be considered as a solution to the game, as the pair of preferred selections of rational players. It can be easily seen that while Φ can have many saddle points, the values of Φ at all these points are equal to each other; we denote this common value by SadVal. If (u∗ , v∗ ) is a saddle point and player I selects u = u∗ , his worst loss, over selections v ∈ V of player II, is SadVal, and if player I selects any u ∈ U , his worst-case loss, over the selections of player II can be only ≥ SadVal. Similarly, when player II selects v = v∗ , his worst-case gain, over the selections of player I, is SadVal, and if player II selects any v ∈ V , his worst-case gain, over the selections of player I, can be only ≤ SadVal. Existence of saddle points of Φ (min in u ∈ U , max in v ∈ V ) can be expressed in terms of the primal problem (P ) and the dual problem (P ):

7 More precisely, “saddle point (min in u ∈ U , max in v ∈ V )”; we will usually skip the clarification in parentheses, since it always will be clear from the context what are the minimization variables and what are the maximization ones.

81

HYPOTHESIS TESTING

Proposition 2.21. Φ has a saddle point (min in u ∈ U , max in v ∈ V ) if and only if problems (P ) and (D) are solvable with equal optimal values: Opt(P ) := inf sup Φ(u, v) = sup inf Φ(u, v) =: Opt(D). u∈U v∈V

v∈V u∈U

(2.55)

Whenever this is the case, the saddle points of Φ are exactly the pairs (u∗ , v∗ ) comprised of optimal solutions to problems (P ) and (D), and the value of Φ at every one of these points is the common value SadVal of Opt(P ) and Opt(D). Existence of a saddle point of a function is a “rare commodity,” and the standard sufficient condition for it is convexity-concavity of Φ coupled with convexity of U and V . The precise statement is as follows: Theorem 2.22. [Sion-Kakutani; see, e.g., [221] or [15, Theorems D.4.3, D.4.4]] Let U ⊂ Rm , V ⊂ Rn be nonempty closed convex sets, with V bounded, and let Φ : U × V → R be a continuous function which is convex in u ∈ U for every fixed v ∈ V , and is concave in v ∈ V for every fixed u ∈ U . Then the equality (2.55) holds true (although it may happen that Opt(P ) = Opt(D) = −∞). If, in addition, Φ is coercive in u, meaning that the level sets {u ∈ U : Φ(u, v) ≤ a} are bounded for every a ∈ R and v ∈ V (equivalently: for every v ∈ V , Φ(ui , v) → +∞ along every sequence ui ∈ U going to ∞: kui k → ∞ as i → ∞), then Φ admits saddle points (min in u ∈ U , max in v ∈ V ). Note that the “true” Sion-Kakutani Theorem is a bit stronger than Theorem 2.22; the latter, however, covers all our related needs. 2.4.4.2

Main result

Theorem 2.23. Let O = (Ω, Π; {pµ : µ ∈ M}; F) be a simple observation scheme, and let M1 , M2 be nonempty compact convex subsets of M. Then (i) The function  R   R Φ(φ, [µ; ν]) = 21 ln Ω e−φ(ω) pµ (ω)Π(dω) + ln Ω eφ(ω) pν (ω)Π(dω) : (2.56) F × (M1 × M2 ) → R is continuous on its domain, convex in φ(·) ∈ F, concave in [µ; ν] ∈ M1 × M2 , and possesses a saddle point (min in φ ∈ F, max in [µ; ν] ∈ M1 × M2 ) (φ∗ (·), [µ∗ ; ν∗ ]) on F × (M1 × M2 ). W.l.o.g. φ∗ can be assumed to satisfy the relation8 Z Z (2.57) exp{φ∗ (ω)}pν∗ (ω)Π(dω). exp{−φ∗ (ω)}pµ∗ (ω)Π(dω) = Ω



8 Note that F contains constants, and shifting by a constant the φ-component of a saddle point of Φ and keeping its [µ; ν]-component intact, we clearly get another saddle point of Φ.

82

CHAPTER 2

Denoting the common value of the two quantities in (2.57) by ε⋆ , the saddle point value min max Φ(φ, [µ; ν]) φ∈F [µ;ν]∈M1 ×M2

is ln(ε⋆ ). Besides this, setting φa∗ (·) = φ∗ (·) − a, one has R (a) ΩRexp{−φa∗ (ω)}pµ (ω)Π(dω) ≤ exp{a}ε⋆ ∀µ ∈ M1 , (b) exp{φa∗ (ω)}pν (ω)Π(dω) ≤ exp{−a}ε⋆ ∀ν ∈ M2 . Ω

(2.58)

In view of Proposition 2.14 this implies that when deciding via an observation ω ∈ Ω on the hypotheses Hχ : ω ∼ pµ with µ ∈ Mχ , χ = 1, 2, the risks of the simple test Tφa∗ based on the detector φa∗ can be upper-bounded as follows: Risk1 (Tφa∗ |H1 , H2 ) ≤ exp{a}ε⋆ , Risk2 (Tφa∗ |H1 , H2 ) ≤ exp{−a}ε⋆ . Moreover, φ∗ , ε⋆ form an optimal solution to the optimization problem R −φ(ω)   e pµ (ω)Π(dω) ≤ ǫ ∀µ ∈ M1 Ω R min ǫ : eφ(ω) pµ (ω)Π(dω) ≤ ǫ ∀µ ∈ M2 φ,ǫ Ω

(2.59)

(the minimum in (2.59) is taken over all ǫ > 0 and all Π-measurable functions φ(·), not just over φ ∈ F). (ii) The dual problem associated with the saddle point data Φ, F, M1 × M2 is   (D) max Φ(µ, ν) := inf Φ(φ; [µ; ν]) . µ∈M1 ,ν∈M2

φ∈F

The objective in this problem is in fact the logarithm of the Hellinger affinity of pµ and pν , Z q  Φ(µ, ν) = ln pµ (ω)pν (ω)Π(dω) , (2.60) Ω

and this objective is concave and continuous on M1 × M2 . The (µ, ν)-components of saddle points of Φ are exactly the maximizers (µ∗ , ν∗ ) of the concave function Φ on M1 × M2 . Given such a maximizer [µ∗ ; ν∗ ] and setting φ∗ (ω) =

1 2

ln(pµ∗ (ω)/pν∗ (ω))

(2.61)

we get a saddle point (φ∗ , [µ∗ ; ν∗ ]) of Φ satisfying (2.57). (iii) Let [µ∗ ; ν∗ ] be a maximizer of Φ over M1 × M2 . Let, further, ǫ ∈ [0, 1/2] be such that there exists any (perhaps randomized) test for deciding via observation ω ∈ Ω on two simple hypotheses (A) : ω ∼ p(·) := pµ∗ (·), with total risk ≤ 2ǫ. Then

(B) : ω ∼ q(·) := pν∗ (·)

p ε⋆ ≤ 2 ǫ(1 − ǫ).

(2.62)

In other words, if the simple hypotheses (A), (B) can be decided, by any test, with total risk 2ǫ, the risks of the simple test with detector φ∗ given by (2.61) on the

83

HYPOTHESIS TESTING

composite hypotheses H1 , H2 do not exceed 2 For proof, see Section 2.11.3.

p ǫ(1 − ǫ).

Remark 2.24. Assume that we are under the premise of Theorem 2.23 and that the simple o.s. in question is nondegenerate (see Section 2.4.2). Then ε⋆ < 1 if and only if the sets M1 and M2 do not intersect. Indeed, by Theorem 2.23.i, ln(ε⋆ ) is the saddle point value of Φ(φ, [µ; ν]) on F × (M1 × M2 ), or, which is the same by Theorem 2.23.ii, the maximum of the function (2.60) on M1 × M2 ; since saddle points exist, this maximum is achieved at some pair [µ; ν] ∈ M1 ×M2 . Since (2.60) is ≤ 0, we conclude that ε⋆ ≤ 1 and p R clearly the equality takes place if and only if Ω pµ (ω)pν (ω)Π(dω) = 1 for some µ ∈ M1 p R p and ν ∈ M2 , or, which is the same, Ω ( pµ (ω) − pν (ω))2 Π(dω) = 0 for these µ and ν. Since pµ (·) and pν (·) are continuous and the support of Π is the entire Ω, the latter can happen if and only if pµ = pν for our µ, ν, or, by nondegeneracy of O, if and only if M1 ∩ M2 6= ∅. ✷ 2.4.5

Simple observation schemes—Examples of optimal detectors

Theorem 2.23.i states that when the observation scheme O = (Ω, Π; {pµ : µ ∈ M}; F) is simple and we are interested in deciding on a pair of hypotheses on the distribution of observation ω ∈ Ω, Hχ : ω ∼ pµ with µ ∈ Mχ , χ = 1, 2 and the hypotheses are convex, meaning that the underlying parameter sets Mχ are convex and compact, building an optimal, in terms of its risk, detector φ∗ —that is, solving (in general, a semi-infinite and infinite-dimensional) optimization problem (2.59)—reduces to solving a finite-dimensional convex problem. Specifically, an optimal solution (φ∗ , ε⋆ ) can be built as follows: 1. We solve optimization problem Z q   pµ (ω)pν (ω)Π(dω) Φ(µ, ν) := ln Opt = max µ∈M1 ,ν∈M2

(2.63)



of maximizing Hellinger affinity (the quantity under the logarithm) of a pair of distributions obeying H1 and H2 , respectively; for a simple o.s., the objective in this problem is concave and continuous, and optimal solutions do exist; 2. (Any) optimal solution [µ∗ ; ν∗ ] to (2.63) gives rise to an optimal detector φ∗ and its risk ε⋆ , according to   1 pµ∗ (ω) φ∗ (ω) = ln , ε⋆ = exp{Opt}. 2 pν∗ (ω) The risks of the simple test Tφ∗ associated with the above detector and deciding on H1 , H2 , satisfy the bounds max [Risk1 (Tφ∗ |H1 , H2 ), Risk2 (Tφ∗ |H1 , H2 )] ≤ ε⋆ ,

84

CHAPTER 2

and the test is near-optimal, meaning that whenever the hypotheses H1 , H2 (and in fact even two simple hypotheses stating that ω ∼ pµ∗ and ω ∼ pν∗ , respectively) can be decided upon by a test with total risk ≤ 2ǫ ≤ 1, Tφ∗ exhibits a “comparable” risk: p (2.64) ε⋆ ≤ 2 ǫ(1 − ǫ). The test Tφ∗ is just the maximum likelihood test induced by the probability densities pµ∗ and pν∗ .

Note that after we know that (φ∗ , ε⋆ ) form an optimal solution to (2.59), some kind of near-optimality of the test Tφ∗ is guaranteed already by Proposition 2.18. By this proposition, whenever in nature there exists a simple test T which decides on H1 , H2 with risks Risk1 , Risk2 bounded by some ǫ ≤ 1/2, the upper bound ε⋆ on the risks of Tφ∗ can be bounded according to (2.64). Our now near-optimality statement is slightly stronger: first, we allow T to have the total risk ≤ 2ǫ, which is weaker than to have both risks ≤ ǫ; second, and more important, now 2ǫ should upper-bound the total risk of T on a pair of simple hypotheses “embedded” into the hypotheses H1 , H2 ; both these modifications extend the family of tests T to which we compare the test Tφ∗ , and thus enrich the comparison. Let us look how the above recipe works for our basic simple o.s.’s. 2.4.5.1

Gaussian o.s.

When O is a Gaussian o.s., that is, {pµ : µ ∈ M} are Gaussian densities with expectations µ ∈ M = Rd and common positive definite covariance matrix Θ, and F is the family of affine functions on Ω = Rd , • M1 , M2 can be arbitrary nonempty convex compact subsets of Rd , • problem (2.63) becomes the convex optimization problem Opt = −

1

(µ min µ∈M1 ,ν∈M2 8

− ν)T Θ−1 (µ − ν),

(2.65)

• the optimal detector φ∗ and the upper bound ε⋆ on its risks given by an optimal solution (µ∗ , ν∗ ) to (2.65) are φ∗ (ω) ε⋆

= =

[µ∗ − ν∗ ]T Θ−1 [ω − w], w = 21 [µ∗ + ν∗ ] exp{− 81 [µ∗ − ν∗ ]Θ−1 [µ∗ − ν∗ ]}. 1 2

(2.66)

Note that when Θ = Id , the test Tφ∗ becomes exactly the optimal test from Example 2.1. The upper bound on the risks of this test established in Example 2.1 (in our present notation, this bound is Erfc( 21 kµ∗ − ν∗ k2 )) is slightly better than the bound ε⋆ = exp{−kµ∗ − ν∗ k22 /8} given by (2.66) when Θ = Id . Note, however, that when speaking about the distance δ = kµ∗ − ν∗ k2 between M1 and M2 allowing for a test with risks ≤ ǫ ≪ 1, the results of Example 2.1 and (2.66) say nearly the same thing: Example 2.1 says that δ should be ≥ p 2ErfcInv(ǫ), with ErfcInv defined in (1.26), and (2.66) says that δ should be ≥ 2 2 ln(1/ǫ). When ǫ → +0, the ratio of these two lower bounds on δ tends to 1. It should be noted that our general construction of optimal detectors as applied to Gaussian o.s. and a pair of convex hypotheses results in exactly an optimal test and can be analyzed directly, without any “science” (see Example 2.1).

85

HYPOTHESIS TESTING

2.4.5.2

Poisson o.s.

When O is a Poisson o.s., that is, M = Rd++ is the interior of the nonnegative orthant in Rd , and pµ , µ ∈ M, is the density  Y  µωi i e−µi , ω = (ω1 , ..., ωd ) ∈ Zd+ pµ (ω) = ωi ! i taken w.r.t. the counting measure Π on Ω = Zd+ , and F is the family of affine functions on Ω, the recipe from the beginning of Section 2.4.5 reads as follows: • M1 , M2 can be arbitrary nonempty convex compact subsets of Rd++ = {x ∈ Rd : x > 0}; • problem (2.63) becomes the convex optimization problem d

Opt = −

min

µ∈M1 ,ν∈M2

√ 2 1X √ ( µi − ν i ) ; 2 i=1

(2.67)

• the optimal detector φ∗ and the upper bound ε⋆ on its risks given by an optimal solution (µ∗ , ν ∗ ) to (2.67) are d

1X φ∗ (ω) = ln 2 i=1 2.4.5.3



µ∗i νi∗



d

1X ∗ ωi + [ν − µ∗i ], 2 i=1 i

ε⋆ = eOpt .

Discrete o.s.

When O is a Discrete P o.s., that is, Ω = {1, ..., d}, Π is a counting measure on Ω, M = {µ ∈ Rd : µ > 0, i µi = 1}, and pµ (ω) = µω , ω = 1, ..., d, µ ∈ M,

the recipe from the beginning of Section 2.4.5 reads as follows:9 • M1 , M2 can be arbitrary nonempty convex compact subsets of the relative interior M of the probabilistic simplex, • problem (2.63) is equivalent to the convex program ε⋆ =

max

µ∈M1 ,ν∈M2

d X √

µi ν i ;

(2.68)

i=1

• the optimal detector φ∗ given by an optimal solution (µ∗ , ν ∗ ) to (2.67) is  ∗ µ (2.69) φ∗ (ω) = 12 ln ν ω∗ , ω

and the upper bound ε⋆ on the risks of this detector is given by (2.68). 9 It

should be mentioned that the results of this section as applied to the Discrete observation scheme are a simple particular case—that of finite Ω—of the results of [21, 22, 25] on distinguishing convex sets of probability distributions.

86

CHAPTER 2

2.4.5.4

K-th power of a simple o.s.

Recall that K-th power of a simple o.s. O = (Ω, Π; {pµ : µ ∈ M}; F) (see Section 2.4.3.4) is the o.s. [O]K = (ΩK , ΠK ; {p(K) : µ ∈ M}; F (K) ) µ where ΩK is the direct product of K copies of Ω, ΠK is the product of K copies of (K) Π, the densities pµ are product densities induced by K copies of the density pµ , µ ∈ M, K Y pµ (ωk ), pµ(K) (ω K = (ω1 , ..., ωK )) = k=1

and F

(K)

is comprised of functions φ(K) (ω K = (ω1 , ..., ωK )) =

K X

φ(ωk )

k=1

stemming from functions φ ∈ F. Clearly, [O]K is the observation scheme describing the stationary K-repeated observations ω K = (ω1 , ..., ωK ) with ωk stemming from the o.s. O; see Section 2.3.2.3. As we remember, [O]K is simple provided that O is so. Assuming O simple, it is immediately seen that as applied to the o.s. [O]K , the recipe from the beginning of Section 2.4.5 reads as follows: • M1 , M2 can be arbitrary nonempty convex compact subsets of M, and the corresponding hypotheses, HχK , χ = 1, 2, state that the components ωk of observation ω K = (ω1 , ..., ωK ) are independently of each other drawn from distribution pµ with µ ∈ M1 (hypothesis H1K ) or µ ∈ M2 (hypothesis H2K ); • problem (2.63) is the convex program  Z q (K) (K) pµ (ω K )pν (ω K )ΠK (dΩ) (DK ) ln Opt(K) = max µ∈M1 ,ν∈M2 ΩK {z } |  R √ ≡K ln



pµ (ω)pν (ω)Π(dω)

implying that any optimal solution to the “single-observation” problem (D1 ) associated with M1 , M2 is optimal for the “K-observation” problem (DK ) associated with M1 , M2 , and Opt(K) = KOpt(1); (K) • the optimal detector φ∗ given by an optimal solution (µ∗ , ν∗ ) to (D1 ) (this solution is optimal for (DK ) as well) is (K)

φ∗ (ω K ) =

K X

k=1

φ∗ (ωk ),

φ∗ (ω) =

1 ln 2



pµ∗ (ω) pν∗ (ω)



(K)

and the upper bound ε⋆ (K) on the risks of the detector φ∗ families of distributions obeying hypotheses H1K or H2K is ε⋆ (K) = eOpt(K) = eKOpt(1) = [ǫ⋆ (1)]K .

,

(2.70)

on the pair of (2.71)

87

HYPOTHESIS TESTING

The just outlined results on powers of simple observation schemes allow us to express near-optimality of detector-based tests in simple o.s.’s in a nicer form. Proposition 2.25. Let O = (Ω, Π; {pµ : µ ∈ M}; F) be a simple observation scheme, M1 , M2 be two nonempty convex compact subsets of M, and (µ∗ , ν∗ ) be an optimal solution to the convex optimization problem (cf. Theorem 2.23)  Z q pµ (ω)pν (ω)Π(dω) . Opt = max ln µ∈M1 ,ν∈M2



Let φ∗ and φK ∗ be single- and K-observation detectors induced by (µ∗ , ν∗ ) via (2.70). Let ǫ ∈ (0, 1/2), and assume that for some positive integer K in nature there exists a simple test T K deciding via K i.i.d. observations ω K = (ω1 , ..., ωK ) with ωk ∼ pµ , for some unknown µ ∈ M, on the hypotheses Hχ(K) : µ ∈ Mχ , χ = 1, 2, with risks Risk1 , Risk2 not exceeding ǫ. Then setting   2 K , K+ = 1 − ln(4(1 − ǫ))/ ln(1/ǫ) the simple test T

(K+ )

φ∗

(K+ )

utilizing K+ i.i.d. observations decides on H1

(K+ )

, H2

with risks ≤ ǫ. Note that K+ “is of the order of K”: K+ /K → 2 as ǫ → +0.

Proof. Applying item (iii) of Theorem 2.23 to the simple o.s. [O]K , we see that what above was called ε⋆ (K) satisfies p ε⋆ (K) ≤ 2 ǫ(1 − ǫ).

 p 1/K By (2.71), we conclude that ε⋆ (1) ≤ 2 ǫ(1 − ǫ) , whence, by the same (2.71), T /K  p , T = 1, 2, .... When plugging in this bound T = K+ , we ε⋆ (T ) ≤ 2 ǫ(1 − ǫ)

get the inequality ε⋆ (K+ ) ≤ ǫ. It remains to recall that ε⋆ (K+ ) upper-bounds the (K ) (K ) risks of the test T (K+ ) when deciding on H1 + vs. H2 + . ✷ φ∗

2.5

TESTING MULTIPLE HYPOTHESES

So far, we have focused on detector-based tests deciding on pairs of hypotheses, and our “constructive” results were restricted to pairs of convex hypotheses dealing with a simple o.s. O = (Ω, Π; {pµ : µ ∈ M}; F), (2.72) convexity of a hypothesis meaning that the family of probability distributions obeying the hypothesis is {pµ : µ ∈ X}, associated with a convex (in fact, convex compact) set X ⊂ M. In this section, we will be interested in pairwise testing unions of convex hypotheses and testing multiple (more than two) hypotheses.

88

CHAPTER 2

2.5.1 2.5.1.1

Testing unions Situation and goal

Let Ω be an observation space, and assume we are given two finite collections of families of probability distributions on Ω: families of red distributions Ri , 1 ≤ i ≤ r, and families of blue distributions Bj , 1 ≤ j ≤ b. These families give rise to r red and b blue hypotheses on the distribution P of an observation ω ∈ Ω, specifically, Ri : P ∈ Ri (red hypotheses) and Bj : P ∈ Bj (blue hypotheses). Assume that for every i ≤ r, j ≤ b we have at our disposal a simple detector-based test Tij capable of deciding on Ri vs. Bj . What we want is to assemble these tests into a test T deciding on the union R of red hypotheses vs. the union B of blue ones: b r [ [ Bj . Ri , B : P ∈ B := R : P ∈ R := i=1

j=1

Here P , as always, stands for the probability distribution of observation ω ∈ Ω. Our motivation primarily stems from the case where Ri and Bj are convex hypotheses in a simple o.s. (2.72): Ri = {pµ : µ ∈ Mi }, Bj = {pµ : µ ∈ Nj }, where Mi and Nj are convex compact subsets of M. In this case we indeed know how to build near-optimal tests deciding on Ri vs. Bj , and the question we have posed becomes, how do we assemble these tests into a test deciding on R vs. B, with S R : P ∈ R = {pµ : µ ∈ X}, X = S i Mi , B : P ∈ B = {pµ : µ ∈ Y }, Y = j Nj ?

While the structure of R, B is similar to that of Ri , Bj , there is a significant difference: the sets X, Y are, in general, nonconvex, and therefore the techniques we have developed fail to address testing R vs. B directly.

2.5.1.2

The construction

In the situation just described, let φij be the detectors underlying the tests Tij ; w.l.o.g., we can assume these detectors balanced (see Section 2.3.2.2) with some risks ǫij : R −φ (ω)  ij RΩ eφ (ω) P (dω) ≤ ǫij ∀P ∈ Ri , 1 ≤ i ≤ r, 1 ≤ j ≤ b. (2.73) e ij P (dω) ≤ ǫij ∀P ∈ Bj Ω Let us assemble the detectors φij into a detector for R, B as follows: φ(ω) = max min [φij (ω) − αij ], 1≤i≤r 1≤j≤b

(2.74)

where the shifts αij are parameters of the construction. Proposition 2.26. The risks of φ on R, B can be bounded as hP i R b αij ∀P ∈ R : Ω e−φ(ω) P (dω) ≤ maxi≤r , j=1 ǫij e R φ(ω) Pr −αij ∀P ∈ B : e P (dω) ≤ maxj≤b [ i=1 ǫij e ]. Ω

(2.75)

89

HYPOTHESIS TESTING

Therefore, the risks of φ on R, B are upper-bounded by the quantity   X hX r i b ε⋆ = max max ǫij eαij , max ǫij e−αij , i≤r

j=1

j≤b

i=1

(2.76)

whence the risks of the simple test Tφ , based on the detector φ, deciding on R, B are upper-bounded by ε⋆ . Proof. Let P ∈ R, so that P ∈ Ri∗ for some i∗ ≤ r. Then R −φ(ω) R e P (dω) = Ω emini≤r maxj≤b [−φij (ω)+αij ] P (dω) Ω R Pb R ≤ Ω emaxj≤b [−φi∗ j (ω)+αi∗ j ] P (dω) ≤ j=1 Ω e−φi∗ j (ω)+αi∗ j P (dω) R Pb = j=1 eαi∗ j Ω e−φi∗ j (ω) P (dω) Pb ≤ j=1 ǫi∗hj eαi∗ j [by (2.73) i due to P ∈ Ri∗ ] Pb αij ≤ maxi≤r . j=1 ǫij e Now let P ∈ B, so that P ∈ Bj∗ for some j∗ . We have R φ(ω) R (ω)−αij ] e P (dω) = Ω emaxi≤r minj≤b [φij P Ω R R P (dω) r maxi≤r [φij∗ (ω)−αij∗ ] φij∗ (ω)−αij∗ ≤P e P (dω) ≤ P (dω) i=1 Ω e Ω R φ (ω) r −αij∗ ij∗ = Pi=1 e e P (dω) Ω r ≤ i=1 ǫij∗P e−αij∗ [by (2.73) due to P ∈ Bj∗ ] r ≤ maxj≤b [ i=1 ǫij e−αij ] .

(2.75) is proved. The remaining claims of the proposition are readily given by (2.75) combined with Proposition 2.14. ✷ Optimal choice of shift parameters. The detector and the test considered in Proposition 2.26, like the resulting risk bound ε⋆ , depend on the shifts αij . Let us optimize the risk bound w.r.t. these shifts. To this end, consider the r × b matrix E = [ǫij ] i≤r

j≤b

and the symmetric (r + b) × (r + b) matrix   E . E= ET As is well known, the eigenvalues of the symmetric matrix E are comprised of the pairs (σs , −σs ), where σs are the singular values of E, and several zeros; in particular, the leading eigenvalue of E is the spectral norm kEk2,2 (the largest singular value) of matrix E. Further, E is a matrix with positive entries, so that E is a symmetric entrywise nonnegative matrix. By the Perron-Frobenius Theorem, the leading eigenvector of this matrix can be selected to be nonnegative. Denoting this nonnegative eigenvector [g; h] with r-dimensional g and b-dimensional h, and setting ρ = kEk2,2 , we have ρg = Eh (2.77) ρh = E T g. Observe that ρ > 0 (evident), whence both g and h are nonzero (since otherwise (2.77) would imply g = h = 0, which is impossible—the eigenvector [g; h] is

90

CHAPTER 2

nonzero). Since h and g are nonzero nonnegative vectors, ρ > 0 and E is entrywise positive, (2.77) says that g and h are strictly positive vectors. The latter allows us to define shifts αij according to αij = ln(hj /gi ).

(2.78)

With these shifts, we get hP i Pb b αij max = max j=1 ǫij hj /gi = max(Eh)i /gi = max ρ = ρ j=1 ǫij e i≤r

i≤r

i≤r

i≤r

(we have used the first relation in (2.77)), and Pr Pr max [ i=1 ǫij e−αij ] = max i=1 ǫij gi /hj = max[E T g]j /hj = max ρ = ρ j≤b

j≤b

j≤b

j≤b

(we have used the second relation in (2.77)). The bottom line is as follows:

Proposition 2.27. In the situation and the notation of Section 2.5.1.1, the risks of the detector (2.74) with shifts (2.77), (2.78) on the families R, B do not exceed the quantity kE := [ǫij ]i≤r,j≤b k2,2 . As a result, the risks of the simple test Tφ deciding on the hypotheses R, B, does not exceed kEk2,2 as well. In fact, the shifts in the above proposition are the best possible; this is an immediate consequence of the following simple fact: Proposition 2.28. Let E = [eij ] be a nonzero entrywise nonnegative n × n symmetric matrix. Then the optimal value in the optimization problem   n   X (∗) Opt = min max eij eαij : αij = −αji αij  i≤n  j=1

is equal to kEk2,2 . When the Perron-Frobenius eigenvector f of E can be selected positive, the problem is solvable, and an optimal solution is given by αij = ln(fj /fi ), 1 ≤ i, j ≤ n.

(2.79)

Proof. Let us prove that Opt ≤ ρ := kEk2,2 . Given ǫ > 0, we clearly can find an entrywise nonnegative symmetric matrix E ′ with entries e′ij inbetween eij and eij + ǫ such that the Perron-Frobenius eigenvector f of E ′ can be selected positive (it suffices, e.g., to set e′ij = eij + ǫ). Selecting αij according to (2.79), we get a feasible solution to (∗) such that X X e′ij fj /fi = kE ′ k2,2 , eij eαij ≤ ∀i : j

j

implying that Opt ≤ kE ′ k2,2 . Passing to limit as ǫ → +0, we get Opt ≤ kEk2,2 . As a byproduct of our reasoning, if E admits a positive Perron-Frobenius eigenvector f , then (2.79) yields a feasible solution to (∗) with the value of the objective equal to kEk2,2 .

91

HYPOTHESIS TESTING

It remain to prove that Opt ≥ kEk2,2 . Assume that this is not the case, so that (∗) admits a feasible solution α bij such that X eij eαbij < ρ := kEk2,2 . ρb := max i

j

By an arbitrarily small perturbation of E, we can make this matrix symmetric and entrywise positive, and still satisfying the above strict inequality; to save notation, assume that already the original E is entrywise positive. Let f be a positive PerronFrobenius eigenvector of E, and let, as above, αij = ln(fj /fi ), so that X X eij fj /fi = ρ ∀i. eij eαij = j

j

Setting δij = α bij − αij , we conclude that the convex functions X θi (t) = eij eαij +tδij j

all are equal to ρ as t = 0, and all are ≤ ρb < ρ as t = 1, implying that θi (1) < θi (0) for every i. The latter, in view of convexity of θi (·), implies that X X eij (fj /fi )δij < 0 ∀i. eij eαij δij = θi′ (0) = j

j

Multiplying the resulting inequalities by fi2 and summing up over i, we get X eij fi fj δij < 0, i,j

which is impossible: we have eij = eji and δij = −δji , implying that the left-hand side in the latter inequality is 0. ✷ 2.5.2

Testing multiple hypotheses “up to closeness”

So far, we have considered detector-based simple tests deciding on pairs of hypotheses, specifically, convex hypotheses in simple o.s.’s (Section 2.4.4) and unions of convex hypotheses (Section 2.5.1).10 Now we intend to consider testing of multiple (perhaps more than 2) hypotheses “up to closeness”; the latter notion was introduced in Section 2.2.4.2. 10 Strictly speaking, in Section 2.5.1 it was not explicitly stated that the unions under consideration involve convex hypotheses in simple o.s.’s; our emphasis was on how to decide on a pair of union-type hypotheses given pairwise detectors for “red” and “blue” components of the unions from the pair. However, for now, the only situation where we indeed have at our disposal good pairwise detectors for red and blue components is that in which these components are convex hypotheses in a good o.s.

92

CHAPTER 2

2.5.2.1

Situation and goal

Let Ω be an observation space, and let a collection P1 , ..., PL of families of probability distributions on Ω be given. As always, families Pℓ give rise to hypotheses Hℓ : P ∈ Pℓ on the distribution P of observation ω ∈ Ω. Assume also that we are given a closeness relation C on {1, ..., L}. Recall that, formally, a closeness relation is some set of pairs of indices (ℓ, ℓ′ ) ∈ {1, ..., L}; we interpret the inclusion (ℓ, ℓ′ ) ∈ C as the fact that hypothesis Hℓ “is close” to hypothesis Hℓ′ . When (ℓ, ℓ′ ) ∈ C, we say that ℓ′ is close (or C-close) to ℓ. We always assume that • C contains the diagonal: (ℓ, ℓ) ∈ C for every ℓ ≤ L (“each hypothesis is close to itself”), and • C is symmetric: whenever (ℓ, ℓ′ ) ∈ C, we have also (ℓ′ , ℓ) ∈ C (“if the ℓ-th hypothesis is close to the ℓ′ -th one, then the ℓ′ -th hypothesis is close to the ℓ-th one”). Recall that a test T deciding on the hypotheses H1 , ..., HL via observation ω ∈ Ω is a procedure which, given on input ω ∈ Ω, builds some set T (ω) ⊂ {1, ..., L}, accepts all hypotheses Hℓ with ℓ ∈ T (ω), and rejects all other hypotheses. Risks of an “up to closeness” test. The notion of C-risk of a test was introduced in Section 2.2.4.2, we reproduce it here for the reader’s convenience. Given closeness C and a test T , we define the C-risk RiskC (T |H1 , ..., HL ) of T as the smallest ǫ ≥ 0 such that

S Whenever an observation ω is drawn from a distribution P ∈ ℓ Pℓ , and ℓ∗ is such that P ∈ Pℓ∗ (i.e., hypothesis Hℓ∗ is true), the P -probability of the event ℓ∗ 6∈ T (ω) (“true hypothesis Hℓ∗ is not accepted”) or there exists ℓ′ not close to ℓ such that Hℓ′ is accepted” is at most ǫ.

Equivalently: RiskC (T |H1 , ..., HL ) ≤ ǫ if and only if the following takes place: S Whenever an observation ω is drawn from a distribution P ∈ ℓ Pℓ , and ℓ∗ is such that P ∈ Pℓ∗ (i.e., hypothesis Hℓ∗ is true), the P -probability of the event ℓ∗ ∈ T (ω) (“the true hypothesis Hℓ∗ is accepted”) and ℓ′ ∈ T (ω) implies that (ℓ, ℓ′ ) ∈ C (“all accepted hypotheses are C-close to the true hypothesis Hℓ∗ ”) is at least 1 − ǫ. For example, consider nine polygons presented on Figure 2.4 and associate with them nine hypotheses on a 2D “signal plus noise” observation ω = x + ξ, ξ ∼ N (0, I2 ), with the ℓ-th hypothesis stating that x belongs to the ℓ-th polygon. We define closeness C on the collection of hypotheses presented on Figure 2.4 as “two hypotheses are close if and only if the corresponding polygons intersect,” like A and B, or A and E. Now the fact that a test T has C-risk ≤ 0.01 would imply, in particular, that if the probability distribution P underlying the observed

93

HYPOTHESIS TESTING

Figure 2.4: Nine hypotheses on the location of the mean µ of observation ω ∼ N (µ, I2 ), each stating that µ belongs to a specific polygon. ω obeys hypothesis A (i.e., the mean of P belongs to the polygon A), then with P -probability at least 0.99 the list of accepted hypotheses includes hypothesis A, and the only other hypotheses in this list are among hypotheses B, D, and E. 2.5.2.2

“Building blocks” and construction

The construction we are about to present is, essentially, that used in Section 2.2.4.3 as applied to detector-generated tests. This being said, the presentation to follow is self-contained. The building blocks of our construction are pairwise detectors φℓℓ′ (ω), 1 ≤ ℓ < ′ ℓ ≤ L, for pairs Pℓ , Pℓ′ along with (upper bounds on) the risks ǫℓℓ′ of these detectors: R  ∀(P ∈ Pℓ ) : RΩ e−φℓℓ′ (ω) P (dω) ≤ ǫℓℓ′ , 1 ≤ ℓ < ℓ′ ≤ L. ∀(P ∈ Pℓ′ ) : Ω eφℓℓ′ (ω) P (dω) ≤ ǫℓℓ′

Setting

φℓ′ ℓ (ω) = −φℓℓ′ (ω), ǫℓ′ ℓ = ǫℓℓ′ , 1 ≤ ℓ < ℓ′ ≤ L, φℓℓ (ω) ≡ 0, ǫℓℓ = 1, 1 ≤ ℓ ≤ L, we get what we refer to as a balanced system of detectors φℓℓ′ and risks ǫℓℓ′ , 1 ≤ ℓ, ℓ′ ≤ L, for the collection P1 , ..., PL , meaning that φℓℓ′ (ω) + φRℓ′ ℓ (ω) ≡ 0, ǫℓℓ′ = ǫℓ′ ℓ , ∀P ∈ Pℓ : Ω e−φℓℓ′ (ω) P (dω) ≤ ǫℓℓ′ ,

1 ≤ ℓ, ℓ′ ≤ L, 1 ≤ ℓ, ℓ′ ≤ L.

(2.80)

Given closeness C, we associate with it the symmetric L × L matrix C given by  0, (ℓ, ℓ′ ) ∈ C, ′ (2.81) Cℓℓ = 1, (ℓ, ℓ′ ) 6∈ C. Test TC . Let a collection of shifts αℓℓ′ ∈ R satisfying the relation αℓℓ′ = −αℓ′ ℓ , 1 ≤ ℓ, ℓ′ ≤ L

(2.82)

94

CHAPTER 2

be given. The detectors φℓℓ′ and the shifts αℓℓ′ specify a test TC deciding on hypotheses H1 , ..., HL . Precisely, given an observation ω, the test TC accepts exactly those hypotheses Hℓ for which φℓℓ′ (ω) − αℓℓ′ > 0 whenever ℓ′ is not C-close to ℓ: TC (ω) = {ℓ : φℓℓ′ (ω) − αℓℓ′ > 0 ∀(ℓ′ : (ℓ, ℓ′ ) 6∈ C)}. Proposition 2.29. (i) The C-risk of the test TC just defined is upper-bounded by the quantity L X ε[α] = max ǫℓℓ′ Cℓℓ′ eαℓℓ′ ℓ≤L

ℓ′ =1

with C given by (2.81). (ii) The infimum, over shifts α satisfying (2.82), of the risk bound ε[α] is the quantity ε⋆ = kEk2,2 , where the L × L symmetric entrywise nonnegative matrix E is given by E = [eℓℓ′ := ǫℓℓ′ Cℓℓ′ ]ℓ,ℓ′ ≤L . Assuming E admits a strictly positive Perron-Frobenius vector f , an optimal choice of the shifts is αℓℓ′ = ln(fℓ′ /fℓ ), 1 ≤ ℓ, ℓ′ ≤ L, resulting in ε[α] = ε⋆ = kEk2,2 . Proof. (i): Setting φ¯ℓℓ′ (ω) = φℓℓ′ (ω) − αℓℓ′ , ǫ¯ℓℓ′ = ǫℓℓ′ eαℓℓ′ , (2.80) and (2.82) imply that (a) (b)

φ¯ℓℓ′ (ω) + φ¯ℓ′ ℓ (ω) ≡ 0, R ¯ ∀P ∈ Pℓ : Ω e−φℓℓ′ (ω) P (dω) ≤ ǫ¯ℓℓ′ ,

1 ≤ ℓ, ℓ′ ≤ L 1 ≤ ℓ, ℓ′ ≤ L.

(2.83)

Now let ℓ∗ be such that the distribution P of observation ω belongs to Pℓ∗ . Then for every ℓ′ the P -probability of the event φ¯ℓ∗ ℓ′ (ω) ≤ 0 is ≤ ǫ¯ℓ∗ ℓ′ by (2.83.b), whence the P -probability of the event  E∗ = ω : ∃ℓ′ : (ℓ∗ , ℓ′ ) 6∈ C & φ¯ℓ∗ ℓ′ (ω) ≤ 0 is upper-bounded by

X

ℓ′ :(ℓ∗ ,ℓ′ )6∈C

ǫ¯ℓ∗ ℓ′ =

L X

ℓ′ =1

Cℓ∗ ℓ′ ǫℓ∗ ℓ′ eαℓ∗ ℓ′ ≤ ε[α].

Assume that E∗ does not take place (as we have seen, this indeed is so with P probability ≥ 1 − ε[α]). Then φ¯ℓ∗ ℓ′ (ω) > 0 for all ℓ′ such that (ℓ∗ , ℓ′ ) 6∈ C, implying, first, that Hℓ∗ is accepted by our test. Second, φ¯ℓ′ ℓ∗ (ω) = −φ¯ℓ∗ ℓ′ (ω) < 0 whenever (ℓ∗ , ℓ′ ) 6∈ C, or, due to the symmetry of closeness, whenever (ℓ′ , ℓ∗ ) 6∈ C, implying that the test TC rejects the hypothesis Hℓ′ when ℓ′ is not C-close to ℓ∗ . Thus, the P -probability of the event “Hℓ∗ is accepted, and all accepted hypotheses are C-close

95

HYPOTHESIS TESTING

to Hℓ∗ ” is at least 1 − ε[α]. We conclude that the C-risk RiskC (TC |H1 , ..., HL ) of the test TC is at most ε[α]. (i) is proved. (ii) is readily given by Proposition 2.28. ✷ 2.5.2.3

Testing multiple hypotheses via repeated observations

In the situation of Section 2.5.2.1, given a balanced system of detectors φℓℓ′ and risks ǫℓℓ′ , 1 ≤ ℓ, ℓ′ ≤ L, for the collection P1 , ..., PL (see (2.80)) and a positive integer K, we can • pass from detectors φℓℓ′ and risks ǫℓℓ′ to the entities (K)

φℓℓ′ (ω K = (ω1 , ..., ωK )) =

K X

k=1

(K)

′ φℓℓ′ (ωk ), ǫℓℓ′ = ǫK ℓℓ′ , 1 ≤ ℓ, ℓ ≤ L; (K)

• associate with the families Pℓ families Pℓ of probability distributions underlying quasi-stationary K-repeated versions of observations ω ∼ P ∈ Pℓ —see Section 2.3.2.3—and thus arrive at hypotheses HℓK = Hℓ⊗,K stating that the distribution P K of K-repeated observation ω K = (ω1 , ..., ωK ), ωk ∈ Ω, belongs K N to the family Pℓ⊗,K = Pℓ , associated with Pℓ ; see Section 2.1.3.3. k=1

By Proposition 2.16 and (2.80), we arrive at the following analog of (2.80): (K)

(K)

(K)

(K)

φℓℓ′ (ω K ) + φℓ′ ℓ (ω K ) ≡ 0, ǫℓℓ′ = ǫℓ′ ℓ = ǫK ℓℓ′ , (K) (K) R (K) −φℓℓ′ (ω K ) K K K ∀P ∈ Pℓ : ΩK e P (dω ) ≤ ǫℓℓ′ ,

1 ≤ ℓ, ℓ′ ≤ L

1 ≤ ℓ, ℓ′ ≤ L.

Given shifts αℓℓ′ satisfying (2.82) and applying the construction from Section 2.5.2.2 to these shifts and our newly constructed detectors and risks, we arrive at the test TCK deciding on hypotheses H1K , ..., HLK via K-repeated observation ω K . Specifically, given an observation ω K , the test TCK accepts exactly those hypotheses HℓK (K) for which φℓℓ′ (ω K ) − αℓℓ′ > 0 whenever ℓ′ is not C-close to ℓ: (K)

TCK (ω K ) = {ℓ : φℓℓ′ (ω K ) − αℓℓ′ > 0 ∀(ℓ′ : (ℓ, ℓ′ ) 6∈ C)}. Invoking Proposition 2.29, we arrive at Proposition 2.30. (i) The C-risk of the test TCK just defined is upper-bounded by the quantity L X αℓℓ′ . ǫK ε[α, K] = max ℓℓ′ Cℓℓ′ e ℓ≤L

ℓ′ =1

(ii) The infimum, over shifts α satisfying (2.82), of the risk bound ε[α, K] is the quantity ε⋆ (K) = kE (K) k2,2 , where the L × L symmetric entrywise nonnegative matrix E (K) is given by h i (K) ′ E (K) = eℓℓ′ := ǫK C . ′ ℓℓ ℓℓ ℓ,ℓ′ ≤L

Assuming E (K) admits a strictly positive Perron-Frobenius vector f , an optimal

96

CHAPTER 2

choice of the shifts is

αℓℓ′ = ln(fℓ /fℓ′ ), 1 ≤ ℓ, ℓ′ ≤ L,

resulting in ε[α, K] = ε⋆ (K) = kE (K) k2,2 . 2.5.2.4

Consistency and near-optimality

Observe that when closeness C is such that ǫℓℓ′ < 1 whenever ℓ, ℓ′ are not C-close to each other, the entries on the matrix E (K) go to 0 as K → ∞ exponentially fast, whence the C-risk of test TCK also goes to 0 as K → ∞, meaning that test TCK is consistent. When, in addition, Pℓ correspond to convex hypotheses in a simple o.s., the test TCK possesses the property of near-optimality similar to that stated in Proposition 2.25: Proposition 2.31. Consider the special case of the situation from Section 2.5.2.1 where, given a simple o.s. O = (Ω, Π; {pµ : µ ∈ M}; F), the families Pℓ of probability distributions are of the form Pℓ = {pµ : µ ∈ Nℓ }, where Nℓ , 1 ≤ ℓ ≤ L, are nonempty convex compact subsets of M. Let also the pairwise detectors φℓℓ′ and their risks ǫℓℓ′ underlying the construction from Section 2.5.2.2 be obtained by applying Theorem 2.23 to the pairs Nℓ , Nℓ′ , so that for 1 ≤ ℓ < ℓ′ ≤ L one has φℓℓ′ (ω) =

1 2

ln(pµℓ,ℓ′ (ω)/pνℓ,ℓ′ (ω)), ǫℓℓ′ = exp{Optℓℓ′ }

where Optℓℓ′ =

min

µ∈Nℓ ,ν∈Nℓ′

ln

Z q



pµ (ω)pν (ω)Π(dω) ,



and (µℓℓ′ , νℓℓ′ ) form an optimal solution to the optimization problem on the righthand side. Assume that for some positive integer K∗ in nature there exists a test T K∗ which decides with C-risk ǫ ∈ (0, 1/2), via stationary K∗ -repeated observation ω K∗ , on the (K ) hypotheses Hℓ ∗ , stating that the components in ω K∗ are drawn, independently of each other, from a distribution P ∈ Pℓ , ℓ = 1, ..., L, and let   1 + ln(L − 1)/ ln(1/ǫ) K∗ . (2.84) K= 2 1 − ln(4(1 − ǫ))/ ln(1/ǫ) Then the test TCK yielded by the construction from Section 2.5.2.2 as applied to the above φℓℓ′ , ǫℓℓ′ , and trivial shifts αℓℓ′ ≡ 0, decides on the hypotheses HℓK —see Section 2.5.2.3—via quasi-stationary K-repeated observations ω K , with C-risk ≤ ǫ. Note that K/K∗ → 2 as ǫ → +0. Proof. Let ǫ¯ = max {ǫℓℓ′ : ℓ < ℓ′ , and ℓ, ℓ′ are not C-close to each other} . ′ ℓ,ℓ

Denoting by (ℓ∗ , ℓ′∗ ) the corresponding maximizer, note that T K∗ induces a simple test T able to decide via stationary K∗ -repeated observations ω K on the pair of (K ) (K ) hypotheses Hℓ∗ ∗ , Hℓ′ ∗ with risks ≤ ǫ (it suffices to make T to accept the first ∗ of the hypotheses in the pair and reject the second one whenever T K∗ on the same (K ) observation accepts Hℓ∗ ∗ ; otherwise T rejects the first hypothesis in the pair and accepts the second one). This observation, by the same argument as in the proof

97

HYPOTHESIS TESTING

p of Proposition 2.25, implies that ǫ¯K∗ ≤ 2 ǫ(1 − ǫ) < 1, whence all entries in the matrix E (K) do not exceed ǫ¯(K/K∗ ) , implying by Proposition 2.29 that the C-risk of the test TCK does not exceed p ǫ(K) := (L − 1)[2 ǫ(1 − ǫ)]K/K∗ . It remains to note that for K given by (2.84) one has ǫ(K) ≤ ǫ.



TCK

Remark 2.32. Note that tests TC and we have built may, depending on observations, accept no hypotheses at all, which sometimes is undesirable. Clearly, every test deciding on multiple hypotheses up to C-closeness always can be modified to ensure that a hypothesis always is accepted. To this end, it suffices, for instance, that the modified test accepts exactly those hypotheses, if any, which are accepted by our original test, and accepts, say, hypothesis # 1 when the original test accepts no hypotheses. It is immediate to see that the C-risk of the modified test cannot be larger than the risk of the original test. 2.5.3

Illustration: Selecting the best among a family of estimates

Let us illustrate our machinery for multiple hypothesis testing by applying it to the situation as follows: We are given: • a simple nondegenerate observation scheme O = (Ω, Π; {pµ (·) : µ ∈ M}; F), • a seminorm k · k on Rn ,11 • a convex compact set X ⊂ Rn along with a collection of M points xi ∈ Rn , 1 ≤ i ≤ M , and a positive D such that the k · k-diameter of the set X + = X ∪ {xi : 1 ≤ i ≤ M } is at most D: kx − x′ k ≤ D ∀(x, x′ ∈ X + ), • an affine mapping x 7→ A(x) from Rn into the embedding space of M such that A(x) ∈ M for all x ∈ X, • a tolerance ǫ ∈ (0, 1).

We observe a K-element sample ω K = (ω1 , ..., ωK ) of observations ωk ∼ pA(x∗ ) , 1 ≤ k ≤ K,

(2.85)

independent across k, where x∗ ∈ Rn is an unknown signal known to belong to X. Our “ideal goal” is to use ω K in order to identify, with probability ≥ 1 − ǫ, the k · k-closest to x∗ point among the points x1 , ..., xM . The goal just outlined may be too ambitious, and in the sequel we focus on the relaxed goal as follows: 11 A seminorm on Rn is defined by exactly the same requirements as a norm, except that now we allow zero seminorms for some nonzero vectors. Thus, a seminorm on Rn is a nonnegative function k · k which is even and homogeneous: kλxk = |λ|kxk and satisfies the triangle inequality kx + yk ≤ kxk + kyk. A universal example is kxk = kBxko , where k · ko is a norm on some Rm and B is an m × n matrix; whenever this matrix has a nontrivial kernel, k · k is a seminorm rather than a norm.

98

CHAPTER 2

Given a positive integer N and a “resolution” θ > 1, consider the grid Γ = {rj = Dθ−j , 0 ≤ j ≤ N } and let

  ρ(x) = min ρj ∈ Γ : ρj ≥ min kx − xi k . 1≤i≤M

Given the design parameters α ≥ 1 and β ≥ 0, we want to specify a volume of observations K and an inference routine ω K 7→ iα,β (ω K ) ∈ {1, ..., M } such that ∀(x∗ ∈ X) : Prob{kx∗ − xiα,β (ωK ) k > αρ(x∗ ) + β} ≥ 1 − ǫ.

(2.86)

Note that when passing from the “ideal” to the relaxed goal, the simplification is twofold: first, instead of the precise distance mini kx∗ − xi k from x∗ to {x1 , ..., xM } we look at the best upper bound ρ(x∗ ) on this distance from the grid Γ; second, we allow factor α and additive term β in mimicking the (discretized) distance ρ(x∗ ) by kx∗ − xiα,β (ωK ) k. The problem we have posed is quite popular in Statistics and originates from the estimate aggregation problem [185, 229, 101] as follows: let xi be candidate estimates of x∗ yielded by a number of a priori “models” of x∗ and perhaps some preliminary noisy observations of x∗ . Given xi and a matrix B, we want to select among the vectors Bxi the (nearly) best approximation of Bx∗ w.r.t. a given norm k · ko , utilizing additional observations ω K of the signal. To bring this problem into our framework, it suffices to specify the seminorm as kxk = kBxko . We shall see in the meantime that in the context of this problem, the “discretization of distances” is, for all practical purposes, irrelevant: the dependence of the volume of observations on N is just logarithmic, so that we can easily handle a fine grid, like the one with θ = 1.001 and θ−N = 10−10 . As for factor α and additive term β, they indeed could be “expensive in terms of applications,” but the “nearly ideal” goal of making α close to 1 and β close to 0 may be unattainable. 2.5.3.1

The construction

Let us associate with i ≤ M and j, 0 ≤ j ≤ N , the hypothesis Hij stating that the observations ωk independent across k—see (2.85)—stem from x∗ ∈ Xij := {x ∈ X : kx − xi k ≤ rj }. Note that the sets Xij are convex and compact. We denote by J the set of all pairs (i, j), for which i ∈ {1, ..., M }, j ∈ {0, 1, ..., N }, and Xij 6= ∅. Further, we define closeness Cα,β on the set of hypotheses Hij , (i, j) ∈ J , as follows: (ij, i′ j ′ ) ∈ Cαβ if and only if ¯= kxi − xi′ k ≤ α ¯ (rj + rj ′ ) + β, α

α−1 2

(2.87)

(here and in what follows, kℓ denotes the ordered pair (k, ℓ)). Applying Theorem 2.23, we can build, in a computation-friendly fashion, the system φij,i′ j ′ (ω), ij, i′ j ′ ∈ J , of optimal balanced detectors for the hypotheses Hij along

99

HYPOTHESIS TESTING

with the risks of these detectors, so that ′ j ′ (ω) ≡ −φi′ j ′ ,ij (ω) φ R ij,i−φ ij,i′ j ′ (ω) p e A(x) (ω)Π(dω) ≤ ǫij,i′ j ′ Ω

∀(ij, i′ j ′ ∈ J ), ∀(ij ∈ J , i′ j ′ ∈ J , x ∈ Xij ).

Let us say that a pair (α, β) is admissible if α ≥ 1, β ≥ 0, and ∀((i, j) ∈ J , (i′ , j ′ ) ∈ J , (ij, i′ j ′ ) 6∈ Cα,β ) : A(Xij ) ∩ A(Xi′ j ′ ) = ∅. Note that checking admissibility of a given pair (α, β) is a computationally tractable task. Given an admissible pair (α, β), we associate with it a positive integer K = K(α, β) and inference ω K 7→ iα,β (ω K ) as follows: 1. K = K(α, β) is the smallest integer such that the detector-based test TCKα,β yielded by the machinery of Section 2.5.2.3 decides on the hypotheses Hij , ij ∈ J , with Cα,β -risk not exceeding ǫ. Note that by admissibility, ǫij,i′ j ′ < 1 whenever (ij, i′ j ′ ) 6∈ Cα,β , so that K(α, β) is well defined. 2. Given observation ω K , K = K(α, β), we define iα,β (ω K ) as follows: a) We apply to ω K the test TCKα,β . If the test accepts no hypothesis (case A), iαβ (ω K ) is undefined. The observations ω K resulting in case A comprise some set, which we denote by B; given ω K , we can recognize whether or not ω K ∈ B. b) When ω K 6∈ B, the test TCKα,β accepts some of the hypotheses Hij , let the set of their indices ij be J (ω K ); we select from the pairs ij ∈ J (ω K ) the one with the largest j, and set iα,β (ω K ) to be equal to the first component, and jα,β (ω K ) to be equal to the second component of the selected pair. We have the following: Proposition 2.33. Assuming (α, β) admissible, for the inference ω K 7→ iα,β (ω K ) just defined and for every x∗ ∈ X, denoting by PxK∗ the distribution of stationary K-repeated observation ω K stemming from x∗ one has kx∗ − xiα,β (ωK ) k ≤ αρ(x∗ ) + β

(2.88)

with PxK∗ -probability at least 1 − ǫ. Proof. Let us fix x∗ ∈ X, and let j∗ = j∗ (x∗ ) be the largest j ≤ N such that rj ≥ min kx∗ − xi k; i≤M

note that j∗ is well defined due to r0 = D ≥ kx∗ − x1 k, and that rj∗ = ρ(x∗ ). We specify i∗ = i∗ (x∗ ) ≤ M in such a way that kx∗ − xi∗ k ≤ rj∗ .

(2.89)

Note that i∗ is well defined and that observations (2.85) stemming from x∗ obey the hypothesis Hi∗ j∗ . Let E be the set of those ω K for which the predicate

100

CHAPTER 2

P: As applied to observation ω K , the test TCKα,β accepts Hi∗ j∗ , and all hypotheses accepted by the test are Cα,β -close to Hi∗ j∗ holds true. Taking into account that the Cα,β -risk of TCKα,β does not exceed ǫ and that the hypothesis Hi∗ j∗ is true, the PxK∗ -probability of the event E is at least 1 − ǫ. Let observation ω K satisfy ω K ∈ E. (2.90) Then 1. The test TCKα,β accepts the hypothesis Hi∗ j∗ , that is, ω K 6∈ B. By construction of iα,β (ω K )jα,β (ω K ) (see the rule 2b above) and due to the fact that TCKα,β accepts Hi∗ j∗ , we have jα,β (ω K ) ≥ j∗ . 2. The hypothesis Hiα,β (ωK )jα,β (ωK ) is Cα,β -close to Hi∗ j∗ , so that kxi∗ − xiα,β (ωK ) k ≤ α ¯ (rj∗ + rjα,β (ωK ) ) + β ≤ 2¯ αrj∗ + β = 2¯ αρ(x∗ ) + β, where the concluding inequality is due to the fact that, as we have already seen, jα,β (ω K ) ≥ j∗ when (2.90) takes place. Invoking (2.89), we conclude that with PxK∗ -probability at least 1 − ǫ it holds kx∗ − xiα,β (ωK ) k ≤ (2¯ α + 1)ρ(x∗ ) + β = αρ(x∗ ) + β. 2.5.3.2



A modification

From the computational viewpoint, an obvious shortcoming of the construction presented in the previous section is the necessity to operate with M (N +1) hypotheses, which might require computing as many as O(M 2 N 2 ) detectors. We are about to present a modified construction, where we deal at most N + 1 times with just M hypotheses at a time (i.e., with the total of at most O(M 2 N ) detectors). The idea is to replace simultaneously processing all hypotheses Hij , ij ∈ J , with processing them in stages j = 0, 1, ..., with stage j operating only with the hypotheses Hij , i = 1, ..., M . The implementation of this idea is as follows. In the situation of Section 2.5.3, given the same entities Γ, (α, β), Hij , Xij , ij ∈ J , as at the beginning of Section 2.5.3.1 and specifying closeness Cα,β according to (2.87), we now act as follows. Preprocessing. For j = 0, 1, ..., N 1. we identify the set Ij = {i ≤ M : Xij 6= ∅} and stop if this set is empty. If this set is nonempty, j 2. we specify the closeness Cαβ on the set of hypotheses Hij , i ∈ Ij , as a “slice” of the closeness Cα,β : j Hij and Hi′ j (equivalently, i and i′ ) are Cα,β -close to each other if (ij, i′ j) are Cα`,β -close, that is,

kxi − xi′ k ≤ 2¯ αrj + β, α ¯=

α−1 . 2

3. We build the optimal detectors φij,i′ j , along with their risks ǫij,i′ j , for all i, i′ ∈ Ij j such that (i, i′ ) 6∈ Cα,β . If ǫij,i′ j = 1 for a pair i, i′ of the latter type, that is,

101

HYPOTHESIS TESTING

A(Xij ) ∩ A(Xi′ j ) 6= ∅, we claim that (α, β) is inadmissible and stop. Otherwise we find the smallest K = Kj such that the spectral norm of the symmetric M × M matrix E jK with the entries  K j ǫij,i′ j , i ∈ Ij , i′ ∈ Ij , (i, i′ ) 6∈ Cα,β jK Eii ′ = 0, otherwise does not exceed ǫ¯ = ǫ/(N + 1). We then use the machinery of Section 2.5.2.3 K to build detector-based test TC j j , which decides on the hypotheses Hij , i ∈ Ij , α,β

j with Cα,β -risk not exceeding ǫ¯.

It may happen that the outlined process stops when processing some value ¯j of j; if this does not happen, we set ¯j = N + 1. Now, if the process does stop, and stops with the claim that (α, β) is inadmissible, we call (α, β) inadmissible and terminate—in this case we fail to produce a desired inference; note that if this is the case, (α, β) is inadmissible in the sense of Section 2.5.3.1 as well. When we do not stop with the inadmissibility claim, we call (α, β) admissible, and in this case we do produce an inference, specifically, as follows. Processing observations: 1. We set J¯ = {0, 1, ..., b j = ¯j − 1}, K = K(α, β) = max K j . Note that J¯ is 0≤j≤b j

nonempty due to ¯j > 0.12

2. Let ω K = (ω1 , ..., ωK ) with independent across k components stemming from unknown signal x∗ ∈ X according to (2.85). We put Ib−1 (ω K ) = {1, ..., M } = I0 . a) For j = 0, 1, ..., b j, we act as follows. When processing j, we have at our disposal subsets Ibk (ω K ) ⊂ {1, ..., M }, −1 ≤ k < j. To build the set Ibj (ω K ) K

i. we apply the test TC j j to the initial Kj components of the observation α,β

ω K . Let Ij+ (ω K ) be the set of hypotheses Hij , i ∈ Ij , accepted by the test; ii. it may happen that Ij+ (ω K ) = ∅; if it is so, we terminate;

iii. if Ij+ (ω K ) is nonempty, we look, one by one, at indices i ∈ Ij+ (ω K ) and call the index i good if for every ℓ ∈ {−1, 0, ..., j − 1}, i ∈ Ibℓ (ω K ); iv. we define Ibj (ω K ) as the set of good indices of Ij+ (ω K ) if this set is not empty and proceed to the next value of j (if j < b j), or terminate (if j = b j). We terminate if there are no good indices in Ij+ (ω K ). b) Upon termination, we have at our disposal a collection Ibj (ω K ), 0 ≤ j ≤ e j(ω K ), of all sets Ibj (ω K ) we have built (this collection can be empty, which we encode by setting e j(ω K ) = −1). When e j(ω K ) = −1, our inference remains undefined. Otherwise we select from the set Ibej(ωK ) (ω K ) an index iα,β (ω K ), say, the smallest one, and claim that the point xiα,β (ωK ) is the point among 12 All the sets X i0 contain X and thus are nonempty, so that I0 = {1, ..., M } 6= ∅, and thus we cannot stop at step j = 0 due to I0 = ∅; the other possibility to stop at step j = 0 is ruled out by the fact that we are in the case where (α, β) is admissible.

102

CHAPTER 2

x1 , ..., xM “nearly closest” to x∗ . We have the following analog of Proposition 2.33: Proposition 2.34. Assuming (α, β) admissible, for the inference ω K 7→ iα,β (ω K ) just defined and for every x∗ ∈ X, denoting by PxK∗ the distribution of stationary K-repeated observation ω K stemming from x∗ one has  PxK∗ ω K : iα,β (ω K ) is well defined and kx∗ − xiα,β (ωK ) k ≤ αρ(x∗ ) + β ≥ 1 − ǫ. Proof. Let us fix the signal x∗ ∈ X underlying observations ω K . As in the proof of Proposition 2.33, let j∗ be such that ρ(x∗ ) = rj∗ , and let i∗ ≤ M be such that x∗ ∈ Xi∗ j∗ . Clearly, i∗ and j∗ are well defined, and the hypotheses Hi∗ j , 0 ≤ j ≤ j∗ , are true. In particular, Xi∗ j 6= ∅ when j ≤ j∗ , implying that i∗ ∈ Ij , 0 ≤ j ≤ j∗ , whence also b j ≥ j∗ . For 0 ≤ j ≤ j∗ , let Ej be the set of all realizations of ω K such that j i∗ ∈ Ij+ (ω K ) & {(i∗ , i) ∈ Cα,β ∀i ∈ Ij+ (ω K )}. K

j Since the Cα,β -risk of the test TC j j is ≤ ǫ¯, we conclude that the PxK∗ -probability of α,β

Ej is at least 1 − ǫ¯, whence the PxK∗ -probability of the event E=

j∗ \

j=0

Ej

is at least 1 − (N + 1)¯ ǫ˙ = 1 − ǫ. Now let ω K ∈ E. Then, • By the definition of Ej , when j ≤ j∗ , we have i∗ ∈ Ij+ (ω K ), whence, by evident induction in j, i∗ ∈ Ibj (ω K ) for all j ≤ j∗ . • We conclude from the above that e j(ω K ) ≥ j∗ . In particular, i := iα,β (ω K ) is well defined and turned out to be good at step e j ≥ j∗ , implying that i ∈ Ibj∗ (ω K ) ⊂ + K Ij∗ (ω ).

Thus, i ∈ Ij+∗ (ω K ), which combines with the definition of Ej∗ to imply that i and j∗ i∗ are Cα,β -close to each other, whence αρ(x∗ ) + β, kxi(α,β)(ωK ) − xi∗ k ≤ 2¯ αrj∗ + β = 2¯ resulting in the desired relation

kxi(α,β)(ωK ) − x∗ k ≤ 2¯ αρ(x∗ ) + β + kxi∗ − x∗ k ≤ [2¯ α + 1]ρ(x∗ ) + β = αρ(x∗ ) + β. ✷ 2.5.3.3

“Near-optimality”

We augment the above constructions with the following ¯ ǫ ∈ (0, 1/2), and a pair (a, b) ≥ Proposition 2.35. Let for some positive integer K,

103

HYPOTHESIS TESTING ¯

¯

0 there exist an inference ω K 7→ i(ω K ) ∈ {1, ..., M } such that whenever x∗ ∈ X, we have ProbωK¯ ∼PxK¯ {kx∗ − xi(ωK¯ ) k ≤ aρ(x∗ ) + b} ≥ 1 − ǫ. ∗

Then the pair (α = 2a + 3, β = 2b) is admissible in the sense of Section 2.5.3.1 (and thus—in the sense of Section 2.5.3.2), and for the constructions in Sections 2.5.3.1 and 2.5.3.2 one has   1 + ln(M (N + 1))/ ln(1/ǫ) ¯  (2.91) K ; K(α, β) ≤ Ceil 2 1 − ln(4(1−ǫ)) ln(1/ǫ) Proof. Consider the situation of Section 2.5.3.1 (the situation of Section 2.5.3.2 can be processed in a completely similar way). Observe that with α, β as above, there exists a simple test deciding on a pair of hypotheses Hij , Hi′ j ′ which are not ¯ ¯ Cα,β -close to each other via stationary K-repeated observation ω K with risk ≤ ǫ. ′ ′ Indeed, the desired test T is as follows: given ij ∈ J , i j ∈ J , and observation ¯ ¯ ω K , we compute i(ω K ) and accept Hij if and only if kxi(ωK¯ ) − xi k ≤ (a + 1)rj + b, and accept Hi′ j ′ otherwise. Let us check that the risk of this test indeed is at most ¯ ǫ. Assume, first, that Hij takes place. The PxK∗ -probability of the event E : kxi(ωK¯ ) − x∗ k ≤ aρ(x∗ ) + b is at least 1 − ǫ due to the origin of i(·), and kxi − x∗ k ≤ rj since Hij takes place, implying that ρ(x∗ ) ≤ rj by the definition of ρ(·). Thus, in the case of E it holds kxi(ωK¯ ) − xi k ≤ kxi(ωK¯ ) − x∗ k + kxi − x∗ k ≤ aρ(x∗ ) + b + rj ≤ (a + 1)rj + b. ¯

We conclude that if Hij is true and ω K ∈ E, then the test T accepts Hij , and thus ¯ the PxK∗ -probability for the simple test T not to accept Hij when the hypothesis takes place is ≤ ǫ. ¯ Now let Hi′ j ′ take place, and let E be the same event as above. When ω K ∈ E, ¯ which happens with the PxK∗ -probability at least 1−ǫ, for the same reasons as above, we have kxi(ωK¯ ) − xi′ k ≤ (a + 1)rj ′ + b. It follows that when Hi′ j ′ takes place and ¯ ω K ∈ E, we have kxi(ωK¯ ) − xi k > (a + 1)rj + b, since otherwise we would have kxi − xi′ k ≤ =

kxi(ωK¯ ) − xi k + kxi(ωK¯ ) − xi′ k ≤ (a + 1)rj + b + (a + 1)rj ′ + b ′ (a + 1)(rj + rj ′ ) + 2b = α−1 2 (rj + rj ) + β,

which contradicts the fact that ij and i′ j ′ are not Cα,β -close. Thus, whenever Hi′ j ′ holds true and E takes place, we have kxi(ωK¯ ) − xi k > (a + 1)rj + b, implying ¯ by the definition of T that T accepts Hi′ j ′ . Thus, the PxK∗ -probability not to accept Hi′ j ′ when the hypotheses is true is at most ǫ. From the fact that whenever ¯ observations, (ij, i′ j ′ ) 6∈ Cα,β , the hypotheses Hij , Hi′ j ′ can be decided upon, via K ′ ′ with risk ≤ ǫ < 0.5 it follows that for the ij, i j in question, the sets A(Xij ) and A(Xi′ j ′ ) do not intersect, so that (α, β) is an admissible pair. As in the proof of Proposition 2.31, by basic properties of simple observation schemes, the fact that the hypotheses Hij , Hi′ j ′ with (ij, i′ j ′ ) 6∈ Cα,β can be decided ¯ upon observations (2.85) with risk ≤ ǫ < 1/2 implies that ǫij,i′ j ′ ≤ p via K-repeated ¯ 1/K , whence, again by basic results on simple observation schemes (look [2 ǫ(1 − ǫ)]

104

CHAPTER 2

once again at the proof of Proposition 2.31), the Cα,β -risk of K-observation detectorbased test TK deciding Hij , ijp ∈ J , up to closeness Cα,β does not p on theK/hypotheses ¯ ¯ K exceed Card(J )[2 ǫ(1 − ǫ)] ≤ M (N + 1)[2 ǫ(1 − ǫ)]K/K , and (2.91) follows. ✷ Comment. Proposition 2.35 says that in our problem, the “statistical toll” for quite large values of N and M is quite moderate: with ǫ = 0.01, resolution θ = 1.001 (which for all practical purposes is the same as no discretization of distances at all), ¯ D/rN as large as 1010 , and M as large as 10,000, (2.91) reads K = Ceil(10.7K)— not a disaster! The actual statistical toll of our construction is in replacing the “existing in nature” a and b with a α = 2α + 3 and β = 2b. And of course there is a huge computational toll for large M and N : we need to operate with large (albeit polynomial in M, N ) number of hypotheses and detectors. 2.5.3.4

Numerical illustration

As an illustration of the approach presented in this section consider the following (toy) problem: A signal x∗ ∈ Rn (one may think of x∗ as of the restriction on the equidistant n-point grid in [0, 1] of a function of continuous argument t ∈ [0, 1]) is observed according to ω = Ax∗ + ξ, ξ ∼ N (0, σ 2 In ),

(2.92)

where A is a “discretized integration”: s

(Ax)s =

1X xs , s = 1, ..., n. n j=1

We want to approximate x in the discrete version of L1 -norm n

kyk =

1X |ys |, y ∈ Rn n s=1

by a low-order polynomial. In order to build the approximation, we use a single observation ω as in (2.92), to build five candidate estimates xi , i = 1, ..., 5, of x∗ . Specifically, xi is the Least Squares polynomial—of degree ≤ i − 1—approximation of x: xi = argmin kAy − ωk22 , y∈Pi−1

where Pκ is the linear space of algebraic polynomials, of degree ≤ κ, of discrete argument s varying in {1, 2, ..., n}. After the candidate estimates are built, we use additional K observations (2.92) “to select the model”—to select among our estimates the k · k-closest to x∗ . In the experiment reported below we use n = 128 and σ = 0.01. The true signal x∗ is a discretization of a piecewise linear function of continuous argument t ∈ [0, 1], with slope 2 to the left of t = 0.5, and with slope −2 to the right of t = 0.5; at t = 0.5, the function has a jump. The a priori information on the true signal is that

105

HYPOTHESIS TESTING

1

0

-1

-2

0

20

40

60

80

100

120

140

0

20

40

60

80

100

120

140

0.3

0.2

0.1

0

Figure 2.5: Signal (top, solid) and candidate estimates (top, dotted). Bottom: the primitive of the signal.

it belongs to the box {x ∈ Rn : kxk∞ ≤ 1}. The signal and sample polynomial approximations xi of x∗ , 1 ≤ i ≤ 5, are presented on the top plot in Figure 2.5; their actual k · k-distances to x∗ are as follows: i kxi − x∗ k

1 0.534

2 0.354

3 0.233

4 0.161

5 0.172

Setting ǫ = 0.01, N = 22, and θ = 21/4 , α = 3 and β = 0.05 resulted in K = 3. In a series of 1,000 simulations of the resulting inference, all 1,000 results correctly identified the candidate estimate x4 k · k-closest to x∗ , in spite of the factor α = 3 in (2.88). Surprisingly, the same holds true when we use the resulting inference with the reduced values of K, namely, K = 1 and K = 2, although the theoretical guarantees deteriorate: with K = 1 and K = 2, the theory guarantees the validity of (2.88) with probabilities 0.77 and 0.97, respectively.

2.6

SEQUENTIAL HYPOTHESIS TESTING

2.6.1

Motivation: Election polls

Let us consider the following “practical” question. One of L candidates for an office is about to be selected by a populationwide majority vote. Every member of the population votes for exactly one candidate. How do we predict the winner via an opinion poll? A (naive) model of the situation could be as follows. Let us represent the preference of a particular voter by his preference vector—a basic orth e in RL with unit entry in a position ℓ meaning that the voter is about to vote for the ℓ-th candidate. The

106

CHAPTER 2

entries µℓ in the average µ, over the population, of these vectors are the fractions of votes in favor of the ℓ-th candidate, and the elected candidate is the one “indexing” the largest of the µℓ ’s. Now assume that we select at random, from the uniform distribution, a member of the population and observe his preference vector. Our observation ω is a realization of a discrete random variable taking values in the set Ω = {e1 , ..., eL } of basic orths in RL , and µ is the distribution of ω (technically, the density of this distribution w.r.t. the counting measure Π on Ω). Selecting a small threshold δ and assuming that the true—unknown to us—µ is such that the largest entry in µ is at least by δ larger than every other entry and that µℓ ≥ N1 for all ℓ, N being the population size,13 we can model the population preference for the ℓ-th candidate with P µ ∈ Mℓ = {µ ∈ Rd : µi ≥ N1 , i µP i = 1, µℓ ≥ µi + δ ∀(i 6= ℓ)} ⊂ M = {µ ∈ Rd : µ > 0, i µi = 1}.

In an (idealized) poll, we select at random a number K of voters and observe their preferences, thus arriving at a sample ω K = (ω1 , ..., ωK ) of observations drawn, independently SLof each other, from an unknown distribution µ on Ω, with µ known to belong to ℓ=1 Mℓ . Therefore, to predict the winner is the same as to decide on L convex hypotheses, H1 , ..., HL , in the Discrete o.s., with Hℓ stating that ω1 , ..., ωK are drawn, independently of each other, from a distribution µ ∈ Mℓ . What we end up with, is the problem of deciding on L convex hypotheses in the Discrete o.s. with L-element Ω via stationary K-repeated observations.

Illustration. Consider two-candidate elections; now the goal of a poll is, given K independent of each other realizations ω1 , ..., ωK of random variable ω taking value χ = 1, 2 with probability µχ , µ1 + µ2 = 1, to decide what is larger, µ1 or µ2 . As explained above, we select somehow a threshold δ and impose on the unknown µ an a priori assumption that the gap between the largest and the next largest (in our case, just the smallest) entry of µ is at least δ, thus arriving at two hypotheses, H1 : µ1 ≥ µ2 + δ,

H2 : µ2 ≥ µ1 + δ,

which is the same as H1 : µ ∈ M1 = {µ : µ1 ≥ H2 : µ ∈ M2 = {µ : µ2 ≥

1+δ 2 , µ2 1+δ 2 , µ1

≥ 0, µ1 + µ2 = 1}, ≥ 0, µ1 + µ2 = 1}.

We now want to decide on these two hypotheses via a stationary K-repeated observation. We are in the case of a simple (specifically, Discrete) o.s.; the optimal detector as given by Theorem 2.23 stems from the optimal solution (µ∗ , ν ∗ ) to the convex optimization problem ε⋆ =

max

µ∈M1 ,ν∈M2

√ √ [ µ1 ν 1 + µ2 ν 2 ] ;

(2.93)

the optimal balanced single-observation detector is φ∗ (ω) = f∗T ω, f∗ = 21 [ln(µ∗1 /ν1∗ ); ln(µ∗2 /ν2∗ )] 13 With the size N of population in the range of tens of thousands and δ as 1/N , both these assumptions seem to be quite realistic.

107

HYPOTHESIS TESTING

(recall that we encoded observations ωk by basic orths from R2 ), the risk of this detector being ε⋆ . In other words, √ 1−δ 1−δ 1+δ ∗ 1 − δ2 , µ∗ = [ 1+δ 2 ; 2 ], ν = [ 2 ; 2 ], ε⋆ = 1 f∗ = 2 [ln((1 + δ)/(1 − δ)); ln((1 − δ)/(1 + δ))] . The optimal balanced K-observation detector and its risk are (K)

(K)

φ∗ (ω1 , ..., ωK ) = f∗T (ω1 + ... + ωK ), ε⋆ | {z }

= (1 − δ 2 )K/2 .

ωK

(K)

The near-optimal K-observation test TφK∗ accepts H1 and rejects H2 if φ∗ (ω K ) ≥ 0; otherwise it accepts H2 and rejects H1 . Both risks of this test do not exceed (K) ε⋆ . Given risk level ǫ, we can identify the minimal “poll size” K for which the risks K Risk1 , Risk2 of the test Tφ∗ do not exceed ǫ. This poll size depends on ǫ and on our a priori “hypotheses separation” parameter δ : K = Kǫ (δ). Some impression on this size can be obtained from Table 2.1, where, as in all subsequent “election illustrations,” ǫ is set to 0.01. We see that while poll sizes for “landslide” elections are surprisingly low, reliable prediction of the results of “close run” elections requires surprisingly high sizes of the polls. Note that this phenomenon reflects reality (to the extent to which the reality is captured by our model).14 Indeed, from Proposition 2.25 we know that our poll size is within an explicit factor, depending solely on ǫ, from the “ideal” poll sizes—the smallest ones which allow to decide upon H1 , H2 with risk ≤ ǫ. For ǫ = 0.01, this factor is about 2.85, meaning that when δ = 0.01, the ideal poll size is larger than 32,000. In fact, we can easily construct more accurate “numerical” lower bounds on the sizes of ideal polls, specifically, as follows. When computing the optimal detector φ∗ , we get, as a byproduct, two distributions, µ∗ , ν ∗ obeying ∗ H1 , H2 , respectively. Denoting by µ∗K and νK the distributions of K-element i.i.d. ∗ ∗ samples drawn from µ and ν , the risk of deciding on two simple hypotheses on ∗ the distribution of ω K —stating that this distribution is µ∗K and νK , respectively— can be only smaller than the risk of deciding on H1 , H2 via K-repeated stationary observations. On the other hand, the former risk can be lower-bounded by one half of the total risk of deciding on our two simple hypotheses, and the latter risk admits a sharp lower bound given by Proposition 2.2, namely, " ( " # #) X Y Y Y Y ∗ ∗ ∗ ∗ min µ iℓ , νiℓ = E(i1 ,...,iK ) min (2µiℓ ), (2νiℓ ) , i1 ,...,iK ∈{1,2}









with the expectation taken w.r.t independent tuples of K integers taking values 14 In actual opinion polls, additional information is used. For instance, in reality voters can be split into groups according to their age, sex, education, income, etc., with variability of preferences within a group essentially lower than across the entire population. When planning a poll, respondents are selected at random within these groups, with a prearranged number of selections in every group, and their preferences are properly weighted, yielding more accurate predictions as compared to the case when the respondents are selected from the uniform distribution. In other words, in actual polls a nontrivial a priori information on the “true” distribution of preferences is used—something we do not have in our naive model.

108

CHAPTER 2

δ K0.01 (δ), L = 2 K0.01 (δ), L = 5

0.5623 25 32

0.3162 88 114

0.1778 287 373

0.1000 917 1193

0.0562 2908 3784

0.0316 9206 11977

0.0177 29118 37885

0.0100 92098 119745

Table 2.1: Sample of values of poll size K0.01 (δ) as a function of δ for 2-candidate (L = 2) and 5-candidate (L = 5) elections. Values of δ form a geometric progression with ratio 10−1/4 .

1 and 2 with probabilities 1/2. Of course, when K is in the range of a few tens and more, we cannot compute the 2K -term sum above exactly; however, we can use Monte Carlo simulation in order to estimate the sum reliably with moderate accuracy, like 0.005, and use this estimate to lower-bound the value of K for which an “ideal” K-observation test decides on H1 , H2 with risks ≤ 0.01. Here are the resulting lower bounds (along with upper bounds from Table 2.1): δ K /K

0.5623

0.3162

0.1778

0.1000

0.0562

0.0316

0.0177

0.0100

14 25

51 88

166 287

534 917

1699 2908

5379 9206

17023 29122

53820 92064

Lower (K) and upper (K) bounds on the “ideal” poll sizes We see that the poll sizes as yielded by our machinery are within factor 2 of the “ideal” poll sizes. Clearly, the outlined approach can be extended to L-candidate elections with L ≥ 2. In our model of the corresponding problem we decide, via stationary K-repeated observations drawn from unknown probability distribution µ on L-element set, on L hypotheses Hℓ : µ ∈ M ℓ =

(

µ ∈ R d : µi ≥

) X 1 , i ≤ L, µi = 1, µℓ ≥ µℓ′ + δ ∀(ℓ′ 6= ℓ) , ℓ ≤ L. N i (2.94)

Here δ > 0 is a threshold selected in advance smallSenough to believe that the actual preferences of the voters correspond to µ ∈ ℓ Mℓ . Defining closeness C in the strongest possible way—Hℓ is close to Hℓ′ if and only if ℓ = ℓ′ —predicting the outcome of elections with risk ǫ becomes the problem of deciding upon our multiple hypotheses with C-risk ≤ ǫ. Thus, we can use pairwise detectors yielded by Theorem 2.23 to identify the smallest possible K = Kǫ such that the test TCK from Section 2.5.2.3 is capable of deciding upon our L hypotheses with C-risk ≤ ǫ. A numerical illustration of the performance of this approach in 5-candidate elections is presented in Table 2.1 (where ǫ is set to 0.01). 2.6.2

Sequential hypothesis testing

In view of the above analysis, when predicting outcomes of “close run” elections, huge poll sizes are necessary. It, however, does not mean that nothing can be done in order to build more reasonable opinion polls. The classical related statistical idea, going back to Wald [236], is to pass to sequential tests where the observations are processed one by one, and at every instant we either accept some of our hypotheses and terminate, or conclude that the observations obtained so far are insufficient to make a reliable inference and pass to the next observation. The idea is that a properly built sequential test, while still ensuring a desired risk, will be able to make “early decisions” in the case when the distribution underlying observations is “well inside” the true hypothesis and thus is far from the alternatives. Let us show

109

HYPOTHESIS TESTING

"

$

#

Figure 2.6: 3-candidate hypotheses in probabilistic simplex ∆3 [area [area [area [area [area [area

A] A] B] B] C] C]

M1 M1s M2 M2s M3 M3s

dark dark dark dark dark dark

tetragon + light border strip: candidate A wins with margin ≥ δS tetragon: candidate A wins with margin ≥ δs > δS tetragon + light border strip: candidate B wins with margin ≥ δS tetragon: candidate B wins with margin ≥ δs > δS tetragon + light border strip: candidate C wins with margin ≥ δS tetragon: candidate C wins with margin ≥ δs > δS

Cs closeness: hypotheses in the tuple {Gs2ℓ−1 : µ ∈ Mℓ , Gs2ℓ : µ ∈ Mℓs , 1 ≤ ℓ ≤ 3} are not Cs -close to each other if the corresponding M -sets belong to different areas and at least one of the sets is painted dark, like M1s and M2 , but not M1 and M2 . how our machinery can be utilized to conceive a sequential test for the problem of predicting the outcome of L-candidate elections. Thus, our goal is, given a small threshold δ, to decide upon L hypotheses (2.94). Let us act as follows. 1. We select a factor θ ∈ (0, 1), say, θ = 10−1/4 , and consider thresholds δ1 = θ, δ2 = θδ1 , δ3 = θδ2 , and so on, until for the first time we get a threshold ≤ δ; to save notation, we assume that this threshold is exactly δ, and let the number of the thresholds be S. 2. We split somehow (e.g., equally) the risk ǫ which we want to guarantee into S portions ǫs , 1 ≤ s ≤ S, so that ǫs are positive and S X

ǫs = ǫ.

s=1

3. For s ∈ {1, 2, ..., S}, we define, along with the hypotheses Hℓ , the hypotheses Hℓs : µ ∈ Mℓs = {µ ∈ Mℓ : µℓ ≥ µℓ′ + δs , ∀(ℓ′ 6= ℓ)}, ℓ = 1, ..., L, (see Figure 2.6), and introduce 2L hypotheses Gs2ℓ−1 = Hℓ , and Gs2ℓ = Hℓs , 1 ≤ ℓ ≤ L. It is convenient to color these hypotheses in L colors, with Gs2ℓ−1 = Hℓ and Gs2ℓ = Hℓs assigned color ℓ. We define also s-th closeness Cs as follows: When s < S, hypotheses Gsi and Gsj are Cs -close to each other if either they are of the same color, or they are of different colors and both of them have odd indices (that is, one of them is Hℓ , and another one is Hℓ′ with ℓ 6= ℓ′ ).

110

CHAPTER 2

When s = S (in this case GS2ℓ−1 = Hℓ = GS2ℓ ), hypotheses GSℓ and GSℓ′ are CS -close to each other if and only if they are of the same color, i.e., both coincide with the same hypothesis Hℓ . Observe that Gsi is a convex hypothesis: Gsi : µ ∈ Yis

s s [Y2ℓ−1 = Mℓ , Y2ℓ = Mℓs ]

The key observation is that when Gsi and Gsj are not Cs -close, sets Yis and Yjs are “separated” by at least δs , meaning that for some vector e ∈ RL with just two nonvanishing entries, equal to 1 and −1, we have min eT µ ≥ δs + maxs eT µ.

µ∈Yis

µ∈Yj

(2.95)

Indeed, let Gsi and Gsj not be Cs -close to each other. That means that the hypotheses are of different colors, say, ℓ and ℓ′ 6= ℓ, and at least one of them has even index. W.l.o.g. we can assume that the even-indexed hypothesis is Gsi , so that Yis ⊂ {µ : µℓ − µℓ′ ≥ δs },

while Yjs is contained in the set {µ : µℓ′ ≥ µℓ }. Specifying e as the vector with just two nonzero entries, ℓ-th equal to 1 and ℓ′ -th equal to −1, we ensure (2.95).

4. For 1 ≤ s ≤ S, we apply the construction from Section 2.5.2.3 to identify the smallest K = K(s) for which the test Ts yielded by this construction as applied to a stationary K-repeated observation allows us to decide on the hypotheses Gs1 , ..., Gs2L with Cs -risk ≤ ǫs . The required K exists due to the already mentioned separation of members in a pair of not Cs -close hypotheses Gsi , Gsj . It is easily seen that K(1) ≤ K(2) ≤ ... ≤ K(S − 1). However, it may happen that K(S − 1) > K(S), the reason being that CS is defined differently than Cs with s < S. We set S = {s ≤ S : K(s) ≤ K(S)}. For example, here is what we get in L-candidate Opinion P8 Poll problem when S = 8, δ = δS = 0.01, and for properly selected ǫs with s=1 ǫs = 0.01: L 2 5

K(1) 177 208

K(2) 617 723

K(3) K(4) K(5) K(6) K(7) K(8) 1829 5099 15704 49699 153299 160118 2175 6204 19205 60781 188203 187718 S = 8, δs = 10−s/4 . S = {1, 2, ..., 8} when L = 2 and S = {1, 2, ..., 6} ∪ {8} when L = 5.

5. Our sequential test Tseq works in attempts (stages) s ∈ S—it tries to make conclusions after observing K(s), s ∈ S, realizations ωk of ω. At the s-th attempt, we apply the test Ts to the collection ω K(s) of observations obtained so far to decide on hypotheses Gs1 , ..., Gs2L . If Ts accepts some of these hypotheses and all accepted hypotheses are of the same color—let it be ℓ—the sequential test accepts the hypothesis Hℓ and terminates; otherwise we continue to observe the realizations of ω (if s < S) or terminate with no hypotheses accepted/rejected (if s = S). It is easily seen that the risk of the outlined sequential test Tseq does not exceed SL ǫ, meaning that whatever be the distribution µ ∈ ℓ=1 Mℓ underlying observations

HYPOTHESIS TESTING

111

ω1 , ω2 , ...ωK(S) and ℓ∗ such that µ ∈ Mℓ∗ , the µ-probability of the event is at least 1 − ǫ.

Tseq accepts exactly one hypothesis, namely, Hℓ∗

Indeed, observe, first, that the sequential test always accepts at most one of the hypotheses H1 , ..., HL . Second, let ωk ∼ µ with µ obeying Hℓ∗ . Consider events Es , s ∈ S, defined as follows:

• when s < S, Es is the event “the test Ts as applied to observation ω K(s) does not accept the true hypothesis Gs2ℓ∗ −1 = Hℓ∗ ”; • ES is the event “as applied to observation ω K(S) , the test TS does not accept the S true hypothesis GS 2ℓ∗ −1 = Hℓ∗ or accepts a hypothesis not CS -close to G2ℓ∗ −1 .”

Note that by our selection of K(s)’s, the µ-probability of Es does not exceed ǫs , so that the µ-probability of none of the events Es , s ∈ S, taking place is at least 1 − ǫ. To justify the above claim on the risk of the sequential test, all we need to verify is that when none of the events Es , s ∈ S, takes place, the sequential test accepts the true hypothesis Hℓ∗ . Verification is immediate: let the observations be such that none of the Es ’s takes place. We claim that in this case (a) The sequential test does accept a hypothesis—if this does not happen at the s-th attempt with some s < S, it definitely happens at the S-th attempt. Indeed, since ES does not take place, TS accepts GS 2ℓ∗ −1 and all other hypotheses, if any, accepted by TS are CS -close to GS 2ℓ∗ −1 , implying by construction of CS that TS does accept hypotheses, and all these hypotheses are of the same color. That is, the sequential test at the S-th attempt does accept a hypothesis.

(b) The sequential test does not accept a wrong hypothesis.

Indeed, assume that the sequential test accepts a wrong hypothesis, Hℓ′ , ℓ′ 6= ℓ∗ , and it happens at the s-th attempt, and let us lead this assumption to a contradiction. Observe that under our assumption the test Ts as applied to observation ω K(s) does accept some hypothesis Gsi , but does not accept the true hypothesis Gs2ℓ∗ −1 = Hℓ∗ . Indeed, assuming Gs2ℓ∗ −1 to be accepted, its color, which is ℓ∗ , should be the same as the color ℓ′ of Gsi —we are in the case where the sequential test accepts Hℓ′ at the s-th attempt! Since in fact ℓ′ 6= ℓ∗ , the above assumption leads to a contradiction. On the other hand, we are in the case where Es does not take place, that is, Ts does accept the true hypothesis Gs2ℓ∗ −1 , and we arrive at the desired contradiction.

(a) and (b) provide us with the verification we were looking for.

Discussion and illustration. It can be easily seen that when ǫs = ǫ/S for all s, the worst-case duration K(S) of our sequential test is within a logarithmic in the SL factor of the duration of any other test capable of deciding on our L hypotheses with risk ǫ. At the same time it is easily seen that when the distribution µ of our observation is “deeply inside” some set Mℓ , specifically, µ ∈ Mℓs for some s ∈ S, s < S, then the µ-probability to terminate not later than just after K(s) realizations ωk of ω ∼ µ are observed and to infer correctly what is the true hypothesis is at least 1 − ǫ. Informally speaking, in the case of “landslide” elections, a reliable prediction of the elections’ outcome will be made after a relatively small number of respondents are interviewed. Indeed, let s ∈ S and ωk ∼ µ ∈ Mℓs , so that µ obeys the hypothesis Gs2ℓ . Consider the s events Et , 1 ≤ t ≤ s, defined as follows: • For t < s, Et occurs when the sequential test terminates at attempt t by accepting, instead of Hℓ , the wrong hypothesis Hℓ′ , ℓ′ 6= ℓ. Note that Et can take place only when Tt does not accept the true hypothesis Gt2ℓ−1 = Hℓ , and the

112

CHAPTER 2

µ-probability of this outcome is ≤ ǫt . • Es occurs when Ts does not accept the true hypothesis Gs2ℓ or accepts it along with some hypothesis Gsj , 1 ≤ j ≤ 2L, of color different from ℓ. Note that we are in the situation where the hypothesis Gs2ℓ is true, and, by construction of Cs , all hypotheses Cs -close to Gs2ℓ are of the same color ℓ as Gs2ℓ . Recalling what Cs -risk is and that the Cs -risk of Ts is ≤ ǫs , we conclude that the µ-probability of Es is at most ǫs . S P The bottom line is that the µ-probability of the event t≤s Et is at most st=1 ǫt ≤ S ǫ; by construction of the sequential test, if the event t≤s Et does not take place, the test terminates in the course of the first s attempts by accepting the correct hypothesis Hℓ . Our claim is justified.

Numerical illustration. To get an impression of the “power” of sequential hypothesis testing, here are the data on the durations of non-sequential and sequential tests with risk ǫ = 0.01 for various values of δ; in the sequential tests, θ = 10−1/4 is used. The worst-case data for 2-candidate and 5-candidate elections are as follows (below, “volume” stands for the number of observations used by the test) δ K, L = 2 S / K(S), L = 2 K, L = 5 S / K(S), L = 5

0.5623 25

0.3162 88

0.1778 287

0.1000 917

0.0562 2908

0.0316 9206

0.0177 29118

0.0100 92098

1 25

2 152

3 499

4 1594

5 5056

6 16005

7 50624

8 160118

32

114

373

1193

3784

11977

37885

119745

1 32

2 179

3 585

4 1870

5 5931

6 18776

7 59391

8 187720

Volume K of non-sequential test, number S of stages, and worst-case volume K(S) of sequential test as functions of threshold δ = δS . Risk ǫ is set to 0.01. As it should be, the worst-case volume of the sequential test is significantly larger than the volume of the non-sequential test.15 This being said, look at what happens in the “average,” rather than the worst, case; specifically, let us look at the empirical distribution of the volume when the distribution µ of observations is selected in the P L-dimensional probabilistic simplex ∆L = {µ ∈ RL : µ ≥ 0, ℓ µℓ = 1} at random. Here are the empirical statistics of test volume obtained when drawing µ from the S uniform distribution on ℓ≤L Mℓ and running the sequential test16 on observations drawn from the selected µ: L 2 5 L 2 5

risk 0.0010 0.0040 75% 617 12704

median 177 1449 80% 1223 19205

mean 9182 18564 85% 1829 39993

60% 177 2175 90% 8766 60781

65% 397 4189 95% 87911 124249

70% 617 6204 100% 160118 187718

Parameters (columns “median, mean”) and quantiles (columns “60%,..., 100%”) of the sample distribution of the observation volume of the Sequential test for a given empirical risk (column ”risk”) . The data in the table are obtained from 1,000 experiments. We see that with the Sequential test, “typical” numbers of observations before termination are much 15 The reason is twofold: first, for s < S we pass from deciding on L hypotheses to deciding on 2L of them; second, the desired risk ǫ is now distributed among several tests, so that each of them should be more reliable than the non-sequential test with risk ǫ. 16 Corresponding to δ = 0.01, θ = 10−1/4 and ǫ = 0.01.

HYPOTHESIS TESTING

113

less than the worst-case values of these numbers. For example, in as much as 80% of experiments these numbers were below quite reasonable levels, at least in the case L = 2. Of course, what is “typical,” and what is not, depends on how we generate µ’s (this is called “prior Bayesian distribution”). Were our generation more likely to produce “close run” distributions, the advantages of sequential decision making would be reduced. This ambiguity is, however, unavoidable when attempting to go beyond worst-case-oriented analysis. 2.6.3

Concluding remarks

Application of our machinery to sequential hypothesis testing is in no sense restricted to the simple election model considered so far. A natural general setup we can handle is as follows: We are given a simple observation scheme O and a number L of related convex hypotheses, colored in d colors, on the distribution of an observation, with distributions obeying hypotheses of different colors being distinct from each other. Given the risk level ǫ, we want to decide (1 − ǫ)-reliably on the color of the distribution underlying observations (i.e., the color of the hypothesis obeyed by this distribution) from stationary K-repeated observations, utilizing as small a number of observations as possible. For detailed description of related constructions and results, an interested reader is referred to [134].

2.7

2.7.1

MEASUREMENT DESIGN IN SIMPLE OBSERVATION SCHEMES Motivation: Opinion polls revisited

Consider the same situation as in Section 2.6.1—we want to use an opinion poll to predict the winner in a population-wide election with L candidates. When addressing this situation earlier, no essential a priori information on the distribution of voters’ preferences was available. Now consider the case when the population is split into I groups (according to age, sex, income, etc., etc.), with the i-th group forming the fraction θi of the entire population, and we have at our disposal, at least for some i, nontrivial a priori information about the distribution pi of the preferences across group # i (the ℓ-th entry piℓ in pi is the fraction of voters of group i voting for candidate ℓ). For instance, we could know in advance that at least 90% of members of group #1 vote for candidate #1, and at least 85% of members of group #2 vote for candidate #2; no information of this type for group #3 is available. In this situation it would be wise to select respondents in the poll via a two-stage procedure, first selecting at random, with probabilities q1 , ..., qI , the group from which the next respondent will be picked, and second selecting the respondent from this group at random according to the uniform distribution on the group. When the qi are proportional to the sizes of the groups (i.e., qi = θi for all i), we come back to selecting respondents at random from the uniform distribution over the entire population. The point, however, is that in the presence of a priori information, it makes sense to use qi different from θi , specifically, to make the

114

CHAPTER 2

ratios qi /θi “large” or “small” depending on whether a priori information on group #i is poor or rich. The story we have just told is an example of a situation in which we can “design measurements”—draw observations from a distribution which partly is under our control. Indeed, what in fact happens in the story is the following. “In nature” there exist I probabilistic vectors p1 , ..., pI of dimension L representing distributions of voting preferences within the corresponding the distribution of preferP groups; i θ p . With the two-stage selection ences across the entire population is p = i i of respondents, the outcome of a particular interview becomes a pair (i, ℓ), with i identifying the group to which the respondent belongs, and ℓ identifying the candidate preferred by this respondent. In subsequent interviews, the pairs (i, ℓ)—these are our observations—are drawn, independently of each other, from the probability distribution on the pairs (i, ℓ), i ≤ I, ℓ ≤ L, with the probability of an outcome (i, ℓ) equal to p(i, ℓ) = qi piℓ . Thus, we find ourselves in the situation of stationary repeated observations stemming from the Discrete o.s. with observation space Ω of cardinality IL; the distribution from which the observations are drawn is a probabilistic vector µ of the form µ = Ax, where • x = [p1 ; ...; pI ] is the “signal” underlying our observations and representing the preferences of the population; this signal is selected by nature in the set X known to us defined in terms of our a priori information on p1 , ..., pI : X = {x = [x1 ; ...; xI ] : xi ∈ Πi , 1 ≤ i ≤ I},

(2.96)

where the Πi are the sets, given by our a priori information, of possible values of the preference vectors pi of the voters from i-th group. In the sequel, we assume o L that P the Πi are convex compact subsets of the positive part ∆L = {p ∈ R : p > 0, ℓ pℓ = 1} of the L-dimensional probabilistic simplex; • A is a “sensing matrix” which, to some extent, is under our control; specifically, A[x1 ; ...; xI ] = [q1 x1 ; q2 x2 ; ...; qI xI ],

(2.97)

with q = [q1 ; ...; qI ] fully controlled by us (up to the fact that q must be a probabilistic vector). Note that in the situation under consideration the hypotheses we want to decide upon can be represented by convex sets in the space of signals, with a particular hypothesis stating that the observations stem from a distribution µ on Ω, with µ belonging to the image of some convex P compact set Xℓ ⊂ X under the mapping x 7→ µ = Ax. For example, when ν = i θi xi , the hypotheses     X H ℓ : ν ∈ Mℓ = ν ∈ R L : νj = 1, νj ≥ N1 , νℓ ≥ νℓ′ + δ, ℓ′ 6= ℓ , 1 ≤ ℓ ≤ L,   j

115

HYPOTHESIS TESTING

considered in Section 2.6.1 can be expressed in terms of the signal x = [x1 ; ...; xI ]:   P i xℓ = 1∀i ≤ I xi ≥ 0, ℓ P   P . Hℓ : µ = Ax, x ∈ Xℓ = x = [x1 ; ...; xI ] : Pi θi xiℓ ≥ i θi xiℓ′ + δ ∀(ℓ′ 6= ℓ)   1 i ≥ θ x , ∀j i j i N (2.98) The challenge we intend to address is as follows: so far, we were interested in inferences from observations drawn from distributions selected “by nature.” Now our goal is to make inferences from observations drawn from a distribution selected partly by nature and partly by us: nature selects the signal x, we select from some set matrix A, and the observations are drawn from the distribution Ax. As a result, we arrive at a question completely new for us: how do we utilize the freedom in selecting A in order to improve our inferences (this is somewhat similar to what is called “design of experiments” in Statistics)? 2.7.2

Measurement Design: Setup

In what follows we address measurement design in simple observation schemes, and our setup is as follows (to make our intensions transparent, we illustrate our general setup by explaining how it should be specified to cover the outlined twostage Opinion Poll Design (OPD) problem). Given are • simple observation scheme O = (Ω, Π; {pµ : µ ∈ M}; F), specifically, Gaussian, Poisson, or Discrete, with M ⊂ Rd . In OPD, O is the Discrete o.s. with Ω = {(i, ℓ) : 1 ≤ i ≤ I, 1 ≤ ℓ ≤ L}, that is, points of Ω are the potential outcomes “reference group, preferred candidate” of individual interviews. • a nonempty closed convex signal space X ⊂ Rn , along with L nonempty convex compact subsets Xℓ of X , ℓ = 1, ..., L. In OPD, X is the set (2.96) comprised of tuples of allowed distributions of voters’ preferences from various groups, and Xℓ are the sets (2.98) of signals associated with the hypotheses Hℓ we intend to decide upon. • a nonempty convex compact set Q in some RN along with a continuous mapping q 7→ Aq acting from Q into the space of d × n matrices such that ∀(x ∈ X , q ∈ Q) : Aq x ∈ M.

(2.99)

In OPD, Q is the set of probabilistic vectors q = [q1 ; ...; qI ] specifying our measurement design, and Aq is the matrix of the mapping (2.97). • a closeness C on the set {1, ..., L} (that is, a set C of pairs (i, j) with 1 ≤ i, j ≤ L such that (i, i) ∈ C for all i ≤ L and (j, i) ∈ C whenever (i, j) ∈ C), and a positive integer K. In OPD, the closeness C is as strict as it could be—i is close to j if and only if i = j,17 and K is the total number of interviews in the poll. 17 This

closeness makes sense when the goal of the poll is to predict the winner; a less ambitious goal, e.g., to decide whether the winner will or will not belong to a particular set of candidates, would require weaker closeness.

116

CHAPTER 2

We associate with q ∈ Q and Xℓ , ℓ ≤ L, the nonempty convex compact sets Mℓq in the space M, Mℓq = {Aq x : x ∈ Xℓ },

and hypotheses Hℓq on K-repeated stationary observations ω K = (ω1 , ..., ωK ), Hℓq stating that the ωk , k = 1, ..., K, are drawn, independently of each other, from a distribution µ ∈ Mℓq , ℓ = 1, ..., L. Closeness C can be thought of as closeness on the collection of hypotheses H1q , H2q , ..., HLq . Given q ∈ Q, we can use the construction from Section 2.5.2 in order to build the test TφK∗ deciding on the hypotheses Hℓq up to closeness C, the C-risk of the test being the smallest allowed by the construction. Note that this C-risk depends on q; the “Measurement Design” (MD for short) problem we are about to consider is to select q ∈ Q which minimizes the C-risk of the associated test TφK∗ . 2.7.3

Formulating the MD problem

By Proposition 2.30, the C-risk of the test TφK∗ is upper-bounded by the spectral norm of the symmetric entrywise nonnegative L × L matrix E (K) (q) = [ǫℓℓ′ (q)]ℓ,ℓ′ , and this is what we intend to minimize in our MD problem. In the above formula, ǫℓℓ′ (q) = ǫℓ′ ℓ (q) are zeros if (ℓ, ℓ′ ) ∈ C. For (ℓ, ℓ′ ) 6∈ C and 1 ≤ ℓ < ℓ′ ≤ L, the quantities ǫℓℓ′ (q) = ǫℓ′ ℓ (q) are defined depending on what the simple o.s. is O. Specifically, • In the case of the Gaussian observation scheme (see Section 2.4.5.1), restriction (2.99) does not restrain the dependence Aq on q at all (modulo the default constraint that Aq is a d × n matrix continuous in q ∈ Q), and ǫℓℓ′ (q) = exp{KOptℓℓ′ (q)} where Optℓℓ′ (q) =

max

x∈Xℓ ,y∈Xℓ′

− 81 [Aq (x − y)]T Θ−1 [Aq (x − y)]

(Gq )

and Θ is the common covariance matrix of the Gaussian densities forming the family {pµ : µ ∈ M}; • In the case of Poisson o.s. (see Section 2.4.5.2), restriction (2.99) requires of Aq x to be a positive vector whenever q ∈ Q and x ∈ X , and ǫℓℓ′ (q) = exp{KOptℓℓ′ (q)}, where Optℓℓ′ (q) =

max

x∈Xℓ ,y∈Xℓ′

X q i

1 2

1 2



[Aq x]i [Aq y]i − [Aq x]i − [Aq y]i ;

(Pq )

• In the case of Discrete o.s. (see Section 2.4.5.3), restriction (2.99) requires of Aq x to be a positive probabilistic vector whenever q ∈ Q and x ∈ X , and K

ǫℓℓ′ (q) = [Optℓℓ′ (q)] ,

117

HYPOTHESIS TESTING

where Optℓℓ′ (q) =

max

x∈Xℓ ,y∈Xℓ′

Xq

[Aq x]i [Aq y]i .

(Dq )

i

The summary of the above observations is as follows. The norm kE (K) k2,2 —the quantity we are interested in minimizing in q ∈ Q—as a function of q ∈ Q is of the form Ψ(q) = ψ({Optℓℓ′ (q) : (ℓ, ℓ′ ) 6∈ C}) | {z } (2.100) Opt(q)

where the outer function ψ is an explicitly given real-valued function on RN (N is the cardinality of the set of pairs (ℓ, ℓ′ ), 1 ≤ ℓ, ℓ′ ≤ L, with (ℓ, ℓ′ ) 6∈ C) which is convex and nondecreasing in each argument. Indeed, denoting by Γ(S) the spectral norm of the d × d matrix S, note that Γ is a convex function of S, and this function is nondecreasing in every one of the entries of S, provided that S is restricted to be entrywise nonnegative.18 ψ(·) is obtained from Γ(S) by substituting for the entries Sℓℓ′ of S, certain—explicit everywhere—convex, nonnegative and nondecreasing functions of variables z = {zℓℓ′ : (ℓ, ℓ′ ) 6∈ C, 1 ≤ ℓ, ℓ′ ≤ L}. Namely, • when (ℓ, ℓ′ ) ∈ C, we set Sℓℓ′ to zero; • when (ℓ, ℓ′ ) 6∈ C, we set Sℓℓ′ = exp{Kzℓℓ′ } in the case of Gaussian and Poisson o.s.’s, and set Sℓℓ′ = max[0, zℓℓ′ ]K , in the case of Discrete o.s. As a result, we indeed get a convex and nondecreasing, in every argument, function ψ of z ∈ RN . Now, the Measurement Design problem we want to solve reads Opt = min ψ(Opt(q)). q∈Q

(2.101)

As we remember, the entries in the inner function Opt(q) are optimal values of solvable convex optimization problems and as such are efficiently computable. When these entries are also convex functions of q ∈ Q, the objective in (2.101), due to the already established convexity and monotonicity properties of ψ, is a convex function of q, meaning that (2.101) is a convex and thus efficiently solvable problem. On the other hand, when some of the entries in Opt(q) are nonconvex in q, we can hardly expect (2.101) to be easy to solve. Unfortunately, convexity of the entries in Opt(q) in q turns out to be a “rare commodity.” For example, we can verify by inspection that the objectives in (Gq ), (Pq ), and (Dq ) as functions of Aq (not of q!) are concave rather than convex. Thus, the optimal values in the problems, as functions of q, are maxima, over the parameters, of parametric families of concave functions of Aq (the parameters in these parametric families are the optimization variables in (Gq ) – (Dq )) and as such can hardly be convex as functions of Aq . And indeed, as a matter of fact, the MD problem usually is nonconvex and difficult to solve. We intend to consider a “Simple case” where this difficulty does not arise, i.e., the case where the objectives of the optimization problems specifying Optℓℓ′ (q) are affine in q. In this case, Optℓℓ′ (q) as a function of q is the maximum, over the 18 The

monotonicity follows from the fact that for an entrywise nonnegative S, we have

kSk2,2 = max{xT Sy : kxk2 ≤ 1, kyk2 ≤ 1} = max{xT Sy : kxk2 ≤ 1, kyk2 ≤ 1, x ≥ 0, y ≥ 0}. x,y

x,y

118

CHAPTER 2

parameters (optimization variables in the corresponding problems), of parametric families of affine functions of q and as such is convex. Our current goal is to understand what our sufficient condition for tractability of the MD problem—affinity in q of the objectives in the respective problems (Gq ), (Pq ), and (Dq )—actually means, and to show that this, by itself quite restrictive, assumption indeed takes place in some important applications. 2.7.3.1

Simple case, Discrete o.s.

Looking at the optimization problem (Dq ), we see that the simplest way to ensure that its objective is affine in q is to assume that Aq = Diag{Bq}A,

(2.102)

where A is some fixed d × n matrix, and B is some fixed d × (dim q) matrix such that Bq is positive whenever q ∈ Q. On the top of this, we should ensure that when q ∈ Q and x ∈ X , Aq x is a positive probabilistic vector; this amounts to some restrictions linking Q, X , A, and B. Illustration. The Opinion Poll Design problem of Section 2.7.1 provides an instructive example of the Simple case of Measurement Design in Discrete o.s.: recall that in this problem the voting population is split into I groups, with the i-th group constituting fraction θi of the entire population. The distribution of voters’ preferences in the i-th group is represented by an unknown L-dimensional probabilistic vector xi = [xi1 ; ...; xiL ] (L is the number of candidates, xiℓ is the fraction of voters in the i-th group intending to vote for the ℓ-th candidate), known to belongPto a given convex compact subset Πi of the “positive part” ∆oL = {x ∈ RL : x > 0, ℓ xℓ = 1} of the L-dimensional probabilistic simplex. We are given threshold δ > 0 and want to decide on PIL hypotheses H1 ,..., HL , with Hℓ stating that the population-wide vector y = i=1 θi xi of voters’ preferences belongs to the closed convex set Yℓ =

(

y=

I X i=1

i



i

)

θi x : x ∈ Πi , 1 ≤ i ≤ I, yℓ ≥ yℓ′ + δ, ∀(ℓ 6= ℓ) .

Note that Yℓ is the image, under the linear mapping X θi xi , [x1 ; ...; xI ] 7→ y(x) = i

of the compact convex set  Xℓ = x = [x1 ; ...; xI ] : xi ∈ Πi , 1 ≤ i ≤ I, yℓ (x) ≥ yℓ′ (x) + δ, ∀(ℓ′ 6= ℓ) , which is a subset of the convex compact set

X = {x = [x1 ; ...; xI ] : xi ∈ Πi , 1 ≤ i ≤ I}. The k-th poll interview is organized as follows: We draw at random a group among the I groups of voters, with probability qi to draw i-th group, and then draw at random, from the uniform distribution on the group, the respondent to be interviewed. The outcome of

119

HYPOTHESIS TESTING

the interview—our observation ωk —is the pair (i, ℓ), where i is the group to which the respondent belongs, and ℓ is the candidate preferred by the respondent. This results in a sensing matrix Aq —see (2.97)—which is in the form of (2.102), namely, Aq = Diag{q1 IL , q2 IL , ..., qI IL }, [q ∈ ∆I ] and the outcome of k-th interview is drawn at random from the discrete probability distribution Aq x, where x ∈ X is the “signal” summarizing voters’ preferences in the groups. Given the total number of observations K, our goal is to decide with a given risk ǫ on our L hypotheses. Whether this goal is or is not achievable depends on K and on Aq . What we want is to find q for which the above goal can be attained with as small a K as possible; in the case in question, this reduces to solving, for various trial values of K, problem (2.101), which under the circumstances is an explicit convex optimization problem. To get an impression of the potential of Measurement Design, we present a sample of numerical results. In all reported experiments, we use δ = 0.05, ǫ = 0.01 and equal fractions θi = I −1 for all groups. The sets Πi , 1 ≤ i ≤ I, are generated as follows: we pick at random a probabilistic vector p¯i of dimension L, and define Πi as the intersection of the box {p : p¯ℓ − ui ≤ pℓ ≤ p¯ℓ + ui } centered at p¯ with the probabilistic simplex ∆L , where the ui , i = 1, ..., I, are prescribed “uncertainty levels.” Note that uncertainty level ui ≥ 1 is the same as absence of any a priori information on the preferences of voters from the i-th group. The results of our numerical experiments are as follows: L 2 2 3 5 5

I 2 2 3 4 4

Uncertainty levels u [0.03;1.00] [0.02;1.00] [0.02;0.03;1.00] [0.02;0.02;0.03;1.00] [1.00;1.00;1.00;1.00]

Kini 1212 2699 3177 2556 4788

qopt [0.437;0.563] [0.000;1.000] [0.000;0.455;0.545] [0.000;0.131;0.322;0.547] [0.250;0.250;0.250;0.250]

Kopt 1194 1948 2726 2086 4788

Effect of measurement design: poll sizes required for 0.99-reliable winner prediction when q = θ (column Kini ) and q = qopt (column Kopt ). We see that measurement design allows us to reduce (for some data, quite significantly) the volume of observations as compared to the straightforward selecting of the respondents uniformly across the entire population. To compare our current model and results with those from Section 2.6.1, note that now we have more a priori information on the true distribution of voting preferences due to some a priori knowledge of preferences within groups, which allows us to reduce the poll sizes with both straightforward and optimal measurement designs.19 On the other hand, the difference between Kini and Kopt is fully due to the measurement design. Comparative drug study. A Simple case of the Measurement Design in Discrete o.s. related to OPD and perhaps more interesting is as follows. Suppose that now, 19 To illustrate this point, look at the last two lines in the table: utilizing a priori information allows us to reduce the poll size from 4,7,88 to 2,556 even with the straightforward measurement design.

120

CHAPTER 2

instead of L competing candidates running for an office we have L competing drugs, and the population of patients the drugs are aimed at rather than the population of voters. For the sake of simplicity, assume that when a particular drug is administered to a particular patient, the outcome is binary: (positive) “effect” or “no effect” (what follows can be easily extended to the case of non-binary categorial outcomes, like “strong positive effect,” “weak positive effect,” “negative effect,” and alike). Our goal is to organize a clinical study in order to decide on comparative drug efficiency, measured by the percentage of patients on which a particular drug has effect. The difference with organizing an opinion poll is that now we cannot just ask a respondent what his or her preferences are; we may only administer to a participant of the study a single drug of our choice and look at the result. As in the OPD problem, we assume that the population of patients is split into I groups, with the i-th group comprising a fraction θi of the entire population. We model the situation as follows. We associate with a patient a Boolean vector of dimension 2L, with the ℓ-th entry in the vector equal to 1 or 0 depending on whether drug # ℓ has effect on the patient, and the (L + ℓ)-th entry complementing the ℓ-th one to 1 (that is, if the ℓ-th entry is χ, then the (L+ℓ)-th entry is 1−χ). Let xi be the average of these vectors over patients from group i. We define “signal” x underlying our measurements as the vector [x1 ; ...; xI ] and assume that our a priori information allows us to localize x in a closed convex subset X of the set Y = {x = [x1 ; ...; xI ] : xi ≥ 0, xiℓ + xiL+ℓ = 1, 1 ≤ i ≤ I, 1 ≤ ℓ ≤ L} to which all our signals belong by construction. Note that the vector X y = Bx = θi xi i

can be treated as a “population-wise distribution of drug effects:” yℓ , ℓ ≤ L, is the fraction, in the entire population of patients, of those patients on whom drug ℓ has effect, and yL+ℓ = 1 − yℓ . As a result, typical hypotheses related to comparison of the drugs, like “drug ℓ has effect on a larger fraction, at least by margin δ, of patients than drug ℓ′ ,” become convex hypotheses on the signal x. In order to test hypotheses of this type, we can use a two-stage procedure for observing drug effects, namely, as follows. To get a particular observation, we select at random, with probability qiℓ , a pair (i, ℓ) from the set {(i, ℓ) : 1 ≤ i ≤ I, 1 ≤ ℓ ≤ L}, select a patient from group i according to the uniform distribution on the group, administer to the patient the drug ℓ, and check whether the drug has effect. Thus, a single observation is a triple (i, ℓ, χ), where χ = 0 if the administered drug has no effect on the patient, and χ = 1 otherwise. The probability of getting observation (i, ℓ, 1) is qiℓ xiℓ , and the probability of getting observation (i, ℓ, 0) is qiℓ xiL+ℓ . Thus, we arrive at the Discrete o.s. where the distribution µ of observations is of the form µ = Aq x, with the rows in Aq indexed by triples ω = (i, ℓ, χ) ∈ Ω := {1, 2, ..., I} × {1, 2, ..., L} × {0, 1} and given by  qiℓ xiℓ χ = 1, 1 I (Aq [x ; ...; x ])i,ℓ,χ = qiℓ xiL+ℓ χ = 0. Specifying the set Q of admissible measurement designs as a closed convex subset of the set of all nonvanishing discrete probability distributions on the set {1, 2, ..., I}× {1, 2, ..., L}, we find ourselves in the Simple case of Discrete o.s., as defined by

121

HYPOTHESIS TESTING

Figure 2.7: PET scanner

(2.102), and Aq x is a probabilistic vector whenever q ∈ Q and x ∈ Y. 2.7.3.2

Simple case, Poisson o.s.

Looking at the optimization problem (Pq ), we see that the simplest way to ensure its objective is, as in the case of Discrete o.s., to assume that Aq = Diag{Bq}A, where A is some fixed d × n matrix, and B is some fixed d × (dim q) matrix such that Bq is positive whenever q ∈ Q. On the top of this, we should ensure that when q ∈ Q and x ∈ X , Aq x is a positive vector; this amounts to some restrictions linking Q, X , A, and B. Application Example: PET with time control. Positron Emission Tomography was already mentioned, as an example of Poisson o.s., in Section 2.4.3.2. As explained in the section, in PET we observe a random vector ω ∈ Rd with independent entries [ω]i ∼ Poisson(µi ), 1 ≤ i ≤ d, where the vector of parameters µ = [µ1 ; ...µd ] of the Poisson distributions is the linear image µ = Aλ of an unknown “signal” λ (the tracer’s density in patient’s body) belonging to some known subset Λ of RD + , with entrywise nonnegative matrix A. Our goal is to make inferences about λ. Now, in an actual PET scan, the patient’s position w.r.t. the scanner is not the same during the entire study; the position is kept fixed for an i-th time period, 1 ≤ i ≤ I, and changes from period to period in order to expose to the scanner the entire “area of interest.” For example, with the scanner shown on Figure 2.7, during the PET study the imaging table with the patient will be shifted several times along the axis of the scanning ring. As a result, the observed vector ω can be split into blocks ω i , i = 1, ..., I, of data acquired during the i-th period, 1 ≤ i ≤ I. On closer inspection, the corresponding block µi in µ is µi = qi Ai λ, where Ai is an entrywise nonnegative matrix known in advance, and qi is the duration of the i-th period. In principle, the qi could PIbe treated as nonnegative design variables subject to the “budget constraint” i=1 qi = T , where T is the

122

CHAPTER 2

total duration of the study,20 and perhaps some other convex constraints, say, positive lower bounds on qi . It is immediately seen that the outlined situation is exactly as is required in the Simple case of Poisson o.s. 2.7.3.3

Simple case, Gaussian o.s.

Looking at the optimization problem (Gq ), we see that the simplest way to ensure that its objective is affine in q is to assume that the covariance matrix Θ is diagonal, and √ √ (2.103) Aq = Diag{ q1 , ..., qd }A where A is a fixed d × n matrix, and q runs through a convex compact subset of Rd+ . It turns out that there are situations where assumption (2.103) makes perfect sense. Let us start with a preamble. In Gaussian o.s. 

ω = Ax + ξ  A ∈ Rd×n , ξ ∼ N (0, Σ), Σ = Diag{σ12 , ..., σd2 }

(2.104)

the “physics” behind the observations in many cases is as follows. There are d sensors (receivers), the i-th registering the continuous time analogous input depending linearly on the underlying observations signal x. On the time horizon on which the measurements are taken, this input is constant in time and is registered by the i-th sensor on time interval ∆i . The deterministic component of the measurement registered by sensor i is the integral of the corresponding input taken over ∆i , and the stochastic component of the measurement is obtained by integrating white Gaussian noise over the same interval. As far as this noise is concerned, what matters is that when the white noise affecting the i-th sensor is integrated over a time interval ∆i , the result is a Gaussian random variable with zero mean and variance σi2 |∆i | (here |∆i | is the length of ∆i ), and the random variables obtained by integrating white noise over nonoverlapping segments are independent. Besides this, we assume that the noisy components of measurements are independent across the sensors. Now, there could be two basic versions of the situation just outlined, both leading to the same observation model (2.104). In the first, “parallel,” version, all d sensors work in parallel on the same time horizon of duration 1. In the second, “sequential,” version, the sensors are activated and scanned one by one, each working unit time; thus, here the full time horizon is d, and the sensors are registering their respective inputs on consecutive time intervals of duration 1 each. In this second “physical” version of Gaussian o.s., we can, in principle, allow for sensors to register their inputs on consecutive time segments of varying durations q1 ≥ 0, q2 ≥ 0, ..., qd ≥ 0, with thePadditional to nonnegativity restriction that our total time budget is respected: i qi = d (perhaps with some other convex constraints on qi ). Let us look what the observation scheme we end up with is. Assuming that (2.104) represents correctly our observations in the reference case where all the |∆i | are equal to 1, the deterministic component ofP the measurement registered by sensor i in time interval of duration qi will be qi j aij xj , and the √ standard deviation of the noisy component will be σi qi , so that the measurements 20 T cannot be too large; aside from other considerations, the tracer disintegrates, and its density can be considered as nearly constant only on a properly restricted time horizon.

123

HYPOTHESIS TESTING

become

X √ aij xj , i = 1, ..., d, zi = σ i q i ζ i + q i j

with standard (zero mean, unit variance) Gaussian noises ζi independent of each other. Now, since we know qi , we can scale the latter observations by making the standard deviation of the noisy component the same σi as in the reference case. Specifically, we lose nothing when assuming that our observations are √ √ X ω i = zi / q i = σ i ζ i + q i aij xj , |{z} ξi

j

or, equivalently,

√ √ ω = ξ + Diag{ q1 , ..., qd }A x, ξ ∼ N (0, Diag{σ12 , ..., σd2 }) {z } |

[A = [aij ]]

Aq

P where q runs through a convex compact subset Q of the simplex {q ∈ Rd+ : i qi = d}. Thus, if the “physical nature” of a Gaussian o.s. is sequential, then, making the activity times of the sensors our design variables, as is natural under the circumstances, we arrive at (2.103), and, as a result, end up with an easy-to-solve Measurements Design problem.

2.8

AFFINE DETECTORS BEYOND SIMPLE OBSERVATION SCHEMES

On a closer inspection, the “common denominator” of our basic simple o.s.’s— Gaussian, Poisson and Discrete ones—is that in all these cases the minimal risk detector for a pair of convex hypotheses is affine. At first glance, this indeed is so for Gaussian and Poisson o.s.’s, where F is comprised of affine functions on the corresponding observation space Ω (Rd for Gaussian o.s., and Zd+ ⊂ Rd for Poisson o.s.), but is not so for Discrete o.s.—in that case, Ω = {1, ..., d}, and F is comprised of all functions on Ω, while “affine functions on Ω = {1, ..., d}” make no sense. Note, however, that we can encode (and from now on this is what we do) the points i = 1, ..., d of a d-element set by basic orths ei = [0; ...; 0; 1; 0; ...; 0] ∈ Rd in Rd , thus making our observation space Ω a subset of Rd . With this encoding, every real valued function on {1, ..., d} becomes a restriction on Ω of an affine function. Note that when passing from our basic simple o.s.’s to their direct products, the minimum risk detectors for pairs of convex hypotheses remain affine. Now, in our context the following two properties of simple o.s.’s are essential: A) the best—with the smallest possible risk—affine detector, like its risk, can be efficiently computed; B) the smallest risk affine detector from A) is the best detector, in terms of risk, available under the circumstances, so that the associated test is near-optimal. Note that as far as practical applications of the detector-based hypothesis testing are concerned, one “can survive” without B) (near-optimality of the constructed detectors), while A) is a requisite.

124

CHAPTER 2

In this section we focus on families of probability distributions obeying A). This class turns out to be incomparably larger than what was defined as simple o.s.’s in Section 2.4; in particular, it includes nonparametric families of distributions. Staying within this much broader class, we still are able to construct in a computationally efficient way the best affine detectors, in certain precise sense, for a pair of “convex” hypotheses, along with valid upper bounds on the risks of the detectors. What we, in general, cannot claim anymore, is that the tests associated with such detectors are near-optimal. This being said, we believe that investigating possibilities for building tests and quantifying their performance in a computationally friendly manner is of value even when we cannot provably guarantee near-optimality of these tests. The results to follow originate from [135, 136]. 2.8.1

Situation

In what follows, we fix an observation space Ω = Rd , and let Pj , 1 ≤ j ≤ J, be given families of probability distributions on Ω. Put S broadly, our goal still Pj , to decide upon the is, given a random observation ω ∼ P , where P ∈ j≤J

hypotheses Hj : P ∈ Pj , j = 1, ..., J. We intend to address this goal in the case when the families Pj are simple—they are comprised of distributions for which moment-generating functions admit an explicit upper bound. 2.8.1.1

Preliminaries: Regular data and associated families of distributions

Definition 2.36.A. Regular data is as a triple H, M, Φ(·, ·), where

– H is a nonempty closed convex set in Ω = Rd symmetric w.r.t. the origin, – M is a closed convex set in some Rn ,

– Φ(h; µ) : H×M → R is a continuous function convex in h ∈ H and concave in µ ∈ M.

B. Regular data H, M, Φ(·, ·) define two families of probability distributions on Ω: – the family of regular distributions

R = R[H, M, Φ] comprised of all probability distributions P on Ω such that  R ∀h ∈ H ∃µ ∈ M : ln Ω exp{hT ω}P (dω) ≤ Φ(h; µ).

– the family of simple distributions

S = S[H, M, Φ] comprised of probability distributions P on Ω such that  R ∃µ ∈ M : ∀h ∈ H : ln Ω exp{hT ω}P (dω) ≤ Φ(h; µ).

(2.105)

For a probability distribution P ∈ S[H, M, Φ], every µ ∈ M satisfying (2.105) is referred to as a parameter of P w.r.t. S. Note that a distribution may have many parameters different from each other.

125

HYPOTHESIS TESTING

Recall that beginning with Section 2.3, the starting point in all our constructions is a “plausibly good” detector-based test which, given two families P1 and P2 of distributions with common observation space, and repeated observations ω1 , ..., ωt drawn from a distribution P ∈ P1 ∪ P2 , decides whether P ∈ P1 or P ∈ P2 . Our interest in the families of regular/simple distributions stems from the fact that when the families P1 and P2 are of this type, building such a test reduces to solving a convex-concave saddle point problem and thus can be carried out in a computationally efficient manner. We postpone the related construction and analysis to Section 2.8.2, and continue with presenting some basic examples of families of simple and regular distributions along with a simple “calculus” of these families. 2.8.1.2

Basic examples of simple families of probability distributions

2.8.1.2.A. Sub-Gaussian distributions: Let H = Ω = Rd , let M be a closed convex subset of the set Gd = {µ = (θ, Θ) : θ ∈ Rd , Θ ∈ Sd+ }, where Sd+ is a cone of positive semidefinite matrices in the space Sd of symmetric d × d matrices, and let Φ(h; θ, Θ) = θT h + 21 hT Θh. Recall that a distribution P on Ω = Rd is called sub-Gaussian with subGaussianity parameters θ ∈ Rd and Θ ∈ Sd+ if Eω∼P {exp{hT ω}} ≤ exp{θT h + 12 hT Θh} ∀h ∈ Rd .

(2.106)

Whenever this is the case, θ is the expected value of P . We shall use the notation ξ ∼ SG(θ, Θ) as a shortcut for the sentence “random vector ξ is sub-Gaussian with parameters θ, Θ.” It is immediately seen that when ξ ∼ N (θ, Θ), we also have ξ ∼ SG(θ, Θ), and (2.106) in this case is an identity rather than an inequality. With Φ as above, S[H, M, Φ] clearly contains every sub-Gaussian distribution P on Rd with sub-Gaussianity parameters (forming a parameter of P w.r.t. S) from M. In particular, S[H, M, Φ] contains all Gaussian distributions N (θ, Θ) with (θ, Θ) ∈ M. 2.8.1.2.B. Poisson distributions: Let H = Ω = Rd , let M be a closed convex subset of d-dimensional nonnegative orthant Rd+ , and let Φ(h = [h1 ; ...; hd ]; µ = [µ1 ; ...; µd ]) =

d X i=1

µi [exp{hi } − 1] : H × M → R.

The family S = S[H, M, Φ] contains all Poisson distributions Poisson[µ] with vectors µ of parameters belonging to M; here Poisson[µ] is the distribution of a random d-dimensional vector with entries independent of each other, the i-th entry being a Poisson random variable with parameter µi . µ is a parameter of Poisson[µ] w.r.t. S. 2.8.1.2.C. Discrete distributions. Consider a discrete random variable taking values in d-element set {1, 2, ..., d}, and let us think of such a variable as of random

126

CHAPTER 2

variable taking values ei ∈ Rd , i = 1, ..., d, where ei = [0; ...; 0; 1; 0; ...; 0] (1 in position i) are standard basic orths in Rd . The probability distribution of such a variable can be identified with a point µ = [µ1 ; ...; µd ] from the d-dimensional probabilistic simplex ) ( d X νi = 1 , ∆d = ν ∈ Rd+ : i=1

where µi is the probability for the variable to take value ei . With these identifications, setting H = Rd , specifying M as a closed convex subset of ∆d , and setting ! d X Φ(h = [h1 ; ...; hd ]; µ = [µ1 ; ...; µd ]) = ln µi exp{hi } , i=1

the family S = S[H, M, Φ] contains distributions of all discrete random variables taking values in {1, ..., d} with probabilities µ1 , ..., µd comprising a vector from M. This vector is a parameter of the corresponding distribution w.r.t. S.

2.8.1.2.D. Distributions with bounded support. Consider the family P[X] of probability distributions supported on a closed and bounded convex set X ⊂ Ω = Rd , and let φX (h) = max hT x x∈X

be the support function of X. We have the following result (to be refined in Section 2.8.1.3): Proposition 2.37. For every P ∈ P[X] it holds  Z 2 T d exp{h ω}P (dω) ≤ hT e[P ] + 81 [φX (h) + φX (−h)] , (2.107) ∀h ∈ R : ln Rd

R

where e[P ] = Rd ωP (dω) is the expectation of P , and the function in the right-hand side of (2.107) is convex. As a result, setting H = Rd , M = X, Φ(h; µ) = hT µ + 81 [φX (h) + φX (−h)]

2

we obtain regular data such that P[X] ⊂ S[H, M, Φ], e[P ] being a parameter of a distribution P ∈ P[X] w.r.t. S. For proof, see Section 2.11.4 2.8.1.3

Calculus of regular and simple families of probability distributions

Families of regular and simple distributions admit “fully algorithmic” calculus, with the main calculus rules as follows. 2.8.1.3.A. Direct summation. For 1 ≤ ℓ ≤ L, let regular data Hℓ ⊂ Ωℓ = Rdℓ ,

127

HYPOTHESIS TESTING

Mℓ ⊂ Rnℓ , Φℓ (hℓ ; µℓ ) : Hℓ × Mℓ → R be given. Let us set Ω1 × ... × ΩL = Rd , d = d1 + ... + dL , H1 × ... × HL = {h = [h1 ; ...; hL ] : hℓ ∈ Hℓ , ℓ ≤ L}, M1 × ... × ML = {µ = [µ1 ; ...; µL ] : µℓ ∈ Mℓ , ℓ ≤ L} ⊂ Rn , n = n1 + ... + nL , PL Φ(h = [h1 ; ...; hL ]; µ = [µ1 ; ...; µL ]) = ℓ=1 Φℓ (hℓ ; µℓ ) : H × M → R.

Ω H M

= = =

Then H is a closed convex set in Ω = Rd , symmetric w.r.t. the origin, M is a nonempty closed convex set in Rn , Φ : H × M → R is a continuous convexconcave function, and clearly • the family R[H, M, Φ] contains all product-type distributions P = P1 × ... × PL on Ω = Ω1 × ... × ΩL with Pℓ ∈ R[Hℓ , Mℓ , Φℓ ], 1 ≤ ℓ ≤ L; • the family S = S[H, M, Φ] contains all product-type distributions P = P1 × ... × PL on Ω = Ω1 × ... × ΩL with Pℓ ∈ Sℓ = S[Hℓ , Mℓ , Φℓ ], 1 ≤ ℓ ≤ L, a parameter of P w.r.t. S being the vector of parameters of Pℓ w.r.t. Sℓ . 2.8.1.3.B. Mixing. For 1 ≤ ℓ ≤ L, let regular data Hℓ ⊂ Ω = Rd , Mℓ ⊂ Rnℓ , Φℓ (hℓ ; µℓ ) : Hℓ ×Mℓ → R be given, with compact Mℓ . Let also ν = [ν1 ; ...; νL ] be a L probabilistic vector. For a tuple P L = {Pℓ ∈ R[Hℓ , Mℓ , Φℓ ]}L ℓ=1 , let Π[P , ν] be the ν-mixture of distributions P1 , ..., PL defined as the distribution of random vector ω ∼ Ω generated as follows: we draw at random, from probability distribution ν on {1, ..., L}, index ℓ, and then draw ω at random from the distribution Pℓ . Finally, let P be the set of all probability distributions on Ω which can be obtained as Π[P L , ν] from the outlined tuples P P L and vectors ν running through the probabilistic simplex L ∆L = {µ ∈ R : ν ≥ 0, ℓ νℓ = 1}. Let us set H

=

Ψℓ (h)

=

Φ(h; ν)

=

L T

ℓ=1

Hℓ ,

max Φℓ (h; µℓ ) : Hℓ → R,  PL ν exp{Ψ (h)} : H × ∆L → R. ln ℓ ℓ ℓ=1

µℓ ∈M  ℓ

Then H, ∆L , Φ clearly is regular data (recall that all Mℓ are compact sets), and for every ν ∈ ∆L and tuple P L of the above type one has  Z T eh ω P (dω) ≤ Φ(h; ν) ∀h ∈ H, (2.108) P = Π[P L , ν] ⇒ ln Ω

implying that P ⊂ S[H, ∆L , Φ], ν being a parameter of P = Π[P L , ν] ∈ P. Indeed,(2.108) is readily given by the fact that for P = Π[P L , ν] ∈ P and h ∈ H it holds ! ! L L n T o  X X h ω hT ω νℓ exp{Ψℓ (h)} = Φ(h; ν), νℓ Eω∼Pℓ {e } ≤ ln = ln ln Eω∼P e ℓ=1

ℓ=1

with the concluding inequality given by h ∈ H ⊂ Hℓ and Pℓ ∈ R[Hℓ , Mℓ , Φℓ ], 1 ≤ ℓ ≤ L.

128

CHAPTER 2

We have built a simple family of distributions S := S[H, ∆L , Φ] which contains all mixtures of distributions from given regular families Rℓ := R[Hℓ , Mℓ , Φℓ ], 1 ≤ ℓ ≤ L, which makes S a simple outer approximation of mixtures of distributions from the simple families Sℓ := S[Hℓ , Mℓ , Φℓ ], 1 ≤ ℓ ≤ L. In this latter capacity, S has a drawback—the only parameter of the mixture P = Π[P L , ν] of distributions Pℓ ∈ Sℓ is ν, while the parameters of Pℓ ’s disappear. In some situations, this makes the outer approximation S of P too conservative. We are about to get rid, to some extent, of this drawback. A modification. In the situation described at the beginning of 2.8.1.3.B, let a vector ν¯ ∈ ∆L be given, and let ¯ Φ(h; µ1 , ..., µL ) =

L X ℓ=1

Let d × d matrix Q  0 satisfy ¯ Φℓ (h; µℓ ) − Φ(h; µ1 , ..., µL ) and let

ν¯ℓ Φℓ (h; µℓ ) : H × (M1 × ... × ML ) → R.

2

≤ hT Qh ∀(h ∈ H, ℓ ≤ L, µ ∈ M1 × ... × ML ), (2.109)

¯ µ1 , ..., µL ) : H × (M1 × ... × ML ) → R. (2.110) Φ(h; µ1 , ..., µL ) = 53 hT Qh + Φ(h; T Φ clearly is convex-concave and continuous on its domain, whence H = ℓ Hℓ , M1 × ... × ML , Φ is regular data. Proposition 2.38. In the situation just defined, denoting by Pν¯ the family of all probability distributions P = Π[P L , ν¯], stemming from tuples P L = {Pℓ ∈ S[Hℓ , Mℓ , Φℓ ]}L ℓ=1 ,

(2.111)

one has Pν¯ ⊂ S[H, M1 × ... × ML , Φ].

As a parameter of distribution P = Π[P L , ν¯] ∈ Pν¯ with P L as in (2.111), one can take µL = [µ1 ; ....; µL ]. Proof. It is easily seen that 3

2

ea ≤ a + e 5 a , ∀a. P As a result, when aℓ , ℓ = 1, ..., L, satisfy ℓ ν¯ℓ aℓ = 0, we have X ℓ

ν¯ℓ eaℓ ≤

X ℓ

ν¯ℓ aℓ +

X ℓ

Now let P L be as in (2.111), and let h ∈ H = ln

R

3

2

3

2

ν¯ℓ e 5 aℓ ≤ e 5 maxℓ aℓ . T L

(2.112)

Hℓ . Setting P = Π[P L , ν¯], we have

  P R P T T ¯ℓ Ω eh ω Pℓ (dω) = ln ( ℓ ν¯ℓ exp{Φℓ (h, µℓ )}) eh ω P (dω) = ln ℓν  P ¯ ¯ ¯ℓ exp{Φℓ (h, µℓ ) − Φ(h; µ1 , ...µL )} = Φ(h; µ1 , ...µL ) + ln ℓν ¯ ¯ ≤ Φ(h; µ1 , ...µL ) + 35 maxℓ [Φℓ (h, µℓ ) − Φ(h; µ1 , ...µL )]2 ≤ Φ(h; µ1 , ..., µL ), |{z} |{z} Ω

a

b

129

HYPOTHESIS TESTING

¯ where a is given by (2.112) as applied to aℓ = Φℓ (h, µℓ ) − Φ(h; µ1 , ...µL ), and b is due to (2.109) and (2.110). The resulting inequality, which holds true for all h ∈ H, is all we need. ✷ 2.8.1.3.C. I.i.d. summation. Let Ω = Rd be an observation space, (H, M, Φ) be regular data on this space, and λ = {λℓ }K ℓ=1 be a collection of reals. We can associate with the outlined entities new data (Hλ , M, Φλ ) on Ω by setting Hλ = {h ∈ Ω : kλk∞ h ∈ H}, Φλ (h; µ) =

L X ℓ=1

Φ(λℓ h; µ) : Hλ × M → R.

Now, given a probability distribution P on Ω, we can associate with it and with the λ λ above P λ a new probability distribution P on Ω as follows: P is the distribution of ℓ λℓ ωℓ , where ω1 , ω2 , ..., ωL are drawn, independently of each other, from P . An immediate observation is that the data (Hλ , M, Φλ ) is regular, and whenever a probability distribution P belongs to S[H, M, Φ], the distribution P λ belongs to S[Hλ , M, Φλ ], and every parameter of P is a parameter of P λ . In particular, when ω ∼ P ∈ S[H, M, Φ] the distribution P L of the sum of L independent copies of ω belongs to S[H, M, LΦ]. 2.8.1.3.D. Semi-direct summation. For 1 ≤ ℓ ≤ L, let regular data Hℓ ⊂ Ωℓ = Rdℓ , Mℓ , Φℓ be given. To avoid complications, we assume that for every ℓ, • Hℓ = Ωℓ , • Mℓ is bounded. Let also an ǫ > 0 be given. We assume that ǫ is small, namely, Lǫ < 1. Let us aggregate the given regular data into a new one by setting H = Ω := Ω1 × ... × ΩL = Rd , d = d1 + ... + dL , M = M1 × ... × ML , and let us define function Φ(h; µ) : Ωd × M → R as follows: Φ(h = [h1 ; ...; hL ]; µ = [µ1 ; ...; µL ]) = inf λ∈∆ǫ PL ∆ǫ = {λ ∈ Rd : λℓ ≥ ǫ ∀ℓ & ℓ=1 λℓ = 1}.

Pd

ℓ=1

λℓ Φℓ (hℓ /λℓ ; µℓ ),

(2.113)

For evident reasons, the infimum in the description of Φ is achieved, and Φ is continuous. In addition, Φ is convex in h ∈ Rd and concave in µ ∈ M. Postponing for a moment verification, the consequences are that H = Ω = Rd , M, and Φ form regular data. We claim that Whenever ω = [ω 1 ; ...; ω L ] is a random variable taking values in Ω = Rd1 × ... × RdL , and the marginal distributions Pℓ , 1 ≤ ℓ ≤ L, of ω belong to the families Sℓ = S[Rdℓ , Mℓ , Φℓ ] for all 1 ≤ ℓ ≤ L, the distribution P of ω belongs to S = S[Rd , M, Φ], a parameter of P w.r.t. S being the vector comprised of parameters of Pℓ w.r.t. Sℓ . Indeed, since Pℓ ∈ S[Rdℓ , Mℓ , Φℓ ], there exists µ bℓ ∈ Mℓ such that ln(Eωℓ ∼Pℓ {exp{g T ω ℓ }}) ≤ Φℓ (g; µ bℓ ) ∀g ∈ Rdℓ .

130

CHAPTER 2

Let us set µ b = [b µ1 ; ...; µ bL ], and let h = [h1 ; ...; hL ] ∈ Ω be given. We can find λ ∈ ∆ǫ such that L X Φ(h; µ b) = λℓ Φℓ (hℓ /λℓ ; µ bℓ ). ℓ=1

Applying the H¨ older inequality, we get ( ) L X Y  λℓ E[ω1 ;...;ωL ]∼P exp{ [hℓ ]T ω ℓ } ≤ Eωℓ ∼Pℓ [hℓ ]T ω ℓ /λℓ , ℓ

ℓ=1

whence ln E[ω1 ;...;ωL ]∼P We see that

(

)! L X X λℓ Φℓ (hℓ /λℓ ; µ bℓ ) = Φ(h; µ b). ≤ exp{ [hℓ ]T ω ℓ }

ln E[ω1 ;...;ωL ]∼P

ℓ=1



(

X exp{ [hℓ ]T ω ℓ } ℓ

)!

≤ Φ(h; µ b) ∀h ∈ H = Rd ,

and thus P ∈ S[Rd , M, Φ], as claimed. It remains to verify that the function Φ defined by (2.113) indeed is convex in h ∈ Rd and concave in µ ∈ M. Concavity in µ is evident. Further, functions λℓ Φℓ (hℓ /λℓ ; µ) (as perspective transformations of convex functions Φℓ (·; µ)) are PL ℓ jointly convex in λ and hℓ , and so is Ψ(λ, h; µ) = ℓ=1 λℓ Φℓ (h /λℓ , µ). Thus Φ(·; µ), obtained by partial minimization of Ψ in λ, indeed is convex. 2.8.1.3.E. Affine image. Let H, M, Φ be regular data, Ω be the embedding ¯ = Rd¯, and let us space of H, and x 7→ Ax + a be an affine mapping from Ω to Ω set ¯ ∈ Rd¯ : AT h ¯ ∈ H}, M ¯ µ) = Φ(AT h; ¯ µ) + aT h ¯: H ¯ = {h ¯ = M, Φ( ¯ h; ¯×M ¯ → R. H ¯ M, ¯ Φ ¯ is regular data. It is immediately seen that Note that H, Whenever the probability distribution P of a random variable ω belongs to R[H, M, Φ] (or belongs to S[H, M, Φ]), the distribution P¯ [P ] of the ran¯ M, ¯ Φ] ¯ (respectively, belongs to dom variable ω ¯ = Aω + a belongs to R[H, ¯ ¯ ¯ S[H, M, Φ], and every parameter of P is a parameter of P¯ [P ]). 2.8.1.3.F. Incorporating support information. Consider the situation as follows. We are given regular data H ⊂ Ω = Rd , M, Φ and are interested in a family P of distributions known to belong to S[H, M, Φ]. In addition, we know that all distributions P from P are supported on a given closed convex set X ⊂ Rd . How could we incorporate this domain information to pass from the family S[H, M, Φ] containing P to a smaller family of the same type still containing P? We are about to give an answer in the simplest case of H = Ω. When denoting by φX (·) the support function of X and selecting somehow a closed convex set G ⊂ Rd containing

131

HYPOTHESIS TESTING

the origin, let us set   b µ) = inf Φ+ (h, g; µ) := Φ(h − g; µ) + φX (g) , Φ(h; g∈G

where Φ(h; µ) : Rd × M → R is the continuous convex-concave function participatb is real-valued and continuous on ing in the original regular data. Assuming that Φ the domain Rd × M (which definitely is the case when G is a compact set such that b is convex-concave on this domain, φX is finite and continuous on G), note that Φ d b so that R , M, Φ is regular data. We claim that b contains P, provided the family S[Rd , M, Φ] does The family S[Rd , M, Φ] so, and the first of these two families is smaller than the second one.

Verification of the claim is immediate. Let P ∈ P, so that for properly selected µ = µP ∈ M and for all e ∈ Rd it holds  Z T exp{e ω}P (dω) ≤ Φ(e; µP ). ln Rd

On the other hand, for every g ∈ G we have φX (g) − g T ω ≥ 0 on the support of P , whence for every h ∈ Rd one has   R R ln Rd exp{hT ω}P (dω) ≤ ln Rd exp{hT ω + φX (g) − g T ω}P (dω) ≤ φX (g) + Φ(h − g; µP ). Since the resulting inequality holds true for all g ∈ G, we get  Z b µP ) ∀h ∈ Rd , exp{hT ω}P (dω) ≤ Φ(h; ln Rd

b because P ∈ P is arbitrary, the first part of the implying that P ∈ S[Rd , M, Φ]; b ⊂ S[Rd , M, Φ] is readily given by the claim is justified. The inclusion S[Rd , M, Φ] b ≤ Φ, and the latter is due to Φ(h, b µ) ≤ Φ(h − 0, µ) + φX (0). inequality Φ

Illustration: Distributions with bounded support revisited. In Section 2.8.1.2, given a convex compact set X ⊂ Rd with support function φX , we checked that the data H = Rd , M = X, Φ(h; µ) = hT µ + 18 [φX (h) + φX (−h)]2 is regular and the family S[Rd , M, Φ] contains the family P[X] of all probability distributions supported on X. Moreover, for every µ ∈ M = X, the family S[Rd , {µ}, Φ Rd ×{µ} ] contains all distributions supported on X with the expectations e[P] = µ. Note that  R T Φ(h; e[P ]) describes well the behavior of the logarithm FP (h) = ln Rd eh ω P (dω) of the moment-generating function of P ∈ P[X] when h is small (indeed, FP (h) = hT e[P ] + O(khk2 ) as h → 0), and by far overestimates FP (h) when h is large. Utilizing the above construction, we replace Φ with the real-valued, convex-concave, and continuous on Rd × M = Rd × X (see Exercise 2.22) function h i b µ) = inf Ψ(h, b Φ(h; g; µ) := (h − g)T µ + 18 [φX (h − g) + φX (−h + g)]2 + φX (g) g



Φ(h; µ).

(2.114)

132

CHAPTER 2

b ·) still ensures the inclusion P ∈ S[Rd , {e[P ]}, Φ b d ] It is easy to see that Φ(·; R ×{e[P ]} for every distribution P ∈ P[X] and “reproduces FP (h) reasonably well” for both b e[P ]) ≤ Φ(h; e[P ]), for small h small and large h. Indeed, since FP (h) ≤ Φ(h; b Φ(h; e[P ]) reproduces FP (h) even better than Φ(h; e[P ]), and we clearly have   b µ) ≤ (h − h)T µ + 1 [φX (h − h) + φX (−h + h)]2 + φX (h) = φX (h) ∀µ, Φ(h; 8

and φX (h) is a correct description of FP (h) for large h. 2.8.2

Main result

2.8.2.1

Situation & Construction

Assume we are given two collections of regular data with common Ω = Rd and common H, specifically, the collections (H, Mχ , Φχ ), χ = 1, 2. We start with constructing a specific detector for the associated families of regular probability distributions Pχ = R[H, Mχ , Φχ ], χ = 1, 2. When building the detector, we impose on the regular data in question the following Assumption I: The regular data (H, Mχ , Φχ ), χ = 1, 2, are such that the convex-concave function Ψ(h; µ1 , µ2 ) =

1 2

[Φ1 (−h; µ1 ) + Φ2 (h; µ2 )] : H × (M1 × M2 ) → R

(2.115)

has a saddle point (min in h ∈ H, max in (µ1 , µ2 ) ∈ M1 × M2 ). A simple sufficient condition for existence of a saddle point of (2.115) is Condition A: The sets M1 and M2 are compact, and the function Φ(h) =

max

µ1 ∈M1 ,µ2 ∈M2

Φ(h; µ1 , µ2 )

is coercive on H, meaning that Φ(hi ) → ∞ along every sequence hi ∈ H with khi k2 → ∞ as i → ∞. Indeed, under Condition A by the Sion-Kakutani Theorem (Theorem 2.22) it holds SadVal[Φ] := inf

max

sup inf Φ(h; µ1 , µ2 ), Φ(h; µ1 , µ2 ) = µ1 ∈M1 ,µ2 ∈M2 h∈H | {z } {z }

h∈H µ1 ∈M1 ,µ2 ∈M2

|

Φ(µ1 ,µ2 )

Φ(h)

so that the optimization problems (P )

Opt(P ) = min Φ(h)

(D)

Opt(D) =

h∈H

max

µ1 ∈M1 ,µ2 ∈M2

Φ(µ1 , µ2 )

have equal optimal values. Under Condition A, problem (P ) clearly is a problem of minimizing a continuous coercive function over a closed set and as such is solvable; thus, Opt(P ) = Opt(D) is a real. Problem (D) clearly is the problem of maximizing over a compact set an upper semi-continuous (since Φ is continuous) function taking real values and, perhaps, value −∞,

133

HYPOTHESIS TESTING

and not identically equal to −∞ (since Opt(D) is a real), and thus (D) is solvable. As a result, (P ) and (D) are solvable with common optimal values, and therefore Φ has a saddle point.

2.8.2.2

Main Result

An immediate (and essential) observation is as follows: Proposition 2.39. In the situation of Section 2.8.2.1, let h ∈ H be such that the quantities Ψ1 (h) = sup Φ1 (−h; µ1 ), Ψ2 (h) = sup Φ2 (h; µ2 ) µ1 ∈M1

µ2 ∈M2

are finite. Consider the affine detector φh (ω) = hT ω + 21 [Ψ1 (h) − Ψ2 (h)] . {z } | κ

Then

Risk[φh |R[H, M1 , Φ1 ], R[H, M2 , Φ2 ]] ≤ exp{ 21 [Ψ1 (h) + Ψ2 (h)]}. Proof. Let h satisfy the premise of the proposition. For every µ1 ∈ M1 , we have Φ1 (−h; µ1 ) ≤ Ψ1 (h), and for every P ∈ R[H, M1 , Φ1 ] we have Z exp{−hT ω}P (dω) ≤ exp{Φ1 (−h; µ1 )} Ω

for properly selected µ1 ∈ M1 . Thus, Z exp{−hT ω}P (dω) ≤ exp{Ψ1 (h)} ∀P ∈ R[H, M1 , Φ1 ], Ω

whence also Z



exp{−hT ω−κ}P (dω) ≤ exp{Ψ1 (h)−κ} = exp{ 21 [Ψ1 (h)+Ψ2 (h)]} ∀P ∈ R[H, M1 , Φ1 ].

Similarly, for every µ2 ∈ M2 , we have Φ2 (h; µ2 ) ≤ Ψ2 (h), and for every P ∈ R[H, M2 , Φ2 ], we have Z exp{hT ω}P (dω) ≤ exp{Φ2 (h; µ2 )} Ω

for properly selected µ2 ∈ M2 . Thus, Z exp{hT ω}P (dω) ≤ exp{Ψ2 (h)} ∀P ∈ R[H, M2 , Φ2 ], Ω

and Z



exp{hT ω + κ}P (dω) ≤ exp{Ψ2 (h) + κ} = exp{ 21 [Ψ1 (h) + Ψ2 (h)]} ∀P ∈ R[H, M2 , Φ2 ]. ✷

An immediate corollary is as follows:

134

CHAPTER 2

Proposition 2.40. In the situation of Section 2.8.2.1 and under Assumption I, let us associate with a saddle point (h∗ ; µ∗1 , µ∗2 ) of the convex-concave function (2.115) the following entities: • the risk

ǫ⋆ := exp{Ψ(h∗ ; µ∗1 , µ∗2 )};

(2.116)

this quantity is uniquely defined by the saddle point value of Ψ and thus is independent of how we select a saddle point; • the detector φ∗ (ω)—the affine function of ω ∈ Rd given by φ∗ (ω) = hT∗ ω + a, a =

1 2

[Φ1 (−h∗ ; µ∗1 ) − Φ2 (h∗ ; µ∗2 )] .

(2.117)

Then Risk[φ∗ |R[H, M1 , Φ1 ], R[H, M2 , Φ2 ]] ≤ ǫ⋆ . Consequences. Assume we are given L collections (H, Mℓ , Φℓ ) of regular data on a common observation space Ω = Rd and with common H, and let Pℓ = R[H, Mℓ , Φℓ ] be the corresponding families of regular distributions. Assume also that for every pair (ℓ, ℓ′ ), 1 ≤ ℓ < ℓ′ ≤ L, the pair (H, Mℓ , Φℓ ), (H, Mℓ′ , Φℓ′ ) of regular data satisfies Assumption I, so that the convex-concave functions 1 2

[Φℓ (−h; µℓ ) + Φℓ′ (h; µℓ′ )] : H × (Mℓ × Mℓ′ ) → R [1 ≤ ℓ < ℓ′ ≤ L] ∗ ∗ ∗ have saddle points (hℓℓ′ ; (µℓ , µℓ′ )) (min in h ∈ H, max in (µℓ , µℓ′ ) ∈ Mℓ × Mℓ′ ). These saddle points give rise to the affine detectors Ψℓℓ′ (h; µℓ , µℓ′ ) =

φℓℓ′ (ω) = [h∗ℓℓ′ ]T ω + 21 [Φℓ (−h∗ℓℓ′ ; µ∗ℓ ) − Φℓ′ (h∗ ; µ∗ℓ′ )]

[1 ≤ ℓ < ℓ′ ≤ L]

and the quantities ǫℓℓ′ = exp { 12 [Φℓ (−h∗ℓℓ′ ; µ∗ℓ ) + Φℓ′ (h∗ ; µ∗ℓ′ )]} ;

[1 ≤ ℓ < ℓ′ ≤ L]

by Proposition 2.40, ǫℓℓ′ are upper bounds on the risks, taken w.r.t. Pℓ , Pℓ′ , of the detectors φℓℓ′ : Z Z −φℓℓ′ (ω) eφℓℓ′ (ω) P (dω) ≤ ǫℓℓ′ ∀P ∈ Pℓ′ . P (dω) ≤ ǫℓℓ′ ∀P ∈ Pℓ & e Ω



[1 ≤ ℓ < ℓ′ ≤ L] Setting φℓℓ′ (·) = −φℓ′ ℓ (·) and ǫℓℓ′ = ǫℓ′ ℓ when L ≥ ℓ > ℓ ≥ 1 and φℓℓ (·) ≡ 0, ǫℓℓ = 1, 1 ≤ ℓ ≤ L, we get a system of detectors and risks satisfying (2.80) and, consequently, can use these “building blocks” in the machinery developed so far for pairwise and multiple hypothesis testing from single and repeated observations (stationary, semi-stationary, and quasi-stationary). ′

Numerical example. To get some impression of how Proposition 2.40 extends the grasp of our computation-friendly machinery of test design consider a toy problem as follows:

135

HYPOTHESIS TESTING

We are given an observation √ √ ω = Ax + σADiag { x1 , ..., xn } ξ,

(2.118)

where • unknown signal x is known to belong to a given convex compact subset M of the interior of Rn+ ; • A is a given n × n matrix of rank n, σ > 0 is a given noise intensity, and ξ ∼ N (0, In ).

Our goal is to decide via a K-repeated version of observations (2.118) on the pair of hypotheses x ∈ Xχ , χ = 1, 2, where X1 , X2 are given nonempty convex compact subsets of M .

Note that an essential novelty, as compared to the standard Gaussian o.s., is that now we deal with zero mean Gaussian noise with covariance matrix Θ(x) = σ 2 ADiag{x}AT depending on the true signal—the larger the signal, the greater the noise. We can easily process the situation in question utilizing the machinery developed in this section. Namely, let us set Hχ = Rn , Mχ = {(x, Diag{x}) : x ∈ Xχ } ⊂ Rn+ × Sn+ , 2 Φχ (h; x, Ξ) = hT AT x + σ2 hT [AΞAT ]h : Mχ → R.

[χ = 1, 2]

It is immediately seen that for χ = 1, 2, H, Mχ , Φχ is regular data, and that the distribution P of observation (2.118) stemming from a signal x ∈ Xχ belongs to S[H, Mχ , Φχ ], so that we can use Proposition 2.40 to build an affine detector for the families Pχ , χ = 1, 2, of distributions of observations (2.118) stemming from signals x ∈ Xχ . The corresponding recipe boils down to the necessity to find a saddle point (h∗ ; x∗ , y∗ ) of the simple convex-concave function   σ2 T 1 T h A(y − x) + h ADiag{x + y}AT h Ψ(h; x, y) = 2 2 (min in h ∈ Rn , max in (x, y) ∈ X1 × X2 ). Such a point clearly exists and is easily found, and gives rise to affine detector φ∗ (ω) = hT∗ ω + 41 σ 2 hT∗ ADiag{x∗ − y∗ }AT h∗ − 21 hT∗ A[x∗ + y∗ ] | {z } a

such that

Risk[φ∗ |P1 , P2 ] ≤ exp

   σ2 T 1 T h∗ A[y∗ − x∗ ] + h∗ ADiag{x∗ + y∗ }AT h∗ . 2 2

(2.119)

Note that we could also process the situation when defining the regular data as + H, M+ χ = Xχ , Φχ , χ = 1, 2, where T Φ+ χ (h; x) = h Ax +

σ2 θ T h AAT h 2

[θ =

max

x∈X1 ∪X2

kxk∞ ],

which, basically, means passing from our actual observations (2.118) to the “more

136

CHAPTER 2

noisy” observations given by Gaussian o.s. ω = Ax + η, η ∼ N (0, σ 2 θAAT ).

(2.120)

It is easily seen that, for this Gaussian o.s., the risk Risk[φ# |P1 , P2 ] of the optimal, detector φ# can be upper-bounded by the risk Risk[φ# |P1+ , P2+ ] known to us, where Pχ+ is the family of distributions of observations (2.120) induced by signals x ∈ Xχ . Note that Risk[φ# |P1+ , P2+ ] is seemingly the best risk bound available for us “within the realm of detector-based tests in simple o.s.’s.” The goal of the small numerical experiment we are about to report on is to understand how our new risk bound (2.119) compares to the “old” bound Risk[φ# |P1+ , P2+ ]. We use   0.001 ≤ x1 ≤ δ 16 n = 16, X1 = x ∈ R : , 0.001 ≤ xi ≤ 1, 2 ≤ i ≤  16  2δ ≤ x1 ≤ 1 X2 = x ∈ R16 : 0.001 ≤ xi ≤ 1, 2 ≤ i ≤ 16 and σ = 0.1. The “separation parameter” δ is set to 0.1. Finally, the 16 × 16 matrix A has condition number 100 (singular values 0.01(i−1)/15 , 1 ≤ i ≤ 16) and randomly oriented systems of left- and right singular vectors. With this setup, a typical numerical result is as follows: • the right-hand side in (2.119) is 0.4346, implying that with detector φ∗ , a 6repeated observation is sufficient to decide on our two hypotheses with risk ≤ 0.01; • the quantity Risk[φ# |P1+ , P2+ ] is 0.8825, meaning that with detector φ# , we need at least a 37-repeated observation to guarantee risk ≤ 0.01. When the separation parameter δ participating in the descriptions of X1 , X2 is reduced to 0.01, the risks in question grow to 0.9201 and 0.9988, respectively (a 56-repeated observation to decide on the hypotheses with risk 0.01 when φ∗ is used vs. a 3685-repeated observation needed when φ# is used). The bottom line is that the new developments can indeed improve significantly the performance of our inferences. 2.8.2.3

Sub-Gaussian and Gaussian cases

For χ = 1, 2, let Uχ be a nonempty closed convex set in Rd , and Vχ be a compact convex subset of the interior of the positive semidefinite cone Sd+ . We assume that U1 is compact. Setting Hχ = Ω = Rd , M χ = U χ × V χ , Φχ (h; θ, Θ) = θT h + 21 hT Θh : Hχ × Mχ → R, χ = 1, 2,

(2.121)

we get two collections (H, Mχ , Φχ ), χ = 1, 2, of regular data. As we know from Section 2.8.1.2, for χ = 1, 2, the families of distributions S[Rd , Mχ , Φχ ] contain the families SG[Uχ , Vχ ] of sub-Gaussian distributions on Rd with sub-Gaussianity parameters (θ, Θ) ∈ Uχ × Vχ (see (2.106)), as well as families G[Uχ , Vχ ] of Gaussian distributions on Rd with parameters (θ, Θ) (expectation and covariance matrix) running through Uχ × Vχ . Besides this, the pair of regular data in question clearly satisfies Condition A. Consequently, the test T∗K given by the above construction

137

HYPOTHESIS TESTING

as applied to the collections of regular data (2.121) is well defined and allows to decide on hypotheses Hχ : P ∈ R[Rd , Uχ , Vχ ], χ = 1, 2, on the distribution P underlying K-repeated observation ω K . The same test can be also used to decide on stricter hypotheses HχG , χ = 1, 2, stating that the observations ω1 , ..., ωK are i.i.d. and drawn from a Gaussian distribution P belonging to G[Uχ , Vχ ]. Our goal now is to process in detail the situation in question and to refine our conclusions on the risk of the test T∗1 when the Gaussian hypotheses HχG are considered and the situation is symmetric, that is, when V1 = V2 . Observe, first, that the convex-concave function Ψ from (2.115) in the current setting becomes Ψ(h; θ1 , Θ1 , θ2 , Θ2 ) = 12 hT [θ2 − θ1 ] + 41 hT Θ1 h + 14 hT Θ2 h.

(2.122)

We are interested in solutions to the saddle point problem min

h∈Rd

max

θ1 ∈U1 ,θ2 ∈U2

Ψ(h; θ1 , Θ1 , θ2 , Θ2 )

(2.123)

Θ1 ∈V1 ,Θ2 ∈V2

associated with the function (2.122). From the structure of Ψ and compactness of U1 , V1 , V2 , combined with the fact that Vχ , χ = 1, 2, are comprised of positive definite matrices, it immediately follows that saddle points do exist, and a saddle point (h∗ ; θ1∗ , Θ∗1 , θ2∗ , Θ∗2 ) satisfies the relations (a) (b) (c)

h∗ = [Θ∗1 + Θ∗2 ]−1 [θ1∗ − θ2∗ ], hT∗ (θ1 − θ1∗ ) ≥ 0 ∀θ1 ∈ U1 , hT∗ (θ2∗ − θ2 ) ≥ 0 ∀θ2 ∈ U2 , hT∗ Θ1 h∗ ≤ hT∗ Θ∗1 h∗ ∀Θ1 ∈ V1 , hT∗ Θ2 h∗ ≤ h∗ Θ∗2 h∗ ∀Θ2 ∈ V2 .

(2.124)

From (2.124.a) it immediately follows that the affine detector φ∗ (·) and risk ǫ⋆ , as given by (2.116) and (2.117), are φ∗ (ω) ǫ⋆

= = =

hT∗ [ω − w∗ ] + 12 hT∗ [Θ∗1 − Θ∗2 ]h∗ , w∗ = 12 [θ1∗ + θ2∗ ]; exp{− 41 [θ1∗ − θ2∗ ]T [Θ∗1 + Θ∗2 ]−1 [θ1∗ − θ2∗ ]} exp{− 14 hT∗ [Θ∗1 + Θ∗2 ]h∗ }.

(2.125)

Note that in the symmetric case (where V1 = V2 ), there always exists a saddle point of Ψ with Θ∗1 = Θ∗2 ,21 and the test T∗1 associated with such saddle point is quite transparent: it is the maximum likelihood test for two Gaussian distributions, N (θ1∗ , Θ∗ ), N (θ2∗ , Θ∗ ), where Θ∗ is the common value of Θ∗1 and Θ∗2 . The bound ǫ⋆ on the risk of the test is nothing but the Hellinger affinity of these two Gaussian distributions, or, equivalently,  ∗ ∗ ǫ⋆ = exp − 81 [θ1∗ − θ2∗ ]T Θ−1 ∗ [θ1 − θ2 ] .

21 Indeed, from (2.122) it follows that when V 1 = V2 , the function Ψ(h; θ1 , Θ1 , θ2 , Θ2 ) is symmetric w.r.t. Θ1 , Θ2 , implying similar symmetry of the function Ψ(θ1 , Θ1 , θ2 , Θ2 ) = minh∈H Ψ(h; θ1 , Θ1 , θ2 , Θ2 ). Since Ψ is concave, the set M of its maximizers over M1 × M2 (which, as we know, is nonempty) is symmetric w.r.t. the swap of Θ1 and Θ2 and is convex, implying that if (θ1 , Θ1 , θ2 , Θ2 ) ∈ M , then (θ1 , 21 [Θ1 + Θ2 ], θ2 , 12 [Θ1 + Θ2 ]) ∈ M as well, and the latter point is the desired component of the saddle point of Ψ with Θ1 = Θ2 .

138

CHAPTER 2

We arrive at the following result: Proposition 2.41. In the symmetric sub-Gaussian case (i.e., in the case of (2.121) with V1 = V2 ), saddle point problem (2.122), (2.123) admits a saddle point of the form (h∗ ; θ1∗ , Θ∗ , θ2∗ , Θ∗ ), and the associated affine detector and its risk are given by φ∗ (ω) ǫ⋆

= =

hT∗ [ω − w∗ ], w∗ = 21 [θ1∗ + θ2∗ ]; ∗ ∗ exp{− 18 [θ1∗ − θ2∗ ]T Θ−1 ∗ [θ1 − θ2 ]}.

As a result, when deciding, via ω K , on “sub-Gaussian hypotheses” Hχ , χ = 1, 2, PK (K) the risk of the test T∗K associated with φ∗ (ω K ) := t=1 φ∗ (ωt ) is at most ǫK ⋆ .

In the symmetric single-observation Gaussian case, that is, when V1 = V2 and we apply the test T∗ = T∗1 to observation ω ≡ ω1 in order to decide on the hypotheses HχG , χ = 1, 2, the above risk bound can be improved: Proposition 2.42. Consider the symmetric case V1 = V2 = V, let (h∗ ; θ1∗ ; Θ∗1 , θ2∗ , Θ∗2 ) be the “symmetric”—with Θ∗1 = Θ∗2 = Θ∗ —saddle point of function Ψ given by (2.122), and let φ∗ be the affine detector given by (2.124) and (2.125): ∗ ∗ ∗ 1 ∗ φ∗ (ω) = hT∗ [ω − w∗ ], h∗ = 21 Θ−1 ∗ [θ1 − θ2 ], w∗ = 2 [θ1 + θ2 ].

Let also δ= so that

q

hT∗ Θ∗ h∗

=

1 2

q

∗ ∗ [θ1∗ − θ2∗ ]T Θ−1 ∗ [θ1 − θ2 ],

 δ 2 = hT∗ [θ1∗ − w∗ ] = hT∗ [w∗ − θ2∗ ] and ǫ⋆ = exp − 21 δ 2 .

(2.126)

Let, further, α ≤ δ 2 , β ≤ δ 2 . Then (a) (b)

∀(θ ∈ U1 , Θ ∈ V) : Probω∼N (θ,Θ) {φ∗ (ω) ≤ α} ≤ Erfc(δ − α/δ), ∀(θ ∈ U2 , Θ ∈ V) : Probω∼N (θ,Θ) {φ∗ (ω) ≥ −β} ≤ Erfc(δ − β/δ).

(2.127)

In particular, when deciding, via a single observation ω, on Gaussian hypotheses HχG , χ = 1, 2, with HχG stating that ω ∼ N (θ, Θ) with (θ, Θ) ∈ Uχ × V, the risk of the test T∗1 associated with φ∗ is at most Erfc(δ). Proof. Let us prove (a) (the proof of (b) is completely similar). For θ ∈ U1 , Θ ∈ V we have Probω∼N (θ,Θ) {φ∗ (ω) ≤ α} = Probω∼N (θ,Θ) {hT∗ [ω − w∗ ] ≤ α} = Probξ∼N (0,I) {hT∗ [θ + Θ1/2 ξ − w∗ ] ≤ α} = Probξ∼N (0,I) {[Θ1/2 h∗ ]T ξ ≤ α − hT∗ [θ − w∗ ] | {z }

}

∗ 2 ≥hT ∗ [θ1 −w∗ ]=δ

by (2.124.b),(2.126) 2

≤ Probξ∼N (0,I) {[Θ1/2 h∗ ]T ξ ≤ α − δ } = Erfc([δ 2 − α]/kΘ1/2 h∗ k2 ) 1/2 ≤ Erfc([δ 2 − α]/kΘ∗ h∗ k2 )

[due to δ 2 − α ≥ 0 and hT∗ Θh∗ ≤ hT∗ Θ∗ h∗ by (2.124.c)]

= Erfc([δ 2 − α]/δ).

139

HYPOTHESIS TESTING

The “in particular” part of Proposition is readily given by (2.127) as applied with α = β = 0. ✷ Note that the progress, as compared to our results on the minimum risk detectors for convex hypotheses in Gaussian o.s., is that we do not assume anymore that the covariance matrix is once and forever fixed. Now neither the mean nor the covariance matrix of the observed Gaussian random variable are known in advance. In this setting, the mean is running through a closed convex set (depending on the hypothesis), and the covariance is running, independently of the mean, through a given convex compact subset of the interior of the positive definite cone, and this subset should be common for both hypotheses we are deciding upon.

2.9

2.9.1

BEYOND THE SCOPE OF AFFINE DETECTORS: LIFTING THE OBSERVATIONS Motivation

The detectors considered in Section 2.8 were affine functions of observations. Note, however, that what an observation is, to some extent depends on us. To give an instructive example, consider the Gaussian observation ζ = A[u; 1] + ξ ∈ Rn , where u is an unknown signal known to belong to a given set U ⊂ Rn , u 7→ A[u; 1] is a given affine mapping from Rn into the observation space Rd , and ξ is zero mean Gaussian observation noise with covariance matrix Θ known to belong to a given convex compact subset V of the interior of the positive semidefinite cone Sd+ . Treating the observation “as is,” affine in the observation detector is affine in [u; ξ]. On the other hand, we can treat as our observation the image of the actual observation ζ under any deterministic mapping, e.g., the “quadratic lifting” ζ 7→ (ζ, ζζ T ). A detector affine in the new observation is quadratic in u and ξ— we get access to a wider set of detectors as compared to those affine in ζ! At first glance, applying our “affine detectors” machinery to appropriate “nonlinear liftings” of actual observations we can handle quite complicated detectors, e.g., polynomial, of arbitrary degree, in ζ. The bottleneck here stems from the fact that in general it is difficult to “cover” the distribution of a “nonlinearly lifted” observation ζ (even as simple as the Gaussian observation above) by an explicitly defined family of regular distributions, and such a “covering” is what we need in order to apply to the lifted observation our affine detector machinery. It turns out, however, that in some important cases the desired covering is achievable. We are about to demonstrate that this takes place in the case of the quadratic lifting ζ 7→ (ζ, ζζ T ) of (sub)Gaussian observation ζ, and the resulting quadratic detectors allow us to handle some important inference problems which are far beyond the grasp of “genuinely affine” detectors.

140

CHAPTER 2

2.9.2

Quadratic lifting: Gaussian case

Given positive integer d, we define E d as the linear space Rd × Sd equipped with the inner product h(z, S), (z ′ , S ′ )i = sT z ′ + 21 Tr(SS ′ ).

Note that the quadratic lifting z 7→ (z, zz T ) maps the space Rd into E d . In the sequel, an instrumental role is played by the following result. Proposition 2.43. (i) Assume we are given

• a nonempty and bounded subset U of Rn ; • a convex compact set V contained in the interior of the cone Sd+ of positive semidefinite d × d matrices; • a d × (n + 1) matrix A. These data specify the family GA [U, V] of distributions of quadratic liftings (ζ, ζζ T ) of Gaussian random vectors ζ ∼ N (A[u; 1], Θ) stemming from u ∈ U and Θ ∈ V. Let us select some 1. γ ∈ (0, 1), 2. convex compact subset Z of the set Z n = {Z ∈ Sn+1 : Z  0, Zn+1,n+1 = 1} such that Z(u) := [u; 1][u; 1]T ∈ Z ∀u ∈ U, (2.128)

3. positive definite d × d matrix Θ∗ ∈ Sd+ and δ ∈ [0, 2] such that −1/2

Θ∗  Θ ∀Θ ∈ V & kΘ1/2 Θ∗

− Id k ≤ δ ∀Θ ∈ V,

(2.129)

where k · k is the spectral norm,22 and set

−1 H = Hγ := {(h, H) ∈ Rd × Sd : −γΘ−1 ∗  H  γΘ∗ },

ΦA,Z (h, H; Θ)

=

1/2

1/2

− 21 ln Det(I − Θ∗ HΘ∗ ) + 21 Tr([Θ − Θ∗ ]H) 1/2 1/2 δ(2+δ) kΘ∗ HΘ∗ k2F + 1/2 1/2 2(1−kΘ HΘ k) ∗ ∗  h i   H h T −1 −1 + 12 φZ B T + [H, h] [Θ − H] [H, h] B : T ∗ h

H × V → R, (2.130)

where B is given by B=



A [0, ..., 0, 1]



,

(2.131)

the function φZ (Y ) := max Tr(ZY ) Z∈Z

(2.132)

is the support function of Z, and k · kF is the Frobenius norm. Function ΦA,Z is continuous on its domain, convex in (h, H) ∈ H and concave 22 It is easily seen that with δ = 2, the second relation in (2.129) is satisfied for all Θ such that 0  Θ  Θ∗ , so that the restriction δ ≤ 2 is w.l.o.g..

141

HYPOTHESIS TESTING

in Θ ∈ V, so that (H, V, ΦA,Z ) is regular data. Besides this, (#) Whenever u ∈ Rn is such that [u; 1][u; 1]T ∈ Z and Θ ∈ V, the Gaussian random vector ζ ∼ N (A[u; 1], Θ) satisfies the relation o  n 1 T T ≤ ΦA,Z (h, H; Θ). (2.133) ∀(h, H) ∈ H : ln Eζ∼N (A[u;1],Θ) e 2 ζ Hζ+h ζ

The latter relation combines with (2.128) to imply that GA [U, V] ⊂ S[H, V, ΦA,Z ].

In addition, ΦA,Z is coercive in (h, H): ΦA,Z (hi , Hi ; Θ) → +∞ as i → ∞ whenever Θ ∈ V, (hi , Hi ) ∈ H, and k(hi , Hi )k → ∞, i → ∞. (χ)

(ii) Let two collections of entities from (i), (Vχ , Θ∗ , δχ , γχ , Aχ , Zχ ), χ = 1, 2, with common d be given, giving rise to the sets Hχ , matrices Bχ , and functions ΦAχ ,Zχ (h, H; Θ), χ = 1, 2. These collections specify the families of normal distributions Gχ = {N (v, Θ) : Θ ∈ Vχ & ∃u ∈ U : v = Aχ [u; 1]}, χ = 1, 2. Consider the convex-concave saddle point problem SV =

1

min max 2 (h,H)∈H1 ∩H2 Θ1 ∈V1 ,Θ2 ∈V2 |

[ΦA1 ,Z1 (−h, −H; Θ1 ) + ΦA2 ,Z2 (h, H; Θ2 )] . {z } Φ(h,H;Θ1 ,Θ2 )

(2.134) A saddle point (H∗ , h∗ ; Θ∗1 , Θ∗2 ) in this problem does exist, and the induced quadratic detector φ∗ (ω) = 12 ω T H∗ ω+hT∗ ω+ 21 [ΦA1 ,Z1 (−h∗ , −H∗ ; Θ∗1 ) − ΦA2 ,Z2 (h∗ , H∗ ; Θ∗2 )], (2.135) {z } | a

when applied to the families of Gaussian distributions Gχ , χ = 1, 2, has the risk Risk[φ∗ |G1 , G2 ] ≤ ǫ⋆ := eSV , that is, (a) (b)

R

e−φ∗ (ω) P (dω) ≤ ǫ⋆ eφ∗ (ω) P (dω) ≤ ǫ⋆ Rd

R Rd

∀P ∈ G1 , ∀P ∈ G2 .

(2.136)

For proof, see Section 2.11.5.

Remark 2.44. Note that the computational effort to solve (2.134) reduces dramatically in the “easy case” of the situation described in item (ii) of Proposition 2.43 where • the observations are direct, meaning that Aχ [u; 1] ≡ u, u ∈ Rd , χ = 1, 2; • the sets Vχ are comprised of positive definite diagonal matrices, and matrices (χ) Θ∗ are diagonal as well, χ = 1, 2; • the sets Zχ , χ = 1, 2, are convex compact sets of the form Zχ = {Z ∈ Sd+1 : Z  0, Tr(ZQχj ) ≤ qjχ , 1 ≤ j ≤ Jχ } +

142

CHAPTER 2

with diagonal matrices Qχj ,23 and these sets intersect the interior of the positive semidefinite cone Sd+1 + . In this case, the convex-concave saddle point problem (2.134) admits a saddle point (h∗ , H∗ ; Θ∗1 , Θ∗2 ) where h∗ = 0 and H∗ is diagonal. Justifying the remark. In the easy case, we have Bχ = Id+1 and therefore    h i−1 H h (χ) −1 T + [H, h] [Θ ] − H [H, h] Bχ Mχ (h, H) := BχT ∗ T = and φZχ (Z)

= =

h

h i−1 (χ) H + H [Θ∗ ]−1 − H H  h i−1 (χ) −1 T T h + h [Θ∗ ] − H H 

 (χ) h + H[[Θ∗ ]−1 − H]−1 h  h i−1 (χ) hT [Θ∗ ]−1 − H h

 max Tr(ZW ) : W  0, Tr(W Qχj ) ≤ qjχ , 1 ≤ j ≤ Jχ W o nP P χ χ minλ , j λj Q j j qj λj : λ ≥ 0, Z 

where the last equality is due to semidefinite duality.24 From the second representation of φZχ (·) and the fact that all Qχj are diagonal it follows that φZχ (Mχ (−h, H)) = φZχ (Mχ (h, H)) (indeed, with diagonal Qχj , if λ is feasible for the minimization problem participating in the representation when Z = Mχ (h, H), it clearly remains feasible when Z is replaced with Mχ (−h, H)). This, in turn, combines straightforwardly with (2.130) to imply that when replacing h∗ with 0 in a saddle point (h∗ , H∗ ; Θ∗1 , Θ∗2 ) of (2.134), we end up with another saddle point of (2.134). In other words, when solving (2.134), we can from the very beginning set h to 0, thus converting (2.134) into the convex-concave saddle point problem SV =

min

max

H:(0,H)∈H1 ∩H2 Θ1 ∈V1 ,Θ2 ∈V2

Φ(0, H; Θ1 , Θ2 ).

(2.137)

Taking into account that we are in the case where all matrices from the sets Vχ , like (χ) the matrices Θ∗ and all the matrices Qχj , χ = 1, 2, are diagonal, it is immediate to verify that Φ(0, H; Θ1 , Θ2 ) = Φ(0, EHE; Θ1 , Θ2 ) for any d × d diagonal matrix E with diagonal entries ±1. Due to convexity-concavity of Φ this implies that (2.137) admits a saddle point (0, H∗ ; Θ∗1 , Θ∗2 ) with H∗ invariant w.r.t. transformations H∗ 7→ EH∗ E with the above E, that is, with diagonal H∗ , as claimed. ✷ 2.9.3

Quadratic lifting—Does it help?

Assume that for χ = 1, 2, we are given • affine mappings u 7→ Aχ (u) = Aχ [u; 1] : Rnχ → Rd , • nonempty convex compact sets Uχ ⊂ Rnχ , • nonempty convex compact sets Vχ ⊂ int Sd+ . These data define families Gχ of Gaussian distributions on Rd : Gχ is comprised of all 23 In terms of the sets U , this assumption means that the latter sets are given by linear χ inequalities on the squares of entries in u. 24 See Section 4.1 (or [187, Section 7.1] for more details).

143

HYPOTHESIS TESTING

distributions N (Aχ (u), Θ) with u ∈ Uχ and Θ ∈ Vχ . The data define also families SGχ of sub-Gaussian distributions on Rd : SGχ is comprised of all sub-Gaussian distributions with parameters (Aχ (u), Θ) with (u, Θ) ∈ Uχ × Vχ . Assume we observe random variable ζ ∈ Rd drawn from a distribution P known to belong to G1 ∪ G2 , and our goal is to decide from a stationary K-repeated version of our observation on the pair of hypotheses Hχ : P ∈ Gχ , χ = 1, 2; we refer to this situation as the Gaussian case, and we assume from now on that we are in this case.25 At present, we have developed two approaches to building detector-based tests for H1 , H2 : A. Utilizing the affine in ζ detector φaff given by solution to the saddle point problem (see (2.122), (2.123) and set θχ = Aχ (uχ ) with uχ running through Uχ )   T 1 h [A2 (u2 ) − A1 (u1 )] + 12 hT [Θ1 + Θ2 ] h ; SadValaff = min max 2 h∈Rd

u1 ∈U1 ,u2 ∈U2

Θ1 ∈V1 ,Θ2 ∈V2

this detector satisfies the risk bound Risk[φaff |G1 , G2 ] ≤ exp{SadValaff }. Q. Utilizing the quadratic in ζ detector φlift given by Proposition 2.43.ii, with the risk bound Risk[φlift |G1 , G2 ] ≤ exp{SadVallift }, with SadVallift given by (2.134). A natural question is, which of these options results in a better risk bound. Note that we cannot just say “clearly, the second option is better, since there are more quadratic detectors than affine ones”—the difficulty is that the key relation (2.133), in the context of Proposition 2.43, is inequality rather than equality.26 We are about to show that under reasonable assumptions, the second option indeed is better: Proposition 2.45. In the situation in question, assume that the sets Vχ , χ = 1, 2, contain the -largest elements, and that these elements are taken as the matrices (χ) Θ∗ participating in Proposition 2.43.ii. Let, further, the convex compact sets Zχ participating in Proposition 2.43.ii satisfy     W u ¯  0, u ∈ Uχ (2.138) Zχ ⊂ Zχ := Z = uT 1 (this assumption does not restrict generality, since Z¯χ is, along with Uχ , a closed convex set which clearly contains all matrices [u; 1][u; 1]T with u ∈ Uχ ). Then SadVallift ≤ SadValaff ,

(2.139)

that is, option Q is at least as efficient as option A. 25 It is easily seen that what follows can be straightforwardly extended to the sub-Gaussian case, where the hypotheses we would decide upon state that P ∈ SGχ . 26 One cannot make (2.133) an equality by redefining the right-hand side function—it will lose the convexity-concavity properties required in our context.

144

CHAPTER 2

ρ 0.5 0.5 0.01

σ1 2 1 1

σ2 2 4 4

unrestricted H and h 0.31 0.24 0.41

H=0 0.31 0.39 1.00

h=0 1.00 0.62 0.41

Table 2.2: Risk of quadratic detector φ(ζ) = hT ζ + 12 ζ T Hζ + κ.

Proof. Let Aχ = [A¯χ , aχ ]. Looking at (2.122) (where one should substitute θχ = (χ) Aχ (uχ ) with uχ running through Uχ ) and taking into account that Θχ  Θ∗ ∈ Vχ when Θχ ∈ Vχ , we conclude that h h i i (1) (2) 1 hT [A¯2 u2 − A¯1 u1 + a2 − a1 ] + 21 hT Θ∗ + Θ∗ h . SadValaff = min max 2 h

u1 ∈U1 ,u2 ∈U2

(2.140)

At the same time, we have by Proposition 2.43.ii: SadVallift

= ≤ =

1 max min [ΦA1 ,Z1 (−h, −H; Θ1 ) + ΦA2 ,Z2 (h, H; Θ2 )] (h,H)∈H1 ∩H2 Θ1 ∈V1 ,Θ2 ∈V2 2 1 [ΦA1 ,Z1 (−h, 0; Θ1 ) + ΦA2 ,Z2 (h, 0; Θ2 )] max min 2 h∈Rd Θ1 ∈V1 ,Θ2 ∈V2 " #!  −A¯T1 h 1 1 min max max Tr Z1 (1) 2 2 Z ∈Z −hT A¯1 −2hT a1 + hT Θ∗ h h∈Rd Θ1 ∈V1 ,Θ2 ∈V2 1 1

"

#!  A¯T2 h Tr Z2 (2) hT A¯2 2hT a2 + hT Θ∗ h [by direct computation utilizing (2.130)]  h i (1) T ¯T 1 1 ≤ min 2 2 max −2u1 A1 h − 2aT1 h + hT Θ∗ h + u1 ∈U1 h∈Rd h i T ¯T T T (2) 1 2u A h + 2a h + h Θ h max ∗ 2 2 2 2 + 21 max Z2 ∈Z2

u2 ∈U2

[due to (2.138)]

= SadValaff ,

where the concluding equality is due to (2.140).



Numerical illustration. To get an impression of the performance of quadratic detectors as compared to affine ones under the premise of Proposition 2.45, we present here the results of an experiment where U1 = U1ρ = {u ∈ R12 : ui ≥ (χ) ρ, 1 ≤ i ≤ 12}, U2 = U2ρ = −U1ρ , A1 = A2 ∈ R8×13 , and Vχ = {Θ∗ = σχ2 I8 } are singletons. The risks of affine, quadratic and “purely quadratic” (with h set to 0) detectors on the associated families G1 , G2 are given in Table 2.2. We see that • when deciding on families of Gaussian distributions with a common covariance matrix and expectations varying in the convex sets associated with the families, passing from affine detectors described by Proposition 2.41 to quadratic detectors does not affect the risk (first row in the table). This should be expected: we are in the scope of Gaussian o.s., where minimum risk affine detectors are optimal among all possible detectors. • When deciding on families of Gaussian distributions in the case where distributions from different families can have close expectations (third row in the table), (1) affine detectors are useless, while the quadratic ones are not, provided that Θ∗

145

HYPOTHESIS TESTING (2)

differs from Θ∗ . This is how it should be—we are in the case where the first moments of the distribution of observation bear no definitive information on the family to which this distribution belongs, making affine detectors useless. In contrast, quadratic detectors are able to utilize information (valuable when (1) (2) Θ∗ 6= Θ∗ ) “stored” in the second moments of the observation. • “In general” (second row in the table), both affine and purely quadratic components in a quadratic detector are useful; suppressing one of them may increase significantly the attainable risk.

2.9.4

Quadratic lifting: Sub-Gaussian case

The sub-Gaussian version of Proposition 2.43 is as follows: Proposition 2.46. (i) Assume we are given • a nonempty and bounded subset U of Rn ; • a convex compact set V contained in the interior of the cone Sd+ of positive semidefinite d × d matrices; • a d × (n + 1) matrix A. These data specify the family SG A [U, V] of distributions of quadratic liftings (ζ, ζζ T ) of sub-Gaussian random vectors ζ with sub-Gaussianity parameters A[u; 1], Θ stemming from u ∈ U and Θ ∈ V. Let us select some 1. reals γ, γ + such that 0 < γ < γ + < 1, 2. convex compact subset Z of the set Z n = {Z ∈ Sn+1 : Z  0, Zn+1,n+1 = 1} such that relation (2.128) takes place, 3. positive definite d×d matrix Θ∗ ∈ Sd+ and δ ∈ [0, 2] such that (2.129) takes place. These data specify the closed convex sets H

b H

= =

d d −1 −1 Hγ := {(h, ∗ },  H) ∈ R × S : −γΘ∗  H   γΘ−1  −1 + −γΘ γ,γ d d d ∗  H  γΘ∗ b H = (h, H, G) ∈ R × S × S : 0  G  γ + Θ−1 ∗ , H G

and the functions

ΨA,Z (h, H, G)

=

1/2

1/2

− − 21 ln Det(I    Θ∗ GΘ∗  ) H h −1 [H, h] B + [H, h]T [Θ−1 : + 12 φZ B T T ∗ − G] h

ΨδA,Z (h, H, G; Θ)

=

1/2

1/2

− 12 ln Det(I − Θ∗ GΘ∗ )

δ(2+δ)

1/2

b × Z → R, H 1/2

kΘ∗ GΘ∗ k2F + 12 Tr([Θ − Θ∗ ]G) + 1/2 1/2 2(1−kΘ∗ GΘ∗ k)      h H −1 [H, h] B + [H, h]T [Θ−1 : + 21 φZ B T T ∗ − G] h

b × {0  Θ  Θ∗ } → R H

(2.141) where B is given by (2.131) and φZ (·) is the support function of Z given by (2.132),

146

CHAPTER 2

along with ΦA,Z (h, H)

=

ΦδA,Z (h, H; Θ)

=

n o b : H → R, min ΨA,Z (h, H, G) : (h, H, G) ∈ H G n o b : H × {0  Θ  Θ∗ } → R, min ΨδA,Z (h, H, G; Θ) : (h, H, G) ∈ H G

ΦA,Z (h, H) is convex and continuous on its domain, and ΦδA,Z (h, H; Θ) is continuous on its domain, convex in (h, H) ∈ H and concave in Θ ∈ {0  Θ  Θ∗ }. Besides this, (##) Whenever u ∈ Rn is such that [u; 1][u; 1]T ∈ Z and Θ ∈ V, the subGaussian random vector ζ, with parameters (A[u; 1], Θ), satisfies the relation ∀(h, H) ∈ Hn: (a)

(b)

o 1 T T ≤ ΦA,Z (h, H), ln Eζ e 2 ζ Hζ+h ζ o  n 1 T T ζ Hζ+h ζ ≤ ΦδA,Z (h, H; Θ), ln Eζ e 2

(2.142)

which combines with (2.128) to imply that

SG A [U, V] ⊂ S[H, V, ΦA,Z ] & SG A [U, V] ⊂ S[H, V, ΦδA,Z ].

(2.143)

In addition, ΦA,Z and ΦδA,Z are coercive in (h, H): ΦA,Z (hi , Hi ) → +∞ and ΦδA,Z (hi , Hi ; Θ) → +∞ as i → ∞ whenever Θ ∈ V, (hi , Hi ) ∈ H, and k(hi , Hi )k → ∞, i → ∞. (χ)

(ii) Let two collections of data from (i): (Vχ , Θ∗ , δχ , γχ , γχ+ , Aχ , Zχ ), χ = 1, 2, with common d be given, giving rise to the sets Hχ , matrices Bχ , and functions δ ΦAχ ,Zχ (h, H), ΦAχχ ,Zχ (h, H; Θ), χ = 1, 2. These collections specify the families SG χ = SG Aχ [Uχ , Vχ ] of sub-Gaussian distributions. Consider the convex-concave saddle point problem h i δ1 δ2 1 Φ SV = min (−h, −H; Θ ) + Φ max (h, H; Θ ) . 1 2 A1 ,Z1 A2 ,Z2 2 (h,H)∈H1 ∩H2 Θ1 ∈V1 ,Θ2 ∈V2 | {z } Φδ1 ,δ2 (h,H;Θ1 ,Θ2 )

(2.144) A saddle point (H∗ , h∗ ; Θ∗1 , Θ∗2 ) in this problem does exist, and the induced quadratic detector h i φ∗ (ω) = 21 ω T H∗ ω + hT∗ ω + 12 ΦδA11 ,Z1 (−h∗ , −H∗ ; Θ∗1 ) − ΦδA22 ,Z2 (h∗ , H∗ ; Θ∗2 ) , {z } | a

when applied to the families of sub-Gaussian distributions SG χ , χ = 1, 2, has the risk Risk[φ∗ |SG 1 , SG 2 ] ≤ ǫ⋆ := eSV .

As a result, (a) (b)

R

e−φ∗ (ω) P (dω) ≤ ǫ⋆ eφ∗ (ω) P (dω) ≤ ǫ⋆ Rd

R Rd

∀P ∈ SG 1 , ∀P ∈ SG 2 .

147

HYPOTHESIS TESTING

Similarly, the convex minimization problem Opt =

min

(h,H)∈H1 ∩H2

1

|2

[ΦA1 ,Z1 (−h, −H) + ΦA2 ,Z2 (h, H)] . {z }

(2.145)

Φ(h,H)

is solvable, and the quadratic detector induced by its optimal solution (h∗ , H∗ ) φ∗ (ω) = 12 ω T H∗ ω + hT∗ ω + 21 [ΦA1 ,Z1 (−h∗ , −H∗ ) − ΦA2 ,Z2 (h∗ , H∗ )], | {z }

(2.146)

a

when applied to the families of sub-Gaussian distributions SG χ , χ = 1, 2, has the risk Risk[φ∗ |SG 1 , SG 2 ] ≤ ǫ⋆ := eOpt , so that relation (2.145) takes place for the φ∗ and ǫ⋆ just defined. For proof, see Section 2.11.6. Remark 2.47. Proposition 2.46 offers two options for building quadratic detectors for the families SG 1 , SG 2 , those based on the saddle point of (2.144) and on the optimal solution to (2.145). Inspecting the proof, the number of options can be δ increased to 4: we can replace any of the functions ΦAχχ ,Zχ , χ = 1, 2 (or both these functions simultaneously), with ΦAχ ,Zχ . The second of the original two options is δ exactly what we get when replacing both ΦAχχ ,Zχ , χ = 1, 2, with ΦAχ ,Zχ . It is easily seen that depending on the data, each of these four options can be the best—result in the smallest risk bound. Thus, it makes sense to keep all these options in mind and to use the one which, under the circumstances, results in the best risk bound. Note that the risk bounds are efficiently computable, so that identifying the best option is easy. 2.9.5

Generic application: Quadratically constrained hypotheses

Propositions 2.43 and 2.46 operate with Gaussian/sub-Gaussian observations ζ with matrix parameters Θ running through convex compact subsets V of int Sd+ , and means of the form A[u; 1], with “signals” u running through given sets U ⊂ Rn . The constructions, however, involved additional entities—convex compact sets Z ⊂ Z n := {Z ∈ Sn+1 : Zn+1,n+1 = 1} containing quadratic liftings [u; 1][u; 1]T of all + signals u ∈ U . Other things being equal, the smaller the Z, the smaller the associated function ΦA,Z (or ΦδA,Z ), and consequently, the smaller the (upper bounds on the) risks of the quadratic in ζ detectors we end up with. In order to implement these constructions, we need to understand how to build the required sets Z in an “economical” way. There is a relatively simple case when it is easy to get reasonable candidates for the role of Z—the case of quadratically constrained signal set U : U = {u ∈ Rn : fk (u) := uT Qk u + 2qkT u ≤ bk , 1 ≤ k ≤ K}.

(2.147)

Indeed, the constraints fk (u) ≤ bk are just linear constraints on the quadratic lifting [u; 1][u; 1]T of u:   Qk qk T T T ∈ Sn+1 . u Qk u + 2qk u ≤ bk ⇔ Tr(Fk [u; 1][u; 1] ) ≤ bk , Fk = qkT

148

CHAPTER 2

Consequently, in the case of (2.147), the simplest candidate on the role of Z is the set Z = {Z ∈ Sn : Z  0, Zn+1,n+1 = 1, Tr(Fk Z) ≤ bk , 1 ≤ k ≤ K}.

(2.148)

This set clearly is closed and convex (the latter even when U itself is not convex), and indeed contains the quadratic liftings [u; 1][u; 1]T of all points u ∈ U . We need also the compactness of Z; the latter definitely takes place when the quadratic constraints describing U contain the constraint of the form uT u ≤ R2 , which, in turn, can be ensured, basically “for free,” when U is bounded. It should be stressed that the “ideal” choice of Z would be the convex hull Z[U ] of all rank 1 matrices [u; 1][u; 1]T with u ∈ U —this definitely is the smallest convex set which contains the quadratic liftings of all points from U . Moreover, Z[U ] is closed and bounded, provided U is so. The difficulty is that Z[U ] can be computationally intractable (and thus useless in our context) already for pretty simple sets U of the form (2.147). The set (2.148) is a simple outer approximation of Z[U ], and this approximation can be very loose: for instance, when U = {u : −1 ≤ uk ≤ 1, 1 ≤ k ≤ n} is just the unit box in Rn , the set (2.148) is {Z ∈ Sn+1 : Z  0, Zn+1,n+1 = 1, |Zk,n+1 | ≤ 1, 1 ≤ k ≤ n}; this set even is not bounded, while Z[U ] clearly is bounded. There is, essentially, just one generic case when the set (2.148) is exactly equal to Z[U ]—the case where U = {u : uT Qu ≤ c}, Q ≻ 0 is an ellipsoid centered at the origin; the fact that in this case the set given by (2.148) is exactly Z[U ] is a consequence of what is called S-Lemma. Though, in general, the set Z can be a very loose outer approximation of Z[U ], this does not mean that this construction cannot be improved. As an instructive example, let U = {u ∈ Rn : kuk∞ ≤ 1}. We get an approximation of Z[U ] much better than the one above when applying (2.148) to an equivalent description of the box by quadratic constraints: U := {u ∈ Rn : kuk∞ ≤ 1} = {u ∈ Rn : u2k ≤ 1, 1 ≤ k ≤ n}. Applying the recipe of (2.148) to the latter description of U , we arrive at a significantly less conservative outer approximation of Z[U ], specifically, Z = {Z ∈ Sn+1 : Z  0, Zn+1,n+1 = 1, Zkk ≤ 1, 1 ≤ k ≤ n}. Not only the resulting set Z is bounded; we can get a reasonable “upper bound” on the discrepancy between Z and Z[U ]. Namely, denoting by Z o the matrix obtained from a symmetric n × n matrix Z by zeroing out the entry Zn+1,n+1 and keeping the remaining entries intact, we have Z o [U ] := {Z o : Z ∈ Z[U ]} ⊂ Z o := {Z o : Z ∈ Z} ⊂ O(1) ln(n + 1)Z o . This is a particular case of a general result (which goes back to [191]; we shall get this result as a byproduct of our forthcoming considerations, specifically, Proposition 4.6) as follows:

HYPOTHESIS TESTING

149

Let U be a bounded set given by a system of convex quadratic constraints without linear terms: U = {u ∈ Rn : uT Qk u ≤ ck , 1 ≤ k ≤ K}, Qk  0, 1 ≤ k ≤ K, and let Z be the associated set (2.148): Z = {Z ∈ Sn+1 : Z  0, Zn+1,n+1 = 1, Tr(ZDiag{Qk , 1}) ≤ ck , 1 ≤ k ≤ K}

Then √ Z o [U ] := {Z o : Z ∈ Z[U ]} ⊂ Z o := {Z o : Z ∈ Z} ⊂ 3 ln( 3(K + 1))Z o [U ]. Note that when K = 1 (i.e., U is an ellipsoid centered at the origin), the factor 4 ln(5(K + 1)), as it was already mentioned, can be replaced by 1. √ One can think that the factor 3 ln( 3(K + 1)) is too large to be of interest; well, this is nearly the best factor one can get under the circumstances, and a nice fact is that the factor is “nearly independent” of K. Finally, we remark that, as in the case of a box, we can try to reduce the conservatism of the outer approximation (2.148) of Z[U ] by passing from the initial description of U to an equivalent one. The standard recipe here is to replace linear constraints in the description of U by their quadratic consequences; for example, we can augment a pair of linear constraints qiT u ≤ ci , qjT u ≤ cj , assuming there is such a pair, with the quadratic constraint (ci −qiT u)(cj −qjT u) ≥ 0. While this constraint is redundant, as far as the description of U itself is concerned, adding this constraint reduces, and sometimes significantly, the set given by (2.148). Informally speaking, transition from (2.147) to (2.148) is by itself “too stupid” to utilize the fact (known to every kid) that the product of two nonnegative quantities is nonnegative; when augmenting linear constraints in the description of U by their pairwise products, we somehow compensate for this stupidity. Unfortunately, while “computationally tractable” assistance of this type allows us to reduce the conservatism of (2.148), it usually does not allow us to eliminate it completely: a grave “fact of life” is that even in the case of the unit box U , the set Z[U ] is computationally intractable. Scientifically speaking: maximizing quadratic forms over the unit box U is provably an NP-hard problem; were we able to get a computationally tractable description of Z[U ], we would be able to solve this NP-hard problem efficiently, implying that P=NP. While we do not know for sure that the latter is not the case, “informal odds” are strongly against this possibility. The bottom line is that while the approach we are discussing in some situations could result in quite conservative tests, “some” is by far not the same as “always”; on the positive side, this approach allows us to process some important problems. We are about to present a simple and instructive illustration. 2.9.5.1

Simple change detection

In Figure 2.8, you see a sample of frames from a “movie” in which a noisy picture of a dog gradually transforms into a noisy picture of a lady; several initial frames differ just by realizations of noise, and starting from some instant, the “signal” (the deterministic component of the image) starts to drift from the dog towards the lady. What, in your opinion, is the change point—the first time instant where

150

CHAPTER 2

#1

#2

#3

#4

#5

#6

#7

#8

# 15

# 20

# 28

# 30

Figure 2.8: Frames from a “movie”

151

HYPOTHESIS TESTING

the signal component of the image differs from the signal component of the initial image? A simple model of the situation is as follows: we observe, one by one, vectors (in fact, 2D arrays, but we can “vectorize” them) ωt = xt + ξt , t = 1, 2, ..., K,

(2.149)

where the xt are deterministic components of the observations and the ξt are random noises. It may happen that for some τ ∈ {2, 3, ..., K}, the vectors xt are independent of t when t < τ , and xτ differs from xτ −1 (“τ is a change point”); if it is the case, τ is uniquely defined by xK = (x1 , ..., xK ). An alternative is that xt is independent of t, for all 1 ≤ t ≤ K (“no change”). The goal is to decide, based on observation ω K = (ω1 , .., ωK ), whether there was a change point, and if yes, then, perhaps, to localize it. The model we have just described is the simplest case of “change detection,” where, given noisy observations on some time horizon, one is interested in detecting a “change” in some time series underlying the observations. In our simple model, this time series is comprised of deterministic components xt of observations, and “change at time τ ” is understood in the most straightforward way—as the fact that xτ differs from preceding xt ’s equal to each other. In more complicated situations, our observations are obtained from the underlying time series {xt } by a non-anticipative transformation, like ωt =

t X

Ats xs + ξt , t = 1, ..., K,

s=1

and we still want to detect the change, if any, in the time series {xt }. As an instructive example, consider observations, taken along an equidistant time grid, of the positions of an aircraft which “normally” flies with constant velocity, but at some time instant can start to maneuver. In this situation, the underlying time series is comprised of the velocities of the aircraft at consecutive time instants, observations are obtained from this time series by integration, and to detect a maneuver means to detect that on the observation horizon, there was a change in the series of velocities. Change detection is the subject of a huge literature dealing with a wide range of models differing from each other in • whether we deal with direct observations of the time series of interest, as in (2.149), or with indirect ones (in the latter case, there is a wide spectrum of options related to how the observations depend on the underlying time series), • what are the assumptions on the noise, • what happens with the xt ’s after the change—do they jump from their common value prior to time τ to a new common value starting with this time, or start to depend on time (and if yes, then how), etc. A significant role in change detection is played by hypothesis testing; as far as affine/quadratic-detector-based techniques developed in this section are concerned, their applications in the context of change detection are discussed in [50]. In what follows, we focus on the simplest of these applications. Situation and goal. We consider the situation as follows:

152

CHAPTER 2

1. Our observations are given by (2.149) with noises ξt ∼ N (0, σ 2 Id ) independent across t = 1, ..., K. We do not known σ a priori; what we know is that σ is independent of t and belongs to a given segment [σ, σ], with 0 < σ ≤ σ; 2. Observations (2.149) arrive one by one, so that at time t, 2 ≤ t ≤ K, we have at our disposal observation ω t = (ω1 , ..., ωt ). Our goal is to build a system of inferences Tt , 2 ≤ t ≤ K, such that Tt as applied to ω t either infers that there was a change at time t or earlier, in which case we terminate, or infers that so far there has been no change, in which case we either proceed to time t + 1 (if t < K), or terminate (if t = K) with a “no change” conclusion. We are given ǫ ∈ (0, 1) and want our collection of inferences to satisfy the bound ǫ on the probability of false alarm (i.e., on the probability of terminating somewhere on time horizon 2, 3, ..., K with a “there was a change” conclusion in the situation where there was no change: x1 = ... = xK ). Under this restriction, we want to make as small as possible the probability of a miss (of not detecting the change at all in the situation where there was a change). The “small probability of a miss” desire should be clarified. When the noise is nontrivial, we have no chances to detect very small changes and respect the bound on the probability of false alarm. A realistic goal is to make as small as possible the probability of missing a not too small change, which can be formalized as follows. Given ρ > 0, and tolerances ǫ, ε ∈ (0, 1), let us look for a system of inferences {Tt : 2 ≤ t ≤ K} such that • the probability of false alarm is at most ǫ, and • the probability of “ρ-miss”—the probability of detecting no change when there was a change of energy ≥ ρ2 (i.e., when there was a change a time τ , and, moreover, it holds kxτ − x1 k22 ≥ ρ2 ) is at most ε. What we are interested in, is to achieve the goal just formulated with as small a ρ as possible. Construction. Let us select a large “safety parameter” R, like R = 108 or even R = 1080 , so that we can assume that for all time series we are interested in it holds kxt − xτ k22 ≤ R2 .27 Let us associate with ρ > 0 “signal hypotheses” Htρ , t = 2, 3, ..., K, on the distribution of observation ω K given by (2.149), with Htρ stating that at time t there is a change, of energy at least ρ2 , in the time series K {xt }K t=1 underlying the observation ω : x1 = x2 = ... = xt−1 & kxt − xt−1 k22 = kxt − x1 k22 ≥ ρ2 (and on top of that, kxt − xτ k22 ≤ R2 for all t, τ ). Let us augment these hypotheses by the null hypothesis H0 stating that there is no change at all—the observation ω K stems from a stationary time series x1 = x2 = ... = xK . We are about to use our machinery of detector-based tests in order to build a system of tests deciding, S with partial risks ǫ and ε, on the null hypothesis vs. the “signal alternative” t Htρ for as small a ρ as possible. The implementation is as follows. Given ρ > 0 such that ρ2 < R2 , consider two 27 R is needed only to make the domains we are working with bounded, thus allowing us to apply the theory we have developed so far. The actual value of R does not enter our constructions and conclusions.

153

HYPOTHESIS TESTING

hypotheses, G1 and Gρ2 , on the distribution of observation ζ = x + ξ ∈ Rd .

(2.150)

Both hypotheses state that ∼ N (0, σ 2 Id ) with unknown σ known to belong to √ ξ√ a given segment ∆ := [ 2σ, 2σ]. In addition, G1 states that x = 0, and Gρ2 that ρ2 ≤ kxk22 ≤ R2 . We can use the result of Proposition 2.43.ii to build a detector quadratic in ζ for the families of distributions P1 , P2ρ obeying the hypotheses G1 , Gρ2 , respectively. To this end it suffices to apply the proposition to the collections (χ)

Vχ = {σ 2 Id : σ ∈ ∆}, Θ∗

= 2σ 2 Id , δχ = 1 − σ/σ, γχ = 0.999, Aχ = Id , Zχ , [χ = 1, 2]

where Z1 Z2

= =

{[0; ...; 0; 1][0; ...; 0; 1]T } ⊂ Sd+1 + , Z2ρ = {Z ∈ Sd+1 : Z = 1, 1 + R2 ≥ Tr(Z) ≥ 1 + ρ2 }. d+1,d+1 +

The (upper bound on the) risk of the quadratic in ζ detector yielded by a saddle point of function (2.134), as given by Proposition 2.43.ii, is immediate: by the same argument as used when justifying Remark 2.44, in the situation in question one can look for a saddle point with h = 0, H = ηId , and identifying the required η reduces to solving the univariate convex problem  σ4 η2 b4 η 2 ) − d2 σ b2 (1 − σ 2 /σ 2 )η + dδ(2+δ)b Opt(ρ) = min 21 − d2 ln(1 − σ 2η 1+b σ η  ρ2 η 2 + 2(1−b : −γ ≤ σ b η ≤ 0 σ 2 η) √   σ b = 2σ, δ = 1 − σ/σ which can be done in no time by Bisection. The resulting detector and the upper bound on its risk are given by the optimal solution η(ρ) to the latter problem according to     1−σ b2 η(ρ) ρ2 η(ρ) d 2 2 2 /σ φ∗ρ (ζ) = 12 η(ρ)ζ T ζ + ln − σ b (1 − σ )η(ρ) − 4 1+σ b2 η(ρ) d(1 − σ b2 η(ρ)) {z } | a(ρ)

with

Risk[φ∗ρ |P1 , P2 ] ≤ Risk(ρ) := eOpt(ρ)

(observe that R appears neither in the definition of the optimal detector nor in the risk bound). It is immediately seen that Opt(ρ) → 0 as ρ → +0 and Opt(ρ) → −∞ as ρ → +∞, implying that given κ ∈ (0, 1), we can easily find by bisection ρ = ρ(κ) such that Risk(ρ) = κ; in what follows, we assume w.l.o.g. that R > ρ(κ) for the value of κ we end with; see below. Next, let us pass from the detector φ∗ρ(κ) (·) to its shift φ∗,κ (ζ) = φ∗ρ(κ) (ζ) + ln(ε/κ), so that for the simple test T κ which, given observation ζ, accepts G1 and rejects

154

CHAPTER 2

ρ(κ)

G2

ρ(κ)

whenever φ∗,κ (ζ) ≥ 0, and accepts G2 ρ(κ)

Risk1 (T κ |G1 , G2

)≤

and rejects G1 otherwise, it holds

κ2 ρ(κ) , Risk2 (T κ |G1 , G2 ) ≤ ε; ε

(2.151)

see Proposition 2.14 and (2.48). We are nearly done. Given κ ∈ (0, 1), consider the system of tests Ttκ , t = 2, 3, ..., K, as follows. At time t ∈ {2, 3, ..., K}, given observations ω1 , ..., ωt stemming from (2.149), let us form the vector ζt = ωt − ω1 and compute the quantity φ∗,κ (ζt ). If this quantity is negative, we claim that the change has already taken place and terminate; otherwise we claim that so far, there was no change, and proceed to time t + 1 (if t < K) or terminate (if t = K). The risk analysis for the resulting system of inferences is immediate. Observe that (!) For every t = 2, 3, ..., K: • if there is no change on time horizon 1, ..., t: x1 = x2 = ... = xt (case A) the probability for Ttκ to conclude that there was a change is at most κ2 /ε; • if, on the other hand, kxt − x1 k22 ≥ ρ2 (κ) (case B), then the probability for Ttκ to conclude that so far there was no change is at most ε. Indeed, we clearly have

ζt = [xt − x1 ] + ξ t , √ √ where ξ t = ξt − ξ1 ∼ N (0, σ 2 Id ) with σ ∈ [ 2σ, 2σ]. Our action at time t is nothing but application of the test T κ to the observation ζt . In case A the distribution of this observation obeys the hypothesis G1 , and the probability for Ttκ to claim that there was a change is at most κ2 /ε by the first inequality ρ(κ) in (2.151). In case B, the distribution of ζt obeys the hypothesis G2 , and κ thus the probability for Tt to claim that there was no change on time horizon 1, ..., t is ≤ ε by the second inequality in (2.151). In view of (!), the probability of false alarm for the system of inferences {Ttκ }K t=2 is at most (K − 1)κ2 /ε, and specifying κ as p κ = ǫε/(K − 1),

we make this probability ≤ ǫ. The resulting procedure, by the same (!), detects a change at time t ∈ {2, 3, ..., K} with probability at least 1 − ε, provided that the energy of this change is at least ρ2∗ , with  p ǫε/(K − 1) . (2.152) ρ∗ = ρ

In fact we can say a bit more:

Proposition 2.48. Let the deterministic sequence x1 , ..., xK underlying observations (2.149) be such that for some t it holds kxt − x1 k22 ≥ ρ2∗ , with ρ∗ given by

155

HYPOTHESIS TESTING

(2.152). Then the probability for the system of inferences we have built to detect a change at time t or earlier is at least 1 − ε. Indeed, under the premise of the proposition, the probability for Ttκ to claim that a change already took place is at least 1 − ε, and this probability can be only smaller than the probability to detect change on time horizon 2, 3, ..., t. How it works. As applied to the “movie” story we started with, the outlined procedure works as follows. The images in question are of the size 256 × 256, so that we are in the case of d = 2562 = 65536. The images are represented by 2D arrays in gray scale, that is, as 256 × 256 matrices with entries in the range [0, 255]. In the experiment to be reported (just as in √ the movie) we assumed the maximal noise intensity σ to be 10, and used σ = σ/ 2. The reliability tolerances ǫ, ε were set to 0.01, and K was set to 9, resulting in ρ2∗ = 7.38 · 106 , which corresponds to the per pixel energy ρ2∗ /65536 = 112.68—just 12% above the allowed expected per pixel energy of noise (the latter is σ 2 = 100). The resulting detector is ζT ζ φ∗ (ζ) = −2.7138 5 + 366.9548. 10 In other words, test Ttκ claims that the change took place when the average, over pixels, per pixel energy in the difference ωt − ω1 was at least 206.33, which is pretty close to the expected per pixel energy (200.0) in the noise ξt − ξ1 affecting the difference ωt − ω1 . Finally, this is how the system of inferences just described worked in simulations. The underlying sequence of images is obtained from the “basic sequence” x ¯t = D + 0.0357(t − 1)(L − D), t = 1, 2, ...28

(2.153)

where D is the image of the dog and L is the image of the lady (up to noise, these are the first and the last frames on Figure 2.8). To get the observations in a particular simulation, we augment this sequence from the left by a random number of images D in such a way that with probability 1/2 there was no change of image on the time horizon 1,2, ..., 9, and with probability 1/2 there was a change at time instant τ chosen at random from the uniform distribution on {2, 3, ..., 9}. The observation is obtained by taking the first nine images in the resulting sequence, and adding to them observation noises independent across the images drawn at random from N (0, 100I65536 ). In the series of 3,000 simulations of this type we have not observed a single false alarm, while the empirical probability of a miss was 0.0553. Besides, the change at time t, if detected, was never detected with a delay more than 1. Finally, in the particular “movie” in Figure 2.8 the change takes place at time t = 3, and the system of inferences we have just developed discovered the change at time 4. How does this compare to the time when you managed to detect the change? “Numerical near-optimality.” Recall that beyond the realm of simple o.s.’s we 28 The

coefficient 0.0357 corresponds to a 28-frame linear transition from D to L.

156

CHAPTER 2

have no theoretical guarantees of near-optimality for the inferences we are developing. This does not mean, however, that we cannot quantify the conservatism of our techniques numerically. To give an example, let us forget, for the sake of simplicity, about change detection per se and focus on the auxiliary problem we have introduced above, that of deciding upon hypotheses G1 and Gρ2 via observation (2.150), and suppose that we want to decide on these two hypotheses from a single observation with risk ≤ ǫ, for a given ǫ ∈ (0, 1). Whether this is possible or not depends on ρ; let us denote by ρ+ the smallest ρ for which we can meet the risk specification with our detector-based approach (ρ+ is nothing but what was above called ρ(ǫ)), and by ρ the smallest ρ for which there exists “in nature” a simple test deciding on G1 vs. Gρ2 with risk ≤ ǫ. We can consider the ratio ρ+ /ρ as the “index of conservatism” of our approach. Now, ρ+ is given by an efficient computation; what about ρ? Well, there is a simple way to get a lower bound on ρ, namely, as follows. Observe that if the composite hypotheses G1 , Gρ2 can be decided upon with risk ≤ ǫ, the same holds true for two simple hypotheses stating that the distribution of observation (2.150) is P1 or P2 respectively, where P1 , P2 correspond to the cases where • (P1 ): ζ is drawn from N (0, 2σ 2 Id ) • (P2 ): ζ is obtained by adding N (0, 2σ 2 Id )-noise to a random signal u, independent of the noise, uniformly distributed on the sphere {kuk2 = ρ}. Indeed, P1 obeys hypothesis G1 , and P2 is a mixture of distributions obeying Gρ2 ; as a result, a simple test T deciding (1 − ǫ)-reliably on G1 vs. Gρ2 would induce a test deciding equally reliably on P1 vs. P2 , specifically, the test which, given observation ζ, accepts P1 if T on the same observation accepts G1 , and accepts P2 otherwise. We can now use a two-point lower bound (Proposition 2.2) to lower-bound the risk of deciding on P1 vs. P2 . Because both distributions are spherically symmetric, computing this bound reduces to computing a similar bound for the univariate distributions of ζ T ζ induced by P1 and P2 , and these univariate distributions are easy to compute. The resulting lower risk bound depends on ρ, and we can find the smallest ρ for which the bound is ≥ 0.01, and use this ρ in the role of ρ; the associated indexes of conservatism can be only larger than the true ones. Let us look at what these indexes are for the data used in our√change detection experiment, that is, ǫ = 0.01, d = 2562 = 65536, σ = 10, σ = σ/ 2. Computation shows that in this case we have ρ+ = 2702.4, ρ+ /ρ ≤ 1.04 —nearly no conservatism at all! √ When eliminating the uncertainty in the intensity of noise by increasing σ from σ/ 2 to σ, we get ρ+ = 668.46, ρ+ /ρ ≤ 1.15 —still not that much of conservatism!

157

HYPOTHESIS TESTING

2.10

EXERCISES FOR CHAPTER 2

2.10.1

Two-point lower risk bound

Exercise 2.1. Let p and q be two probability distributions distinct from each other on delement observation space Ω = {1, ..., d}, and consider two simple hypotheses on the distribution of observation ω ∈ Ω, H1 : ω ∼ p, and H2 : ω ∼ q. 1. Is it true that there always exists a simple deterministic test deciding on H1 , H2 with risk < 1/2? 2. Is it true that there always exists a simple randomized test deciding on H1 , H2 with risk < 1/2? 3. Is it true that when quasi-stationary K-repeated observations are allowed, one can decide on H1 , H2 with any small risk, provided K is large enough? 2.10.2

Around Euclidean Separation

Exercise 2.2. Justify the “immediate observation” in Section 2.2.2.3.B. Exercise 2.3. 1) Prove Proposition 2.9. Hint: You can find useful the following simple observation (prove it, provided you indeed use it): Let f (ω), g(ω) be probability densities taken w.r.t. a reference measure P on an observation space Ω, and let ǫ ∈ (0, 1/2] be such that Z min[f (ω), g(ω)]P (dω) ≤ 2ǫ. 2¯ ǫ := Ω

Then

Z p Ω

f (ω)g(ω)P (dω) ≤ 2

2) Justify the illustration in Section 2.2.3.2.C. 2.10.3

p ǫ(1 − ǫ).

Hypothesis testing via ℓ1 -separation

Let d be a positive integer, and the observation space Ω be the finite set {1, ..., d} equipped with the counting reference measure.29 Probability distributions on Ω can be identified with points p of d-dimensional probabilistic simplex X pi = 1}; ∆d = {p ∈ Rd : p ≥ 0, i

29 Counting measure is the measure on a discrete (finite or countable) set Ω which assigns every point of Ω with mass 1, so that the measure of a subset of Ω is the cardinality of the subset when it is finite and is +∞ otherwise.

158

CHAPTER 2

the i-th entry pi in p ∈ ∆d is the probability for the random variable distributed according to p to take value i ∈ {1, ..., d}. With this interpretation, p is the probability density taken w.r.t. the counting measure on Ω. Assume B and W are two nonintersecting nonempty closed convex subsets of ∆d ; we interpret B and W as the sets of black and white probability distributions on Ω, and our goal is to find the optimal, in terms of its total risk, test deciding on the hypotheses H1 : p ∈ B, H2 : p ∈ W via a single observation ω ∼ p. Warning: Everywhere in this section, “test” means “simple test.” Exercise 2.4. Our first goal is to find the optimal test, in terms of its total risk, deciding on the hypotheses H1 , H2 via a single observation ω ∼ p ∈ B ∪ W . To this end we consider the convex optimization problem " # d X f (p, q) := Opt = min |pi − qi | (2.154) p∈B,q∈W

i=1

and let (p∗ , q ∗ ) be an optimal solution to this problem (it clearly exists). 1. Extract from optimality conditions that there exist reals ρi ∈ [−1, 1], 1 ≤ i ≤ n, such that  1, p∗i > qi∗ (2.155) ρi = −1, p∗i < qi∗ and

ρT (p − p∗ ) ≥ 0 ∀p ∈ B & ρT (q − q ∗ ) ≤ 0 ∀q ∈ W.

(2.156)

2. Extract from the previous item that the test T which, given an observation ω ∈ {1, ..., d}, accepts H1 with probability πω = (1 + ρω )/2 and accepts H2 with complementary probability, has its total risk equal to X min[p∗ω , qω∗ ], (2.157) ω∈Ω

and thus is minimax optimal in terms of the total risk. Comments. Exercise 2.4 describes an efficiently computable and, in terms of worst-case total risk, optimal simple test deciding on a pair of “convex” composite hypotheses on the distribution of a discrete random variable. While it seems an attractive result, we believe by itself this result is useless, since typically in the testing problem in question a single observation by far is not enough for a reasonable inference; such an inference requires observing several independent realizations ω1 , ..., ωK of the random variable in question. And the construction presented in Exercise 2.4 says nothing on how to adjust the test to the case of repeated observation. Of course, when ω K = (ω1 , ..., ωK ) is a K-element i.i.d. sample drawn from a probability distribution p on Ω = {1, ..., d}, ω K can be thought of as a single observation of a discrete random variable taking value in the set ΩK = Ω × ... × Ω, | {z } K

the probability distribution pK of ω K being readily given by p. So, why not to

HYPOTHESIS TESTING

159

apply the construction from Exercise 2.4 to ω K in the role of ω? On a close inspection, this idea fails. One of the reasons for this failure is that the cardinality of ΩK (which, among other factors, is responsible for the computational complexity of implementing the test in Exercise 2.4) blows up exponentially as K grows. Another, even more serious, complication is that pK depends on p nonlinearly, so that the family of distributions pK of ω K induced by a convex family of distributions p of ω—convexity meaning that the p’s in question fill a convex subset of the probabilistic simplex—is not convex; and convexity of the sets B, W in the context of Exercise 2.4 is crucial. Thus, passing from a single realization of discrete random variable to the sample of K > 1 independent realizations of the variable results in severe structural and quantitative complications “killing,” at least at first glance, the approach undertaken in Exercise 2.4.30 In spite of the above pessimistic conclusions, the single-observation test from Exercise 2.4 admits a meaningful multi-observation modification, which is the subject of our next exercise. Exercise 2.5. There is a straightforward way to use the optimal–in terms of its total risk— single-observation test built in Exercise 2.4 in the “multi-observation” environment. Specifically, following the notation from the exercise 2.4, let ρ ∈ Rd , p∗ , q ∗ be the entities built in this Exercise, so that p∗ ∈ B, q ∗ ∈ W , all entries in ρ belong to [−1, 1], and {ρT p ≥ α := ρT p∗ ∀p ∈ B} & {ρT q ≤ β := ρT q ∗ ∀q ∈ W } & α − β = ρT [p∗ − q ∗ ] = kp∗ − q ∗ k1 . Given an i.i.d. sample ω K = (ω1 , ..., ωK ) with ωt ∼ p, where p ∈ B ∪ W , we could try to decide on the hypotheses H1 : p ∈ B, H2 : p ∈ W as follows.PLet us K 1 set ζt = ρωt . For large K, given ω K , the observable quantity ζ K := K t=1 ζt , by the Law of Large Numbers, will be with overwhelming probability close to Eω∼p {ρω } = ρT p, and the latter quantity is ≥ α when p ∈ B and is ≤ β < α when p ∈ W . Consequently, selecting a “comparison level” ℓ ∈ (β, α), we can decide on the hypotheses p ∈ B vs. p ∈ W by computing ζ K , comparing the result to ℓ, accepting the hypothesis p ∈ B when ζ K ≥ ℓ, and accepting the alternative p ∈ W otherwise. The goal of this exercise is to quantify the above qualitative considerations. To this end let us fix ℓ ∈ (β, α) and K and ask ourselves the following questions: A. For p ∈ B, how do we upper-bound the probability ProbpK {ζ K ≤ ℓ}? B. For p ∈ W , how do we upper-bound the probability ProbpK {ζ K ≥ ℓ}? Here pK is the probability distribution of the i.i.d. sample ω K = (ω1 , ..., ωK ) with ωt ∼ p. The simplest way to answer these questions is to use Bernstein’s bounding scheme. Specifically, to answer question A, let us select γ ≥ 0 and observe that for 30 Though directly extending the optimal single-observation test to the case of repeated observations encounters significant technical difficulties, it was carried on in some specific situations. For instance, in [122, 123] such an extension has been proposed for the case of sets B and W of distributions which are dominated by bi-alternating capacities (see, e.g., [8, 12, 35], and references therein); explicit constructions of the test were proposed for some special sets of distributions [121, 196, 209].

160

CHAPTER 2

every probability distribution p on {1, 2, ..., d} it holds 

ProbpK ζ {z |

K

πK,− [p]

whence



≤ ℓ exp{−γℓ} ≤ EpK } ln(πK,− [p]) ≤ K ln



" d  #K X 1 exp{−γζ } = pi exp − γρi , K i=1 K

!  1 + γℓ, pi exp − γρi K i=1

d X

implying, via substitution γ = µK, that

∀µ ≥ 0 : ln(πK,− [p]) ≤ Kψ− (µ, p), ψ− (µ, p) = ln  Similarly, setting πK,+ [p] = ProbpK ζ K ≥ ℓ , we get ∀ν ≥ 0 : ln(πK,+ [p]) ≤ Kψ+ (ν, p), ψ+ (ν, p) = ln

d X i=1

!

pi exp{−µρi }

d X i=1

!

pi exp{νρi }

+ µℓ.

− νℓ.

Now comes the exercise: 1. Extract from the above observations that   Risk(T K,ℓ |H1 , H2 ) ≤ exp{Kκ}, κ = max max inf ψ− (µ, p), max inf ψ+ (ν, q) , p∈B µ≥0

q∈W ν≥0

where T K,ℓ is the K-observation test which accepts the hypothesis H1 : p ∈ B when ζ K ≥ ℓ and accepts the hypothesis H2 : p ∈ W otherwise. 2. Verify that ψ− (µ, p) is convex in µ and concave in p, and similarly for ψ+ (ν, q), so that max inf ψ− (µ, p) = inf max ψ− (µ, p), max inf ψ+ (ν, q) = inf max ψ+ (ν, q). p∈B µ≥0

µ≥0 p∈B

q∈W ν≥0

ν≥0 q∈W

Thus, computing κ reduces to minimizing on the nonnegative ray the convex functions φ− (µ) = maxp∈B ψ+ (µ, p) and φ+ (ν) = maxq∈W ψ+ (ν, q). 3. Prove that when ℓ = 12 [α + β], one has κ≤−

1 2 ∆ , ∆ = α − β = kp∗ − q ∗ k1 . 12

Note that the above test and the quantity κ responsible for the upper bound on its risk depend, as on a parameter, on the “acceptance level” ℓ ∈ (β, α). The simplest way to select a reasonable value of ℓ is to minimize κ over an equidistant grid Γ ⊂ (β, α), of small cardinality, of values of ℓ. Now, let us consider an alternative way to pass from a “good” single-observation test to its multi-observation version. Our “building block” now is the minimum risk randomized single-observation test31 and its multi-observation modification is just 31 This test can differ from the test built in Exercise 2.4—the latter test is optimal in terms of the sum, rather than the maximum, of its partial risks.

161

HYPOTHESIS TESTING

the majority version of this building block. Our first observation is that building the minimum risk single-observation test reduces to solving a convex optimization problem. Exercise 2.6. Let, as above, B and W be nonempty nonintersecting closed convex subsets of probabilistic simplex ∆d . Show that the problem of finding the best—in terms of its risk—randomized single-observation test deciding on H1 : p ∈ B vs. H2 : p ∈ W via observation ω ∼ p reduces to solving a convex optimization problem. Write down this problem as an explicit LO program when B and W are polyhedral sets given by polyhedral representations: B W

= =

{p : ∃u : PB p + QB u ≤ aB }, {p : ∃u : PW p + QW u ≤ aW }.

We see that the “ideal building block”—the minimum-risk single-observation test—can be built efficiently. What is at this point unclear is whether this block is of any use for majority modifications, that is, whether the risk of this test < 1/2— this is what we need for the majority version of the minimum-risk single-observation test to be consistent. Exercise 2.7. Extract from Exercise 2.4 that in the situation of this section, denoting by ∆ the optimal value in the optimization problem (2.154), one has 1. The risk of any single-observation test, deterministic or randomized, is ≥ 21 − ∆ 4 2. There exists a single-observation randomized test with risk ≤ 12 − ∆ 8 , and thus the risk of the minimum risk single-observation test given by Exercise 2.6 does not exceed 12 − ∆ 8 < 1/2 as well. Pay attention to the fact that ∆ > 0 (since, by assumption, B and W do not intersect). The bottom line is that in the situation of this section, given a target value ǫ of risk and assuming stationary repeated observations are allowed, we have (at least) three options to meet the risk specifications: 1. To start with the optimal—in terms of its total risk—single-observation detector as explained in Exercise 2.4, and then to pass to its multi-observation version built in Exercise 2.5; 2. To use the majority version of the minimum-risk randomized single-observation test built in Exercise 2.6; 3. To use the test based on the minimum risk detector for B, W , as explained in the main body of Chapter 2. In all cases, we have to specify the number K of observations which guarantees that the risk of the resulting multi-observation test is at most a given target ǫ. A bound on K can be easily obtained by utilizing the results on the risk of a detector-based test in a Discrete o.s. from the main body of Chapter 2 along with risk-related results of Exercises 2.5, 2.6, and 2.7. Exercise 2.8.

162

CHAPTER 2

Run numerical experiments to see if one of the three options above always dominates the others (that is, requires a smaller sample of observations to ensure the same risk). Let us now focus on a theoretical comparison of the detector-based test and the majority version of the minimum-risk single-observation test (options 1 and 2 above) in the general situation described at the beginning of Section 2.10.3. Given ǫ ∈ (0, 1), the corresponding sample sizes Kd and Km are completely determined by the relevant “measure of closeness” between B and W . Specifically, • For Kd , the closeness measure is ρd (B, W ) = 1 −

max

p∈B,q∈W

X√

pω q ω ;

(2.158)

ω

1 − ρd (B, W ) is the minimal risk of a detector for B, W , and for ρd (B, W ) and ǫ small, we have Kd ≈ ln(1/ǫ)/ρd (B, W ) (why?). • Given ǫ, Km is fully specified by the minimal risk ρ of simple randomized singleobservation test T deciding on the hypotheses associated with B, W . By Exercise 2.7, we have ρ = 12 − δ, where δ is within absolute constant factor of the optimal value ∆ = minp∈B,q∈W kp − qk1 of (2.154). The risk bound for the Kobservation majority version of T is the probability to get at least K/2 heads in K independent tosses of coin with probability to get heads in a single toss equal to ρ = 1/2 − δ. When ρ is not close to 0 and ǫ is small,pthe (1 − ǫ)quantile of the number of heads in our K coin tosses is Kρ + O(1) K ln(1/ǫ) = p K/2−δK +O(1) K ln(1/ǫ) (why?). Km is the smallest K for which this quantile is < K/2, so that Km is of the order of ln(1/ǫ)/δ 2 , or, which is the same, of the order of ln(1/ǫ)/∆2 . We see that the closeness between B and W “responsible for Km ” is   2

ρm (B, W ) = ∆2 =

min

p∈B,q∈W

kp − qk1

,

and Km is of the order of ln(1/ǫ)/ρm (B, W ). The goal of the next exercise is to compare ρb and ρm . Exercise 2.9. Prove that in the situation of this section one has p 1 1 ρm (B, W ). ρ (B, W ) ≤ ρ (B, W ) ≤ m d 8 2

(2.159)

Relation (2.159) suggests that while Kd never is “much larger” than Km (this we know in advance: in repeated versions of Discrete o.s., a properly built detectorbased test provably is nearly optimal), Km might be much larger than Kd . This indeed is the case: Exercise 2.10. Given δ ∈ (0, 1/2), let B = {[δ; 0; 1 − δ]} and W = {[0; δ; 1 − δ]}. Verify that in this case the numbers of observations Kd and Km , resulting in a given risk ǫ ≪ 1 of multi-observation tests, as functions of δ are proportional to 1/δ and 1/δ 2 , respectively. Compare the numbers when ǫ = 0.01 and δ ∈ {0.01; 0.05; 0.1}.

HYPOTHESIS TESTING

2.10.4

163

Miscellaneous exercises

Exercise 2.11. Prove that the conclusion in Proposition 2.18 remains true when the test T in the premise of the proposition is randomized. Exercise 2.12. Let p1 (ω), p2 (ω) be two positive probability densities, taken w.r.t. a reference measure Π on an observation space Ω, and let Pχ = {pχ }, χ = 1, 2. Find the optimal—in terms of its risk—balanced detector for Pχ , χ = 1, 2. Exercise 2.13. Recall that the exponential distribution on Ω = R+ , with parameter µ > 0, is the distribution with the density pµ (ω) = µe−µω , ω ≥ 0. Given positive reals α < β, consider two families of exponential distributions, P1 = {pµ : 0 < µ ≤ α}, and P2 = {pµ : µ ≥ β}. Build the optimal—in terms of its risk—balanced detector for P1 , P2 . What happens with the risk of the detector you have built when the families Pχ , χ = 1, 2, are replaced with their convex hulls? Exercise 2.14. [Follow-up to Exercise 2.13] Assume that the “lifetime” ζ of a lightbulb is a realization of random variable with exponential distribution (i.e., the density pµ (ζ) = µe−µζ , ζ ≥ 0; in particular, the expected lifespan of a lightbulb in this model is 1/µ).32 Given a lot of lightbulbs, you should decide whether they were produced under normal conditions (resulting in µ ≤ α = 1) or under abnormal ones (resulting in µ ≥ β = 1.5). To this end, you can select at random K lightbulbs and test them. How many lightbulbs should you test in order to make a 0.99-reliable conclusion? Answer this question in the situations when the observation ω in a test is 1. the lifespan of a lightbulb (i.e., ω ∼ pµ (·)); 2. the minimum ω = min[ζ, δ] of the lifespan ζ ∼ pµ (·) and the allowed duration δ > 0 of your test (i.e., if the lightbulb you are testing does not “die” on time horizon δ, you terminate the test); 3. ω = χζ 0 is the allowed test duration (i.e., you observe whether or not a lightbulb “dies” on time horizon δ, but do not register the lifespan when it is < δ). Consider the values 0.25, 0.5, 1, 2, 4 of δ. Exercise 2.15. 32 In Reliability, probability distribution of the lifespan ζ of an organism or a technical device Prob{t≤ζ≤t+δ} is characterized by the failure rate λ(t) = limδ→+0 δ·Prob{ζ≥t} (so that for small δ, λ(t)δ is the conditional probability to “die” in the time interval [t, t + δ] provided the organism or device is still “alive” at time t). The exponential distribution corresponds to the case of failure rate independent of t; in applications, this indeed is often the case except for “very small” and “very large” values of t.

164

CHAPTER 2

[Follow-up to Exercise 2.14] In the situation of Exercise 2.14, build a sequential test for deciding on null hypothesis “the lifespan of a lightbulb from a given lot is ζ ∼ pµ (·) with µ ≤ 1” (recall that pµ (z) is the exponential density µe−µz on the ray {z ≥ 0}) vs. the alternative “the lifespan is ζ ∼ pµ (·) with µ > 1.” In this test, you can select a number K of lightbulbs from the lot, switch them on at time 0 and record the actual lifetimes of the lightbulbs you are testing. As a result at the end of (any) observation interval ∆ = [0, δ], you observe K independent realizations of r.v. min[ζ, δ], where ζ ∼ pµ (·) with some unknown µ. In your sequential test, you are welcome to make conclusions at the endpoints δ1 < δ2 < ... < δS of several observation intervals. Note: We deliberately skip details of the problem’s setting; how you decide on these missing details is part of your solution to the exercise. Exercise 2.16. In Section 2.6, we consider a model of elections where every member of the population was supposed to cast a vote. Enrich the model by incorporating the option for a voter not to participate in the elections at all. Implement Sequential test for the resulting model and run simulations. Exercise 2.17. Work out the following extension of the Opinion Poll Design problem. You are given two finite sets, Ω1 = {1, ..., I} and Ω2 = {1, ..., M }, along with L nonempty closed convex subsets Yℓ of the set ) ( M I X X yim = 1 ∆IM = [yim > 0]i,m : i=1 m=1

of all nonvanishing probability distributions on Ω = Ω1 × Ω2 = {(i, m) : 1 ≤ i ≤ I, 1 ≤ m ≤ M }. Sets Yℓ are such that all distributions from Yℓ have a common marginal distribution θℓ > 0 of i: M X

m=1

yim = θiℓ , 1 ≤ i ≤ I, ∀y ∈ Yℓ , 1 ≤ ℓ ≤ L.

Your observations ω1 , ω2 , ... are sampled, independently of each other, from a distribution partly selected “by nature,” and partly by you. Specifically, nature selects ℓ ≤ L and a distribution y ∈ Yℓ , and you select a positive an I-dimensional probabilistic vector q from a given convex compact subset Q of the positive part of I-dimensional probabilistic simplex. Let y|i be the conditional distribution of m ∈ Ω2 given i induced by y, so that y|i is the M -dimensional probabilistic vector with entries yim yim = ℓ . [y|i ]m = P y θi µ≤M iµ

In order to generate ωt = (it , mt ) ∈ Ω, you draw it at random from the distribution q, and then nature draws mt at random from the distribution y|it . Given closeness relation C, your goal is to decide, up to closeness C, on the hypotheses H1 , ..., HL , with Hℓ stating that the distribution y selected by nature belongs to Yℓ . Given an “observation budget” (a number K of observations ωk you can use), you want to find a probabilistic vector q which results in the test with as

165

HYPOTHESIS TESTING

small a C-risk as possible. Pose this Measurement Design problem as an efficiently solvable convex optimization problem. Exercise 2.18. [Probabilities of deviations from the mean] The goal of what follows is to present the most straightforward application of simple families of distributions—bounds on probabilities of deviations of random vectors from their means. Let H ⊂ Ω = Rd , M, Φ be regular data such that 0 ∈ int H, M is compact, Φ(0; µ) = 0 ∀µ ∈ M, and Φ(h; µ) is differentiable at h = 0 for every µ ∈ M. Let, further, P¯ ∈ S[H, M, Φ] and let µ ¯ ∈ M be a parameter of P¯ . Prove that 1. P¯ possesses expectation e[P¯ ], and e[P¯ ] = ∇h Φ(0; µ ¯) 2. For every linear form eT ω on Ω it holds π

:= ≤

P¯ {ω: eT (ω − e[P¯ ]) ≥ 1}    T Φ(te; µ ¯) − te ∇h Φ(0; µ ¯) − t . exp inf

(2.160)

t≥0:te∈H

What are the consequences of (2.160) for sub-Gaussian distributions? Exercise 2.19. [testing convex hypotheses on mixtures] Consider the situation as follows. For given positive integers K and L and for χ = 1, 2, given are • nonempty convex compact signal sets Uχ ⊂ Rnχ , χ • regular data Hkℓ ⊂ Rdk , Mχkℓ , Φχkℓ , and affine mappings uχ 7→ Aχkℓ [uχ ; 1] : Rnχ → Rdk such that

uχ ∈ Uχ ⇒ Aχkℓ [uχ ; 1] ∈ Mχkℓ ,

1 ≤ k ≤ K, 1 ≤ ℓ ≤ L, • probabilistic vectors µk = [µk1 ; ...; µkL ], 1 ≤ k ≤ K. We can associate with the outlined data families of probability distributions Pχ on the observation space Ω = Rd1 × ... × RdK as follows. For χ = 1, 2, Pχ is comprised of all probability distributions P of random vectors ω K = [ω1 ; ...; ωK ] ∈ Ω generated as follows: We select • a signal uχ ∈ Uχ , χ • a collection of probability distributions Pkℓ ∈ S[Hkℓ , Mχkℓ , Φχkℓ ], 1 ≤ k ≤ K, χ 1 ≤ ℓ ≤ L, in such a way that Akℓ [uχ ; 1] is a parameter of Pkℓ :   T χ ∀h ∈ Hkℓ : ln Eωk ∼Pkℓ {eh ωk } ≤ Φχkℓ (hk ; Aχkℓ [uχ ; 1]);

• we generate the components ωk , k = 1, ..., K, independently across k, from µk mixture Π[{Pkℓ }L ℓ=1 , µ] of distributions Pkℓ , ℓ = 1, ..., L, that is, draw at random,

166

CHAPTER 2

from distribution µk on {1, ..., L}, index ℓ, and then draw ωk from the distribution Pkℓ . Prove that when setting Hχ

=



=

Φχ (h; µ)

=

{h = [h1 ; ...; hK ] ∈ Rd=d1 +...+dK : hk ∈

L T

ℓ=1

χ Hkℓ , 1 ≤ k ≤ K},

{0} ⊂ R,   PK PL χ χ k ln µ exp max Φ (h ; A [u ; 1]) : Hχ × Mχ → R, χ k kℓ kℓ k=1 ℓ=1 ℓ uχ ∈Uχ

we obtain the regular data such that

Pχ ⊂ S[Hχ , Mχ , Φχ ]. Explain how to use this observation to compute via Convex Programming an affine detector and its risk for the families of distributions P1 and P2 . Exercise 2.20. [Mixture of sub-Gaussian distributions] Let Pℓ be sub-Gaussian distributions on Rd with sub-Gaussianity parameters θℓ , Θ, 1 ≤ ℓ ≤ L, with a common Θparameter, and let ν = [ν1 ; ...; νL ] be a probabilistic vector. Consider the ν-mixture P = Π[P L , ν] of distributions Pℓ , so that ω ∼ P is generated as follows: we draw at random from distribution ν index ℓ and then draw ω at random from distribution P Pℓ . Prove that P is sub-Gaussian with sub-Gaussianity parameters θ¯ = ℓ νℓ θℓ ¯ with (any) Θ ¯ chosen to satisfy and Θ, ¯ ℓ − θ] ¯ T ∀ℓ, ¯  Θ + 6 [θℓ − θ][θ Θ 5 in particular, according to any one of the following rules:  ¯ 2 Id , ¯ = Θ + 6 maxℓ kθℓ − θk 1. Θ 2 5P ¯ ¯T ¯ =Θ+ 6 2. Θ ℓ (θℓ − θ)(θℓ − θ) , 5 ¯ = Θ + 6 P θℓ θT , provided that ν1 = ... = νL = 1/L. 3. Θ ℓ ℓ 5 Exercise 2.21.

The goal of this exercise is to give a simple sufficient condition for quadratic lifting “to work” in the Gaussian case. Namely, let Aχ , Uχ , Vχ , Gχ , χ = 1, 2, be as in Section 2.9.3, with the only difference that now we do not assume the compact sets Uχ to be convex, and let Zχ be convex compact subsets of the sets Z nχ —see item i.2. in Proposition 2.43—such that [uχ ; 1][uχ ; 1]T ∈ Zχ ∀uχ ∈ Uχ , χ = 1, 2. (∗)

(χ)

Augmenting the above data with Θχ , δχ such that V = Vχ , Θ∗ = Θ∗ , δ = δχ satisfy (2.129), χ = 1, 2, and invoking Proposition 2.43.ii, we get at our disposal a quadratic detector φlift such that Risk[φlift |G1 , G2 ] ≤ exp{SadVallift }, with SadVallift given by (2.134). A natural question is, when SadVallift is negative, meaning that our quadratic detector indeed “is working”—its risk is < 1, imply-

HYPOTHESIS TESTING

167

ing that when repeated observations are allowed, tests based upon this detector are consistent—able to decide on the hypotheses Hχ : P ∈ Gχ , χ = 1, 2, on the distribution of observation ζ ∼ P with any small desired risk ǫ ∈ (0, 1). With our computation-oriented ideology, this is not too important a question, since we can answer it via efficient computation. This being said, there is no harm in a “theoretical” answer which could provide us with an additional insight. The goal of the exercise is to justify a simple result on the subject. Here is the exercise: In the situation in question, assume that V1 = V2 = {Θ∗ }, which allows us to (χ) set Θ∗ = Θ∗ , δχ = 0, χ = 1, 2. Prove that in this case a necessary and sufficient condition for SadVallift to be negative is that the convex compact sets Uχ = {Bχ ZBχT : Z ∈ Zχ } ⊂ Sd+1 + , χ = 1, 2 do not intersect with each other. Exercise 2.22. Prove that if X is a nonempty convex compact set in Rd , then the function b µ) given by (2.114) is real-valued and continuous on Rd × X and is convex in Φ(h; h and concave in µ. Exercise 2.23.

The goal of what follows is to refine the change detection procedure (let us refer to it as the“basic” one) developed in Section 2.9.5.1. The idea is pretty simple. With the notation from Section 2.9.5.1, in the basic procedure, when testing the null hypothesis H0 vs. signal hypothesis Htρ , we look at the difference ζt = ωt − ω1 and try to decide whether the energy of the deterministic component xt − x1 of ζt is 0, as is the case under H0 , or is ≥ ρ2 , as is the case under Htρ . Note that if σ ∈ [σ, σ] is the actual intensity of the observation noise, then the noise component of ζt is N (0, 2σ 2 Id ); other things being equal, the larger is the noise in ζt , the larger should be ρ to allow for a reliable—with a given reliability level—decision. Now note that under the hypothesis Htρ , we have x1 = ... = xt−1 , so that the deterministic component ofP the difference ζt = ωt − ω1 is exactly the same as for t−1 1 2 e the difference ζet = ωt − t−1 s=1 ωs , while the noise component in ζt is N (0, σt Id ) 1 t 2 2 2 2 with σt = σ + t−1 σ = t−1 σ . Thus, the intensity of noise in ζet is at most the same as in ζt , and this intensity, in contrast to that for ζt , decreases as t grows. Here comes the exercise: Let reliability tolerances ǫ, ε ∈ (0, 1) be given, and let our goal be to design a system of inferences Tt , t = 2, 3, ..., K, which, when used in the same fashion as tests Ttκ were used in the basic procedure, results in false alarm probability at most ǫ and in probability to miss a change of energy ≥ ρ2 at most ε. Needless to say, we want to achieve this goal with as small a ρ as possible. Think how to utilize the above observation to refine the basic procedure eventually reducing (and provably not increasing) the required value of ρ. Implement the basic and the refined change detection procedures and compare their quality (the resulting values of ρ), e.g., on the data used in the experiment reported in Section 2.9.5.1.

168

CHAPTER 2

2.11 2.11.1

PROOFS Proof of the observation in Remark 2.8

We have to prove that if p = [p1 ; ...; pK ] ∈ B = [0, 1]K then the probability PM (p) of the event The total number of heads in K independent coin tosses, with probability pk to get heads in k-th toss, is at least M is a nondecreasing function of p: if p′ ≤ p′′ , p′ , p′′ ∈ B, then PM (p′ ) ≤ PM (p′′ ). To see it, let us associate with p ∈ B a subset of B, specifically, Bp = {x ∈ B : 0 ≤ xk ≤ pk , 1 ≤ k ≤ K}, and a function χp (x) : B → {0, 1} which is equal to 0 at every point x ∈ B where the number of entries xk satisfying xk ≤ pk is less than M , and is equal to 1 otherwise. It is immediately seen that Z PM (p) ≡ χp (x)dx (2.161) B

(since with respect to the uniform distribution on B, the events Ek = {x ∈ B : xk ≤ pk } are independent across k and have probabilities pk , and the right-hand side in (2.161) is exactly the probability, taken w.r.t. the uniform distribution on B, of the event “at least M of the events E1 ,..., EK take place”). But the right-hand side in (2.31) clearly is nondecreasing in p ∈ B, since χp , by construction, is the characteristic function of the set B[p] = {x : at least M of the entries xk in x satisfy xk ≤ pk }, and these sets clearly grow when p increases entrywise. 2.11.2



Proof of Proposition 2.6 in the case of quasi-stationary K-repeated observations

2.11.2.A Situation and goal. We are in the case QS—see Section 2.2.3.1—of the setting described at the beginning of Section 2.2.3. It suffices to verify that if Hℓ , ℓ ∈ {1, 2}, is true then the probability for TKmaj to reject Hℓ is at most the quantity ǫK defined in (2.23). Let us verify this statement in the case of ℓ = 1; the reasoning for ℓ = 2 “mirrors” the one to follow. It is clear that our situation and goal can be formulated as follows: • “In nature” there exists a random sequence ζ K = (ζ1 , ..., ζK ) of driving factors and a collection of deterministic functions θk (ζ k = (ζ1 , ..., ζk ))33 taking values in Ω = Rd such that our k-th observation is ωk = θk (ζ k ). Additionally, the conditional distribution Pωk |ζ k−1 of ωk given ζ k−1 always belongs to the family P1 comprised of distributions of random vectors of the form x + ξ, where deterministic x belongs to X1 and the distribution of ξ belongs to Pγd . • There exist deterministic functions χk : Ω → {0, 1} and integer M, 1 ≤ M ≤ K, such that the test TKmaj , as applied to observation ω K = (ω1 , ..., ωK ), rejects H1 33 As always, given a K-element sequence, say, ζ , ..., ζ , we write ζ t , t ≤ K, as a shorthand 1 K for the fragment ζ1 , ..., ζt of this sequence.

169

HYPOTHESIS TESTING

if and only if the number of 1’s among the quantities χk (ωk ), 1 ≤ k ≤ K, is at least M . In the situation of Proposition 2.6, M =⌋K/2⌊ and χk (·) are in fact independent of k: χk (ω) = 1 if and only if φ(ω) ≤ 0.34 • What we know is that the conditional probability of the event χk (ωk = θk (ζ k )) = 1, ζ k−1 being given, is at most ǫ⋆ : Pωk |ζ k−1 {ωk : χk (ωk ) = 1} ≤ ǫ⋆ ∀ζ k−1 . Indeed, Pωk |ζ k−1 ∈ P1 . As a result, Pωk |ζ k−1 {ωk : φk (ωk ) = 1}

= =

Pωk |ζ k−1 {ωk : φ(ωk ) ≤ 0} Pωk |ζ k−1 {ωk : φ(ωk ) < 0} ≤ ǫ⋆ ,

where the second equality is due to the fact that φ(ω) is a nonconstant affine function and Pωk |ζ k−1 , along with all distributions from P1 , has density, and the inequality is given by the origin of ǫ⋆ which upper-bounds the risk of the single-observation test underlying TKmaj . What we want to prove is that under the circumstances we have just summarized, we have M} PωK {ω K = (ω1 , ..., ωK ) : Card{k ≤ K : χk (ω  k ) = 1} ≥K−k P K k , ≤ ǫM = M ≤k≤K k ǫ⋆ (1 − ǫ⋆ )

(2.162)

where PωK is the distribution of ω K = {ωk = θk (ζ k−1 )}K k=1 induced by the distribution of hidden factors. There is nothing to prove when ǫ⋆ = 1, since in this case ǫM = 1. Thus, we assume from now on that ǫ⋆ < 1. 2.11.2.B Achieving the goal, step 1. Our reasoning, inspired by that used to justify Remark 2.8, is as follows. Consider a sequence of random variables ηk , 1 ≤ k ≤ K, uniformly distributed on [0, 1] and independent of each other and of ζ K , and consider new driving factors λk = [ζk ; ηk ] and new observations35 µk = [ωk = θk (ζ k ); ηk ] = Θk (λk = (λ1 , ..., λk ))

(2.163)

driven by these new driving factors, and let ψk (µk = [ωk ; ηk ]) = χk (ωk ). It is immediately seen that • µk = [ωk = θk (ζ k ); ηk ] is a deterministic function, Θk (λk ), of λk , and the con34 In fact, we need to write φ(ω) < 0 instead of φ(ω) ≤ 0; we replace the strict inequality with its nonstrict version in order to make our reasoning applicable to the case of ℓ = 2, where nonstrict inequalities do arise. Clearly, replacing in the definition of χk strict inequality with the nonstrict one, we only increase the “rejection domain” of H1 , so that the upper bound on the probability of this domain we are about to get automatically is valid for the true rejection domain. 35 In this display, as in what follows, whenever some of the variables λ, ω, ζ, η, µ appear in the same context, it should always be understood that ζt and ηt are components of λt = [ζt ; ηt ], µt = [ωt ; ηt ] = Θt (λt ), and ωt = θt (ζ t ). To remind us about these “hidden relations,” we sometimes write something like φ(ωk = θk (ζ k )) to stress that we are speaking about the value of function φ at the point ωk = θk (ζ k ).

170

CHAPTER 2

ditional distribution Pµk |λk−1 of µk given λk−1 = [ζ k−1 ; η k−1 ] is the product distribution Pωk |ζ k−1 × U on Ω × [0, 1], where U is the uniform distribution on [0, 1]. In particular, πk (λk−1 )

:= =

Pµk |λk−1 {µk = [ωk ; ηk ] : χk (ωk ) = 1} Pωk |ζ k−1 {ωk : χk (ωk ) = 1} ≤ ǫ⋆ .

(2.164)

• We have PλK {λK : Card{k ≤ K : ψk (µk = Θk (λk )) = 1} ≥ M } = PωK {ω K = (ω1 , ..., ωK ) : Card{k ≤ K : χk (ωk ) = 1} ≥ M }

(2.165)

where PωK is as in (2.162), and Θk (·) is defined in (2.163). Now let us define ψk+ (λk ) as follows: • when ψk (Θk (λk )) = 1, or, which is the same, χk (ωk = θk (ζ k )) = 1, we set ψk+ (λk ) = 1 as well; • when ψk (Θk (λk )) = 0, or, which is the same, χk (ωk = θk (ζ k )) = 0, we set ψk+ (λk ) = 1 whenever ηk ≤ γk (λk−1 ) :=

ǫ⋆ − πk (λk−1 ) 1 − πk (λk−1 )

and ψk+ (λk ) = 0 otherwise. Let us make the following immediate observations: (A) Whenever λk is such that ψk (µk = Θk (λk )) = 1, we also have ψk+ (λk ) = 1; (B) The conditional probability of the event ψk+ (λk ) = 1, given λk−1 = [ζ k−1 ; η k−1 ] is exactly ǫ⋆ . Indeed, let Pλk |λk−1 be the conditional distribution of λk given λk−1 . Let us fix λk−1 . The event E = {λk : ψk+ (λk ) = 1}, by construction, is the union of two nonoverlapping events: E1 E2

= =

{λk = [ζk ; ηk ] : χk (θk (ζ k )) = 1}, {λk = [ζk ; ηk ] : χk (θk (ζ k )) = 0, ηk ≤ γk (λk−1 )}.

Taking into account that the conditional distribution of µk = [ωk = θk (ζ k ); ηk ], λk−1 being fixed, is the product distribution Pωk |ζ k−1 × U , we conclude in view of (2.164) that Pλk |λk−1 {E1 } = Pλk |λk−1 {E2 } = =

Pωk |ζ k−1 {ωk : χk (ωk ) = 1} = πk (λk−1 ), Pωk |ζ k−1 {ωk : χk (ωk ) = 0}U {η ≤ γk (λk−1 )} (1 − πk (λk−1 ))γk (λk−1 ),

which combines with the definition of γk (·) to imply (B).

171

HYPOTHESIS TESTING

2.11.2.C Achieving the goal, step 2. By (A) combined with (2.165) we have PωK {ω K : Card{k ≤ K : χk (ωk ) = 1} ≥ M } = PλK {λK : Card{k ≤ K : ψk (µk = Θk (λk )) = 1} ≥ M } ≤ PλK {λK : Card{k ≤ K : ψk+ (λk ) = 1} ≥ M }, and all we need to verify is that the first quantity in this chain is upper-bounded by the quantity ǫM given by (2.162). Invoking (B), it is enough to prove the following claim: (!) Let λK = (λ1 , ..., λK ) be a random sequence with probability distribution P , let ψk (λk ) take values 0 and 1 only, and let for every k ≤ K the conditional probability for ψk+ (λk ) to take value 1, λk−1 being fixed, be equal to ǫ⋆ , for all λk−1 . Then the P -probability of the event {λK : Card{k ≤ K : ψk+ (λk ) = 1} ≥ M } is equal to ǫM given by (2.162). This is immediate. For integers k, m, 1 ≤ k ≤ K, m ≥ 0, let χkm (λk ) be the characteristic function of the event {λk : Card{t ≤ k : ψt+ (λt ) = 1} = m}, and let k πm = P {λK : χkm (λk ) = 1}.

We have the following evident recurrence: k−1 k−1 k−1 χkm (λk ) = χm (λ )(1 − ψk+ (λk )) + χm−1 (λk−1 )ψk+ (λk ), k = 1, 2, ... k−1 augmented by the “boundary conditions” χ0m = 0, m > 0, χ00 = 1, χ−1 = 0 for all k ≥ 1. Taking expectation w.r.t. P and utilizing the fact that conditional expectation of ψk+ (λk ) given λk−1 is, identically in λk−1 , equal to ǫ⋆ , we get k πm

=

0 πm

=

k−1 k−1 πm (1 − ǫ⋆ ) + πm−1 ǫ⋆ , k = 1, ..., K,  1, m = 0, k−1 π−1 = 0, k = 1, 2, ... 0, m > 0,

whence k πm =



k m

0,



k−m ǫm , ⋆ (1 − ǫ⋆ )

m ≤ k, m > k.

Therefore, P {λK : Card{k ≤ K : ψk+ (λk ) = 1} ≥ M } = as required.

X

πkK = ǫM ,

M ≤k≤K



172

CHAPTER 2

2.11.3

Proof of Theorem 2.23

1o . Since O is a simple o.s., the function Φ(φ, [µ; ν]) given by (2.56) is a well-defined real-valued function on F × (M × M) which is concave in [µ; ν]; convexity of the function in φ ∈ F is evident. Since both F and M are convex sets coinciding with their relative interiors, convexity-concavity and real valuedness of Φ on F ×(M×M) imply the continuity of Φ on the indicated domain. As a consequence, Φ is a convexconcave continuous real-valued function on F × (M1 × M2 ). Now let (2.166) Φ(µ, ν) = inf Φ(φ, [µ; ν]). φ∈F

Note that Φ, being the infimum of a family of concave functions of [µ; ν] ∈ M × M, is concave on M × M. We claim that for µ, ν ∈ M the function φµ,ν (ω) =

1 2

ln(pµ (ω)/pν (ω))

(which, by definition of a simple o.s., belongs to F) is an optimal solution to the right-hand side minimization problem in (2.166), so that ∀(µ ∈ M1 , ν ∈ M2 ) :

Φ([µ; ν]) := inf φ∈F Φ(φ, [µ; ν]) = Φ(φµ,ν , [µ; ν]) = ln Indeed, we have

R p  p (ω)p (ω)Π(dω) . µ ν Ω (2.167)

exp{−φµ,ν (ω)}pµ (ω) = exp{φµ,ν (ω)}pν (ω) = g(ω) := whence Φ(φµ,ν , [µ; ν]) = ln δ(·) ∈ F we have (a) (b)

R



q

pµ (ω)pν (ω),

(2.168)

 g(ω)Π(dω) . On the other hand, for φ(·) = φµ,ν (·) +

i hp i R hp g(ω)Π(dω) = g(ω) exp{−δ(ω)/2} g(ω) exp{δ(ω)/2} Π(dω) Ω Ω 1/2 1/2 R R g(ω) exp{δ(ω)}Π(dω) ≤ Ω g(ω) exp{−δ(ω)}Π(dω) 1/2 1/2 Ω R R = Ω Rexp{−φ(ω)}pµ (ω)Π(dω) [by (2.168)] exp{φ(ω)}pν (ω)Π(dω) Ω ⇒ ln Ω g(ω)Π(dω) ≤ Φ(φ, [µ; ν]), R

and thus Φ(φµ,ν , [µ; ν]) ≤ Φ(φ, [µ; ν]) for every φ ∈ F.

Remark 2.49. Note that the above reasoning did not use the fact that the minimization on the right-hand side of (2.166) is over φ ∈ F; in fact, this reasoning shows that φµ,ν (·) minimizes Φ(φ, R R [µ; ν]) over all functions φ for which the integrals exp{−φ(ω)}p (ω)Π(dω) and exp{φ(ω)}pν (ω)Π(dω) exist. µ Ω Ω

Remark 2.50. Note that the inequality in (b) can be equality only when the inequality in (a) is so. In other words, if φ¯ is pa minimizer of Φ(φ, [µ; ν])pover φ ∈ F, setting ¯ − φµ,ν (·), the functions g(ω) exp{−δ(ω)/2} and g(ω) exp{δ(ω)/2}, δ(·) = φ(·) considered as elements of L2 [Ω, Π], are proportional to each other. Since g is positive and g, δ are continuous, while the support of Π is the entire Ω, this “L2 proportionality” means that the functions in question differ by a constant factor, or, which is the same, that δ(·) is constant. Thus, the minimizers of Φ(φ, [µ; ν]) over φ ∈ F are exactly the functions of the form φ(ω) = φµ,ν (ω) + const.

173

HYPOTHESIS TESTING

2o . Let us verify that Φ(φ, [µ; ν]) has a saddle point (min in φ ∈ F, max in [µ; ν] ∈ M1 × M2 ). First, observe that on the domain of Φ it holds Φ(φ(·) + a, [µ; ν]) = Φ(φ(·), [µ; ν]) ∀(a ∈ R, φ ∈ F).

(2.169)

Let us select some µ ¯ ∈ M, R and let P be the measure on Ω with density pµ¯ w.r.t. Π. For φ ∈ F, the integrals Ω e±φ(ω) P (dω) are finite (since O is simple), implying that φ R ∈ L1 [Ω, P ]; note also that P is a probabilistic measure. Let now F0 = {φ ∈ F : φ(ω)P (dω) = 0}, so that F0 is a linear subspace in F, and all functions φ ∈ F Ω can be obtained by shifts of functions from F0 by constants. Now, by (2.169), to prove the existence of a saddle point of Φ on F × (M1 × M2 ) is exactly the same as to prove the existence of a saddle point of Φ on F0 × (M1 × M2 ). Let us verify that Φ(φ, [µ; ν]) indeed has a saddle point on F0 × (M1 × M2 ). Because M1 × M2 is a convex compact set, and Φ is continuous on F0 × (M1 × M2 ) and convex-concave, invoking the Sion-Kakutani Theorem we see that all we need in order to prove the existence of a saddle point is to verify that Φ is coercive in the first argument. In other words, we have to show that for every fixed [µ; ν] ∈ M1 × M2 one has Φ(φ, [µ; ν]) → +∞ as φ ∈ F0 and kφk → ∞ (whatever be the norm k · k on F0 ; recall that F0 is a finite-dimensional linear space). Setting   Z  Z 1 φ(ω) −φ(ω) e pν (ω)Π(dω) e pµ (ω)Π(dω) + ln ln Θ(φ) = Φ(φ, [µ; ν]) = 2 ω ω and taking into account that Θ is convex and finite on F0 , in order to prove that Θ is coercive, it suffices to verify Rthat Θ(tφ) → ∞, t → ∞, for every nonzero φ ∈ F0 , which is evident: since Ω φ(ω)P (dω) = 0 and φ is nonzero, we have R R max[φ(ω), 0]P (dω) = Ω max[−φ(ω), 0]P (dω) > 0, whence φ > 0 and φ < 0 on Ω sets of Π-positive measure, so that Θ(tφ) → ∞ as t → ∞ due to the fact that both pµ (·) and pν (·) are continuous and everywhere positive. 3o . Now let (φ∗ (·); [µ∗ ; ν∗ ]) be a saddle point of Φ on F × (M1 × M2 ). Shifting, if necessary, φ∗ (·) by a constant (by (2.169), this does not affect the fact that (φ∗ , [µ∗ ; ν∗ ]) is a saddle point of Φ), we can assume that Z Z exp{φ∗ (ω)}pν∗ (ω)Π(dω), exp{−φ∗ (ω)}pµ∗ (ω)Π(dω) = ε⋆ := Ω



so that the saddle point value of Φ is Φ∗ :=

max

min Φ(φ, [µ; ν]) = Φ(φ∗ , [µ∗ ; ν∗ ]) = ln(ε⋆ ),

[µ;ν]∈M1 ×M2 φ∈F

(2.170)

as claimed in item (i) of the theorem. Now let us prove (2.58). For µ ∈ M1 , we have ln(ε⋆ )

Hence, ln

R



= = =

Φ∗ ≥ RΦ(φ∗ , [µ; ν∗ ])  1 ln RΩ exp{−φ∗ (ω)}pµ (ω)Π(dω) + 2 1 ln Ω exp{−φ∗ (ω)}pµ (ω)P (dω) + 2

exp{−φa∗ (ω)}pµ (ω)Π(dω)



= ≤

1 2 1 2

 R ln Ω exp{φ∗ (ω)}pν∗ (ω)Π(dω) ln(ε⋆ ).

 R ln Ω exp{−φ∗ (ω)}pµ (ω)P (dω) + a ln(ε⋆ ) + a,

174

CHAPTER 2

and (2.58.a) follows. Similarly, when ν ∈ M2 , we have ln(ε⋆ )

= = =

so that ln

R



Φ∗ ≥ RΦ(φ∗ , [µ∗ ; ν])   R 1 ln Ω exp{−φR∗ (ω)}pµ∗ (ω)Π(dω) + 21 ln Ω exp{φ∗ (ω)}pν (ω)Π(dω) 2 1 ln(ε⋆ ) + 21 ln Ω exp{φ∗ (ω)}pν (ω)Π(dω) , 2

exp{φa∗ (ω)}pν (ω)Π(dω)



= ≤

 R ln Ω exp{φ∗ (ω)}pν (ω)Π(dω) − a ln(ε⋆ ) − a,

and (2.58.b) follows. We have proved all statements of item (i), except for the claim that the φ∗ , ε⋆ just defined form an optimal solution to (2.59). Note that by (2.58) as applied with a = 0, the pair in question is feasible for (2.59). Assuming that the problem admits ¯ ǫ) with ǫ < ε⋆ , let us lead this assumption to a contradiction. a feasible solution (φ, ¯ Note that φ should be such that Z Z ¯ ¯ −φ(ω) eφ(ω) pν∗ (ω)Π(dω) < ε⋆ , e pµ∗ (ω)Π(dω) < ε⋆ & Ω



¯ [µ∗ ; ν∗ ]) < ln(ε⋆ ). On the other hand, Remark 2.49 says and consequently Φ(φ, ¯ that Φ(φ, [µ∗ ; ν∗ ]) cannot be less than min Φ(φ, [µ∗ ; ν∗ ]), and the latter quantity is φ∈F

Φ(φ∗ , [µ∗ ; ν∗ ]) because (φ∗ , [µ∗ ; ν∗ ]) is a saddle point of Φ on F × (M1 × M2 ). Thus, assuming that the optimal value in (2.59) is < ε⋆ , we conclude that Φ(φ∗ , [µ∗ ; ν∗ ]) ≤ ¯ [µ∗ ; ν∗ ]) < ln(ε⋆ ), contradicting (2.170). Item (i) of Theorem 2.23 is proved. Φ(φ, 4o . Let us prove item (ii) of Theorem 2.23. Relation (2.60) and concavity of the right-hand side of this relation in [µ; ν] were already proved; moreover, these relations were proved in the range M × M of [µ; ν]. Since this range coincides with its relative interior, the real-valued concave function Φ is continuous on M × M and thus is continuous on M1 × M2 . Next, let φ∗ be the φ-component of a saddle point of Φ on F × (M1 × M2 ) (we already know that such a saddle point exists). By Proposition 2.21, the [µ; ν]-components of saddle points of Φ on F × (M1 × M2 ) are exactly the maximizers of Φ on M1 × M2 ; let [µ∗ ; ν∗ ] be such a maximizer. By the same proposition, (φ∗ , [µ∗ ; ν∗ ]) is a saddle point of Φ, whence Φ(φ, [µ∗ ; ν∗ ]) attains its minimum over φ ∈ F at φ = φ∗ . We have also seen that Φ(φ, [µ∗ ; ν∗ ]) attains its minimum over φ ∈ F at φ = φµ∗ ,ν∗ . These observations combine with Remark 2.50 to imply that φ∗ and φµ∗ ,ν∗ differ by a constant, which, in view of (2.169), means that (φµ∗ ,ν∗ , [µ∗ ; ν∗ ]) is a saddle point of Φ along with (φ∗ , [µ∗ ; ν∗ ]). (ii) is proved. 5o . It remains to prove item (iii) of Theorem 2.23. In the notation from (iii), simple hypotheses (A) and (B) can be decided with the total risk ≤ 2ǫ, and therefore, by Proposition 2.2, Z min[p(ω), q(ω)]Π(dω) ≤ 2ǫ. 2¯ ǫ := Ω

On the other hand, we have seen that the saddle point value of Φ is ln(ε⋆ ); since [µ∗ ; ν∗ ] is a component of a saddle point of Φ, it follows that minφ∈F Φ(φ, [µ∗ ; ν∗ ]) = ln(ε⋆ ). The left-hand side in this equality, by item 1o , is Φ(φµ∗ ,ν∗ , [µ∗ ; ν∗ ]), and we

175

HYPOTHESIS TESTING

arrive at 1 2

ln(ε⋆ ) = Φ( ln(pµ∗ (·)/pν∗ (·)), [µ∗ ; ν∗ ]) = ln so that ε⋆ =

Z q

pµ∗ (ω)pν∗ (ω)Π(dω) =



We now have ε⋆

Z q



pµ∗ (ω)pν∗ (ω)Π(dω) ,



Z p

p(ω)q(ω)Π(dω).



p R p R p = Ω p(ω)q(ω)Π(dω) = Ω min[p(ω), q(ω)] max[p(ω), q(ω)]Π(dω) 1/2 1/2 R R max[p(ω), q(ω)]Π(dω) ≤ Ω min[p(ω), q(ω)]Π(dω) 1/2 1/2 RΩ R (p(ω) + q(ω) − min[p(ω), q(ω)])Π(dω) = pΩ min[p(ω), q(ω)]Π(dω) Ω p = 2¯ ǫ(2 − 2¯ ǫ) ≤ 2 (1 − ǫ)ǫ,

where the concluding inequality is due to ǫ¯ ≤ ǫ ≤ 1/2. (iii) is proved, and the proof of Theorem 2.23 is complete. ✷ 2.11.4

Proof of Proposition 2.37

All we need is to verify (2.107) and to check that the right-hand side function in this relation is convex. The latter is evident, since φX (h) + φX (−h) ≥ 2φX (0) = 0 and φX (h) + φX (−h) is convex. To verify (2.107), let us fix P ∈ P[X] and h ∈ Rd and set ν = hT e[P ], so that ν is the expectation of hT ω with ω ∼ P . Note that for ω ∼ P we have hT ω ∈ [−φX (−h), φX (h)] with P -probability 1, whence −φX (−h) ≤ ν ≤ φX (h). In particular, when φX (h) + φX (−h) = 0, hT ω = ν with P -probability 1, so that (2.107) definitely holds true. Now let η :=

1 2

[φX (h) + φX (−h)] > 0,

and let a=

1 2

[φX (h) − φX (−h)] , β = (ν − a)/η.

Denoting by Ph the distribution of hT ω induced by the distribution P of ω and noting that this distribution is supported on [−φX (−h), φX (h)] = [a − η, a + η] and has expectation ν, we get β ∈ [−1, 1] and γ :=

Z

exp{hT ω}P (dω) =

Z

a+η a−η

[es − λ(s − ν)]Ph (ds)

for all λ ∈ R. Hence,   ln(γ) ≤ inf ln max [es − λ(s − ν)] a−η≤s≤a+η λ   = a + inf ln max [et − ρ(t − [ν − a])] [substituting λ = ea ρ, s = a + t] ρ −η≤t≤η     = a + inf ln max [et − ρ(t − ηβ)] ≤ a + ln max [et − ρ¯(t − ηβ) ρ

−η≤t≤η

−η≤t≤η

176

CHAPTER 2

with ρ¯ = (2η)−1 (eη − e−η ). The function g(t) = et − ρ¯(t − ηβ) is convex on [−η, η], and g(−η) = g(η) = cosh(η) + β sinh(η), which combines with the above computation to yield the relation ln(γ) ≤ a + ln(cosh(η) + β sinh(η)).

(2.171)

Thus, all we need to verify is that ∀(η > 0, β ∈ [−1, 1]) : βη + 21 η 2 − ln(cosh(η) + β sinh(η)) ≥ 0.

(2.172)

Indeed, if (2.172) holds true (2.171) implies that ln(γ) ≤ a + βη + 12 η 2 = ν + 21 η 2 , which, recalling what γ, ν, and η are, is exactly what we want to prove. Verification of (2.172) is as follows. The left-hand side in (2.172) is convex in β for β > − cosh(η) sinh(η) containing, due to η > 0, the range of β in (2.172). Furthermore, the minimum of the left-hand side of (2.172) over β > − coth(η) is attained at cosh(η) and is equal to β = sinh(η)−η η sinh(η) r(η) = 12 η 2 + 1 − η coth(η) − ln(sinh(η)/η). All we need to prove is that the latter quantity is nonnegative whenever η > 0. We have r′ (η) = η − coth(η) − η(1 − coth2 (η)) − coth(η) + η −1 = (η coth(η) − 1)2 η −1 ≥ 0, and since r(+0) = 0, we get r(η) ≥ 0 when η > 0. 2.11.5



Proof of Proposition 2.43

2.11.5.A Proof of Proposition 2.43.i 

 A 1 . Let b = [0; ...; 0; 1] ∈ R , so that B = , and let A(u) = A[u; 1]. For bT any u ∈ Rn , h ∈ Rd , Θ ∈ Sd+ , and H ∈ Sd such that −I ≺ Θ1/2 HΘ1/2 ≺ I we have   Ψ(h, H; u, Θ) :=ln Eζ∼N (A(u),Θ) exp{hT ζ + 21 ζ T Hζ}  = ln Eξ∼N (0,I) exp{hT [A(u) + Θ1/2 ξ] + 21 [A(u) + Θ1/2 ξ]T H[A(u) + Θ1/2 ξ]} = − 12 ln Det(I − Θ1/2 HΘ1/2 ) + hT A(u) + 21 A(u)T HA(u) 1/2 1/2 1/2 −1 1/2 + 21 [HA(u) + h]T Θ ] Θ  [HA(u) + h]  T[I − Θ T HΘ 1 1 T 1/2 1/2 T = − 2 ln Det(I − Θ HΘ  ) + 2 [u; 1] bh A + A hb + AT HA [u; 1]  + 21 [u; 1]T B T [H, h]T Θ1/2 [I − Θ1/2 HΘ1/2 ]−1 Θ1/2 [H, h]B [u; 1] (2.173) due to hT A(u) = [u; 1]T bhT A[u; 1] = [u; 1]T AT hbT [u; 1] o

n+1

and HA(u) + h = [H, h]B[u; 1].

177

HYPOTHESIS TESTING

Observe that when (h, H) ∈ Hγ , we have −1 Θ1/2 [I − Θ1/2 HΘ1/2 ]−1 Θ1/2 = [Θ−1 − H]−1  [Θ−1 , ∗ − H]

so that (2.173) implies that for all u ∈ Rn , Θ ∈ V, and (h, H) ∈ Hγ , Ψ(h, H; u,Θ) ≤ − 12 ln Det(I − Θ1/2 HΘ1/2 )  −1 [H, h]B [u; 1] + 12 [u; 1]T bhT A + AT hbT + AT HA + B T [H, h]T [Θ−1 ∗ − H] | {z } Q[H,h]

= − 21 ln Det(I − Θ1/2 HΘ1/2 ) + 12 Tr(Q[H, h]Z(u)) ≤ − 21 ln Det(I − Θ1/2 HΘ1/2 ) + ΓZ (h, H), ΓZ (h, H) = 12 φZ (Q[H, h])

(2.174) (we have taken into account that Z(u) ∈ Z when u ∈ U , the premise of the proposition, and therefore Tr(Q[H, h]Z(u)) ≤ φZ (Q[H, h])). Note that the above function Q[H, h] is nothing but    H h T −1 −1 + [H, h] [Θ − H] [H, h] B. (2.175) Q[H, h] = B T ∗ hT 2o . We need the following: Lemma 2.51. Let Θ∗ be a d × d symmetric positive definite matrix, let δ ∈ [0, 2], and let V be a closed convex subset of Sd+ such that −1/2

Θ ∈ V ⇒ {Θ  Θ∗ } & {kΘ1/2 Θ∗

− Id k ≤ δ}

(2.176)

−1 (cf. (2.129)). Let also Ho := {H ∈ Sd : −Θ−1 ∗ ≺ H ≺ Θ∗ }. Then

∀(H, Θ) ∈ Ho × V : G(H; Θ) := − 12 ln Det(I − Θ1/2 HΘ1/2 ) 1/2 1/2 ≤ G+ (H; Θ) := − 12 ln Det(I − Θ∗ HΘ∗ ) + 21 Tr([Θ − Θ∗ ]H) 1/2 1/2 δ(2+δ) + kΘ∗ HΘ∗ k2F , 1/2 1/2 2(1−kΘ∗

HΘ∗

(2.177)

k)

where k · k is the spectral, and k · kF the Frobenius norm of a matrix. In addition, G+ (H, Θ) is a continuous function on Ho × V which is convex in H ∈ H o and concave (in fact, affine) in Θ ∈ V Proof. Let us set

1/2

1/2

d(H) = kΘ∗ HΘ∗ k,

so that d(H) < 1 for H ∈ Ho . For H ∈ Ho and Θ ∈ V fixed we have kΘ1/2 HΘ1/2 k = ≤

−1/2

1/2

1/2

−1/2

k[Θ1/2 Θ∗ ][Θ∗ HΘ∗ ][Θ1/2 Θ∗ ]T k −1/2 1/2 1/2 1/2 1/2 kΘ1/2 Θ∗ k2 kΘ∗ HΘ∗ k ≤ kΘ∗ HΘ∗ k = d(H) (2.178) 1/2 −1/2 (we have used the fact that 0  Θ  Θ∗ implies kΘ Θ∗ k ≤ 1). Noting that

178

CHAPTER 2

kABkF ≤ kAkkBkF , a computation completely similar to the one in (2.178) yields 1/2

1/2

kΘ1/2 HΘ1/2 kF ≤ kΘ∗ HΘ∗ kF =: D(H).

(2.179)

Besides this, setting F (X) = − ln Det(X) : int Sd+ → R and equipping Sd with the 1/2 1/2 Frobenius inner product, we have ∇F (X) = −X −1 , so that with R0 = Θ∗ HΘ∗ , R1 = Θ1/2 HΘ1/2 , and ∆ = R1 − R0 , we have for properly selected λ ∈ (0, 1) and Rλ = λR0 + (1 − λ)R1 : F (I − R1 )

= = =

F (I − R0 − ∆) = F (I − R0 ) + h∇F (I − Rλ ), −∆i F (I − R0 ) + h(I − Rλ )−1 , ∆i F (I − R0 ) + hI, ∆i + h(I − Rλ )−1 − I, ∆i.

We conclude that F (I − R1 ) ≤ F (I − R0 ) + Tr(∆) + kI − (I − Rλ )−1 kF k∆kF .

(2.180)

Denoting by µi the eigenvalues of Rλ and noting that kRλ k ≤ max[kR0 k, kR1 k] = 1 = d(H) (see (2.178)), we have |µi | ≤ d(H), and therefore eigenvalues νi = 1 − 1−µ i µi −1 − 1−µi of I − (I − Rλ ) satisfy |νi | ≤ |µi |/(1 − µi ) ≤ |µi |/(1 − d(H)), whence kI − (I − Rλ )−1 kF ≤ kRλ kF /(1 − d(H)). Noting that kRλ kF ≤ max[kR0 kF , kR1 kF ] ≤ D(H)—see (2.179)—we conclude that kI − (I − Rλ )−1 kF ≤ D(H)/(1 − d(H)), so that (2.180) yields F (I − R1 ) ≤ F (I − R0 ) + Tr(∆) + D(H)k∆kF /(1 − d(H)). −1/2

Further, by (2.129) the matrix D = Θ1/2 Θ∗ 1/2

(2.181)

− I satisfies kDk ≤ δ, whence

1/2

1/2 = (I +D)R0 (I +DT )−R0 = DR0 +R0 DT +DR0 DT . ∆ = |Θ1/2 HΘ HΘ {z }−Θ | ∗ {z ∗ } R1

R0

Consequently, k∆kF

kDR0 kF + kR0 DT kF + kDR0 DT kF ≤ [2kDk + kDk2 ]kR0 kF δ(2 + δ)kR0 kF = δ(2 + δ)D(H).

≤ ≤

This combines with (2.181) and the relation 1/2

1/2

Tr(∆) = Tr(Θ1/2 HΘ1/2 − Θ∗ HΘ∗ ) = Tr([Θ − Θ∗ ]H) to yield F (I − R1 )



=

F (I − R0 ) + Tr([Θ − Θ∗ ]H) +

F (I − R0 ) + Tr([Θ − Θ∗ ]H) +

δ(2+δ) 1−d(H) D(H) 1/2 1/2 δ(2+δ) HΘ∗ k2F , 1/2 1/2 kΘ∗ 1−kΘ∗ HΘ∗ }

and we arrive at (2.177). It remains to prove that G+ (H; Θ) is convex-concave and continuous on Ho × V. The only component of this claim which is not completely evident is convexity of the function in H ∈ Ho . To see that it is the case, note that ln Det(S) is concave on the interior of the semidefinite cone, the function

179

HYPOTHESIS TESTING

f (u, v) =

u2 1−v

is convex and nondecreasing in u, v in the convex domain Π =

{(u, v) : u ≥ 0, v < 1}, and the function convex substitution of variables H 7→ into Π. .

1/2 2 kΘ1/2 ∗ HΘ∗ kF 1/2

1/2

is obtained from f by

1−kΘ∗ HΘ∗ k 1/2 1/2 1/2 1/2 (kΘ∗ HΘ∗ kF , kΘ∗ HΘ∗ k)

mapping Ho ✷

3o . Combining (2.177), (2.174), and (2.130) and the origin of Ψ—see (2.173)—we arrive at ∀((u, Θ) ∈ U × V,  (h, H) ∈ Hγ = H) :  ln Eζ∼N (A[u;1],Θ) exp{hT ζ + 12 ζ T Hζ} ≤ ΦA,Z (h, H; Θ),

as claimed in (2.133).

4o . Now let us check that ΦA,Z (h, H; Θ) : H × V → R is continuous and convexconcave. Recalling that the function G+ (H; Θ) from (2.177) is convex-concave and continuous on Ho ×V, all we need to verify is that ΓZ (h, H) is convex and continuous on H. Recalling that Z is a nonempty compact set, the function φZ (·) : Sd+1 → R is continuous, implying the continuity of ΓZ (h, H) = 12 φZ (Q[H, h]) on H = Hγ (Q[H, h] is defined in (2.175)). To prove convexity of ΓZ , note that Z is contained in Sn+1 + , implying that φZ (·) is convex and -monotone. On the other hand, by the Schur Complement Lemma, we have S

:= =

{(h, (h, H) ∈ Hγ }  H, G) : G   Q[H, h], T G − [bh A + AT hbT + AT HA] (h, H, G) : [H, h]B

B T [H, h]T Θ−1 ∗ −H



 0,  γ (h, H) ∈ H ,

implying that S is convex. Since φZ (·) is -monotone, we have {(h, H, τ ) : (h, H) ∈ Hγ , τ ≥ ΓZ (h, H)} = {(h, H, τ ) : ∃G : G  Q[H, h], 2τ ≥ φZ (G), (h, H) ∈ Hγ }, and we see that the epigraph of ΓZ is convex (since the set S and the epigraph of φZ are so), as claimed. 5o . It remains to prove that ΦA,Z is coercive in H, h. Let Θ ∈ V and (hi , Hi ) ∈ Hγ with k(hi , Hi )k → ∞ as i → ∞, and let us prove that ΦA,Z (hi , Hi ; Θ) → ∞. Looking at the expression for ΦA,Z (hi , Hi ; Θ), it is immediately seen that all terms in this expression, except for the terms coming from φZ (·), remain bounded as i grows, so that all we need to verify is that the φZ (·)-term goes to ∞ as i → ∞. Observe that Hi are uniformly bounded due to (hi , Hi ) ∈ Hγ , implying that khi k2 → ∞ as i → ∞. Denoting e = [0; ...; 0; 1] ∈ Rd+1 and, as before, b = [0; ...; 0; 1] ∈ Rn+1 , note that, by construction, B T e = b. Now let W ∈ Z, so −1 that Wn+1,n+1 = 1. Taking into account that the matrices [Θ−1 satisfy ∗ − Hi ] −1 γ αId  [Θ−1 − H ]  βI for some positive α, β due to H ∈ H , observe that i d i ∗      H i hi T −1 −1 + [Hi , hi ] [Θ−1 [Hi , hi ] = hTi [Θ−1 hi eeT + Ri , ∗ − Hi ] ∗ − Hi ] T hi {z } | | {z } α kh k2 Qi =Q[Hi ,hi ]

i

i 2

180

CHAPTER 2

where αi ≥ α > 0 and kRi kF ≤ C(1 + khi k2 ). As a result, φZ (B T Qi B)



=

Tr(W B T Qi B) = Tr(W B T [αi khi k22 eeT + Ri ]B) αi khi k22 Tr(W bbT ) −kBW B T kF kRi kF | {z } =Wn+1,n+1 =1



αkhi k22

− C(1 + khi k2 )kBW B T kF ,

and the concluding quantity tends to ∞ as i → ∞ due to khi k2 → ∞, i → ∞. Part (i) is proved. 2.11.5.B Proof of Proposition 2.43.ii By (i) the function Φ(h, H; Θ1 , Θ2 ), as defined in (2.134), is continuous and convexconcave on the domain (H1 ∩ H2 ) × (V1 × V2 ) and is coercive in (h, H), H and V | {z } | {z } H

V

are closed and convex, and V in addition is compact, so that saddle point problem (2.134) is solvable (Sion-Kakutani Theorem, a.k.a. Theorem 2.22). Now let (h∗ , H∗ ; Θ∗1 , Θ∗2 ) be a saddle point. To prove (2.136), let P ∈ G1 , that is, P = N (A1 [u; 1], Θ1 ) for some Θ1 ∈ V1 and some u with [u; 1][u; 1]T ∈ Z1 . Applying (2.133) to the first collection of data, with a given by (2.135), we get the first ≤ in the following chain:  R 1 T T e− 2 ω H∗ ω−ω h∗ −a P (dω) ≤ ΦA1 ,Z1 (−h∗ , −H∗ ; Θ1 ) − a ln ≤ Φ (−h∗ , −H∗ ; Θ∗1 ) − a |{z} = SV , |{z} A1 ,Z1 (b)

(a)

where (a) is due to the fact that ΦA1 ,Z1 (−h∗ , −H∗ ; Θ1 ) + ΦA2 ,Z2 (h∗ , H∗ ; Θ2 ) attains its maximum over (Θ1 , Θ2 ) ∈ V1 × V2 at the point (Θ∗1 , Θ∗2 ), and (b) is due to the origin of a and the relation SV = 21 [ΦA1 ,Z1 (−h∗ , −H∗ ; Θ∗1 ) + ΦA2 ,Z2 (h∗ , H∗ ; Θ∗2 )]. The bound in (2.136.a) is proved. Similarly, let P ∈ G2 , that is, P = N (A2 [u; 1], Θ2 ) for some Θ2 ∈ V2 and some u with [u; 1][u; 1]T ∈ Z2 . Applying (2.133) to the second collection of data, with the same a as above, we get the first ≤ in the following chain:  R 1 T T ln e 2 ω H∗ ω+ω h∗ +a P (dω) ≤ ΦA2 ,Z2 (h∗ , H∗ ; Θ2 ) + a (h , H ; Θ∗ ) + a |{z} = SV , ≤ Φ |{z} A2 ,Z2 ∗ ∗ 2 (b)

(a)

with exactly the same justification of (a) and (b) as above. The bound in (2.136.b) is proved. ✷ 2.11.6

Proof of Proposition 2.46

2.11.6.A Preliminaries We start with the following result: ¯ be a positive definite d × d matrix, Lemma 2.52. Let Θ u 7→ C(u) = A[u; 1]

B=



A 0, ..., 0, 1



, and let

HYPOTHESIS TESTING

181

be an affine mapping from Rn into Rd . Finally, let h ∈ Rd , H ∈ Sd and P ∈ Sd satisfy the relations ¯ 1/2 H Θ ¯ 1/2 . 0  P ≺ Id & P  Θ (2.182)

¯ and for every u ∈ Rn it holds Then, ζ ∼ SG(C(u), Θ)  n T 1 T o ≤ − 12 ln Det(I − P ) ln Eζ eh ζ+ 2 ζ Hζ h i  T ¯ 1/2 H h −1 ¯ 1/2 + [H, h] Θ [I − P ] Θ [H, h] B[u; 1] + 21 [u; 1]T B T T h

(2.183)

¯ −1/2 P Θ ¯ −1/2 ): whenever h ∈ Rd , H ∈ Sd and G ∈ Sd Equivalently (set G = Θ satisfy the relations ¯ −1 & G  H, 0G≺Θ (2.184)

¯ and every for every u ∈ Rn : one has for ζ ∼ SG(C(u), Θ)  n T 1 T o ¯ 1/2 GΘ ¯ 1/2 ) ln Eζ eh ζ+ 2 ζ Hζ ≤ − 21 ln Det(I − Θ h i  T ¯ −1 H h −1 + 21 [u; 1]T B T + [H, h] [ Θ − G] [H, h] B[u; 1]. T h

(2.185)

Proof. 1o . Let us start with the following observation:

Lemma 2.53. Let Θ ∈ Sd+ and S ∈ Rd×d be such that SΘS T ≺ Id . Then for every ν ∈ Rd one has o  o n 1 T  n T 1 T T T ln Eξ∼SG(0,Θ) eν Sξ+ 2 ξ S Sξ ≤ ln Eη∼N (ν,Id ) e 2 η SΘS η (2.186)   = − 12 ln Det(Id − SΘS T ) + 21 ν T SΘS T (Id − SΘS T )−1 ν. Indeed, let ξ ∼ SG(0, Θ) and η ∼ N (ν, Id ) be independent. We have n T o n n oo n n T T oo 1 T T T Eξ eν Sξ+ 2 ξ S Sξ |{z} = Eξ Eη e[Sξ] η = Eη Eξ e[S η] ξ o a n 1 T T ≤ Eη e 2 η SΘS η , |{z} b

where a is due to η ∼ N (ν, Id ) and b is due to ξ ∼ SG(0, Θ). We have verified the inequality in (2.186); the equality in (2.186) is given by direct computation. ✷ 2o . Now, in the situation described in Lemma 2.52, by continuity it suffices to prove (2.183) in the case when P  0 in (2.182) is replaced with P ≻ 0. Under the premise of the lemma, given u ∈ Rn and assuming P ≻ 0, let us set µ = C(u) = A[u; 1], ¯ 1/2 [Hµ + h], and S = P 1/2 Θ ¯ −1/2 , so that S ΘS ¯ T = P ≺ Id , and let ν = P −1/2 Θ −1/2 ¯ −1/2 ¯ ¯ G=Θ PΘ , so that G  H. Let ζ ∼ SG(µ, Θ). Representing ζ as ζ = µ + ξ

182

CHAPTER 2

¯ we have with ξ ∼ SG(0, Θ), o  n T 1 T o  n 1 T T = hT µ + 21 µT Hµ + ln Eξ e[h+Hµ] ξ+ 2 ξ Hξ ln Eζ eh ζ+ 2 ζ Hζ o  n 1 T T ≤ hT µ + 12 µT Hµ + ln Eξ e[h+Hµ] ξ+ 2 ξ Gξ [since G  H]  n o = hT µ + 12 µT Hµ + ln Eξ eν

T

1

Sξ+ 2 ξ T S T Sξ

[since S T ν = h + Hµ and G = S T S] ¯ T ) + 1 ν T S ΘS ¯ T (Id − S ΘS ¯ T )−1 ν ≤ hT µ + 12 µT Hµ − 21 ln Det(Id − S ΘS 2 ¯ [by Lemma 2.53 with Θ = Θ] 1 1 1 T T ¯ 1/2 −1 ¯ 1/2 T = h µ + 2 µ Hµ − 2 ln Det(Id − P ) + 2 [Hµ + h] Θ (Id − P ) Θ [Hµ + h] [plugging in S and ν].

It is immediately seen that the concluding quantity in this chain is nothing but the right-hand side quantity in (2.183). ✷ 2.11.6.B Completing the proof of Proposition 2.46. ¯ = Θ∗ , 1o . Let us prove (2.142.a). By Lemma 2.52 (see (2.185)) applied with Θ setting C(u) = A[u; 1], we have  ∀ (h, H) ∈ H, G : 0n G  γ + Θ−1 , G  H, u ∈ Rn : [u; 1][u; 1]T ∈ Z : ∗ o 1 T T 1/2 1/2 ≤ − 12 ln Det(I − Θ∗ GΘ∗ ) ln Eζ∼SG(C(u),Θ∗ ) eh ζ+ 2 ζ Hζ h i  T h H −1 −1 + 12 [u; 1]T B T + [H, h] [Θ − G] [H, h] B[u; 1] T ∗ h 1/2

1/2

≤ − 12 ln  Det(I h − Θ∗ GΘ∗ ) + 21 φZ B T



H hT

h



i  T −1 + [H, h] [Θ−1 − G] [H, h] B = ΨA,Z (h, H, G), ∗

(2.187) implying, due to the origin of ΦA,Z , that under the premise of (2.187) we have  n T 1 T o ≤ ΦA,Z (h, H), ∀(h, H) ∈ H. ln Eζ∼SG(C(u),Θ∗ ) eh ζ+ 2 ζ Hζ

Taking into account that when ζ ∼ SG(C(u), Θ) with Θ ∈ V, we have also ζ ∼ SG(C(u), Θ∗ ); (2.142.a) follows.

2o . Now let us prove (2.142.b). All we need is to verify the relation  + −1 n T ∀ (h, H) ∈ H, G : 0   G  γ Θ∗ , G n  H, u ∈ Ro: [u; 1][u; 1] ∈ Z, Θ ∈ V : ln Eζ∼SG(C(u),Θ) eh

T

1

ζ+ 2 ζ T Hζ

≤ ΨδA,Z (h, H, G; Θ);

(2.188) with this relation at our disposal (2.142.b) can be obtained by the same argument as the one we used in item 1o to derive (2.142.a). To establish (2.188), let us fix h, H, G, u, Θ satisfying the premise of (2.188); recall that under the premise of Proposition 2.46.i, we have 0  Θ  Θ∗ . Now let λ ∈ (0, 1), and let Θλ = Θ + λ(Θ∗ − Θ), so that 0 ≺ Θλ  Θ∗ , and let δλ = 1/2 −1/2 + −1 kΘλ Θ∗ − Id k, implying that δλ ∈ [0, 2]. We have 0  G  γ + Θ−1 ∗  γ Θλ , ¯ ¯ that is, H, G satisfy (2.184) w.r.t. Θ = Θλ . As a result, for our h, G, H, u, the Θ

183

HYPOTHESIS TESTING

just defined and the ζ ∼ SG(C(u), Θλ ) relation (2.185) hold true:  n T 1 T o 1/2 1/2 ≤ − 21 ln Det(I − Θλ GΘλ ) ln Eζ eh ζ+ 2 ζ Hζ h i  T H h −1 −1 + [H, h] [Θ − G] [H, h] B[u; 1] + 12 [u; 1]T B T λ hT 1/2

1/2

≤ − 21 ln Det(I  h− Θλ GΘλ ) 

+ 12 φZ B T

h

H hT

(2.189)

i   T −1 + [H, h] [Θ−1 − G] [H, h] B λ

(recall that [u; 1][u; 1]T ∈ Z). As a result,

   1 T T 1/2 1/2 ≤ − 21 ln Det(I − Θλ GΘλ ) ln Eζ∼SG(C(u),Θ) eh ζ+ 2 ζ Hζ      H h −1 + 12 φZ B T + [H, h]T [Θ−1 [H, h] B . T ∗ − G] h

(2.190)

When deriving (2.190) from (2.189), we have used that — Θ  Θλ , so that when ζ ∼ SG(C(u), Θ), we have also ζ ∼ SG(C(u), Θλ ), −1 −1 −1 — 0  Θλ  Θ∗ and G ≺ Θ−1  [Θ−1 , ∗ , whence [Θλ − G] ∗ − G] n+1 — Z ⊂ S+ , whence φZ is -monotone: φZ (M ) ≤ φZ (N ) whenever M  N .

By Lemma 2.51 applied with Θλ in the role of Θ and δλ in the role of δ, we have 1/2

1/2

1/2

1/2

− 21 ln Det(I − Θλ GΘλ ) ≤ − 12 ln Det(I − Θ∗ GΘ∗ ) + 12 Tr([Θλ − Θ∗ ]G) 1/2 1/2 δλ (2+δλ ) kΘ∗ GΘ∗ k2F . + 1/2 1/2 2(1−kΘ∗

GΘ∗

k)

Consequently, (2.190) implies that 



ln Eζ∼SG(C(u),Θ) e

hT ζ+

1 T ζ Hζ 2

+



1/2

1/2

δλ (2+δλ ) 1/2

2(1−kΘ ∗

+ 21 φZ

1/2

≤ − 12 ln Det(I − Θ∗ GΘ∗ ) + 21 Tr([Θλ − Θ∗ ]G) 1/2

GΘ ∗



BT

k)

1/2

kΘ∗ GΘ∗ k2F

H hT



h

  −1 + [H, h]T [Θ−1 [H, h] B . ∗ − G]

The resulting inequality holds true for all small positive λ; taking lim inf of the right-hand side as λ → +0, and recalling that Θ0 = Θ, we get    1 T T 1/2 1/2 ln Eζ∼SG(C(u),Θ) eh ζ+ 2 ζ Hζ ≤ − 21 ln Det(I − Θ∗ GΘ∗ ) + 12 Tr([Θ − Θ∗ ]G) +

1/2

δ(2+δ) 1/2

2(1−kΘ ∗

+ 12 φZ

1/2

GΘ ∗

BT



k)

H hT

1/2

kΘ∗ GΘ∗ k2F h



  −1 + [H, h]T [Θ−1 − G] [H, h] B ∗

(note that under the premise of Proposition 2.46.i we clearly have lim inf λ→+0 δλ ≤ δ). The right-hand side of the resulting inequality is nothing but ΨδA,Z (h, H, G; Θ)— see (2.141)—and we arrive at the inequality required in the conclusion of (2.188). 3o . To complete the proof of Proposition 2.46.i, it remains to show that functions ΦA,Z , ΦδA,Z , as announced in the proposition, possess continuity, convexityconcavity, and coerciveness properties. Let us verify that this indeed is so for ΦδA,Z ; the reasoning which follows, with obvious simplifications, is applicable to ΦA,Z as well.

184

CHAPTER 2

Observe, first, that for exactly the same reasons as in item 4o of the proof of Proposition 2.43, the function ΨδA,Z (h, H, G; Θ) is real-valued, continuous and convex-concave on the domain + −1 + −1 b × V = {(h, H, G) : −γ + Θ−1 H ∗  H  γ Θ∗ , 0  G  γ Θ∗ , H  G} × V.

The function ΦδA,Z (h, H; Θ) : H × V → R is obtained from ΨδA,Z (h, H, G; Θ) by the following two operations: we first minimize ΨδA,Z (h, H, G; Θ) over G linked to (h, H) by the convex constraints 0  G  γ + Θ−1 and G  H, thus obtaining a ∗ function + −1 ¯ Φ(h, H; Θ) : {(h, H) : −γ + Θ−1 ∗  H  γ Θ∗ } ×V → R ∪ {+∞} ∪ {−∞}. {z } | ¯ H

¯ ¯ ¯ Second, we restrict the function Φ(h, H; Θ) from H×V onto H×V. For (h, H) ∈ H, the set of G’s linked to (h, H) by the above convex constraints clearly is a nonempty ¯ is a real-valued convex-concave function on H ¯ ×V. From compact set; as a result, Φ δ δ continuity of ΨA,Z on its domain it immediately follows that ΨA,Z is bounded and uniformly continuous on every bounded subset of this domain. This implies that ¯ ¯ × V, where B ¯ is a bounded Φ(h, H; Θ) is bounded in every domain of the form B ¯ ¯ subset of H, and is continuous on B × V in Θ ∈ V with properly selected modulus ¯ Furthermore, by construction, H ⊂ int H, ¯ of continuity independent of (h, H) ∈ B. implying that if B is a convex compact subset of H, it belongs to the interior of ¯ of H. ¯ Since Φ ¯ is bounded on B ¯×V a properly selected convex compact subset B ¯ and is convex in (h, H), the function Φ is a Lipschitz continuous in (h, H) ∈ B with Lipschitz constant which can be selected to be independent of Θ ∈ V. Taking into account that H is convex and closed, the bottom line is that ΦδA,Z is not just real-valued convex-concave function on the domain H × V, but is also continuous on this domain. Coerciveness of ΦδA,Z (h, H; Θ) in (h, H) is proved in exactly the same way as the similar property of function (2.130); see item 5o in the proof of Proposition 2.43. The proof of item (i) of Proposition 2.46 is complete. 4o . Item (ii) of Proposition 2.46 can be derived from item (i) of the proposition following the steps of the proof of (ii) of Proposition 2.43. ✷

Chapter Three From Hypothesis Testing to Estimating Functionals In this chapter we extend the techniques developed in Chapter 2 beyond the hypothesis testing problem and apply them to estimating properly structured scalar functionals of the unknown signal, specifically: • In simple observation schemes—linear (and more generally, N -convex; see Section 3.2) functionals on unions of convex sets (Sections 3.1 and 3.2); • Beyond simple observation schemes—linear and quadratic functionals on convex sets (Sections 3.3 and 3.4).

3.1

ESTIMATING LINEAR FORMS ON UNIONS OF CONVEX SETS

The key to the subsequent developments in this section and in Sections 3.3 and 3.4 is the following simple observation. Let P = {Px : x ∈ X } be a parametric family of distributions on Rd , X being a convex subset of some Rm . Suppose that given a linear form g T x on Rm and an observation ω ∼ Px stemming from unknown signal x ∈ X , we want to recover g T x, and intend to use for this purpose an affine function hT ω + κ of the observation. How do we ensure that the recovery, with a given probability 1 − ǫ, deviates from g T x by at most a given margin ρ, for all x ∈ X? Let us focus on one “half” of the answer: how to ensure that the probability of the event hT ω + κ > g T x + ρ does not exceed ǫ/2, for every x ∈ X . The answer becomes easy when assuming that we have at our disposal an upper bound on the exponential moments of the distributions from the family—a function Φ(h; x) such that Z  T ln eh ω Px (dω) ≤ Φ(h; x) ∀(h ∈ Rn , x ∈ X ). Indeed, for obvious reasons, in this case the Px -probability of the event hT ω + κ − g T x > ρ is at most exp{Φ(h; x) − [g T x + ρ − κ]}. To add some flexibility, note that when α > 0, the event in question is the same as the event (h/α)T ω + κ/α > [g T x + ρ]/α; thus we arrive at a parametric family of upper bounds exp{Φ(h/α; x) − [g T x + ρ − κ]/α}, α > 0, on the Px -probability of our “bad” event. It follows that a sufficient condition for this probability to be ≤ ǫ/2, for a given x ∈ X , is the existence of α > 0 such that exp{Φ(h/α; x) − [g T x + ρ − κ]/α} ≤ ǫ/2,

186

CHAPTER 3

or Φ(h/α; x) − [g T x + ρ − κ]/α ≤ ln(ǫ/2), or, which again is the same, the existence of α > 0 such that αΦ(h/α; x) + α ln(2/ǫ) − g T x ≤ ρ − κ. In other words, a sufficient condition for the relation Probω∼Px {hT ω + κ > g T x + ρ} ≤ ǫ/2 is inf [αΦ(h/α; x) + α ln(2/ǫ) − g T x] ≤ ρ − κ.

α>0

If we want the bad event in question to take place with Px -probability ≤ ǫ/2 whatever be x ∈ X , the sufficient condition for this is sup inf [αΦ(h/α; x) + α ln(2/ǫ) − g T x] ≤ ρ − κ.

x∈X α>0

(3.1)

Now assume that X is convex and compact, and Φ(h; x) is continuous, convex in h, and concave in x. In this case the function αΦ(h/α; x) is convex in (h, α) in the domain α > 0 1 and is concave in x, so that we can switch sup and inf, thus arriving at the sufficient condition   (3.2) ∃α > 0 : max αΦ(h/α; x) + α ln(2/ǫ) − g T x ≤ ρ − κ, x∈X

for the validity of the relation

 ∀x ∈ X : Probω∼Px hT ω + κ − g T x ≤ ρ ≥ 1 − ǫ/2.

Note that our sufficient condition is expressed in terms of a convex constraint on h, κ, ρ, α. Consider also the dramatic simplification allowed by the convexityconcavity of Φ: in (3.1), every x ∈ X should be “served” by its own α, so that (3.1) is an infinite system of constraints on h, ρ, κ. In contrast, in (3.2) all x ∈ X are “served” by a single α. The developments in this section and Sections 3.3 and 3.4 are no more than implementations, under various circumstances, of the simple idea we have just outlined. 3.1.1

The problem

Let O = (Ω, Π, {pµ (·) : µ ∈ M}, F) be a simple observation scheme (see Section 2.4.2). The problem we consider in this section is as follows: We are given a positive integer K and I nonempty convex compact sets Xj ⊂ Rn , along with affine mappings Aj (·) : Rn → RM such that Aj (x) ∈ M whenever x ∈ Xj , 1 ≤ j ≤ I. In addition, we are given a linear function 1 This is due to the following standard fact: if f (h) is a convex function, then the projective transformation αf (h/α) of f is convex in (h, α) in the domain α > 0.

187

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

g T x on Rn . Given random observation ω K = (ω1 , ..., ωK ) with ωk drawn, independently across k, from pAj (x) with j ≤ I and x ∈ Xj , we want to recover g T x. It should be stressed that we do not know j and x underlying our observation. Given reliability tolerance ǫ ∈ (0, 1), we quantify the performance of a candidate estimate—a Borel function gb(·) : Ω → R—by the worst-case, over j and x, width of a (1 − ǫ)-confidence interval. Precisely, we say that gb(·) is (ρ, ǫ)-reliable if g (ω) − g T x| > ρ} ≤ ǫ. ∀(j ≤ I, x ∈ Xj ) : Probω∼pAj (x) {|b

(3.3)

We define the ǫ-risk of the estimate as Riskǫ [b g ] = inf {ρ : gb is (ρ, ǫ)-reliable} ,

i.e., Riskǫ [b g ] is the smallest ρ such that gb is (ρ, ǫ)-reliable. The technique we are about to develop originates from [131] where estimating a linear form on a convex compact set in a simple o.s. (i.e., the case I = 1 of the problem at hand) was considered, and where it was proved that in this situation the estimate X gb(ω K ) = φ(ωk ) + κ k

with properly selected φ ∈ F and κ ∈ R is near-optimal. The problem of estimating linear functionals of a signal in Gaussian o.s. has a long history; see, e.g., [38, 40, 124, 125, 125, 127, 126, 170, 179] and references therein. In particular, in the case of I = 1, using different techniques, a similar fact was proved by D. Donoho [64] in 1991; related results in the case of I > 1 are available in [41, 42]. 3.1.2

The estimate

In the sequel, we associate with the simple o.s. O = (Ω, Π, {pµ (·) : µ ∈ M}, F) in question the function Z  ΦO (φ; µ) = ln eφ(ω) pµ (ω)Π(dω) , (φ, µ) ∈ F × M. Recall that by definition of a simple o.s., this function is real-valued on F × M, concave in µ ∈ M, convex in φ ∈ F, and continuous on F × M (the latter follows from convexity-concavity and relative openness of M and F). Let us associate with a pair (i, j), 1 ≤ i, j ≤ I, the functions Φij (α, φ; x, y)

=

Ψij (α, φ)

= =

1 2



KαΦO (φ/α; Ai (x)) + KαΦ  O (−φ/α; Aj (y)) +g T (y − x) + 2α ln(2I/ǫ) : {α > 0, φ ∈ F } × [Xi × Xj ] → R, max Φij (α, φ; x, y)

x∈Xi ,y∈Xj 1 [Ψi,+ (α, φ) 2

+ Ψj,− (α, φ)] : {α > 0} × F → R

188

CHAPTER 3

where Ψℓ,+ (β, ψ)

=

Ψℓ,− (β, ψ)

=

  max KβΦO (ψ/β; Aℓ (x)) − g T x + β ln(2I/ǫ) :

x∈Xℓ

{β > 0, ψ ∈ F} → R,  max KβΦO (−ψ/β; Aℓ (x)) + g T x + β ln(2I/ǫ) :

x∈Xℓ



{β > 0, ψ ∈ F} → R.

Note that the function αΦO (φ/α; Ai (x)) is obtained from the continuous convexconcave function ΦO (·, ·) by projective transformation in the convex argument, and affine substitution in the concave argument, so that the former function is convexconcave and continuous on the domain {α > 0, φ ∈ X } × Xi . By similar argument, the function αΦO (−φ/α; Aj (y)) is convex-concave and continuous on the domain {α > 0, φ ∈ F} × Xj . These observations combine with compactness of Xi and Xj to imply that Ψij (α, φ) is a real-valued continuous convex function on the domain F + = {α > 0} × F. Observe that functions Ψii (α, φ) are nonnegative on F + . Indeed, selecting some x ¯ ∈ Xi , and setting µ = Ai (¯ x), we have   µ) + ΦO (−φ/α; µ)]K + ln(2I/ǫ) Ψii (α, φ) ≥ Φii (α, φ; x ¯, x ¯) = α 21 [ΦO (φ/α;  ≥ α ΦO (0; µ) K + ln(2I/ǫ) = α ln(2I/ǫ) ≥ 0 | {z } =0

(we have used convexity of ΦO in the first argument). Functions Ψij give rise to convex and feasible optimization problems Optij = Optij (K) =

min

(α,φ)∈F +

Ψij (α, φ).

(3.4)

By its origin, Optij is either a real, or −∞; by the observation above, Optii are nonnegative. Our estimate is as follows. 1. For 1 ≤ i, j ≤ I, we select some feasible solutions αij , φij to problems (3.4) (the less the values of the corresponding objectives, the better) and set ρij κij gij (ω K ) ρ

= = = =

Ψij (αij , φij ) = 21 [Ψi,+ (αij , φij ) + Ψj,− (αij , φij )] 1 [Ψj,− (αij , φij ) − Ψi,+ (αij , φij )] 2 P K k=1 φij (ωk ) + κij max1≤i,j≤I ρij .

2. Given observation ω K , we specify the estimate gb(ω K ) as follows: ri cj gb(ω K )

= = =

maxj≤I gij (ω K ) mini≤I gij (ω K ) 1 [mini≤I ri + maxj≤I cj ] . 2

(3.5)

(3.6)

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

3.1.3

189

Main result

Proposition 3.1. The ǫ-risk of the estimate gb(ω K ) can be upper-bounded as follows: Riskǫ [b g ] ≤ ρ. (3.7) Proof. Let the common distribution p of components ωk independent across k in observation ω K be pAℓ (u) for some ℓ ≤ I and u ∈ Xℓ . Let us fix these ℓ and u; we denote µ = Aℓ (u) and let pK stand for the distribution of ω K . 1o . We have

  Ψℓ,+ (αℓj , φℓj ) = maxx∈Xℓ Kαℓj ΦO (φℓj /αℓj , Aℓ (x)) − g T x + αℓj ln(2I/ǫ) T ≥ Kαℓj ΦO (φ  [since u ∈ Xℓ and µ = Aℓ (u)] R ℓj /αℓj , µ) − g u + αℓj ln(2I/ǫ) = Kαℓj ln exp{φℓj (ω)/αℓj }pµ (ω)Π(dω) − g T u + αℓj ln(2I/ǫ) [by definition of ΦO ]  n o −1 P T = αℓj ln EωK ∼pK exp{αℓj φ (ω )} − g u + α ln(2I/ǫ) ℓj k ℓj k o  n −1 = αℓj ln EωK ∼pK exp{αℓj [gℓj (ω K ) − κℓj ]} − g T u + αℓj ln(2I/ǫ)  n o −1 = αℓj ln EωK ∼pK exp{αℓj [gℓj (ω K ) − g T u − ρℓj ]} + ρℓj − κℓj + αℓj ln(2I/ǫ)   K T ≥ αℓj ln ProbωK ∼pK gℓj (ω ) > g u + ρℓj + ρℓj − κℓj + αℓj ln(2I/ǫ) ⇒   ǫ ) αℓj ln ProbωK ∼pK gℓj (ω K ) > g T u + ρℓj ≤ Ψℓ,+ (αℓj , φℓj ) + κℓj − ρℓj + αℓj ln( 2I ǫ = αℓj ln( 2I ) [by (3.5)],

and we arrive at

Similarly,

 ǫ ProbωK ∼pK gℓj (ω K ) > g T u + ρℓj ≤ . 2I

(3.8)

  Ψℓ,− (αiℓ , φiℓ ) = maxy∈Xℓ Kαiℓ ΦO (−φiℓ /αiℓ , Aℓ (y)) + g T y + αiℓ ln(2I/ǫ) T ≥ Kαiℓ ΦO (−φ  [since u ∈ Xℓ and µ = Aℓ (u)] R iℓ /αiℓ , µ) + g u + αiℓ ln(2I/ǫ) = Kαiℓ ln exp{−φiℓ (ω)/αiℓ }pµ (ω)Π(dω) + g T u + αiℓ ln(2I/ǫ) [by definition of ΦO ]   −1 P T = αiℓ ln EωK ∼pK exp{−αiℓ φ (ω )} + g u + α ln(2I/ǫ) iℓ k iℓ k  −1 = αiℓ ln EωK ∼pK exp{αiℓ [−giℓ (ω K ) + κiℓ ]} + g T u + αiℓ ln(2I/ǫ) −1 K T = αiℓ ln EωK ∼pK exp{α u − ρiℓ ]} + ρiℓ + κiℓ + αiℓ ln(2I/ǫ)  iℓ [−giℓ (ω ) + g  ≥ αiℓ ln ProbωK ∼pK giℓ (ω K ) < g T u − ρiℓ + ρiℓ + κiℓ + αiℓ ln(2I/ǫ) ⇒   ǫ ) αiℓ ln ProbωK ∼pK giℓ (ω K ) < g T u − ρiℓ ≤ Ψℓ,− (αiℓ , φiℓ ) − κiℓ − ρiℓ + αiℓ ln( 2I ǫ = αiℓ ln( 2I ) [by (3.5)],

and we arrive at  ǫ ProbωK ∼pK giℓ (ω K ) < g T u − ρiℓ ≤ . 2I

(3.9)

2o . Let E = {ω K : gℓj (ω K ) ≤ g T u + ρℓj , giℓ (ω K ) ≥ g T u − ρiℓ , 1 ≤ i, j ≤ I}. From (3.8) and (3.9) and the union bound it follows that pK -probability of the event E is ≥ 1 − ǫ. As a result, all we need to complete the proof of the proposition

190

CHAPTER 3

is to verify that ω K ∈ E ⇒ |b g (ω K ) − g T u| ≤ ρℓ := max[max ρiℓ , max ρℓj ], i

j

(3.10)

since clearly ρℓ ≤ ρ := maxi,j ρij . To this end, let us fix ω K ∈ E, and let E be the I × I matrix with entries Eij = gij (ω K ), 1 ≤ i, j ≤ I. The quantity ri —see (3.6)—is the maximum of the entries in the i-th row of E, while the quantity cj is the minimum of the entries in the j-th column of E. In particular, ri ≥ Eij ≥ cj for all i, j, implying that ri ≥ cj for all i, j. Now, let ∆ = [g T u − ρℓ , g T u + ρℓ ]. Since ω K ∈ E, we have Eℓℓ = gℓℓ (ω K ) ≥ g T u − ρℓℓ ≥ g T u − ρℓ and Eℓj = gℓj (ω K ) ≤ g T u + ρℓj ≤ g T u + ρℓ for all j, implying that rℓ = maxj Eℓj ∈ ∆. Similarly, ω K ∈ E implies that Eℓℓ = gℓℓ (ω K ) ≤ g T u + ρℓ and Eiℓ = giℓ (ω K ) ≥ g T u − ρiℓ ≥ g T u − ρℓ for all i, implying that cℓ = mini Eiℓ ∈ ∆. We see that both rℓ and cℓ belong to ∆; since r∗ := mini ri ≤ rℓ and, as have already seen, ri ≥ cℓ for all i, we conclude that r∗ ∈ ∆. By a similar argument, c∗ := maxj cj ∈ ∆ as well. By construction, gb(ω K ) = 21 [r∗ + c∗ ], that is, gb(ω K ) ∈ ∆, and the conclusion in (3.10) indeed takes place. ✷

Remark 3.2. Let us consider a special case of I = 1. In this case, given a K-repeated observation of the signal in a simple o.s., our construction yields an estimate of a linear form g T x of unknown signal x, known to belong to a given convex compact set X1 . This estimate is K X gb(ω K ) = φ(ωk ) + κ, (3.11) k=1

and is associated with the optimization problem

{Ψ(α, φ) := 21 [Ψ+ (α, φ) + Ψ− (α, φ)]} ,   Ψ+ (α, φ) = max KαΦO (φ/α, A1 (x)) − g T x + α ln(2/ǫ) , x∈X1  Ψ− (α, φ) = max KαΦO (−φ/α, A1 (x)) + g T x + α ln(2/ǫ) . min

α>0,φ∈F

x∈X1

By Proposition 3.1, when α, φ is a feasible solution to the problem and κ = 12 [Ψ− (α, φ) − Ψ+ (α, φ)], the ǫ-risk of estimate (3.11) does not exceed Ψ(α, φ). 3.1.4

Near-optimality

Observe that by properly selecting φij and αij we can make, in a computationally efficient manner, the upper bound ρ on the ǫ-risk of the above estimate arbitrarily close to Opt(K) = max Optij (K). 1≤i,j≤I

We are about to demonstrate that the quantity Opt(K) “nearly lower-bounds” the minimax optimal ǫ-risk Risk∗ǫ (K) = inf Riskǫ [b g ], g b(·)

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

191

the infimum being taken over all estimates (all Borel functions of ω K ). The precise statement is as follows: Proposition 3.3. In the situation of this section, let ǫ ∈ (0, 1/2) and K be a positive integer. Then for every integer K satisfying K/K > one has

2 ln(2I/ǫ)   1 ln 4ǫ(1−ǫ)

Opt(K) ≤ Risk∗ǫ (K).

(3.12)

(3.13)

In addition, in the special case where for every i, j there exists xij ∈ Xi ∩ Xj such that Ai (xij ) = Aj (xij ) one has K ≥ K ⇒ Opt(K) ≤ For proof, see Section 3.6.1. 3.1.5

2 ln(2I/ǫ)  Risk∗ǫ (K).  1 ln 4ǫ(1−ǫ)

(3.14)

Illustration

We illustrate our construction with the simplest possible example in which Xi = {xi } are singletons in Rn , i = 1, ..., I, and the observation scheme is Gaussian. Thus, setting yi = Ai (xi ) ∈ Rm , the observation’s components ωk , 1 ≤ k ≤ K, stemming from the signal xi , are drawn, independently of each other, from the normal distribution N (yi , Im ). The family F of functions φ associated with Gaussian o.s. is the family of all affine functions φ(ω) = φ0 +ϕT ω on the observation space (which at present is Rm ); we identify φ ∈ F with the pair (φ0 , ϕ). The function ΨO associated with the Gaussian observation scheme with m-dimensional observations is ΦO (φ; µ) = φ0 + ϕT µ + 21 ϕT ϕ : (R × Rm ) × Rm → R; a straightforward computation shows that in the case in question, setting θ = ln(2I/ǫ),

192

CHAPTER 3

we have Ψi,+ (α, φ)

= =

Ψj,− (α, φ)

=

Optij

= = = =

  Kα φ0 + ϕT yi /α + 21 ϕT ϕ/α2 + αθ − g T xi K T ϕ ϕ + αθ, Kαφ0 + KϕT yi − g T xi + 2α K T −Kαφ0 − KϕT yj + g T xj + ϕ ϕ + αθ, 2α inf 1 [Ψi,+ (α, φ) + Ψj,− (α, φ)] α>0,φ 2    K T K T 1 T ϕ [yi − yj ] + inf ϕ ϕ + αθ g [xj − xi ] + inf 2 ϕ α>0 2α 2   √ K T 1 T ϕ [yi − yj ] + 2Kθkϕk2 g [xj − xi ] + inf 2 ϕ 2 p  1 T g [xj − xi ], kyi − yj k2 ≤ 2p2θ/K 2 −∞, kyi − yj k2 > 2 2θ/K.

We see that we can put φ0 = 0, and that setting

p I = {(i, j) : kyi − yj k2 ≤ 2 2θ/K},

Optij (K) is finite if and only if (i, j) ∈ I and is −∞ otherwise. In both cases, the optimization problem specifying Optij has no optimal solution.2 Indeed, this clearly is the case when (i, j) 6∈ I; when (i, j) ∈ I, a minimizing sequence is, e.g., φ0 ≡ 0, ϕ ≡ 0, αi → 0, but its limit is not in the minimization domain (on this domain, α should be positive). In this particular case, the simplest way to overcome the difficulty is to restrict the optimization domain F + in (3.4) with its compact subset {α ≥ 1/R, φ0 = 0, kϕk2 ≤ R} with large R, like R = 1010 or 1020 . Then we specify the entities participating in (3.5) as  0, (i, j) ∈ I T φij (ω) = ϕij ω, ϕij = −R[y − y ]/ky − y k , (i, j) 6∈ I i j i j 2 ( 1/R, (i, j) ∈ I q αij = K 2θ R, (i, j) 6∈ I resulting in κij

= = =

1 [Ψ (αij , φij ) − Ψi,+ (αij , φij )] 2 h j,− 1 −KϕTij yj + g T xj + 2αKij ϕTij ϕij 2 1 T g [xi + xj ] − K ϕT [y + yj ] 2 2 ij i

+ αij θ − KϕTij yi + g T xi −

K 2αij

ϕTij ϕij − αij θ

i

2 Handling this case was exactly the reason why in our construction we required φ , α ij ij to be feasible, and not necessary optimal, solutions to the optimization problems (3.4).

193

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

and ρij

= = = =

1 2

[Ψi,+ (αij , φij ) + Ψj,− (αij , φij )]   K T K T 1 KϕTij yi − g T xi + ϕij ϕij + αij θ − KϕTij yj + g T xj + ϕij ϕij + αij θ 2 2αij 2αij K T K T 1 T ϕij φij + αij θ + 2 g [xj − xi ] + ϕij [yi − yj ] 2αij 2  1 T −1 g [x − x ] + R θ, (i, j) ∈ I, j i 2 √ (3.15) 1 T K 2Kθ − g [x − x ] + [ ky − y k ]R, (i, j) 6∈ I. j i i j 2 2 2

In the numerical experiment we report on we use n = 20, m = 10, and I = 100, with xi , i ≤ I, drawn independently of each other from N (0, In ), and yi = Axi with randomly generated matrix A (specifically, matrix with independent N (0, 1) entries normalized to have unit spectral norm). The linear form to be recovered is the first coordinate of x, the confidence parameter is set to ǫ = 0.01, and R = 1020 . Results of a typical experiment are presented in Figure 3.1.

2.5 2

1.5

1

0.5 0 20

30

40

50

100

200

300

Figure 3.1: Boxplot of empirical distributions, over 20 random estimation problems, of the upper 0.01-risk bounds max1≤i,j≤100 ρij (as in (3.15)) for different observation sample sizes K.

3.2

ESTIMATING N -CONVEX FUNCTIONS ON UNIONS OF CONVEX SETS

In this section, we apply our testing machinery to the estimation problem as follows. Given are: • a simple o.s. O = (Ω, Π; {pµ : µ ∈ M}; F), • a signal space X ⊂ Rn along with the affine mapping x 7→ A(x) : X → M, • a real-valued function f on X.

Given observation ω ∼ pA(x∗ ) stemming from unknown signal x∗ known to belong to X, we want to recover f (x∗ ).

194

CHAPTER 3

OHIW

ULJKW

OHIW

ULJKW

7HVW OHIW YV ULJKW

7HVW OHIW YV ULJKW

1HZ ORFDOL]HU OHIW DFFHSWHG

ULJKW

OHIW

, 7HVW OHIW YV ULJKW

,, 7HVW OHIW YV ULJKW

1HZ ORFDOL]HU OHIW DFFHSWHG

1HZ ORFDOL]HU ULJKW DFFHSWHG

1HZ ORFDOL]HU

1HZ ORFDOL]HU

ULJKW DFFHSWHG

, ,, DFFHSW OHIW

1HZ ORFDOL]HU , ,, DFFHSW ULJKW

1HZ ORFDOL]HU , ,, LQ GLVDJUHHPHQW

D

E

F

Figure 3.2: Bisection via Hypothesis Testing.

Our approach imposes severe restrictions on f (satisfied, e.g., when f is linear, or linear-fractional, or is the maximum of several linear functions); as a compensation, we allow for rather “complex” X—finite unions of convex sets. 3.2.1

Outline

Though the estimator we develop is, in a nutshell, quite simple, its formal description turns out to be rather involved.3 For this reason we start its presentation with an informal outline, which exposes some simple ideas underlying its construction. Consider the situation where the signal space X is the 2D rectangle as presented on the top of Figure 3.2.(a), and let the function to be recovered be f (x) = x1 . Thus, “nature” has somehow selected x = [x1 , x2 ] in the rectangle, and we observe a Gaussian random vector with the mean A(x) and known covariance matrix, where A(·) is a given affine mapping. Note that hypotheses f (x) ≥ b and f (x) ≤ a translate into convex hypotheses on the expectation of the observed Gaussian r.v., so that we can use our hypothesis testing machinery to decide on hypotheses of this type and to localize f (x) in a (hopefully, small) segment by a Bisection-type process. Before describing the process, let us make a terminological agreement. In the sequel we shall use pairwise hypothesis testing in the situation where it may happen that neither of the hypotheses we are deciding upon is true. In this case, we will say that the outcome of a test is correct if the rejected hypothesis indeed is wrong (the accepted hypothesis can be wrong as well, but the latter can happen 3 It should be mentioned that the proposed estimation procedure is a “close relative” of the binary search algorithm of [77].

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

195

only in the case when both our hypotheses are wrong). This is what the Bisection might look like. 1. Were we able to decide reliably on the left and the right hypotheses in Figure 3.2.(a), that is, to understand via observations whether x belongs to the left or to the right half of the original rectangle, our course of actions would be clear: depending on this decision, we would replace our original rectangle with a smaller rectangle localizing x, as shown in Figure 3.2.(a), and then iterate this process. The difficulty, of course, is that our left and right hypotheses intersect, so that is impossible to decide on them reliably. 2. In order to make the left and right hypotheses distinguishable from each other, we could act as shown in Figure 3.2.(b), by shrinking the left and the right rectangles and inserting a rectangle in the middle (“no man’s land”). Assuming that the width of the middle rectangle allows to decide reliably on our new left and right hypotheses and utilizing the available observation, we can localize x either in the left, or in the right rectangle as shown in Figure 3.2.(b). Specifically, assume that our “left vs. right” test rejected correctly the right hypothesis. Then x can be located either in the left, or in the middle rectangle shown on the top, and thus x is in the new left localizer which is the union of the left and the middle original rectangles. Similarly, if our test rejects correctly the left hypothesis, then we can take, as the new localizer of x, the union of the original right and middle rectangles. Note that our localization is as reliable as our test is, and that it reduces the width of the localizer by a factor close to 2, provided the width of the middle rectangle is small compared to the width of the original localizer of x. We can iterate this process, until we arrive at a localizer so narrow that the corresponding separator— “no man’s land” (this part cannot be too narrow, since it should allow for a reliable decision on the current left and right hypotheses)—becomes too large to allow reducing significantly the localizer’s width. Note that in this implementation of the binary search (same as in the implementation proposed in [77]), starting from the second step of the Bisection, the hypotheses to decide upon depend on the observations (e.g., when x belongs to the middle part of the three-rectangle localizer in Figure 3.2, deciding on “left vs. right” can, depending on observation, result in accepting either the left or the right hypothesis, leading to different updated localizers). Analysing this situation usually brings about complications we would like to avoid. 3. A simple modification of the Bisection allows us to circumvent the difficulties related to testing random hypotheses. Indeed, let us consider the following construction: given the current localizer for x (at the first step the initial rectangle), we consider two “three-rectangle” partitions of it as presented in Figure 3.2.(c). In the first partition, the left rectangle is the left half of the original rectangle, in the second partition the right rectangle is the right half of the original rectangle. We then run two “left vs. right” tests, the first on the pair of left and right hypotheses stemming from the first partition, and the second on the pair of left and right hypotheses stemming from the second partition. Assuming that in both tests the rejected hypotheses indeed were wrong, the results of these tests allow us to make the following conclusions: • when both tests reject the right hypotheses from the corresponding pairs, x is located in the left half of the initial rectangle (since otherwise in the second test

196

CHAPTER 3

the rejected hypothesis were in fact true, contradicting to the assumption that both tests make no wrong rejections); • when both tests reject the left hypotheses from the corresponding pairs, x is located in the right half of the original rectangle (for the exactly same reasons as in the previous case); • when the tests “disagree,” rejecting hypotheses of different types (like left in the firsts, and right in the second test), x is located in the union of the two middle rectangles we deal with. Indeed, otherwise x should be either in the left rectangles of both our three-rectangle partitions, or in the right rectangles of both of them. Since we have assumed that in both tests no wrong rejections took place, in the first case both tests must reject the right hypotheses, and both should reject the left hypotheses in the second, while none of these events took place. Now, in the first two cases we can safely say to which of the “halves”—left or right— of the initial rectangle x belongs, and take this half as the new localizer. In the third case, we take as a new localizer for x the middle rectangle shown at the bottom of Figure 3.2 and terminate our estimation process—the new localizer already is narrow! In the proposed algorithm, unless we terminate at the very first step, we carry out the second step exactly in the same way as the first one, with the localizer of x yielded by the first step in the role of the initial localizer, then carry out, in the same way, the third step, etc., until termination either due to running into a disagreement, or due to reaching a prescribed number of steps. Upon termination, we return the last localizer for x which we have built, and claim that f (x) = x1 belongs to the projection of this localizer onto the x1 -axis. In all tests from the above process, we use the same observation. Note that in the present situation, in contrast to that discussed earlier, reutilizing a single observation creates no difficulties, since with no wrong rejections in the pairwise tests we use, the pairs of hypotheses participating in the tests are not random at all—they are uniquely defined by f (x) = x1 . Indeed, with no wrong rejections, prior to termination everything is as if we were running deterministic Bisection, that is, were updating subsequent rectangles ∆t containing x according to the rules • ∆1 is a rectangle containing x given in advance, • ∆t+1 is precisely the half of ∆t containing x (say, the left half in the case of a tie). Thus, given x and assuming that there are no wrong rejections, the situation is as if a single observation were used in L tests running in “parallel” rather than sequentially. The only elaboration caused by the sequential nature of our process is the “risk accumulation”—we want the probability of error in one or more of our L tests to be less than the desired risk ǫ of wrong “bracketing” of f (x), implying, in the absence of something better, that the risks of the individual tests should be at most ǫ/L. These risks, in turn, define the allowed width of separators and thus – the accuracy to which f (x) can be estimated. It should be noted that the number L of steps of Bisection always is a moderate integer (since otherwise the width of “no man’s land,” which at the concluding Bisection steps is of order of 2−L , would be too small to allow for deciding on the concluding pairs of our hypotheses with risk ǫ/L, at least when our observations possess non-negligible volatility). As a result, “the cost” of Bisection turns out to be significantly lower than in the case where every test uses its own observation.

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

197

From the above sketch of our construction it is clear that all that matters is our ability to decide on the pairs of hypotheses {x ∈ X : f (x) ≤ a} and {x ∈ X : f (x) ≥ b}, with a and b given, via observation drawn from pA(x) . In our outline, these were convex hypotheses in Gaussian o.s., and in this case we can use detectorbased pairwise tests yielded by Theorem 2.23. Applying the machinery developed in Section 2.5.1, we could also handle the case when the sets {x ∈ X : f (x) ≤ a} and {x ∈ X : f (X) ≥ b} are unions of a moderate number of convex sets (e.g., f is affine, and X is the union of a number of convex sets), the o.s. in question still being simple, and this is the situation we intend to consider. 3.2.2

Estimating N -convex functions: Problem setting

In the rest of this section, we consider the situation as follows. We are given 1. simple o.s. O = (Ω, P, {pµ (·) : µ ∈ M}, F), 2. convex compact set X ⊂ Rn along with a collection of I convex compact sets Xi ⊂ X , 3. affine mapping x 7→ A(x) : X → M, 4. a continuous function f (x) : X → R which is N -convex, meaning that for every a ∈ R the sets X a,≥ = {x ∈ X : f (x) ≥ a} and X a,≤ = {x ∈ X : f (x) ≤ a} can be represented as the unions of at most N closed convex sets Xνa,≥ , Xνa,≤ : X a,≥ =

N [

ν=1

Xνa,≥ , X a,≤ =

For some unknown x known to belong to X =

N [

ν=1 I S

Xνa,≤ .

Xi , we have at our disposal

i=1

observation ω K = (ω1 , ..., ωK ) with i.i.d. ωt ∼ pA(x) (·), and our goal is to estimate from this observation the quantity f (x). Given tolerances ρ > 0, ǫ ∈ (0, 1), let us call a candidate estimate fb(ω K ) (ρ, ǫ)reliable (cf. (3.3)) if for every x ∈ X, with the pA(x) -probability at least 1 − ǫ, it holds |fb(ω K ) − f (x)| ≤ ρ or, which is the same, n o ∀(x ∈ X) : ProbωK ∼pA(x) ×...×pA(x) |fb(ω K ) − f (x)| > ρ ≤ ǫ. 3.2.2.1

Examples of N -convex functions

Example 3.1. [Minima and maxima of linear-fractional functions] Every function which can be obtained from linear-fractional functions hgνν (x) (x) (gν , hν are affine functions on X and hν are positive on X ) by taking maxima and minima is N -convex for appropriately selected N due to the following immediate observations: g(x) • linear-fractional function h(x) with denominator positive on X is 1-convex on X ; • if f (x) is N -convex, so is −f (x);

198

CHAPTER 3

• if fi (x) is Ni -convex, i = 1, 2, ..., I, then f (x) = max fi (x) is N -convex with i

N = max

"

Y

Ni ,

i

due to {x ∈ X : f (x) ≤ a}

=

{x ∈ X : f (x) ≥ a}

=

X i

I T

i=1 I S

i=1

#

Ni ,

{x : fi (x) ≤ a}, {x : fi (x) ≥ a}.

Note that the first set is the intersection of I unionsQof convex sets with Ni components in i-th union, and thus is the union of i Ni convex sets. The second set is the union of I unions P of convex sets with Ni elements in the i-th union, and thus is the union of i Ni convex sets.

Example 3.2. [Conditional quantile] Let S = {s1 < s2 < ... < sM } ⊂ R. For a nonvanishing probability distribution q on S and α ∈ [0, 1], let χα [q] be the regularized α-quantile of q defined as follows: we pass from q to the distribution on [s1 , sM ] by spreading uniformly the mass qν , 1 < ν ≤ M , over [sν−1 , sν ], and assigning mass q1 to the point s1 ; χα [q] is the usual α-quantile of the resulting χα [q] = min{s ∈ [s1 , sM ] : q¯{[s1 , s]} ≥ α}. distribution q¯: s s4 s3 s2

s1

0

q1

q1+q2

q1+q2+q3 1

α

Regularized quantile as function of α, M = 4

Given, along with S, a finite set T , let X be a convex compact set in the space of nonvanishing probability distributions on S ×T . For τ ∈ T , consider the conditional to t = τ , distribution pτ (·) of s ∈ S induced by a distribution p(·, ·) ∈ X : p(µ, τ ) , 1 ≤ µ ≤ M, pτ (µ) = PM ν=1 p(ν, τ )

where p(µ, τ ) is the p-probability for (s, t) to take value (sµ , τ ), and pτ (µ) is the pτ -probability for s to take value sµ , 1 ≤ µ ≤ M . The function χα [pτ ] : X → R turns out to be 1-convex; for verification see Section 3.6.2.

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

3.2.3

199

Bisection estimate: Construction

While the construction to be presented admits numerous refinements, we focus here on its simplest version. 3.2.3.1

Preliminaries

Upper and lower feasibility/infeasibility, sets Zia,≥ and Zia,≤ . Let a be a real. We associate with a a collection of upper a-sets defined as follows: we look at the sets Xi ∩ Xνa,≥ , 1 ≤ i ≤ I, 1 ≤ ν ≤ N , and arrange the nonempty sets from this family into a sequence Zia,≥ , 1 ≤ i ≤ Ia,≥ . Here Ia,≥ = 0 if all sets in the family are empty; in the latter case, we call a upper-infeasible, and call it upper-feasible otherwise. Similarly, we associate with a the collection of lower a-sets Zia,≤ , 1 ≤ i ≤ Ia,≤ , by arranging into a sequence all nonempty sets from the family Xi ∩ Xνa,≤ , and call a lower-feasible or lower-infeasible depending on whether Ia,≤ is positive or zero. Note that upper and lower a-sets are nonempty convex compact sets, and S Zia,≥ , X a,≥ := {x ∈ X : f (x) ≥ a} = 1≤i≤Ia,≥ S Zia,≤ . X a,≤ := {x ∈ X : f (x) ≤ a} = 1≤i≤Ia,≤

Right tests. Given a segment ∆ = [a, b] of positive length with lower-feasible K a, we associate with this segment a right test—a function T∆,r (ω K ) taking values right and left, and risk σ∆,r ≥ 0—as follows: K 1. if b is upper-infeasible, T∆,r (·) ≡ left and σ∆,r = 0; 2. if b is upper-feasible, the collections of “right sets” {A(Zib,≥ )}i≤Ib,≥ and of “left sets” {A(Zja,≤ )}j≤Ia,≤ are nonempty, and the test is given by the construction from Section 2.5.1 as applied to these sets and the stationary K-repeated version of O , specifically,

• for 1 ≤ i ≤ Ib,≥ , 1 ≤ j ≤ Ia,≤ , we build the detectors K φK ij∆ (ω ) =

K X

φij∆ (ωt ),

t=1

with φij∆ (ω) given by (rij∆ , sij∆ ) φij∆ (ω)

∈ =

Argmin b,≥

r∈Zi 1 2

We set ǫij∆ =

a,≤

,s∈Zj

ln

 R p pA(r) (ω)pA(s) (ω)Π(dω) , Ω

 ln pA(rij∆ ) (ω)/pA(sij∆ ) (ω) .

Z q Ω

pA(rij∆ ) (ω)pA(sij∆ ) (ω)Π(dω)

and build the Ib,≥ × Ia,≤ matrix E∆,r = [ǫK ij∆ ] 1≤i≤Ib,≥ ; 1≤j≤Ia,≤

• we define σ∆,r as the spectral norm of E∆,r . We compute  the Perron-Frobenius E ∆,r eigenvector [g ∆,r ; h∆,r ] of the matrix , so that (see Section T E∆,r

200

CHAPTER 3

2.5.1.2) T g ∆,r > 0, h∆,r > 0, σ∆,r g ∆,r = E∆,r h∆,r , σ∆,r h∆,r = E∆,r g ∆,r .

Finally, we define the matrix-valued function ∆,r ∆,r K D∆,r (ω K ) = [φK ij∆ (ω )− ln(hj )+ ln(gi )] 1≤i≤Ib,≥ . 1≤j≤Ia,≤

K Test T∆,r (ω K ) takes value right iff the matrix D∆,r (ω K ) has a nonnegative row, and takes value left otherwise.

Given δ > 0, κ > 0, we call segment ∆ = [a, b] δ-good (right) if a is lower-feasible, b > a, and σ∆,r ≤ δ. We call a δ-good (right) segment ∆ = [a, b] κ-maximal if the segment [a, b − κ] is not δ-good (right). Left tests. The “mirror” version of the above is as follows. Given a segment ∆ = [a, b] of positive length with upper-feasible b, we associate with this segment a K left test—a function T∆,l (ω K ) taking values right and left, and risk σ∆,l ≥ 0—as follows: K 1. if a is lower-infeasible, T∆,l (·) ≡ right and σ∆,l = 0; K K 2. if a is lower-feasible, we set T∆,l ≡ T∆,r , σ∆,l = σ∆,r .

Given δ > 0, κ > 0, we call segment ∆ = [a, b] δ-good (left) if b is upper-feasible, b > a, and σ∆,l ≤ δ. We call a δ-good (left) segment ∆ = [a, b] κ-maximal if the segment [a + κ, b] is not δ-good (left). Explanation: When a < b and a is lower-feasible, b is upper-feasible, so that the sets X a,≤ = {x ∈ X : f (x) ≤ a}, X b,≥ = {x ∈ X : f (x) ≥ b}

K K are nonempty, the right and the left tests T∆,l , T∆,r are identical to each other and coincide with the minimal risk test, built as explained in Section 2.5.1, deciding, via stationary K-repeated observations, on the “location” of the distribution pA(x) underlying the observations—whether this location is left (left hypothesis stating S A(Zia,≤ )), or right (right that x ∈ X and f (x) ≤ a, whence A(x) ∈ 1≤i≤Ia,≤

hypothesis stating that x ∈ X and f (x) ≥ b, whence A(x) ∈

S

1≤i≤Ib,≥

A(Zib,≥ )).

When a is lower-feasible and b is not upper-feasible, the right hypothesis is empty, and the left test associated with [a, b], naturally, always accepts the left hypothesis; similarly, when a is lower-infeasible and b is upper-feasible, the right test associated with [a, b] always accepts the right hypothesis. A segment [a, b] with a < b is δ-good (left) if the right hypothesis corresponding K to the segment is nonempty, and the left test T∆ℓ associated with [a, b] decides on the right and the left hypotheses with risk ≤ δ, and similarly for the δ-good (right) segment [a, b].

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

3.2.4 3.2.4.1

201

Building Bisection estimate Control parameters

The control parameters of the Bisection estimate are 1. positive integer L—the maximum allowed number of bisection steps, 2. tolerances δ ∈ (0, 1) and κ > 0. 3.2.4.2

Bisection estimate: Construction

The estimate of f (x) (x is the signal underlying our observations: ωt ∼ pA(x) ) is given by the following recurrence run on the observation ω ¯ K = (¯ ω1 , ..., ω ¯ K ) at our disposal: 1. Initialization. We find a valid upper bound b0 on maxu∈X f (u) and valid lower bound a0 on minu∈X f (u) and set ∆0 = [a0 , b0 ]. We assume w.l.o.g. that a0 < b0 ; otherwise the estimation is trivial. Note: f (x) ∈ ∆0 . 2. Bisection Step ℓ, 1 ≤ ℓ ≤ L. Given the localizer ∆ℓ−1 = [aℓ−1 , bℓ−1 ] with aℓ−1 < bℓ−1 , we act as follows: a) We set cℓ = 21 [aℓ−1 + bℓ−1 ]. If cℓ is not upper-feasible, we set ∆ℓ = [aℓ−1 , cℓ ] and pass to 2e, and if cℓ is not lower-feasible, we set ∆ℓ = [cℓ , bℓ−1 ] and pass to 2e. Note: When the rule requires us to pass to 2e, the set ∆ℓ−1 \∆ℓ does not intersect with f (X); in particular, in such a case f (x) ∈ ∆ℓ provided that f (x) ∈ ∆ℓ−1 . b) When cℓ is both upper- and lower-feasible, we check whether the segment [cℓ , bℓ−1 ] is δ-good (right). If it is not the case, we terminate and claim that ¯ := ∆ℓ−1 ; otherwise find vℓ , cℓ < vℓ ≤ bℓ−1 , such that the segment f (x) ∈ ∆ ∆ℓ,rg = [cℓ , vℓ ] is δ-good (right) κ-maximal. Note: In terms of the outline of our strategy presented in Section 3.2.1, termination when the segment [cℓ , bℓ−1 ] is not δ-good (right) corresponds to the case when the current localizer is too small to allow for the “no-man’s land” wide enough to ensure low-risk decision on the left and the right hypotheses. To find vℓ , we check the candidates with vℓk = bℓ−1 − kκ, k = 0, 1, ... until arriving for the first time at segment [cℓ , vℓk ], which is not δ-good (right), and take as vℓ the quantity v k−1 (because k ≥ 1 the resulting value of vℓ is well-defined and clearly meets the above requirements). c) Similarly, we check whether the segment [aℓ−1 , cℓ ] is δ-good (left). If it is ¯ := ∆ℓ−1 ; otherwise find not the case, we terminate and claim that f (x) ∈ ∆ uℓ , aℓ−1 ≤ uℓ < cℓ , such that the segment ∆ℓ,lf = [uℓ , cℓ ] is δ-good (left) κ-maximal. Note: The rules for building uℓ are completely similar to those for vℓ . d) We compute T∆Kℓ,rg ,r (¯ ω K ) and T∆Kℓ,lf ,l (¯ ω K ). If T∆Kℓ,rg ,r (¯ ω K ) = T∆Kℓ,lf ,l (¯ ωK ) (“consensus”), we set ∆ℓ = [aℓ , bℓ ] =



[cℓ , bℓ−1 ], [aℓ−1 , cℓ ],

T∆Kℓ,rg ,r (¯ ω K ) = right, K T∆ℓ,rg ,r (¯ ω K ) = left

(3.16)

and pass to 2e. Otherwise (“disagreement”) we terminate and claim that

202

CHAPTER 3

¯ = [uℓ , vℓ ]. f (x) ∈ ∆ e) We pass to step ℓ + 1 when ℓ < L; otherwise we terminate with the claim that ¯ := ∆L . f (x) ∈ ∆ ¯ built upon termination 3. Output of the estimation procedure is the segment ∆ and claimed to contain f (x) (see rules 2b–2e) the midpoint of this segment is the estimate of f (x) yielded by our procedure. 3.2.5

Bisection estimate: Main result

Our main result on Bisection is as follows: Proposition 3.4. Consider the situation described at the beginning of Section 3.2.2, and let ǫ ∈ (0, 1/2) be given. Then (i) [reliability of Bisection] For every positive integer L and every κ > 0, Bisection with control parameters L, δ =

ǫ ,κ 2L

(3.17)

is (1 − ǫ)-reliable: for every x ∈ X, the pA(x) -probability of the event ¯ f (x) ∈ ∆ ¯ is the Bisection output as defined above) is at least 1 − ǫ. (∆ (ii) [near-optimality] Let ρ > 0 and positive integer K be such that “in nature” S there exists a (ρ, ǫ)-reliable estimate fb(·) of f (x), x ∈ X := i≤I Xi , via stationary

K-repeated observation ω K with ωk ∼ pA(x) , 1 ≤ k ≤ K. Given ρb > 2ρ, the Bisection estimate utilizing stationary K-repeated observations, with K≥

2 ln(2LN I/ǫ)  K,  1 ln 4ǫ(1−ǫ)

the control parameters of the estimate being    ǫ b0 − a 0 L = log2 , δ= , κ = ρb − 2ρ, 2b ρ 2L

(3.18)

(3.19)

is (b ρ, ǫ)-reliable. Note that K is only “slightly larger” than K.

For proof, see Section 3.6.3. Note that the running time K of the Bisection estimate as given by (3.18) is just by (at most) logarithmic in N , I, L, and 1/ǫ factor larger than K; note also that L is just logarithmic in 1/b ρ. Assume, e.g., that for some γ > 0 “in nature” there exist (ǫγ , ǫ)-reliable estimates, parameterized by ǫ ∈ (0, 1/2), utilizing K = K(ǫ) observations. Then Bisection with the volume of observation and control parameters given by (3.18) and (3.19), where ρb = 3ρ = 3ǫγ and K = K(ǫ), is (3ǫγ , ǫ)-reliable and requires K = K(ǫ)-repeated observations with limǫ→+0 K(ǫ)/K(ǫ) ≤ 2.

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

3.2.6

203

Illustration

To illustrate bisection-based estimation of an N -convex function, consider the following situation.4 There are M devices (“receivers”) recording a signal u known to belong to a given convex compact and nonempty set U ⊂ Rn ; the output of the i-th receiver is the vector yi = Ai u + σξ ∈ Rm [ξ ∼ N (0, Im )] where Ai are given m × n matrices (you may think of M allowed positions for a single receiver, and of yi as the output of the receiver when the latter is in position i). Our observation ω is one of the vectors yi , 1 ≤ i ≤ M , with index i unknown to us (“we observe a noisy record of a signal, but do not know the position in which this record was taken”). Given ω, we want to recover a given linear function g(x) = eT u of the signal. The problem can be modeled as follows. Consider M sets Xi = {x = [x1 ; ...; xM ] ∈ RM n = Rn × ... × Rn : xj = 0, j 6= i; xi ∈ U } {z } | M

along with the linear mapping

A[x1 ; ...; xM ] =

M X i=1

Ai x i : R M n → R m

and linear function f ([x1 ; ...; xM ]) = eT

X i

xi : RM n → R.

Let X be a convex compact set in RM n containing all the sets Xi , 1 ≤ i ≤ m. Observe that the problem we are interested in is nothing but the problem of recovering f (x) via observation ω = Ax + σξ, ξ ∼ N (0, Im ), (3.20) SM where the unknown signal x is known to belong to the union i=1 Xi of known convex compact sets Xi . As a result, our problem can be solved via the machinery developed in this section. Numerical illustration. In the numerical experiments to be reported, we use n = 128, m = 64 and M = 2. The data is generated as follows: • The set U ⊂ R128 of candidate signals is comprised of restrictions onto the equidistant (n = 128)-point grid in [0, 1] of twice differentiable functions h(t) of continuous argument t ∈ [0, 1] satisfying the relations |h(0)| ≤ 1, |h′ (0)| ≤ 1, |h′′ (t)| ≤ 1, 0 ≤ t ≤ 1. For the discretized signal u = [h(0); h(1/n); ...; h(1 − 1/n)] this translates into the system of convex constraints |u1 | ≤ 1, n|u2 − u1 | ≤ 1, n2 |ui+1 − 2ui + ui−1 | ≤ 1, 2 ≤ i ≤ n − 1. 4 Our goal is to illustrate a mathematical construction rather than to work out a particular application; the reader is welcome to invent a plausible “covering story.”

204

CHAPTER 3

Characteristic error bound actual error # of Bisection steps

min 0.008 0.001 5

median 0.015 0.002 7.00

mean 0.014 0.002 6.60

max 0.015 0.005 8

Table 3.1: Data of 10 Bisection experiments, σ = 0.01. In the table, “error bound” is the half-length of the final localizer, which is an 0.99-reliable upper bound on the estimation error; the “actual error” is the actual estimation error. R1 • We look to estimate the discretized counterpart of the integral 0 h(t)dt, specifP n ically, the quantity eT u = α i=1 ui . The normalizing constant α is selected to T ensure maxu∈U e u = 1, minu∈U eT u = −1, allowing us to run Bisection over ∆0 = [−1; 1]. • We generate A1 as an (m = 64)×(n = 128) matrix with singular values σi = θi−1 , 1 ≤ i ≤ m, with θ selected from the requirement σm = 0.1. The system of left singular vectors of A1 is obtained from the system of basic orths in Rn by random rotation. Matrix A2 was selected as A2 = A1 S, where S is a symmetry w.r.t. the axis e, that is, Se = e & Sh = −h whenever h is orthogonal to e. (3.21) Signals u underlying the observations are selected at random in U . • The reliability 1 − ǫ of the estimate is set to 0.99, while the maximal allowed number L of Bisection steps is set to 8. We use single observation (3.20) (i.e., use K = 1 in our general scheme) with σ = 0.01. The results of our experiments are presented in Table 3.1. Observe that in the considered problem there exists an intrinsic obstacle for high accuracy estimation even in the case of noiseless observations and invertible matrices Ai , i = 1, 2 (recall that we are in the case of M = 2). Indeed, assume that there exist u ∈ U , u′ ∈ U such that A1 u = A2 u′ and eT u 6= eT u′ . Since we do not know which of the matrices, A1 or A2 , underlies the observation and A1 u = A2 u′ , there is no way to distinguish between the two cases we have described, implying that the quantity 1 T |e (u − u′ )| : A1 u = A2 u′ (3.22) ρ = max 2 ′ u,u ∈U

is a lower bound on the worst-case, over signals from U , error of a reliable recovery of eT u, independently of how small the noise is. In the reported experiments, we used A2 = A1 S with S linked to e (see (3.21)); with this selection of S, e, and A2 , and invertible A1 , the lower bound ρ would be trivial—just zero. Note that the selected A1 is not invertible, resulting in a positive ρ. However, computation shows that with our data, this positive ρ is negligibly small (about 2.0e − 5). When we destroy the link between e and S, the estimation problem can become intrinsically more difficult, and the performance of our estimation procedure can deteriorate. Let us look at what happens when we keep A1 and A2 = A1 S exactly as they are, but replace the linear form to be estimated with eT u, e being randomly selected.5 The corresponding results are presented in Table 3.2. The data in the

5 In the experiments to be reported, e is selected as follows: we start with a random unit vector drawn from the uniform distribution on the unit sphere in Rn and then normalize it to

205

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

Characteristic error bound actual error # of Bisection steps

min 0.057 0.001 1

median 0.457 0.297 1.00

mean 0.441 0.350 2.20

max 1.000 1.000 5

“Difficult” signals, data over 10 experiments ρ error bound

0.022

0.028

0.154

0.170

0.213

0.248

0.250

0.500

0.605

0.924

0.057

0.063

0.219

0.239

0.406

0.508

0.516

0.625

0.773

1.000

Error bound vs. ρ, experiments sorted according to the values of ρ Characteristic error bound actual error # of Bisection steps

min 0.016 0.005 1

median 0.274 0.066 2.00

mean 0.348 0.127 2.80

max 1.000 0.556 7

Random signals, data over 10 experiments ρ error bound

0.010

0.085

0.177

0.243

0.294

0.334

0.337

0.554

0.630

0.762

0.016

0.182

0.376

0.438

0.602

0.029

0.031

0.688

0.125

1.000

Error bound vs. ρ, experiments sorted according to the values of ρ

Table 3.2: Results of experiments with randomly selected linear form, σ = 0.01.

top part of the table match “difficult” signals u—those participating in forming the lower bound (3.22) on the recovery error, while the data in the bottom part of the table correspond to randomly selected signals.6 Observe that when estimating a randomly selected linear form, the error bounds indeed deteriorate, as compared to those in Table 3.1. We see also that the resulting error bounds are in a reasonably good agreement with the lower bound ρ, illustrating the basic property of nearly optimal estimates: the guaranteed performance of an estimate can be bad or good, but it is always nearly as good as is possible under the circumstances. As for actual estimation errors, they in some experiments are significantly less than the error bounds, especially when random signals are used. 3.2.7

Estimating N -convex functions: An alternative

Observe that the problem of estimating an N -convex function on the union of convex sets posed in Section 3.2.2 can be processed not only by Bisection. An alternative is as follows. In the notation ofSSection 3.2.2, we start with computing Xi , that is, we compute the quantities the range ∆ of function f on the set X = i≤I

f = min f (x), f = max f (x) x∈X

x∈X

have maxu∈U eT u − minu∈U eT u = 2. 6 Precisely, to generate a signal u, we draw a point u ¯ at random, from the uniform distribution √ ¯. on the sphere of radius 10 n, and take as u the point of U k · k2 -closest to u

206

CHAPTER 3

and set ∆ = [f , f ]. We assume that this segment is not a singleton; otherwise estimating f is trivial. Let L ∈ Z+ and let δL = (f −f )/L be the desired estimation accuracy. We split ∆ into L segments ∆ℓ of equal length δL and consider the sets Xiℓ = {x ∈ Xi : f (x) ∈ ∆ℓ }, 1 ≤ i ≤ I, 1 ≤ ℓ ≤ L. Since f is N -convex, each set Xiℓ is a union of Miℓ ≤ N 2 convex compact sets Xiℓj , 1 ≤ j ≤ Miℓ . Thus, we have at our disposal a collection of at most ILN 2 convex compact sets; let us eliminate from this collection empty sets and S arrange the nonempty ones into a sequence Y1 , ..., YM , M ≤ ILN 2 . Note that s≤M Ys = X, so that the goal set in Section 3.2.2 can be reformulated as follows: For some unknown x known to belong to X =

M S

Ys , we have at our disposal

s=1

observation ω K = (ω1 , ..., ωK ) with i.i.d. ωt ∼ pA(x) (·); we aim at estimating the quantity f (x) from this observation. The sets Ys give rise to M hypotheses H1 , ..., HM on the distribution of the observations ωt , 1 ≤ t ≤ K; according to Hs , ωt ∼ pA(x) (·) with some x ∈ Ys . Let us define a closeness C on the set of our M hypotheses as follows. Given s ≤ M , the set Ys is some Xi(s)ℓ(s)j(s) ; we say that two hypotheses, Hs and Hs′ , are C-close if the segments ∆ℓ(s) and ∆ℓ(s′ ) intersect. Observe that when Hs and Hs′ are not C-close, the convex compact sets Ys and Ys′ do not intersect, since the values of f on Ys belong to ∆ℓ(s) , the values of f on Ys′ belong to ∆ℓ(s′ ) , and the segments ∆ℓ(s) and ∆ℓ(s′ ) do not intersect. Now let us apply to the hypotheses H1 , ..., HM our machinery for testing up to closeness C; see Section 2.5.2. Assuming that whenever Hs and Hs′ are not C-close, the risks ǫss′ defined in Section 2.5.2.2 are < 1,7 we, given tolerance ǫ ∈ (0, 1), can find K = K(ǫ) such that stationary K-repeated observation ω K allows us to decide (1−ǫ)-reliably on H1 , ..., HM up to closeness C. As applied to ω K , the corresponding test T K will accept some (perhaps, none) of the hypotheses, let the indexes of the accepted hypotheses form set S = S(ω K ). We convert S into an estimate fb(ω K ) of S f (x), x ∈ X = s≤M Ys being the signal underlying our observation, as follows: • when S = ∅ the estimate is, say (f + f )/2; • when S is nonempty we take the union ∆(S) of the segments ∆ℓ(s) , s ∈ S, and our estimate is the average of the largest and the smallest elements of ∆(S).

It is immediately seen that if the signal x underlying our stationary K-repeated observation ω K belongs to some Ys∗ , so that the hypothesis Hs∗ is true, and the outcome S of T K contains s∗ and is such that for all s ∈ S Hs and Hs∗ are Cclose to each other, we have |f (x) − fb(ω K )| ≤ δL . Note that since the C-risk of T K is ≤ ǫ, the pA(x) -probability to get such a “good” outcome, and thus to get |f (x) − fb(ω K )| ≤ δL , is at least 1 − ǫ. 7 In standard simple o.s.’s, this is the case whenever for s, s′ in question the images of Y and s Ys′ under the mapping x 7→ A(x) do not intersect. Because for s, s′ , Ys and Ys′ do not intersect, this definitely is the case when A(·) is an embedding.

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

3.2.7.1

207

Numerical illustration

Our illustration deals with the situation when I = 1, X = X1 is a convex compact set, and f (x) is fractional-linear: f (x) = aT x/cT x with positive on X denominator. Specifically, assume we are given noisy measurements of voltages Vi at some nodes i and currents Iij in some arcs (i, j) of an electric circuit, and want to recover the resistance of a particular arc (i∗ , j∗ ): ri ∗ j ∗ =

V j ∗ − V i∗ . I i∗ j ∗

The observation noises are assumed to be N (0, σ 2 ) and independent across the measurements. In our experiment, we work with the data as follows:

B

D C

input node

output node

x = [voltages at nodes; currents in arcs] Ax = [observable voltages; observable currents] • • • •

Currents are measured in all arcs except for a, b Voltages are measured at all nodes except for c We want to recover resistance of arc b   conservation of current, except for input/output nodes   zero voltage at input node, nonnegative currents X: current in arc b at least 1, total of currents at most 33    Ohm’s Law, resistances of arcs between 1 and 10

We are in the situation of N = 1 and I = 1, implying M = L. When using L = 8, the projections of the sets Ys , 1 ≤ s ≤ L = 8, onto the 2D plane of variables

208

CHAPTER 3

(Vj∗ − Vi∗ , Ii∗ j∗ ) are the “stripes” shown below: I i∗ j ∗

V j ∗ − V i∗ The range of the unknown resistance turns out to be ∆ = [1, 10]. We set ǫ = 0.01, and instead of looking for K such that the K-repeated observation allows us to recover 0.99-reliably the resistance in the arc of interest within accuracy |∆|/L, we look for the largest observation noise σ allowing us to achieve the desired recovery with a single observation. The results for L = 8, 16, 32 are as follows: L δL σ σopt /σ ≤ σ σopt /σ ≤

8 9/8 ≈ 1.13 0.024 1.31 0.031 1.01

16 9/16 ≈ 0.56 0.010 1.31 0.013 1.06

32 9/32 ≈ 0.28 0.005 1.33 0.006 1.08

In the above table: • σopt is the largest σ for which “in nature” there exists a test deciding on H1 , ..., HL with C-risk ≤ 0.01; • Underlined data: Risks ǫss′ of pairwise tests are bounded via risks of optimal detectors; C-risk of T is bounded by  1, (s, s′ ) 6∈ C, L ′ ′ ′ , χ = ] χ [ǫss ss s,s′ =1 ss 0, (s, s′ ) ∈ C; 2,2

see Proposition 2.29; • “Slanted” data: Risks ǫss′ of pairwise tests are bounded via the error function; C-risk of T is bounded by X max ǫss′ s

s′ :(s,s′ )6∈C

(it is immediately seen that in the case of Gaussian o.s., this indeed is a legitimate risk bound).

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

209

C

B D C Figure 3.3: A circuit (nine nodes and 16 arcs). a: arc of interest; b: arcs with measured currents; c: input node where external current and voltage are measured.

3.2.7.2

Estimating dissipated power

The alternative approach to estimating N -convex functions proposed in Section 3.2.7 can be combined with the quadratic lifting described in Section 2.9 to yield, under favorable circumstances, estimates of quadratic and quadratic fractional functions. We are about to consider an instructive example of this type. Figure 3.3 represents a DC circuit. We have access to repeated noisy measurements of currents in some arcs and voltages at some nodes, with the voltage of the ground node equal to 0. The arcs are oriented; this orientation, however, is of no relevance in our context and therefore is not displayed. Our goal is to use these observations to estimate the power dissipated in a given “arc of interest.” The a priori information is as follows: • the (unknown) arc resistances are known to belong to a given range [r, R], with 0 < r < R < ∞; • the currents and the voltages are linked by Kirchhoff’s laws:

– at every node, the sum of currents in the outgoing arcs is equal to the sum of currents in the incoming arcs plus the external current at the node. In our circuit, there are just two external currents, one at the ground node and one at the input node c.

– the voltages and the currents are linked by Ohm’s law: for every (inner) arc γ, we have Iγ rγ = Vj(γ) − Vi(γ) where Iγ is the current in the arc, rγ is the arc’s resistance, Vs is the voltage at node s, and i(γ), j(γ) are the initial and the terminal nodes linked by arc γ; • magnitudes of all currents and voltages are bounded by 1. We assume that the measurements of observable currents and voltages are affected by zero mean Gaussian noise with scalar covariance matrix θ2 I, with unknown θ from a given range [σ, σ]. Processing the problem. We specify the “signal” underlying our observation as

210

CHAPTER 3

a collection u of the voltages at nine nodes and currents Iγ in 16 (inner) arcs γ of the circuit, augmented by the external current Io at the input node (so that −Io is the external current at the ground node). Thus, our single-time observation is ζ = Au + θξ,

(3.23)

where A extracts from u four entries (currents in two arcs b and external current and voltage at the input node c), ξ ∼ N (0, I4 ), and θ ∈ [σ, σ]. Our a priori information on u states that u belongs to the compact set U given by the quadratic constraints, namely, as follows:

U=

          

u = {Iγ , Io , Vi } :

Iγ2 ≤ 1, Vi2 ≤ 1 ∀γ, i; uT J T Ju≤ 0 [Vj(γ) − Vi(γ) ]2 /R − Iγ [Vj(γ) − Vi(γ) ] ≤ 0 ∀γ Iγ [Vj(γ) − Vi(γ) ] − [Vj(γ) − Vi(γ) ]2 /r ≤ 0  2 rIγ − Iγ [Vj(γ) − Vi(γ) ] ≤ 0 ∀γ Iγ [Vj(γ) − Vi(γ) ] − RIγ2 ≤ 0

(a)

     

   (b)  

(3.24)

where Ju = 0 expresses the first Kirchhoff’s law, and quadratic constraints (a) and (b) account for Ohm’s law in the situation when we do not know the exact resistances but only their range [r, R]. Note that groups (a) and (b) of constraints in (3.24) are “logical consequences” of each other, and thus one of groups seems to be redundant. However, on closer inspection, quadratic inequalities valid on U do not tighten the outer approximation Z of Z[U ] and thus are redundant in our context only when these inequalities can be obtained from the inequalities we do include into the description of Z “in a linear fashion”—by taking weighted sums with nonnegative coefficients. This is not how (b) is obtained from (a). As a result, to get a smaller Z, it makes sense to keep both (a) and (b). The dissipated power we are interested in estimating is the quadratic function f (u) = Iγ∗ [Vj∗ − Vi∗ ] = [u; 1]T G[u; 1] where γ∗ = (i∗ , j∗ ) is the arc of interest, and G ∈ Sn+1 , n = dim u, is a properly built matrix. In order to build an estimate, we “lift quadratically” the observations ζ 7→ ω = (ζ, ζζ T ) and pass from the domain U of actual signals to the outer approximation Z of the quadratic lifting of U : Z

:= ⊃

n+1 {Z : Z  0, Z  ∈S n+1,n+1 = 1, Tr(Qs Z) ≤ cs , 1 ≤ s ≤ S} [u; 1][u; 1]T : u ∈ V .

Here the matrix Qs ∈ Sn+1 represents the left-hand side Fs (u) of the s-th quadratic constraint in the description (3.24) of U : Fs (u) ≡ [u; 1]T Qs [u; 1], and cs is the righthand side of the s-th constraint. We process the problem similarly to what was done in Section 3.2.7.1, where our goal was to estimate a fractional-linear function. Specifically, 1. We compute the range of f on U ; the smallest value f of f on U clearly is zero, and an upper bound on the maximum of f (u) over u ∈ U is the optimal value

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

211

in the convex optimization problem f = max Tr(GZ). Z∈Z

2. Given a positive integer L, we split the range [f , f ] into L segments ∆ℓ = [aℓ−1 , aℓ ] of equal length δL = (f − f )/L and define convex compact sets Zℓ = {Z ∈ Z : aℓ−1 ≤ Tr(GZ) ≤ aℓ }, 1 ≤ ℓ ≤ L, so that u ∈ U, f (u) ∈ ∆ℓ ⇒ [u; 1][u; 1]T ∈ Zℓ , 1 ≤ ℓ ≤ L. 3. We specify L quadratically constrained hypotheses H1 , ..., HL on the distribution of observation (3.23), with Hℓ stating that ζ ∼ N (Au, θ2 I4 ) with some u ∈ U satisfying f (u) ∈ ∆ℓ (so that [u; 1][u; 1]T ∈ Zℓ ), and θ belongs to the above segment [σ, σ]]. We equip our hypotheses with a closeness relation C; specifically, we consider Hℓ and Hℓ′ C-close if and only if the segments ∆ℓ and ∆ℓ′ intersect. 4. We use Propositions 2.43.ii and 2.40 to build detectors φℓℓ′ quadratic in ζ for the families of distributions obeying Hℓ and Hℓ′ , respectively, along with upper bounds ǫℓℓ′ on the risks of these detectors. Finally, we use the machinery from Section 2.5.2 to find the smallest K and a test TCK , based on a stationary Krepeated version of observation (3.23), able to decide on H1 , ..., HL with C-risk ≤ ǫ, where ǫ ∈ (0, 1) is a given tolerance. Finally, given stationary K-repeated observation (3.23), we apply to it test TCK , look at the hypotheses, if any, accepted by the test, and build the union ∆ of the corresponding segments ∆ℓ . If ∆ = ∅, we estimate f (u) as the midpoint of the power range [f , f ]; otherwise the estimate is the mean of the largest and the smallest points in ∆. It is easily seen that for this estimate, the probability for the estimation error to be > δℓ is ≤ ǫ. The numerical results we present√here correspond to the circuit presented in Figure 3.3. We set σ = 0.01, σ = σ/ 2, [r, R] = [1, 2], ǫ = 0.01, and L = 8. The simulation setting is as follows: the computed range [f , f ] of the dissipated power is [0, 0.821], so that the estimate built recovers the dissipated power within accuracy 0.103 and reliability 0.99. The resulting value of K is K = 95. In all 500 simulation runs, the actual recovery error was less than the bound 0.103, and the average error was as small as 0.041.

3.3

ESTIMATING LINEAR FORMS BEYOND SIMPLE OBSERVATION SCHEMES

We are about to show that the techniques developed in Section 2.8 can be applied to building estimates of linear and quadratic forms of the parameters of observed distributions. As compared to the machinery of Section 3.2, our new approach has somewhat restricted scope: we do not estimate general N -convex functions nor handle domains which are unions of convex sets; now we need the function to be linear (perhaps, after quadratic lifting of observations) and the domain to

212

CHAPTER 3

be convex.8 As a compensation, we are not limited to simple observation schemes anymore—our approach is in fact a natural extension of the approach developed in Section 3.1 beyond simple o.s.’s. In this section, we focus on estimating linear forms; estimating quadratic forms will be our subject in Section 3.4. 3.3.1

Situation and goal

Consider the situation as follows: given are Euclidean spaces Ω = EH , EM , EX along with • regular data (see Section 2.8.1.1) H ⊂ EH , M ⊂ EM , Φ(·; ·) : H × M → R, with 0 ∈ int H, • a nonempty convex compact set X ⊂ EX , • an affine mapping x 7→ A(x) : EX → EM such that A(X ) ⊂ M, • a continuous convex calibrating function υ(x) : X → R, • a vector g ∈ EX and a constant c specifying the linear form G(x) = hg, xi + c : EX → R,9 • a tolerance ǫ ∈ (0, 1). These data specify, in particular, the family P = S[H, M, Φ] of probability distributions on Ω = EH ; see Section 2.8.1.1. Given random observation ω ∼ P (·) (3.25) where P ∈ P is such that ∀h ∈ H : ln

Z

ehh,ωi P (dω) EH



≤ Φ(h; A(x))

(3.26)

for some x ∈ X (that is, A(x) is a parameter, as defined in Section 2.8.1.1, of distribution P ), we want to recover the quantity G(x). ǫ-risk. Given ρ > 0, we call an estimate gb(·) : EH → R (ρ, ǫ, υ(·))-accurate if for all pairs x ∈ X , P ∈ P satisfying (3.26) it holds Probω∼P {|b g (ω) − G(x)| > ρ + υ(x)} ≤ ǫ.

If ρ∗ is the infimum of those ρ for which estimate gb is (ρ, ǫ, υ(·))-accurate, then clearly gb is (ρ∗ , ǫ, υ(·))-accurate; we shall call ρ∗ the ǫ-risk of the estimate gb taken

8 The latter is just for the sake of simplicity, to not overload the presentation to follow. An interested reader will certainly be able to reproduce the corresponding construction of Section 3.1 in the situation of this section. 9 From now on, hu, vi denotes the inner product of vectors u, v belonging to a Euclidean space; what this space is will always be clear from the context.

213

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

w.r.t. the data G(·), X , υ(·), and (A, H, M, Φ):

 Riskǫ (b g (·)|G, X , υ, A, H, M, Φ) = min ρ : Probω∼P {ω : |b g (ω) − G(x)| > ρ + υ(x)} ≤ ǫ (  P ∈ P, x ∈ X  R hT ω . ∀(x, P ) : ln e P (dω) ≤ Φ(h; A(x)) ∀h ∈ H

(3.27)

When G, X , υ, A, H, M, and Φ are clear from the context, we shorten Riskǫ (b g (·)|G, X , υ, A, H, M, Φ)

to Riskǫ (b g (·)). Given the data listed at the beginning of this section, we are about to build, in a computationally efficient fashion, an affine estimate gb(ω) = hh∗ , ωi + κ along with ρ∗ such that the estimate is (ρ∗ , ǫ, υ(·))-accurate. 3.3.2

Construction and main results

Let us set H+ = {(h, α) : h ∈ EH , α > 0, h/α ∈ H}

so that H+ is a nonempty convex set in EH × R+ , and let (a) (b)

Ψ+ (h, α)

=

Ψ− (h, β)

=

sup [αΦ(h/α, A(x)) − G(x) − υ(x)] : H+ → R,

x∈X

sup [βΦ(−h/β, A(x)) + G(x) − υ(x)] : H+ → R,

(3.28)

x∈X

so that Ψ± are convex real-valued functions on H+ (recall that Φ is convex-concave and continuous on H × M, while A(X ) is a compact subset of M). Our starting point is quite simple: ¯ α ¯ κ, Proposition 3.5. Given ǫ ∈ (0, 1), let h, ¯ , β, ¯ ρ¯ be a feasible solution to the system of convex constraints (a1 ) (a2 ) (b1 ) (b2 )

(h, α) (h, β) α ln(ǫ/2) β ln(ǫ/2)

∈ ∈ ≥ ≥

H+ H+ Ψ+ (h, α) − ρ + κ Ψ− (h, β) − ρ − κ

(3.29)

in variables h, α, β, ρ, κ. Setting ¯ ωi + κ, gb(ω) = hh, ¯

we obtain an estimate with ǫ-risk at most ρ¯.

¯ α ¯ κ, Proof. Let ǫ ∈ (0, 1), h, ¯ , β, ¯ ρ¯ satisfy the premise of the proposition, and let x ∈ X , P satisfy (3.26). We have ⇒

Probω∼P {b g (ω) > G(x) + ρ¯ + υ(x)}

=

Probω∼P {b g (ω) > G(x) + ρ¯ + υ(x)}

≤ ≤

o n ¯ ¯ κ+υ(x) ¯ Probω∼P hh,ωi > G(x)+ρ− α ¯ α ¯ hR i G(x)+ρ− ¯ κ+υ(x) ¯ ¯ α ¯ ehh,ωi/α¯ P (dω) e− ¯

¯ eΦ(h/α,A(x)) e−

G(x)+ρ− ¯ κ+υ(x) ¯ α ¯

.

214

CHAPTER 3

As a result, α ¯ ln (Probω∼P {b g (ω) > G(x) + ρ¯ + υ(x)}) ¯ α, A(x)) − G(x) − ρ¯ + κ ≤α ¯ Φ(h/¯ ¯ − υ(x) ¯ α ≤ Ψ+ (h, ¯ ) − ρ¯ + κ ¯ [by definition of Ψ+ and due to x ∈ X ] ≤α ¯ ln(ǫ/2) [by (3.29.b1 )] so that Probω∼P {b g (ω) > G(x) + ρ¯ + υ(x)} ≤ ǫ/2. Similarly o n ¯ −G(x)+ρ+ ¯ κ+υ(x) ¯ > = Probω∼P −hh,ωi ¯ ¯ β i −G(x)+βρ+ hR ¯ κ+υ(x) ¯ ¯ ¯ − ¯ β e−hh,ωi/β P (dω) e ⇒ Probω∼P {b g (ω) < G(x) − ρ¯ − υ(x)} ≤ Probω∼P {b g (ω) < G(x) − ρ¯ − υ(x)}

¯ ¯

≤ eΦ(−h/β,A(x)) e

G(x)−ρ− ¯ κ−υ(x) ¯ ¯ β

.

Thus β¯ ln (Probω∼P {b g (ω) < G(x) − ρ¯ − υ(x)}) ¯ ¯ β, ¯ A(x)) + G(x) − ρ¯ − κ ≤ βΦ(− h/ ¯ − υ(x) ¯ β) ¯ − ρ¯ − κ ≤ Ψ− (h, ¯ [by definition of Ψ− and due to x ∈ X ] ≤ β¯ ln(ǫ/2) [by (3.29.b2 )] and Probω∼P {b g (ω) < G(x) − ρ¯ − υ(x)} ≤ ǫ/2.



Corollary 3.6. In the situation described in Section 3.3.1, let Φ satisfy the relation Φ(0; µ) ≥ 0 ∀µ ∈ M.

(3.30)

Then b + (h) := inf α {Ψ+ (h, α) + α ln(2/ǫ) : α > 0, (h, α) ∈ H+ } Ψ = supx∈X inf α>0,(h,α)∈H+ [αΦ(h/α, A(x)) − G(x) − υ(x) + α ln(2/ǫ)] , b − (h) := inf α {Ψ− (h, α) + α ln(2/ǫ) : α > 0, (h, α) ∈ H+ } (b) Ψ = supx∈X inf α>0,(h,α)∈H+ [αΦ(−h/α, A(x)) + G(x) − υ(x) + α ln(2/ǫ)] , (3.31) ¯ κ, b ± : EH → R are convex. Furthermore, let h, and functions Ψ ¯ ρe be a feasible solution to the system of convex constraints (a)

b + (h) ≤ ρ − κ, Ψ b − (h) ≤ ρ + κ Ψ

(3.32)

in variables h, ρ, κ. Then the estimate

¯ ωi + κ gb(ω) = hh, ¯

of G(x), x ∈ X, has the ǫ-risk at most ρe:

Riskǫ (b g (·)|G, X, υ, A, H, M, Φ) ≤ ρe.

(3.33)

¯ is a Relation (3.32) (and thus the risk bound (3.33)) clearly holds true when h

215

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

candidate solution to the convex optimization problem h io n b + (h) + Ψ b − (h) , b Opt = min Ψ(h) := 12 Ψ h

¯ and b h), ρe = Ψ(

κ ¯=

1 2

h

(3.34)

i ¯ −Ψ ¯ . b − (h) b + (h) Ψ

¯ we can make (an upper bound on) the ǫ-risk of As a result, by properly selecting h, estimate gb(·) arbitrarily close to Opt, and equal to Opt when optimization problem (3.34) is solvable. Proof. Let us first verify the identities in (3.31). The function

Θ+ (h, α; x) = αΦ(h/α, A(x)) − G(x) − υ(x) + α ln(2/ǫ) : H+ × X → R is convex-concave and continuous, and X is compact, whence by the Sion-Kakutani Theorem b + (h) Ψ

:= = = =

inf α {Ψ+ (h, α) + α ln(2/ǫ) : α > 0, (h, α) ∈ H+ } inf α>0,(h,α)∈H+ maxx∈X Θ+ (h, α; x) supx∈X inf α>0,(h,α)∈H+ Θ+ (h, α; x) supx∈X inf α>0,(h,α)∈H+ [αΦ(h/α, A(x)) − G(x) − υ(x) + α ln(2/ǫ)] ,

as required in (3.31.a). As we know, Ψ+ (h, α) is real-valued continuous function on b + is convex on EH , provided that the function is real-valued. Now, H+ , so that Ψ let x ¯ ∈ X , and let e be a subgradient of φ(h) = Φ(h; A(¯ x)) taken at h = 0. For h ∈ EH and all α > 0 such that (h, α) ∈ H+ we have Ψ+ (h, α)

≥ ≥ ≥

αΦ(h/α; A(¯ x)) − G(¯ x) − υ(¯ x) + α ln(2/ǫ) α[Φ(0; A(¯ x)) + he, h/αi] − G(¯ x) − υ(¯ x) + α ln(2/ǫ) he, hi − G(¯ x) − υ(¯ x)

(we have used (3.30)), and therefore Ψ+ (h, α) as a function of α is bounded from below on the set {α > 0 : h/α ∈ H}. In addition, this set is nonempty, since H b + is real-valued and convex on EH . contains a neighbourhood of the origin. Thus, Ψ b Verification of (3.31.b) and of the fact that Ψ− (h) is real-valued convex function on EH is completely similar. ¯ κ, Now, given a feasible solution (h, ¯ ρe) to (3.32), let us select some ρ¯ > ρe. Taking b ± , we can find α into account the definition of Ψ ¯ and β¯ such that ¯ α ¯ α (h, ¯ ) ∈ H+ & Ψ+ (h, ¯) + α ¯ ln(2/ǫ) ≤ ρ¯ − κ, ¯ + ¯ ¯ ¯ ¯ + β¯ ln(2/ǫ) ≤ ρ¯ + κ, (h, β) ∈ H & Ψ− (h, β) ¯

¯ α ¯ κ, implying that the collection (h, ¯ , β, ¯ ρ¯) is a feasible solution to (3.29). Invoking Proposition 3.5, we get Probω∼P {ω : |b g (ω) − G(x)| > ρ¯ + υ(x)} ≤ ǫ for all (x ∈ X , P ∈ P) satisfying (3.26). Since ρ¯ can be selected arbitrarily close to ρe, gb(·) indeed is a (e ρ, ǫ, υ(·))-accurate estimate. ✷

216 3.3.3

CHAPTER 3

Estimation from repeated observations

Assume that in the situation described in Section 3.3.1 we have access to K observations ω1 , ..., ωK sampled, independently of each other, from a probability distribution P , and aim to build the estimate based on these K observations rather than on a single observation. We can immediately reduce this new situation to the previous one, just by redefining the data. Specifically, given initial data H ⊂ EH , M ⊂ EM , Φ(·; ·) : H × M → R, X ⊂ EX , υ(·), A(·), G(x) = hg, xi + c (see Section 3.3.1) and a positive integer K, let us update part of the data, namely, replace H ⊂ EH with K := EH × ... × EH , HK := H × ... × H ⊂ EH {z } | {z } | K

K

and replace Φ(·, ·) : H × M → R with

ΦK (hK = (h1 , ..., hK ); µ) =

K X i=1

Φ(hi ; µ) : HK × M → R.

It is immediately seen that the updated data satisfy all requirements imposed on the data in Section 3.3.1, and that whenever x ∈ X and a Borel probability distribution P on EH are linked by (3.26), x and the distribution P K of K-element i.i.d. sample ω K = (ω1 , ..., ωK ) drawn from P are linked by the relation K K ∀h 1 , ..., hK ) ∈ H : R = (hK K ln E K ehh ,ω i P K (dω K ) H

= ≤

P

R

ehhi ,ωi i P (dωi ) Φ (h ; A(x)). i ln EH K K



Applying to our new data the construction from Section 3.3.2, we arrive at “repeated observation” versions of Proposition 3.5 and Corollary 3.6. Note that the resulting convex constraints/objectives are symmetric w.r.t. permutations functions of the components h1 , ..., hK of hK , implying that we lose nothing when restricting ourselves with collections hK with components equal to each other; it is convenient to denote the common value of these components h/K. With this observation in mind, Proposition 3.5 and Corollary 3.6 translate into the following statements (we use the assumptions and the notation from the previous sections): Proposition 3.7. Given ǫ ∈ (0, 1) and positive integer K, let (a) (b)

Ψ+ (h, α)

=

Ψ− (h, β)

=

sup [αΦ(h/α, A(x)) − G(x) − υ(x)] : H+ → R,

x∈X

sup [βΦ(−h/β, A(x)) + G(x) − υ(x)] : H+ → R,

x∈X

¯ α ¯ κ, and let h, ¯ , β, ¯ ρ¯ be a feasible solution to the system of convex constraints (a1 ) (a2 ) (b1 ) (b2 )

(h, α) (h, β) αK −1 ln(ǫ/2) βK −1 ln(ǫ/2)

∈ ∈ ≥ ≥

H+ H+ Ψ+ (h, α) − ρ + κ Ψ− (h, β) − ρ − κ

(3.35)

217

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

in variables h, α, β, ρ, κ. Setting gb(ω K ) =

  XK ¯ 1 ωi + κ, ¯ h, i=1 K

we obtain an estimate of G(x) via independent K-repeated observations ωi ∼ P, i = 1, ..., K, with the ǫ-risk on X not exceeding ρ¯. In other words, whenever x ∈ X and a Borel probability distribution P on EH are linked by (3.26), one has  g (ω K ) − G(x)| > ρ¯ + υ(x) ≤ ǫ. (3.36) ProbωK ∼P K ω K : |b

Corollary 3.8. In the situation described at the beginning of Section 3.3.1, let Φ satisfy relation (3.30), and let a positive integer K be given. Then (a)

(b)

 b +,K (h) := inf Ψ+ (h, α) + K −1 α ln(2/ǫ) : α > 0, (h, α) ∈ H+ Ψ α   = sup inf αΦ(h/α, A(x)) − G(x) − υ(x) + K −1 α ln(2/ǫ) , x∈X α>0,(h,α)∈H+  b −,K (h) := inf α Ψ− (h, α) + K −1 α ln(2/ǫ) : α > 0, (h, α) ∈ H+ Ψ   = sup inf αΦ(−h/α, A(x)) + G(x) − υ(x) + K −1 α ln(2/ǫ) , x∈X α>0,(h,α)∈H+

¯ κ, b ±,K : EH → R are convex. Furthermore, let h, and functions Ψ ¯ ρe be a feasible solution to the system of convex constraints b +,K (h) ≤ ρ − κ, Ψ b −,K (h) ≤ ρ + κ Ψ

(3.37)

in variables h, ρ, κ. Then the ǫ-risk of the estimate   XK ¯ 1 ωi + κ, ¯ gb(ω K ) = h, i=1 K

¯ implying that whenever x ∈ X and a Borel b h), of G(x), x ∈ X , is at most Ψ( probability distribution P on EH are linked by (3.26), relation (3.36) holds true. ¯ is a candidate solution to the convex Relation (3.37) clearly holds true when h optimization problem io h n b +,K (h) + Ψ b −,K (h) , b K (h) := 1 Ψ (3.38) OptK = min Ψ 2 h

¯ and b K (h), ρ¯ = Ψ

κ ¯=

1 2

h

i ¯ −Ψ ¯ . b −,K (h) b +,K (h) Ψ

¯ we can make (an upper bound on) the ǫ-risk As a result, by properly selecting h of the estimate gb(·) arbitrarily close to Opt, and equal to Opt when optimization problem (3.38) is solvable.

From now on, if not explicitly stated otherwise, we deal with K-repeated observations; to get back to single-observation case, it suffices to set K = 1.

218

CHAPTER 3

3.3.4

Application: Estimating linear forms of sub-Gaussianity parameters

Consider the simplest case of the situation from Sections 3.3.1 and 3.3.3, where • H = EH = Rd , M = EM = Rd × Sd+ , Φ(h; µ, M ) = hT µ + 12 hT M h : Rd × (Rd × Sd+ ) → R, • • • •

so that S[H, M, Φ] is the family of all sub-Gaussian distributions on Rd ; X ⊂ EX = Rnx is a nonempty convex compact set; A(x) = (Ax + a, M (x)), where A is d × nx matrix, and M (x) is a symmetric d × d matrix affinely depending on x such that M (x) is  0 when x ∈ X ; υ(x) is a convex continuous function on X ; G(x) is an affine function on EX .

In the case in question, (3.30) clearly takes place, and the left-hand sides in constraints (3.37) become b +,K (h) Ψ

=

b −,K (h) Ψ

=

=

=

 1 T sup inf hT [Ax + a] + 2α h M (x)h + K −1 α ln(2/ǫ) − G(x) − υ(x) x∈X α>0 o np 2K −1 ln(2/ǫ)[hT M (x)h] + hT [Ax + a] − G(x) − υ(x) , max x∈X  1 T sup inf −hT [Ax + a] + 2α h M (x)h + K −1 α ln(2/ǫ) + G(x) − υ(x) α>0 x∈X n o p 2K −1 ln(2/ǫ)[hT M (x)h] − hT [Ax + a] + G(x) − υ(x) . max x∈X

Thus, system (3.37) reads hp i aT h + max 2K −1 ln(2/ǫ)[hT M (x)h] + hT Ax − G(x) − υ(x) x∈X h i p −aT h + max 2K −1 ln(2/ǫ)[hT M (x)h] − hT Ax + G(x) − υ(x) x∈X



ρ − κ,



ρ + κ.

We arrive at the following version of Corollary 3.8:

Proposition 3.9. In the situation described at the beginning of Section 3.3.4, given ¯ be a feasible solution to the convex optimization problem ǫ ∈ (0, 1), let h b K (h) OptK = min Ψ h∈Rd

where

b +,K (h) Ψ

(3.39)

}| { z hp i T T  max −1 T 2K ln(2/ǫ)[h M (x)h] + h Ax − G(x) − υ(x) + a h  i b K (h) := 1 x∈X hp . Ψ 2 + max 2K −1 ln(2/ǫ)[hT M (y)h] − hT Ay + G(y) − υ(y) − aT h y∈X {z } |

Then, setting

κ ¯= the affine estimate

1 2

h

b −,K (h) Ψ

i ¯ −Ψ ¯ , ρ¯ = Ψ ¯ b −,K (h) b +,K (h) b K (h), Ψ gb(ω K ) =

K 1 X ¯T h ωi + κ ¯ K i=1

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

219

has ǫ-risk, taken w.r.t. the data listed at the beginning of this section, at most ρ¯. It is immediately seen that optimization problem (3.39) is solvable, provided that \ Ker(M (x)) = {0}, x∈X

and an optimal solution h∗ to the problem, taken along with i h b −,K (h∗ ) − Ψ b +,K (h∗ ) , κ∗ = 1 Ψ 2

(3.40)

yields the affine estimate

gb∗ (ω) =

K 1 X T h ωi + κ∗ K i=1 ∗

with ǫ-risk, taken w.r.t. the data listed at the beginning of this section, at most OptK . 3.3.4.1

Consistency

Assuming υ(x) ≡ 0, we can easily answer the natural question “when is the proposed estimation scheme consistent?” meaning that for every ǫ ∈ (0, 1), it allows us to achieve arbitrarily small ǫ-risk, provided that K is large enough. Specifically, denoting by g T x the linear part of G(x): G(x) = g T x + c, from Proposition 3.9 it is immediately seen that a necessary and sufficient condition for consistency is the ¯ ∈ Rd such that h ¯ T Ax = g T x for all x ∈ X − X , or, equivalently, existence of h the condition that g is orthogonal to the intersection of the kernel of A with the linear span of X − X . Indeed, under this assumption, for every fixed ǫ ∈ (0, 1) we ¯ = 0, implying that limK→∞ Opt = 0, with Ψ b K (h) b K and clearly have limK→∞ Ψ K OptK given by (3.39). On the other hand, if the condition is violated, then there exist x′ , x′′ ∈ X such that Ax′ = Ax′′ and G(x′ ) 6= G(x′′ ); we lose nothing when assuming that G(x′′ ) > G(x′ ). Looking at (3.39), we see that   p −1 ln(2/ǫ)[hT M (x′ )h] + hT Ax′ − G(x′ ) + aT h b K (h) ≥ 1 2K Ψ 2  p  + 2K −1 ln(2/ǫ)[hT M (x′′ )h] − hT Ax′′ + G(x′′ ) − aT h ≥

G(x′′ ) − G(x′ ),

whence OptK , for all K, is lower-bounded by G(x′′ ) − G(x′ ) > 0. 3.3.4.2

Direct product case

Further simplifications are possible in the direct product case, where, in addition to what was assumed at the beginning of Section 3.3.4, • EX = EU × EV and X = U × V , with convex compact sets U ⊂ EU = Rnu and V ⊂ E V = R nv , • A(x = (u, v)) = [Au + a, M (v)] : U × V → Rd × Sd , with M (v)  0 for v ∈ V , • G(x = (u, v)) = g T u + c depends solely on u, and

220

CHAPTER 3

• υ(x = (u, v)) = ̺(u) depends solely on u. It is immediately seen that in the direct product case problem (3.39) reads   q φU (AT h − g) + φU (−AT h + g) −1 T + max 2K ln(2/ǫ)h M (v)h , OptK = min v∈V 2 h∈Rd (3.41) where   (3.42) φU (f ) = max uT f − ̺(u) . u∈U T Assuming v∈V Ker(M (v)) = {0}, the problem is solvable, and its optimal solution h∗ gives rise to the affine estimate gb∗ (ω K ) =

1 X T h ωi + κ∗ , κ∗ = 21 [φU (−AT h + g) − φU (AT h − g)] − aT h∗ + c K i ∗

with ǫ-risk ≤ OptK . Near-optimality. In addition to the assumption that we are in the direct product case, assume that υ(·) ≡ 0 and, for the sake of simplicity, that M (v) ≻ 0 whenever v ∈ V . In this case (3.39) reads  OptK = minh maxv∈V Θ(h, v) := 21 [φU (AT h − g) + φU (−AT h + g)]  p −1 T + 2K ln(2/ǫ)h M (v)h . Hence, taking into account that Θ(h, v) clearly is convex in h and concave in v, while V is a convex compact set, by the Sion-Kakutani Theorem we get also OptK =

  maxv∈V Opt(v) = minh 21 [φU (AT h − g) + φU (−AT h + g)]  p + 2K −1 ln(2/ǫ)hT M (v)h .

(3.43)

Now consider the problem of estimating g T u from independent observations ωi , i ≤ K, sampled from N (Au + a, M (v)), where unknown u is known to belong to U and v ∈ V is known. Let ρǫ (v) be the minimax ǫ-risk of recovery:  g (ω K ) − g T u| > ρ} ≤ ǫ ∀u ∈ U , ρǫ (v) = inf ρ : ProbωK ∼[N (Au+a,M (v))]K {ω K : |b g b(·)

where inf is taken over all Borel functions gb(·) : RKd → R. Invoking [131, Theorem 3.1], it is immediately seen that whenever ǫ < 1/4, one has "

2 ln(2/ǫ)  ρǫ (v) ≥ 1 ln 4ǫ

#−1

Opt(v).

Since the family SG(U, V ) of all sub-Gaussian distributions on Rd with parameters (Au + a, M (v)), u ∈ U , v ∈ V , contains all Gaussian distributions N (Au + a, M (v)) induced by (u, v) ∈ U × V , we arrive at the following conclusion:

221

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

Proposition 3.10. In the just described situation, the minimax optimal ǫ-risk Riskopt g (·)) ǫ (K) = inf Riskǫ (b g b(·)

of recovering g T u from a K-repeated i.i.d. sub-Gaussian observation with parameters (Au + a, M (v)), (u, v) ∈ U × V , is within a moderate factor of the upper bound OptK on the ǫ-risk, taken w.r.t. the same data, of the affine estimate gb∗ (·) yielded by an optimal solution to (3.41), namely, OptK ≤ 3.3.4.3

Numerical illustration

2 ln(2/ǫ)  Riskopt ǫ (K). 1 ln 4ǫ

The numerical illustration we are about to discuss models the situation in which we want to recover a linear form of a signal x known to belong to a given convex compact subset X via indirect observations Ax affected by sub-Gaussian “relative noise,” meaning that the variance of observation is larger the larger is the signal. Specifically, our observation is ω ∼ SG(Ax, M (x)), where 

n

x ∈ X = x ∈ R : 0 ≤ xj ≤ j

−α



, 1 ≤ j ≤ n , M (x) = σ

2

n X

xj Θ j

(3.44)

j=1

where A ∈ Rd×n and Θj ∈ Sd+ , j = 1, ..., n, are given matrices; the linear form to be estimated is G(x) = g T x. The entities g, A and {Θj }nj=1 and reals α ≥ 0 (“degree of smoothness”) and σ > 0 (“noise intensity”) are parameters of the estimation problem we intend to process. The parameters g, A, Θj are as follows: • g ≥ 0 is selected at random and then normalized to have max g T x = max g T [x − y] = 2; x∈X

x,y∈X

• we deal with the case of n > d (“deficient observations”); the d nonzero singular i−1 values of A were set to θ− d−1 , where “condition number” θ ≥ 1 is a parameter; the orthonormal systems U and V of the first d left and, respectively, right singular vectors of A were drawn at random from rotationally invariant distributions; • the positive semidefinite d×d matrices Θj are orthogonal projectors on randomly selected subspaces in Rd of dimension ⌊d/2⌋; • in all our experiments, we consider the single-observation case K = 1 and use υ(·) ≡ 0. Note that X possesses the ≥-largest point x ¯, whence M (x)  M (¯ x) whenever x ∈ X ; as a result, sub-Gaussian distributions with matrix parameter M (x), x ∈ X , can be thought also to have matrix parameter M (¯ x). One of the goals of the considered experiment is to understand how much we might lose were we replacing c(x) ≡ M (¯ M (·) with M x), that is, were we ignoring the fact that small signals result

222

CHAPTER 3

in low-noise observations. In our experiment we use d = 32, m = 48, α = 2, θ = 2, and σ = 0.01. With these parameters, we generated at random, as described above, 10 collections {g, A, Θj , j ≤ d}, thus arriving at 10 estimation problems. For each problem, we apply the outlined machinery to build an estimate of g T x affine in ω as yielded by the optimal solution to (3.39), and compute the upper bound Opt on the (ǫ = 0.01)risk of this estimate. In fact, for each problem, we build two estimates and two risk bounds: the first for the problem “as is,” and the second for the aforementioned “direct product envelope” of the problem, where the mapping x 7→ M (x) is replaced c(x) := M (¯ with conservative x 7→ M x). The results are as follows: min median mean max 0.138 0.190 0.212 0.299 0.150 0.210 0.227 0.320 Upper bounds on 0.01-risk, data over 10 estimation problems [d = 32, m = 48, α = 2, θ = 2, σ = 0.01] First row: ω ∼ SG(Ax, M (x)); second row: ω ∼ SG(Ax, M (¯ x))

Note the significant “noise amplification” in the estimate (about 20 times the observation noise level σ) and high risk variability across the experiments. Seemingly, both these phenomena stem from the fact that we have highly deficient observations (n/d = 1.5) combined with a random orientation of the 16-dimensional kernel of A.

3.4

ESTIMATING QUADRATIC FORMS VIA QUADRATIC LIFTING

In the situation of Section 3.3.1, passing from “original” observations (3.25) to their quadratic lifting, we can use the machinery just developed to estimate quadratic, rather than linear, forms of the underlying parameters. We investigate the related possibilities in the cases of Gaussian and sub-Gaussian observations. The results of this section form an essential extension of the results of [39, 81] where a similar approach to estimating quadratic functionals of the mean of a Gaussian vector was used. 3.4.1 3.4.1.1

Estimating quadratic forms, Gaussian case Preliminaries

Consider the situation where we are given a nonempty bounded set U in Rm ; a nonempty convex compact subset V of the positive semidefinite cone Sd+ ; a matrix Θ∗ ≻ 0 such that Θ∗  Θ for all Θ ∈ V; an affine mapping u 7→ A[u; 1] : Rm → Ω = Rd , where A is a given d × (m + 1) matrix; • a convex continuous function ̺(·) on Sm+1 . + • • • •

A pair (u ∈ U, Θ ∈ V) specifies Gaussian random vector ζ ∼ N (A[u; 1], Θ) and thus

223

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

specifies probability distribution P [u, Θ] of (ζ, ζζ T ). Let Q(U, V) be the family of probability distributions on Ω = Rd × Sd stemming this way from Gaussian distributions with parameters from U × V. Our goal is to cover the family Q(U, V) by a family of the type S[N, M, Φ]. It is convenient to represent a linear form on Ω = Rd × Sd as hT z + 21 Tr(HZ), where (h, H) ∈ Rd × Sd is the “vector of coefficients” of the form, and (z, Z) ∈ Rd × Sd is the argument of the form. We assume that for some δ ∈ [0, 2] it holds −1/2

kΘ1/2 Θ∗

− Id k ≤ δ ∀Θ ∈ V,

(3.45)

where k · k is the spectral norm (cf. (2.129)). Finally, we set   A m+1 b = [0; ...; 0; 1] ∈ R , B= bT and

Z + = {W ∈ Sm+1 : Wm+1,m+1 = 1}. +

The statement below is nothing but a straightforward reformulation of Proposition 2.43.i: Proposition 3.11. In the just described situation, let us select γ ∈ (0, 1) and set H M+ Φ(h, H; Θ, Z)

= = =

−1 Hγ := {(h, H) ∈ Rd × Sd : −γΘ−1 ∗  H  γΘ∗ }, V × Z +, 1/2 1/2 − 12 ln Det(I − Θ∗ HΘ∗ ) + 21 Tr([Θ − Θ∗ ]H) 1/2 1/2 δ(2+δ) + kΘ∗ HΘ∗ k2F + Γ(h, H; Z) : H × M+ → R, 1/2 1/2 2(1−kΘ∗

HΘ∗

k)

where k · k is the spectral, k · kF is the Frobenius norm, and Γ(h, H; Z)

= =

AT hbT + AT HA + B T [H, h]T [Θ−1 H]−1 ∗ −  [H, h]B] H h T −1 −1 + [H, h] [Θ∗ − H] [H, h] B . hT

1 Tr Z[bhT  A+ 2 1 Tr 2

ZB T



Then H, M+ , Φ is a regular data, and for every (u, Θ) ∈ Rm × V it holds  n T 1 T o ≤ Φ(h, H; Θ, [u; 1][u; 1]T ). ∀(h, H) ∈ H : ln Eζ∼N (A[u;1],Θ) eh ζ+ 2 ζ Hζ

Besides this, function Φ(h, H; Θ, Z) is coercive in the convex argument: whenever (Θ, Z) ∈ M and (hi , Hi ) ∈ H and k(hi , Hi )k → ∞ as i → ∞, we have Φ(hi , Hi ; Θ, Z) → ∞, i → ∞. 3.4.1.2

Estimating quadratic form: Situation and goal

Let us assume that we are given a sample ζ K = (ζ1 , ..., ζK ) of identically distributed observations ζi ∼ N (A[u; 1], M (v)), 1 ≤ i ≤ K (3.46) independent across i, where

224

CHAPTER 3

• (u, v) is an unknown “signal” known to belong to a given set U × V , where – U ⊂ Rm is a compact set, and

– V ⊂ Rk is a compact convex set;

• A is a given d × (m + 1) matrix, and v 7→ M (v) : Rk → Sd is an affine mapping such that M (v)  0 whenever v ∈ V . We are also given a convex calibrating function ̺(Z) : Sm+1 → R and “functional + of interest” F (u, v) = [u; 1]T Q[u; 1] + q T v, (3.47) where Q and q are a known (m+1)×(m+1) symmetric matrix and a k-dimensional vector, respectively. Our goal is to estimate the value F (u, v), for unknown (u, v) known to belong to U × V . Given a tolerance ǫ ∈ (0, 1), we quantify the quality of a candidate estimate gb(ζ K ) of F (u, v) by the smallest ρ such that for all (u, v) ∈ U ×V it holds  g (ζ K ) − F (u, v)| > ρ + ̺([u; 1][u; 1]T ) ≤ ǫ. Probζ K ∼[N (A[u;1],M (v))]K |b 3.4.1.3

Construction and result

Let V = {M (v) : v ∈ V },

so that V is a convex compact subset of the positive semidefinite cone Sd+ . Let us select some 1. matrix Θ∗ ≻ 0 such that Θ∗  Θ, for all Θ ∈ V; 2. convex compact subset Z of the set Z + = {Z ∈ Sm+1 : Zm+1,m+1 = 1} such + that [u; 1][u; 1]T ∈ Z for all u ∈ U ; 3. real γ ∈ (0, 1) and a nonnegative real δ such that (3.45) takes place. We further set (cf. Proposition 3.11) B

=

H M Φ(h, H; Θ, Z)

= = =



 A ∈ R(d+1)×(m+1) , [0, ..., 0, 1] −1 Hγ := {(h, H) ∈ Rd × Sd : −γΘ−1 ∗  H  γΘ∗ }, V × Z, 1/2 1/2 − 12 ln Det(I − Θ∗ HΘ∗ ) + 12 Tr([Θ − Θ∗ ]H) 1/2 1/2 δ(2+δ) + kΘ∗ HΘ∗ k2F + Γ(h, H; Z) : H × M → R 1/2 1/2 2(1−kΘ∗

HΘ∗

k)

(3.48)

where Γ(h, H; Z)

= =

AT hbT + AT HA + B T [H, h]T [Θ−1 H]−1 ∗ −  [H, h]B] H h T −1 −1 + [H, h] [Θ∗ − H] [H, h] B hT

1 Tr Z[bhT  A+ 2 1 Tr 2

ZB T



and treat, as observation, the quadratic lifting of observation (3.46), that is, our observation is ω K = {ωi = (ζi , ζi ζiT )}K i=1 , with independent ζi ∼ N (A[u; 1], M (v)).

(3.49)

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

225

Note that by Proposition 3.11 function Φ(h, H; Θ, Z) : H × M → R is a continuous convex-concave function which is coercive in convex argument and is such that ∀(u  ∈ U, v ∈ V, (h, H)n∈ 1H)T : o T ln Eζ∼N (A[u;1],M (v)) e 2 ζ Hζ+h ζ ≤ Φ(h, H; M (v), [u; 1][u; 1]T ).

(3.50)

We are about to demonstrate that when estimating the functional of interest (3.47) at a point (u, v) ∈ U × V via observation (3.49), we are in the situation considered in Section 3.3 and can utilize the corresponding machinery. Indeed, let us specify the following data introduced in Section 3.3.1: • H = {f = (h, H) ∈ H} ⊂ EH = Rd × Sd , with H defined in (3.48), and the inner product on EH defined as 1 h(h, H), (h′ , H ′ )i = hT h′ + Tr(HH ′ ), 2 EM = Sd × Sm+1 , and M, Φ defined as in (3.48); • EX = Rk × Sm+1 , X = V × Z; • A(x = (v, Z)) = (M (v), Z); note that A is an affine mapping from EX into EM which maps X into M, as required in Section 3.3.1. Observe that when u ∈ U and v ∈ V , the common distribution P = Pu,v of i.i.d. observations ωi defined by (3.49) satisfies the relation ∀(f = (h, H) ∈ H) :   ln Eω∼P ehf,ωi

= ≤

 n T 1 T o ln Eζ∼N (A[u;1],M (v)) eh ζ+ 2 ζ Hζ Φ(h, H; M (v), [u; 1][u; 1]T );

(3.51)

see (3.50); • υ(x = (v, Z)) = ̺(Z) : X → R; • we define affine functional G(x) on EX by the relation hg, x := (v, Z)i = q T v + Tr(QZ); see (3.47). As a result, for x = (v, [u; 1][u; 1]T ) with v ∈ V and u ∈ U we have F (u, v) = G(x). Applying Corollary 3.8 to the data just specified (which is legitimate, because our Φ clearly satisfies (3.30)), we arrive at the result as follows: Proposition 3.12. In the situation just described, let us set

226

CHAPTER 3

b +,K (h, H) Ψ  := inf α

=

max

(v,Z)∈V ×Z

max

(v,Z)∈V ×Z

b −,K (h, H) Ψ  := inf α

=

α>0, −1 −1 −γαΘ∗ HγαΘ∗

(v,Z)∈V ×Z

max

 αΦ(h/α, H/α; M (v), Z) − G(v, Z) − ̺(Z) + K −1 α ln(2/ǫ) : inf

max

(v,Z)∈V ×Z







−1 α > 0, −γαΘ−1 ∗  H  γαΘ∗



αΦ(h/α, H/α; M (v), Z) − G(v, Z) − ̺(Z)

 +K −1 α ln(2/ǫ) ,

 αΦ(−h/α, −H/α; M (v), Z) + G(v, Z) − ̺(Z) + K −1 α ln(2/ǫ) :  −1 −1 α > 0, −γαΘ∗  H  γαΘ∗  αΦ(−h/α, −H/α; M (v), Z) + G(v, Z) − ̺(Z) inf

α>0, −1 −1 −γαΘ∗ HγαΘ∗

 +K −1 α ln(2/ǫ) ,

(3.52)

b ±,K (h, H) : Rd × Sd → R are convex. Furthermore, whenever so that functions Ψ ¯ ¯ h, H, ρ¯, κ ¯ form a feasible solution to the system of convex constraints b +,K (h, H) ≤ ρ − κ, Ψ b −,K (h, H) ≤ ρ + κ Ψ

(3.53)

in variables (h, H) ∈ Rd × Sd , ρ ∈ R, κ ∈ R, setting gb(ζ K := (ζ1 , ..., ζK )) =

 K  1 1 X T h ζi + ζiT Hζi + κ, ¯ K i=1 2

(3.54)

we get an estimate of the functional of interest F (u, v) = [u; 1]T Q[u; 1] + q T v via K independent observations ζi ∼ N (A[u; 1], M (v)), i = 1, ..., K, with the following property: ∀(u, v) ∈ U × V :  Probζ K ∼[N (A[u;1],M (v))]K |F (u, v) − gb(ζ K )| > ρ¯ + ̺([u; 1][u; 1]T ) ≤ ǫ.

(3.55)

Proof. Under the premise of the proposition, let us fix u ∈ U , v ∈ V , so that x := (v, Z := [u; 1][u; 1]T ) ∈ X . Denoting, as above, by P = Pu,v the distribution of ω := (ζ, ζζ T ) with ζ ∼ N (A[u; 1], M (v)), and invoking (3.51), we see that for the (x, P ) just defined, relation (3.26) takes place. Applying Corollary 3.8, we conclude that  g (ζ K ) − G(x)| > ρ¯ + ̺([u; 1][u; 1]T ) ≤ ǫ. Probζ K ∼[N (A[u;1],M (v))]K |b

227

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

It remains to note that by construction for the x = (v, Z) in question it holds G(x) = q T v + Tr(QZ) = q T v + Tr(Q[u; 1][u; 1]T ) = q T v + [u; 1]T Q[u, 1] = F (u, v). ✷ An immediate consequence of Proposition 3.12 is as follows: Corollary 3.13. Under the premise and in the notation of Proposition 3.12, let (h, H) ∈ Rd × Sd . Setting h i b +,K (h, H) + Ψ b −,K (h, H) , ρ = 12 Ψ h i (3.56) b −,K (h, H) − Ψ b +,K (h, H) , κ = 21 Ψ the ǫ-risk of estimate (3.54) does not exceed ρ.

Indeed, with ρ and κ given by (3.56), h, H, ρ, κ satisfy (3.53). 3.4.1.4

Consistency

We are about to present a simple sufficient condition for the estimator defined in Proposition 3.12 to be consistent in the sense of Section 3.3.4.1. Specifically, in the situation and with the notation from Sections 3.4.1.1 and 3.4.1.3 assume that A.1. ̺(·) ≡ 0; A.2. V = {¯ v } is a singleton and M (v) ≻ 0, which allows us to set Θ∗ = M (¯ v ), to satisfy (3.45) with δ = 0, and to assume w.l.o.g. that F (u, v) = [u; 1]T Q[u; 1], G(Z) = Tr(QZ); A.3. the first m columns of the d × (m + 1) matrix A are linearly independent. By A.3, the columns of (d + 1) × (m + 1) matrix B (see (3.48)) are linearly independent, so that we can find (m + 1) × (d + 1) matrix C such that CB = Im+1 . Let ¯ H) ¯ ∈ Rd × Sd from the relation us define (h, 

¯ H ¯ hT

¯  h

= 2(C T QC)o ,

(3.57)

where for (d + 1) × (d + 1) matrix S, S o is the matrix obtained from S by zeroing our the entry in the cell (d + 1, d + 1). The consistency of our estimation machinery is given by the following simple statement: Proposition 3.14. In the situation just described and under assumptions A.1–3, given ǫ ∈ (0, 1), consider the estimate

where

gbK (ζ K ) = κK =

1 2

K 1 X ¯T ¯ k ] + κK , [h ζk + 21 ζ T Hζ K k=1

h

¯ H) ¯ H) b −,K (h, ¯ −Ψ b +,K (h, ¯ Ψ

i

b ±,K are given by (3.52). Then the ǫ-risk of gbK (·) goes to 0 as K → ∞. and Ψ

228

CHAPTER 3

For proof, see Section 3.6.4. 3.4.1.5

A modification

In the situation described at the beginning of Section 3.4.1.2, let a set W ⊂ U × V be given, and assume we are interested in estimating the value of F (u, v), as defined in (3.47), at points (u, v) ∈ W only. When reducing the “domain of interest” U × V to W , we hopefully can reduce the attainable ǫ-risk of recovery. Let us assume that we can point out a convex compact set W ⊂ V × Z such that (u, v) ∈ W ⇒ (v, [u; 1][u; 1]T ) ∈ W A straightforward inspection justifies the following: Remark 3.15. In the situation just described, the conclusion of Proposition 3.12 remains valid when the set U × V participating in (3.55) is reduced to W , and the set V × Z participating in relations (3.52) is reduced to W. This modification enlarges the feasible set of (3.53) and thus reduces the risk bound ρ¯. 3.4.2 3.4.2.1

Estimating quadratic form, sub-Gaussian case Situation

In the rest of this section we are interested in the situation as follows: we are given K i.i.d. observations ζi ∼ SG(A[u; 1], M (v)), i = 1, ..., K

(3.58)

(i.e., ζi are sub-Gaussian random vectors with parameters A[u; 1] ∈ Rd and M (v) ∈ d S+ ), where • (u, v) is an unknown “signal” known to belong to a given set U × V , where – U ⊂ Rm is a compact set, and

– V ⊂ Rk is a compact convex set;

• A is a given d × (m + 1) matrix, and v 7→ M (v) : Rk → Sd is an affine mapping such that M (v)  0 whenever v ∈ V . We are also given a convex calibrating function ̺(Z) : Sm+1 → R and “functional + of interest” F (u, v) = [u; 1]T Q[u; 1] + q T v, (3.59) where Q and q are a known (m+1)×(m+1) symmetric matrix and a k-dimensional vector, respectively. Our goal is to recover F (u, v), for unknown (u, v) known to belong to U × V , via observation (3.58). Note that the only difference between our present setting and that considered in Section 3.4.1.1 is that now we allow for sub-Gaussian, and not necessary Gaussian, observations. 3.4.2.2

Construction and result

Let V = {M (v) : v ∈ V },

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

229

so that V is a convex compact subset of the positive semidefinite cone Sd+ . Let us select some 1. matrix Θ∗ ≻ 0 such that Θ∗  Θ, for all Θ ∈ V; 2. convex compact subset Z of the set Z + = {Z ∈ Sm+1 : Zm+1,m+1 = 1} such + that [u; 1][u; 1]T ∈ Z for all u ∈ U ; 3. reals γ, γ + ∈ (0, 1) with γ < γ + (say, γ = 0.99, γ + = 0.999). Preliminaries. Given the data of the above description and δ ∈ [0, 2], we set (cf. Proposition 3.11) −1 Hγ := {(h, H) ∈ Rd × Sd : −γΘ−1 ∗  H  γΘ∗ },  A B = ∈ R(d+1)×(m+1) , [0, ..., 0, 1] M = V × Z, 1/2 1/2 Ψ(h, H, G; Z) = − 21 ln ∗ GΘ∗ ) Det(I − Θ    h H T −1 −1 + [H, h] [Θ − G] [H, h] B : + 12 Tr ZB T T ∗ h  + −1 H × {G : 0  G  γ Θ∗ } × Z → R,

H

where

=

Ψδ (h, H, G; Θ, Z)

=

Φ(h, H; Z)

=

Φδ (h, H; Θ, Z)

=

1/2

(3.60)

1/2

− 12 ln Det(I − Θ∗ GΘ∗ ) + 12 Tr([Θ − Θ∗ ]G) 1/2 1/2 δ(2+δ) kΘ∗ GΘ∗ k2F + 1/2 1/2 2(1−kΘ    ∗ GΘ∗ k)  h H T −1 −1 + [H, h] [Θ − G] [H, h] B : + 21 Tr ZB T ∗ hT  + −1 H  × {G : 0  G  γ Θ∗ } × ({0  Θ  Θ∗ } × Z) → R, min Ψ(h, H, G; Z) : 0  G  γ + Θ−1 ∗ , G  H : H × Z → R, G  min Ψδ (h, H, G; Θ, Z) : 0  G  γ + Θ−1 ∗ ,G  H : G

H × ({0  Θ  Θ∗ } × Z) → R.

The following statement is a straightforward reformulation of Proposition 2.46.i: Proposition 3.16. In the situation described in Sections 3.4.2.1 and 3.4.2.2 we have (i) Φ is well-defined real-valued continuous function on the domain H × Z; the function is convex in (h, H) ∈ H, concave in Z ∈ Z, and Φ(0; Z) ≥ 0. Furthermore, let (h, H) ∈ H, u ∈ U , v ∈ V , and let ζ ∼ SG(A[u; 1], M (v)). Then   (3.61) ln Eζ exp{hT ζ + 21 ζ T Hζ} ≤ Φ(h, H; [u; 1][u; 1]T ). (ii) Assume that

−1/2

∀Θ ∈ V : kΘ1/2 Θ∗

− Id k ≤ δ.

(3.62)

Then Φδ (h, H; Θ, Z) is a well-defined real-valued continuous function on the domain H × (V × Z); it is convex in (h, H) ∈ H, concave in (Θ, Z) ∈ V × Z, and Φδ (0; Θ, Z) ≥ 0. Furthermore, let (h, H) ∈ H, u ∈ U , v ∈ V , and let ζ ∼ SG(A[u; 1], M (v)). Then   (3.63) ln Eζ exp{hT ζ + 21 ζ T Hζ} ≤ Φδ (h, H; M (v), [u; 1][u; 1]T ). The estimate. Our construction of the estimate is completely similar to the case of Gaussian observations. Specifically, let us pass from observations (3.58) to their

230

CHAPTER 3

quadratic lifts, so that our observations become ωi = (ζi , ζi ζiT ), 1 ≤ i ≤ K, ζi ∼ SG(A[u; 1], M (v)) are i.i.d.

(3.64)

As in the Gaussian case, we find ourselves in the situation considered in Section 3.3.3 and can use the corresponding constructions. Indeed, let us specify the data introduced in Section 3.3.1 and participating in the constructions of Section 3.3 as follows: • H = {f = (h, H) ∈ H} ⊂ EH = Rd × Sd , with H defined in (3.60), and the inner product on EH defined as 1 h(h, H), (h′ , H ′ )i = hT h′ + Tr(HH ′ ), 2 EM = Sd × Sm+1 , and M, Φ defined as in (3.60); • EX = Rk × Sm+1 , X = V × Z; • A(x = (v, Z)) = (M (v), Z); note that A is an affine mapping from EX into EM mapping X into M, as required in Section 3.3. Observe that when u ∈ U and v ∈ V , the common distribution P = Pu,v of i.i.d. observations ωi defined by (3.64) satisfies the relation ∀(f = (h, H) ∈ H) :   ln Eω∼P ehf,ωi

= ≤

 n T 1 T o ln Eζ∼SG(A[u;1],M (v)) eh ζ+ 2 ζ Hζ Φ(h, H; [u; 1][u; 1]T );

(3.65)

see (3.61). Moreover, in the case of (3.62), we have also ∀(f = (h, H) ∈ H) :   ln Eω∼P ehf,ωi

= ≤

 n T 1 T o ln Eζ∼SG(A[u;1],M (v)) eh ζ+ 2 ζ Hζ Φδ (h, H; M (v), [u; 1][u; 1]T );

(3.66)

see (3.63); • we set υ(x = (v, Z)) = ̺(Z); • we define affine functional G(x) on EX by the relation G(x := (v, Z)) = q T v + Tr(QZ); see (3.59). As a result, for x = (v, [u; 1][u; 1]T ) with v ∈ V and u ∈ U we have F (u, v) = G(x). The result. Applying to the data just specified Corollary 3.8 (which is legitimate, because our Φ clearly satisfies (3.30)), we arrive at the result as follows: Proposition 3.17. In the situation described in Sections 3.4.2.1 and 3.4.2.2 let us

231

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

set b +,K (h, H) := inf Ψ α

=

max

(v,Z)∈V ×Z

=

max

(v,Z)∈V ×Z

max

(v,Z)∈V ×Z



inf

 αΦ(h/α, H/α; Z) − G(v, Z) − ̺(Z) + αK −1 ln(2/ǫ) :

α>0, −1 −1 −γαΘ∗ HγαΘ∗

b −,K (h, H) := inf Ψ α





max

(v,Z)∈V ×Z



 −1 α > 0, −γαΘ−1 ∗  H  γαΘ∗   αΦ(h/α, H/α; Z) − G(v, Z) − ̺(Z) + αK −1 ln(2/ǫ) ,

 αΦ(−h/α, −H/α; Z) + G(v, Z) − ̺(Z) + αK −1 ln(2/ǫ) :

inf

α>0, −1 −1 −γαΘ∗ HγαΘ∗

 −1 α > 0, −γαΘ−1 ∗  H  γαΘ∗   αΦ(−h/α, −H/α; Z) + G(v, Z) − ̺(Z) + αK −1 ln(2/ǫ) .

(3.67)

b ±,K (h, H) : Rd × Sd → R are convex. Furthermore, whenever Thus, functions Ψ ¯ ¯ h, H, ρ¯, κ ¯ form a feasible solution to the system of convex constraints b +,K (h, H) ≤ ρ − κ, Ψ b −,K (h, H) ≤ ρ + κ Ψ

(3.68)

in variables (h, H) ∈ Rd × Sd , ρ ∈ R, κ ∈ R, the estimate gb(ζ K ) =

 K  1 1 X T h ζi + ζiT Hζi + κ, ¯ K i=1 2

of F (u, v) = [u; 1]T Q[u; 1] + q T v via i.i.d. observations ζi ∼ SG(A[u; 1], M (v)), 1 ≤ i ≤ K, satisfies for all (u, v) ∈ U × V :

 Probζ K ∼[SG(A[u;1],M (v))]K |F (u, v) − gb(ζ K )| > ρ¯ + ̺([u; 1][u; 1]T ) ≤ ǫ.

Proof. Under the premise of the proposition, let us fix u ∈ U , v ∈ V , and let x = (v, Z := [u; 1][u; 1]T ). Denoting by P the distribution of ω := (ζ, ζζ T ) with ζ ∼ SG(A[u; 1], M (v)), and invoking (3.65), we see that for the (x, P ) just defined relation (3.26) takes place. Applying Corollary 3.8, we conclude that  g (ζ K ) − G(x)| > ρ¯ + ̺([u; 1][u; 1]T ) ≤ ǫ. Probζ K ∼[N (A[u;1],M (v))]K |b It remains to note that by construction for the x = (v, Z) in question it holds G(x) = q T v + Tr(QZ) = q T v + [u; 1]T Q[u, 1] = F (u, v).



Remark 3.18. In the situation described in Sections 3.4.2.1 and 3.4.2.2 let δ ∈ [0, 2] be such that −1/2 kΘ1/2 Θ∗ − Id k ≤ δ ∀Θ ∈ V.

Then the conclusion of Proposition 3.17 remains valid when the function Φ in (3.67)

232

CHAPTER 3

b ±,K are defined as is replaced with the function Φδ , that is, when Ψ b +,K (h, H) := inf Ψ α

=

max

(v,Z)∈V ×Z

inf

α

max

(v,Z)∈V ×Z

max

(v,Z)∈V ×Z



 αΦδ (h/α, H/α; M (v), Z) − G(v, Z) − ̺(Z) + αK −1 ln(2/ǫ) :

α>0, −1 −1 −γαΘ∗ HγαΘ∗

b −,K (h, H) := inf Ψ =





max

(v,Z)∈V ×Z

inf



 α > 0, −γαΘ−1  H  γαΘ−1 ∗ ∗   αΦδ (h/α, H/α; M (v), Z) − G(v, Z) − ̺(Z) + αK −1 ln(2/ǫ) ,

 αΦδ (−h/α, −H/α; M (v), Z) + G(v, Z) − ̺(Z) + αK −1 ln(2/ǫ) :

α>0, −1 −1 −γαΘ∗ HγαΘ∗

 α > 0, −γαΘ−1  H  γαΘ−1 ∗ ∗   αΦδ (−h/α, −H/α; M (v), Z) + G(v, Z) − ̺(Z) + αK −1 ln(2/ǫ) .

To justify Remark 3.18, it suffices to replace relation (3.65) in the proof of Proposition 3.17 with (3.66). Note that what is better in terms of the risk of the resulting estimate—Proposition 3.17 “as is” or its modification presented in Remark 3.18—depends on the situation, so that it makes sense to keep in mind both options. 3.4.2.3

Numerical illustration, direct observations

The problem. Our initial illustration is deliberately selected to be extremely simple: given direct noisy observations ζ =u+ξ of unknown signal u ∈ Rm known to belong to a given set U , we want to recover the “energy” uT u of u. We are interested in an estimate of uT U quadratic in ζ with as small as possible an ǫ-risk on U ; here ǫ ∈ (0, 1) is a given design parameter. The details of our setup are as follows: • U is the “spherical layer” U = {u ∈ Rm : r2 ≤ uT u ≤ R2 }, where r and R, 0 ≤ r < R < ∞, are given. As a result, the “main ingredient” of constructions from Sections 3.4.1.3 and 3.4.2.2—the convex compact subset Z of the set {Z ∈ Sm+1 : Zm+1,m+1 = 1} containing all matrices [u; 1][u; 1]T , u ∈ U —can be + specified as Z = {Z ∈ Sm+1 : Zm+1,m+1 = 1, 1 + r2 ≤ Tr(Z) ≤ 1 + R2 }; + • ξ is either ∼ N (0, Θ) (Gaussian case), or ∼ SG(0, Θ) (sub-Gaussian case), with matrix Θ known to be diagonal with diagonal entries equal to each other satisfying θσ 2 ≤ Θii ≤ σ 2 , 1 ≤ i ≤ d = m, withPknown θ ∈ [0, 1] and σ 2 > 0; m • the calibrating function ̺(Z) is ̺(Z) = ς( i=1 Zii ), where ς is a convex continuous real-valued function on R+ . Note that with this selection, the claim that ǫ-risk of an estimate gb(·) is ≤ ρ means that whenever u ∈ U , one has Prob{|b g (u + ξ) − uT u| > ρ + ς(uT u)} ≤ ǫ.

(3.69)

Processing the problem. It is easily seen that in the situation in question the apparatus in Sections 3.4.1 and 3.4.2 translates into the following:

233

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

1. We lose nothing when restricting ourselves with estimates of the form gb(ζ) = 12 ηζ T ζ + κ,

(3.70)

with properly selected scalars η and κ; 2. In Gaussian case, η and κ are yielded by the convex optimization problem with only three variables α+ , α− , and η, namely the problem n i h o b + , α− , η) = 1 Ψ b + (α+ , η) + Ψ b − (α− , η) : σ 2 |η| < α± min Ψ(α (3.71) 2 α± ,η

where

b + (α+ , η) Ψ

=

b − (α+ , η) Ψ

=

4 2

dα+ 2

dδ(2+δ)σ η ln(1 − σ 2 η/α+ ) + d2 σ 2 (1 − θ) max[−η, 0] + 2(α 2 + −σ |η|) hh i i α+ η + max − 1 t − ς(t) + α+ ln(2/ǫ) 2(α −σ 2 η)





r 2 ≤t≤R2 dα− ln(1 + 2 hh

+ max

r 2 ≤t≤R2



+

4 2

dδ(2+δ)σ η σ 2 η/α− ) + d2 σ 2 (1 − θ) max[η, 0] + 2(α 2 − −σ |η|) i i α η − 2(α − + 1 t − ς(t) + α ln(2/ǫ), − +σ 2 η) −

with δ = 1− θ. Now, the η-component of a feasible solution to (3.71) augmented by the quantity i h b − (α− , η) − Ψ b + (α+ , η) κ = 21 Ψ

b + , α− , η); yields estimate (3.70) with ǫ-risk on U not exceeding Ψ(α 3. In the sub-Gaussian case, η and κ are yielded by the convex optimization problem with five variables, α± , g± , and η, namely, the problem  i h b ± , g± , η) = 1 Ψ b + (α+ , g+ , η) + Ψ b − (α− , g− , η) : min Ψ(α 2 α± ,g± ,η  (3.72) 0 ≤ σ 2 g± < α± , −α+ < σ 2 η < α− , η ≤ g+ , −η ≤ g− , where b + (α+ , g+ , η) Ψ

b − (α− , g− , η) Ψ

=



dα+ 2



dα− 2

ln(1 − σ 2 g+ /α+ )hh

+α+ ln(2/ǫ) + =

max

r 2 ≤t≤R2

ln(1 − σ 2 g− /α− )hh

+α− ln(2/ǫ) +

max

r 2 ≤t≤R2

σ2 η2 2(α+ −σ 2 g+ ) σ2 η2 2(α− −σ 2 g− )

i i + 12 η − 1 t − ς(t)

i i − 21 η + 1 t − ς(t)

The η-component of a feasible solution to (3.72) augmented by the quantity i h b − (α− , g− , η) − Ψ b + (α+ , g+ , η) κ=1 Ψ 2

b ± , g± , η). yields estimate (3.70) with ǫ-risk on U not exceeding Ψ(α

Note that the Gaussian case of our “energy estimation” problem is well studied in the literature (see, among others, [19, 43, 81, 87, 90, 97, 120, 124, 147, 160]), mainly in the case ξ ∼ N (0, σ 2 Im ) of white Gaussian noise with exactly known variance σ 2 . Available results investigate analytically the interplay between the dimension m of signal, noise intensity σ 2 and the parameters R, r and offer estimates which are provably optimal, up to absolute constant factors. A nice property of the proposed

234

CHAPTER 3

d

r

R

θ

64 64 64 64 64 64 256 256 256 256 256 256 1024 1024 1024 1024 1024 1024

0 0 0 0 8 8 0 0 0 0 16 16 0 0 0 0 32 32

16 16 128 128 80 80 32 32 512 512 160 160 64 64 2048 2048 320 320

1 0.5 1 0.5 1 0.5 1 0.5 1 0.5 1 0.5 1 0.5 1 0.5 1 0.5

Relative 0.01-risk, Gaussian case 0.34808 0.43313 0.04962 0.05064 0.07827 0.08095 0.19503 0.26813 0.01264 0.01289 0.03996 0.04255 0.10272 0.17032 0.00317 0.00324 0.02019 0.02273

Relative 0.01-risk, sub-Gaussian case 0.44469 0.44469 0.05181 0.05181 0.08376 0.08376 0.30457 0.30457 0.01314 0.01314 0.04501 0.04501 0.21923 0.21923 0.00330 0.00330 0.02516 0.02516

Optimality ratio 1.22 1.48 1.28 1.34 1.28 1.34 1.28 1.41 1.28 1.34 1.28 1.34 1.28 1.34 1.28 1.34 1.28 1.41

Table 3.3: Estimating the signal energy from direct observations.

approach is that (3.71) automatically takes care of the parameters and results in estimates with seemingly near-optimal performance, as witnessed by the numerical experiments we are about to present. Numerical results. In the first series of experiments we use the trivial calibrating function: ς(·) ≡ 0. A typical sample of numerical results is presented in Table 3.3. To avoid large numbers, we display in the table relative 0.01-risk of the estimates, that is, the plain risk as given by (3.71) divided by R2 ; keeping this in mind, one will not be surprised that when extending the range [r, R] of allowed norms of the observed signal, all other components of the setup being fixed, the relative risk can decrease (the actual risk, of course, can only increase). Note that in all our experiments σ is set to 1. Along with the values of the relative 0.01-risk, we present also the values of “optimality ratios”—the ratios of the upper risk bounds given by (3.71) in the Gaussian case, to (lower bounds on) the best 0.01-risks Risk∗0.01 possible under the circumstances, defined as the infimum of the 0.01-risk over all estimates recovering kuk22 via single observation ω = u + ζ. These lower bounds are obtained as follows. Let us select some values r1 < r2 in the allowed range [r, R] of kuk2 , along with two values, σ1 , σ2 , in the allowed range [θσ, σ] = [θ, 1] of values of diagonal entries in diagonal matrices Θ, and consider two distributions of observations P1 and P2 as follows: Pχ is the distribution of the random vector x + ζ, where x and ξ are independent, x is uniformly distributed on the sphere kxk2 = rχ , and ζ ∼ N (0, σχ2 Id ). It is immediately seen that whenever the two simple hypotheses ω ∼ P1 and ω ∼ P2 cannot be decided upon via a single observation by a test with total risk (the sum, over the two hypotheses in question, of probabilities for the test to reject the hypothesis when it is true) ≤ 2ǫ, the quantity δ = 21 (r22 − r12 ) is a lower bound on the optimal ǫ-risk, Risk∗ǫ . In other words, denoting by pχ (·) the density

235

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

of Pχ , we have 0.02
d (“deficient observations”), • u ∈ Rm is a signal known to belong to a compact set U , • ξ ∼ N (0, Θ) (Gaussian case) or ξ ∼ SG(0, Θ) (sub-Gaussian case) is the observation noise; Θ is a positive semidefinite d × d matrix known to belong to a given convex compact set V ⊂ Sd+ . Our goal is to estimate the energy F (u) =

1 m

kuk22

of the signal given observation (3.74). In our experiment, the data is specified as follows: 1. We think of u ∈ Rm as of discretization of a smooth function x(t) of continuous argument t ∈ [0; 1]: ui = x( mi ), 1 ≤ i ≤ m. We set U = {u : kSuk2 ≤ 1}, where u 7→ Su is the finite-difference approximation of the mapping x(·) 7→ (x(0), x′ (0), x′′ (·)), so that U is a natural discrete-time analog of the SobolevR1 type ball {x : [x(0)]2 + [x′ (0)]2 + 0 [x′′ (t)]2 dt ≤ 1}. 2. d × m matrix B is of the form U DV T , where U and V are randomly selected d × d and m × m orthogonal matrices, and the d diagonal entries in diagonal i−1 d × m matrix D are of the form θ− d−1 , 1 ≤ i ≤ d. 3. The set V of admissible matrices Θ is the set of all diagonal d × d matrices with diagonal entries varying in [0, σ 2 ]. Both σ and θ are components of the experiment setup. Processing the problem. The described estimation problem clearly is covered by the setups considered in Sections 3.4.1 (Gaussian case) and 3.4.2 (sub-Gaussian case); in terms of these setups, it suffices to specify Θ∗ as σ 2 Id , M (v) as the identity mapping of V onto itself, the mapping u 7→ A[u; 1] as the mapping u 7→ Bu, and the set Z (which should be a convex compact subset of the set {Z ∈ Sd+1 : Zd+1,d+1 = + 0} containing all matrices of the form [u; 1][u; 1]T , u ∈ U ) as the set  Z = {Z ∈ Sd+1 : Zd+1,d+1 = 1, Tr ZDiag{S T S, 0} ≤ 1}. + As suggested by Propositions 3.12 (Gaussian case) and 3.17 (sub-Gaussian case), 1 kuk22 stem the linear in “lifted observation” ω = (ζ, ζζ T ) estimates of F (u) = m from the optimal solution (h∗ , H∗ ) to the convex optimization problem h i b + (h, H) + Ψ b − (h, H) , Opt = min 12 Ψ (3.75) h,H

b ± (·) given by (3.52) in the Gaussian, and by (3.67) in the sub-Gaussian cases, with Ψ with the number K of observations in (3.52) and (3.67) set to 1. The resulting

237

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

d, m 8, 12 16, 24 32, 48

Opt, Gaussian case 0.1362(+65%) 0.1614(+53%) 0.0687(+46%)

Opt, sub-Gaussian case 0.1382(+67%) 0.1640(+55%) 0.0692(+48%)

LwBnd 0.0825 0.1058 0.0469

Table 3.4: Upper bound (Opt) on the 0.01-risk of estimate (3.76), (3.75) vs. lower bound (LwBnd) on the 0.01-risk attainable under the circumstances. In the experiments, σ = 0.025 and θ = 10. Data in parentheses: excess of Opt over LwBnd.

estimate is ζ 7→ hT∗ ζ + 12 ζ T H∗ ζ + κ, κ =

1 2

h

b − (h∗ , H∗ ) − Ψ b + (h∗ , H∗ ) Ψ

i

(3.76)

and the ǫ-risk of the estimate is (upper-bounded by) Opt. Problem (3.75) is a well-structured convex-concave saddle point problem and as such is beyond the “immediate scope” of the standard Convex Programming software toolbox primarily aimed at solving well-structured convex minimization (or maximization) problems. However, applying conic duality, one can easily eliminate in (3.52) and (3.67) the inner maxima over v, Z to end up with a reformulation which can be solved numerically by CVX [108], and this is how we process (3.75) in our experiments. Numerical results. In the experiments to be reported, we use the trivial calibrating function: ̺(·) ≡ 0. We present some typical numerical results in Table 3.4. To qualify the performance of our approach, we present, along with the upper risk bounds for the computed estimates, simple lower bounds on ǫ-risk. The origin of the lower bounds is as follows. Assume we have at our disposal a signal w ∈ U , and let t(w) = kBwk2 , ρ = 2σErfcInv(ǫ), where ErfcInv is the inverse error function as defined in (1.26). Setting θ(w) = max[1 − ρ/t(w), 0], observe that w′ := θ(w)w ∈ U and kBw − Bw′ k2 ≤ ρ, which, due to the origin of ρ, implies that there is no way to decide via observation Bu + ξ, ξ ∼ N (0, σ 2 ), with risk < ǫ on the two simple hypotheses u = w and u = w′ . As an immediate consequence, the quantity φ(w) := 12 [kwk22 − kw′ k22 ] = kwk22 [1 − θ2 (w)]/2 is a lower bound on the ǫ-risk, on U , of any estimate of kuk22 . We can now try to maximize the resulting lower risk bound over U , thus arriving at the lower risk bound  LwBnd = max 21 kwk22 (1 − θ2 (w)) . w∈U

On closer inspection, the latter problem is not a convex one, which does not prevent building a suboptimal solution to this problem, and this is how the lower risk bounds in Table 3.4 are built (we omit the details). We see that the ǫ-risks of our estimates are within a moderate factor of the optimal ones. Figure 3.4 shows empirical error distributions of the estimates built in the three experiments reported in Table 3.4. When simulating the observations and estimates, we used N (0, σ 2 Id ) noise and selected signals in U by maximizing over U randomly selected linear forms. Finally, we note that already with fixed design parameters d, m, θ and σ we deal with a family of estimation problems rather than with a single problem, the reason being that our U is an ellipsoid with half-axes es-

238

CHAPTER 3

d = 8, m = 12

d = 16, m = 24

d = 32, m = 48

d = 8, m = 12

Gaussian case d = 16, m = 24

d = 32, m = 48

Sub-Gaussian case

Figure 3.4: Histograms of recovery errors in experiments, 1,000 simulations per experiment.

sentially different from each other. In this situation, attainable risks heavily depend on how the right singular vectors of A are oriented with respect to the directions of the half-axes of U , so that the risks of our estimates vary significantly from instance to instance. Note also that the “sub-Gaussian experiments” were conducted on exactly the same data as “Gaussian experiments” of the same sizes d and m.

3.5

EXERCISES FOR CHAPTER 3

Exercise 3.1. In the situation of Section 3.3.4, design of a “good” estimate is reduced to solving convex optimization problem (3.39). Note that the objective in this problem is, in a sense, “implicit”—the design variable is h, and the objective is obtained from an explicit convex-concave function of h and (x, y) by maximization over (x, y). There exist solvers able to process problems of this type efficiently. However, commonly used off-the-shelf solvers, like cvx, cannot handle problems of this type. The goal of the exercise to follow is to reformulate (3.39) as a semidefinite program, thus making it amenable for cvx. On an immediate inspection, the situation we are interested in is as follows. We are given • a nonempty convex compact set X ⊂ Rn along with affine function M (x) taking values in Sd and such that M (x)  0 when x ∈ X, and • an affine function F (h) : Rd → Rn .

239

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

Given γ > 0, this data gives rise to the convex function   q Ψ(h) = max F T (h)x + γ hT M (x)h , x∈X

and we want to find a “nice” representation of this function, specifically, we want to represent the inequality τ ≥ Ψ(h) by a bunch of LMIs in variables τ , h, and perhaps additional variables. To achieve our goal, we assume in the sequel that the set X + = {(x, M ) : x ∈ X, M = M (x)} can be described by a system of linear and semidefinite constraints in variables x, M , and additional variables ξ, namely,     (a) si − aTi x − bTi ξ − Tr(Ci M ) ≥ 0, i ≤ I + (b) S − A(x) − B(ξ) − C(M )  0 X = (x, M ) : ∃ξ : .  (c) M  0

Here si ∈ R, S ∈ SN are some constants, and A(·), B(·), C(·) are (homogeneous) linear functions taking values in SN . We assume that this system of constraints is essentially strictly feasible, meaning that there exists a feasible solution at which the semidefinite constraints (b) and (c) are satisfied strictly (i.e., the left-hand sides of the LMIs are positive definite). Here comes the exercise: 1) Check that Ψ(h) is the optimal value in the semidefinite program  si − aTi x − bTi ξ − Tr(Ci M ) ≥ 0, i ≤ I      S − A(x) − B(ξ) − C(M )  0 M 0 F T (h)x + γt : Ψ(h) = max   x,M,ξ,t     hT M h t     0   t 1      

(a) (b) (c)

     

   (d)  

.

(P )

2) Passing from (P ) to the semidefinite dual of (P ), build explicit semidefinite representation of Ψ, that is, an explicit system S of LMIs in variables h, τ , and additional variables u such that {τ ≥ Ψ(h)} ⇔ {∃u : (τ, h, u) satisfies S}. Exercise 3.2. Let us consider the situation as follows. Given an m × n “sensing matrix” A which is stochastic with columns from the probabilistic simplex ) ( X m vi = 1 ∆m = v ∈ R : v ≥ 0, i

and a nonempty closed subset U of ∆n , we observe an M -element, M > 1, i.i.d. sample ζ M = (ζ1 , ..., ζM ) with ζk drawn from the discrete distribution Au∗ , where u∗ is an unknown probabilistic vector (“signal”) known to belong to U . We handle the discrete distribution Au, u ∈ ∆n , as a distribution on the vertices e1 , ..., em of ∆m , so that possible values of ζk are basic orths e1 , ..., em in Rm . Our goal is to

240

CHAPTER 3

recover the value F (u∗ ) of a given quadratic form F (u) = uT Qu + 2q T u. Observe that for u ∈ ∆n , we have u = [uuT ]1n , where 1k is the all-ones vector in Rk . This observation allows us to rewrite F (u) as a homogeneous quadratic form: ¯ Q ¯ = Q + [q1T + 1n q T ]. F (u) = uT Qu, n

(3.77)

The goal of the exercise is to follow the approach developed in Section 3.4.1 for the Gaussian case in order to build an estimate gb(ζ M ) of F (u). To this end, consider the following construction. Let

JM = {(i, j) : 1 ≤ i < j ≤ M }, JM = Card(JM ).

For ζ M = (ζ1 , ..., ζM ) with ζk ∈ {e1 , ..., em }, 1 ≤ k ≤ M , let ωij [ζ M ] = 21 [ζi ζjT + ζj ζiT ], (i, j) ∈ JM .

The estimates we are interested in are of the form    1 X +κ ωij [ζ M ] gb(ζ M ) = Tr h (i,j)∈JM JM {z } | ω[ζ M ]

where h ∈ Sm and κ ∈ R are the parameters of the estimate. Now comes the exercise: 1) Verify that when the ζk ’s stem from signal u ∈ U , the expectation of ω[ζ M ] is a linear image Az[u]AT of the matrix z[u] = uuT ∈ Sn : denoting by PuM the distribution of ζ M , we have Eζ M ∼PuM {ω[ζ M ]} = Az[u]AT .

(3.78)

Check that when setting Zk = {ω ∈ Sk : ω  0, ω ≥ 0, 1Tk ω1k = 1}, where x ≥ 0 for a matrix x means that x is entrywise nonnegative, the image of Zn under the mapping z 7→ AzAT is contained in Zm . 2) Let ∆k = {z ∈ Sk : z ≥ 0, 1Tn z1n = 1}, so that Zk is the set of all positive semidefinite matrices from ∆k . For µ ∈ ∆m , let Pµ be the distribution of the random matrix w taking values in Sm as follows: the possible values of w are matrices of the form eij = 12 [ei eTj + ej eTi ], 1 ≤ i ≤ j ≤ m; for every i ≤ m, w takes value eii with probability µii , and for every i, j with i < j, w takes value eij with probability 2µij . Let us set   m X µij exp{hij } : Sm × ∆m → R, Φ1 (h; µ) = ln  i,j=1

241

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

so that Φ1 is a continuous convex-concave function on Sm × ∆m .

2.1. Prove that

 ∀(h ∈ Sm , µ ∈ Zm ) : ln Ew∼Pµ {exp{Tr(hw)}} = Φ1 (h; µ).

2.2. Derive from 2.1 that setting

K = K(M ) = ⌊M/2⌋, ΦM (h; µ) = KΦ1 (h/K; µ) : Sm × ∆m → R, ΦM is a continuous convex-concave function on Sm × ∆m such ΦK (0; µ) = 0 for all µ ∈ Zm , and whenever u ∈ U , the following holds true: Let Pu,M be the distribution of ω = ω[ζ M ], ζ M ∼ PuM . Then for all u ∈ U, h ∈ Sm ,  ln Eω∼Pu,M {exp{Tr(hω)}} ≤ ΦM (h; Az[u]AT ), z[u] = uuT . (3.79)

3) Combine the above observations with Corollary 3.6 to arrive at the following result: Proposition 3.19. In the situation in question, let Z be a convex compact subset of Zn such that uuT ∈ Z for all u ∈ U . Given ǫ ∈ (0, 1), let Ψ+ (h, α)

=

Ψ− (h, α)

=

b + (h) Ψ

b − (h) Ψ

:= = = := = =

  ¯ : Sm × {α > 0} → R, max αΦM (h/α, AzAT ) − Tr(Qz) z∈Z   T ¯ : Sm × {α > 0} → R max αΦM (−h/α, AzA ) + Tr(Qz) z∈Z

inf α>0 [Ψ+ (h, α) + α ln(2/ǫ)]  ¯ + α ln(2/ǫ) max inf αΦM (h/α, AzAT ) − Tr(Qz) z∈Z α>0   ¯ + β ln(2/ǫ) max inf βΦ1 (h/β, AzAT ) − Tr(Qz) K z∈Z β>0

inf [Ψ− (h, α) + α ln(2/ǫ)]   ¯ + α ln(2/ǫ) max inf αΦM (−h/α, AzAT ) + Tr(Qz) z∈Z α>0   ¯ + β ln(2/ǫ) max inf βΦ1 (−h/β, AzAT ) + Tr(Qz)

[β = Kα],

α>0

K

z∈Z β>0

[β = Kα].

b ± are real-valued and convex on Sm , and every candidate solution The functions Ψ h to the convex optimization problem h io n b + (h) + Ψ b − (h) b (3.80) Opt = min Ψ(h) := 12 Ψ h

induces the estimate

b − (h) − Ψ b + (h)] gbh (ζ M ) = Tr(hω[ζ M ]) + κ(h), κ(h) = 12 [Ψ

of the functional of interest (3.77) via observation ζ M with ǫ-risk on U not exceeding b ρ = Ψ(h): ∀(u ∈ U ) : Probζ M ∼PuM {|F (u) − gbh (ζ M )| > ρ} ≤ ǫ.

4) Consider an alternative way to estimate F (u), namely, as follows. Let u ∈ U . Given a pair of independent observations ζ1 , ζ2 drawn from distribution Au, let us convert them into the symmetric matrix ω1,2 [ζ 2 ] = 21 [ζ1 ζ2T + ζ2 ζ1T ]. The distribution Pu,2 of this matrix is exactly the distribution Pµ(z[u]) —see item B—where µ(z) = AzAT : ∆n → ∆m . Now, given M = 2K observations ζ 2K = (ζ1 , ..., ζ2K ) stemming from signal u, we can split them into K consecutive pairs giving rise

242

CHAPTER 3

to K observations ω K = (ω1 , ..., ωK ), ωk = ω[[ζ2k−1 ; ζ2k ]], drawn independently of each other from probability distribution Pµ(z[u]) , and the functional of interest ¯ (3.77) is a linear function Tr(Qz[u]) of z[u]. Assume that we are given a set Z as in the premise of Proposition 3.19. Observe that we are in the situation as follows: Given K i.i.d. observations ω K = (ω1 , ..., ωK ) with ωk ∼ Pµ(z) , where z is an unknown signal known to belong to Z, we want to recover the value ¯ of v ∈ Sn . Besides this, we know at z of linear function G(v) = Tr(Qv) m that Pµ , for every µ ∈ ∆ , satisfies the relation  ∀(h ∈ Sm ) : ln Eω∼Pµ {exp{Tr(hω)}} ≤ Φ1 (h; µ).

This situation fits the setting of Section 3.3.3, with the data specified as H = EH = Sm , M = ∆m ⊂ EM = Sm , Φ = Φ1 , X := Z ⊂ EX = Sn , A(z) = AzAT .

Therefore, we can use the apparatus developed in that section to upper-bound the ǫ-risk of the affine estimate ! K 1 X ωk + κ Tr h K k=1

¯ and to build the best, in terms of the upper risk of F (u) := G(z[u]) = uT Qu bound, estimate; see Corollary 3.8. On closer inspection (carry it out!), the b ± arising in (3.38) are exactly the associated with the above data functions Ψ b functions Ψ± specified in Proposition 3.19 for M = 2K. Thus, the approach to estimating F (u) via observations ζ 2K stemming from u ∈ U results in a family of estimates ! K 1 X 2K geh (ζ ) = Tr h ω[[ζ2k−1 ; ζ2k ]] + κ(h), h ∈ Sm . K k=1

b b The resulting upper bound on the ǫ-risk of estimate geh is Ψ(h), where Ψ(·) is associated with M = 2K according to Proposition 3.19. In other words, this is exactly the upper bound on the ǫ-risk of the estimate gbh offered by the proposition. Note, however, that the estimates geh and gbh are not identical:   PK 1 2K ] + κ(h), geh (ζ 2K ) = Tr h K k=1 ω2k−1,2k [ζ   P 1 2K ω [ζ ] + κ(h). gbh (ζ 2K ) = Tr h K(2K−1) 1≤i ζ,” where η, ζ are discrete real-valued random variables independent of each other with distributions u, v, and π is a linear function of the “joint distribution” uv T of η, ζ. This story gives rise to the aforementioned estimation problem with the unit sensing matrices P and Q. Assuming that there are “measurement errors”—instead of observing an action’s outcome “as is,” we observe a realization of a random variable with distribution depending, in a prescribed fashion, on the outcome—we arrive at problems where P and Q can be general type stochastic matrices. As always, we encode the p possible values of ηk by the basic orths e1 , ..., ep in Rp , and the q possible values of ζ by the basic orths f1 , ..., fq in Rq . We focus on estimates of the form #T " # " 1X 1 X K L [h ∈ Rp×q , κ ∈ R]. ηk h ζℓ + κ gbh,κ (η , ζ ) = K L k

This is what you are supposed to do:



244

CHAPTER 3

1) (cf. item 2 in Exercise 3.2) Denoting by ∆mn the set of nonnegative m × n matrices with unit sum of all entries (i.e., the set of all probability distributions on {1, ..., m} × {1, ..., n}) and assuming L ≥ K, let us set A(z) = P zQT : Rr×s → Rp×q and Φ(h; µ) ΦK (h; µ)

= =

P  Pq p ln µ exp{h } : Rp×q × ∆pq → R, ij ij i=1 j=1 KΦ(h/K; µ) : Rp×q × ∆pq → R.

Verify that A maps ∆rs into ∆pq , Φ and ΦK are continuous convex-concave functions on their domains, and that for every u ∈ ∆r , v ∈ ∆s , the following holds true: (!) When η K = (η1 , ..., ηK ), ζ L = (ζ1 , ..., ζK ) with mutually independent η1 , ..., ζL such that ηk ∼ P u, ηℓ ∼ Qv for all k, ℓ, we have  "  #T " #      1 X X 1  ≤ ΦK (h; A(uv T )). (3.82) ηk h ζℓ ln Eη,ζ exp    K L k



2) Combine (!) with Corollary 3.6 to arrive at the following analog of Proposition 3.19: Proposition 3.20. In the situation in question, let Z be a convex compact subset of ∆rs such that uv T ∈ Z for all u ∈ U , v ∈ V . Given ǫ ∈ (0, 1), let Ψ+ (h, α)

=

Ψ− (h, α)

=

z∈Z

b + (h) Ψ

:= =

b − (h) Ψ

:=

=

= =

  max αΦK (h/α, P zQT ) − Tr(F z T ) : Rp×q × {α > 0} → R, z∈Z   max αΦK (−h/α, P zQT ) + Tr(F z T ) : Rp×q × {α > 0} → R,

inf α>0 [Ψ+ (h, α) + α ln(2/ǫ)]  max inf αΦK (h/α, P zQT ) − Tr(F z T ) + α ln(2/ǫ) z∈Z α>0   β ln(2/ǫ) [β = Kα], max inf βΦ(h/β, P zQT ) − Tr(F z T ) + K z∈Z β>0

inf [Ψ− (h, α) + α ln(2/ǫ)]   max inf αΦK (−h/α, P zQT ) + Tr(F z T ) + α ln(2/ǫ) z∈Z α>0   β max inf βΦ(−h/β, P zQT ) + Tr(F z T ) + K ln(2/ǫ) [β = Kα].

α>0

z∈Z β>0

b ± are real-valued and convex on Rp×q , and every candidate solution The functions Ψ h to the convex optimization problem n h io b b + (h) + Ψ b − (h) Opt = min Ψ(h) := 21 Ψ h

induces the estimate

 "

1 X ηk gbh (η , ζ ) = Tr h  K K

L

k

#"

1X ζℓ L ℓ

# T T 

b − (h) − Ψ b + (h)]   + κ(h), κ(h) = 1 [Ψ 2

of the functional of interest (3.81) via observation η K , ζ L with ǫ-risk on U × V not b exceeding ρ = Ψ(h): ∀(u ∈ U, v ∈ V ) : Prob{|F (u, v) − gbh (η K , ζ L )| > ρ} ≤ ǫ,

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

245

the probability being taken w.r.t. the distribution of observations η K , ζ L stemming from signals u, v. Exercise 3.4. [recovering mixture weights] The problem to be addressed in this exercise is as follows. We are given K probability distributions P1 , ..., PK on observation space Ω, and let these distributions have densities pk (·) w.r.t. some reference measure Π P on Ω; we assume that k pk (·) is positive on Ω. We are given also N independent observations ωt ∼ Pµ , t = 1, ..., N, drawn from distribution Pµ =

K X

µ k Pk ,

k=1

where µ is an unknown P “signal” known to belong to the probabilistic simplex ∆K = {µ ∈ RK : µ ≥ 0, k µk = 1}. Given ω N = (ω1 , ..., ωN ), we want to recover the linear image Gµ of µ, where G ∈ Rν×K is given. b N ) : Ω × ... × Ω → Rν We intend to measure the risk of a candidate estimate G(ω by the quantity h oi1/2 n b b N ) − Gµk22 Risk[G(·)] = sup EωN ∼Pµ ×...×Pµ kG(ω . µ∈∆

3.4.A. Recovering linear form. Let us start with the case when G = g T is a 1 × K matrix. 3.4.A.1. Preliminaries. To motivate the construction to follow, consider the case when Ω is a finite set (obtained, e.g., by “fine discretization” of the “true” observation space). In this situation our problem becomes an estimation problem in Discrete o.s.: given a stationary N -repeated observation stemming from a discrete probability distribution Pµ affinely parameterized by signal µ ∈ ∆K , we want to recover a linear form of µ. It is shown in Section 3.1—see Remark 3.2—that in this case a nearly optimal, in terms of its ǫ-risk, estimate is of the form gb(ω N ) =

N 1 X φ(ωt ) N t=1

(3.83)

with properly selected φ. The difficulty with this approach is that as far as computations are concerned, optimal design of φ requires solving a convex optimization problem of design dimension of order of the cardinality of Ω, and this cardinality could be huge, as is the case when Ω is a discretization of a domain in Rd with d in the range of tens. To circumvent this problem, we are to simplify the outlined approach: from the construction of Section 3.1 we inherit the simple structure (3.83) of the estimator; taking this structure for granted, we are to develop an alternative design of φ. With this new design, we have no theoretical guarantees for the resulting estimates to be near-optimal; we sacrifice these guarantees in order to reduce dramatically the computational effort of building the estimates.

246

CHAPTER 3

3.4.A.2. Generic estimate. Let us select somehow L functions Fℓ (·) on Ω such that Z Fℓ2 (ω)pk (ω)Π(dω) < ∞, 1 ≤ ℓ ≤ L, 1 ≤ k ≤ K. (3.84) With λ ∈ RL , consider estimate of the form gbλ (ω N ) =

1) Prove that



Risk[b gλ ]

:=

N X 1 X λℓ Fℓ (ω). Φλ (ωt ), Φλ (ω) = N t=1

Risk(λ)  R P 2 max N1 [ ℓ λℓ Fℓ (ω)] pk (ω)Π(dω) k≤K

2 R P [ ℓ λℓ Fℓ (ω)] pk (ω)Π(dω) − g T ek  1/2 max N1 λT Wk λ + [eTk [M λ − g]]2 ,

+

= where M

=

Wk

=





(3.85)



1/2

(3.86)

k≤K

Mkℓ := [Wk ]ℓℓ′

 Fℓ (ω)pk (ω)Π(dω) k≤K , ℓ≤L  R := Fℓ (ω)Fℓ′ (ω)pk (ω)Π(dω) ℓ≤L , 1 ≤ k ≤ K, R

ℓ′ ≤L

and e1 , ..., eK are the standard basic orths in RK .

Note that Risk(λ) is a convex function of λ; this function is easy to compute, provided the matrices M and Wk , k ≤ K, are available. Assuming this is the case, we can solve the convex optimization problem Opt = min Risk(λ) λ∈RK

(3.87)

and use the estimate (3.85) associated with the optimal solution to this problem; the risk of this estimate will be upper-bounded by Opt. 3.4.A.3. Implementation. When implementing the generic estimate we arrive at the “Measurement Design” question: how do we select the value of L and functions Fℓ , 1 ≤ ℓ ≤ L, resulting in small (upper bound Opt on the) risk of the estimate (3.85) yielded by an optimal solution to (3.87)? We are about to consider three related options—naive, basic, and Maximum Likelihood (ML). The naive option is to take Fℓ = pℓ , 1 ≤ ℓ ≤ L = K, assuming that this selection meets (3.84). For the sake of definiteness, consider the “Gaussian case,” where Ω = Rd , Π is the Lebesgue measure, and pk is Gaussian distribution with parameters νk , Σk :  pk (ω) = (2π)−d/2 Det(Σk )−1/2 exp − 21 (ω − νk )T Σ−1 k (ω − νk ) .

In this case, the Naive option leads to easily computable matrices M and Wk appearing in (3.86).

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

247

2) Check that in the Gaussian case, when setting −1 −1 −1 −1 −1 Σkℓ = [Σ−1 , Σkℓm = [Σ−1 , χk = Σ−1 k + Σℓ ] k + Σℓ + Σm ] k νk , q q Det(Σkℓ ) Det(Σkℓm ) −d αkℓ = (2π)d Det(Σ , β = (2π) kℓm Det(Σk )Det(Σℓ )Det(Σm ) , )Det(Σ ) k ℓ

we have

Mkℓ [Wk ]ℓm

:= = := =

R

pℓ (ω)pk (ω)Π(dω)   1 T T T α R kℓ exp 2 [χk + χℓ ] Σkℓ [χk + χℓ ] − χk Σk χk − χℓ Σℓ χℓ , pℓ (ω)pm(ω)p  k (ω)Π(dω) βkℓm exp 12 [χk + χℓ + χm ]T Σkℓm [χk + χℓ + χm ]  −χTk Σk χk − χTℓ Σℓ χℓ − χTm Σm χm .

Basic option. Though simple, the Naive option does not make much sense: when

replacing the reference measure Π with another measure Π′ which has positive density θ(·) w.r.t. Π, the densities pk are updated according to pk (·) 7→ p′k (·) = θ−1 (·)p(·), so that selecting Fℓ′ = p′ℓ , the matrices M and Wk become M ′ and Wk′ with R R pk (ω)pℓ (ω) ′ ℓ (ω) ′ Mkℓ = Π (dω) = pk (ω)p Π(dω), θ 2 (ω) θ(ω) R R pk (ω)pℓ (ω)pm (ω) ′ pk (ω)pℓ (ω) ′ [Wk ]ℓm = Π (dω) = Π(dω). θ 3 (ω) θ 2 (ω)

We see that in general M 6= M ′ and Wk 6= Wk′ , which makes the Naive option rather unnatural. In the alternative Basic option we set pℓ (ω) . L = K, Fℓ (ω) = π(ω) := P k pk (ω)

The motivation is that the functions Fℓ are invariant when replacing Π with Π′ , so that here M = M ′ and Wk = Wk′ . Besides this, there are statistical arguments in favor P of the Basic option, namely, as follows. Let Π∗ be the measure with the w.r.t. Π; taken w.r.t. Π∗ , the densities of Pk are exactly the density k pk (·) P above πk (·), and k πk (ω) ≡ 1. Now, (3.86) says that the risk of estimate gbλ can be upper-bounded by the function Risk(λ) defined in (3.86), and this function, in turn, can be upper-bounded by the function  P R P 2 + 1 [ ℓ λℓ Fℓ (ω)] pk (ω)Π(dω) Risk (λ) := k N  2 1/2 R P + maxk [ k λℓ Fℓ (ω)] pk (ω)Π(dω) − g T ek  R P 2 = N1 [ ℓ λℓ Fℓ (ω)] Π∗ (dω)  2 1/2 R P T + maxk [ k λℓ Fℓ (ω)] πk (ω)Π∗ (dω) − g ek ≤ KRisk(λ)

(we have said that the maximum of K nonnegative quantities is at most their sum, and the latter is at most K times the maximum of the quantities). Consequently, the risk of the estimate (3.85) stemming from an optimal solution to (3.87) can be

248

CHAPTER 3

upper-bounded by the quantity Opt+ := min Risk+ (λ) λ

[≥ Opt := max Risk(λ)]. λ

And here comes the punchline: 3.1) Prove that both the quantities Opt defined in (3.87) and the above Opt+ depend only on the linear span of the functions Fℓ , ℓ = 1, ..., L, not on how the functions Fℓ are selected in this span. 3.2) Prove that the selection Fℓ = πℓ , 1 ≤ ℓ ≤ L = K, minimizes Opt+ among all possible selections L, {Fℓ }L ℓ=1 satisfying (3.84). Conclude that the selection Fℓ = πℓ , 1 ≤ ℓ ≤ L = K, while not necessarily optimal in terms of Opt, definitely is meaningful: this selection optimizes the natural upper bound Opt+ on Opt. Observe that Opt+ ≤ KOpt, so that optimizing instead of Opt the upper bound Opt+ , although rough, is not completely meaningless. A downside of the Basic option is that it seems problematic to get closed form expressions for the associated matrices M and Wk ; see (3.86). For example, in the Gaussian case, the Naive choice of Fℓ ’s allows us to represent M and Wk in an explicit closed form; in contrast to this, when selecting Fℓ = πℓ , ℓ ≤ L = K, seemingly the only way to get M and Wk is to use Monte-Carlo simulations. This being said, we indeed can use Monte-Carlo simulations to compute M and Wk , provided we can sample from distributions P1 , ..., PK . In this respect, it should be stressed that with Fℓ ≡ πℓ , the entries in M and Wk are expectations, w.r.t. P1 , ..., PK , of functions of ω bounded in magnitude by 1, and thus well-suited for Monte-Carlo simulation. Maximum Likelihood option. This choice of {Fℓ }ℓ≤L follows straightforwardly the idea of discretization we started with in this exercise. Specifically, we split Ω into L cells Ω1 , ..., ΩL in such a way that the intersection of any two different cells is of Π-measure zero, and treat as our observations not the actual observations ωt , but the indexes of the cells to which the ωt ’s belong. With our estimation scheme, this is the same as selecting Fℓ as the characteristic function of Ωℓ , ℓ ≤ L. Assuming that for distinct k, k ′ the densities pk , pk′ differ from each other Π-almost surely, the simplest discretization independent of how the reference measure is selected is the Maximum Likelihood discretization Ωℓ = {ω : max pk (ω) = pℓ (ω)}, 1 ≤ ℓ ≤ L = K; k

with the ML option, we take, as Fℓ ’s, the characteristic functions of the sets Ωℓ , 1 ≤ ℓ ≤ L = K, just defined. As with the Basic option, the matrices M and Wk associated with the ML option can be found by Monte-Carlo simulation. We have discussed three simple options for selecting Fℓ ’s. In applications, one can compute the upper risk bounds Opt—see (3.87)—associated with each option, and use the option with the best—the smallest—risk bound (“smart” choice of Fℓ ’s). Alternatively, one can take as {Fℓ , ℓ ≤ L} the union of the three collections yielded by the above options (and, perhaps, further extend this union). Note that the larger is the collection of the Fℓ ’s, the smaller is the associated Opt, so that the only price for combining different selections is in increasing the computational cost of solving (3.87). 3.4.A.4. Illustration. In the experimental part of this exercise your are expected

249

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

to 4.1) Run numerical experiments to compare the estimates yielded by the above three options (Naive, Basic, ML). Recommended setup: • d = 8, K = 90; • Gaussian case with the covariance matrices Σk of Pk selected at random, Sk = rand(d, d), Σk =

Sk SkT kSk k2

[k · k: spectral norm]

and the expectations νk of Pk selected at random from N (0, σ 2 Id ), with σ = 0.1; • values of N : {10s , s = 0, 1, ..., 5}; • linear form to be recovered: g T µ ≡ µ1 .

4.2† ). Utilize the Cramer-Rao lower risk bound (see Proposition 4.37, Exercise 4.22) to Opt of the estimates built in item 4.1. upper-bound the level of conservatism Risk ∗ Here Risk∗ is the minimax risk in our estimation problem:  1/2  g (ω N ) − g T µ|2 , Risk∗ = inf Risk[b g (ω N )] = inf sup EωN ∼Pµ ×...×Pµ |b g b(·)

g b(·) µ∈∆

where inf is taken over all estimates.

3.4.B. Recovering linear images. Now consider the case when G is a general ν × K matrix. The analog of the estimate gbλ (·) is now as follows: with somehow chosen F1 , ..., FL satisfying (3.84), we select a ν × L matrix Λ = [λiℓ ], set X X X ΦΛ (ω) = [ λ1ℓ Fℓ (ω); λ2ℓ Fℓ (ω); ...; λνℓ Fℓ (ω)], ℓ





and estimate Gµ by

N X b Λ (ω N ) = 1 Φλ (ωt ). G N t=1

5) Prove the following counterpart of the results of item 3.4.A: Proposition 3.21. The risk of the proposed estimator can be upper-bounded as follows: bΛ ] Risk[G Ψ(Λ, µ)

:= ≤ = =

where

oi1/2 n h b N ) − Gµk22 maxµ∈∆K EωN ∼Pµ ×...×Pµ kG(ω

Risk(Λ) := maxk≤K Ψ(Λ, ek ), h P i1/2  K 2 2 1 k=1 µk Eω∼Pk kΦΛ (ω)k2 + k[ψΛ − G]µk2 N i1/2 h R P PK P 2 , k[ψΛ − G]µk22 + N1 k=1 µk [ i≤ν [ ℓ λiℓ Fℓ (ω)] ]Pk (dω)

  R P [ ℓ λ1ℓ Fℓ (ω)]Pk (dω) , 1 ≤ k ≤ K ··· Colk [ψΛ ] = Eω∼Pk (·) ΦΛ (ω) =  R P [ ℓ λνℓ Fℓ (ω)]Pk (dω)

and e1 , ..., eK are the standard basic orths in RK .

250

CHAPTER 3

Note that exactly the same reasoning as in the case of the scalar Gµ ≡ g T µ demonstrates that a reasonable way to select L and Fℓ , ℓ = 1, ..., L, is to set L = K and Fℓ (·) = πℓ (·), 1 ≤ ℓ ≤ L.

3.6

PROOFS

3.6.1

Proof of Proposition 3.3

o

1 . Observe that Optij (K) is the saddle point value in the convex-concave saddle point problem:  1 Kα [ΦO (φ/α; Ai (x)) + ΦO (−φ/α; Aj (y))] Optij (K) = inf max α>0,φ∈F x∈Xi ,y∈Xj 2  + 12 g T [y − x] + α ln(2I/ǫ) . The domain of the maximization variable is compact and the cost function is continuous on its domain, whence, by the Sion-Kakutani Theorem, we also have Optij (K)

=

Θij (x, y)

=

max

x∈Xi ,y∈Xj

Θij (x, y), 1 2

Kα [ΦO (φ/α; Ai (x)) + ΦO (−φ/α; Aj (y))]  +α ln(2I/ǫ) + 12 g T [y − x]. inf

α>0,φ∈F

(3.88)

Note that Θij (x, y)

=

=

inf

α>0,ψ∈F 1 T



1 2

Kα [ΦO (ψ; Ai (x)) + ΦO (−ψ; Aj (y))] + α ln(2I/ǫ)

+ 2 g [y − x] inf

α>0

1 2

αK inf [ΦO (ψ; Ai (x)) + ΦO (−ψ; Aj (y))] + α ln(2I/ǫ) ψ∈F

+ 21 g T [y − x].





Given x ∈ Xi , y ∈ Xj and setting µ = Ai (x), ν = Aj (y), we obtain inf [ΦO (ψ; Ai (x)) + ΦO (−ψ; Aj (y))]  Z  Z  = inf ln exp{ψ(ω)}pµ (ω)Π(dω) + ln exp{−ψ(ω)}pν (ω)Π(dω) .

ψ∈F

ψ∈F

¯ Since O is a good o.s., the function ψ(ω) = inf

ψ∈F

= =



ln

Z

inf

δ∈F

inf

δ∈F

exp{ψ(ω)}pµ (ω)Π(dω) 



|

ln ln

Z Z



+ ln

Z

q

ln(pν (ω)/pµ (ω)) belongs to F, and

exp{−ψ(ω)}pν (ω)Π(dω)

¯ exp{ψ(ω) + δ(ω)}pµ (ω)Π(dω) exp{δ(ω)}

1 2



pµ (ω)pν (ω)Π(dω)



+ ln + ln {z

f (δ)

Z

Z



¯ exp{−ψ(ω) − δ(ω)}pν (ω)Π(dω) exp{−δ(ω)}

q



pµ (ω)pν (ω)Π(dω)



}

.

251

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

Observe that f (δ) clearly is a convex and even function of δ ∈ F; as such, it attains its minimum over δ ∈ F when δ = 0. The bottom line is that  Z q pAi (x) (ω)pAj (y) (ω)Π(dω) , inf [ΦO (ψ; Ai (x)) + ΦO (−ψ; Aj (y))] = 2 ln ψ∈F

(3.89)

and Θij (x, y)

= =

   Z q pAi (x) (ω)pAj (y) (ω)Π(dω) + ln(2I/ǫ) + 12 g T [y − x] inf α K ln α>0  R q ( 1 T pAi (x) (ω)pAj (y) (ω)Π(dω) + ln(2I/ǫ) ≥ 0, g [y − x], K ln 2 −∞,

otherwise.

This combines with (3.88) to imply that n Optij (K) = maxx,y 21 g T [y − x] : x ∈ Xi , y ∈ Xj , iK hR q pAi (x) (ω)pAj (y) (ω)Π(dω) ≥

ǫ 2I



.

(3.90)

2o . We claim that under the premise of the proposition, for all i, j, 1 ≤ i, j ≤ I, one has Optij (K) ≤ Risk∗ǫ (K), implying the validity of (3.13). Indeed, assume that for some pair i, j the opposite inequality holds true, Optij (K) > Risk∗ǫ (K), and let us lead this assumption to a contradiction. Under our assumption optimization problem in (3.90) has a feasible solution (¯ x, y¯) such that y−x ¯] > Risk∗ǫ (K), r := 21 g T [¯

(3.91)

implying, due to the origin of Risk∗ǫ (K), that there exists an estimate ge(ω K ) such that for µ = Ai (¯ x), ν = Aj (¯ y ) it holds o o n n x + y¯] ≤ ProbωK ∼pK |e ProbωK ∼pK ge(ω K ) ≤ 12 g T [¯ g (ω K ) − g T y¯| ≥ r ≤ ǫ ν n ν n o o 1 T K x + y¯] ≤ ProbωK ∼pK |e g (ω K ) − g T x ¯| ≥ r ≤ ǫ. ProbωK ∼pK ge(ω ) ≥ 2 g [¯ µ

µ

In other words, we can decide on two simple hypotheses stating that observation ω K K K obeys distribution pK Π × ... × Π µ or pν , with risk ≤ ǫ. Consequently, setting Π = | {z } K Q K K and pK (ω ) = p (ω ), we have k θ k=1 θ Z i h K K K (ω ), p (ω ) ΠK (dω K ) ≤ 2ǫ. min pK µ ν

252

CHAPTER 3

Hence, hR p = ≤ =

= ≤

iK Rq K K K K K pµ (ω)pν (ω)Π(dω) pK = µ (ω )pν (ω )Π (dω ) r ir i h h R K K K K K K K K min pK max pK µ (ω ), pν (ω ) µ (ω ), pν (ω ) Π (dω ) R h i  1 R i 1 h 2 2 K K K K K K K K K K min pK max pK µ (ω ), pν (ω ) Π (dω ) µ (ω ), pν (ω ) Π (dω ) i 1 R h 2 K K K K K min pK µ (ω ), pν (ω ) Π (dω ) h ii 1 R h 2 K K K K K K K ΠK (dω K ) × pK µ (ω ) + pν (ω ) − min pµ (ω ), pν (ω ) i 1  i 1 h R h R 2 2 K K K K K K K K K K 2 − min pK min pK µ (ω ), pν (ω ) Π (dω ) µ (ω ), pν (ω ) Π (dω ) p 2 ǫ(1 − ǫ).

Therefore, for K satisfying (3.12) we have

K Z q p ǫ pµ (ω)pν (ω)Π(dω) , ≤ [2 ǫ(1 − ǫ)]K/K < 2I

which is the desired contradiction (recall that µ = Ai (¯ x), ν = Aj (¯ y ) and (¯ x, y¯), is feasible for (3.90)). 3o . Now let us prove that under the premise of the proposition, (3.14) takes place. To this end, let us set   Z q  1 T g [y − x] : K ln wij (s) = max pAi (x) (ω)pAj (y) (ω)Π(dω) +s ≥ 0 . 2 x∈Xj ,y∈Xj {z } | H(x,y)

(3.92)

As we have seen in item 1o —see (3.89)—one has H(x, y) = inf

1

ψ∈F 2

[ΦO (ψ; Ai (x)) + ΦO (−ψ, Aj (y))] ,

that is, H(x, y) is the infimum of a parametric family of concave functions of (x, y) ∈ Xi × Xj and as such is concave. Besides this, the optimization problem in (3.92) is feasible whenever s ≥ 0, a feasible solution being y = x = xij . At this feasible solution we have g T [y − x] = 0, implying that wij (s) ≥ 0 for s ≥ 0. Observe also that from concavity of H(x, y) it follows that wij (s) is concave on the ray {s ≥ 0}. Finally, we claim that p (3.93) wij (¯ s) ≤ Risk∗ǫ (K), s¯ = − ln(2 ǫ(1 − ǫ)).

Indeed, wij (s) is nonnegative, concave, and bounded (since Xi , Xj are compact) on R+ , implying that wij (s) is continuous on {s > 0}. Assuming, on the contrary to what we need to prove, that wij (¯ s) > Risk∗ǫ (K), there exists s′ ∈ (0, s¯) such that ∗ ′ wij (s ) > Riskǫ (K) and thus there exist x ¯ ∈ Xi , y¯ ∈ Xj such that (¯ x, y¯) is feasible for the optimization problem specifying wij (s′ ) and (3.91) takes place. We have seen in item 2o that the latter relation implies that for µ = Ai (¯ x), ν = Aj (¯ y ) it holds K Z q p pµ (ω)pν (ω)Π(dω) ≤ 2 ǫ(1 − ǫ),

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

that is, K ln

Z q

K ln

Z q

Hence,

253



pµ (ω)pν (ω)Π(dω) + s¯ ≤ 0. 

pµ (ω)pν (ω)Π(dω) + s′ < 0,

contradicting the feasibility of (¯ x, y¯) to the optimization problem specifying wij (s′ ). It remains to note that (3.93) combines with concavity of wij (·) and the relation wij (0) ≥ 0 to imply that wij (ln(2I/ǫ)) ≤ ϑwij (¯ s) ≤ ϑRisk∗ǫ (K) where ϑ = ln(2I/ǫ)/¯ s= Invoking (3.90), we conclude that

2 ln(2I/ǫ) . ln([4ǫ(1 − ǫ)]−1 )

Optij (K) = wij (ln(2I/ǫ)) ≤ ϑRisk∗ǫ (K) ∀i, j. Finally, from (3.90) it immediately follows that Optij (K) is nonincreasing in K (as K grows, the feasible set of the optimization problem in (3.90) shrinks), so that for K ≥ K we have Opt(K) ≤ Opt(K) = max Optij (K) ≤ ϑRisk∗ǫ (K), i,j

and (3.14) follows. 3.6.2



Verifying 1-convexity of the conditional quantile

Let r be a nonvanishing probability distribution on S, and let Fm (r) =

m X i=1

ri , 1 ≤ m ≤ M,

so that 0 < F1 (r) < F2 (r) < ... < FM (r) = 1. Denoting by P the set of all nonvanishing probability distributions on S, observe that for every p ∈ P χα [r] is a piecewise linear function of α ∈ [0, 1] with breakpoints 0, F1 (r), F2 (r), F3 (r), ..., FM (r), the values of the function at these breakpoints being s1 , s1 , s2 , s3 , ..., sM . In particular, this function is equal to s1 on [0, F1 (r)] and is strictly increasing on [F1 (r), 1]. Now let s ∈ R, and let Pα≤ [s] = {r ∈ P : χα [r] ≤ s}, Pα≥ [s] = {r ∈ P : χα [r] ≥ s}. Observe that the just introduced sets are cut off P by nonstrict linear inequalities, specifically, • • • •

when when when when

s < s1 , we have Pα≤ [s] = ∅, Pα≥ [s] = P; s = s1 , we have Pα≤ [s] = {r ∈ P : F1 (r) ≥ α}, Pα≥ [s] = P; s > sM , we have Pα≤ [s] = P, Pα≥ [s] = ∅; s1 < s ≤ sM , for every r ∈ P the equation χγ [r] = s in variable γ ∈ [0, 1]

254

CHAPTER 3

has exactly one solution γ(r) which can be found as follows: we specify k = k s ∈ {1, ..., M − 1} such that sk < s ≤ sk+1 and set γ(r) =

(sk+1 − s)Fk (r) + (s − sk )Fk+1 (r) . sk+1 − sk

Since χα [r] is strictly increasing in α when α ∈ [F1 (p), 1], for s ∈ (s1 , sM ] we have 

 (sk+1 − s)Fk (r) + (s − sk )Fk+1 (r) r∈P: ≥α , sk+1 − sk   (s − s)F k+1 k (r) + (s − sk )Fk+1 (r) Pα≥ [s] = {r ∈ P : α ≥ γ(r)} = r ∈ P : ≤α . sk+1 − sk

Pα≤ [s] = {r ∈ P : α ≤ γ(r)} =

As an immediate consequence of this description, given α ∈ [0, 1] and τ ∈ T and setting µ X p(ι, τ ), 1 ≤ µ ≤ M, Gτ,µ (p) = ι=1

and

X s,≤ = {p(·, ·) ∈ X : χα [pτ ] ≤ s}, X s,≥ = {p(·, ·) ∈ X : χα [pτ ] ≥ s}, we get s < s1 s = s1





s > sM



s1 < s ≤ sM



X s,≤ = ∅, X s,≥ = X ,

X s,≤ = {p ∈ X : Gτ,1 (p) ≤ s1 Gτ,M (p)}, X s,≥ = X ,

X s,≤ = X , X s,≥ = ∅,  o n  X s,≤ = p ∈ X : (sk+1 −s)Gτ,k (r)+(s−sk )Gτ,k+1 (r) ≥ αGτ,M (p) , sk+1 −sk o n  X s,≥ = p ∈ X : (sk+1 −s)Gτ,k (r)+(s−sk )Gτ,k+1 (r) ≤ αGτ,M (p) , sk+1 −sk k = ks : sk < s ≤ sk+1 ,

implying 1-convexity of the conditional quantile on X (recall that Gτ,µ (p) are linear in p). ✷ 3.6.3 3.6.3.1

Proof of Proposition 3.4 Proof of Proposition 3.4.i

We call step ℓ essential if at this step rule 2d is invoked. 1o . Let x ∈ X be the true signal underlying the observation ω ¯ K , so that ω ¯ 1 , ..., ω ¯K are drawn from the distribution pA(x) independently of each other. Consider the “ideal” estimate given by exactly the same rules as the Bisection procedure in Section 3.2.4.2 (in the sequel, we refer to the latter as the “true” one), with tests T∆Kℓ,rg ,r (·), T∆Kℓ,lf ,l (·) in rule 2d replaced with the “ideal tests” Tb∆ℓ,rg ,r = Tb∆ℓ,lf ,l =



right, left,

f (x) > cℓ , f (x) ≤ cℓ .

Marking by ∗ the entities produced by the resulting fully deterministic procedure, we arrive at the sequence of nested segments ∆∗ℓ = [a∗ℓ , b∗ℓ ], 0 ≤ ℓ ≤ L∗ ≤ L, along

255

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

with subsegments ∆∗ℓ,rg = [c∗ℓ , vℓ∗ ], ∆∗ℓ,lf = [u∗ℓ , c∗ℓ ] of ∆∗ℓ−1 , defined for all ∗ -essential ¯ ∗ claimed to contain f (x). Note that the ideal values of ℓ, and the output segment ∆ procedure cannot terminate due to arriving at a disagreement, and that f (x), as is ¯ ∗. immediately seen, is contained in all segments ∆∗ℓ , 0 ≤ ℓ ≤ L∗ , just as f (x) ∈ ∆ ∗ ∗ ∗ Let L be the set of all -essential values of ℓ. For ℓ ∈ L , let the event Eℓ [x] parameterized by x be defined as follows:  n o K K K K K  ω : T (ω ) = right or T (ω ) = right , f (x) ≤ u∗ℓ , ∗ ∗  ∆ ,r  n ℓ,rg  o ∆ℓ,lf ,l    ω K : T K∗ (ω K ) = right , u∗ℓ < f (x) ≤ c∗ℓ , ∆ℓ,rg ,r n o Eℓ [x] = K K K  c∗ℓ < f (x) < vℓ∗ ,   nω : T∆∗ℓ,lf ,l (ω ) = left ,  o    ω K : T K∗ (ω K ) = left or T K∗ (ω K ) = left , f (x) ≥ vℓ∗ . ,r ∆ ∆ ,l ℓ,rg

ℓ,lf

(3.94)

2o . Observe that by construction and in view of Proposition 2.27 we have ∀ℓ ∈ L∗ : ProbωK ∼pA(x) ×...×pA(x) {Eℓ [x]} ≤ 2δ.

(3.95)

Indeed, let ℓ ∈ L∗ . • When f (x) ≤ u∗ℓ , we have x ∈ X and f (x) ≤ u∗ℓ ≤ c∗ℓ , implying that Eℓ [x] takes place only when either the left test T∆K∗ ,l or the right test T∆K∗ ,r , or both, accept ℓ,rg ℓ,lf wrong—right—hypotheses from the pairs of right and left hypotheses. Since the corresponding intervals ([u∗ℓ , c∗ℓ ] for the left side test, [c∗ℓ , vℓ∗ ] for the right side one) are δ-good left and right, respectively, the risks of the tests do not exceed δ, and the pA(x) -probability of the event Eℓ [x] is at most 2δ; • when u∗ℓ < f (x) ≤ c∗ℓ , the event Eℓ [x] takes place only when the right side test T∆K∗ ,r accepts the wrong—right—hypothesis from the pair; as above, this can ℓ,rg happen with pA(x) -probability at most δ; • when cℓ < f (x) ≤ vℓ , the event Eℓ [x] takes place only if the left test T∆K∗ ,l accepts ℓ,lf the wrong—left—hypothesis from the pair to which it was applied, which again happens with pA(x) -probability ≤ δ; • finally, when f (x) > vℓ , the event Eℓ [x] takes place only when either the left side test T∆K∗ ,l or the right side test T∆K∗ ,r , or both, accept wrong—left—hypotheses ℓ,rg ℓ,lf from the pairs; as above, this can happen with pA(x) -probability at most 2δ. ¯ = L(¯ ¯ ω K ) be the last step of the true estimating procedure as run on the 3o . Let L observation ω ¯ K . We claim that the following holds true: S (!) Let E := ℓ∈L∗ Eℓ [x], so that the pA(x) -probability of the event E, the observations stemming from x, is at most 2δL = ǫ ¯ ω K ) ≤ L∗ , and only two (see (3.17), (3.95)). Assume that ω ¯ K 6∈ E. Then L(¯ cases are possible: A. The true estimating procedure does not terminate due to arriving at a ¯ ω K ) and the trajectories of the ideal and disagreement. In this case L∗ = L(¯

256

CHAPTER 3

the true procedures are identical (same localizers and essential steps, same ¯ or output segments, etc.), and, in particular, f (x) ∈ ∆, B. The true estimating procedure terminates due to arriving at a dis¯ and f (x) ∈ ∆. ¯ agreement. Then ∆ℓ = ∆∗ℓ for ℓ < L, ¯ is at least In view of A and B the pA(x) -probability of the event f (x) ∈ ∆ 1 − ǫ, as claimed in Proposition 3.4. To prove (!), note that the actions at step ℓ in ideal and true procedures depend solely on ∆ℓ−1 and on the outcome of rule 2d. Taking into account that ∆0 = ∆∗0 , all we need to verify is the following claim: (!!) Let ω ¯ K 6∈ E, and let ℓ ≤ L∗ be such that ∆ℓ−1 = ∆∗ℓ−1 , whence also ∗ uℓ = uℓ , cℓ = c∗ℓ , and vℓ = vℓ∗ . Assume that ℓ is essential (given that ∆ℓ−1 = ∆∗ℓ−1 , this may happen if and only if ℓ is ∗ -essential as well). Then either C. At step ℓ the true procedure terminates due to disagreement, in which ¯ or case f (x) ∈ ∆, D. At step ℓ there was no disagreement, in which case ∆ℓ as given by (3.16) is identical to ∆∗ℓ as given by the ideal counterpart of (3.16) in the case of ∆∗ℓ−1 = ∆ℓ−1 , that is, by the rule ∆∗ℓ =



[cℓ , bℓ−1 ], [aℓ−1 , cℓ ],

f (x) > cℓ , f (x) ≤ cℓ .

(3.96)

To verify (!!), let ω ¯ K and ℓ satisfy the premise of (!!). Note that due to ∆ℓ−1 = ∗ ∆ℓ−1 we have uℓ = u∗ℓ , cℓ = c∗ℓ , and vℓ = vℓ∗ , and thus also ∆∗ℓ,lf = ∆ℓ,lf , ∆∗ℓ,rg = ∆ℓ,rg . Consider first the case when the true estimation procedure terminates by disagreement at step ℓ, so that T∆K∗ ,l (¯ ω K ) 6= T∆K∗ ,r (¯ ω K ). When ℓ,lf

ℓ,rg

assuming that f (x) < uℓ = u∗ℓ , the relation ω ¯ K 6∈ Eℓ [x] combines with (3.94) to K K K K imply that T∆∗ ,r (¯ ω ) = T∆∗ ,l (¯ ω ) = left, which under disagreement is imℓ,rg

ℓ,lf

possible. Assuming f (x) > vℓ = vℓ∗ , the same argument results in T∆K∗

ℓ,rg ,r

(¯ ωK ) =

T∆K∗

(¯ ω K ) = right, which again is impossible. We conclude that in the case in ¯ as claimed in C. C is proved. question uℓ ≤ f (x) ≤ vℓ , i.e., f (x) ∈ ∆, Now, suppose that there was a consensus at step ℓ in the true estimating procedure. Because ω ¯ K 6∈ Eℓ [x] this can happen in the following four cases: ℓ,lf ,l

(¯ ω K ) = left and f (x) ≤ uℓ = u∗ℓ , ℓ,rg ,r K T∆∗ ,r (¯ ω K ) = left and uℓ < f (x) ≤ cℓ = c∗ℓ , ℓ,rg T∆K∗ ,l (¯ ω K ) = right and cℓ < f (x) < vℓ = vℓ∗ , ℓ,lf T∆K∗ ,l (¯ ω K ) = right and vℓ ≤ f (x). ℓ,lf

1. T∆K∗

2. 3. 4.

Due to consensus at step ℓ, in situations 1 and 2 (3.16) says that ∆ℓ = [aℓ−1 , cℓ ], which combines with (3.96) and vℓ = vℓ∗ to imply that ∆ℓ = ∆∗ℓ . Similarly, in situations 3 and 4, due to consensus at step ℓ, (3.16) implies that ∆ℓ = [cℓ , bℓ−1 ], which combines with uℓ = u∗ℓ and (3.96) to imply that ∆ℓ = ∆∗ℓ . D is proved. ✷

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

3.6.3.2

257

Proof of Proposition 3.4.ii

0 0 ≤ ρb, since in this case the estimate a0 +b There is nothing to prove when b0 −a 2 2 which does not use observations at all is (b ρ, 0)-reliable. From now on we assume that b0 − a0 > 2b ρ, implying that L is a positive integer.

1o . Observe, first, that if a and b are such that a is lower-feasible, b is upperfeasible, and b − a > 2ρ, then for every i ≤ Ib,≥ and j ≤ Ia,≤ there exists a test, based on K observations, which decides upon the hypotheses H1 , H2 , stating that the observations are drawn from pA(x) with x ∈ Zib,≥ (H1 ) or with x ∈ Zja,≤ (H2 ) with risk at most ǫ. Indeed, it suffices to consider the test which accepts H1 and rejects H2 when fb(ω K ) ≥ a+b 2 and accepts H2 and rejects H1 otherwise.

2o . With parameters of Bisection chosen according to (3.19), by already proved Proposition 3.4.i, we have ¯ ∆ ¯ being E. For every x ∈ X, the pA(x) -probability of the event f (x) ∈ ∆, the output segment of our Bisection, is at least 1 − ǫ.

3o . We claim also that F.1. Every segment ∆ = [a, b] with b − a > 2ρ and lower-feasible a is δ-good (right), F.2. Every segment ∆ = [a, b] with b − a > 2ρ and upper-feasible b is δ-good (left), F.3. Every κ-maximal δ-good (left or right) segment has length at most 2ρ + κ = ρb. As a result, for every essential step ℓ, the lengths of the segments ∆ℓ,rg and ∆ℓ,lf do not exceed ρb.

Let us verify F.1 (verification of F.2 is completely similar, and F.3 is an immediate consequence of the definitions and F.1-2). Let [a, b] satisfy the premise of F.1. It may happen that b is upper-infeasible, whence ∆ = [a, b] is 0-good (right), and we are done. Now let b be upper-feasible. As we have already seen, whenever i ≤ Ib,≥ and j ≤ Ia,≤ , the hypotheses stating that ωk are sampled from pA(x) for some x ∈ Zib,≥ and for some x ∈ Zja,≤ , respectively, can be decided upon with risk ≤ ǫ, implying, as in the proof of Proposition 2.25, that p ǫij∆ ≤ [2 ǫ(1 − ǫ)]1/K .

Hence, taking into account that the column and the row sizes of E∆,r do not exceed N I, p ǫ K/K =δ σ∆,r ≤ N I max ǫK ≤ ij∆ ≤ N I[2 ǫ(1 − ǫ)] i,j 2L

(we have used (3.19)), that is, ∆ indeed is δ-good (right).

4o . Let us fix x ∈ X and consider a trajectory of Bisection, the observation being ¯ of the procedure is given by one of the following drawn from pA(x) . The output ∆ options: 1. At some step ℓ of Bisection, the process terminated according to rules in 2b or 2c. In the first case, the segment [cℓ , bℓ−1 ] has lower-feasible left endpoint and is not δ-good (right), implying by F.1 that the length of this segment (which is ¯ = ∆ℓ−1 ) is ≤ 2ρ, so that the length |∆| ¯ of ∆ ¯ is at most half the length of ∆

258

CHAPTER 3

4ρ ≤ 2b ρ. The same conclusion, by a completely similar argument, holds true if the process terminated at step ℓ according to rule 2c. 2. At some step ℓ of Bisection, the process terminated due to disagreement. In this ¯ ≤ 2b case, by F.3, we have |∆| ρ. ¯ = ∆L . In this case, termination clauses in 3. Bisection terminated at step L, and ∆ rules 2b, 2c, and 2d were never invoked, clearly implying that |∆s | ≤ |∆s−1 |/2, ¯ = |∆L | ≤ 2−L |∆0 | ≤ 2b 1 ≤ s ≤ L, and thus |∆| ρ (see (3.19)). ¯ ≤ 2b Thus, we have |∆| ρ, implying that whenever the signal x ∈ X underlying ¯ are such that f (x) ∈ ∆, ¯ the error of the observations and the output segment ∆ ¯ is at most ρb. Invoking E, we Bisection estimate (which is the midpoint of ∆) conclude that the Bisection estimate is (b ρ, ǫ)-reliable. ✷ 3.6.4

Proof of Proposition 3.14

Let us fix ǫ ∈ (0, 1). Setting ρK =

1 2

h

¯ H) ¯ H) b +,K (h, ¯ +Ψ b −,K (h, ¯ Ψ

i

and invoking Corollary 3.13, all we need to prove is that in the case of A.1-3 one has i h ¯ H) ¯ H) b +,K (h, ¯ +Ψ b −,K (h, ¯ ≤ 0. (3.97) lim sup Ψ K→∞

To this end, note that in our current situation, (3.48) and (3.52) simplify to 1/2

1/2

− Θ∗ HΘ∗ ) Φ(h, H; Z) = − 21ln Det(I     H h T −1 1 + 2 Tr Z B + [H, h]T [Θ−1 [H, h] B , ∗ − H] T h {z } | Q(h,H)    b +,K (h, H) = inf max αΦ(h/α, H/α; Z) − Tr(QZ) + K −1 α ln(2/ǫ) : Ψ α Z∈Z  −1 , α > 0, −γαΘ−1 ∗  H  γαΘ∗    b −,K (h, H) = inf max αΦ(−h/α, −H/α; Z) + Tr(QZ) + K −1 α ln(2/ǫ) : Ψ α Z∈Z  −1 −1 . α > 0, −γαΘ∗  H  γαΘ∗

Hence   h i ¯ H) ¯ H) ¯ b +,K (h, ¯ +Ψ b −,K (h, ¯ ≤ inf ¯ Ψ max αΦ(h/α, H/α; Z1 ) − Tr(QZ1 ) α

Z1 ,Z2 ∈Z  ¯ ¯ +Φ(−h/α, −H/α; Z1 ) + Tr(QZ2 ) + 2K −1 α ln(2/ǫ) :  −1 −1 ¯ α > 0, −γαΘ∗  H  γαΘ∗    1/2 ¯ 1/2 2 2 − 21 α ln Det I − [Θ∗ HΘ = inf max + 2K −1 α ln(2/ǫ) ∗ ] /α α Z1 ,Z2 ∈Z    ¯ ¯ ¯ ¯ H/α) + αTr Z2 Q(−h/α, −H/α) +Tr(Q[Z2 − Z1 ]) + 12 αTr Z1 Q(h/α, : −1 ¯ α > 0, −γαΘ−1 ∗  H  γαΘ∗

259

FROM HYPOTHESIS TESTING TO ESTIMATING FUNCTIONALS

= inf max

α Z1 ,Z2 ∈Z



  1/2 ¯ 1/2 2 2 + 2K −1 α ln(2/ǫ) − 12 α ln Det I − [Θ∗ HΘ ∗ ] /α  ¯ T [αΘ−1 − H] ¯ ¯ h] ¯ −1 [H, ¯ h]B + 21 Tr Z1 B T [H, ∗  ¯ T [αΘ−1 + H] ¯ ¯ h] ¯ −1 [H, ¯ h]B + 21 Tr Z2 B T [H, ∗   + Tr(Q[Z2 − Z1 ]) + 21 Tr([Z1 − Z2 ]B T | {z T (Z1 ,Z2 )

α > 0, −γαΘ−1 ∗

By (3.57) we have 21 B T



¯ H ¯T h

¯  h

¯ H ¯ hT

¯ h

B) : }  ¯  γαΘ−1 H . ∗

(3.98)

B = B T [C T QC + J]B, where the only nonzero

entry, if any, in the (d + 1) × (d + 1) matrix J is in the cell (d + 1, d + 1). By definition of B—see (3.48)—the only nonzero element, if any, in J¯ = B T JB is in the cell (m + 1, m + 1), and we conclude that 1 2

BT



¯  h

¯ H ¯T h

B = (CB)T Q(CB) + J¯ = Q + J¯

(recall that CB = Im+1 ). Now, when Z1 , Z2 ∈ Z, the entries of Z1 , Z2 in the cell (m + 1, m + 1) both are equal to 1, whence 1 2

Tr([Z1 −Z2 ]B

T



¯ H ¯T h

¯  h

¯ = Tr([Z1 −Z2 ]Q), B) = Tr([Z1 −Z2 ]Q)+Tr([Z1 −Z2 ]J)

implying that the quantity T (Z1 , Z2 ) in (3.98) is zero, provided Z1 , Z2 ∈ Z. Consequently, (3.98) becomes h

i ¯ H) ¯ H) b +,K (h, ¯ +Ψ b −,K (h, ¯ ≤ inf Ψ



  1/2 ¯ 1/2 2 2 − 12 α ln Det I − [Θ∗ HΘ ∗ ] /α α Z1 ,Z2 ∈Z  ¯ TB ¯ h][αΘ−1 ¯ −1 [H, ¯ h] +2K −1 α ln(2/ǫ) + 21 Tr Z1 B T [H, ∗ − H]   −1 ¯ T [αΘ−1 ¯ ¯ h] ¯ −1 [H, ¯ h]B ¯ . : α > 0, −γαΘ−1 + 21 Tr Z2 B T [H, ∗ + H] ∗  H  γαΘ∗ max

(3.99)

Now, for an appropriately selected real c independent of K, for α allowed by (3.99), and all Z1 , Z2 ∈ Z we have (recall that Z is bounded)  1 ¯ T [αΘ−1 − H] ¯ ¯ h] ¯ −1 [H, ¯ h]B Tr Z1 B T [H, ∗ 2  ¯ T [αΘ−1 + H] ¯ ¯ h] ¯ −1 [H, ¯ h]B ≤ c/α, + 21 Tr Z2 B T [H, ∗ along with

  1/2 ¯ 1/2 2 2 − 12 α ln Det I − [Θ∗ HΘ ≤ c/α. ∗ ] /α

Therefore, given δ > 0, we can find α = αδ > 0 large enough to ensure that ¯  γαδ Θ−1 and 2c/αδ ≤ δ, −γαδ Θ−1  H ∗



which combines with (3.99) to imply that i h ¯ H) ¯ H) b +,K (h, ¯ +Ψ b −,K (h, ¯ ≤ δ + 2K −1 αδ ln(2/ǫ), Ψ

and (3.97) follows.



Chapter Four Signal Recovery by Linear Estimation OVERVIEW In this chapter we consider several variations of one of the most basic problems of high-dimensional statistics—signal recovery. In its simplest form the problem is as follows: given positive definite m × m matrix Γ, m × n matrix A, ν × n matrix B, and indirect noisy observation [ξ ∼ N (0, Γ)]

ω = Ax + ξ

(4.1)

of unknown “signal” x known to belong to a given convex compact subset X of Rn , we want to recover the vector Bx ∈ Rν of x. We focus first on the case where the quality of a candidate recovery ω 7→ x b(ω) is quantified by its worst-case, over x ∈ X , expected k · k22 -error, that is, by the risk q x(Ax + ξ) − Bxk22 }. (4.2) Risk[b x(·)|X ] = sup Eξ∼N (0,Γ) {kb x∈X

The simplest and the most studied type of recovery is an affine one: x b(ω) = H T ω+h; assuming X to be symmetric w.r.t. the origin, we lose nothing when passing from affine estimates to linear ones—those of the form x bH (ω) = H T ω. An advantage of linear estimates is that under favorable circumstances (e.g., when X is an ellipsoid), minimizing risk over linear estimates is an efficiently solvable problem, and there exists a huge body of literature on optimal in terms of their risk linear estimates (see, e.g., [6, 57, 82, 155, 156, 197, 206, 207] and references therein). Moreover, in the case of signal recovery from direct observations in white Gaussian noise (the case of B = A = In , Γ = σ 2 In ), there is huge body of results on near-optimality of properly selected linear estimates among all possible recovery routines; see, e.g., [79, 88, 106, 124, 198, 230, 239] and references therein. A typical result of this type states that when recovering x ∈ X from direct observation ω = x+σξ, ξ ∼ N (0, Im ), where X is an ellipsoid of the form X {x ∈ Rn : j 2α x2j ≤ L2 }, j

or the box {x ∈ Rn : j α |xj | ≤ L, j ≤ n}, with fixed L < ∞ and α > 0, the ratio of the risk of a properly selected linear estimate to the minimax risk Riskopt [X ] := inf Risk[b x|X ] x b(·)

(4.3)

(the infimum is taken over all estimates, not necessarily linear) remains bounded, or even tends to 1, as σ → +0, and this happens uniformly in n, α and L being fixed.

SIGNAL RECOVERY BY LINEAR ESTIMATION

261

Similar “near-optimality” results are known for the “diagonal” case, where X is an ellipsoid/box and A, B, Γ are diagonal matrices. To the best of our knowledge, the only “general” (that is, not imposing severe restrictions on how the geometries of X , A, B, Γ are linked to each other) result on optimality of linear estimates is due to D. Donoho, who proved [64], that when recovering a linear form (i.e., in the case of one-dimensional Bx), the best risk over all linear estimates is within the factor 1.2 of the minimax risk. The primary goal of this chapter is to establish rather general results on nearoptimality of properly built linear estimates as compared to all possible estimates. Results of this type are bound to impose some restrictions on X , since there are cases (e.g., the case of a high-dimensional k · k1 -ball X ) where linear estimates are by far nonoptimal. Our restrictions on X reduce to the existence of a special type representation of X and are satisfied, e.g., when X is the intersection of K < ∞ ellipsoids/elliptic cylinders, P X = {x ∈ Rn : xT Rk x ≤ 1, 1 ≤ k ≤ K} [Rk  0, k Rk ≻ 0] (4.4)

in particular, X can be a symmetric w.r.t. the origin compact polytope given by 2K linear inequalities −1 ≤ rkT x ≤ 1, 1 ≤ k ≤ K, or, equivalently, X = {x : xT (rk rkT ) x, 1 ≤ k ≤ K}. Another instructive example is a set of the form | {z } Rk

X = {x : kSxkp ≤ L}, where p ≥ 2 and S is a matrix with trivial kernel. It should be stressed that while imposing some restrictions on X , we require nothing from A, B, and Γ, aside from positive definiteness of the latter matrix. Our main result (Proposition 4.5) states, in particular, that with X given by (4.4) and with arbitrary A and B, the risk of properly selected linear estimate x bH∗ with both H∗ and the risk efficiently computable, satisfies the bound p (∗) Risk[b xH∗ |X ] ≤ O(1) ln(K + 1)Riskopt [X ],

where Riskopt [X ] is the minimax risk, and O(1) is an absolute constant. Note that the outlined result is an “operational” one—the risk of provably nearly optimal estimate and the estimate itself are given by efficient computation. This is in sharp contrast with traditional results of nonparametric statistics, where near-optimal estimates and their risks are given in a “closed analytical form,” at the price of severe restrictions on the structure of the “data” X , A, B, Γ. This being said, it should be stressed that one of the crucial components in our construction is quite classical—this is the idea, going back to M.S. Pinsker [198], of bounding from below the minimax risk via the Bayesian risk associated with a properly selected Gaussian prior.1 The main body of the chapter originates from [138, 137] and is organized as follows. • Section 4.1 presents basic results on Conic Programming and Conic Duality—the 1 [88, 198] address the problem of k · k -recovery of a signal x from direct observations (A = 2 B = I) in the case when X is a high-dimensional ellipsoid with “regularly decreasing half-axes,” P 2α 2 2 n like X = {x ∈ R : j j xj ≤ L } with α > 0. In this case Pinsker’s construction shows that as σ → +0, the risk of a properly built linear estimate is, uniformly in n, (1 + o(1)) times the minimax risk. This is much stronger than (∗), and it seems to be unlikely that a similarly strong result holds true in the general case underlying (∗).

262

CHAPTER 4

principal optimization tools utilized in all subsequent constructions and proofs. • Section 4.2 contains problem formulation (Section 4.2.1), construction of the linear estimate we deal with (Section 4.2.2) and the central result on near-optimality of this estimate (Section 4.2.2.2). We discuss also the “expressive abilities” of the family of sets (we call them ellitopes) to which our main result applies. • In Section 4.3 we extend the results of the previous section from ellitopes to their “matrix analogs”—spectratopes in the role of signal sets, passing simultaneously from the norm k · k2 in which the recovery error is measured to arbitrary spectratopic norms, those for which the unit ball of the conjugate norm is a spectratope. In addition, we allow for observation noise to have nonzero mean and to be non-Gaussian. • Section 4.4 adjusts our preceding results on linear estimation to the case where the signals to be recovered possess stochastic components. • Finally, Section 4.5 deals with “uncertain-but-bounded” observation noise, that is, noise selected “by nature,” perhaps in an adversarial fashion, from a given bounded set.

4.1

4.1.1

PRELIMINARIES: EXECUTIVE SUMMARY ON CONIC PROGRAMMING Cones

A cone in Euclidean space E is a nonempty set K which is closed w.r.t. taking conic combinations of its elements, that is, linear combinations with nonnegative coefficients. Equivalently: K ⊂ E is a cone if K is nonempty, and • x, y ∈ K ⇒ x + y ∈ K; • x ∈ K, λ ≥ 0 ⇒ λx ∈ K. It is immediately seen that a cone is a convex set. We call a cone K regular if it is closed, pointed T (that is, does not contain lines passing through the origin, or, equivalently, K [−K] = {0}) and possesses a nonempty interior. Given a cone K ⊂ E, we can associate with it its dual cone K ∗ defined as K ∗ = {y ∈ E : hy, xi ≥ 0 ∀x ∈ K}

[h·, ·i is inner product on E].

It is immediately seen that K ∗ is a closed cone, and K ⊂ (K ∗ )∗ . It is well known that • if K is a closed cone, it holds K = (K ∗ )∗ ; • K is a regular cone if and only if K ∗ is so. Examples of regular cones “useful in applications” are as follows: 1. Nonnegative orthants Rd+ = {x ∈ Rd : x ≥ 0}; qP d−1 2 2. Lorentz cones Ld+ = {x ∈ Rd : xd ≥ i=1 xi }; 3. Semidefinite cones Sd+ comprised of positive semidefinite symmetric d × d matrices. Semidefinite cone Sd+ lives in the space Sd of symmetric matrices equipped

263

SIGNAL RECOVERY BY LINEAR ESTIMATION

with the Frobenius inner product hA, Bi = Tr(AB T ) = Tr(AB) =

d X

Aij Bij ,

i,j=1

A, B ∈ Sd .

All cones listed so far are self-dual. 4. Let k · k be a norm on Rn . The set {[x; t] ∈ Rn × R : t ≥ kxk} is a regular cone, and the dual cone is {[y; τ ] : kyk∗ ≤ τ }, where kyk∗ = max{xT y : kxk ≤ 1} x

is the norm on Rn conjugate to k · k. An additional example of a regular cone useful for the sequel is the conic hull of a convex compact set defined as follows. Let T be a convex compact set with a nonempty interior in Euclidean space E. We can associate with T its closed conic hull  T = cl [t; τ ] ∈ E + = E × R : τ > 0, t/τ ∈ T . {z } | K o (T )

It is immediately seen that T is a regular cone, and that to get this cone, one should add to the convex set K o (T ) the origin of E + . It is also clear that one can “see T in T:”—T is nothing but the cross-section of the cone T by the hyperplane τ = 1 in E + = {[t; τ ]}: T = {t ∈ E : [t; 1] ∈ T}.

It is easily seen that the cone T∗ dual to T is given by T∗ = {[g; s] ∈ E+ : s ≥ φT (−g)}, where φT (g) = maxhg, ti t∈T

is the support function of T . 4.1.2

Conic problems and their duals

Given regular cones Ki ⊂ Ei , 1 ≤ i ≤ m, consider an optimization problem of the form   Ai x − bi ∈ Ki , i = 1, ..., m Opt(P ) = min hc, xi : , (P ) Rx = r where x 7→ Ai x − bi are affine mappings acting from some Euclidean space E to the spaces Ei where the cones Ki live. A problem in this form is called a conic problem on the cones K1 , ..., Km ; the constraints Ai x − bi ∈ Ki on x are called conic constraints. We call a conic problem (P ) strictly feasible if it admits a strictly feasible solution x ¯, meaning that x ¯ satisfies the equality constraints and satisfies strictly the conic constraints, i.e., Ai x ¯ − bi ∈ int Ki . One can associate with conic problem (P ) its dual, which also is a conic problem. The origin of the dual problem is the desire to obtain lower bounds on the optimal value Opt(P ) of the primal problem (P ) in a systematic way—by linear aggregation

264

CHAPTER 4

of constraints. Linear aggregation of constraints works as follows: let us equip every conic constraint Ai x − bi ∈ Ki with aggregation weight, called Lagrange multiplier, yi restricted to reside in the cone Ki∗ dual to Ki . Similarly, we equip the system Rx = r of equality constraints in (P ) with Lagrange multiplier z—a vector of the same dimension as r. Now let x be a feasible solution to the conic problem, and let yi ∈ Ki∗ , i ≤ m, and z be Lagrange multipliers. By the definition of the dual cone and due to Ai x − bi ∈ Ki , yi ∈ Ki∗ we have hyi , Ai xi ≥ hyi , bi i, 1 ≤ i ≤ m and of course z T Rx ≥ rT z. Summing up all resulting inequalities, we arrive at the scalar linear inequality D E X X R∗ z + A∗i yi , x ≥ rT z + hbi , yi i (!) i

i

where A∗i are the conjugates to Ai : hy, Ai xiEi ≡ hA∗i y, xiE , and R∗ is the conjugate of R. By its origin, (!) is a consequence of the system of constraints in (P ) and as such is satisfied everywhere on the feasible domain of the problem. If we are lucky to get the objective of (P ) as the linear function of x in the left hand side of (!), that is, if X A∗i yi = c, R∗ z + i

(!) imposes a lower bound on the objective of the primal conic problem (P ) everywhere on the feasible domain of the primal problem, and the conic dual of (P ) is the problem ) ( ∗ X yi ∈ KiP ,1≤i≤m T (D) hbi , yi i : Opt(D) = max r z + m R∗ z + i=1 A∗i yi = c yi ,z i

of maximizing this lower bound on Opt(P ). The relations between the primal and the dual conic problems are the subject of the standard Conic Duality Theorem as follows: Theorem 4.1. [Conic Duality Theorem] Consider conic problem (P ) (where all Ki are regular cones) along with its dual problem (D). Then

1. Duality is symmetric: the dual problem (D) is conic, and the conic dual of (D) is (equivalent to) (P ); 2. Weak duality: It always holds Opt(D) ≤ Opt(P ) 3. Strong duality: If one of the problems (P ), (D) is strictly feasible and bounded,2 then the other problem in the pair is solvable, and the optimal values of the problems are equal to each other. In particular, if both (P ) and (D) are strictly feasible, then both problems are solvable with equal optimal values. Remark 4.2. While the Conic Duality Theorem in the form just presented meets all our subsequent needs, it makes sense to note that in fact the Strong Duality part of 2 For a minimization problem, boundedness means that the objective is bounded from below on the feasible set, for a maximization problem, that it is bounded from above on the feasible set.

SIGNAL RECOVERY BY LINEAR ESTIMATION

265

the theorem can be strengthened by replacing strict feasibility with “essential strict feasibility” defined as follows: a conic problem in the form of (P ) (or, which is the same, form of (D)) is called essentially strictly feasible if it admits a feasible solution x ¯ which satisfies strictly the non-polyhedral conic constraints, that is, Ai x ¯ − bi ∈ int Ki for all i for which the cone Ki is not polyhedral—is not given by a finite list of homogeneous linear inequality constraints. The proof of the Conic Duality Theorem can be found in numerous sources, e.g., in [187, Section 7.1.3]. 4.1.3

Schur Complement Lemma

The following simple fact is extremely useful: Lemma 4.3. [Schur Complement Lemma] A symmetric block matrix   P QT A= Q R with R ≻ 0 is positive (semi)definite if and only if the matrix P − QT R−1 Q is so. Proof. With u, v of the same sizes as P , R, we have T

min [u; v] A [u; v] = uT [P − QT R−1 Q]u v

(direct computation utilizing the fact that R ≻ 0). It follows that the quadratic form associated with A is nonnegative everywhere if and only if the quadratic form with the matrix [P −QT R−1 Q] is nonnegative everywhere (since the latter quadratic form is obtained from the former one by partial minimization). ✷

4.2

NEAR-OPTIMAL LINEAR ESTIMATION FROM GAUSSIAN OBSERVATIONS

4.2.1

Situation and goal

Given an m × n matrix A, a ν × n matrix B, and an m × m matrix Γ ≻ 0, consider the problem of estimating the linear image Bx of an unknown signal x known to belong to a given set X ⊂ Rn via noisy observation ω = Ax + ξ, ξ ∼ N (0, Γ),

(4.5)

where ξ is the observation noise. A candidate estimate in this case is a (Borel) function x b(·) : Rm → Rν , and the performance of such an estimate in what follows will be quantified by the Euclidean risk Risk[b x|X ] defined by (4.2). 4.2.1.1

Ellitopes

From now on we assume that X ⊂ Rn is a set given by  X = x ∈ Rn : ∃(y ∈ Rn¯ , t ∈ T ) : x = P y, y T Rk y ≤ tk , 1 ≤ k ≤ K ,

(4.6)

266

CHAPTER 4

where • P is an n × n ¯ matrix, P • Rk  0 are n ¯×n ¯ matrices with k Rk ≻ 0, • T is a nonempty computationally tractable convex compact subset of RK + intersecting the interior of RK and such that T is monotone, meaning that the + relations 0 ≤ τ ≤ t and t ∈ T imply that τ ∈ T .3 Note that under our assumptions int T 6= ∅. In the sequel, we refer to a set of the form (4.6) with data [P, {Rk , 1 ≤ k ≤ K}, T ] satisfying the assumptions just formulated as an ellitope, and to (4.6) as an ellitopic representation of X . Here are instructive examples of ellitopes (in all these examples, P is the identity mapping; in the sequel, we call ellitopes of this type basic): • when K = 1, T = [0, 1], and R1 ≻ 0, X is the ellipsoid {x : xT R1 x ≤ 1}; • when K ≥ 1, T = {t ∈ RK : 0 ≤ tk ≤ 1, k ≤ K}, and X is the intersection of \ {x : xT Rk x ≤ 1} 1≤k≤K

ellipsoids/elliptic cylinders centered at the origin. In particular, when U is a K × n matrix of rank n with rows uTk , 1 ≤ k ≤ K, and Rk = uk uTk , X is the symmetric w.r.t. the origin polytope {x : kU xk∞ ≤ 1}; P p/2 ≤ 1} • when U , uk and Rk are as in the latter example and T = {t ∈ RK + : k tk for some p ≥ 2, we get X = {x : kU xkp ≤ 1}. It should be added that the family of ellitope-representable sets is quite rich: this family admits a “calculus,” so that more ellitopes can be constructed by taking intersections, direct products, linear images (direct and inverse) or arithmetic sums of ellitopes given by the above examples. In fact, the property of being an ellitope is preserved by nearly all basic operations with sets preserving convexity and symmetry w.r.t. the origin (a regrettable exception is taking the convex hull of a finite union); see Section 4.6;. As another example of an ellitope instructive in the context of nonparametric statistics, consider the situation where our signals x are discretizations of functions of continuous argument running through a compact d-dimensional domain D, and the functions f we are interested in are those satisfying a Sobolev-type smoothness constraint – an upper bound on the Lp (D)-norm of Lf , where L is a linear differential operator with constant coefficients. After discretization, this restriction can be modeled as kLxkp ≤ 1, with properly selected matrix L. As we already know from the above example, when p ≥ 2, the set X = {x : kLxkp ≤ 1} is an ellitope, and as such is captured by our machinery. Note also that by the outlined calculus, imposing on the functions f in question several Sobolev-type smoothness constraints with parameters p ≥ 2, still results in a set of signals which is an ellitope. 3 The latter relation is “for free”—given a nonempty convex compact set T ⊂ RK , the right+ hand side of (4.6) remains intact when passing from T to its “monotone hull” {τ ∈ RK + : ∃t ∈ T : τ ≤ t} which already is a monotone convex compact set.

267

SIGNAL RECOVERY BY LINEAR ESTIMATION

4.2.1.2

Estimates and their risks

In the outlined situation, a candidate estimate is a Borel function x b(·) : Rm → Rν ; given observation (4.5), we recover w = Bx as x b(ω). In the sequel, we quantify the quality of an estimate by its worst-case, over x ∈ X , expected k · k22 recovery error h  i1/2 x(Ax + ξ) − Bxk22 , Risk[b x|X ] = sup Eξ∼N (0,Γ) kb x∈X

and define the optimal, or the minimax, risk as

Riskopt [X ] = inf Risk[b x|X ], x b(·)

(4.7)

where inf is taken over all Borel candidate estimates. 4.2.1.3

Main goal

The main goal of what follows is to demonstrate that an estimate linear in ω x bH (ω) = H T ω

(4.8)

with a properly selected efficiently computable matrix H is near-optimal in terms of its risk. Our first observation is that when X is the ellitope (4.6), replacing matrices A and B with AP and BP , respectively, we pass from the initial estimation problem of interest to the transformed problem, where the signal set is ¯ = {y ∈ Rn¯ : ∃t ∈ T : y T Rk y ≤ tk , 1 ≤ k ≤ K}, X ¯ via observation and we want to recover [BP ]y, y ∈ X, ω = [AP ]y + ξ. It is obvious that the considered families of estimates (the family of all linear estimates and the family of all estimates), like the risks of the estimates, remain intact under this transformation; in particular,  1/2 x([AP ] y + ξ) − [BP ] yk22 } . Risk[b x|X ] = sup Eξ {kb ¯ y∈X

Therefore, to save notation, from now on, unless explicitly stated otherwise, we assume that matrix P is identity, so that X is the basic ellitope  X = x ∈ Rn : ∃t ∈ T , xT Rk x ≤ tk , 1 ≤ k ≤ K . (4.9)

We assume in the sequel that B 6= 0, since otherwise one has Bx = 0 for all x ∈ X , and the estimation problem is trivial. 4.2.2

Building a linear estimate

We start with building a “presumably good” linear estimate. Restricting ourselves to linear estimates (4.8), we may be interested in the estimate with the smallest

268

CHAPTER 4

risk, that is, the estimate associated with a ν × m matrix H which is an optimal solution to the optimization problem  min R(H) := Risk2 [b xH |X ] . H

We have

R(H)

= =

max Eξ {kH T ω − Bxk22 } = Eξ {kH T ξk22 } + max kH T Ax − Bxk22 x∈X

x∈X

T

T

T

T

T

Tr(H ΓH) + max x (H A − B) (H A − B)x. x∈X

This function, while convex, can be hard to compute. For this reason, we use a linear estimate yielded by minimizing an efficiently computable convex upper bound on R(H) which is built as follows. Let φT be the support function of T : φT (λ) = max λT t : RK → R. t∈T

Observe that whenever λ ∈ RK + and H are such that [B − H T A]T [B − H T A]  for x ∈ X it holds

X

λ k Rk ,

(4.10)

k

kBx − H T Axk22 ≤ φT (λ).

(4.11)

Indeed, in the case of (4.10) and with x ∈ X , there exists t ∈ T such that xT Rk x ≤ tk for all t, and consequently vector t¯ with the entries t¯k = xT Rk x also belongs to T , whence kBx − H T Axk22 = xT [B − H T A]T [B − H T A]x ≤

X k

λk xT Rk x = λT t¯ ≤ φT (λ),

which combines with (4.9) to imply (4.11). From (4.11) it follows that if H and λ ≥ 0 are linked by (4.10), then  Risk2 [b xH |X ] = max E kBx − H T (Ax + ξ)k22 x∈X

=



Tr(H T ΓH) + max k[B − H T A]xk22 x∈X

T

Tr(H ΓH) + φT (λ).

We see that the efficiently computable convex function b R(H) = inf λ

(

T

T

T

T

Tr(H ΓH) + φT (λ) : (B − H A) (B − H A) 

X k

λk R k , λ ≥ 0

)

P (which clearly is well defined due to compactness of T combined with k Rk ≻ 0) is an upper bound on R(H).4 Note that by Schur Complement Lemma the matrix 4 It

is well known that when K = 1 (i.e., X is an ellipsoid), the above bounding scheme is b b could be larger than R(·), although the ratio exact: R(·) ≡ R(·). For more complicated X ’s, R(·) b R(·)/R(·) is bounded by O(log(K)); see Section 4.2.3.

269

SIGNAL RECOVERY BY LINEAR ESTIMATION

P inequality (B−H T A)T (B−H T A)  k λk Rk is equivalent to the matrix inequality   P B T − AT H k λ k Rk 0 B − HT A Iν linear in H, λ. We have arrived at the following result: Proposition 4.4. In the situation of this section, the risk of the “presumably good” linear estimate x bH∗ (ω) = H∗T ω yielded by an optimal solution (H∗ , λ∗ ) to the (clearly solvable) convex optimization problem Opt

=

=

  P λk R k , λ ≥ 0 min Tr(H T ΓH) + φT (λ) : (B − H T A)T (B − H T A)  H,λ   P k  B T − AT H T k λk Rk min Tr(H ΓH) + φT (λ) :  0, λ ≥ 0 H,λ B − HT A Iν (4.12)

is upper-bounded by 4.2.2.1



Opt.

Illustration: Recovering temperature distribution

Situation: A square steel plate was somewhat heated at time 0 and left to cool, the temperature along the perimeter of the plate being all the time kept zero. At time t1 , we measure the temperatures at m points of the plate, and want to recover the distribution of the temperature along the plate at a given time t0 , 0 < t0 < t1 . Physics, after suitable discretization of spatial variables, offers the following model of the situation. We represent the distribution of temperature at time t as 2N −1 (2N − 1) × (2N − 1) matrix U (t) = [uij (t)]i,j=1 , where uij (t) is the temperature, at time t, at the point Pij = (pi , pj ), pk = k/N − 1,

1 ≤ i, j ≤ 2N − 1

of the plate (in our model, this plate occupies the square S = {(p, q) : |p| ≤ 1, |q| ≤ 1}). Here positive integer N is responsible for spatial discretization. For 1 ≤ k ≤ 2N − 1, let us specify functions φk (s) on the segment −1 ≤ s ≤ 1 as follows: φ2ℓ−1 (s) = c2ℓ−1 cos(ω2ℓ−1 s), φ2ℓ (s) = c2ℓ sin(ω2ℓ s), ω2ℓ−1 = (ℓ − 1/2)π, ω2ℓ = ℓπ, where ck are readily given by the normalization condition that φk (±1) = 0. It is immediately seen that the matrices

P2N −1 i=1

φ2k (pi ) = 1; note

2N −1 Φkℓ = [φk (pi )φℓ (pj )]i,j=1 , 1 ≤ k, ℓ ≤ 2N − 1

form an orthonormal basis in the space of (2N − 1) × (2N − 1) matrices, so that we can write X U (t) = xkℓ (t)Φkℓ . k,ℓ≤2N −1

The advantage of representing temperature fields in the basis {Φkℓ }k,ℓ≤2N −1 stems from the fact that in this basis the heat equation governing evolution of the tem-

270

CHAPTER 4

perature distribution in time becomes extremely simple, just d xkℓ (t) = −(ωk2 + ωℓ2 )xkℓ (t) ⇒ xkℓ (t) = exp{−(ωk2 + ωℓ2 )t}xkℓ .5 dt Now we can convert the situation into the one considered in our general estimation scheme, namely, as follows: • We select some discretization parameter N and treat x = {xkℓ (0), 1 ≤ k, ℓ ≤ 2N − 1} as the signal underlying our observations. In every potential application, we can safely upper-bound the magnitudes of the initial temperatures and thus the magnitude of x, say, by a constraint of the form X x2kℓ (0) ≤ R2 k,ℓ

with properly selected R, which allows us to specify the domain X of the signal as the Euclidean ball: X = {x ∈ R(2N −1)×(2N −1) : kxk22 ≤ R2 }.

(4.13)

• Let the measurements of the temperature at time t1 be taken along the points Pi(ν),j(ν) , 1 ≤ ν ≤ m,6 and let them be affected by a N (0, σ 2 Im )-noise, so that our observation is ω = A(x) + ξ, ξ ∼ N (0, σ 2 Im ). Here x 7→ A(x) is the linear mapping from R(2N −1)×(2N −1) into Rm given by [A(x)]ν =

2N −1 X

2

2

e−(ωk +ωℓ )t1 φk (pi(ν) )φℓ (pj(ν) )xkℓ (0).

(4.14)

k,ℓ=1

• We want to recover the temperatures at time t0 taken along some grid, say, the square (2K − 1) × (2K − 1) grid {Qij = (ri , rj ), 1 ≤ i, j ≤ 2K − 1}, where ri = i/K −1, 1 ≤ i ≤ 2K −1. In other words, we want to recover B(x), where the linear mapping x 7→ B(x) from R(2N −1)×(2N −1) into R(2K−1)×(2K−1) is given by [B(x)]ij =

2N −1 X

2

2

e−(ωk +ωℓ )t0 φk (ri )φℓ (rj )xkℓ (0).

k,ℓ=1

5 The explanation is simple: the functions φ (p, q) = φ (p)φ (q), k, ℓ = 1, 2, ..., form an kℓ k ℓ orthogonal basis in L2 (S) and vanish on the boundary of S, and the heat equation   2 ∂ ∂2 ∂ u(t; p, q) u(t; p, q) = + 2 2 ∂t ∂p ∂q

governing evolution of the temperature field u(t; p, q), (p, q) ∈ S, with time t, in terms of the coefficients xkℓ (t) of the temperature field in the orthogonal basis {φkℓ (p, q)}k,ℓ becomes d xkℓ (t) = −(ωk2 + ωℓ2 )xkℓ (t). dt In our discretization, we truncate the expansion of u(t; p, q), keeping only the terms with k, ℓ ≤ 2N − 1, and restrict the spatial variables to reside in the grid {Pij , 1 ≤ i, j ≤ 2N − 1}. 6 The construction can be easily extended to allow for measurement points outside of the grid {Pij }.

271

SIGNAL RECOVERY BY LINEAR ESTIMATION

Ill-posedness. Our problem is a typical example of an ill-posed inverse problem, where one wants to recover a past state of a dynamical system converging exponentially fast to equilibrium and thus “forgetting rapidly” its past. More specifically, in our situation ill-posedness stems from the fact that, as is clearly seen from (4.14), contributions of “high frequency” (i.e., with large ωk2 + ωℓ2 ) components xkℓ (0) of the signal to A(x) decrease exponentially fast, with high decay rate, as t1 grows. As a result, high frequency components xkℓ (0) are impossible to recover from noisy observations of A(x), unless the corresponding time instant t1 is very small. As a kind of compensation, contributions of high frequency components xkℓ (0) to B(x) are also very small, provided that t0 is not too small, implying that there is no necessity to recover well high frequency components, unless they are huge. Our linear estimate, roughly speaking, seeks for the best trade-off between these two opposite phenomena, utilizing (4.13) as the source of upper bounds on the magnitudes of high frequency components of the signal. Numerical results. In the experiment to be reported, we used N = 32, m = 100, K = 6, t0 = 0.01, t1 = 0.03 (i.e., temperature is measured at time 0.03 at 100 points selected at random on a 63 × 63 square grid, and we want to recover the temperatures at time 0.01 along an 11 × 11 square grid). We used R = 15, that is, X X = {[xkℓ ]63 x2kℓ ≤ 225}, k,ℓ=1 : k,ℓ

and σ = 0.001. Under the circumstances, the risk of the best linear estimate turns out to be 0.3968. Figure 4.1 shows a sample temperature distribution B(x) = U∗ (t0 ) at time b (t0 ) t0 resulting from a randomly selected signal x ∈ X along with the recovery U e (t0 ) of U∗ by the optimal linear estimate and the naive “least squares” recovery U of U∗ . The latter is defined as B(x∗ ), where x∗ is the least squares recovery of the signal underlying observation ω: x = x∗ (ω) := argmin kA(x) − ωk2 . x

Notice the dramatic difference in performances of the “naive least squares” and the optimal linear estimate. 4.2.2.2

Near-optimality of x bH∗

Proposition 4.5. The efficiently computable linear estimate x bH∗ (ω) = H∗T ω yielded by an optimal solution to the optimization problem (4.12) is nearly optimal in terms of its risk: p p (4.15) Risk[b xH∗ |X ] ≤ Opt ≤ 64 45 ln 2(ln K + 5 ln 2) Riskopt [X ], where the minimax optimal risk Riskopt [X ] is given by (4.7).

For proof, see Section 4.8.5. Note that the “nonoptimality factor” in (4.15) depends logarithmically on K and is completely independent on what A, B, Γ are and the “details” Rk , T —see (4.9)—specifying ellitope X .

272

CHAPTER 4

U∗ :

kU∗ k2 = 2.01 kU∗ k∞ = 0.347

b b : kU − U∗ k2 = 0.318 U b − U∗ k∞ = 0.078 kU

e e : kU − U∗ k2 = 44.82 U e − U∗ k∞ = 12.47 kU

Figure 4.1: True distribution of temperature U∗ = B(x) at time t0 = 0.01 (left) b via the optimal linear estimate (center) and the “naive” along with its recovery U e (right). recovery U 4.2.2.3

Relaxing the symmetry requirement

Sets X of the form (4.6)—we called them ellitopes—are symmetric w.r.t. the origin convex compact sets of special structure. This structure is rather flexible, but the symmetry is “built in.” We are about to demonstrate that, to some extent, the symmetry requirement can be somewhat relaxed. Specifically, assume instead of (4.6) that the convex compact set X known to contain the signals x underlying observations (4.5) can be “sandwiched” by two ellitopes known to us and similar to each other, with coefficient α ≥ 1:  x ∈ Rn : ∃(y ∈ Rn¯ , t ∈ T ) : x = P y & y T Rk y ≤ tk , 1 ≤ k ≤ K ⊂ X ⊂ αX , {z } | X

with Rk and T possessing the properties postulated in Section 4.2.1.1. Let Opt and H∗ be the optimal value and optimal solution of the optimization problem (4.12) ¯ = BP in the role associated with the data R1 , ..., RK , T and matrices A¯ = AP , B of A, B, respectively. It is immediately seen that the risk Risk[b x H∗ |X ] of the linear √ we have Riskopt [X ] ≤ estimate x bH∗ (ω) is at most α Opt. On the other hand, p √ Riskopt [X ], and by Proposition 4.5 also Opt ≤ O(1) ln(2K)Riskopt [X ]. Taken together, these relations imply that p (4.16) Risk[b xH ∗ |X ] ≤ O(1)α ln(2K)Riskopt [X ].

In other words, as far as the “level of nonoptimality” of efficiently computable linear estimates is concerned, signal sets X which can be approximated by ellitopes within a factor α of order of 1 are nearly as good as the ellitopes. To give an example: it is known that whenever the intersection X of K elliptic cylinders {x : (x − ck )T Rk (x − ck ) ≤ 1}, Rk  0, concentric or not, is bounded and has a nonempty interior, X can be approximated by an ellipsoid within the factor

273

SIGNAL RECOVERY BY LINEAR ESTIMATION

√ α = K + 2 K.7 Assuming w.l.o.g. that the approximating ellipsoid is centered at the origin, the level of nonoptimality of a linear estimate is bounded by (4.16) with O(1)K in the role of α. 4.2.2.4

Comments

Note that bound (4.16) rapidly deteriorates when α grows, and this phenomenon to some extent “reflects the reality.” For example, a perfect simplex X inscribed into the unit sphere in Rn is in-between two Euclidean balls centered at the origin with the ratio of radii equal to n (i.e. α = n). It is immediately seen that with A = B = I, Γ = σ 2 I, in the range σ ≤ nσ 2 ≤ 1 of values of n and σ, we have √ √ xH∗ |X ] = O(1) nσ, Riskopt [X ] ≈ σ, Riskopt [b with ≈ meaning “up to logarithmic in n/σ factor.” In other words, for large nσ linear estimates indeed are significantly (albeit not to the full extent of (4.16)) outperformed by nonlinear ones. Another situation “bad for linear estimates” suggested by (4.15) is the one where the description (4.6) of X , albeit possible, requires a very large value of K. Here again (4.15) reflects to some extent the reality: when X is the unit k · k1 ball in Rn , (4.6) takes place with K = 2n−1 ; consequently, the factor at Riskopt [X ] √ in the right-hand side of (4.15) becomes at least n. On the other hand, with A = B = I, Γ = σ 2 I, in the range σ ≤ nσ 2 ≤ 1 of values of n, σ, the risks Riskopt [X ], Riskopt [b xH∗ |X ] are basically the same as in the case of X being the perfect simplex inscribed into the unit sphere in Rn , and linear estimates indeed are “heavily nonoptimal” when nσ is large. 4.2.2.5

How near is “near-optimal”: Numerical illustration √ The “nonoptimality factor” θ in the upper bound Opt ≤ θRiskopt [X ] from Proposition 4.5, while logarithmic, seems to be unpleasantly large. On closer inspection, one can get numerically less conservative bounds on non-optimality factors. Here are some illustrations. In the six experiments to be reported, we used n = m = ν = 32 and Γ = σ 2 Im . In the first triple of experiments, X was the ellipsoid X = {x ∈ R32 :

32 X j=1

j 2 x2j ≤ 1},

P32 that is, P was the identity, K = 1, R1 = j=1 j 2 ej eTj (ej are basic orths), and T = [0, 1]. In the second triple of experiments, X was the box circumscribed around the above ellipsoid: 

X = {x ∈ R32 : j|xj | ≤ 1, 1 ≤ j ≤ 32}  P = I, K = 32, Rk = k 2 ek eTk , k ≤ K, T = [0, 1]K .

P T setting F (x) = − K k=1 ln(1 − (x − ck ) Rk (x − ck )) : int X → R and denoting by x ¯ the analytic center argminx∈int X F (x), one has √ {x : (x − x ¯)T F ′′ (¯ x)(x − x ¯) ≤ 1} ⊂ X ⊂ {x : (x − x ¯)T F ′′ (¯ x)(x − x ¯) ≤ [K + 2 K]2 }. 7 Namely,

274

CHAPTER 4

X ellipsoid ellipsoid ellipsoid box box box

σ 0.0100 0.0010 0.0001 0.0100 0.0010 0.0001



Opt 0.288 0.103 0.019 0.698 0.163 0.021

LwB 0.153 0.060 0.018 0.231 0.082 0.020



Opt/LwB 1.88 1.71 1.06 3.02 2.00 1.06

Table 4.1: Performance of linear estimates (4.8), (4.12), m = n = 32, B = I.

In these experiments, B was the identity matrix, and A was a randomly rotated matrix common for all experiments, with singular values λj , 1 ≤ j ≤ 32, forming a geometric progression, with λ1 = 1 and λ32 = 0.01. Experiments in a triple differed by the values of σ (0.01,0.001,0.0001). √ The results of the experiments are presented in Table 4.1, where, as above, Opt is the upper bound given by (4.12) on the risk Risk[b xH∗ |X] of recovering Bx = x, x ∈ X, by the linear estimate yielded by (4.8) and (4.12), and LwB is the lower bound on Riskopt [X] computed via the techniques outlined in Exercise 4.22 (we skip the details). Whatever might be your attitude to the “reality” as reflected by the data in Table 4.1, this reality is much better than the theoretical upper bound on θ appearing in (4.15). 4.2.3

Byproduct on semidefinite relaxation

We are about to present a byproduct, important in its own right, of the reasoning underlying Proposition 4.5. This byproduct is not directly related to Statistics; it relates to the quality of the standard semidefinite relaxation. Specifically, given a quadratic from xT Cx and an ellitope X represented by (4.6), consider the problem  Opt∗ = max xT Cx = max y T P T CP y : ∃t ∈ T : y T Rk y ≤ tk , k ≤ K . (4.17) x∈X

y

This problem can be NP-hard (this is already so when X is the unit box and C a general-type positive semidefinite matrix); however, Opt admits an efficiently computable upper bound given by semidefinite relaxation as follows: whenever λ ≥ 0 is such that K X P T CP  λk Rk , k=1

¯ := {y : ∃t ∈ T : y T Rk y ≤ tk , k ≤ K} we clearly have for y ∈ X X [P y]T CP y ≤ λk y T Rk y ≤ φT (λ) k

where the last ≤ is due to the fact that the vector with the entries y T Rk y, 1 ≤ k ≤ K, belongs to T . As a result, the efficiently computable quantity ) ( X T (4.18) Opt = min φT (λ) : λ ≥ 0, P CP  λ k Rk λ

k

SIGNAL RECOVERY BY LINEAR ESTIMATION

275

is an upper bound on Opt∗ . We have the following Proposition 4.6. Let C be a symmetric n × n matrix and X be given by ellitopic representation (4.6), and let Opt∗ and Opt be given by (4.17) and (4.18). Then Opt √ ≤ Opt∗ ≤ Opt. 3 ln( 3K)

(4.19)

For proof, see Section 4.8.2.

4.3

FROM ELLITOPES TO SPECTRATOPES

So far, the domains of signals we dealt with were ellitopes. In this section we demonstrate that basically all our constructions and results can be extended onto a much wider family of signal domains, namely, spectratopes. 4.3.1

Spectratopes: Definition and examples

We call a set X ⊂ Rn a basic spectratope if it admits a simple spectratopic representation—representation of the form  X = x ∈ Rn : ∃t ∈ T : Rk2 [x]  tk Idk , 1 ≤ k ≤ K (4.20)

where

Pn S.1. Rk [x] = i=1 xi Rki are symmetric dk ×dk matrices linearly depending on x ∈ Rn (i,e., “matrix coefficients” Rki belong to Sn ). S.2. T ∈ RK + is the set with the same properties as in the definition of an ellitope, that is, T is a convex compact subset of RK + which contains a positive vector and is monotone: 0 ≤ t′ ≤ t ∈ T ⇒ t′ ∈ T . S.3. Whenever x 6= 0, it holds Rk [x] 6= 0 for at least one k ≤ K. An immediate observation is as follows: Remark 4.7. By the Schur Complement Lemma, the set (4.20) given by data satisfying S.1-2 can be represented as     tk Idk Rk [x]  0, k ≤ K . X = x ∈ Rn : ∃t ∈ T : Rk [x] Idk By the latter representation, X is nonempty, closed, convex, symmetric w.r.t. the origin, and contains a neighbourhood of the origin. This set is bounded if and only if the data, in addition to S.1–2, satisfies S.3. A spectratope X ⊂ Rν is a set represented as a linear image of a basic spectratope: X = {x ∈ Rν : ∃(y ∈ Rn , t ∈ T ) : x = P y, Rk2 [y]  tk Idk , 1 ≤ k ≤ K},

(4.21)

276

CHAPTER 4

where P is a ν × n matrix, and Rk [·], T are as in S.1–3. We associate with a basic spectratope (4.20), S.1–3, the following entities: 1. The size D=

K X

dk ;

k=1

2. Linear mappings Q 7→ Rk [Q] =

X i,j

Qij Rki Rkj : Sn → Sdk .

As is immediately seen, we have Rk [xxT ] ≡ Rk2 [x],

(4.22)

implying that Rk [Q]  0 whenever Q  0, whence Rk [·] is -monotone: Q′  Q ⇒ Rk [Q′ ]  Rk [Q].

(4.23)

Besides this, we have Q  0 ⇒ Eξ∼N (0,Q) {Rk2 [ξ]} = Eξ∼N (0,Q) {Rk [ξξ T ]} = Rk [Q],

(4.24)

where the first equality is given by (4.22). 3. Linear mappings Λk 7→ R∗k [Λk ] : Sdk → Sn given by [R∗k [Λk ]]ij = 21 Tr(Λk [Rki Rkj + Rkj Rki ]), 1 ≤ i, j ≤ n.

(4.25)

It is immediately seen that R∗k [·] is the conjugate of Rk [·]: hΛk , Rk [Q]iF = Tr(Λk Rk [Q]) = Tr(R∗k [Λk ]Q) = hR∗k [Λk ], QiF ,

(4.26)

where hA, BiF = Tr(AB) is the Frobenius inner product of symmetric matrices. Besides this, we have (4.27) Λk  0 ⇒ R∗k [Λk ]  0. Indeed, R∗k [Λk ] is linear in Λk , so that it suffices to verify (4.27) for dyadic matrices Λk = f f T ; for such a Λk , (4.25) reads (R∗k [f f T ])ij = [Rki f ]T [Rkj f ], that is, R∗k [f f T ] is a Gram matrix and as such is  0. Another way to arrive at (4.27) is to note that when Λk  0 and Q = xxT , the first quantity in (4.26) is nonnegative by (4.22), and therefore (4.26) states that xT R∗k [Λk ]x ≥ 0 for every x, implying R∗k [Λk ]  0. 4. The linear space ΛK = Sd1 × ... × SdK of all ordered collections Λ = {Λk ∈ Sdk }k≤K along with the linear mapping Λ 7→ λ[Λ] := [Tr(Λ1 ); ...; Tr(ΛK )] : ΛK → RK .

277

SIGNAL RECOVERY BY LINEAR ESTIMATION

4.3.1.1

Examples of spectratopes

Example: Ellitopes. Every ellitope X = {x ∈ Rν : ∃(y ∈ Rn , t ∈ TP ) : x = P y, y T Rk y ≤ tk , k ≤ K} [Rk  0, k Rk ≻ 0]

Ppk T rkj rkj , pk = Rank(Rk ), be a dyadic is a spectratope as well. Indeed, let Rk = j=1 representation of the positive semidefinite matrix Rk , so that X T (rkj y)2 ∀y, y T Rk y = j

and let P Tb = {{tkj ≥ 0, 1 ≤ j ≤ pk , 1 ≤ k ≤ K} : ∃t ∈ T : j tkj ≤ tk }, T Rkj [y] = rkj y ∈ S1 = R. We clearly have 2 X = {x ∈ Rν : ∃({tkj } ∈ Tb , y) : x = P y, Rkj [y]  tkj I1 ∀k, j},

and the right-hand side is a legitimate spectratopic representation of X . Example: “Matrix box.” Let L be a positive definite d × d matrix. Then the “matrix box” X

= =

{X ∈ Sd : −L  X  L} = {X ∈ Sd : −Id  L−1/2 XL−1/2  Id } {X ∈ Sd : R2 [X] := [L−1/2 XL−1/2 ]2  Id }

is a basic spectratope (augment R1 [·] := R[·] with K = 1, T = [0, 1]). As a result, a bounded set X ⊂ Rν given by a system of “two-sided” Linear Matrix Inequalities, specifically, √ √ X = {x ∈ Rν : ∃t ∈ T : − tk Lk  Sk [x]  tk Lk , k ≤ K} where Sk [x] are symmetric dk × dk matrices linearly depending on x, Lk ≻ 0, and T satisfies S.2, is a basic spectratope: X = {x ∈ Rν : ∃t ∈ T : Rk2 [x] ≤ tk Idk , k ≤ K}

−1/2

[Rk [x] = Lk

−1/2

Sk [x]Lk

].

Like ellitopes, spectratopes admit fully algorithmic calculus; see Section 4.6. 4.3.2

Semidefinite relaxation on spectratopes

Now let us extend Proposition 4.6 to our current situation. The extension reads as follows: Proposition 4.8. Let C be a symmetric n×n matrix and X be given by spectratopic representation X = {x ∈ Rn : ∃y ∈ Rµ , t ∈ T : x = P y, Rk2 [y]  tk Idk , k ≤ K},

(4.28)

278

CHAPTER 4

let Opt∗ = max xT Cx, x∈X

and let Opt =

min

Λ={Λk }k≤K



φT (λ[Λ]) : Λk  0, P T CP 

[λ[Λ] = [Tr(Λ1 ); ...; Tr(ΛK )]] .

P

k

R∗k [Λk ]



(4.29)

Then (4.29) is solvable, and Opt∗ ≤ Opt ≤ 2 max[ln(2D), 1]Opt∗ , D =

X

dk .

(4.30)

k

Let us verify the easy and instructive part of the proposition, namely, the left inequality in (4.30); the remaining claims will be proved in Section 4.8.3. The left inequality in (4.30) is readily given by the following Lemma 4.9. Let X be spectratope (4.28) and Q ∈ Sn . Whenever Λk ∈ Sd+k satisfy P T QP 

X k

R∗k [Λk ],

for all x ∈ X we have xT Qx ≤ φT (λ[Λ]), λ[Λ] = [Tr(Λ1 ); ...; Tr(ΛK )]. Proof of the lemma: Let x ∈ X , so that for some t ∈ T and y it holds x = P y, Rk2 [y]  tk Idk ∀k ≤ K. Consequently, xT Qx

4.3.3

P P T T = yP P QP y ≤ y T k R∗k [Λk ]y = k Tr(R∗k [Λk ][yy T ]) = Pk Tr(Λk Rk [yy T ]) [by (4.26)] = Pk Tr(Λk Rk2 [y]) [by (4.22)] ≤ k tk Tr(Λk Idk ) [since Λk  0 and Rk2 [y]  tk Idk ] ≤ φT (λ[Λ]). ✷

Linear estimates beyond ellitopic signal sets and k · k2 -risk

In Section 4.2, we have developed a computationally efficient scheme for building “presumably good” linear estimates of the linear image Bx of unknown signal x known to belong to a given ellitope X in the case when the (squared) risk is defined as the worst, w.r.t. x ∈ X , expected squared Euclidean norm k · k22 of the recovery error. We are about to extend these results to the case when X is a spectratope, and the norm used to measure the recovery error, while not being completely arbitrary, is not necessarily k · k2 . Besides this, in what follows we also relax our assumptions on observation noise.

279

SIGNAL RECOVERY BY LINEAR ESTIMATION

4.3.3.1

Situation and goal

We consider the problem of recovering the image Bx ∈ Rν of a signal x ∈ Rn known to belong to a given spectratope X = {x ∈ Rn : ∃t ∈ T : Rk2 [x]  tk Idk , 1 ≤ k ≤ K} from noisy observation ω = Ax + ξ,

(4.31)

where A is a known m × n matrix, and ξ is random observation noise. Observation noise. In typical signal processing applications, the distribution of noise is fixed and is a part of the data of the estimation problem. In order to cover some applications (e.g., the one in Section 4.3.3.7), we allow for “ambiguous” noise distributions; all we know is that this distribution belongs to a family P of Borel probability distributions on Rm associated with a given convex compact subset Π of the interior of the cone Sm + of positive semidefinite m × m matrices, “association” meaning that the matrix of second moments of every distribution P ∈ P is -dominated by a matrix from Π: P ∈ P ⇒ ∃Q ∈ Π : Var[P ] := Eξ∼P {ξξ T }  Q.

(4.32)

The actual distribution of noise in (4.31) is selected from P by nature (and may, e.g., depend on x). In the sequel, for a probability distribution P on Rm we write P ✁ Π to express the fact that the matrix of second moments of P is -dominated by a matrix from Π: {P ✁ Π} ⇔ {∃Θ ∈ Π : Var[P ]  Θ}. Quantifying risk. Given Π and a norm k · k on Rν , we quantify the quality of a candidate estimate x b(·) : Rm → Rν by its (Π, k · k)-risk on X defined as RiskΠ,k·k [b x|X ] =

sup

x∈X ,P ✁Π

Eξ∼P {kb x(Ax + ξ) − Bxk} .

Goal. As before, our focus is on linear estimates—estimates of the form x bH (ω) = H T ω

given by m×ν matrices H. Our goal is to demonstrate that under some restrictions on the signal domain X , a “presumably good” linear estimate yielded by an optimal solution to an efficiently solvable convex optimization problem is near-optimal in terms of its risk among all estimates, linear and nonlinear alike. 4.3.3.2

Assumptions

Preliminaries: Conjugate norms. Recall that a norm k · k on a Euclidean space E, e.g., on Rk , gives rise to its conjugate norm kyk∗ = max{hy, xi : kxk ≤ 1}, x

280

CHAPTER 4

where h·, ·i is the inner product in E. Equivalently, k · k∗ is the smallest norm such that hx, yi ≤ kxkkyk∗ ∀x, y. (4.33) It is well known that taken twice, norm conjugation recovers the initial norm: (k · k∗ )∗ is exactly k · k; in other words, kxk = max{hx, yi : kyk∗ ≤ 1}. y

The standard examples are the conjugates to the standard ℓp -norms on E = Rk , p ∈ [1, ∞]: it turns out that (k · kp )∗ = k · kp∗ , where p∗ ∈ [1, ∞] is linked to p ∈ [1, ∞] by the symmetric relation 1 1 = 1, + p p∗ so that 1∗ = ∞, ∞∗ = 1, 2∗ = 2. The corresponding version of inequality (4.33) is called H¨ older inequality—an extension of the Cauchy-Schwartz inequality dealing with the case k · k = k · k∗ = k · k2 . Assumptions. From now on we make the following assumptions: Assumption A: The unit ball B∗ of the norm k · k∗ conjugate to the norm k · k in the formulation of our estimation problem is a spectratope: B∗ = {z ∈ Rν : ∃y ∈ Y : z = M y}, Y := {y ∈ Rq : ∃r ∈ R : Sℓ2 [y]  rℓ Ifℓ , 1 ≤ ℓ ≤ L},

(4.34)

where the right-hand side data are as required in a spectratopic representation. Note that Assumption A is satisfied when k · k = k · kp with p ∈ [1, 2]: in this case, B∗ = {u ∈ Rν : kukp∗ ≤ 1}, p∗ =

p ∈ [2, ∞], p−1

so that B∗ is an ellitope—see Section 4.2.1.1—and thus is a spectratope. Another potentially useful example of norm k · k which obeys Assumption A is the nuclear norm kV kSh,1 on the space Rν = Rp×q of p×q matrices—the sum of singular values of a matrix V . In this case the conjugate norm is the spectral norm k · k = k · k2,2 on Rν = Rp×q , and the unit ball of the latter norm is a spectratope: {X ∈ Rp×q : kXk ≤ 1} = {X: ∃t ∈ T = [0, 1] : R2 [X]  tIp+q }, XT . R[X] = X Besides Assumption A, we make Assumption B: The signal set X is a basic spectratope: X = {x ∈ Rn : ∃t ∈ T : Rk2 [x]  tk Idk , 1 ≤ k ≤ K},

281

SIGNAL RECOVERY BY LINEAR ESTIMATION

where the right-hand side data are as required in a spectratopic representation. Note: Similarly to what we have observed in Section 4.2.1.3 in the case of ellitopes, the situation where the signal set is a general type spectratope can be straightforwardly reduced to the one where X is a basic spectratope. In addition we make the following regularity assumption: Assumption R: All matrices from Π are positive definite. 4.3.3.3

Building linear estimate

Let H ∈ Rm×ν . We clearly have RiskΠ,k·k [b xH (·)|X ]

 Eξ∼P k[B − H T A]x − H T ξk x∈X,P ✁Π  supx∈X k[B − H T A]xk + supP ✁Π Eξ∼P kH T ξk kB − H T AkX ,k·k + ΨΠ (H), (4.35)

=

sup

≤ = where

kV kX ,k·k ΨΠ (H)

= =

ν×n maxx {kV xk → R,  : xT∈ X } : R sup Eξ∼P kH ξk . P ✁Π

As in Section 4.2.2, we need to derive efficiently computable convex upper bounds on the norm k·kX ,k·k and the function ΨΠ , which by themselves, while being convex, can be difficult to compute. 4.3.3.4

Upper-bounding k · kX ,k·k

With Assumptions A, B in force, consider the spectratope Z

:= =

X × Y = {[x; y] ∈ Rn × Rq : ∃s = [t; r] ∈ T × R : Rk2 [x]  tk Idk , 1 ≤ k ≤ K, Sℓ2 [y]  rℓ Ifℓ , 1 ≤ ℓ ≤ L} {w = [x; y] ∈ Rn × Rq : ∃s = [t; r] ∈ S = T × R : Ui2 [w]  si Igi , 1 ≤ i ≤ I = K + L}

with Ui [·] readily given by Rk [·] and Sℓ [·]. Given a ν × n matrix V and setting   1 V TM W [V ] = 2 MT V we have kV kX ,k·k = max kV xk = x∈X

max

x∈X ,z∈B∗

zT V x =

max y T M T V x = max wT W [V ]w.

x∈X ,y∈Y

w∈Z

Applying Proposition 4.8, we arrive at the following Corollary 4.10. In the situation just defined, the efficiently computable convex

282

CHAPTER 4

function kV k+ X ,k·k

=

 min φT (λ[Λ]) + φR (λ[Υ]) : Λ,Υ

fℓ dk Λ = {Λk ∈ ℓ ∈ S+  }ℓ≤L, PS+ }∗k≤K , Υ =1 {Υ T V M R [Λ ] k k k P2 ∗ 0 1 T ℓ Sℓ [Υℓ ] 2M V



(4.36)

 φT (λ) = max λT t, φR (λ) = max λT r, λ[{Ξ1 , ..., ΞN }] = [Tr(Ξ1 ); ...; Tr(ΞN )], t∈T r∈R   P   [R∗k [Λk ]]ij = 12 Tr(Λk [Rkki Rkkj + Rkkj Rkki ]), where Rk [x] = i xi Rki , P ℓj ℓj ℓi ℓi ℓi ∗ 1 [Sℓ [Υℓ ]]ij = 2 Tr(Υℓ [Sℓ Sℓ + Sℓ Sℓ ]), where Sℓ [y] = i yi S

is a norm on Rν×n , and this norm is a tight upper bound on k · kX ,k·k , namely,

4.3.3.5

∀V ∈ Rν×n : kV kX ,k·k ≤ kV k+ ≤ 2 max[ln(2D), 1]kV kX ,k·k , P X ,k·k P D = k dk + ℓ fℓ .

Upper-bounding ΨΠ (·)

The next step is to derive an efficiently computable convex upper bound on the function ΨΠ stemming from a norm obeying Assumption B. The underlying observation is as follows: Lemma 4.11. Let V be an m × ν matrix, Q ∈ Sm + , and P be a probability distribution on Rm with Var[P ]  Q. Let, further, k · k be a norm on Rν with the unit ball B∗ of the conjugate norm k · k∗ given by (4.34). Finally, let Υ = {Υℓ ∈ Sf+ℓ }ℓ≤L and a matrix Θ ∈ Sm satisfy the constraint   1 Θ V M 2 P ∗ 0 (4.37) 1 T T ℓ Sℓ [Υℓ ] 2M V

(for notation, see (4.34), (4.36)). Then

Eη∼P {kV T ηk} ≤ Tr(QΘ) + φR (λ[Υ]).

(4.38)

Proof is immediate. In the case of (4.37), we have kV T ξk =



= = = ≤



max z T V T ξ = max y T M T V T ξ  Py∈Y [by (4.37)] max ξ T Θξ + ℓ y T Sℓ∗ [Υℓ ]y y∈Y   P ∗ T T max ξ Θξ + ℓ Tr(Sℓ [Υℓ ]yy ) y∈Y   P [by (4.22) and (4.26)] max ξ T Θξ + ℓ Tr(Υℓ Sℓ2 [y]) y∈Y  P 2 2 ξ T Θξ + max ℓ Tr(Υℓ Sℓ [y]) : Sℓ [y]  rℓ Ifℓ , ℓ ≤ L, r ∈ R

z∈B∗ 

y,r

ξ T Θξ + max r∈R

P

[by (4.34)]

ℓ Tr(Υℓ )rℓ [by Υℓ  0]

ξ T Θξ + φR (λ[Υ]).

Taking the expectation of both sides of the resulting inequality w.r.t. distribution P of ξ and taking into account that Tr(Var[P ]Θ) ≤ Tr(QΘ) due to Θ  0 (by (4.37)) and Var[P ]  Q, we get (4.38). ✷

283

SIGNAL RECOVERY BY LINEAR ESTIMATION

Note that when P = N (0, Q), the smallest upper bound on Eη∼P {kV T ηk} which can be extracted from Lemma 4.11 (this bound is efficiently computable) is tight; see Lemma 4.17 below. An immediate consequence of the bound in Lemma (4.11) is: Corollary 4.12. Let Γ(Θ) = max Tr(QΘ)

(4.39)

Q∈Π

and ΨΠ (H)

=

min

{Υℓ }ℓ≤L ,Θ∈Sm

( 

Γ(Θ) + φR (λ[Υ]) : Υℓ  0 ∀ℓ, Θ 1 T M HT 2

1 P2 HM ∗ ℓ Sℓ [Υℓ ]



)

(4.40)

0 .

Then ΨΠ (·) : Rm×ν → R is an efficiently computable convex upper bound on ΨΠ (·). Indeed, given Lemma 4.11, the only non-evident part of the corollary is that ΨΠ (·) is a well-defined real-valued function, which is readily given by Lemma 4.44 stating, in particular, that the optimization problem in (4.40) is feasible, combined with the fact that the objective is coercive on the feasible set (i.e., is not bounded from above along every unbounded sequence of feasible solutions). Remark 4.13. When Υ = {Υℓ }ℓ≤L , Θ is a feasible solution to the right-hand side problem in (4.40) and s > 0, the pair Υ′ = {sΥℓ }ℓ≤L , Θ′ = s−1 Θ also is a feasible solution. Since φR (·) and Γ(·) are positive homogeneous of degree 1, we conclude that ΨΠ is in fact the infimum of the function p   2 Γ(Θ)φR (λ[Υ]) = inf s−1 Γ(Θ) + sφR (λ[Υ]) s>0

over Υ, Θ satisfying the constraints of the problem (4.40). In addition, for every feasible solution Υ = {Υℓ }ℓ≤L , Θ to (4.40) with M[Υ] := P ∗ 1 −1 b [Υ]M T H T is feasible for the problem ℓ Sℓ [Υℓ ] ≻ 0, the pair Υ, Θ = 4 HM M b  Θ (Schur Complement Lemma), so that Γ(Θ) b ≤ Γ(Θ). As a as well, and 0  Θ result,   1 Γ(HM M−1 [Υ]M T H T ) + φR (λ[Υ]) : 4 ΨΠ (H) = inf . (4.41) Υ Υ = {Υℓ ∈ Sf+ℓ }ℓ≤L , M[Υ] ≻ 0 Illustration. Suppose that kuk = kukp with p ∈ [1, 2], and let us apply the just described scheme for upper-bounding ΨΠ , assuming {Q} ⊂ Π ⊂ {S ∈ Sm + : S  Q} for some given Q ≻ 0, so that Γ(Θ) = Tr(QΘ), Θ  0. The unit ball of the norm p conjugate to k · k, that is, the norm k · kq , q = p−1 ∈ [2, ∞], is the basic spectratope (in fact, ellitope) B∗ = {y ∈ Rµ : ∃r ∈ R := {Rν+ : krkq/2 ≤ 1} : Sℓ2 [y] ≤ rℓ , 1 ≤ ℓ ≤ L = ν}, Sℓ [y] = yℓ . As a result, Υ’s from Remark 4.13 are collections of ν positive semidefinite 1 × 1 matrices, and we can identify them with ν-dimensional nonnegative vectors υ,

284

CHAPTER 4

resulting in λ[Υ] = υ and M[Υ] = Diag{υ}. Furthermore, for nonnegative υ we clearly have φR (υ) = kυkp/(2−p) , so the optimization problem in (4.41) now reads ΨΠ (H) = inf ν υ∈R

1 4

Tr(V Diag−1 {υ}V T ) + kυkp/(2−p) : υ > 0

and when setting aℓ = kColℓ [V ]k2 , (4.41) becomes ) ( 1 X a2ℓ ΨΠ (H) = inf + kυkp/(2−p) . υ>0 4 υℓ



[V = Q1/2 H],



This results in ΨΠ (H) = k[a1 ; ...; aµ ]kp . Recalling what aℓ and V are, we end up with ∀P, Var[P ]  Q :

  Eξ∼P {kH T ξk} ≤ ΨΠ (H) := kRow1 [H T Q1/2 ]k2 ; . . . ; kRowν [H T Q1/2 ]k2 p .

This result is quite transparent and could be easily obtained straightforwardly. 2 Indeed, when Var[P ]  Q, and ξ ∼ P , the vector ζ = H T ξ clearly Pi } ≤ Psatisfies E{ζ σi2 := kRowi [H T Q1/2 ]k22 , implying, due to p ∈ [1, 2], that E{ i |ζi |p } ≤ i σip , whence E{kζkp } ≤ k[σ1 ; ...; σν ]kp . 4.3.3.6

Putting things together

An immediate outcome of Corollaries 4.10 and 4.12 is the following recipe for building a “presumably good” linear estimate: Proposition 4.14. In the situation of Section 4.3.3.1 and under Assumptions A, B, and R (see Section 4.3.3.2) consider the convex optimization problem (for notation, see (4.36) and (4.39)) Opt

=

min ′

H,Λ,Υ,Υ ,Θ



φT (λ[Λ]) + φR (λ[Υ]) + φR (λ[Υ′ ]) + Γ(Θ) :

Λ = {Λk  0, k ≤ K}, Υ′ = {Υ′ℓ  0, ℓ ≤ L},  Υ =P{Υℓ ∗ 0, ℓ ≤ L}, 1 [B T − AT H]M k Rk [Λk ] 2 P  0, T T ∗ 1 M [B − H A] ℓ Sℓ [Υℓ ] 2  1 HM Θ P2 ∗ ′  0. T T 1 M H ℓ Sℓ [Υℓ ] 2

      (4.42)     

The problem is solvable, and the H-component H∗ of its optimal solution yields linear estimate x bH∗ (ω) = H∗T ω such that RiskΠ,k·k [b xH∗ (·)|X ] ≤ Opt.

(4.43)

Note that the only claim in Proposition 4.14 which is not an immediate consequence of Corollaries 4.10 and 4.12 is that problem (4.42) is solvable; this fact is readily given by the feasibility of the problem (by Lemma 4.44) and the coerciveness of the objective on the feasible set (recall that Γ(Θ) is coercive on Sm + due to Π ⊂ int Sm and that y → 7 M y is an onto mapping, since B is full-dimensional). ∗ +

285

SIGNAL RECOVERY BY LINEAR ESTIMATION

4.3.3.7

Illustration: Covariance matrix estimation

Suppose that we observe a sample η T = {ηk = Aξk }k≤T

(4.44)

where A is a given m × n matrix, and ξ1 , ..., ξT are sampled, independently of each other, from a zero mean Gaussian distribution with unknown covariance matrix ϑ known to satisfy γϑ∗  ϑ  ϑ∗ , (4.45)

where γ ≥ 0 and ϑ∗ ≻ 0 are given. Our goal is to recover ϑ, and the norm on Sn in which the recovery error is measured satisfies Assumption A. Processing the problem. We can process the problem just outlined as follows. 1. box

We represent the set {ϑ ∈ Sn+ : γϑ∗  ϑ  ϑ∗ } as the image of the matrix V = {v ∈ Sn : kvk2,2 ≤ 1}

[k · k2,2 : spectral norm]

under affine mapping; specifically, we set ϑ0 =

1−γ 1+γ ϑ∗ , σ = 2 2

and treat the matrix −1/2

v = σ −1 ϑ∗

−1/2

(ϑ − ϑ0 )ϑ∗

h

1/2

1/2

⇔ ϑ = ϑ0 + σϑ∗ vϑ∗

i

as the signal underlying our observations. Note that our a priori information on ϑ reduces to v ∈ V. 2.

We pass from observations ηk to “lifted” observations ηk ηkT ∈ Sm , so that 1/2

1/2

E{ηk ηkT } = E{Aξk ξkT AT } = AϑAT = A (ϑ0 + σAϑ∗ vϑ∗ ) AT , {z } | ϑ[v]

and treat as “actual” observations the matrices

ωk = ηk ηkT − Aϑ0 AT . We have8 1/2

1/2

ωk = Av + ζk with Av = σAϑ∗ vϑ∗ AT and ζk = ηk ηkT − Aϑ[v]AT .

(4.46)

Observe that random matrices ζ1 , ..., ζT are i.i.d. with zero mean and covariance mapping Q[v] (that of random matrix-valued variable ζ = ηη T − E{ηη T }, η ∼ 8 In our current considerations, we need to operate with linear mappings acting from Sp to Sq . We treat Sk as Euclidean space equipped with the Frobenius inner product hu, vi = Tr(uv) and denote linear mappings from Sp into Sq by capital calligraphic letters, like A, Q, etc. Thus, A in (4.46) denotes the linear mapping which, on closer inspection, maps matrix v ∈ Sn into the matrix Av = A[ϑ[v] − ϑ[0]]AT .

286

CHAPTER 4

N (0, Aϑ[v]AT )). 3. Let us -upper-bound the covariance mapping of ζ. Observe that Q[v] is a symmetric linear mapping of Sm into itself given by hh, Q[v]hi = E{hh, ζi2 } = E{hh, ηη T i2 } − hh, E{ηη T }i2 , h ∈ Sm . Given v ∈ V, let us set θ = ϑ[v], so that 0  θ  θ∗ , and let H(h) = θ1/2 AT hAθ1/2 . We have hh, Q[v]hi

= = =

Eξ∼N (0,θ) {Tr2 (hAξξ T AT )} − Tr2 (hEξ∼N (0,θ) {Aξξ T AT }) Eχ∼N (0,In ) {Tr2 (hAθ1/2 χχT θ1/2 AT ))} − Tr2 (hAθAT ) Eχ∼N (0,In ) {(χT H(h)χ)2 } − Tr2 (H(h)).

We have H(h) = U Diag{λ}U T with orthogonal U , so that Eχ∼N (0,In ) {(χT H(h)χ)2 } − Tr2 (H(h)) P T T χ∼N (0,I ) {(χ = Eχ:=U χ) ¯ 2 } − ( Pi λi )2 ¯ P P P Pn ¯2 Diag{λ} ¯i )2 } − ( i λi )2 = i6=j λi λj + 3 i λ2i − ( i λi )2 = Eχ∼N {( i λi χ ¯ (0,I ) n P = 2 i λ2i = 2Tr([H(h)]2 ).

Thus,

hh, Q[v]hi

= ≤ = =

2Tr([H(h)]2 ) = 2Tr(θ1/2 AT hAθAT hAθ1/2 ) 2Tr(θ1/2 AT hAθ∗ AT hAθ1/2 ) [since 0  θ  θ∗ ] 1/2 1/2 1/2 1/2 2Tr(θ∗ AT hAθAT hAθ∗ ) ≤ 2Tr(θ∗ AT hAθ∗ AT hAθ∗ ) T T 2Tr(θ∗ A hAθ∗ A hA).

We conclude that ∀v ∈ V : Q[v]  Q, he, Qhi = 2Tr(ϑ∗ AT hAϑ∗ AT eA), e, h ∈ Sm .

(4.47)

4. To continue, we need to set some additional notation to be used when operating with Euclidean spaces Sp , p = 1, 2, .... • We denote p¯ = set

p(p+1) 2

= dim Sp , Ip = {(i, j) : 1 ≤ i ≤ j ≤ p}, and for (i, j) ∈ Ip  ei eTi , i=j ij ep = , √1 [ei eT + ej eT ], i < j i j 2

where the ei are standard basic orths in Rp . Note that {eij p : (i, j) ∈ Ip } is the standard orthonormal basis in Sp . Given v ∈ Sp , we denote by Xp (v) the vector of coordinates of v in this basis:  vii , i=j √ , (i, j) ∈ Ip . Xpij (v) = Tr(veij ) = p 2vij , i < j Similarly, P for x ∈ Rp¯, we index the entries in x by pairs ij, (i, j) ∈ Ip , and set p p Vp (x) = (i,j)∈Ip xij eij p , so that v 7→ X (v) and x 7→ V (x) are linear normpreserving maps inverse to each other identifying the Euclidean spaces Sp and Rp¯ (recall that the inner products on these spaces are, respectively, the Frobenius and the standard one).

287

SIGNAL RECOVERY BY LINEAR ESTIMATION

• Recall that V is the matrix box {v ∈ Sn : v 2  In } = {v ∈ Sn : ∃t ∈ T := [0, 1] : v 2  tIn }. We denote by X the image of V under the mapping Xn : X X = {x ∈ Rn¯ : ∃t ∈ T : R2 [x]  tIn }, R[x] = xij eij ¯ = 12 n(n + 1). n, n (i,j)∈In

Note that X is a basic spectratope of size n. Now we can assume that the signal underlying our observations is x ∈ X , and the observations themselves are wk = Xm (ωk ) = Xm (AVn (x)) +zk , zk = Xm (ζk ). | {z } =:Ax

¯ Note that zk ∈ Rm , 1 ≤ k ≤ T , are zero mean i.i.d. random vectors with covariance matrix Q[x] satisfying, in view of (4.47), the relation T kℓ Q[x]  Q, where Qij,kℓ = 2Tr(ϑ∗ AT eij m Aϑ∗ A em A), (i, j) ∈ Im , (k, ℓ) ∈ Im .

Our goal is to estimate ϑ[v] − ϑ[0], or, which is the same, to recover Bx := Xn (ϑ[Vn (x)] − ϑ[0]). We assume that the norm in which the estimation error is measured is “transferred” from Sn to Rn¯ ; we denote the resulting norm on Rn¯ by k · k and assume that the unit ball B∗ of the conjugate norm k · k∗ is given by spectratopic representation: {u ∈ Rn¯ : kuk∗ ≤ 1} = {u ∈ Rn¯ : ∃y ∈ Y : u = M y}, Y := {y ∈ Rq : ∃r ∈ R : Sℓ2 [y]  rℓ Ifℓ , 1 ≤ ℓ ≤ L}.

(4.48)

The formulated description of the estimation problem fits the premises of Proposition 4.14, specifically: • the signal x underlying our observation w(T ) = [w1 ; ...; wT ] is known to belong to basic spectratope X ∈ Rn¯ , and the observation itself is of the form w(T ) = A

(T )

(T )

x + z (T ) , A

= [A; ...; A], z (T ) = [z1 ; ...; zT ]; | {z } T

• the noise z (T ) is zero mean, and its covariance matrix is  QT := Diag{Q, ..., Q}, | {z } T

which allows us to set Π = {QT }; • our goal is to recover Bx, and the norm k · k in which the recovery error is measured satisfies (4.48).

Proposition 4.14 supplies the linear estimate x b(w(T ) ) =

T X

k=1

T H∗k wk

288

CHAPTER 4

of Bx with H∗ = [H∗1 ; ...; H∗T ] stemming from the optimal solution to the convex optimization problem Opt

=

min

H=[H1 ;...;HT ],Λ,Υ

"

where



Tr(Λ) + φR (λ[Υ]) + Ψ{QT } (H1 , ..., HT ) :

 Λ ∈ Sn L},   + , Υ = {Υℓ  0, ℓ ≤ # T T P 1 − A H ]M R∗ [Λ] [B k 2 0  P P ∗ k  1 M T [B − [ k Hk ]T A] ℓ Sℓ [Υℓ ] 2

(4.49)

kℓ R∗ [Λ] ∈ Sn¯ : (R∗ [Λ])ij,kℓ = Tr(Λeij n en ), (i, j) ∈ In , (k, ℓ) ∈ In ,

and (cf. (4.40))  Tr(QT Θ) + φR (λ[Υ′ ]) : Θ ∈ SmT , Υ′ = {Υ′ℓ  0, ℓ ≤ L}, Ψ{QT } (H1 , ..., HT ) = min Υ′ ,Θ    1 Θ [H1 M ; ...; HT M ] 2 P  0 . ∗ ′ 1 [M T H1T , ..., M T HTT ] ℓ Sℓ [Υℓ ] 2

5. Evidently, the function Ψ{QT } ([H1 , ..., HT ]) remains intact when permuting H1 , ..., HT ; with this in mind, it is clear that permuting H1 , ..., HT and keeping intact Λ and Υ is a symmetry of (4.49)—such a transformation maps the feasible set onto itself and preserves the value of the objective. Since (4.49) is convex and solvable, it follows that there exists an optimal solution to the problem with H1 = ... = HT = H. On the other hand, Ψ{QT } (H,  ..., H)

= min Tr(QT Θ) + φR (λ[Υ′ ]) : Θ ∈ SmT , Υ′ = {Υ′ℓ  0, ℓ ≤ L} Υ′ ,Θ    1 [HM ; ...; HM ] Θ 2 P 0 1 ∗ ′ [M T H T , ..., M T H T ] ℓ Sℓ [Υℓ ] 2  = inf ′

Υ ,Θ

= inf ′

Υ ,Θ

Tr(QT Θ) + φR (λ[Υ′ ]) : Θ ∈ SmT , Υ′ = {Υ′ℓ ≻ 0, ℓ ≤ L},    1 [HM ; ...; HM ] Θ 2 P 0 1 ∗ ′ [M T H T , ..., M T H T ] ℓ Sℓ [Υℓ ] 2 



Tr(QT Θ) + φR (λ[Υ′ ]) : Θ ∈ SmT , Υ′ = {Υ′ℓ ≻ 0, ℓ ≤ L},



= inf′ φR (λ[Υ ]) + Υ

T 4



Θ

1 4 [HM ; ...; HM ] [

Tr QHM [

P

∗ ′ −1 ℓ Sℓ [Υℓ ]]

P T

∗ ′ −1 ℓ Sℓ [Υℓ ]]

M H

T





:Υ =

[HM ; ...; HM ] {Υ′ℓ

T



 ≻ 0, ℓ ≤ L}

due to QT = Diag{Q, ..., Q}, and we arrive at  T Tr(QG) + φR (λ[Υ′ ]) : Υ′ = {Υ′ℓ  0, ℓ ≤ L}, Ψ{QT } (H, ..., H) = min Υ′ ,G   (4.50)  1 G HM m 2 0 G∈S , 1 T T P ∗ ′ M H ℓ Sℓ [Υℓ ] 2 P (we have used the Schur Complement Lemma combined with the fact that ℓ Sℓ∗ [Υ′ℓ ] ≻ 0 whenever Υ′ℓ ≻ 0 for all ℓ; see Lemma 4.44).

289

SIGNAL RECOVERY BY LINEAR ESTIMATION

In view of the above observations, when replacing variables H and G with H = T H and G = T 2 G, respectively, problem (4.49), (4.50) becomes  Opt = min Tr(Λ) + φR (λ[Υ]) + φR (λ[Υ′ ]) + T1 Tr(QG) : H,G,Λ,Υ,Υ′  Λ ∈ Sn+", Υ = {Υℓ  0, ℓ ≤ L}, Υ′ = {Υ′ℓ  0, ℓ #≤ L},    T T  1  [B − A H]M R∗ [Λ]  2  (4.51)  0, P T 1 T ∗ , ℓ Sℓ [Υℓ ] 2 M [B − H" A] #   1  G HM  0  P2 ∗ ′  T  1 MT H S [Υ ] ℓ

2

and the estimate

x b(wT ) =





T 1 TX H wk T k=1

brought about by an optimal solution to (4.51) satisfies RiskΠ,k·k [b x|X ] ≤ Opt where Π = {QT }. 4.3.3.8

Estimation from repeated observations

Consider the special case of the situation from Section 4.3.3.1 where observation ω in (4.31) is a T -element sample ω = [ω¯1 ; ...; ω ¯ T ] with components ¯ + ξt , t = 1, ..., T ω ¯ t = Ax ¯ and ξt are i.i.d. observation noises with zero mean distribution P¯ satisfying P¯ ✁ Π m ¯ ¯ for some convex compact set Π ⊂ int S+ . In other words, we are in the situation where ¯ ¯ ...; A¯] ∈ Rm×n for some A¯ ∈ Rm×n and m = T m, ¯ A = [A; | {z } T

¯ ..., Q ¯ }, Q ¯ ∈ Π}. ¯ Π = {Q = Diag{Q, | {z } T

The same argument as used in item 5 of Section 4.3.3.7 above justifies the following Proposition 4.15. In the situation in question and under Assumptions A, B, and R the linear estimate of Bx yielded by an optimal solution to problem (4.42) can be found as follows. Consider the convex optimization problem Opt =

where

min

′ ,Θ ¯ ¯ H,Λ,Υ,Υ



φT (λ[Λ]) + φR (λ[Υ]) + φR (λ[Υ′ ]) +

1 T

¯ : Γ(Θ)

Λ = {Λk  0, k ≤ K}, Υ =P {Υℓ  0, ℓ ≤ L}, Υ′ = {Υ′ℓ  0, ℓ ≤ L}, ∗ 1 ¯ [B T − AT H]M k Rk [Λk ] 2 P  0, T T ∗ 1 ¯ A] M [B − H S [Υ ℓ] ℓ ℓ 2   1 ¯ ¯ HM Θ P2 ∗ ′ 0 T ¯T 1 M H ℓ Sℓ [Υℓ ] 2

     

(4.52)

    

¯ = max Tr(Q ¯ Θ). ¯ Γ(Θ) ¯ Π ¯ Q∈

¯ The problem is solvable, and the estimate in question is yielded by the H-component

290

CHAPTER 4

¯ ∗ of the optimal solution according to H x b([¯ ω1 ; ...; ω ¯ T ]) =

T 1 ¯T X ω ¯t. H∗ T t=1

The upper bound provided by Proposition 4.14 on the risk RiskΠ,k·k [b x(·)|X ] of this estimate is equal to Opt. The advantage of this result as compared to what is stated under the circumstances by Proposition 4.14 is that the sizes of optimization problem (4.52) are independent of T . 4.3.3.9

Near-optimality in the Gaussian case

The risk of the linear estimate x bH∗ (·) constructed in (4.42) can be compared to the minimax optimal risk of recovering Bx, x ∈ X , from observations corrupted by zero mean Gaussian noise with covariance matrix from Π. Formally, the minimax risk is defined as   b(Ax + ξ)k} (4.53) RiskOptΠ,k·k [X ] = sup inf sup Eξ∼N (0,Q) {kBx − x b(·) x∈X Q∈Π x

where the infimum is taken over all estimates.

Proposition 4.16. Under the premise and in the notation of Proposition 4.14, we have Opt , (4.54) RiskOptΠ,k·k [X ] ≥ p 64 (2 ln F + 10 ln 2)(2 ln D + 10 ln 2) where

D=

X k

dk , F =

X

fℓ .

(4.55)



Thus, the upper bound Opt on the risk RiskΠ,k·k [b xH∗ |X ] of the presumably good linear estimate x bH∗ yielded by an optimal solution to optimization problem (4.42) is within logarithmic in the sizes of spectratopes X and B∗ factor of the Gaussian minimax risk RiskOptΠ,k·k [X ]. For the proof, see Section 4.8.5. The key component of the proof is the following fact important in its own right (for proof, see Section 4.8.4): Lemma 4.17. Let Y be an N × ν matrix, let k · k be a norm on Rν such that the unit ball B∗ of the conjugate norm is the spectratope (4.34), and let ζ ∼ N (0, Q) for some positive semidefinite N × N matrix Q. Then the best upper bound on ψQ (Y ) := E{kY T ζk} yielded by Lemma 4.11, that is, the optimal value Opt[Q] in the convex optimization problem (cf. (4.40))  Opt[Q] = min φR (λ[Υ]) + Tr(QΘ) : Υ = {Υℓ  0, 1 ≤ ℓ ≤ L}, Θ,Υ    (4.56) 1 YM Θ 2 P  0 Θ ∈ SN , 1 ∗ MT Y T ℓ Sℓ [Υℓ ] 2

SIGNAL RECOVERY BY LINEAR ESTIMATION

291

(for notation, see Lemma 4.11 and (4.36)), satisfies the identity ∀(Q  0) : Opt[Q] = Opt[Q] :=

 min φR (λ[Υ]) + Tr(G) : Υℓ  0, G,Υ={Υℓ ,ℓ≤L}    1 1/2 YM G 2Q P 0 , 1 T T 1/2 ∗ ℓ Sℓ [Υℓ ] 2M Y Q

(4.57)

and is a tight bound on ψQ (Y ), namely,

P

√ ψQ (Y ) ≤ Opt[Q] ≤ 22 2 ln F + 10 ln 2 ψQ (Y ),

where F = ℓ fℓ is the size of the spectratope (4.34). Besides this, for all κ ≥ 1 one has   2 e3/8 Opt[Q] ≥ βκ := 1 − − 2F e−κ /2 . Probζ kY T ζk ≥ 4κ 2 √ In particular, when selecting κ = 2 ln F + 10 ln 2, we obtain   Opt[Q] 3 Probζ kY T ζk ≥ √ ≥ 0.2100 > 16 . 4 2 ln F + 10 ln 2

4.4

(4.58)

(4.59)

(4.60)

LINEAR ESTIMATES OF STOCHASTIC SIGNALS

In the recovery problem considered so far in this chapter, the signal x underlying observation ω = Ax+ξ was “deterministic uncertain but bounded”—all the a priori information on x was that x ∈ X for a given signal set X . There is a well-known alternative model, where the signal x has a random component, specifically, x = [η; u] where the “stochastic component” η is random with (partly) known probability distribution Pη , and the “deterministic component” u is known to belong to a given set X . As a typical example, consider a linear dynamical system given by yt+1 ωt

= =

Pt y t + η t + u t , Ct yt + ξt , 1 ≤ t ≤ T,

(4.61)

where yt , ηt , and ut are, respectively, the state, the random “process noise,” and the deterministic “uncertain but bounded” disturbance affecting the system at time t, ωt is the output (it is what we observe at time t), and ξt is the observation noise. We assume that the matrices Pt , Ct are known in advance. Note that the trajectory y = [y1 ; ...; yT ] of the states depends not only on the trajectories of process noises ηt and disturbances ut , but also on the initial state y1 , which can be modeled as a realization of either the initial noise η0 , or the initial disturbance u0 . When ut ≡ 0, y1 = η0

292

CHAPTER 4

and the random vectors {ηt , 0 ≤ t ≤ T, ξt , 1 ≤ t ≤ T } are zero mean Gaussian independent of each other, (4.61) is the model underlying the celebrated Kalman filter [143, 144, 171, 172]. Now, given model (4.61), we can use the equations of the model to represent the trajectory of the states as a linear image of the trajectory of noises η = {ηt } and the trajectory of disturbances u = {ut }, y = P η + Qu (recall that the initial state is either the component η0 of η, or the component u0 of u), and our “full observation” becomes ω = [ω1 ; ...; ωT ] = A[η; u] + ξ, ξ = [ξ1 , ..., ξT ]. A typical statistical problem associated with the outlined situation is to estimate the linear image B[η; u] of the “signal” x = [η; u] underlying the observation. For example, when speaking about (4.61), the goal could be to recover yT +1 (“forecast”). We arrive at the following estimation problem: Given noisy observation ω = Ax + ξ ∈ Rm

of signal x = [η; u] with random component η ∈ Rp and deterministic component u known to belong to a given set X ⊂ Rq , we want to recover the image Bx ∈ Rν of the signal. Here A and B are given matrices, η is independent of ξ, and we have a priori (perhaps, incomplete) information on the probability distribution Pη of η, specifically, we know that Pη ∈ Pη for a given family Pη of probability distributions. Similarly, we assume that what we know about the noise ξ is that its distribution belongs to a given family Pξ of distributions on the observation space. Given a norm k · k on the image space of B, it makes sense to specify the risk of a candidate estimate x b(ω) by taking the expectation of the norm kb x(A[η; u] + ξ) − B[η; u]k of the error over both ξ and η and then taking the supremum of the result over the allowed distributions of η, ξ and over u ∈ X : Riskk·k [b x] = sup

sup

u∈X Pξ ∈Pξ ,Pη ∈Pη

E[ξ;η]∼Pξ ×Pη {kb x(A[η; u] + ξ) − B[η; u]k} .

When k · k = k · k2 and all distributions from Pξ and Pη are with zero means and finite covariance matrices, it is technically more convenient to operate with the Euclidean risk #1/2 "  2 x(A[η; u] + ξ) − B[η; u]k2 E[ξ;η]∼Pξ ×Pη kb . RiskEucl [b x] = sup sup u∈X Pξ ∈Pξ ,Pη ∈Pη

Our next goal is to show that as far as the design of “presumably good” linear estimates x b(ω) = H T ω is concerned, the techniques developed so far can be straightforwardly extended to the case of signals with random component.

SIGNAL RECOVERY BY LINEAR ESTIMATION

4.4.1

293

Minimizing Euclidean risk

For the time being, assume that Pξ is comprised of all probability distributions P on Rm with zero mean and covariance matrices Cov[P ] = Eξ∼P {ξξ T } running through a computationally tractable convex compact subset Qξ ⊂ int Sm + , and Pη is comprised of all probability distributions P on Rp with zero mean and covariance matrices running through a computationally tractable convex compact subset Qη ⊂ int Sp+ . Let, in addition, X be a basic spectratope: X = {x ∈ Rq : ∃t ∈ T : Rk2 [x]  tk Idk , k ≤ K} with our standard restrictions on T and Rk [·]. Let us derive an efficiently solvable convex optimization problem “responsible” for a presumably good, in terms of its Euclidean risk, linear estimate. For a linear estimate H T ω, u ∈ X , Pξ ∈ Pξ , Pη ∈ Pη , denoting by Qξ and Qη the covariance matrices of Pξ and Pη , and partitioning A as A = [Aη , Au ] and B = [Bη , Bu ] according to the partition x = [η; u], we have  E[ξ;η]∼Pξ ×Pη kH T (A[η; u] + ξ) − B[η; u]k22  = E[ξ;η]∼Pξ ×Pη k[H T Aη − Bη ]η + H T ξ + [HT Au − Bu ]uk22 T T = uT [Bu −H T Au ]T [Bu − H T Au ]u + Eξ∼Pξ Tr(H ξξ H) +Eη∼Pη Tr([Bη − H T Aη ]ηη T [Bη − H T Aη ]T ) = uT [Bu − H T Au ]T [Bu − H T Au ]u + Tr(H T Qξ H) +Tr([Bη − H T Aη ]Qη [Bη − H T Aη ]T ). Hence, the squared Euclidean risk of the linear estimate x bH (ω) = H T ω is Risk2Eucl [b xH ] Φ(H)

= =

Ψξ (H)

=

Ψη (H)

=

Φ(H) + Ψξ (H) + Ψη (H), max uT [Bu − H T Au ]T [Bu − H T Au ]u, u∈X

max Tr(H T QH),

Q∈Qξ

maxQ∈Qη Tr([Bη − H T Aη ]Q[Bη − H T Aη ]T ).

Functions Ψξ and Ψη are convex and efficiently computable, function Φ(H), by Proposition 4.8, admits an efficiently computable convex upper bound  Φ(H) = minΛ φT (λ[Λ]) : Λ = {Λk  0, k ≤ K},  P ∗ T T T [Bu − H Au ] [Bu − H Au ]  k Rk [Λk ]

P which is tight within the factor 2 max[ln(2 k dk ), 1] (see Proposition 4.8). Thus, the efficiently solvable convex problem yielding a presumably good linear estimate is   Opt = min Φ(H) + Ψξ (H) + Ψη (H) ; H

the Euclidean risk of the linear√ estimate H∗T ω yielded by the to the p optimal solution P problem is upper-bounded by Opt and is within factor 2 max[ln(2 k dk ), 1] of the minimal Euclidean risk achievable with linear estimates.

294 4.4.2

CHAPTER 4

Minimizing k · k-risk

Now let Pξ be comprised of all probability distributions P on Rm with matrices of second moments Var[P ] = Eξ∼P {ξξ T } running through a computationally tractable convex compact subset Qξ ⊂ int Sm + , and Pη be comprised of all probability distributions P on Rp with matrices of second moments Var[P ] running through a computationally tractable convex compact subset Qη ⊂ int Sp+ . Let, as above, X be a basic spectratope, X = {u ∈ Rn : ∃t ∈ T : Rk2 [u]  tk Idk , k ≤ K}, and let k·k be such that the unit ball B∗ of the conjugate norm k·k∗ is a spectratope:  B∗ = {y : kyk∗ ≤ 1} = y ∈ Rν : ∃(r ∈ R, z ∈ RN ) : y = M z, Sℓ2 [z]  rℓ Ifℓ , ℓ ≤ L ,

with our standard restrictions on T , R, Rk [·] and Sℓ [·]. Here the efficiently solvable convex optimization problem “responsible” for a presumably good, in terms of its risk Riskk·k , linear estimate can be built as follows. For a linear estimate H T ω, u ∈ X , Pξ ∈ Pξ , Pη ∈ Pη , denoting by Qξ and Qη the matrices of second moments of Pξ and Pη , and partitioning A as A = [Aη , Au ] and B = [Bη , Bu ] according to the partition x = [η; u], we have  u] + ξ) − B[η; u]k E[ξ;η]∼Pξ ×Pη kH T (A[η;  = E[ξ;η]∼Pξ ×Pη k[H T Aη − Bη ]η + H T ξ + [H T A u − Bu ]uk ≤ k[Bu − H T Au ]uk + Eξ∼Pξ kH T ξk + Eη∼Pη k[Bη − H T Aη ]ηk . It follows that for a linear estimate x bH (ω) = H T ω one has Riskk·k [b xH ] Φ(H) Ψξ (H) Ψη (H)

≤ = = =

Φ(H) + Ψξ (H) + Ψη (H), maxu∈X k[Bu − H T Au ]uk, supPξ ∈Pξ Eξ∼Pξ {kH T ξk}, supPη ∈Pη Eξ∼Pξ {k[Bη − H T Aη ]ηk}.

As was shown in Section 4.3.3.3, the functions Φ, Ψξ , Ψη admit efficiently computable upper bounds as follows (for notation, see Section 4.3.3.3):  Φ(H) ≤ Φ(H) := min φT (λ[Λ]) + φR (λ[Υ]) : Λ,Υ ) Λ = {ΛP k  0, k ≤ K}, Υ = {Υℓ  0, ℓ ≤ L}   1 T T ∗ ; u − Au H]M k Rk [Λk ] 2 [BP 0 1 T T S [Υ ] M [B − H Au] ℓ ℓ u ℓ 2  Ψξ (H) ≤ Ψξ (H) := min φR (λ[Υ]) + maxQ∈Qξ Tr(GQ) : Υ = {Υℓ  0, ℓ ≤ L} Υ,G    1 G 2 HM P  0 , 1 T T ℓ Sℓ [Υℓ ] 2M H  Ψη (H) ≤ Ψη (H) := min φR (λ[Υ]) + maxQ∈Qη Tr(GQ) : Υ = {Υℓ  0, ℓ ≤ L}, Υ,G    1 T T G η − Aη H]M 2 [BP 0 , 1 T T ℓ Sℓ [Υℓ ] 2 M [Bη − H Aη ]

and these bounds are reasonably tight (for details on tightness, see Proposition 4.8

295

SIGNAL RECOVERY BY LINEAR ESTIMATION

and Lemma 4.17). As a result, to get a presumably good linear estimate, one needs to solve the efficiently solvable convex optimization problem   Opt = min Φ(H) + Ψξ (H) + Ψη (H) . H

The linear estimate x bH∗ = H∗T ω yielded by an optimal solution H∗ to this problem admits the risk bound Riskk·k [b xH∗ ] ≤ Opt. Note that the above derivation did not use independence of ξ and η.

4.5

LINEAR ESTIMATION UNDER UNCERTAIN-BUT-BOUNDED NOISE

So far, the main subject of our interest was recovering (linear images of) signals via indirect observations of these signals corrupted by random noise. In this section, we focus on alternative observation schemes – those with “uncertain-but-bounded” and “mixed” noise. 4.5.1

Uncertain-but-bounded noise

Consider the estimation problem where, given observation ω = Ax + η

(4.62)

of unknown signal x known to belong to a given signal set X , one wants to recover linear image Bx of x. Here A and B are given m × n and ν × n matrices. The situation looks exactly as before, the difference with our previous considerations is that now we do not assume the observation noise to be random—all we assume about η is that it belongs to a given compact set H (“uncertain-but-bounded observation noise”). In the situation in question, a natural definition of the risk on X of a candidate estimate ω 7→ x b(ω) is RiskH,k·k [b x|X ] =

sup

x∈X,η∈H

kBx − x b(Ax + η)k

(“H-risk”). We are about to prove that when X , H, and the unit ball B∗ of the norm k · k∗ conjugate to k · k are spectratopes, which we assume from now on, an efficiently computable linear estimate is near-optimal in terms of its H-risk. Our initial observation is that in this case the model (4.62) reduces straightforwardly to the model without observation noise. Indeed, let Y = X × H; then Y is a spectratope, and we lose nothing when assuming that the signal underlying observation ω is y = [x; η] ∈ Y: ¯ A¯ = [A, Im ], ω = Ax + η = Ay,

296

CHAPTER 4

while the entity to be recovered is ¯ B ¯ = [B, 0ν×m ]. Bx = By, With these conventions, the H-risk of a candidate estimate x b(·) : Rm → Rν becomes the quantity Riskk·k [b x|X × H] =

sup

y=[x;η]∈X ×H

¯ −x ¯ kBy b(Ay)k,

and we indeed arrive at the situation where the observation noise is identically zero. To avoid messy notation, let us assume that the outlined reduction has been carried out in advance, so that (!) The problem of interest is to recover the linear image Bx ∈ Rν of an unknown signal x known to belong to a given spectratope X (which, as always, we can assume w.l.o.g. to be basic) from (noiseless) observation ω = Ax ∈ Rm . The risk of a candidate estimate is defined as Riskk·k [b x|X ] = sup kBx − x b(Ax)k, x∈X

where k · k is a given norm with a spectratope B∗ —see (4.34)—as the unit ball of the conjugate norm: X B∗

= =

{x ∈ Rn : ∃t ∈ T : Rk2 [x]  tk Idk , k ≤ K}, {z ∈ Rν : ∃y ∈ Y : z = M y}, Y := {y ∈ Rq : ∃r ∈ R : Sℓ2 [y]  rℓ Ifℓ , 1 ≤ ℓ ≤ L},

(4.63)

with our standard restrictions on T , R and Rk [·], Sℓ [·]. 4.5.1.1

Building a linear estimate

Let us build a presumably good linear estimate. For a linear estimate x bH (ω) = H T ω, we have Riskk·k [b xH |X ]

=

=

max k(B − H T A)xk x∈X  T max [u; x] 1

[u;x]∈B∗ ×X

2

(B − H T A)T

1 2

(B − H T A)



[u; x].

Applying Proposition 4.8, we arrive at the following: Proposition 4.18. In the situation of this section, consider the convex optimization problem  Opt# = min φR (λ[Υ]) + φT (λ[Λ]) : Υℓ  0, Λk  0, ∀(ℓ, k) H,Υ={Υℓ },Λ={Λk }    (4.64) P 1 ∗ − H T A]T M k Rk [Λk ] 2 [BP  0 , 1 T T ∗ ℓ Sℓ [Υℓ ] 2 M [B − H A]

297

SIGNAL RECOVERY BY LINEAR ESTIMATION

where R∗k [·], Sℓ∗ [·] are induced by Rk [·], Sℓ [·], respectively, as explained in Section 4.3.1. The problem is solvable, and the risk of the linear estimate x bH∗ (·) yielded by the H-component of an optimal solution does not exceed Opt# . For proof, see Section 4.8.6.1. 4.5.1.2

Near-optimality

Proposition 4.19. The linear estimate x bH∗ yielded by Proposition 4.18 is nearoptimal in terms of its risk: X X fℓ , (4.65) dk + Riskk·k [b xH∗ |X ] ≤ Opt# ≤ O(1) ln(D)Riskopt [X ], D = k



where Riskopt [X ] is the minimax optimal risk: Riskopt [X ] = inf Riskk·k [b x|X ] with inf taken w.r.t. all Borel estimates.

x b

Remark 4.20. When X and B∗ are ellitopes rather than spectratopes, X B∗

= := =

{x ∈ Rn : ∃t ∈ T : xT Rk x ≤ tk , k ≤ K}, {u ∈ Rν : kuk∗ ≤ 1} {u ∈ Rν P : ∃r ∈ R, z : u = M P z, z T Sℓ z ≤ rℓ , ℓ ≤ L} [Rk  0, k Rk ≻ 0, Sℓ  0, ℓ Sℓ ≻ 0],

problem (4.64) becomes Opt#

=



φR (µ) + φT (λ) : λ ≥ 0, µ ≥ 0,   P 1 − H T A]T M k λk Rk 2 [B P  0 , 1 T T ℓ µ ℓ Sℓ 2 M [B − H A]

min 

H,λ,µ

and (4.65) can be strengthened to

Riskk·k [b xH∗ |X ] ≤ Opt# ≤ O(1) ln(K + L)Riskopt [X ]. For proofs, see Section 4.8.6. 4.5.1.3

Nonlinear estimation

The uncertain-but-bounded model of observation error makes it easy to point out an efficiently computable near-optimal nonlinear estimate. Indeed, in the situation described at the beginning of Section 4.5.1, let us assume that the range of observation error η is H = {η ∈ Rm : kηk(m) ≤ σ},

where k · k(m) and σ > 0 are a given norm on Rm and a given error bound, and let us measure the recovery error by a given norm k · k(ν) on Rν . We can immediately point out a (nonlinear) estimate optimal within factor 2 in terms of its H-risk, namely, estimate x b∗ , as follows:

298

CHAPTER 4

Given ω, we solve the feasibility problem find x ∈ X : kAx − ωk(m) ≤ σ.

(F [ω])

Let xω be a feasible solution; we set x b∗ (ω) = Bxω .

Note that the estimate is well-defined, since (F [ω]) clearly is solvable, with one of the feasible solutions being the true signal underlying observation ω. When X is a computationally tractable convex compact set, and k · k(m) is an efficiently computable norm, a feasible solution to (F [ω]) can be found in a computationally efficient fashion. Let us make the following immediate observation: Proposition 4.21. The estimate x b∗ is optimal within factor 2:  RiskH [b x∗ |X ] ≤ Opt∗ := sup kBx − Byk(ν) : x, y ∈ X , kA(x − y)k(m) ≤ 2σ x,y



2Riskopt,H

(4.66)

where Riskopt,H is the infimum of H-risk over all estimates. The proof of the proposition is the subject of Exercise 4.28. 4.5.1.4

Quantifying risk

Note that Proposition 4.21 does not impose restrictions on X and the norms k·k(m) , k · k(ν) . The only—but essential—shortcoming of the estimate x b∗ is that we do not know, in general, what its H-risk is. From (4.66) it follows that this risk is tightly (namely, within factor 2) upper-bounded by Opt∗ , but this quantity, being the maximum of a convex function over some domain, can be difficult to compute. Aside from a handful of special cases where this difficulty does not arise, there is a generic situation when Opt∗ can be tightly upper-bounded by efficient computation. This is the situation where X is the spectratope defined in (4.63), k · k(m) is such that the unit ball of this norm is a basic spectratope, B(m) := {u : kuk(m) ≤ 1} = {u ∈ Rm : ∃p ∈ P : Q2j [u]  pj Iej , 1 ≤ j ≤ J}, and the unit ball of the norm k · k(ν),∗ conjugate to the norm k · k(ν) is a spectratope, ∗ B(ν)

:= =

{v ∈ Rν : kvk(ν),∗ ≤ 1} {v : ∃(w ∈ RN , r ∈ R) : v = M w, Sℓ2 [w]  rℓ Ifℓ , 1 ≤ ℓ ≤ L},

with the usual restrictions on P, R, Qj [·], and Sℓ [·]. Proposition 4.22. In the situation in question, consider the convex optimization problem Opt

=

min

Λ={Λk ,k≤K}, Υ={Υℓ ,ℓ≤L}, Σ={Σj ,j≤J}



φT (λ[Λ]) + φR (λ[Υ]) + σ 2 φP (λ[Σ]) + φR (λ[Σ]) :

) ΛkP  0, Υℓ  0, Σj  0 ∀(k, ℓ, j),   T ∗ M B ℓ Sℓ [Υℓ ] P 0 ∗ T P ∗ BT M k Rk [Λk ] + A [ j Qj [Σj ]]A

(4.67)

299

SIGNAL RECOVERY BY LINEAR ESTIMATION

where R∗k [·] is associated with mapping x 7→ Rk [x] according to (4.25), Sℓ∗ [·] and Q∗j [·] are associated in the same fashion with mappings w 7→ Sℓ [w] and u 7→ Qj [u], respectively, and φT , φR , and φP are the support functions of the corresponding sets T , R, and P. The optimal value in (4.67) is an efficiently computable upper bound on the quantity Opt# defined in (4.66), and this bound is tight within factor 2 max[ln(2D), 1], D =

X

dk +

k

X

fℓ +



X

ej .

j

Proof of the proposition is the subject of Exercise 4.29. 4.5.2

Mixed noise

So far, we have considered separately the cases of random and uncertain-butbounded observation noises in (4.31). Note that both these observation schemes are covered by the following “mixed” scheme: ω = Ax + ξ + η, where, as above, A is a given m × n matrix, x is an unknown deterministic signal known to belong to a given signal set X , ξ is random noise with distribution known to belong to a family P of Borel probability distributions on Rm satisfying (4.32) for a given convex compact set Π ⊂ int Sm + , and η is an “uncertain-but-bounded” observation error known to belong to a given set H. As before, our goal is to estimate Bx ∈ Rν via observation ω. In our present setting, given a norm k · k on Rν , we can quantify the performance of a candidate estimate ω 7→ x b(ω) : Rm → Rν by its risk RiskΠ,H,k·k [b x|X ] =

sup

x∈X ,P ✁Π,η∈H

Eξ∼P {kBx − x b(Ax + ξ + η)k}.

Observe that the estimation problem associated with the “mixed” observation scheme straightforwardly reduces to a similar problem for the random observation scheme, by the same trick we have used in Section 4.5 to eliminate the observation noise. Indeed, let us treat x+ = [x; η] ∈ X + := X × H and X + as the new ¯ + = Ax + η, Bx ¯ + = Bx. signal/signal set underlying our observation, and set Ax With these conventions, the “mixed” observation scheme reduces to ¯ + + ξ, ω = Ax and for every candidate estimate x b(·) it clearly holds

RiskΠ,H,k·k [b x|X ] = RiskΠ,k·k [b x|X + ],

so that we find ourselves in the situation of Section 4.3.3.1. Assuming that X and H are spectratopes, so is X + , meaning that all results of Section 4.3.3 on building presumably good linear estimates and their near-optimality are applicable to our present setup.

300 4.6

CHAPTER 4

CALCULUS OF ELLITOPES/SPECTRATOPES

We present here the rules of the calculus of ellitopes/spectratopes. We formulate these rules for ellitopes; the “spectratopic versions” of the rules are straightforward modifications of their “ellitopic versions.” • Intersection X =

I T

i=1

Xi of ellitopes

Xi = {x ∈ Rn : ∃(y i ∈ Rni , ti ∈ Ti ) : x = Pi y i & [y i ]T Rik y i ≤ tik , 1 ≤ k ≤ Ki } is an ellitope. Indeed, this is evident when X = {0}. Assuming X 6= {0}, we have X

{x ∈ Rn : ∃(y = [y 1 ; ...; y I ] ∈ Y, t = (t1 , ..., tI ) ∈ T = T1 × ... × TI ) : x = P y := P1 y 1 & [y i ]T Rik y i ≤ tik , 1 ≤ k ≤ Ki , 1 ≤ i ≤ I}, | {z }

=

Y

1

I

{[y ; ...; y ] ∈ R

=

n1 +...+nI

:

+ y T Rik y i Pi y = P1 y 1 ,

2 ≤ i ≤ I}

(note that Y can be identified with Rn¯ with a properly selected n ¯ > 0). I Q • The direct product X = Xi of ellitopes i=1

Xi =

{xi ∈ Rni : ∃(y i ∈ Rn¯ i , ti ∈ Ti ) : xi = Pi y i , 1 ≤ i ≤ I & [y i ]T Rik y i ≤ tik , 1 ≤ k ≤ Ki }

is an ellitope: X =

(

[x1 ; ...; xI ] ∈ Rn1 × ... × RnI : ∃ 1

I



y = [y 1 ; ...; y I ] ∈ Rn¯ 1 +...¯nI t = (t1 , ..., tI ) ∈ T = T1 × ... × TI

i T

i

x = P y := [P1 y ; ...; PI y ], [y ] Rik y ≤ | {z }

tik , 1



)

≤ k ≤ Ki , 1 ≤ i ≤ I .

+ y T Rik y

• The linear image Z = {Rx : x ∈ X }, R ∈ Rp×n , of an ellitope X = {x ∈ Rn : ∃(y ∈ Rn¯ , t ∈ T ) : x = P y & y T Rk y ≤ tk , 1 ≤ k ≤ K} is an ellitope: Z = {z ∈ Rp : ∃(y ∈ Rn¯ , t ∈ T ) : z = [RP ]y & y T Rk y ≤ tk , 1 ≤ k ≤ K}. • The inverse linear image Z = {z ∈ Rq : Rz ∈ X }, R ∈ Rn×q , of an ellitope X = {x ∈ Rn : ∃(y ∈ Rn¯ , t ∈ T ) : x = P y & y T Rk y ≤ tk , 1 ≤ k ≤ K} under linear mapping z 7→ Rz : Rq → Rn is an ellitope, provided that the mapping is an embedding: Ker R = {0}. Indeed, setting E = {y ∈ Rn¯ : P y ∈ ImR}, we get a linear subspace in Rn¯ . If E = {0}, Z = {0} is an ellitope; if E 6= {0}, we have Z P¯

= =

{z ∈ Rq : ∃(y ∈ E, t ∈ T ) : z = P¯ y & y T Rk y ≤ tk , 1 ≤ k ≤ K}, ΠP, where Π : ImR → Rq is the inverse of z 7→ Rz : Rq → ImR

(E can be identified with some Rk , and Π is well-defined since R is an embed-

301

SIGNAL RECOVERY BY LINEAR ESTIMATION

ding). n o PI • The arithmetic sum X = x = i=1 xi : xi ∈ Xi , 1 ≤ i ≤ I of ellitopes Xi is an ellitope, with representation readily given by those of X1 , ..., XI . Indeed, X is the image of X1 × ... × XI under the linear mapping [x1 ; ...; xI ] 7→ x1 + .... + xI , and taking direct products and images under linear mappings preserves ellitopes. • “S-product.” Let Xi = {xi ∈ Rni : ∃(y i ∈ Rn¯ i , ti ∈ Ti ) : xi = Pi y i , 1 ≤ i ≤ I & [y i ]T Rik y i ≤ tik , 1 ≤ k ≤ Ki } be ellitopes, and let S be a convex compact set in RI+ which intersects the interior of RI+ and is monotone: 0 ≤ s′ ≤ s ∈ S implies s′ ∈ S. We associate with S the set  S 1/2 = s ∈ RI+ : [s21 ; ...; s2I ] ∈ S of entrywise square roots of points from S; clearly, S 1/2 is a convex compact set. Xi and S specify the S-product of the sets Xi , i ≤ I, defined as the set n o Z = z = [z 1 ; ...; z I ] : ∃(s ∈ S 1/2 , xi ∈ Xi , i ≤ I) : z i = si xi , 1 ≤ i ≤ I , or, equivalently,  Z = z = [z 1 ; ...; z I ] : ∃(r = [r1 ; ...; rI ] ∈ R, y 1 , ..., y I ) : i T

i

zi = Pi yi ∀i ≤ I, [y ] Rik y ≤

rki

 ∀(i ≤ I, k ≤ Ki ) ,

R = {[r1 ; ...; rI ] ≥ 0 : ∃(s ∈ S 1/2 , ti ∈ Ti ) : ri = s2i ti ∀i ≤ I}.

We claim that Z is an ellitope. All we need to verify to this end is that the set R is as it should be in an ellitopic representation, that is, that R is a compact 1 +...+KI and monotone subset of RK containing a strictly positive vector (all this + is evident), and that R is convex. To verify convexity, let Ti = cl{[ti ; τi ] : τi > 0, ti /τi ∈ Ti } be the conic hulls of Ti ’s. We clearly have R = {[r1 ; ...; rI ] : ∃s ∈ S 1/2 : [ri ; s2i ] ∈ Ti , i ≤ I} = {[r1 ; ...; rI ] : ∃σ ∈ S : [ri ; σi ] ∈ Ti , i ≤ I}, where the concluding equality is due to the origin of S 1/2 . The concluding set in the above chain clearly is convex, and we are done. As an example, consider the situation where the ellitopes Xi possess nonempty interiors and thus can be thought of as unit balls of norms k·k(i) on the respective spaces Rni , and let S = {s ∈ RI+ : kskp/2 ≤ 1}, where p ≥ 2. In this situation, S 1/2 = {s ∈ RI+ : kskp ≤ 1}, whence Z is the unit ball of the “block p-norm”   k[z 1 ; ...; z I ]k = k kz 1 k(1) ; ...; kz I k(I) kp .

Note also that the usual direct product of I ellitopes is their S-product, with S = [0, 1]I . • “S-weighted sum.” Let Xi ⊂ Rn be ellitopes, 1 ≤ i ≤ I, and let S ⊂ RI+ , S 1/2 be the same as in the previous rule. Then the S-weighted sum of the sets Xi ,

302

CHAPTER 4

defined as X = {x : ∃(s ∈ S 1/2 , xi ∈ Xi , i ≤ I) : x =

X i

si xi },

is an ellitope. Indeed, the set in question is the image of the S-product of Xi under the linear mapping [z 1 ; ...; z I ] 7→ z 1 + ... + z I , and taking S-products and linear images preserves the property of being an ellitope. It should be stressed that the outlined “calculus rules” are fully algorithmic: representation (4.6) of the result of an operation is readily given by the representations (4.6) of the operands.

4.7

EXERCISES FOR CHAPTER 4

4.7.1

Linear estimates vs. Maximum Likelihood

Exercise 4.1. Consider the problem posed at the beginning of Chapter 4: Given observation ω = Ax + σξ, ξ ∼ N (0, I) of unknown signal x known to belong to a given signal set X ⊂ Rn , we want to recover Bx. Let us consider the case where matrix A is square and invertible, B is the identity, and X is a computationally tractable convex compact set. As far as computational aspects are concerned, the situation is well suited for utilizing the “magic wand” of Statistics—the Maximum Likelihood (ML) estimate where the recovery of x is x bML (ω) = argmin kω − Ayk2 (ML) y∈X

—the signal which maximizes, over y ∈ X , the likelihood (the probability density) of getting the observation we actually got. Indeed, with computationally tractable X , (ML) is an explicit convex, and therefore efficiently solvable, optimization problem. Given the exclusive role played by the ML estimate in Statistics, perhaps the first question about our estimation problem is: how good is the ML estimate? The goal of this exercise is to show that in the situation we are interested in, the ML estimate can be “heavily nonoptimal,” and this may happen even when the techniques we develop in Chapter 4 do result in an efficiently computable nearoptimal linear estimate. To justify the claim, investigate the risk (4.2) of the ML estimate in the case where ( ) n X n 2 −2 2 X = x ∈ R : x1 + ǫ xi ≤ 1 & A = Diag{1, ǫ−1 , ..., ǫ−1 }, i=2

ǫ and σ are small, and n is large, so that σ 2 (n − 1) ≥ 2. Accompany your theoretical analysis by numerical experiments—compare the empirical risks of the ML estimate with theoretical and empirical risks of the linear estimate optimal under the circumstances.

303

SIGNAL RECOVERY BY LINEAR ESTIMATION

Recommended setup: n runs through {256, 1024, 2048}, ǫ = σ runs through {0.01; 0.05; 0.1}, and signal x is generated as x = [cos(φ); sin(φ)ǫζ], where φ ∼ Uniform[0, 2π] and random vector ζ is independent of φ and is distributed uniformly on the unit sphere in Rn−1 . 4.7.2

Measurement Design in Signal Recovery

Exercise 4.2. [Measurement Design in Gaussian o.s.] As a preamble to the exercise, please read the story about possible “physics” of Gaussian o.s. from Section 2.7.3.3. The summary of the story is as follows: We consider the Measurement Design version of signal recovery in Gaussian o.s., specifically, we are allowed to use observations [ξ ∼ N (0, Im )]

ω = Aq x + σξ where

√ √ √ Aq = Diag{ q1 , q2 , ..., qm }A,

with a given A ∈ Rm×n and vector q which we can select in a given convex compact set Q ⊂ Rm + . The signal x underlying the observation is known to belong to a given ellitope X . Your goal is to select q ∈ Q and a linear recovery ω 7→ GT ω of the image Bx of x ∈ X , with given B, resulting in the smallest worst-case, over x ∈ X , expected k · k22 recovery risk. Modify, according to this goal, problem (4.12). Is it possible to end up P with a tractable problem? Work out in full detail the case when Q = {q ∈ Rm + : i qi = m}. Exercise 4.3.

[follow-up to Exercise 4.2] A translucent bar of length n = 32 is comprised of 32 consecutive segments of length 1 each, with density ρi of i-th segment known to belong to the interval [µ − δi , µ + δi ]. Sample translucent bar The bar is lit from the left end; when light passes through a segment with density ρ, the light’s intensity is reduced by factor e−αρ . The light intensity at the left endpoint of the bar is 1. You can scan the segments one by one from left to right and measure light intensity ℓi at the right endpoint of the i-th segment during time qi ; √ the result zi of the measurement is ℓi eσξi / qi , where ξi ∼ N (0, 1) are independent across i. The total time budget is n, and you are interested in recovering the m = n/2-dimensional vector of densities of the right m segments. Build an optimization problem responsible for near-optimal linear recovery with and without Measurement Design (in the latter case, we assume that each segment is observed during unit time) and compare the resulting near-optimal risks. Recommended data: α = 0.01, δi = 1.2 + cos(4π(i − 1)/n), µ = 1.1 max δi , σ = 0.001. i

304

CHAPTER 4

Exercise 4.4. Let X be a basic ellitope in Rn : X = {x ∈ Rn : ∃t ∈ T : xT Sk x ≤ tk , 1 ≤ k ≤ K} with our usual restrictions on Sk and T . Let, further, m be a given positive integer, and x 7→ Bx : Rn → Rν be a given linear mapping. Consider the Measurement Design problem where you are looking for a linear recovery ω 7→ x bH (ω) := H T ω of Bx, x ∈ X, from observation ω = Ax + σξ

[σ > 0 is given and ξ ∼ N (0, Im )]

in which the m × n sensing matrix A is under your control—it is allowed to be any m × n matrix of spectral norm not exceeding 1. You are interested in selecting H and A in order to minimize the worst-case, over x ∈ X, expected k · k22 recovery error. Similarly to (4.12), this problem can be posed as  Opt = minH,λ,A σ 2 Tr(H T H) + φT (λ) :    P (4.68) B T − AT H k λ k Sk  0, kAk ≤ 1, λ ≥ 0 , T B−H A Iν where k · k stands for the spectral norm. The objective in this problem is the (upper bound on the) squared risk Risk2 [b xH |X], the sensing matrix being A. The problem is nonconvex, since the matrix participating in the semidefinite constraint is bilinear in H and A. A natural way to handle an optimization problem with objective and/or constraints bilinear in the decision variables u, v is to use “alternating minimization,” where one alternates optimization in v for u fixed and optimization in u for v fixed, the value of the variable fixed in a round being the result of optimization w.r.t. this variable in the previous round. Alternating minimizations are carried out until the value of the objective (which in the outlined process definitely improves from round to round) stops to improve (or nearly so). Since the algorithm does not necessarily converge to the globally optimal solution to the problem of interest, it makes sense to run the algorithm several times from different, say, randomly selected, starting points. Now comes the exercise. 1. Implement Alternating Minimization as applied to (4.68). You may restrict your experimentation to the case where the sizes m, n, ν are quite moderate, in the range of tens, and X is either the box {x : j 2γ x2j ≤ 1, 1 ≤ j ≤ n}, or the ellipsoid Pn {x : j=1 j 2γ x2j ≤ 1}, where γ is a nonnegative parameter (try γ = 0, 1, 2, 3). As for B, you can generate it at random, or enforce B to have prescribed singular values, say, σj = j −θ , 1 ≤ j ≤ ν, and a randomly selected system of singular vectors. 2. Identify cases where a globally optimal solution to (4.68) is easy to find and use this information in order to understand how reliable Alternating Minimization is in the application in question, reliability meaning the ability to identify nearoptimal, in terms of the objective, solutions. If you are not satisfied with Alternating Minimization “as is,” try to improve it.

305

SIGNAL RECOVERY BY LINEAR ESTIMATION

3. Modify (4.68) and your experiment to cover the cases where the constraint kAk ≤ 1 on the sensing matrix is replaced with one of the following: • kRowi [A]k2 ≤ 1, 1 ≤ i ≤ m, • |Aij | ≤ 1 for all i, j

(note that these two types of restrictions mimic what happens if you are interested in recovering (the linear image of) the vector of parameters in a linear regression model from noisy observations of the model’s outputs at the m points which you are allowed to select in the unit ball or unit box). 4. [Embedded Exercise] Recall that a ν × n matrix G admits singular value decomposition G = U DV T with orthogonal matrices U ∈ Rν×ν and V ∈ Rn×n and diagonal ν × n matrix D with nonnegative and nonincreasing diagonal entries.9 These entries are uniquely defined by G and are called singular values σi (G), 1 ≤ i ≤ min[ν, n]. Singular values admit characterization similar to variational characterization of eigenvalues of a symmetric matrix; see, e.g., [15, Section A.7.3]: Theorem 4.23. [VCSV—Variational Characterization of Singular Values] For a ν × n matrix G it holds σi (G) = min

max

E∈Ei e∈E,kek2 =1

kGek2 , 1 ≤ i ≤ min[ν, n],

(4.69)

where Ei is the family of all subspaces in Rn of codimension i − 1. Corollary 4.24. [SVI—Singular Value Interlacement] Let G and G′ be ν × n matrices, and let k = Rank(G − G′ ). Then σi (G) ≥ σi+k (G′ ), 1 ≤ i ≤ min[ν, n], where, by definition, singular values of a ν × n matrix with indexes > min[ν, n] are zeros. We denote by σ(G) the vector of singular values of G arranged in nonincreasing order. The function kGkSh,p = kσ(G)kp is called the Shatten p-norm of matrix G; this indeed is a norm on the space of ν × n matrices, and the conjugate norm is k · kSh,q , with p1 + 1q = 1. An easy and important consequence of Corollary 4.24 is the following fact: Corollary 4.25. Given a ν × n matrix G, an integer k, 0 ≤ k ≤ min[ν, n], and p ∈ [1, ∞], (one of ) the best approximation of G in the Shatten p-norm among matrices of rank ≤ k is obtained from Pk G by zeroing out all but k largest singular values, that is, the matrix Gk = i=1 σi (G)Coli [U ]ColTi [V ], where G = U DV T is the singular value decomposition of G. Prove Theorem 4.23 and Corollaries 4.24 and 4.25. 5. Consider the Measurement Design problem (4.68) in the case when X is an ellipsoid: n o Xn X = x ∈ Rn : x2j /a2j ≤ 1 , j=1

9 We

say that a rectangular matrix D is diagonal if all entries Dij in D with i 6= j are zeros.

306

CHAPTER 4

A is an m × n matrix of spectral norm not exceeding 1, and there is no noise in observations: σ = 0. Find an optimal solution to this problem. Think how this result can be used to get a (hopefully) good starting point for Alternating Minimization in the case when X is an ellipsoid and σ is small. 4.7.3

Around semidefinite relaxation

Exercise 4.5. Let X be an ellitope: X = {x ∈ Rn : ∃(y ∈ RN , t ∈ T ) : x = P y, y T Sk y ≤ tk , k ≤ K} P rk skj sTkj , we can with our standard restrictions on T and Sk . Representing Sk = j=1 pass from the initial ellitopic representation of X to the spectratopic representation of the same set: n N + + X = {x ∈ R [sTkj x]2  t+ kj I1 , 1 ≤ k ≤ K, h : ∃(y ∈ R , t ∈ T ) : x = P y,P i 1 ≤ j ≤ rk } rk + + + + T = {t = {tkj ≥ 0} : ∃t ∈ T : j=1 tkj ≤ tk , 1 ≤ k ≤ K} .

If now C is a symmetric n × n matrix and Opt = maxx∈X xT Cx, we have  P Opt∗ ≤ Opte := min φT (λ) : P T CP  k λk Sk λ={λk ∈R+ } n o P Opt∗ ≤ Opts := min φT + (Λ) : P T CP  k,j Λkj skj sTkj Λ={Λkj ∈R+ }

where the first relation is yielded by the ellitopic representation of X and Proposition 4.6, and the second, on closer inspection (carry this inspection out!), by the spectratopic representation of X and Proposition 4.8. Prove that Opte = Opts .

Exercise 4.6. Proposition 4.6 provides us with an upper bound on the quality of the semidefinite relaxation as applied to the problem of upper-bounding the maximum of a homogeneous quadratic form over an ellitope. Extend the construction to the case where an inhomogeneous quadratic form is maximized over a shifted ellitope, so that the quantity to upper-bound is   Opt = max f (x) := xT Ax + 2bT x + c , x∈X

X = {x : ∃(y, t ∈ T ) : x = P y + p, y T Sk y ≤ tk , 1 ≤ k ≤ K}

with our standard assumptions on Sk and T . Note: X is centered at p, and a natural upper bound on Opt is d Opt ≤ f (p) + Opt,

d is an upper bound on the quantity where Opt

Opt = max [f (x) − f (p)] . x∈X

d What you are interested in upper-bounding is the ratio Opt/Opt.

307

SIGNAL RECOVERY BY LINEAR ESTIMATION

Exercise 4.7. [estimating Kolmogorov widths of spectratopes/ellitopes] 4.7.A. Preliminaries: Kolmogorov and Gelfand widths. Let X be a convex compact set in Rn , and let k · k be a norm on Rn . Given a linear subspace E in Rn , let distk·k (x, E) = min kx − zk : Rn → R+ z∈E

be the k · k-distance from x to E. The quantity distk·k (X , E) = max distk·k (x, E) x∈X

can be viewed as the worst-case k · k-accuracy to which vectors from X can be approximated by vectors from E. Given positive integer m ≤ n and denoting by Em the family of all linear subspaces in Rm of dimension m, the quantity δm (X , k · k) = min distk·k (X , E) E∈Em

can be viewed as the best achievable quality of approximation, measured in k · k, of vectors from X by vectors from an m-dimensional linear subspace of Rn . This quantity is called the m-th Kolmogorov width of X w.r.t. k · k. Observe that one has distk·k (x, E) = maxξ {ξ T x : kξk∗ ≤ 1, ξ ∈ E ⊥ }, ξ T x, distk·k (X , E) = max

(4.70)

x∈X , kξk∗ ≤1,ξ∈E ⊥

where E ⊥ is the orthogonal complement to E. 1) Prove (4.70). Hint: Represent distk·k (x, E) as the optimal value in a conic problem on the cone K = {[x; t] : t ≥ kxk} and use the Conic Duality Theorem. Now consider the case when X is the unit ball of some norm k · kX . In this case (4.70) combines with the definition of Kolmogorov width to imply that δm (X , k · k)

= = =

min distk·k (x, E) = min max

min

max

max

E∈Em y∈E ⊥ ,kyk∗ ≤1 x:kxkX ≤1

min

max

max

E∈Em x∈X y∈E ⊥ ,kyk∗ ≤1 T

E∈Em

F ∈En−m y∈F,kyk∗ ≤1

y x

yT x (4.71)

kykX ,∗ ,

where k·kX ,∗ is the norm conjugate to k·kX . Note that when Y is a convex compact set in Rn and | · | is a norm on Rn , the quantity dm (Y, | · |) =

min

max |y|

F ∈En−m y∈Y∩F

has a name—it is called the m-th Gelfand width of Y taken w.r.t. | · |. The “duality relation” (4.71) states that When X , Y are the unit balls of respective norms k · kX , k · kY , for every m < n the m-th Kolmogorov width of X taken w.r.t. k · kY,∗ is the same as

308

CHAPTER 4

the m-th Gelfand width of Y taken w.r.t. k · kX ,∗ . The goal of the remaining part of the exercise is to use our results on the quality of semidefinite relaxation on ellitopes/spectratopes to infer efficiently computable upper bounds on Kolmogorov widths of a given set X ⊂ Rn . In the sequel we assume that • X is a spectratope: X = {x ∈ Rn : ∃(t ∈ T , u) : x = P u, Rk2 [u]  tk Idk , k ≤ K}; • The unit ball B∗ of the norm conjugate to k · k is a spectratope: B∗ = {y : kyk∗ ≤ 1} = {y ∈ Rn : ∃(r ∈ R, z) : y = M z, Sℓ2 [z]  rℓ Ifℓ , ℓ ≤ L}. with our usual restrictions on T , R and Rk [·] and Sℓ [·]. 4.7.B. Simple case: k·k = k·k2 . We start with the simple case where k·k = k·k2 , so that B∗ isP the ellitope {y : y T y ≤ 1}. Let D = k dk be the size of the spectratope X , and let κ = 2 max[ln(2D), 1].

Given integer m < n, consider the convex optimization problem  Opt(m) = minΛ={Λk ,k≤K},Y φT (λ[Λ]) : Λk  0∀k, 0  Y  In ,  P ∗ T k Sk [Λk ]  P Y P, Tr(Y ) = n − m .

(Pm )

2) Prove the following:

Proposition 4.26. Whenever 1 ≤ µ ≤ m < n, one has 2 2 Opt(m) ≤ κδm (X , k · k2 ) & δm (X , k · k2 ) ≤

m+1 Opt(µ). m+1−µ

(4.72)

Moreover, the above upper bounds on δm (X , k · k2 ) are “constructive,” meaning that an optimal solution to (Pµ ), µ ≤ m, can be straightforwardly converted into a linear subspace E m,µ of dimension m such that r m+1 Opt(µ). distk·k2 (X , E m,µ ) ≤ m+1−µ Finally, Opt(µ) is nonincreasing in µ < n. 4.7.C. General case. Now consider the case when both X and the unit ball B∗ of the norm conjugate to k · k are spectratopes. As we are about to see, this case is essentially more difficult than the case of k · k = k · k2 , but something still can be done. 3) Prove the following statement:

309

SIGNAL RECOVERY BY LINEAR ESTIMATION

(!) Given m < n, let Y be an orthoprojector of Rn of rank n − m, and let collections Λ = {Λk  0, k ≤ K} and Υ = {Υℓ  0, ℓ ≤ L} satisfy the relation   P 1 T ∗ k Rk [Λk ] P 2P Y M  0. (4.73) 1 T ∗ ℓ Sℓ [Υℓ ] 2M Y P Then

distk·k (X , Ker Y ) ≤ φT (λ[Λ]) + φR (λ[Υ]).

(4.74)

As a result, δm (X , k · k)

≤ ≤

distk·k (X , Ker Y )  φT (λ[Λ]) + φR (λ[Υ]) : Opt := min Λ={Λk ,k≤K}, Υ={Υℓ ,ℓ≤L}

Λ  0 ∀k, Υℓ  0 ∀ℓ,  kP ∗ k Rk [Λk ] T 1 M YP 2

4) Prove the following statement: (!!) Let m, n, Y be as in (!). Then

1 T P YM 2 P ∗ ℓ Sℓ [Υℓ ]



0

)

0

)

δm (X , k · k) ≤ distk·k (X , Ker Y ) ≤

d := Opt

min

ν,Λ={Λk ,k≤K}, Υ={Υℓ ,ℓ≤L}

∗ k Rk [Λk ] T 1 M P 2

P

.

φT (λ[Λ]) + φR (λ[Υ]) :

ν ≥ 0, Λk  0 ∀k, Υℓ  0 ∀ℓ, P ∗ ℓ Sℓ [Υℓ ]

1 T P M 2 T

+ νM (I − Y )M



(4.75)

(4.76) ,

d ≤ Opt, with Opt given by (4.75). and Opt

Statements (!) and (!!) suggest the following policy for upper-bounding the Kolmogorov width δm (X , k · k): A. First, we select an integer µ, 1 ≤ µ < n, and solve the convex optimization problem  min φT (λ[Λ]) + φR (λ[Υ]) : 0  Y  I, Tr(Y ) = n − µ, Λ,Υ,Y  (P µ ) Λ = {Λ  0, k ≤ K}, Υ = {Υ  0, ℓ ≤ L},  k ℓ   P ∗ 1 T P YM . k Rk [Λk ] 2 P 0  T ∗ 1 2

M YP



Sℓ [Υℓ ]

B. Next, we take the Y -component Y µ of the optimal solution to (P µ ) and “round” it to a orthoprojector Y of rank n − m in the same fashion as in the case of k · k = k · k2 , that is, keep the eigenvectors of Y µ intact and replace the m smallest eigenvalues with zeros, and all remaining eigenvalues with ones.

310

CHAPTER 4

C. Finally, we solve the convex optimization problem  Optm,µ = min φT (λ[Λ]) + φR (λ[Υ]) : Λ,Υ,ν

ν ≥ 0, Λ = {Λk  0, k ≤ K}, Υ = {Υℓ  0,ℓ ≤ L},   P ∗ 1 T P M . k Rk [Λk ] 2 P ∗ 0 T T 1 2

M P



(P m,µ )

Sℓ [Υℓ ] + νM (I − Y )M

By (!!), Optm,µ is an upper bound on the Kolmogorov width δm (X , k · k) (and in fact also on distk·k (X , Ker Y )). Observe all the complications we encounter when passing from the simple case k · k = k · k2 to the case of general norm k · k with a spectratope as the unit ball p of the conjugate norm. Note that Proposition 4.26 gives both a lower bound Opt(m)/κ on qthe m-th Kolmogorov width of X w.r.t. k · k2 , and a family of

m+1 upper bounds m+1−µ Opt(µ), 1 ≤ µ ≤ m, on this width. As a result, we can approximate X by m-dimensional subspaces in the Euclidean norm in a “nearly optimal” fashion. Indeed, if for some ǫ and k it holds δk (X , k · k2 ) ≤ ǫ, then Opt(k) ≤ κǫ2 by Proposition 4.26 as applied with m = k. On the other hand, assuming k < n/2, the same proposition when applied with m = 2k and µ = k says that r p √ 2k + 1 m,k distk·k2 (X , E ) ≤ Opt(k) ≤ 2Opt(k) ≤ 2κ ǫ. k+1

Thus, if X can be approximated by a k-dimensional subspace within k√· k2 -accuracy ǫ, we can efficiently get approximation of “nearly the same quality” ( 2κǫ instead of ǫ; recall that κ is just logarithmic in D) and “nearly the same dimension” (2k instead of k). Neither of these options is preserved when passing from the Euclidean norm to a general one: in the latter case, we do not have lower bounds on Kolmogorov widths, and have no understanding of how tight our upper bounds are. Now, two concluding questions:

5) Why in step A of the above bounding scheme do we utilize statement (!) rather d ≤ Opt) statement (!!)? than the less conservative (since Opt 6) Implement the scheme numerically and run experiments. Recommended setup: • Given σ > 0 and positive integers n and κ, let f be a function of continuous argument t ∈ [0, 1] satisfying the smoothness restriction |f (k) (t)| ≤ σ k , 0 ≤ t ≤ 1, k = 0, 1, 2, ..., κ. Specify X as the set of n-dimensional vectors x obtained by restricting f onto the n-point equidistant grid {ti = i/n}ni=1 . To this end, translate the description on f into a bunch of two-sided linear constraints on x: |dT(k) [xi ; xi+1 ; ...; xi+k ]| ≤ σ k , 1 ≤ i ≤ n − k, 0 ≤ k ≤ κ, where d(k) ∈ Rk+1 is the vector of coefficients of finite-difference approximation, with resolution 1/n, of the k-th derivative: d(0) = 1, d(1) = n[−1; 1], d(2) = n2 [1; −2; 1], d(3) = n3 [−1; 3; −3; 1], d(4) = n4 [1; −4; 6; −4; 1], ....

311

SIGNAL RECOVERY BY LINEAR ESTIMATION

• Recommended parameters: n = 32, m = 8, κ = 5, σ ∈ {0.25, 0.5; 1, 2, 4}. • Run experiments with k · k = k · k1 and k · k = k · k2 . Exercise 4.8. [more on semidefinite relaxation] The goal of this exercise is to extend SDP relaxation beyond ellitopes/spectratopes. SDP relaxation is aimed at upper-bounding the quantity OptX (B) = max xT Bx, x∈X

[B ∈ Sn ]

where X ⊂ Rn is a given set (which we from now on assume to be nonempty convex compact). To this end we look for a computationally tractable convex compact set U ⊂ Sn such that for every x ∈ X it holds xxT ∈ U ; in this case, we refer to U as to a set matching X (equivalent wording: “U matches X ”). Given such a set U , the optimal value in the convex optimization problem OptU (B) = max Tr(BU ) U ∈U

(4.77)

is an efficiently computable convex upper bound on OptX (B). Given U matching X , we can pass from U to the conic hull of U –to the set U[U ] = cl{(U, µ) ∈ Sn × R+ : µ > 0, U/µ ∈ U} which, as is immediately seen, is a closed convex cone contained in Sn × R+ . The only point (U, µ) in this cone with µ = 0 has U = 0 (since U is compact), and U = {U : (U, 1) ∈ U} = {U : ∃µ ≤ 1 : (U, µ) ∈ U}, so that the definition of OptU can be rewritten equivalently as OptU (B) = min {Tr(BU ) : (U, µ) ∈ U, µ ≤ 1} . U,µ

The question, of course, is where to take a set U matching X , and the answer depends on what we know about X . For example, when X is a basic ellitope X = {x ∈ Rn : ∃t ∈ T : xT Sk x ≤ tk , k ≤ K} with our usual restrictions on T and Sk , it is immediately seen that x ∈ X ⇒ xxT ∈ U = {U ∈ Sn : U  0, ∃t ∈ T : Tr(U Sk ) ≤ tk , k ≤ K}. Similarly, when X is a basic spectratope X = {x ∈ Rn : ∃t ∈ T : Sk2 [x]  tk Idk , k ≤ K} with our usual restrictions on T and Sk [·], it is immediately seen that x ∈ X ⇒ xxT ∈ U = {U ∈ Sn : U  0, ∃t ∈ T : Sk [U ]  tk Idk , k ≤ K}. One can verify that the semidefinite relaxation bounds on the maximum of a quadratic form on an ellitope/spectratope X derived in Sections 4.2.3 (for elli-

312

CHAPTER 4

topes) and 4.3.2 (for spectratopes) are nothing but the bounds (4.77) associated with the U just defined. 4.8.A Matching via absolute norms. There are other ways to specify a set matching X . The seemingly simplest of them is as follows. Let p(·) be an absolute norm on Rn (recall that this is a norm p(x) which depends solely on abs[x], where abs[x] is the vector comprised of the magnitudes of entries in x). We can convert p(·) into the norm p+ (·) on the space Sn as follows: p+ (U ) = p([p(Col1 [U ]); ...; p(Coln [U ])])

[U ∈ Sn ].

1.1) Prove that p+ indeed is a norm on Sn , and p+ (xxT ) = p2 (x). Denoting by q(·) the norm conjugate to p(·), what is the relation between the norm (p+ )∗ (·) conjugate to p+ (·) and the norm q + (·) ? 1.2) Derive from 1.1 that whenever p(·) is an absolute norm such that X is contained in the unit ball Bp(·) = {x : p(x) ≤ 1} of the norm p, the set Up(·) = {U ∈ Sn : U  0, p+ (U ) ≤ 1} is matching X . If, in addition, X ⊂ {x : p(x) ≤ 1, P x = 0},

(4.78)

then the set Up(·),P = {U ∈ Sn : U  0, p+ (U ) ≤ 1, P U = 0} is matching X . Assume that in addition to p(·), we have at our disposal a computationally tractable closed convex set D such that whenever p(x) ≤ 1, the vector [x]2 := [x21 ; ...; x2n ] belongs to D; in the sequel we call such a D square-dominating p(·). For example, when p(·) = k · kr , we can take   P n : i y1 ≤ 1 , r ≤ 2 y ∈ R + . D= y ∈ Rn+ : kykr/2 ≤ 1 , r > 2 Prove that in this situation the above construction can be refined: whenever X satisfies (4.78), the set

D Up(·),P = {U ∈ Sn : U  0, p+ (U ) ≤ 1, P U = 0, dg(U ) ∈ D} [dg(U ) = [U11 ; U22 ; ...; Unn ]] matches X . D when Note: in the sequel, we suppress P in the notation Up(·),P and Up(·),P P = 0; thus, Up(·) is the same as Up(·),0 . 1.3) Check that when p(·) = k · kr with r ∈ [1, ∞], one has +

p (U ) = kU kr :=

( P

1/r |Uij |r , maxi,j |Uij |, i,j

1 ≤ r < ∞, . r=∞

313

SIGNAL RECOVERY BY LINEAR ESTIMATION

1.4) Let X = {x ∈ Rn : kxk1 ≤ 1} and p(x) = kxk1 , so that X ⊂ {x : p(x) ≤ 1}, and n o X Conv{[x]2 : x ∈ X } ⊂ D = y ∈ Rn+ : yi = 1 . (4.79) i

What are the bounds OptUp(·) (B) and OptU D (B)? Is it true that the former p(·) (the latter) of the bounds is precise? Is it true that the former (the latter) bound is precise when B  0 ? 1.5) Let X = {x ∈ Rn : kxk2 ≤ 1} and p(x) = kxk2 , so that X ⊂ {x : p(x) ≤ 1} and (4.79) holds true. What are the bounds OptUp(·) (B) and OptU D (B) ? Is the p(·) former (the latter) bound precise? 1.6) Let X ⊂ Rn+ be closed, convex, bounded, and with a nonempty interior. Verify that the set X + = {x ∈ Rn : ∃y ∈ X : abs[x] ≤ y} is the unit ball of an absolute norm pX , and this is the largest absolute norm p(·) such that X ⊂ {x : p(x) ≤ 1}. Derive from this observation that the norm pX (·) is the best (i.e., resulting in the least conservative bounding scheme) among absolute norms which allow us to upper-bound OptX (B) via the construction from item 1.2. 4.8.B “Calculus of matchings.” Observe that the matching we have introduced admits a kind of “calculus.” Specifically, consider the situation as follows: for 1 ≤ ℓ ≤ L, we are given • nonempty convex compact sets Xℓ ⊂ Rnℓ , 0 ∈ Xℓ , along with matching Xℓ convex compact sets Uℓ ⊂ Snℓ giving rise to the closed convex cones Uℓ = cl{(Uℓ , µℓ ) ∈ Snℓ × R+ : µℓ > 0, µ−1 ℓ Uℓ ∈ Uℓ }. We denote by ϑℓ (·) the Minkowski functions of Xℓ : ϑℓ (y ℓ ) = inf{t : t > 0, t−1 y ℓ ∈ Xℓ } : Rnℓ → R ∪ {+∞}; note that Xℓ = {y ℓ : ϑℓ (y ℓ ) ≤ P 1}; • nℓ × n matrices Aℓ such that ℓ ATℓ Aℓ ≻ 0.

On top of that, we are given a monotone convex set T ⊂ RL + intersecting the interior of RL +. These data specify the convex set X = {x ∈ Rn : ∃t ∈ T : ϑ2ℓ (Aℓ x) ≤ tℓ , ℓ ≤ L}.

(∗)

2.1) Prove the following: Lemma 4.27. In the situation in question, the set  U = U ∈ Sn : U  0 & ∃t ∈ T : (Aℓ U ATℓ , tℓ ) ∈ Uℓ , ℓ ≤ L

is a closed and bounded convex set which matches X . As a result, the efficiently

314

CHAPTER 4

computable quantity OptU (B) = max {Tr(BU ) : U ∈ U} U

is an upper bound on OptX (B) = max xT Bx. x∈X

n

2.2) Prove that if X ⊂ R is a nonempty convex compact set, P is an m × n matrix, and U matches X , then the set V = {V = P U P T : U ∈ U} matches Y = {y : ∃x ∈ X : y = P x}. 2.3) Prove that if X ⊂ Rn is a nonempty convex compact set, P is an n × m matrix of rank m, and U matches X , then the set V = {V  0 : P V P T ∈ U} matches Y = {y : P y ∈ X }. 2.4) Consider the “direct product” case where X = X1 × ... × XL . When specifying Aℓ as the matrix which “cuts” the ℓ-th block Aℓ x = xℓ of a block vector x = [x1 ; ...; xL ] ∈ Rn1 × ... × RnL and setting T = [0, 1]L , we cover this situation by the setup under consideration. In the direct product case, the construction from item 2.1 is as follows: given the sets Uℓ matching Xℓ , we build the set ′

U = {U = [U ℓℓ ∈ Rnℓ ×nℓ′ ]ℓ,ℓ′ ≤L ∈ Sn1 +...+nL : U  0, U ℓℓ ∈ Uℓ , ℓ ≤ L} and claim that this set matches X . Could we be less conservative? While we do not know how to be less conservative in general, we do know how to be less conservative in the special case when the Uℓ are built via absolute norms. Namely, let pℓ (·) : Rnℓ → R+ , ℓ ≤ L, be absolute norms, let sets Dℓ be square-dominating pℓ (·), bℓ = {xℓ ∈ Rnℓ : Pℓ xℓ = 0, pℓ (xℓ ) ≤ 1}, Xℓ ⊂ X

and let

Uℓ = {U ∈ Snℓ : U  0, Pℓ U = 0, p+ ℓ (U ) ≤ 1, dg(U ) ∈ Dℓ }. In this case the above construction results in U=

  

U = [U

ℓℓ′

∈R

nℓ ×nℓ′

]ℓ,ℓ′ ≤L ∈

1 +...+nL Sn +

Now let

 Pℓ U ℓℓ = 0  + ℓℓ : U  0, pℓ (U ) ≤ 1 , ℓ ≤ L .  dg(U ℓℓ ) ∈ Dℓ

p([x1 ; ...; xL ]) = max[p1 (x1 ), ..., pL (xL )] : Rn1 × ... × RnL → R, so that p is an absolute norm and X ⊂ {x = [x1 ; ...; xL ] : p(x) ≤ 1, Pℓ xℓ = 0, ℓ ≤ L}. Prove that in fact the set U=

  



1 +...+nL U = [U ℓℓ ∈ Rnℓ ×nℓ′ ]ℓ,ℓ′ ≤L ∈ Sn +

 Pℓ U ℓℓ = 0  : U  0, dg(U ℓℓ ) ∈ Dℓ , ℓ ≤ L  p+ (U ) ≤ 1

SIGNAL RECOVERY BY LINEAR ESTIMATION

315

matches X , and that we always have U ⊂ U . Verify that in general this inclusion is strict. 4.8.C Illustration: Nullspace property revisited. Recall the sparsity-oriented signal recovery via ℓ1 minimization from Chapter 1: Given an m × n sensing matrix A and (noiseless) observation y = Aw of unknown signal w known to have at most s nonzero entries, we recover w as w b ∈ Argmin {kzk1 : Az = y} . z

We called matrix A s-good if whenever y = Aw with s-sparse w, the only optimal solution to the right-hand side optimization problem is w. The (difficult to verify!) necessary and sufficient condition for s-goodness is the Nullspace property:  Opt := max kzk(s) : z ∈ Ker A, kzk1 ≤ 1 < 1/2, z

where kzk(k) is the sum of the k largest entries in the vector abs[z]. A verifiable sufficient condition for s-goodness is d := min max kColj [I − H T A]k(s) < 1 , Opt 2 H

j

(4.80)

d is an upper bound on Opt (see the reason being that, as is immediately seen, Opt Proposition 1.9 with q = 1). An immediate observation is that Opt is nothing but the maximum of quadratic form over an appropriate convex compact set. Specifically, let P X = {[u; v] ∈ Rn × Rn : Au= 0, kuk1 ≤ 1, i |vi | ≤ s, kvk∞ ≤ 1},  1 I 2 n . B= 1 2 In Then OptX (B)

=

max [u; v]T B[u; v]

[u;v]∈X 

P max uT v : Au = 0, kuk1 ≤ 1, i |vi | ≤ s, kvk∞ ≤ 1 u,v  max kuk : Au = 0, kuk ≤ 1 = u 1 (s) |{z} =

(a)

=

Opt,

where (a) is due to the well-known fact (prove it!) that whenever s is a positive integer ≤ n, the extreme points of the set X |vi | ≤ s, kvk∞ ≤ 1} V = {v ∈ Rn : i

are exactly the vectors with at most s nonzero entries, the nonzero entries being ±1; as a result ∀(z ∈ Rn ) : max z T v = kzk(s) . v∈V

316

CHAPTER 4

Now, V is the unit ball of the absolute norm r(v) = min {t : kvk1 ≤ st, kvk∞ ≤ t} , so that X is contained in the unit ball B of the absolute norm on R2n specified as p([u; v]) = max {kuk1 , r(v)}

[u, v ∈ Rn ],

i.e., X = {[u; v] : p([u, v]) ≤ 1, Au = 0} . As a result, whenever x = [u; v] ∈ X , the matrix   11 U = uuT U 12 = uv T T U = xx = U 21 = vuT U 22 = vv T satisfies the condition p+ (U ) ≤ 1 (see item 1.2 above). In addition, this matrix clearly satisfies the condition A[U 11 , U 12 ] = 0. It follows that the set   11 U 12 U ∈ S2n : U  0, p+ (U ) ≤ 1, AU 11 = 0, AU 12 = 0} U = {U = U 21 U 22 (which clearly is a nonempty convex compact set) matches X . As a result, the efficiently computable quantity Opt

= =

max Tr(BU ) U ∈U   11 U max Tr(U 12 ) : U = U U 21

U 12 U 22



 0, p+ (U ) ≤ 1, AU 11 = 0, AU 12 = 0



(4.81)

is an upper bound on Opt. As a result, the verifiable condition Opt < 1/2 is sufficient for s-goodness of A. Now comes the concluding part of the exercise: d so that (4.81) is less conservative than (4.80). 3.1) Prove that Opt ≤ Opt, Hint: Apply Conic Duality to verify that ) ( n X d = max Tr(V ) : V ∈ Rn×n , AV = 0, r(Coli [V T ]) ≤ 1 . Opt V

(4.82)

i=1

3.2) Run simulations with randomly generated Gaussian matrices A and play with d and Opt. To save time, you can use toy different values of s to compare Opt sizes m, n, say, m = 18, n = 24.

317

SIGNAL RECOVERY BY LINEAR ESTIMATION

4.7.4 4.7.4.1

Around Propositions 4.4 and 4.14 Optimizing linear estimates on convex hulls of unions of spectratopes

Exercise 4.9. Let • X1 , ..., XJ be spectratopes in Rn : 2 Xj = {x ∈ Rn : ∃(y ∈ RNj , th∈ Tj ) : x = Pj y, Rkj [y]i  tk Idkj , k ≤ Kj }, 1 ≤ j ≤ J, P Nj kji Rkj [y] = i=1 yi R ,

• A ∈ Rm×n and B ∈ Rν×n be given matrices, • k · k be a norm on Rν such that the unit ball B∗ of the conjugate norm k · k∗ is a spectratope: B∗

:= =

{u : kuk∗ ≤ 1} {u ∈ Rν : ∃(z h∈ RN , r ∈ R) : u = iM z, Sℓ2 [z]  rℓ Ifℓ , ℓ ≤ L} PN Sℓ [z] = i=1 zi S ℓi ,

• Π be a convex compact subset of the interior of the positive semidefinite cone Sm +, with our standard restrictions on Rkj [·], Sℓ [·], Tj and R. Let, further,   [ X = Conv  Xj  j

be the convex hull of the union of spectratopes Xj . Consider the situation where, given observation ω = Ax + ξ of unknown signal x known to belong to X , we want to recover Bx. We assume that the matrix of second moments of noise is -dominated by a matrix from Π, and quantify the performance of a candidate estimate x b(·) by its k · k-risk RiskΠ,k·k [b x|X ] = sup sup Eξ∼P {kBx − x b(Ax + ξ)k} x∈X P :P ✁Π

where P ✁ Π means that the matrix Var[P ] = Eξ∼P {ξξ T } of second moments of distribution P is -dominated by a matrix from Π. Prove the following: Proposition 4.28. In the situation in question, consider the convex optimization problem    max φTj (λ[Λj ]) + φR (λ[Υj ]) + φR (λ[Υ′ ]) + ΓΠ (Θ) : Opt = min j j ′ H,Θ,Λ ,Υ ,Υ

j

318

CHAPTER 4



Λj = {Λjk  0, j ≤ Kj }, j ≤ J, j Υj = {Υℓ  0, ℓ ≤ L}, j ≤ J, Υ′ = {Υ′ℓ 0, ℓ ≤ L} P

R∗kj [Λjk ] 1 M T [B − H T A]Pj 2

1 T P [B T 2 j P

− AT H]M  0, j ≤ J, j ∗ S ℓ ℓ [Υℓ ]    1 Θ HM 2 P ∗ ′ 0 , 1 M T HT ℓ Sℓ [Υℓ ] 2

k

where, as usual,

              

(4.83)

φTj (λ) = max tT λ, φR (λ) = max rT λ, t∈Tj

r∈R

ΓΠ (Θ) = max Tr(QΘ), λ[U1 , ..., Us ] = [Tr(U1 ); ...; Tr(US )], Q∈Π   Sℓ∗ [·] : Sfℓ → SN : Sℓ∗ [U ] = Tr(S ℓp U S ℓq ) p,q≤N ,   R∗kj [·] : Sdkj → SNj : R∗kj [U ] = Tr(Rkjp U Rkjq ) p,q≤N . j

Problem (4.83) is solvable, and H-component H∗ of its optimal solution gives rise to linear estimate x bH∗ (ω) = H∗T ω such that RiskΠ,k·k [b xH∗ |X ] ≤ Opt.

(4.84)

Moreover, the estimate x bH∗ is near-optimal among linear estimates: ln(D + F )RiskOptlin i h Opt ≤ O(1) P P D = maxj k≤Kj dkj , F = ℓ≤L fℓ

where

RiskOptlin = inf

sup

H x∈X ,Q∈Π

(4.85)

 Eξ∼N (0,Q) kBx − H T (Ax + ξ)k

is the best risk attainable by linear estimates in the current setting under zero mean Gaussian observation noise. It should be stressed that the convex hull of a union of spectratopes is not necessarily a spectratope, and that Proposition 4.28 states that the linear estimate stemming from (4.83) is near-optimal only among linear, not among all estimates (the latter might indeed not be the case). 4.7.4.2

Recovering nonlinear vector-valued functions

Exercise 4.10. Consider the situation as follows: We are given a noisy observation ω = Ax + ξx

[A ∈ Rm×n ]

of the linear image Ax of an unknown signal x known to belong to a given spectratope X ⊂ Rn ; here ξx is the observation noise with distribution Px which can depend on x. As in Section 4.3.3, we assume that we are given a computationally tractable convex compact set Π ⊂ int Sm + such that for every x ∈ X , Var[Px ]  Θ for some Θ ∈ Π; cf. (4.32). We want to recover the value f (x) of a given vectorvalued function f : X → Rν , and we measure the recovery error in a given norm | · | on Rν .

SIGNAL RECOVERY BY LINEAR ESTIMATION

319

4.10.A. Preliminaries and the Main observation. Let k · k be a norm on Rn , and g(·) : X → Rν be a function. Recall that the function is called Lipschitz continuous on X w.r.t. the pair of norms k · k on the argument and | · | on the image spaces, if there exist L < ∞ such that |g(x) − g(y)| ≤ Lkx − yk ∀(x, y ∈ X ); every L with this property is called a Lipschitz constant of g. It is well known that in our finite-dimensional situation, the property of g to be Lipschitz continuous is independent of how the norms k · k, | · | are selected; this selection affects only the value(s) of Lipschitz constant(s). Assume from now on that the function of interest f is Lipschitz continuous on X . Let us call a norm k · k on Rn appropriate for f if f is Lipschitz continuous with constant 1 on X w.r.t. k · k, | · |. Our immediate observation is as follows:

Observation 4.29. In the situation in question, let k · k be appropriate for f . Then recovering f (x) is not more difficult than recovering x in the norm k · k: every estimate x b(ω) of x via ω such that x b(·) ∈ X induces the “plug-in” estimate fb(ω) = f (b x(ω))

of f (x), and the k · k-risk

x(Ax + ξ) − xk} Riskk·k [b x|X ] = sup Eξ∼Px {kb x∈X

of estimate x b upper-bounds the | · |-risk

n o Risk|·| [fb|X ] = sup Eξ∼Px |fb(Ax + ξ) − f (x)| x∈X

of the estimate fb induced by x b:

Risk|·| [fb|X ] ≤ Riskk·k [b x|X ].

When f is defined and Lipschitz continuous with constant 1 w.r.t. k · k, | · | on the entire Rn , this conclusion remains valid without the assumption that x b is X -valued.

4.10.B. Consequences. Observation 4.29 suggests the following simple approach to solving the estimation problem we started with: assuming that we have at our disposal a norm k · k on Rn such that • k · k is appropriate for f , and • k · k is good, meaning that the unit ball B∗ of the norm k · k∗ conjugate to k · k is a spectratope given by explicit spectratopic representation,

we use the machinery of linear estimation developed in Section 4.3.3 to build a nearoptimal, in terms of its k·k-risk, linear estimate of x via ω, and convert this estimate into an estimate of f (x). By the above observation, the | · |- risk of the resulting estimate is upper-bounded by the k · k-risk of the underlying linear estimate. The construction just outlined needs a correction: in general, the linear estimate x e(·) yielded by Proposition 4.14 (same as any nontrivial—not identically zero—linear estimate) is not guaranteed to take values in X , which is, in general, required for

320

CHAPTER 4

Observation 4.29 to be applicable. This correction is easy: it is enough to convert x e into the estimate x b defined by x b(ω) ∈ Argmin ku − x e(ω)k. u∈X

This transformation preserves efficient computability of the estimate, and ensures that the corrected estimate takes its values in X ; at the same time, “correction” x e 7→ x b nearly preserves the k · k-risk: Riskk·k [b x|X ] ≤ 2Riskk·k [e x|X ].

(∗)

Note that when k · k is a (general-type) Euclidean norm: kxk2 = xT Qx for some Q ≻ 0, factor 2 on the right-hand side can be discarded. 1) Justify (∗). 4.10.C. How to select k · k. When implementing the outlined approach, the major question is how to select a norm k · k appropriate for f . The best choice would be to select the smallest among the norms appropriate for f (such a norm does exist under mild assumptions), because the smaller the k · k, the smaller the k · k-risk of an estimate of x. This ideal can be achieved in rare cases only: first, it could be difficult to identify the smallest among the norms appropriate for f ; second, our approach requires for k · k to have an explicitly given spectratope as the unit ball of the conjugate norm. Let us look at a couple of “favorable cases,” where the difficulties just outlined can be (partially) overcome. Example: A norm-induced f . Let us start with the case, important in its own right, when f is a scalar functional which itself is a norm, and this norm has a spectratope as the unit ball of the conjugate norm, as is the case when f (·) = k · kr , r ∈ [1, 2], or when f (·) is the nuclear norm. In this case the smallest of the norms appropriate for f clearly is f itself, and none of the outlined difficulties arises. As an extension, when f (x) is obtained from a good norm k·k by operations P preserving Lipschitz continuity and constant, such as f (x) = kx − ck, or f (x) = i ai kx − ci k, P i |ai | ≤ 1, or f (x) = sup / inf kx − ck, c∈C

or even something like f (x) = sup / inf α∈A

(

)

sup / inf kx − ck . c∈Cα

In such a case, it seems natural to use this norm in our construction, although now this, perhaps, is not the smallest of the norms appropriate for f . Now let us consider the general case. Note that in principle the smallest of the norms appropriate for a given Lipschitz continuous f admits a description. Specifically, assume that X has a nonempty interior (this is w.l.o.g.—we can always replace Rn with the linear span of X ). A well-known fact of Analysis (Rademacher Theorem) states that in this situation (more generally, when X is convex with a nonempty interior), a Lipschitz continuous f is differentiable almost everywhere in X o = int X , and f is Lipschitz continuous with constant 1 w.r.t. a norm k · k if and

321

SIGNAL RECOVERY BY LINEAR ESTIMATION

only if

kf ′ (x)kk·k→|·| ≤ 1

whenever x ∈ X o is such that the derivative (a.k.a. Jacobian) of f at x exists; here kQkk·k→|·| is the matrix norm of a ν × n matrix Q induced by the norms k · k on Rn and | · | on Rν : kQkk·k→|·| := max |Qx| = max y T Qx = kxk≤1

kxk≤1 |y|∗ ≤1

max xT QT y = kQT k|·|∗ →k·k∗ ,

|y|∗ ≤1 [kxk∗ ]∗ ≤1

where k · k∗ , | · |∗ are the conjugates of k · k, | · |. 2) Prove that a norm k · k is appropriate for f if and only if the unit ball of the conjugate to k · k norm contains the set Bf,∗ = cl Conv{z : ∃(x ∈ Xo , y, |y|∗ ≤ 1) : z = [f ′ (x)]T y}, where Xo is the set of all x ∈ X o where f ′ (x) exists. Geometrically, Bf,∗ is the closed convex hull of the union of all images of the unit ball B∗ of | · |∗ under the linear mappings y 7→ [f ′ (x)]T y stemming from x ∈ Xo . Equivalently: k · k is appropriate for f if and only if kuk ≥ kukf := max z T u. z∈Bf,∗

(!)

Check that kukf is a norm, provided that Bf,∗ (this set by construction is a convex compact set symmetric w.r.t. the origin) possesses a nonempty interior; whenever this is the case, kukf is the smallest of the norms appropriate for f . Derive from the above that the norms k · k we can use in our approach are the norms on Rn for which the unit ball of the conjugate norm is a spectratope containing Bf,∗ . Example. Consider the case of componentwise quadratic f :   f (x) = 12 xT Q1 x; 21 xT Q2 x; ...; 12 xT Qν x

[Qi ∈ Sn ]

and |u| = kukq with q ∈ [1, 2].10 In this case B∗ = {u ∈ Rν : kukp ≤ 1}, p =

h i q ∈ [2, ∞[, and f ′ (x) = xT Q1 ; xT Q2 ; ...; xT Qν . q−1

Setting S = {s ∈ Rν+ : kskp/2 ≤ 1} and

S 1/2 = {s ∈ Rν+ : [s21 ; ...; s2ν ] ∈ S} = {s ∈ Rν+ : kskp ≤ 1}, the set

Z = {[f ′ (x)]T u : x ∈ X , u ∈ B∗ }

10 To save notation, we assume that the linear parts in the components of f are trivial—just i zeros. In this respect, note that we always can subtract from f any linear mapping and reduce our estimation problem to two distinct problems of estimating separately the values at the signal x of the modified f and the linear mapping we have subtracted (we know how to solve the latter problem reasonably well).

322

CHAPTER 4

is contained in the set ( Y=

n

y ∈ R : ∃(s ∈ S

1/2

i

, x ∈ X , i ≤ ν) : y =

X

s i Qi xi

i

)

.

The set Y is a spectratope with spectratopic representation readily given by that of X . Indeed, Y is nothing but the S-sum of the spectratopes Qi X , i = 1, ..., ν; see Section 4.10. As a result, we can use the spectratope Y (when int Y 6= ∅) or the arithmetic sum of Y with a small Euclidean ball (when int Y = ∅) as the unit ball of the norm conjugate to k · k, thus ensuring that k · k is appropriate for f . We then can use k · k in order to build an estimate of f (·). 3.1) For illustration, work out the problem of recovering the value of a scalar quadratic form f (x) = xT M x, M = Diag{iα , i = 1, ..., n}

[ν = 1, | · | is the absolute value]

from noisy observation ω = Ax + ση, A = Diag{iβ , i = 1, ..., n}, η ∼ N (0, In )

(4.86)

of a signal x known to belong to the ellipsoid X = {x ∈ Rn : kP xk2 ≤ 1}, P = Diag{iγ , i = 1, ..., n}, where α, β, γ are given reals satisfying α − γ − β < −1/2. You could start with the simplest unbiased estimate x e(ω) = [1−β ω1 ; 2−β ω2 ; ...; n−β ωn ]

of x. 3.2) Work out the problem of recovering the norm

f (x) = kM xkp , M = Diag{iα , i = 1, ..., n}, p ∈ [1, 2], from observation (4.86) with X = {x : kP xkr ≤ 1}, P = Diag{iγ , i = 1, ..., n}, r ∈ [2, ∞]. 4.7.4.3

Suboptimal linear estimation

Exercise 4.11. [recovery of large-scale signals] Consider the problem of estimating the image Bx ∈ Rν of signal x ∈ X from observation ω = Ax + σξ ∈ Rm in the simplest case where X = {x ∈ Rn : xT Sx ≤ 1} is an ellipsoid (so that S ≻ 0), the recovery error is measured in k · k2 , and ξ ∼ N (0, Im ). In this case,

SIGNAL RECOVERY BY LINEAR ESTIMATION

323

Problem (4.12) to solve when building “presumably good linear estimate” reduces to     B T − AT H λS  0 , (4.87) Opt = min λ + σ 2 kHk2F : B − HT A Iν H,λ where k · kF is the Frobenius norm of a matrix. An optimal solution H∗ to this problem results in the linear estimate x bH∗ (ω) = H∗T ω satisfying the risk bound q p Risk[b xH∗ |X ] := max E{kBx − H∗T (Ax + σξ)k22 } ≤ Opt. x∈X

Now, (4.87) is an efficiently solvable convex optimization problem. However, when the sizes m, n of the problem are large, solving the problem by standard optimization techniques could become prohibitively time-consuming. The goal of what follows is to develop a relatively cheap computational technique for finding a good enough suboptimal solution to (4.87). In the sequel, we assume that A 6= 0; otherwise (4.87) is trivial. 1) Prove that problem (4.87) can be reduced to a similar problem with S = In and diagonal positive semidefinite matrix A, the reduction requiring several singular value decompositions and multiplications of matrices of the same sizes as those of A, B, and S.

2) By item 1, we can assume from the very beginning that S = I and A = Diag{α1 , ..., αn } with 0 ≤ √ α1 ≤ α2 ≤ ... ≤ αn . Passing in (4.87) from variables λ, H to variables τ = λ, G = H T , the problem becomes  Opt = min τ 2 + σ 2 kGk2F : kB − GAk ≤ τ , (4.88) G,τ

where k · k is the spectral norm. Now consider the construction as follows:

• Consider a partition {1, ..., n} = I0 ∪ I1 ∪ ... ∪ IK of the index set {1, ..., n} into consecutive segments in such a way that (a) I0 is the set of those i, if any, for which αi = 0, and Ik 6= ∅ when k ≥ 1, (b) for k ≥ 1 the ratios αj /αi , i, j ∈ Ik , do not exceed θ > 1 (θ is the parameter of our construction), while (c) for 1 ≤ k < k ′ ≤ K, the ratios αj /αi , i ∈ Ik , j ∈ Ik′ , are > θ. The recipe for building the partition is self-evident, and we clearly have K ≤ ln(α/α)/ ln(θ) + 1, where α is the largest of αi , and α is the smallest of those αi which are positive. • For 1 ≤ k ≤ K, we denote by ik the first index in Ik , set αk = αik , nk = Card Ik , and define Ak as the nk × nk diagonal matrix with diagonal entries αi , i ∈ Ik .

Now, given a ν × n matrix C, let us specify Ck , 0 ≤ k ≤ K, as the ν × nk submatrix of C comprised of columns with indexes from Ik , and consider the following parametric optimization problems:  Opt∗k (τ ) = minGk ∈Rν×nk kGk k2F : kBk − Gk Ak k ≤ τ (Pk∗ [τ ]) (Pk [τ ]) Optk (τ ) = minGk ∈Rν×nk kGk k2F : kBk − αk Gk k ≤ τ

324

CHAPTER 4

where τ ≥ 0 is the parameter, and 1 ≤ k ≤ K. Justify the following simple observations: 2.1) Gk is feasible for (Pk [τ ]) if and only if the matrix G∗k = αk Gk A−1 k is feasible for (Pk∗ [τ ]), and kG∗k kF ≤ kGk kF ≤ θkG∗k kF , implying that Opt∗k (τ ) ≤ Optk (τ ) ≤ θ2 Opt∗k (τ ); 2.2) Problems (Pk [τ ]) are easy to solve: if Bk = Uk Dk VkT is the singular value decomposition of Bk and σkℓ , 1 ≤ ℓ ≤ νk := min[ν, nk ], are diagonal entries of Dk , then an optimal solution to (Pk [τ ]) is b k [τ ] = [αk ]−1 Uk Dk [τ ]VkT , G

where Dk [τ ] is the diagonal matrix obtained from Dk by truncating the diagonal entries σkℓ 7→ [σkℓ − τ ]+ (from now on, a+ = max[a, 0], a ∈ R). The optimal value in (Pk [τ ]) is Optk (τ ) = [αk ]−2

νk X ℓ=1

[σkℓ − τ ]2+ .

2.3) If (τ, G) is a feasible solution to (4.88) then τ ≥ τ := kB0 k, and the matrices Gk , 1 ≤ k ≤ K, are feasible solutions to problems (Pk∗ [τ ]), implying that X Opt∗k (τ ) ≤ kGk2F . k

And vice versa: if τ ≥ τ , Gk , 1 ≤ k ≤ K, are feasible solutions to problems (Pk∗ [τ ]), and  K, I0 = ∅ K+ = , K + 1, I0 6= ∅ p then the matrix G = [0ν×n0 , G1 , ..., GK ] and τ+ = K+ τ form a feasible solution to (4.88).

Extract from these observations that if τ∗ is an optimal solution to the convex optimization problem ( ) K X 2 2 2 Optk (τ ) : τ ≥ τ min θ τ + σ (4.89) τ

k=1

and Gk,∗ are optimal solutions to the problems (Pk [τ∗ ]), then the pair p b = [0ν×n , G∗ , ..., G∗ ] [G∗k,∗ = αk Gk,∗ A−1 τb = K+ τ∗ , G 1,∗ K,∗ 0 k ]

is a feasible solution to (4.88), and the value of the objective of the latter problem at this feasible solution is within the factor max[K+ , θ2 ] of the true optimal value b gives rise to a linear estimate with risk on Opt of this problem. As a result, p G √ X which is within the factor max[ K+ , θ] of the risk Opt of the “presumably

SIGNAL RECOVERY BY LINEAR ESTIMATION

325

good” linear estimate yielded by an optimal solution to (4.87). Notice that • After carrying out singular value decompositions of matrices Bk , 1 ≤ k ≤ K, specifying τ∗ and Gk,∗ requires solving univariate convex minimization problem with an easy-to-compute objective, so that the problem can be easily solved, e.g., by bisection; • The computationally cheap suboptimal solution we end up with is not that bad, since K is “moderate”—just logarithmic in the condition number α/α of A. Your next task is as follows: 3) To get an idea of the performance of the proposed synthesis of “suboptimal” linear estimation, run numerical experiments as follows: • select some n and generate at random the n × n data matrices S, A, B; • for “moderate” values of n compute both the linear estimate yielded by the optimal solution to (4.12)11 and the suboptimal estimate as yielded by the above construction. Compare their risk bounds and the associated CPU times. For “large” n, where solving (4.12) becomes prohibitively time-consuming, compute only a suboptimal estimate in order to get an impression of how the corresponding CPU time grows with n. Recommended setup: • range of n: 50, 100 (“moderate” values), 1000, 2000 (“large” values) • range of σ: {1.0, 0.01, 0.0001} • generation of S, A, B: generate the matrices at random according to S = US Diag{1, 2, ..., n}UST , A = UA Diag{µ1 , ..., µn }VAT , B = UB Diag{µ1 , ..., µn }VBT , where US , UA , VA , UB , VB are random orthogonal n × n matrices, and the µi form a geometric progression with µ1 = 0.01 and µn = 1. You could run the above construction for several values of θ and select the best, in terms of its risk bound, of the resulting suboptimal estimates. 4.11.A. Simple case. There is a trivial case where (4.88) is really easy; this is the case where the right orthogonal factors in the singular value decompositions of A and B are the same, that is, when B = W F V T , A = U DV T with orthogonal n × n matrices W, U, V and diagonal F, D. This very special case is in fact of some importance—it covers the denoising situation where B = A, so that our goal is to denoise our observation of Ax given a priori information x ∈ X 11 When X is an ellipsoid, semidefinite relaxation bound on the maximum of a quadratic form over x ∈ X is exact, so that we are in the case when an optimal solution to (4.12) yields the best, in terms of risk on X , linear estimate.

326

CHAPTER 4

on x. In this situation, setting W T H T U = G, problem (4.88) becomes  Opt = min kF − GDk2 + σ 2 kGk2F . G

(4.90)

Now goes the concluding part of the exercise:

4) Prove that in the situation in question an optimal solution G∗ to (4.90) can be selected to be diagonal, with diagonal entries γi , 1 ≤ i ≤ n, yielded by the optimal solution to the optimization problem ) ( n X 2 2 2 [φi = Fii , δi = Dii ]. γi Opt = min f (G) := max(φi − γi δi ) + σ γ

i≤n

i=1

Exercise 4.12. [image reconstruction—follow-up to Exercise 4.11] A grayscale image can be represented by an m × n matrix x = [xpq ] 0≤p 0 be such that Z ⊂ ∆[α]. Prove that X + = {[x; z] : W [x; z] = 0, [x; z] ∈ Conv{vij = [gi ; hj ], 1 ≤ i ≤ n, 0 ≤ j ≤ p}} ,

(!)

where the gi are the standard basic orths in Rn , h0 = 0 ∈ Rp , and αj hj , 1 ≤ j ≤ p, are the standard basic orths in Rp . 6.2) Derive from 5.1 that the efficiently computable convex function n o ΦSA (H) = inf max k(B − H T A)gi + C T W vij k : C ∈ R(p+q)×ν C

i,j

is an upper bound on Φ(H). In the sequel, we refer to ΦSA (H) as to the SheraliAdams bound [214]. 4.17.G. Combined bound. We can combine the above bounds, specifically, as follows: 7) Prove that the efficiently computable convex function ΦLBS (H) =

inf

max Gij (H, Λ± , C± , µ, µ+ ),

(Λ± ,C± ,µ,µ+ )∈R i,j

(#)

where  Gij (H, Λ± , C± , µ, µ+ ) := −µT F gi + µT+ W vij + min ktk :

t    T W v ] , [(−B + H T A − Λ F )g + C T W v ] , t ≥ Max [(B − H T A − Λ+ F )gi + C+ + − + ij i ij − ν×(p+2q)

R = {(Λ± , C± , µ, µ+ ) : Λ± ∈ R+

, C± ∈ R(p+q)×ν , µ ∈ Rp+2q , µ+ ∈ Rp+q } +

is an upper bound on Φ(H), and that this Combined bound is at least as good as any of the Lagrange, Basic, or Sherali-Adams bounds.

340

CHAPTER 4

4.17.H. How to select α? A shortcoming of the Sherali-Adams and the combined upper bounds on Φ is the presence of a “degree of freedom”—on the positive vector α. Intuitively, we would like to select α to make the simplex ∆[α] ⊃ Z to be “as small as possible.” It is unclear, however, what “as small as possible” is in our context, not to speak of how to select the required α after we agree on how we measure the “size” of ∆[α]. It turns out, however, that we can efficiently select α resulting in the smallest volume ∆[α]. 8) Prove that minimizing the volume of ∆[α] ⊃ Z in α reduces to solving the following convex optimization problem: ) ( p X T T (∗) ln(αs ) : 0 ≤ α ≤ −v, E u + G v ≤ 1n . inf − α,u,v

s=1

9) Run numerical experiments to evaluate the quality of the above bounds. It makes sense to generate problems where we know in advance the actual value of Φ, e.g., to take X = {x ∈ ∆n : x ≥ a} (a) P with a ≥ 0 such that i ai ≤ 1. In this case, we can easily list the extreme points of X (how?) and thus can easily compute Φ(H). In your experiments, you can use the matrices stemming from “presumably good” linear estimates yielded by the optimization problems Opt

=

where

min

H,Υ,Θ



Φ(H) + φR (λ[Υ]) + ΓX (Θ) : Υ = {Υℓ  0, ℓ ≤ L, }    1 HM Θ P2 ∗ 0 T T 1 M H ℓ Sℓ [Υℓ ] 2

ΓX (Θ) =

(4.99)

1 max Tr(Diag{Ax}Θ), K x∈X

(see Corollary 4.12), with the actual Φ (which is available for our X ), or the upper bounds on Φ (Lagrange, Basic, Sherali-Adams, and Combined) in the role of Φ. Note that it may make sense to test seven bounds rather than just four. Indeed, with additional constraints on the optimization variables in (#), we can get, besides “pure” Lagrange, Basic, and Sherali-Adams bounds and their “threecomponent combination” (Combined bound), pairwise combinations of the pure bounds as well. For example, to combine Lagrange and Sherali-Adams bounds, it suffices to add to (#) the constraints Λ± = 0. Exercise 4.18. The exercise to follow deals with recovering discrete probability distributions in the Wasserstein norm. The Wasserstein distance between probability distributions is extremely popular today in Statistics; it is defined as follows.17 Consider discrete random variables taking values in finite observation space Ω = {1, 2, ..., n} which is equipped with 17 The distance we consider stems from the Wasserstein 1-distance between discrete probability distributions. This is a particular case of the general Wasserstein p-distance between (not necessarily discrete) probability distributions.

341

SIGNAL RECOVERY BY LINEAR ESTIMATION

the metric {dij : 1 ≤ i, j ≤ n} satisfying the standard axioms.18 As always, we identify probability distributions on Ω with n-dimensional probabilistic vectors p = [p1 ; ...; pn ], where pi is the probability mass assigned by p to i ∈ Ω. The Wasserstein distance between probability distributions p and q is defined as W (p, q) = min

x=[xij ]

(

X ij

dij xij : xij ≥ 0,

X

xij = pi ,

j

X i

xij = qj ∀1 ≤ i, j ≤ n

)

. (4.100)

In other words, one may think of p and q as two distributions of unit mass on the points of Ω, and consider the mass transport problem of redistributing the mass assigned to points by distribution p to get the distribution q. Denoting by xij the P x = p say that the total mass mass moved from point i to point j, constraints ij i j P taken from point i is exactly pi , constraints i xij = qj say that as the result of transportation, the mass at point j will be exactly qj , and the constraints xij ≥ 0 reflect the fact that transport of a negative mass is forbidden. Assuming that the cost of transporting a mass µ from point i to point j is dij µ, the Wasserstein distance W (p, q) between p and q is the cost of the cheapest transportation plan which converts p into q. As compared to other natural distances between discrete probability distributions, like kp − qk1 , the advantage of the Wasserstein distance is that it allows us to model the situation (indeed arising in some applications) where the effect, measured in terms of intended application, of changing probability masses of points from Ω is small when the probability mass of a point is redistributed among close points.19 Now comes the first part of the exercise: 1) Let p, q be two probability distributions. Prove that ) ( X fi (pi − qi ) : |fi − fj | ≤ dij ∀i, j . W (p, q) = maxn f ∈R

(4.101)

i

Treating vector f ∈ Rn as a function on Ω, the value of the function at a point i ∈ Ω being fi , (4.101) admits a very transparent interpretation: the Wasserstein distance W (p, q) between probability distributions p and q is the maximum of inner products of p − q and functions f on Ω which are Lipschitz continuous w.r.t. the metric d, with constant 1. When shifting f by a constant, the inner product remains intact (since p − q is a vector with zero sum of entries). Therefore, denoting by D = max dij i,j

the d-diameter of Ω, we have  W (p, q) = max f T (p − q) : |fi − fj | ≤ dij , |fi | ≤ D/2 ∀i, j , f

(4.102)

18 Namely, positivity: d ij = dji ≥ 0, with dij = 0 if and only if i = j; and the triangle inequality: dik ≤ dij + djk for all triples i, j, k. 19 In fact, the Wasserstein distance shares this property with some other distances between distributions used in Probability Theory, such as Skorohod, or Prokhorov, or Ky Fan distances. What makes the Wasserstein distance so “special” is its representation (4.100) as the optimal value of a Linear Programming problem, responsible for efficient computational handling of this distance.

342

CHAPTER 4

the reason being that every function f on Ω which is Lipschitz continuous, with constant 1, w.r.t. metric d can be shifted by a constant to ensure kf k∞ ≤ D/2 (look what happens when the shift ensures that mini fi = −D/2). Representation (4.102) shows that the Wasserstein distance is generated by a norm on Rn : for all probability distributions on Ω one has W (p, q) = kp − qkW , where k · kW is the Wasserstein norm on Rn given by kxkW = max f T x, f ∈B∗  B∗ = u ∈ Rn : uT Sij u ≤ 1, 1 ≤ i ≤ j ≤ n , T d−2 ij [ei − ej ][ei − ej ] , 1 ≤ i < j ≤ n, Sij = −2 T 4D ei ei , 1 ≤ i = j ≤ n,

(4.103)

where e1 , ..., en are the standard basic orths in Rn . 2) Let us equip n-element set Ω = {1, ..., d} with the metric dij = What is the associated Wasserstein norm?



2, 0,

i 6= j . i=j

Note that the set B∗ in (4.103) is the unit ball of the norm conjugate to k·kW , and as we see, this set is a basic ellitope. As a result, the estimation machinery developed in Chapter 4 is well suited for recovering discrete probability distributions in the Wasserstein norm. This observation motivates the concluding part of the exercise: 3) Consider the situation as follows: Given an m × n column-stochastic matrix A and a ν × n column-stochastic matrix B, we observe K samples ωk , 1 ≤ k ≤ K, independent of each other, drawn from the discrete probability distribution Ax ∈ ∆m (as always, ∆ν ⊂ Rν is the probabilistic simplex in Rν ), x ∈ ∆n being an unknown “signal” underlying the observations; realizations of ωk are identified with respective vertices f1 , ..., fm of ∆m . Our goal is to use the observations to estimate the distribution Bx ∈ ∆ν . We are given a metric d on the set Ων = {1, 2, ..., ν} of indices of entries in Bx, and measure the recovery error in the Wasserstein norm k · kW associated with d. Build an explicit convex optimization problem responsible for a “presumably good” linear recovery of the form

Exercise 4.19.

x bH =

K 1 TX ωk . H K k=1

[follow-up to Exercise 4.17] In Exercise 4.17, we have built a “presumably good” linear estimate x bH∗ (·)—see (4.98)—yielded by the H-component H∗ of an optimal solution to problem (4.99). The optimal value Opt in this problem is an upper bound on the risk Riskk·k [b xH∗ |X ] (here and in what follows we use the same notation and impose the same assumptions as in Exercise 4.17). Recall that Riskk·k is the worst, w.r.t. signals x ∈ X underlying our observations, expected norm of the recovery error. It makes sense also to provide upper bounds on the probabilities of deviations of the error’s magnitude from its expected value, and this is the problem

343

SIGNAL RECOVERY BY LINEAR ESTIMATION

we consider here; cf. Exercise 4.14. 1) Prove the following Lemma 4.33. Let Q ∈ Sm + , let K be a positive integer, and let p ∈ ∆m . Let, further, ω K = (ω1 , ..., ωK ) be i.i.d. random vectors, with ωk taking the value ej (e1 , ..., em are the standard basic orths in Rm ) with probability pj . Finally, let PK 1 ξk = ωk − E{ωk } = ωk − p, and ξb = K k=1 ξk . Then for every ǫ ∈ (0, 1) it holds   12 ln(2m/ǫ) 2 b ≥ 1 − ǫ. Prob kξk2 ≤ K

Hint: use the classical Bernstein inequality: Let X1 , ..., XK be independent zero mean random variables taking values in [−M, M ], and let σk2 = E{Xk2 }. Then for every t ≥ 0 one has  X   K t2 . Prob Xk ≥ t ≤ exp − P 2 1 k=1 2[ k σk + 3 M t]

2) Consider the situation described in Exercise 4.17 with X = ∆n , specifically,

• Our observation is a sample ω K = (ω1 , ..., ωK ) with i.i.d. components ωk ∼ Ax, where X ∈ ∆n is an unknown n-dimensional probabilistic vector, A is an m × n stochastic matrix (nonnegative matrix with unit column sums), and ω ∼ Ax means that ω is a random vector taking value ei (ei are standard basic orths in Rm ) with probability [Ax]i , 1 ≤ i ≤ m. • Our goal is to recover Bx in a given norm k · k; here B is a given ν × n matrix. • We assume that the unit ball B∗ of the norm k · k∗ conjugate to k · k is a spectratope: B∗ = {u = M y, y ∈ Y}, Y = {y ∈ RN : ∃r ∈ R : Sℓ2 [y]  rℓ Ifℓ , ℓ ≤ L}.

Our goal is to build a presumably good linear estimate x bH (ω K ) = H T ω b [ω K ], ω b [ω K ] =

1 X ωk . K k

Prove the following

Proposition 4.34. Let H, Θ, Υ be a feasible solution to the convex optimization problem

where

minH,Θ,Υ {Φ(H) + φR (λ[Υ]) +  Γ(Θ)/K : Υ =1{Υℓ  0,ℓ ≤ L},  Θ HM 2 P 0 1 ∗ M T HT ℓ Sℓ [Υℓ ] 2 Φ(H) = max kColj [B − H T A]k, Γ(Θ) = max Tr(Diag{Ax}Θ). j≤n

Then

x∈∆n

(4.104)

344

CHAPTER 4

(i) For every x ∈ ∆n it holds p  bH (ω K )k  ≤ Φ(H) + 2K −1/2 φR (λ[Υ])Γ(Θ) EωK kBx − x ≤



Φ(H) + φR (λ[Υ]) + Φ(H) + Γ(Θ)/K .

(4.105)

(ii) Let ǫ ∈ (0, 1). For every x ∈ ∆n with p γ = 2 3 ln(2m/ǫ)

one has

o n p bH (ω K )k ≤ Φ(H) + 2γK −1/2 φR (λ[Υ])kΘkSh,∞ ProbωK kBx − x ≥ 1 − ǫ.

(4.106)

3) Look what happens when ν = m = n, A and B are the unit matrices, and H = I, i.e., we want to understand how good is the recovery of a discrete probability distribution by empirical distribution of a K-element i.i.d. sample drawn from the original distribution. Take, as k · k, the norm k · kp with p ∈ [1, 2], and show that for every x ∈ ∆n and every ǫ ∈ (0, 1) one has ∀(x ∈ ∆n ) : 1  1 1 E kxn− x bI (ω K )kp ≤ n p − 2 K − 2 , o p 1 1 1 Prob kx − x bI (ω K )kp ≤ 2 3 ln(2n/ǫ)n p − 2 K − 2 ≥ 1 − ǫ.

Exercise 4.20.

[follow-up to Exercise 4.17] Consider the situation as follows. A retailer sells n items by offering customers, via internet, bundles of m < n items, so that an offer is an m-element subset B of the set S = {1, ..., n} of the items. A customer has personal preferences represented by a subset P of S—customer’s preference set. We assume that if an offer B intersects with the preference set P of a customer, the latter buys an item drawn at random from the uniform distribution on B ∩ P , and if B ∩ P = ∅, the customer declines the offer. In the pilot stage we are interested in, the seller learns the market by making offers to K customers. Specifically, the seller draws the k-th customer, k ≤ K, at random from the uniform distribution on the population of customers, and makes the selected customer an offer drawn at random from the uniform distribution on the set Sm,n of all m-item offers. What is observed in the k-th experiment is the item, if any, bought by the customer, and we want to make statistical inferences from these observations. The outlined observation scheme can be formalized as follows. Let S be the set of all subsets of the n-element set, so that S is of cardinality N = 2n . The population of customers induces a probability distribution p on S: for P ∈ S, pP is the fraction of customers with the preference set being P ; we refer to p as to the preference distribution. An outcome of a single experiment can be represented by a pair (ι, B), where B ∈ Sm,n is the offer used in the experiment, and ι is either 0 (“nothing is bought”, P ∩ B = ∅), or a point from P ∩ B, the item which was bought, when  n )P ∩ B 6= ∅. Note that AP is a probability distribution on the (M = (m + 1) m element set Ω = {(ι, B)} of possible outcomes. As a result, our observation scheme is fully specified by an M × N column-stochastic matrix A known to us with the

345

SIGNAL RECOVERY BY LINEAR ESTIMATION

columns AP indexed by P ∈ S. When a customer is drawn at random from the uniform distribution on the population of customers, the distribution of the outcome clearly is Ap, where p is the (unknown) preference distribution. Our inferences should be based on the K-element sample ω K = (ω1 , ..., ωK ), with ω1 , .., ωK drawn, independently of each other, from the distribution Ap. Now we can pose various inference problems, e.g., that of estimating p. We, however, intend to focus on a simpler problem—one of recovering Ap. In terms of our story, this makes sense: when we know Ap, we know, e.g., what the probability is for every offer to be “successful” (something indeed is bought) and/or to result in a specific profit, etc. With this knowledge at hand, the seller can pass from a “blind” offering policy (drawing an offer at random from the uniform distribution on the set Sm,n ) to something more rewarding. Now comes the exercise: 1. Use the results of Exercise 4.17 to build a “presumably good” linear estimate # " K 1 X K T ωk x bH (ω ) = H K k=1

of Ap (as always, we encode observations ω, which are elements of the M -element set Ω, by standard basic orths in RM ). As the norm k·k quantifying the recovery error, use k · k1 and/or k · k2 . In order to avoid computational difficulties, use small m and n (e.g., m = 3 and n = 5). Compare your results with those for the PK 1 “straightforward” estimate K k=1 ωk (the empirical distribution of ω ∼ Ap). 2. Assuming that the “presumably good” linear estimate outperforms the straightforward one, how could this phenomenon be explained? Note that we have no nontrivial a priori information on p! Exercise 4.21. [Poisson Imaging] The Poisson Imaging Problem is to recover an unknown signal observed via the Poisson observation scheme. More specifically, assume that our observation is a realization of random vector ω ∈ Rm + with Poisson entries ωi = Poisson([Ax]i ) independent of each other. Here A is a given entrywise nonnegative m × n matrix, and x is an unknown signal known to belong to a given compact convex subset X of Rn+ . Our goal is to recover in a given norm k · k the linear image Bx of x, where B is a given ν × n matrix. We assume in the sequel that X is a subset cut off the n-dimensional probabilistic simplex ∆n by a collection of linear equality and inequality constraints. The assumption X ⊂ ∆n isPnot too restrictive. Indeed, assume that we know in advance a linear inequality i αi xi ≤ 1 with P positive coefficients which is valid on X .20 Introducing slack variable s given by i αi xi + s = 1 and passing from signal x to the new signal [α1 x1 ; ...; αn xn ; s], after a straightforward modification of matrices A and B, we arrive at the situation where X is a subset of the probabilistic simplex. Our goal in the sequel is to build a presumably good linear estimate x bH (ω) = H T ω of Bx. As in Exercise 4.17, we start with upper-bounding the risk of a linear 20 For example, in PET—see Section 2.4.3.2—where x is the density of a radioactive tracer P injected into the patient taking the PET procedure, we know in advance the total amount i vi xi of the tracer, vi being the volume of voxels.

346

CHAPTER 4

estimate. When representing ω = Ax + ξx , we arrive at zero mean observation noise ξx with entries [ξx ]i = ωi − [Ax]i independent of each other and covariance matrix Diag{Ax}. We now can upper-bound the risk of a linear estimate x bH (·) in the same way as in Exercise 4.17. Specifically, denoting by ΠX the set of all diagonal matrices Diag{Ax}, x ∈ X , and by Pi,x the Poisson distribution with parameter [Ax]i , we have  T Riskk·k [b xH |X ] = supx∈X Eω∼P  1,x ×...×Pm,xT kBx − TH ωk = supx∈X Eξx k[Bx − H A]x − H ξx k sup Eξ kH T ξk . ≤ sup k[B − H T A]xk + x∈X ξ:Cov[ξ]∈ΠX {z } | | {z } Φ(H)

ΨX (H)

In order to build a presumably good linear estimate, it suffices to build efficiently X computable upper bounds Φ(H) on Φ(H) and Ψ (H) on ΨX (H) convex in H, and then take as H an optimal solution to the convex optimization problem h i X Opt = min Φ(H) + Ψ (H) . H

As in Exercise 4.17, assume from now on that k · k is an absolute norm, and the unit ball B∗ of the conjugate norm is a spectratope: B∗ := {u : kuk∗ ≤ 1} = {u : ∃r ∈ R, y : u = M y, Sℓ2 [y]  rℓ Ifℓ , ℓ ≤ L}.

Observe that • In order to build Φ, we can use exactly the same techniques as those developed in Exercise 4.17. Indeed, as far as building Φ is concerned, the only difference with the situation of Exercise 4.17 is that in the latter, A was column-stochastic matrix, while now A is just an entrywise nonnegative matrix. Note, however, that when upper-bounding Φ in Exercise 4.17, we never used the fact that A is column-stochastic. • In order to upper-bound ΨX , we can use the bound (4.40) of Exercise 4.17. The bottom line is that in order to build a presumably good linear estimate, we need to solve the convex optimization problem  Opt = min Φ(H) + φR (λ[Υ]) + ΓX (Θ) : Υ = {Υℓ  0, ℓ ≤ L} H,Υ,Θ    1 (P ) HM Θ 2 P  0 1 T T ∗ M H ℓ Sℓ [Υℓ ] 2 where

ΓX (Θ) = max Tr(Diag{Ax}Θ) x∈X

(cf. problem (4.99)) with Φ yielded by any construction from Exercise 4.17, e.g., the least conservative Combined upper bound on Φ. What in our present situation differs significantly from the situation of Exercise 4.17, are the bounds on probabilities of large deviations (for Discrete o.s., established in Exercise 4.19). The goal of what follows is to establish these bounds for

347

SIGNAL RECOVERY BY LINEAR ESTIMATION

Poisson Imaging. Here is what you are supposed to do: 1. Let ω ∈ Rm be a random vector with independent entries ωi ∼ Poisson(µi ), and let µ = [µ1 ; ...; µm ]. Prove that whenever h ∈ Rm , γ > 0, and δ ≥ 0, one has  X ln Prob{hT ω > hT µ + δ} ≤ [exp{γhi } − 1]µi − γhT µ − γδ. (4.107) i

2. Taking for granted (or see, e.g., [178]) that ex − x − 1 ≤ prove that in the situation of item 1 one has for t > 0:

x2 2(1−x/3)

when |x| < 3,

P   γ 2 i h2i µi 3 T T ⇒ ln Prob{h ω > h µ + t} ≤ − γt. 0≤γ< khk∞ 2(1 − γkhk∞ /3)

(4.108)

Derive from the latter fact that

  δ2 P Prob h ω > h µ + δ ≤ exp − , 2[ i h2i µi + khk∞ δ/3] 

T

T

and conclude that 







δ2 Prob |h ω − h µ| > δ ≤ 2 exp − P 2 2[ i hi µi + khk∞ δ/3] T

T



(4.109)

.

(4.110)

3. Extract from (4.110) the following

Proposition 4.35. In the situation and under the assumptions of Exercise 4.21, let Opt be the optimal value, and H, Υ, Θ be a feasible solution to problem (P ). Whenever x ∈ X and ǫ ∈ (0, 1), denoting by Px the distribution of observations stemming from x (i.e., the distribution of random vector ω with independent entries ωi ∼ Poisson([Ax]i )), one has

and

Eω∼Px {kBx − x bH (ω)k}





Φ(H) + 2

p

φR (λ[Υ])Tr(Diag(Ax}Θ)

Φ(H) + φR (λ[Υ]) + ΓX (Θ)



Probω∼Px kBx − x bH (ω)k ≤ Φ(H)  q p +4 29 ln2 (2m/ǫ)Tr(Θ) + ln(2m/ǫ)Tr(Diag{Ax}Θ) φR (λ[Υ]) ≥ 1 − ǫ.

(4.111)

(4.112)

Note that in the case of [Ax]i ≥ 1 for all x ∈ X and all i we have Tr(Θ) ≤ Tr(Diag{Ax}Θ), so that in this case the Px -probability of the event n o p ω : kBx − x bH (ω)k ≤ Φ(H) + O(1) ln(2m/ǫ) φR (λ[Υ])ΓX (Θ) is at least 1 − ǫ. 4.7.6

Numerical lower-bounding minimax risk

Exercise 4.22. 4.22.A. Motivation. From the theoretical viewpoint, the results on near-optimality of presumably good linear estimates stated in Propositions 4.5 and 4.16 seem

348

CHAPTER 4

to be quite strong and general. This being said, for a practically oriented user the “nonoptimality factors” arising in these propositions can be too large to make any practical sense. This drawback of our theoretical results is not too crucial—what matters in applications, is whether the risk of a proposed estimate is appropriate for the application in question, and not by how much it could be improved were we smart enough to build the “ideal” estimate; results of the latter type from a practical viewpoint offer no more than some “moral support.” Nevertheless, the “moral support” has its value, and it makes sense to strengthen it by improving the lower risk bounds as compared to those underlying Propositions 4.5 and 4.16. In this respect, an appealing idea is to pass from lower risk bounds yielded by theoretical considerations to computation-based ones. The goal of this exercise is to develop some methodology yielding computation-based lower risk bounds. We start with the main ingredient of this methodology—the classical Cramer-Rao bound. 4.22.B. Cramer-Rao bound. Consider the situation as follows: we are given • an observation space Ω equipped with reference measure Π, basic examples being (A) Ω = Rm with Lebesgue measure Π, and (B) (finite or countable) discrete set Ω with counting measure Π; • a convex compact set Θ ⊂ Rk and a family P = {p(ω, θ) : θ ∈ Θ} of probability densities, taken w.r.t. Π. Our goal is, given an observation ω ∼ p(·, θ) stemming from unknown θ known to belong to Θ, to recover θ. We quantify the risk of a candidate estimate θb as  o1/2 n b = sup Eω∼p(·,θ) kθ(ω) b Risk[θ|Θ] , − θk22

(4.113)

θ∈Θ

and define the “ideal” minimax risk as

b Riskopt = inf Risk[θ], θb

the infimum being taken w.r.t. all estimates, or, which is the same, all bounded estimates (indeed, passing from a candidate estimate θb to the projected estimate b θbΘ (ω) = argminθ∈Θ kθ(ω) − θk2 will only reduce the estimate risk). The Cramer-Rao inequality [58, 205], which we intend to use,21 is a certain relation between the covariance matrix of a bounded estimate and its bias; this relation is valid under mild regularity assumptions on the family P, specifically, as follows: 1) p(ω, θ) > 0 for all ω ∈ Ω, θ ∈ U , and p(ω, θ) is differentiable in θ, with ∇θ p(ω, θ) continuous in θ ∈ Θ; 2) The Fisher Information matrix I(θ) =

Z



∇θ p(ω, θ)[∇θ p(ω, θ)]T Π(dω) p(ω, θ)

21 As a matter of fact, the classical Cramer-Rao inequality dealing with unbiased estimates is not sufficient for our purposes “as is.” What we need to build is a “bias enabled” version of this inequality. Such an inequality may be developed using Bayesian argument [99, 233].

349

SIGNAL RECOVERY BY LINEAR ESTIMATION

is well-defined for all θ ∈ Θ; R 3) There exists function M (ω) ≥ 0 such that Ω M (ω)Π(dω) < ∞ and k∇θ p(ω, θ)k2 ≤ M (ω) ∀ω ∈ Ω, θ ∈ Θ.

b The derivation of the Cramer-Rao bound is as follows. Let θ(ω) be a bounded estimate, and let Z b φ(θ) = [φ1 (θ); ...; φk (θ)] = θ(ω)p(ω, θ)Π(dω) Ω

be the expected valuehof theiestimate. By item 3, φ(θ) is differentiable on Θ, with given by the Jacobian φ′ (θ) = ∂φ∂θi (θ) j i,j≤k

φ′ (θ)h =

Z



T b θ(ω)h ∇θ p(ω, θ)Π(dω), h ∈ Rk .

R this, recalling that Ω p(ω, θ)Π(dω) ≡ 1 and invoking item 3, we have RBesides hT ∇θ p(ω, θ)Π(dω) = 0, whence, in view of the previous identity, Ω Z b − φ(θ)]hT ∇θ p(ω, θ)Π(dω), h ∈ Rk . φ′ (θ)h = [θ(ω) Ω

Therefore, for all g, h ∈ Rk we have [g T φ′ (θ)h]2

= ≤ =

hR

[g T (θb − φ(θ)][hT ∇θ p(ω, θ)/p(ω, θ)]p(ω, θ)Π(dω) hRω i g T [θb − φ(θ)][θb − φ(θ)]T gp(ω, θ)Π(dω) Ω R  × Ω [hT ∇θ p(ω, θ)/p(ω, θ)]2 p(ω, θ)Π(dω) [by  T the Cauchy   Inequality]  g Covθb(θ)g hT I(θ)h ,

i2

n o b b where Covθb(θ) is the covariance matrix Eω∼p(·,θ) [θ(ω) − φ(θ)][θ(ω) − φ(θ)]T of b θ(ω) induced by ω ∼ p(·, θ). We have arrived at the inequality 

g T Covθb(θ)g



 hT I(θ)h ≥ [g T φ′ (θ)h]2 ∀(g, h ∈ Rk , θ ∈ Θ).

(∗)

For θ ∈ Θ fixed, let J be a positive definite matrix such that J  I(θ), whence by (∗) it holds  T   g Covθb(θ)g hT J h ≥ [g T φ′ (θ)h]2 ∀(g, h ∈ Rk ). (∗∗)

For g fixed, the maximum of the right-hand side quantity in (∗∗) over h satisfying hT J h ≤ 1 is g T φ′ (θ)J −1 [φ′ (θ]T g, and we arrive at the Cramer-Rao inequality ∀(θ ∈ Θ, J  I(θ), J ≻ 0) : Covθb(θ)  φ′ (θ)J −1 [φ′ (θ]T (4.114) h n o n oi b Covθb(θ) = Eω∼p(·,θ) [θb − φ(θ)][θb − φ(θ)]T , φ(θ) = Eω∼p(·,θ) θ(ω)

b which holds true for every bounded estimate θ(·). Note also that for every θ ∈ Θ

350

CHAPTER 4

and every bounded estimate x we have o o n n b b ≥ Eω∼p(·,θ) kθ(ω) b − φ(θ)] + [φ(θ) − θ]k22 Risk2 [θ] − θk22 = Eω∼p(·,θ) k[θ(ω) o n b = Eω∼p(·,θ) kθ(ω) − φ(θ)k22 +kφ(θ) − θ)k22 h o b −2 Eω∼p(·,θ) [θ(ω) − φ(θ)]T [φ(θ) − θ)] | {z } = Tr(Covθb(θ)) + kφ(θ) − θk22 .

=0

Hence, in view of (4.114), for every bounded estimate θb it holds

∀(J ≻ 0 : J  I(θ) ∀θ ∈ Θ) :   b ≥ sup Tr(φ′ (θ)J −1 [φ′ (θ)]Ti) + kφ(θ) − θk22 Risk2 [θ] θ∈Θ h b φ(θ) = Eω∼p(·,θ) {θ(ω)} .

(4.115)

The fact that we considered the risk of estimating “the entire” θ rather than a given vector-valued function f (θ) : Θ → Rν plays no special role, and in fact the Cramer-Rao inequality admits the following modification yielded by a completely similar reasoning: Proposition 4.36. In the situation described in item 4.22.B and under assumptions 1)–3) of this item, let f (·) : Θ → Rν be a bounded Borel function, and let fb(ω) be a bounded estimate of f (ω) via observation ω ∼ p(·, θ). Then, setting for θ∈Θ n o φ(θ) = Eω∼p(·,θ) fb(θ) , n o Covfb(θ) = Eω∼p(·,θ) [fb(ω) − φ(θ)][fb(ω) − φ(θ)]T , one has

∀(θ ∈ Θ, J  I(θ), J ≻ 0) : Covfb(θ)  φ′ (θ)J −1 [φ′ (θ)]T .

As a result, for

h oi1/2 n Risk[fb] = sup Eω∼p(·,θ) kfb(ω) − f (θ)k22 θ∈Θ

it holds

∀(J ≻ 0 : J  I(θ) ∀θ ∈ Θ) :   Risk2 [fb] ≥ supθ∈Θ Tr(φ′ (θ)J −1 [φ′ (θ)]T ) + kφ(θ) − f (θ)k22

Now comes the first part of the exercise: 1) Derive from (4.115) the following

Proposition 4.37. In the situation of item 4.22.B, let • Θ ⊂ Rk be a k · k2 -ball of radius r > 0, • the family P be such that I(θ)  J for some J ≻ 0 and all θ ∈ Θ.

351

SIGNAL RECOVERY BY LINEAR ESTIMATION

Then the minimax optimal risk satisfies the bound rk . Riskopt ≥ p r Tr(J ) + k

(4.116)

In particular, when J = α−1 Ik , we have Riskopt

√ r αk √ . ≥ r + αk

(4.117)

Hint. Assuming w.l.o.g. that Θ is centered at the origin, and given a bounded estimate θb with risk R, let φ(θ) be associated with the estimate via (4.115). Select γ ∈ (0, 1) and consider two cases: (a): there exists θ ∈ ∂Θ such that kφ(θ) − θk2 > γr, and (b): kφ(θ) − θk2 ≤ γr for all θ ∈ ∂Θ. In the case of (a), lowerbound R by maxθ∈Θ kφ(θ) − θk2 ; see (4.115). In the case of (b), lower-bound R2 by maxθ∈Θ Tr(φ′ (θ)J −1 [φ′ (θ)]T )—see (4.115)—and use the Gauss Divergence theorem to lower-bound the latter quantity in terms of the flux of the vector field φ(·) over ∂Θ. When implementing the above program, you might find useful the following fact (prove it!): Lemma 4.38. Let Φ be an n × n matrix, and J be a positive definite n × n matrix. Then Tr2 (Φ) . Tr(ΦJ −1 ΦT ) ≥ Tr(J ) 4.22.C. Application to signal recovery. Proposition 4.37 allows us to build computation-based lower risk bounds in the signal recovery problem considered in Section 4.2, in particular, the problem where one wants to recover the linear image Bx of an unknown signal x known to belong to a given ellitope X = {x ∈ Rn : ∃t ∈ T : xT Sℓ x ≤ tℓ , ℓ ≤ L} (with our usual restriction on Sℓ and T ) via observation ω = Ax + σξ, ξ ∼ N (0, Im ), and the risk of a candidate estimate, as in Section 4.2, is defined according to (4.113).22 It is convenient to assume that the matrix B (which in our general setup can be an arbitrary ν × n matrix) is a nonsingular n × n matrix.23 Under this 22 In fact, the approach to be developed can be applied to signal recovery problems involving Discrete/Poisson observation schemes and norms different from k · k2 used to measure the recovery error, signal-dependent noises, etc. 23 This assumption is nonrestrictive. Indeed, when B ∈ Rν×n with ν < n, we can add to B n − ν zero rows, which keeps our estimation problem intact. When ν ≥ n, we can add to B a small perturbation to ensure Ker B = {0}, which, for small enough perturbation, again keeps our estimation problem basically intact. It remains to note that when Ker B = {0} we can replace Rν with the image space of B, which again does not affect the estimation problem we are interested in.

352

CHAPTER 4

assumption, setting Y = B −1 X = {y ∈ Rn : ∃t ∈ T : y T [B −1 ]T Sℓ B −1 y ≤ tℓ , ℓ ≤ L} and A¯ = AB −1 , we lose nothing when replacing the sensing matrix A with A¯ and treating as our signal y ∈ Y rather than X . Note that in our new situation A is ¯ X with Y, and B is the unit matrix In . For the sake of simplicity, replaced with A, ¯ has trivial kernel. Finally, let we assume from now on that A (and therefore A) S˜ℓ  Sℓ be close to Sk positive definite matrices, e.g., S˜ℓ = Sℓ + 10−100 In . Setting S¯ℓ = [B −1 ]T S˜ℓ B −1 and Y¯ = {y ∈ Rn : ∃t ∈ T : y T S¯ℓ y ≤ tℓ , ℓ ≤ L}, we get S¯ℓ ≻ 0 and Y¯ ⊂ Y. Therefore, any lower bound on the k · k2 -risk of recovery y ∈ Y¯ via observation ω = AB −1 y + σξ, ξ ∼ N (0, Im ), automatically is a lower bound on the minimax risk Riskopt corresponding to our original problem of interest. Now assume that we can point out a k-dimensional linear subspace E in Rn and positive reals r, γ such that ¯ (i) the k · k2 -ball Θ = {θ ∈ E : kθk2 ≤ r} is contained in Y; (ii) The restriction A¯E of A¯ onto E satisfies the relation Tr(A¯∗E A¯E ) ≤ γ (A¯∗E : Rm → E is the conjugate of the linear map A¯E : E → Rm ). Consider the auxiliary estimation problem obtained from the (reformulated) prob¯ the minimax lem of interest by replacing the signal set Y¯ with Θ. Since Θ ⊂ Y, risk in the auxiliary problem is a lower bound on the minimax risk Riskopt we are interested in. On the other hand, the auxiliary problem is nothing but the problem ¯ σ 2 I), which is just a of recovering parameter θ ∈ Θ from observation ω ∼ N (Aθ, special case of the problem considered in item 4.22.B. As it is immediately seen, the Fisher Information matrix in this problem is independent of θ and is σ −2 A¯∗E A¯E : eT I(θ)e = σ −2 eT A¯∗E A¯E e, e ∈ E. Invoking Proposition 4.37, we arrive at the lower bound on the minimax risk in the auxiliary problem (and thus in the problem of interest as well): rσk . Riskopt ≥ √ r γ + σk

(4.118)

The resulting risk bound depends on r, k, γ and is larger the smaller γ is and the larger k and r are. Lower-bounding Riskopt . In order to make the bounding scheme just outlined give its best, we need a mechanism which allows us to generate k-dimensional “disks” Θ ⊂ Y¯ along with associated quantities r, γ. In order to design such a mechanism, it is convenient to represent k-dimensional linear subspaces of Rn as the image spaces of orthogonal n × n projectors P of rank k. Such a projector P ¯ where rP is the gives rise to the disk ΘP of the radius r = rP contained in Y, T 2 largest ρ such that the set {y ∈ ImP : y P y ≤ ρ } is contained in Y¯ (“condition

353

SIGNAL RECOVERY BY LINEAR ESTIMATION

C(r)”), and we can equip the disk with γ satisfying (ii) if and only if ¯ ) ≤ γ, Tr(P A¯T AP or, which is the same (recall that P is an orthogonal projector) ¯ A¯T ) ≤ γ Tr(AP

(4.119)

(“condition D(γ)”). Now, when P is a nonzero orthogonal projector, the simplest sufficient condition for the validity of C(r) is the existence of t ∈ T such that ∀(y ∈ Rn , ℓ ≤ L) : y T P S¯ℓ P y ≤ tℓ r−2 y T P y, or, which is the same, ∃s : r2 s ∈ T & P S¯ℓ P  sℓ P, ℓ ≤ L.

(4.120)

Let us rewrite (4.119) and (4.120) as a system of linear matrix inequalities. This is what you are supposed to do: 2.1) Prove the following simple fact: Observation 4.39. Let Q be a positive definite, R be a nonzero positive semidefinite matrix, and let s be a real. Then RQR  sR if and only if

sQ−1  R.

2.2) Extract from the above observation the conclusion as follows. Let T be the conic hull of T : T = cl{[s; τ ] : τ > 0, s/τ ∈ T } = {[s; τ ] : τ > 0, s/τ ∈ T } ∪ {0}. Consider the system of constraints ¯ A¯T ) ≤ γ, [s; τ ] ∈ T & sℓ S¯ℓ−1  P, ℓ ≤ L & Tr(AP P is an orthogonal projector of rank k ≥ 1

(#)

in variables [s; τ ], k, γ, and P . Every feasible solution to this system gives rise to a k-dimensional Euclidean subspace E ⊂ Rn (the image space of P ) such that the Euclidean ball Θ in E centered at the origin of radius √ r = 1/ τ taken along with γ satisfies conditions (i)–(ii). Consequently, such a feasible solution yields the lower bound Riskopt ≥ ψσ,k (γ, τ ) := √

σk √ γ + σ τk

on the minimax risk in the problem of interest. Ideally, to utilize item 2.2 to lower-bound Riskopt , we should look through k =

354

CHAPTER 4

1, ..., n and maximize for every k the lower risk bound ψσ,k (γ, τ ) under constraints (#), thus arriving at the problem n √ √ min[s;τ ],γ,P ψσ,kσ(γ,τ ) = γ/k + σ τ :  (Pk ) ¯ A¯T ) ≤ γ, [s; τ ] ∈ T & sℓ S¯ℓ−1  P, ℓ ≤ L & Tr(AP P is an orthogonal projector of rank k. This problem seems to be computationally intractable, since the constraints of (Pk ) include the nonconvex restriction on P to be a projector of rank k. A natural convex relaxation of this constraint is 0  P  In , Tr(P ) = k. The (minor) remaining difficulty is that √ the objective in (P ) is nonconvex. Note, √ however, that to minimize γ/k + σ τ is basically the same as to minimize the convex function γ/k 2 + σ 2 τ which is a tight “proxy” of the squared objective of (Pk ). We arrive at a convex “proxy” of (Pk )—the problem   [s; τ ] ∈ T, 0  P  In , Tr(P ) = k 2 2 min γ/k + σ τ : (P [k]) ¯ A¯T ) ≤ γ , sℓ S¯ℓ−1  P, ℓ ≤ L, Tr(AP [s;τ ],γ,P k = 1, ..., n. Problem (P [k]) clearly is solvable, and the P -component P (k) of its (k) optimal solution gives rise to a collection of orthogonal projectors Pκ , κ = 1, ..., n (k) obtained from P (k) by “rounding”—to get Pκ , we replace the κ leading eigenvalues (k) of P with ones, and the remaining eigenvalues with zeros, while keeping the eigenvectors intact. We can now for every κ = 1, ..., n fix the P -variable in (Pk ) as (k) Pκ and solve the resulting problem in the remaining variables [s; τ ] and γ, which is easy—with P fixed, the problem clearly reduces to minimizing τ under the convex constraints sℓ S¯ℓ−1  P, ℓ ≤ L, [s; τ ] ∈ T on [s; τ ]. As a result, for every k ∈ {1, ..., n}, we get n lower bounds on Riskopt , that is, a total of n2 lower risk bounds, of which we select the best—the largest. Now comes the next part of the exercise: 3) Implement the outlined program numerically and compare the lower bound on the minimax risk with the upper risk bounds of presumably good linear estimates yielded by Proposition 4.4. Recommended setup: • Sizes: m = n = ν = 16. • A, B: B = In , A = Diag{a1 , ..., an } with ai = i−α and α running through {0, 1, 2}. • X = {x ∈ Rn : xT Sℓ x ≤ 1, ℓ ≤ L} (i.e., T = [0, 1]L ) with randomly generated Sℓ . • Range of L: {1, 4, 16}. For L in this range, you can generate Sℓ , ℓ ≤ L, as Sℓ = Rℓ RℓT with Rℓ = randn(n, p), where p =⌋n/L⌊. • Range of σ: {1.0, 0.1, 0.01, 0.001, 0.0001}. Exercise 4.23. [follow-up to Exercise 4.22]

355

SIGNAL RECOVERY BY LINEAR ESTIMATION

1) Prove the following version of Proposition 4.37: Proposition 4.40. In the situation of item 4.22.B and under Assumptions 1)–3) from this item, let • k · k be a norm on Rk such that kθk2 ≤ κkθk ∀θ ∈ Rk , • Θ ⊂ Rk be a k · k-ball of radius r > 0, • the family P be such that I(θ)  J for some J ≻ 0 and all θ ∈ Θ.

Then the minimax optimal risk Riskopt,k·k = inf

b θ(·)



sup Eω∼p(·,θ)

θ∈Θ

n

2 b kθ − θ(ω)k

o1/2

of recovering parameter θ ∈ Θ from observation ω ∼ p(·, θ) in the norm k · k satisfies the bound rk . (4.121) Riskopt,k·k ≥ p rκ Tr(J ) + k

In particular, when J = α−1 Ik , we get Riskopt,k·k

√ r αk √ . ≥ rκ + αk

(4.122)

2) Apply Proposition 4.40 to get lower bounds on the minimax k · k-risk in the following estimation problems: 2.1) Given indirect observation ω = Aθ + σξ, ξ ∼ N (0, Im ) of unknown vector θ known to belong to Θ = {θ ∈ Rk : kθkp ≤ r} with given A, Ker A = {0}, p ∈ [2, ∞], r > 0, we want to recover θ in k · kp . 2.2) Given indirect observation ω = LθR + σξ, where θ is unknown µ × ν matrix known to belong to the Shatten norm ball Θ ∈ Rµ×ν : kθkSh,p ≤ r, we want to recover θ in k · kSh,p . Here L ∈ Rm×µ , Ker L = {0} and R ∈ Rν×n , Ker RT = {0} are given matrices, p ∈ [2, ∞], and ξ is a random Gaussian m × n matrix (i.e., the entries in ξ are N (0, 1) random variables independent of each other). 2.3) Given a K-repeated observation ω K = (ω1 , ..., ωK ) with i.i.d. components ωt ∼ N (0, θ), 1 ≤ t ≤ K, with unknown θ ∈ Sn known to belong to the matrix box Θ = {θ : β− In  θ  β+ In } with given 0 < β− < β+ < ∞, we want to recover θ in the spectral norm. Exercise 4.24. [More on Cramer-Rao risk bound] Let us fix µ ∈ (1, ∞) and a norm k · k on Rk , µ . Assume that we are and let k · k∗ be the norm conjugate to k · k, and µ∗ = µ−1 in the situation of item 4.22.B and under assumptions 1) and 3) from this item; as for assumption 2) we now replace it with the assumption that the quantity  1/µ∗ Ik·k∗ ,µ∗ (θ) := Eω∼p(·,θ) {k∇θ ln(p(ω, θ))kµ∗ ∗ }

356

CHAPTER 4

is well-defined and bounded on Θ; in the sequel, we set Ik·k∗ ,µ∗ = sup Ik·k∗ ,µ∗ (θ). θ∈Θ

1) Prove the following variant of the Cramer-Rao risk hound: Proposition 4.41. In the situation described at the beginning of item 4.22.D, let Θ ⊂ Rk be a k · k-ball of radius r. Then the minimax k · k-risk of recovering θ ∈ Θ via observation ω ∼ p(·, θ) can be lower-bounded as h n oi1/µ b Riskopt,k·k [Θ] := inf sup Eω∼p(·,θ) kθ(ω) − θkµ ≥ rIk·krk,µ +k , ∗ ∗ b θ∈Θ θ(·) i h   1/µ∗ Ik·k∗ ,µ∗ = max Ik·k∗ ,µ∗ (θ) := Eω∼p(·,θ) {k∇θ ln(p(ω, θ))kµ∗ ∗ } .

(4.123)

θ∈Θ

Example I: Gaussian case, estimating shift. Let µ = 2, and let p(ω, θ) = N (Aθ, σ 2 Im ) with A ∈ Rm×k . Then −2 T R∇θ ln(p(ω, θ)) = σ2 A (ω − Aθ) ⇒ −4 R kAT (ω − Aθ)k2∗ p(ω, θ)dω k∇θ ln(p(ω, θ))k∗ p(ω, θ)dω = σ R T 1 = σ −4 [√2πσ] kAT ωk2∗ exp{− ω2σ2ω }dω m R = σ −4 [2π]1m/2 kAT σξk2∗ exp{−ξ T ξ/2}dξ R = σ −2 [2π]1m/2 kAT ξk2∗ exp{−ξ T ξ/2}dξ

whence

  1/2 . Ik·k∗ ,2 = σ −1 Eξ∼N (0,Im ) kAT ξk2∗ {z } | γk·k (A)

Consequently, assuming Θ to be a k · k-ball of radius r in Rk , lower bound (4.123) becomes Riskopt,k·k [Θ] ≥

rk rIk·k∗ + k

=

rσ −1 γ

rk rσk = . rγk·k (A) + σk k·k (A) + k

(4.124)

The case of direct observations. To see “how it works,” consider the case m = k, A = Ik of direct observations, and let Θ = {θ ∈ Rk : kθk ≤ r}. Then p • We have γk·k1 (Ik ) ≤ O(1) ln(k), whence the k · k1 -risk bound is

rσk Riskopt,k·k1 [Θ] ≥ O(1) p [Θ = {θ ∈ Rk : kθ − ak1 ≤ r}]. r ln(k) + σk √ • We have γk·k2 (Ik ) = k, whence the k · k2 -risk bound is √ rσ k √ Riskopt,k·k2 [Θ] ≥ r+σ k

[Θ = {θ ∈ Rk : kθ − ak2 ≤ r}].

357

SIGNAL RECOVERY BY LINEAR ESTIMATION

• We have γk·k∞ (Ik ) ≤ O(1)k, whence the k · k∞ -risk bound is Riskopt,k·k∞ [Θ] ≥ O(1)

rσ r+σ

[Θ = {θ ∈ Rk : kθ − ak∞ ≤ r}].

In fact, the above examples are essentially covered by the following: Observation 4.42. Let k · k be a norm on Rk , and let Θ = {θ ∈ Rk : kθk ≤ r}. Consider the problem of recovering signal θ ∈ Θ via observation ω ∼ N (θ, σ 2 Ik ). Let  n o1/2 b = sup Eω∼N (θ,σ2 I) kθ(ω) b Riskk·k [θ|Θ] − θk2 θ∈Θ

b be the k · k-risk of an estimate θ(·), and let

b Riskopt,k·k [Θ] = inf Riskk·k [θ|Θ] b θ(·)

be the associated minimax risk. Assume that the norm k · k is absolute and symmetric w.r.t. permutations of the coordinates. Then rσk Riskopt,k·k [Θ] ≥ p , 2 ln(ek)rα∗ + σk

α∗ = k[1; ...; 1]k∗ .

(4.125)

Here is the concluding part of the exercise: 2) Prove the observation and compare the lower risk bound it yields with the k·k-risk of the “plug-in” estimate χ b(ω) ≡ ω.

Example II: Gaussian case, estimating covariance. Let µ = 2, let K be a positive integer, and let our observation ω be a collection of K i.i.d. samples ωt ∼ N (0, θ), 1 ≤ t ≤ K, with unknown θ known to belong to a given convex compact subset Θ of the interior of the positive semidefinite cone Sn+ . Given ω1 ,...,ωK , we want to recover θ in the Shatten norm k · kSh,s with s ∈ [1, ∞]. Our estimation problem is covered by the setupQof Exercise 4.22 with P comprised of the product K probability densities p(ω, θ) = t=1 g(ωt , θ), θ ∈ Θ, where g(·, θ) is the density of N (0, θ). We have  P P  −1 1 ωt ωtT θ−1 − θ−1 ln(g(ω ∇θ ln(p(ω, θ)) = 21 t ∇θP t , θ)) = 2 t θ  (4.126) −1/2 ωt ][θ−1/2 ωt ]T − In θ−1/2 . = 12 θ−1/2 t [θ With some effort [149] it can be proved that when K ≥ n, which we assume from now on, for random vectors ξ1 , ..., ξK independent across t sampled from the standard Gaussian distribution N (0, In ) for every u ∈ [1, ∞] one

358

CHAPTER 4

has

" (

2

XK T

[ξt ξt − In ] E

t=1

Sh,u

)#1/2

1

1

≤ Cn 2 + u



K

(4.127)

with appropriate absolute constant C. Consequently, for θ ∈ Θ and all u ∈ [1, ∞] we have n o Eω∼p(·,θ) k∇θ ln(p(ω, θ))k2Sh,u o n  P  −1/2 ωt ][θ−1/2 ωt ]T − In θ−1/2 k2Sh,u = 41 Eω∼p(·,θ) kθ−1/2 t [θ [by (4.126)] n  −1/2 2 o P  T 1 −1/2 −1/2 = 4 Eξ∼p(·,In ) kθ θ kSh,u [setting θ ωt = ξt ] t ξt ξt − I n o n P   ≤ 14 kθ−1/2 k4Sh,∞ Eξ∼p(·,In ) k t ξt ξtT − In k2Sh,u [since kABkSh,u ≤ kAkSh,∞ kBkSh,u ] h i2 1 1√ + ≤ 41 kθ−1/2 k4Sh,∞ Cn 2 u K [by (4.127)] and we arrive at 

 1/2 1 1√ C Eω∼p(·,θ) k∇θ ln(p(ω, θ))k2Sh,u ≤ kθ−1 kSh,∞ n 2 + u K. 2

(4.128)

Now assume that Θ is k · kSh,s -ball of radius r < 1 centered at In : Θ = {θ ∈ Sn : kθ − In kSh,s ≤ r}.

(4.129)

In this case the estimation problem from Example II is the scope of Proposition 4.41, and the quantity Ik·k∗ ,2 as defined in (4.123) can be upper-bounded as follows: Ik·k∗ ,2

= ≤ ≤

h n oi1/2 max Eω∼p(·,θ) k∇θ ln(p(ω, θ))k2Sh,s∗ θ∈Θ 1 1 √ O(1)n 2 + s∗ K maxθ∈Θ kθ−1 kSh,∞ [see (4.128)] O(1) n

1 + 1 √ 2 s∗ K

1−r

.

We can now use Proposition 4.41 to lower-bound the minimax k · kSh,s -risk, thus arriving at n(1 − r)r (4.130) Riskopt,k·kSh,s [Θ] ≥ O(1) √ 1 1 Kn 2 − s r + n(1 − r)

(note that we are in the case of k = dim θ = n(n+1) ). 2 Let us compare this lower risk bound with the k · kSh,s -risk of the “plug-in” estimate K 1 X b ωt ωtT . θ(ω) = K t=1

359

SIGNAL RECOVERY BY LINEAR ESTIMATION

Assuming θ ∈ Θ, we have o o n P n b Eω∼p(·,θ) kK[θ(ω) − θ]k2Sh,s = Eω∼p(·,θ) k t [ωt ωtT − θ]k2Sh,s n  1/2 2 o P −1/2 −1/2 T [[θ ω ][θ ω ] − I ] θ kSh,s = Eω∼p(·,θ) kθ1/2 t t n t n o   P T 1/2 2 = Eξ∼p(·,In ) kθ1/2 kSh,s t [ξt ξt − In ] θ n P o ≤ kθ1/2 k4Sh,∞ Eξ∼p(·,In ) k t [ξt ξtT − In ]k2Sh,s i2 h 1 1√ [see (4.127)] ≤ kθ1/2 k4Sh,∞ Cn 2 + s K ,

and we arrive at

1 1 2+s

b ≤ O(1) max kθkSh,∞ n√ Riskk·kSh,s [θ|Θ] θ∈Θ

K

.

In the case of (4.129), the latter bound becomes

1

1

2+s b ≤ O(1) max kθkSh,∞ n√ . Riskk·kSh,s [θ|Θ] θ∈Θ K

(4.131)

For the sake of simplicity, assume that r in (4.129) is 1/2 (what actually matters below is that r ∈ (0, 1) is bounded away from 0 and from 1). In this case the lower bound (4.130) on the minimax k · kSh,s -risk reads # 1 1 n2+s Riskopt,k·kSh,s [Θ] ≥ O(1) min √ , 1 . K "

2

When K is “large”: K ≥ n1+ s , this lower bound matches, within an absolute constant factor, the upper bound (4.131) on the risk of the plug-in estimate, so that 2 the latter estimate is near-optimal. When K < n1+ s , the lower risk bound becomes b O(1), so that here a nearly optimal estimate is the trivial estimate θ(ω) ≡ In .

4.7.7

Around S-Lemma

S-Lemma is a classical result of extreme importance in Semidefinite Optimization. Basically, the lemma states that when the ellitope X in Proposition 4.6 is an ellipsoid, (4.19) can be strengthened to Opt = Opt∗ . In fact, S-Lemma is even stronger: Lemma 4.43. [S-Lemma] Consider two quadratic forms f (x) = xT Ax + 2aT x + α and g(x) = xT Bx + 2bT x + β such that g(¯ x) < 0 for some x ¯. Then the implication g(x) ≤ 0 ⇒ f (x) ≤ 0 takes place if and only if for some λ ≥ 0 it holds f (x) ≤ λg(x) for all x, or, which is the same, if and only if Linear Matrix Inequality   λb − a λB − A 0 λbT − aT λβ − α

360

CHAPTER 4

in scalar variable λ has a nonnegative solution. Proof of S-Lemma can be found, e.g., in [15, Section 3.5.2]. The goal of subsequent exercises is to get “tight” tractable outer approximations of sets obtained from ellitopes by quadratic lifting. We fix an ellitope X = {x ∈ Rn : ∃t ∈ T : xT Sk x ≤ tk , 1 ≤ k ≤ K}

(4.132)

where, as always, Sk are positive semidefinite matrices with positive definite sum, and T is a computationally tractable convex compact subset in Rk+ such that t ∈ T implies t′ ∈ T whenever 0 ≤ t′ ≤ t and T contains a positive vector. Exercise 4.25.

Let us associate with ellitope X given by (4.132) the sets X Xb

= =

Conv{xxT : x ∈ X}, {Y ∈ Sn : Y  0, ∃t ∈ T : Tr(Sk Y ) ≤ tk , 1 ≤ k ≤ K},

so that X , Xb are convex compact sets containing the origin, and Xb is computationally tractable along with T . Prove that

1. When K = 1, we have X = Xb; √ 2. We always have X ⊂ Xb ⊂ 3 ln( 3K)X . Exercise 4.26.

n

T

o

For x ∈ R let Z(x) = [x; 1][x; 1] , Z [x] = C=



1





xxT xT

x



. Let

,

and let us associate with ellitope X given by (4.132) the sets X+

Xb+

= =

o Conv{Z   [x] : x ∈X},  U u Y = ∈ Sn+1 : Y + C  0, ∃t ∈ T : Tr(Sk U ) ≤ tk , 1 ≤ k ≤ K , T u

so that X + , Xb+ are convex compact sets containing the origin, and Xb+ is computationally tractable along with T . Prove that

1. When K = 1, we have X + = Xb+ ; √ 2. We always have X + ⊂ Xb+ ⊂ 3 ln( 3(K + 1))X + . 4.7.8

Miscellaneous exercises

Exercise 4.27. Let X ⊂ Rn be a convex compact set, let b ∈ Rn , and let A be an m × n matrix. Consider the problem of affine recovery ω 7→ hT ω + c of the linear function Bx = bT x of x ∈ X from indirect observation ω = Ax + σξ, ξ ∼ N (0, Im ).

361

SIGNAL RECOVERY BY LINEAR ESTIMATION

Given tolerance ǫ ∈ (0, 1), we are interested in minimizing the worst-case, over x ∈ X, width of (1 − ǫ) confidence interval, that is, the smallest ρ such that Prob{ξ : bT x−f T (Ax+σξ) > ρ} ≤ ǫ/2 & Prob{ξ : bT x−f T (Ax+σξ) < ρ} ≤ ǫ/2 ∀x ∈ X.

Pose the problem as a convex optimization problem and consider in detail the case where X is the box {x ∈ Rn : aj |xj | ≤ 1, 1 ≤ j ≤ n}, where aj > 0 for all j. Exercise 4.28. Prove Proposition 4.21. Exercise 4.29. Prove Proposition 4.22.

4.8

PROOFS

4.8.1 4.8.1.1

Preliminaries Technical lemma

Lemma 4.44. Given basic spectratope X = {x ∈ Rn : ∃t ∈ T : Rk2 [x]  tk Idk , 1 ≤ k ≤ K}

(4.133)

and a positive definite n × n matrix Q and setting Λk = Rk [Q] (for notation, see P Section 4.3.1), we get a collection of positive semidefinite matrices, and k R∗k [Λk ] is positive definite. As a corollaries, P (i) whenever Mk , k ≤ K, are positive definite matrices, the matrix k R∗k [Mk ] is positive definite; (ii) the set QT = {Q  0 : Rk [Q]  T Idk , k ≤ K} is bounded for every T . Proof. Let us prove the first claim. P Assuming the opposite, we would be able to find a nonzero vector y such that k y T R∗k [Λk ]y ≤ 0, whence 0≥

X k

y T R∗k [Λk ]y =

X k

Tr(R∗k [Λk ][yy T ]) =

X k

Tr(Λk Rk [yy T ])

(we have used (4.26), (4.22)). Since Λk = Rk [Q]  0 due to Q  0—see (4.23)— it follows that Tr(Λk Rk [yy T ]) = 0 for all k. Now, the linear mapping Rk [·] is -monotone, and Q is positive definite, implying that Q  rk yy T for some rk > 0, whence Λk  rk Rk [yy T ], and therefore Tr(Λk Rk [yy T ]) = 0 implies that Tr(R2k [yy T ]) = 0, that is, Rk [yy T ] = Rk2 [y] = 0. Since Rk [·] takes values in Sdk , we get Rk [y] = 0 for al k, which is impossible due to y 6= 0 and property S.3; see Section 4.3.1. To verify (i), note that when Mk are positive definite, we can find γ > 0 such that Λk P  γMk for all k ≤ K; invoking (4.27), we conclude that R∗k [Λk ]  γR∗k [Mk ], P ∗ ∗ whence k Rk [Mk ] is positive definite along with k Rk [Λk ]. To verify (ii), assume, on the contrary to what should be proved, that QT is unbounded. Since QT is closed and convex, it must possess a nonzero recessive

362

CHAPTER 4

direction, that is, there should exist nonzero positive semidefinite matrix D such that Rk [D]  0 for all k. Selecting positive definite matrices Mk , the matrices R∗k [Mk ] are positive semidefinite (see Section 4.3.1), and their sum S is positive definite by (i). We have X X Tr(DR∗k [Mk ]) = Tr(DS), Tr(Rk [D]Mk ) = 0≥ k

k

where the first inequality is due to Mk  0, and the first equality is due to (4.26). The resulting inequality is impossible due to 0 6= D  0 and S ≻ 0, which is the desired contradiction. ✷ 4.8.1.2

Noncommutative Khintchine Inequality

We will use a deep result from Functional Analysis (“Noncommutative Khintchine Inequality”) due to Lust-Piquard [175], Pisier [199] and Buchholz [34]; see [228, Theorem 4.6.1]: Theorem 4.45. Let Qi ∈ Sn , 1 ≤ i ≤ I, and let ξi , i = 1, ..., I, be independent Rademacher (±1 with probabilities 1/2) or N (0, 1) random variables. Then for all t ≥ 0 one has

( I )  

X

t2

Prob ξi Qi ≥ t ≤ 2n exp −

2vQ i=1

P

I

where k · k is the spectral norm, and vQ = i=1 Q2i . We need the following immediate consequence of the theorem:

Lemma 4.46. Given spectratope (4.20), let Q ∈ Sn+ be such that Rk [Q]  ρtk Idk , 1 ≤ k ≤ K,

(4.134)

for some t ∈ T and some ρ ∈ (0, 1]. Then h

Probξ∼N (0,Q) {ξ 6∈ X } ≤ min 2De

1 − 2ρ

K i X , 1 , D := dk . k=1

Proof. When setting ξ = Q1/2 η, η ∼ N (0, In ), we have Rk [ξ] = Rk [Q1/2 η] =:

n X

¯ ki = R ¯ k [η] ηi R

i=1

with X i

 2  ¯ ki ]2 = Eη∼N (0,I ) R ¯ k [η] = Eξ∼N (0,Q) Rk2 [ξ] = Rk [Q]  ρtk Id [R k n

due to (4.24). Hence, by Theorem 4.45 1

¯ k [η]k2 ≥ tk } ≤ 2dk e− 2ρ . Probξ∼N (0,Q) {kRk [ξ]k2 ≥ tk } = Probη∼N (0,In ) {kR

363

SIGNAL RECOVERY BY LINEAR ESTIMATION

We conclude that 1

Probξ∼N (0,Q) {ξ 6∈ X } ≤ Probξ∼N (0,Q) {∃k : kRk [ξ]k2 > tk } ≤ 2De− 2ρ .



The ellitopic version of Lemma 4.46 is as follows: Lemma 4.47. Given ellitope (4.9), let Q ∈ Sn+ be such that Tr(Rk Q) ≤ ρtk , 1 ≤ k ≤ K,

(4.135)

for some t ∈ T and some ρ ∈ (0, 1]. Then

  1 Probξ∼N (0,Q) {ξ 6∈ X } ≤ 2K exp − . 3ρ

Proof. Observe that if P ∈ Sn+ satisfies Tr(R) ≤ 1, we have √   Eη∼N (0,In ) exp 13 η T P η ≤ 3.

(4.136)

Indeed, we lose nothing when assuming that P = Diag{λ1 , ..., λn } with λi ≥ 0, P i λi ≤ 1. In this case ) ( X  Eη∼N (0,In ) exp{ 13 η T P η} = f (λ) := Eη∼N (0,In ) exp{ 31 λi ηi2 } . i

Function f is convex, so that its maximum on the simplex {λ ≥ 0 : achieved at a vertex, that is,  √ f (λ) ≤ Eη∼N (0,1) exp{ 13 η 2 } = 3;

P

i

(4.136) is proved. Note that (4.136) implies that  √ Probη∼N (0,In ) η : η T P η > s < 3 exp{−s/3}, s ≥ 0.

λi ≤ 1} is

(4.137)

Now let Q and t satisfy the Lemma’s premise. Setting ξ = Q1/2 η, η ∼ N (0, In ), for k ≤ K such that tk > 0 we have ξ T Rk ξ = ρtk η T Pk η, Pk := [ρtk ]−1 Q1/2 Rk Q1/2  0 & Tr(Pk ) = [ρtk ]−1 Tr(QRk ) ≤ 1, so that  Probξ∼N (0,Q) ξ : ξ T Rk ξ > sρtk

=
s √ 3 exp{−s/3},

(4.138)

where the inequality is due to (4.137). Relation (4.138) was established for k with tk > 0; it is trivially true when tk = 0, since in this case Q1/2 Rk Q1/2 = 0 due to Tr(QRk ) ≤ 0 and Rk , Q ∈ Sn+ . Setting s = 1/ρ, we get from (4.138) that  √ 1 Probx∼N (0,Q) ξ T Rk ξ > tk ≤ 3 exp{− }, k ≤ K, 3ρ

and (4.137) follows due to the union bound.



364 4.8.1.3

CHAPTER 4

Anderson’s Lemma

Below we use a simple-looking, but by far nontrivial, fact. Anderson’s Lemma [4]. Let f be a nonnegative even (f (x) ≡ f (−x)) summable function on RN such that the level sets {x : f (x) ≥ t} are convex for all t and let X ⊂ RN be a closed convex set symmetric w.r.t. the origin. Then for every y ∈ RN Z f (z)dz X+ty

is a nonincreasing function of t ≥ 0. In particular, if ζ is a zero mean N dimensional Gaussian random vector, then for every y ∈ RN Prob{ζ 6∈ y + X} ≥ Prob{ζ 6∈ X}. Hence, for every norm k · k on RN it holds Prob{ζ : kζ − yk > ρ} ≥ Prob{ζ : kζk > ρ} ∀(y ∈ RN , ρ ≥ 0).

4.8.2

Proof of Proposition 4.6

1o . We need the following: Lemma 4.48. Let S be a positive semidefinite N × N matrix with trace ≤ 1 and ξ be an N -dimensional Rademacher random vector (i.e., the entries in ξ are independent and take values ±1 with probabilities 1/2). Then √   ≤ 3, E exp 31 ζ T Sζ

implying that

√ Prob{ξ T Sξ > s} ≤ 3 exp{−s/3}, s ≥ 0. P i i T be the eigenvalue decomposition of S, so that Proof. Let S = i h [h ] i σP i T i [h ] h = 1, σi ≥ 0, and i σi ≤ 1. The function n 1P T i i T o F (σ1 , ..., σn¯ ) = E e 3 i σi ξ h [h ] ξ

P is convex on the simplex {σ ≥ 0, i σi ≤ 1} and thus attains its maximum over the simplex at a vertex, implying that for some f = hi , f T f = 1, it holds 1

E{e 3 ξ

T



1

} ≤ E{e 3 (f

T

ξ)2

}.

365

SIGNAL RECOVERY BY LINEAR ESTIMATION

Let ζ ∼ N (0, 1) be independent of ξ. We have oo n n p  Eξ exp{ 13 (f T ξ)2 } = Eξ Eζ exp{[ 2/3f T ξ]ζ} ( ) n n n oo o N p p Q = Eζ Eξ exp{[ 2/3f T ξ]ζ} Eξ exp{ 2/3ζfj ξj } = Eζ j=1 ) ( ) ( N N p Q Q 1 2 2 exp{ 3 ζ fj } cosh( 2/3ζfj ) ≤ Eζ = Eζ j=1  j=1 1 2 √ = Eζ exp{ 3 ζ } = 3. ✷ 2o . The right inequality in (4.19) has been justified in Section 4.2.3. To prove the left inequality in (4.19), let T be the closed conic hull of T (see Section 4.1.1), and let us consider the conic problem  Opt# = max Tr(P T CP Q) : Q  0, Tr(QRk ) ≤ tk ∀k ≤ K, [t; 1] ∈ T . (4.139) Q,t

We claim that

Opt = Opt# .

(4.140)

Indeed, (4.139) clearly is a strictly feasible and bounded conic problem, so that its optimal value is equal to the optimal value of its conic dual (Conic Duality Theorem). Taking into account that the cone T∗ dual to T is {[g; s] : s ≥ φT (−g)}—see Section 4.1.1—we therefore get Opt#

 P P Tr([ k λk Rk − L]Q) − k [λk + gk ]tk = Tr(P T CP Q) ∀(Q, t), λP≥ 0, L  0, s ≥ φT (−g) λ,[g;s],L   λk Rk − L = P T CP, g = −λ, k = min s: λP≥ 0, L  0, s ≥ φT (−g) λ,[g;s],L  = min φT (λ) : k λk Rk  P T CP, λ ≥ 0 = Opt,

= min



s:

λ

as claimed.

3o . With Lemma 4.48 and (4.140) at our disposal, we can now complete the proof of Proposition 4.6 by adjusting the technique from [191]. Specifically, problem (4.139) clearly is solvable; let Q∗ , t∗ be an optimal solution to the problem. Next, let us 1/2 set R∗ = Q∗ , C¯ = R∗ P T CP R∗ , let C¯ = U DU T be the eigenvalue decomposition ¯ ¯ k = U T R∗ Rk R∗ U . Observe that of C, and let R ¯ Tr(C) ¯k ) Tr(R

= =

Tr(R∗ P T CP R∗ ) = Tr(Q∗ P T CP ) = Opt# = Opt, Tr(R∗ Rk R∗ ) = Tr(Q∗ Rk ) ≤ t∗k .

Now let ξ be a Rademacher random vector. For k with t∗k > 0, applying Lemma ¯ k /t∗ , we get for s > 0 4.48 to matrices R k √ ¯ k ξ > st∗k } ≤ 3 exp{−s/3}; (4.141) Prob{ξ T R ¯ k ) = 0, that is, R ¯ k = 0 (since R ¯ k  0), and if k is such that t∗k = 0, we have Tr(R (4.141) holds true as well. Now let √ s∗ = 3 ln( 3K),

366

CHAPTER 4

√ so that 3 exp{−s/3} < 1/K when s > s∗ . The latter relation combines with (4.141) to imply that for every s > s∗ there exists a realization ξ¯ of ξ such that ¯ k ξ¯ ≤ st∗k ∀k. ξ¯T R Let us set y¯ =

¯ √1 R∗ U ξ. s

Then

¯ k ξ¯ ≤ t∗k ∀k y¯T Rk y¯ = s−1 ξ¯T U T R∗ Rk R∗ U ξ¯ = s−1 ξ¯T R implying that x ¯ := P y¯ ∈ X , and ¯ = s−1 Opt. x ¯T C x ¯ = s−1 ξ¯T U T R∗ P T CP R∗ U ξ¯ = s−1 ξ¯T Dξ¯ = s−1 Tr(D) = s−1 Tr(C) {z } | ¯ C

Thus, Opt∗ := maxx∈X xT Cx ≥ s−1 Opt whenever s > s∗ , which implies the left inequality in (4.19). ✷ 4.8.3

Proof of Proposition 4.8

The proof follows the lines of the proof of Proposition 4.6. First, passing from C to the matrix C¯ = P T CP , the situation clearly reduces to the one where P = I. To save notation, in the rest of the proof we assume that P is the identity. Second, from Lemma 4.44 and the fact that the level sets of φT (·) on the nonnegative orthant are bounded (since T contains a positive vector) it immediately follows that problem (4.29) is feasible with bounded level sets of the objective, so that the problem is solvable. The left inequality in (4.30) was proved in Section 4.3.2. Thus, all we need is to prove the right inequality in (4.30). 1o . Let T be the closed conic hull of T (see Section 4.1.1). Consider the conic problem Opt# = max {Tr(CQ) : Q  0, Rk [Q]  tk Idk ∀k ≤ K, [t; 1] ∈ T} . Q,t

(4.142)

This problem clearly is strictly feasible; by Lemma 4.44, the feasible set of the problem is bounded, so the problem is solvable. We claim that Opt# = Opt. Indeed, (4.142) is a strictly feasible and bounded conic problem, so that its optimal value is equal to the one in its conic dual, that is,   P P Tr([ k R∗k [Λk ] − L]Q) − k [Tr(Λk ) + gk ]tk   = Tr(CQ) ∀(Q, t), Opt# = min s:  Λ={Λk }k≤K ,[g;s],L  Λ  0 ∀k, L  0, s ≥ φ (−g) k T   P ∗ k Rk [Λk ] − L = C, g = −λ[Λ], = min s: Λk P  0 ∀k, L  0, s ≥ φT (−g) Λ,[g;s],L = min {φT (λ[Λ]) : k R∗k [Λk ]  C, Λk  0 ∀k} = Opt, Λ

as claimed.

2o . Problem (4.142), as we already know, is solvable; let Q∗ , t∗ be an optimal

367

SIGNAL RECOVERY BY LINEAR ESTIMATION

1/2 b b = solution to the problem. Next, let us set R∗ = Q∗ , C = R∗ CR∗ , and let C T T b U DU be the eigenvalue decomposition of C, so that the matrix D = U R∗ CR∗ U is diagonal, and the trace of this matrix is Tr(R∗ CR∗ ) = Tr(CQ∗ ) = Opt# = Opt. Now let V = R∗ U , and let ξ = V η, where η is n-dimensional random Rademacher vector (independent entries taking values ±1 with probabilities 1/2). We have

ξ T Cξ = η T [V T CV ]η = η T [U T R∗ CR∗ U ]η = η T Dη ≡ Tr(D) = Opt

(4.143)

(recall that D is diagonal) and Eξ {ξξ T } = Eη {V ηη T V T } = V V T = R∗ U U T R∗ = R∗2 = Q∗ . From the latter relation,  Eξ Rk2 [ξ]

 Eξ Rk [ξξ T ] = Rk [Eξ {ξξ T }] Rk [Q∗ ]  t∗k Idk , 1 ≤ k ≤ K.

= =

(4.144)

¯ kj we have On the other hand, with properly selected symmetric matrices R X ¯ ki yi R Rk [V y] = i

identically in y ∈ Rn , whence

  Eξ Rk2 [ξ] = Eη Rk2 [V η] = Eη

 hX

i

¯ ki ηi R

i2 

=

X i,j

¯ ki R ¯ kj = Eη {ηi ηj }R

This combines with (4.144) to imply that X ¯ ki ]2  t∗ Id , 1 ≤ k ≤ K. [R k k

X

¯ ki ]2 . [R

i

(4.145)

i

3o . Let us fix k ≤ K. Assuming t∗k > 0 and applying Theorem 4.45, we derive from (4.145) that 1 ¯ k [η]k2 > t∗k /ρ} < 2dk e− 2ρ , Prob{η : kR and recalling the relation between ξ and η, we arrive at 1

Prob{ξ : kRk [ξ]k2 > t∗k /ρ} < 2dk e− 2ρ ∀ρ ∈ (0, 1].

(4.146)

¯ ki = 0 for all i, so that Rk [ξ] ≡ R ¯ k [η] ≡ 0, Note that when t∗k = 0 (4.145) implies R and (4.146) holds for those k as well. 1 . For this ρ, the sum over k ≤ K of the rightNow let us set ρ = 2 max[ln(2D),1] hand sides in inequalities (4.146) is ≤ 1, implying that there exists a realization ξ¯ of ξ such that ¯ 2 ≤ t∗ /ρ, ∀k, kRk [ξ]k k or, equivalently,

x ¯ := ρ1/2 ξ¯ ∈ X

(recall that P = I), implying that Opt∗ := max xT Cx ≥ x ¯T C x ¯ = ρξ T Cξ = ρOpt x∈X

368

CHAPTER 4

(the concluding equality is due to (4.143)), and we arrive at the right inequality in (4.30). ✷ 4.8.4

Proof of Lemma 4.17

1o . Let us verify (4.57). When Q ≻ 0, passing from variables (Θ, Υ) in problem (4.56) to the variables (G = Q1/2 ΘQ1/2 , Υ), the problem becomes exactly the optimization problem in (4.57), implying that Opt[Q] = Opt[Q] when Q ≻ 0. As is easily seen, both sides in this equality are continuous in Q  0, and (4.57) follows. 2o . Let us prove (4.59). Setting ζ = Q1/2 η with η ∼ N (0, IN ) and Z = Q1/2 Y , to justify (4.59) we have to show that when κ ≥ 1 one has 3/8 2 Opt[Q] ¯ ≥ βκ := 1 − e ⇒ Probη {kZ T ηk ≥ δ} − 2F e−κ /2 , δ¯ = 4κ 2

(4.147)

where (cf. (4.57))  [Opt[Q] =] Opt[Q] := min φR (λ[Υ]) + Tr(Θ) : Θ,Υ={Υℓ ,ℓ≤L}   1 ZM Θ 2 Υℓ  0, 1 T T P ∗ 0 . M Z ℓ Sℓ [Υℓ ] 2

(4.148)

Justification of (4.147) is as follows.

2.1o . Let us represent Opt[Q] as the optimal value of a conic problem. Setting K = cl{[r; s] : s > 0, r/s ∈ R}, we ensure that R = {r : [r; 1] ∈ K}, K∗ = {[g; s] : s ≥ φR (−g)}, where K∗ is the cone dual to K. Consequently, (4.148) reads   Υ (a)   ℓ  0, 1 ≤ ℓ ≤ L       1 ZM Θ 2 P .  0 (b) Opt[Q] = min θ + Tr(Θ) : 1 ∗ T T Sℓ [Υℓ ] M Z  Θ,Υ,θ  ℓ 2     [−λ[Υ]; θ] ∈ K∗ (c)

(P )

2.2o . Now let us prove that there exists a matrix W ∈ Sq+ and r ∈ R such that Sℓ [W ]  rℓ Ifℓ , ℓ ≤ L, and Opt[Q]≤

X

σi (ZM W 1/2 ),

(4.149) (4.150)

i

where σ1 (·) ≥ σ2 (·) ≥ ... are singular values. To get the announced result, let us pass from problem (P ) to its conic dual. Applying Lemma 4.44 we conclude that (P ) is strictly feasible; in addition, (P )

SIGNAL RECOVERY BY LINEAR ESTIMATION

369

clearly is bounded, so that the dual to (P ) problem (D) is solvable with optimal   −R G  value Opt[Q]. Let us build (D). Denoting by Λℓ  0, ℓ ≤ L, −RT W 0, [r; τ ] ∈ K the Lagrange multipliers for the respective constraints in (P ), and aggregating these constraints, the multipliers being the aggregation weights, we arrive at the following aggregated constraint: P P P Tr(ΘG) + Tr(W ℓ Sℓ∗ [Υℓ ]) + ℓ Tr(Λℓ Υℓ ) − ℓ rℓ Tr(Υℓ ) + θτ ≥ Tr(ZM RT ).

To get the dual problem, we impose on the Lagrange multipliers, in addition to the initial conic constraints like Λℓ  0, 1 ≤ ℓ ≤ L, the restriction that the lefthand side in the aggregated constraint, identically in Θ, Υℓ , and θ, is equal to the objective of (P ), that is, G = I, Sℓ [W ] + Λℓ − rℓ Ifℓ = 0, 1 ≤ ℓ ≤ L, τ = 1, and maximize, under the resulting restrictions, the right-hand side of the aggregated constraint. After immediate simplifications, we arrive at  Opt[Q] = max Tr(ZM RT ) : W  RT R, r ∈ R, Sℓ [W ]  rℓ Ifℓ , 1 ≤ ℓ ≤ L W,R,r

T (note that r ∈  R is equivalent to [r; 1] ∈ K, and W  R R is the same as  I −R  0). Now, to say that RT R  W is exactly the same as to say −RT W that R = SW 1/2 with the spectral norm kSk2,2 of S not exceeding 1, so that

Opt[Q] = max

W,S,r



Tr(ZM [SW | {z



1/2 T

] ) : W  0, kSk2,2 ≤ 1, r ∈ R, Sℓ [W ]  rℓ Ifℓ , ℓ ≤ L . }

=Tr([ZM W 1/2 ]S T )

We can immediately eliminate the S-variable, using the well-known fact that for a p × q matrix J it holds max

S∈Rp×q ,kSk2,2 ≤1

Tr(JS T ) = kJkSh,1 ,

where kJkSh,1 is the nuclear norm (the sum of singular values) of J. We arrive at n o Opt[Q] = max kZM W 1/2 kSh,1 : r ∈ R, W  0, Sℓ [W ]  rℓ Ifℓ , ℓ ≤ L . W,r

The resulting problem clearly is solvable, and its optimal solution W ensures the target relations (4.149) and (4.150). 2.3o . Given W satisfying (4.149) and (4.150), let U JV = W 1/2 M T Z T be the singular value decomposition of W 1/2 M T Z T , so that U and V are, respectively, q×q and N ×N orthogonal matrices, J is q×N matrix with diagonal σ = [σ1 ; ...; σp ], p = min[q, N ], and zero off-diagonal entries; the diagonal entries σi , 1 ≤ i ≤ p are the singular values of W 1/2 M T Z T , or, which is the same, of ZM W 1/2 . Therefore, by (4.150) we have X σi ≥ Opt[Q]. (4.151) i

370

CHAPTER 4

Now consider the following construction. Let η ∼ N (0, IN ); we denote by υ the vector comprised of the first p entries in V η; note that υ ∼ N (0, Ip ), since V is orthogonal. We then augment, if necessary, υ by q − p N (0, 1) random variables independent of each other and of η to obtain a q-dimensional random vector υ ′ ∼ N (0, Iq ), and set χ = U υ ′ . Because U is orthogonal we also have χ ∼ N (0, Iq ). Observe that χT W 1/2 M T Z T η = χT U JV η = [υ ′ ]T Jυ =

p X

σi υi2 .

(4.152)

i=1

To continue we need two simple observations. (i) One has α := Prob

(

p X i=1

σi υi2
0, and let us apply the Cramer bounding scheme. Namely, given γ > 0, consider the random variable ) ( X X 2 1 σ i υi . σi − γ ω = exp 4 γ i

i

Pp

Pp Note that ω > 0 a.s., and is > 1 when i=1 σi υi2 < 14 i=1 σi , so that α ≤ E{ω}, or, equivalently, thanks to υ ∼ N (0, Ip ),  P P ln(α) ≤ ln(E{ω})P = 41 γ i σi + i ln E{exp{−γσi υi2 }} ≤ 41 γσ − 21 i ln(1 + 2γσi ). P in [σ1 ; ...; σp ] ≥ 0; therefore, its maximum Function − i ln(1 + 2γσi ) is convex P over the simplex {σi ≥ 0, i ≤ p, i σi = σ} is attained at a vertex, and we get ln(α) ≤ 14 γσ − 21 ln(1 + 2γσ).

Minimizing the right-hand side in γ > 0, we arrive at (4.153). (ii) Whenever κ ≥ 1, one has Prob{kM W 1/2 χk∗ > κ} ≤ 2F exp{−κ 2 /2},

(4.154)

with F given by (4.55). √ Indeed, setting ρ = 1/κ 2 ≤ 1 and ω = ρW 1/2 χ, we get ω ∼ N (0, ρW ). Let us apply Lemma 4.46 to Q = ρW , R in the role of T , L in the role of K, and Sℓ [·] in the role of Rk [·]. Denoting Y := {y : ∃r ∈ R : Sℓ2 [y]  rℓ Ifℓ , ℓ ≤ L}, we have Sℓ [Q] = ρSℓ [W ]  ρrℓ Ifℓ , ℓ ≤ L, with r ∈ R (see (4.149)), so we are under the premise of Lemma 4.46 (with Y in the role of X and thus with F in the role of D). Applying the lemma, we conclude that  Prob χ : κ −1 W 1/2 χ 6∈ Y ≤ 2F exp{−1/(2ρ)} = 2F exp{−κ 2 /2}.

371

SIGNAL RECOVERY BY LINEAR ESTIMATION

Recalling that B∗ = M Y, we see that Prob{χ : κ −1 M W 1/2 χ 6∈ B∗ } is indeed upper-bounded by the right-hand side of (4.154), and (4.154) follows. 2.4o . Now, for κ ≥ 1, let ( Eκ =

(χ, η) : kM W

1/2

χk∗ ≤ κ,

X

σi υi2

i



1 4

X

σi

i

)

,

and let Eκ+ = {η : ∃χ : (χ, η) ∈ Eκ }. For η ∈ Eκ+ there exists χ such that (χ, η) ∈ Eκ , leading to κkZ T ηk ≥ kM W 1/2 χk∗ kZ T ηk ≥ χT W 1/2 M T Z T η =

X i

σi υi2 ≥

1 4

X i

σi ≥ 14 Opt[Q]

(we have used (4.152) and (4.151)). Thus, η ∈ Eκ+ ⇒ kZ T ηk ≥

Opt[Q] . 4κ

On the other hand, due to (4.153) and (4.154), for our random (χ, η) it holds Prob{Eκ } ≥ 1 −

2 e3/8 − 2F e−κ /2 = βκ , 2

and the marginal distribution of η is N (0, IN ), implying that Probη∼N (0,IN ) {η ∈ Eκ+ } ≥ βκ . (4.147) is proved. 3o . As was explained in the beginning of item 2o , (4.147) is exactly the same as (4.59). The latter relation clearly implies (4.60) which, in turn, implies the right inequality in (4.58). ✷ 4.8.5

Proofs of Propositions 4.5, 4.16 and 4.19

Below, we focus on the proof of Proposition 4.16; Propositions 4.5 and 4.19 will be derived from it in Sections 4.8.5.2, 4.8.6.2, respectively. 4.8.5.1

Proof of Proposition 4.16

In what follows, we use the assumptions and the notation of Proposition 4.16. 1o . Let Φ(H, Λ, Υ, Υ′ , Θ; Q) = φT (λ[Λ]) + φR (λ[Υ]) + φR (λ[Υ′ ]) + Tr(QΘ) : M × Π → R,

372

CHAPTER 4

where M



=

(H, Λ, Υ, Υ′ , Θ) :

Λ = {Λk  0, k ≤ K}, Υ {Υℓ  0, ℓ ≤ L}, Υ′ = {Υ′ℓ  0, ℓ≤ L},   =P 1 ∗ [B T − AT H]M  k Rk [Λk ] 2 P  0,   1 T T ∗ M [B − H A] S [Υ ] ℓ ℓ ℓ  2 1 Θ   P2 HM  0.  1 T T ∗ ′ M H S [Υ ] ℓ ℓ ℓ 2

Looking at (4.42), we see immediately that the optimal value Opt in (4.42) is nothing but   ′ ′ Φ(H, Λ, Υ, Υ , Θ) := max Φ(H, Λ, Υ, Υ , Θ; Q) . (4.155) Opt = min ′ (H,Λ,Υ,Υ ,Θ)∈M

Q∈Π

Note that sets M and Π are closed and convex, Π is compact, and Φ is a continuous convex-concave function on M × Π. In view of these observations, the fact that Π ⊂ int Sm + combines with the Sion-Kakutani Theorem to imply that Φ possesses saddle point (H∗ , Λ∗ , Υ∗ , Υ′∗ , Θ∗ ; Q∗ ) (min in (H, Λ, Υ, Υ′ , Θ), max in Q) on M×Π, whence Opt is the saddle point value of Φ by (4.155). We conclude that for properly selected Q∗ ∈ Π it holds Opt = =

min

(H,Λ,Υ,Υ′ ,Θ)∈M

min ′

H,Λ,Υ,Υ ,Θ



Φ(H, Λ, Υ, Υ′ , Θ; Q∗ )

φT (λ[Λ]) + φR (λ[Υ]) + φR (λ[Υ′ ]) + Tr(Q∗ Θ) :

Λ = {Λ {Υℓ  0, ℓ ≤ L}, Υ′ = {Υ′ℓ  0, ℓ ≤ L},  Pk  ∗0, k ≤ K},1 Υ = T [B − AT H]M k Rk [Λk ] 2 P  0, ∗ T T 1  ℓ Sℓ [Υℓ ]  2 M [B − H 1A] Θ HM P2 ∗ ′ 0 T T 1 M H ℓ Sℓ [Υℓ ] 2  min ′ φT (λ[Λ]) + φR (λ[Υ]) + φR (λ[Υ′ ]) + Tr(G) :

=

H,Λ,Υ,Υ ,G

=

min

H,Λ,Υ

where Ψ(H)



Λ = {Λ {Υℓ  0, ℓ ≤ L}, Υ′ = {Υ′ℓ  0, ℓ ≤ L},  Pk  ∗0, k ≤ K},1 Υ = T R [Λ ] [B − AT H]M k k k 2 P  0, ∗ T T 1 [Υℓ ] ℓ Sℓ # " 2 M [B − H A] 1/2 1 Q HM G 2 ∗ 0 P T T 1/2 ∗ ′ 1 M H Q∗ ℓ Sℓ [Υℓ ] 2

φT (λ[Λ]) + φR (λ[Υ]) + Ψ(H) :

 Λ = {Λ {Υℓ  0, ℓ ≤ L},   Pk  ∗0, k ≤ K},1 Υ = [B T − AT H]M k Rk [Λk ] 2 P 0  ∗ T T 1 M [B − H A] ℓ Sℓ [Υℓ ] 2

:=

                 

(4.156)

 min′ φR (λ[Υ′ ]) + Tr(G) : Υ′ = {Υ′ℓ  0, ℓ ≤ L}, G,Υ # ) " 1/2 1 Q∗ HM G 2 0 , P ∗ ′ 1/2 1 M T H T Q∗ ℓ Sℓ [Υℓ ] 2

and Opt is given by (4.42), and the equalities are due to (4.56) and (4.57).

     

SIGNAL RECOVERY BY LINEAR ESTIMATION

373

From now on we assume that the noise ξ in observation (4.31) is ξ ∼ N (0, Q∗ ). We also assume that B 6= 0, since otherwise the conclusion of Proposition 4.16 is evident. 2o . ǫ-risk. In Proposition 4.16, we are speaking about k·k-risk of an estimate—the maximal, over signals x ∈ X , expected norm k · k of the error of recovering Bx; what we need to prove is that the minimax optimal risk RiskOptΠ,k·k [X ] as given by (4.53) can be lower-bounded by a quantity “of order of” Opt. To this end, of course, it suffices to build such a lower bound for the quantity   RiskOptk·k := inf sup Eξ∼N (0,Q∗ ) {kBx − x b(Ax + ξ)k} , x b(·) x∈X

since this quantity is a lower bound on RiskOptΠ,k·k . Technically, it is more convenient to work with the ǫ-risk defined in terms of “k · k-confidence intervals” rather than in terms of the expected norm of the error. Specifically, in the sequel we will heavily use the minimax ǫ-risk defined as  b(Ax + ξ)k > ρ} ≤ ǫ ∀x ∈ X , RiskOptǫ = inf ρ : Probξ∼N (0,Q∗ ) {kBx − x x b,ρ

where x b in the infimum runs through the set of all Borel estimates. When ǫ ∈ (0, 1) is once and forever fixed (in the sequel, we use ǫ = 18 ) we can use ǫ-risk to lowerbound RiskOptk·k , since by evident reasons RiskOptk·k ≥ ǫ · RiskOptǫ .

(4.157)

Consequently, all we need in order to prove Proposition 4.16 is to lower-bound RiskOpt 18 by a “not too small” multiple of Opt, and this is our current objective. 3o . Let W be a positive semidefinite n × n matrix, let η ∼ N (0, W ) be random signal, and let ξ ∼ N (0, Q∗ ) be independent of η; vectors (η, ξ) induce random vector ω = Aη + ξ ∼ N (0, AW AT + Q∗ ). Consider the Bayesian version of the estimation problem where given ω we are interested in recovering Bη. Recall that, because [ω; Bη] is zero mean Gaussian, ¯Tω the conditional expectation E|ω {Bη} of Bη given ω is linear in ω: E|ω {Bη} = H 24 ¯ for some H depending on W only. Therefore, denoting by P|ω the conditional probability distribution given ω, for any ρ > 0 and estimate x b(·) one has  Probη,ξ {kBη −x b(Aη + ξ)k ≥ ρ} = Eω Prob b(ω)k ≥ ρ} |ω {kBη − x  ¯ T (Aη + ξ)k ≥ ρ}, ≥ Eω Prob|ω {kBη − E|ω {Bη}k ≥ ρ} = Probη,ξ {kBη − H

with the inequality given by the Anderson Lemma as applied to the shift of the Gaussian distribution P|ω by its mean. Applying the Anderson Lemma again we 24 We have used the following standard fact [172]: let ζ = [ω; η] ∼ N (0, S), the covariance matrix of the marginal distribution of ω being nonsingular. Then the conditional distribution of η given ω is Gaussian with the mean linearly depending on ω and covariance matrix independent of ω.

374

CHAPTER 4

get ¯ T (Aη + ξ)k ≥ ρ} Probη,ξ {kBη − H

= ≥

 ¯ T A)η − H ¯ T ξk ≥ ρ} Eξ Probη {k(B − H ¯ T A)ηk ≥ ρ}, Probη {k(B − H

and, by “symmetric” reasoning, ¯ T (Aη + ξ)k ≥ ρ} ≥ Probξ {kH ¯ T ξk ≥ ρ}. Probη,ξ {kBη − H We conclude that for any x b(·)

Probη,ξ {kBηn − x b(ω)k ≥ ρ}

o ¯ T A)ηk ≥ ρ}, Probξ {kH ¯ T ξk ≥ ρ} . ≥ max Probη {k(B − H

(4.158)

¯ Q = Q∗ , 4o . Let H be an m × ν matrix. Applying Lemma 4.17 to N = m, Y = H, we get from (4.59) ¯ ≥ βκ ∀κ ≥ 1, ¯ T ξk ≥ [4κ]−1 Ψ(H)} Probξ∼N (0,Q∗ ) {kH

(4.159)

where Ψ(H) is defined by (4.156). Similarly, applying Lemma 4.17 to N = n, ¯ T A)T , Q = W , we obtain Y = (B − H ¯ ≥ βκ ∀κ ≥ 1, ¯ T A)ηk ≥ [4κ]−1 Φ(W, H)} Probη∼N (0,W ) {k(B − H

(4.160)

where Φ(W, H)

=

 Tr (W Θ) + φR (λ[Υ]) : Υℓ  0 ∀ℓ, min Υ={Υ  ℓ ,ℓ≤L},Θ   1 Θ [B T − AT H]M 2 P 0 . 1 ∗ M T [B − H T A] ℓ Sℓ [Υℓ ] 2

(4.161)

¯ = [8κ]−1 [Ψ(H) ¯ + Φ(W, H)]; ¯ Let us put ρ(W, H) when combining (4.160) with (4.159) we conclude that n o ¯ T A)ηk ≥ ρ(W, H)}, ¯ ¯ T ξk ≥ ρ(W, H)} ¯ max Probη {k(B − H Probξ {kH ≥ βκ , ¯ is replaced with the smaller quantity and the same inequality holds if ρ(W, H) ρ¯(W ) = [8κ]−1 inf [Ψ(H) + Φ(W, H)]. H

Now, the latter bound combines with (4.158) to imply the following result: Lemma 4.49. Let W be a positive semidefinite n × n matrix, and κ ≥ 1. Then for any estimate x b(·) of Bη given observation ω = Aη + ξ, where η ∼ N (0, W ) is independent of ξ ∼ N (0, Q∗ ), one has o n 2 e3/8 −2F e−κ /2 Probη,ξ kBη − x b(ω)k ≥ [8κ]−1 inf [Ψ(H) + Φ(W, H)] ≥ βκ = 1− H 2

where Ψ(H) and Φ(W, H) are defined, respectively, by (4.156) and (4.161).

375

SIGNAL RECOVERY BY LINEAR ESTIMATION

In particular, for κ=κ ¯ :=



2 ln F + 10 ln 2

(4.162)

it holds Probη,ξ {kBη − x b(ω)k ≥ [8κ] ¯ −1 inf [Ψ(H) + Φ(W, H)]} > H

3 16 .

5o . For 0 < κ ≤ 1, let us set (a) (b)

Wκ ={W ∈ Sn+ : ∃t ∈ T : Rk [W ]  κtk Idk , 1 ≤ k ≤ K}, Z=

   

(Υ = {Υℓ , ℓ ≤ L}, Θ, H) :

Υ " ℓ  0 ∀ℓ,

Θ − H T A]

1 M T [B 2

1 [B T 2 P

− AT H]M ∗ ℓ Sℓ [Υℓ ]

#

  

0  

.

Note that Wκ is a nonempty convex and compact (by Lemma 4.44) set such that Wκ = κW1 , and Z is a nonempty closed convex set. Consider the parametric saddle point problem Opt(κ) = max

inf

W ∈Wκ (Υ,Θ,H)∈Z

h

i E(W ; Υ, Θ, H) := Tr(W Θ) + φR (λ[Υ]) + Ψ(H) .

(4.163)

This problem is convex-concave; utilizing the fact that Wκ is compact and contains positive definite matrices, it is immediately seen that the Sion-Kakutani theorem ensures the existence of a saddle point whenever κ ∈ (0, 1]. We claim that √ (4.164) 0 < κ ≤ 1 ⇒ Opt(κ) ≥ κOpt(1). Indeed, Z is invariant w.r.t. scalings (Υ = {Υℓ , ℓ ≤ L}, Θ, H) 7→ (θΥ := {θΥℓ , ℓ ≤ L}, θ−1 Θ, H),

[θ > 0].

When taking into account that φR (λ[θΥ]) = θφR (λ[Υ]), we get E(W )

:= =

inf

(Υ,Θ,H)∈Z

inf

(Υ,Θ,H)∈Z

E(W ; Υ, Θ, H) = inf inf E(W ; θΥ, θ−1 Θ, H) θ>0 (Υ,Θ,H)∈Z i h p 2 Tr(W Θ)φR (λ[Υ]) + Ψ(H) .

Because Ψ is nonnegative we conclude that whenever W  0 and κ ∈ (0, 1], one has √ E(κW ) ≥ κE(W ). This combines with Wκ = κW1 to imply that Opt(κ) = max E(W ) = max E(κW ) ≥ W ∈Wκ

W ∈W1



κ max E(W ) = W ∈W1



κOpt(1),

and (4.164) follows. 6o . We claim that Opt(1) = Opt,

(4.165)

where Opt is given by (4.42) (and, as we have seen, by (4.156) as well). Note that (4.165) combines with (4.164) to imply that √ 0 < κ ≤ 1 ⇒ Opt(κ) ≥ κOpt. (4.166)

376

CHAPTER 4

Verification of (4.165) is given by the following computation. By the Sion-Kakutani Theorem,  Tr(W Θ) + φR (λ[Υ]) + Ψ(H) Opt(1) = max inf W ∈W1 (Υ,Θ,H)∈Z  = inf max Tr(W Θ) + φR (λ[Υ]) + Ψ(H) W ∈W1 (Υ,Θ,H)∈Z   = inf Ψ(H) + φR (λ[Υ]) + max Tr(ΘW ) : W (Υ,Θ,H)∈Z  W  0, ∃t ∈ T : Rk [W ]  tk Idk , k ≤ K   Ψ(H) + φR (λ[Υ]) + max Tr(ΘW ) : = inf W,t (Υ,Θ,H)∈Z  W  0, [t; 1] ∈ T, Rk [W ]  tk Idk , k ≤ K , where T is the closed conic hull of T . On the other hand, using Conic Duality combined with the fact that T∗ = {[g; s] : s ≥ φT (−g)} we obtain max {Tr(ΘW ) : W  0, [t; 1] ∈ T, Rk [W ]  tk Idk , k ≤ K} W,t   Z  0, [g; s] ∈ T∗ , ΛP   k  0, k ≤ K,     −Tr(ZW ) − g T t +P k Tr(R∗k [Λk ]W ) s: = min − k tk Tr(Λk ) = Θ , Z,[g;s],Λ={Λk }      ∀(W ∈ Sn , t ∈ RK )   Z  0, P s ≥ φT (−g), Λk  0, k ≤ K, s: = min Θ = k R∗k [Λk ] − Z, g = −λ[Λ] Z,[g;s],Λ={Λk } ( ) X ∗ = min φT (λ[Λ]) : Λ = {Λk  0, k ≤ K}, Θ  Rk [Λk ] , Λ

k

and we arrive at Opt(1) =

inf

Υ,Θ,H,Λ

= inf

Υ,H,Λ

= Opt

 Ψ(H) + φR (λ[Υ]) + φT (λ[Λ]) :

Υ = {Υ P ℓ ∗0, ℓ ≤ L}, Λ = {Λk  0, k ≤ K}, Θ  k Rk [Λk ],   1 [B T − AT H]M Θ 2 P 0 1 T T ∗ ℓ Sℓ [Υℓ ]  2 M [B − H A] Ψ(H) + φR (λ[Υ]) + φT (λ[Λ]) :

      

 Υ = {Υ  P ℓ ∗0, ℓ ≤ L},1Λ =T {Λk T 0, k ≤K},  [B − A H]M k Rk [Λk ] 2 P 0  1 ∗ T T M [B − H A] ℓ Sℓ [Υℓ ] 2 [see (4.156)].

7o . Now we can complete the proof. For κ ∈ (0, 1], let Wκ be the W -component of

377

SIGNAL RECOVERY BY LINEAR ESTIMATION

a saddle point solution to the saddle point problem (4.163). Then, by (4.166), o n √ κOpt ≤ Opt(κ) = inf Tr(Wκ Θ) + φR (λ[Υ]) + Ψ(H) (Υ,Θ,H)∈Z (4.167)  = inf Φ(Wκ , H) + Ψ(H) . H

On the other hand, when applying Lemma 4.46 to Q = Wκ and ρ = κ, we obtain, in view of relations 0 < κ ≤ 1, Wκ ∈ Wκ , 1

δ(κ) := Probζ∼N (0,In ) {Wκ1/2 ζ 6∈ X } ≤ 2De− 2κ ,

(4.168)

with D given by (4.55). In particular, when setting κ ¯=

1 2 ln D + 10 ln 2

(4.169)

we obtain δκ ≤ 1/16. Therefore, Probη∼N (0,Wκ¯ ) {η 6∈ X } ≤ Now let

1 16 .

Opt ̺∗ := p . 8 (2 ln F + 10 ln 2)(2 ln D + 10 ln 2)

(4.170)

(4.171)

All we need in order to achieve our goal of justifying (4.54) is to show that RiskOpt 81 ≥ ̺∗ ,

(4.172)

since given the latter relation, (4.54) will be immediately given by (4.157) as applied with ǫ = 81 . To prove (4.172), assume, on the contrary to what should be proved, that the 1 -risk is < ̺∗ , and let x ¯(·) be an estimate with 18 -risk ̺′ < ̺∗ . We can utilize x ¯ to 8 estimate Bη, in the Bayesian problem of recovering Bη from observation ω = Aη+ξ, (η, ξ) ∼ N (0, Σ) with Σ = Diag{Wκ¯ , Q∗ }. From (4.170) we conclude that Prob(η,ξ)∼N (0,Σ) {kBη − x ¯(Aη + ξ)k > ̺′ } ≤ Prob(η,ξ)∼N (0,Σ) {kBη − x ¯(Aη + ξ)k > ̺′ , η ∈ X } 1 3 = 16 . + Probη∼N (0,Wκ¯ ) {η 6∈ X } ≤ 18 + 16

(4.173)

On the other hand, by (4.167) we have κ) ≥ inf [Φ(Wκ¯ , H) + Ψ(H)] = Opt(¯ H



κ ¯ Opt = [8κ]̺ ¯ ∗

with κ ¯ given by (4.162). Thus, by Lemma 4.49, for any estimate x ˆ(·) of Bη via observation ω = Ax + ξ it holds Probη,ξ {kBη − x b(Aη + ξ)k ≥ ̺∗ } ≥ βκ¯ > 3/16;

in particular, this relation should hold true for x b(·) ≡ x ¯(·), but the latter is impos3 -risk of x ¯ is ≤ ̺′ < ̺∗ ; see (4.173). ✷ sible: the 16

378 4.8.5.2

CHAPTER 4

Proof of Proposition 4.5

We shall extract Proposition 4.5 from the following result, meaningful by its own right (it can be considered as an “ellitopic refinement” of Proposition 4.16): Proposition 4.50. Consider the recovery of the linear image Bx ∈ Rν of unknown signal x known to belong to a given signal set X ⊂ Rn from noisy observation ω = Ax + ξ ∈ Rm

[ξ ∼ N (0, Γ), Γ ≻ 0],

the recovery error being measured in norm k · k on Rν . Assume that X and the unit ball B∗ of the norm k · k∗ conjugate to k · k are ellitopes: X B∗

= =

{x ∈ Rn : ∃t ∈ T : xT Rk x ≤ tk , k ≤ K}, {y ∈ Rν : ∃(r ∈ R, y) : u = M y, y T Sℓ y ≤ rℓ , ℓ ≤ L},

(4.174)

with our standard restrictions on T , R, Rk and Sℓ (as always, we lose nothing when assuming that the ellitope X is basic). Consider the optimization problem  Opt# = min ′ φT (λ) + φR (µ) + φR (µ′ ) + Tr(ΓΘ) : Θ,H,λ,µ,µ

λ ≥ 0,P µ ≥ 0, µ′ ≥ 0,



1 [B − H T A]T M λk R k 2 P T T 1 ℓ µℓ S ℓ   2 M [B − H 1 A] HM Θ 2 P ′ 0 1 M T HT ℓ µℓ S ℓ 2 k



 0,

)

(4.175) .

The problem is solvable, and the linear estimate x bH∗ (ω) = H∗T ω yielded by the H-component of an optimal solution to the problem satisfies the risk bound bH∗ (Ax + ξ)k} ≤ Opt# . RiskΓ,k·k [b xH∗ |X ] := max Eξ∼N (0,Γ) {kBx − x x∈X

Furthermore, the estimate x bH∗ (·) is near-optimal: p Opt# ≤ 64 (3 ln K + 15 ln 2)(3 ln L + 15 ln 2) RiskOpt,

(4.176)

where RiskOpt is the minimax optimal risk

RiskOpt = inf sup Eξ∼N (0,Γ) {kBx − x b(Ax + ξ)k} , x b x∈X

the infimum being taken w.r.t. all estimates.

Proposition 4.50 ⇒ Proposition 4.5: Clearly, the situation considered in Proposition 4.5 is a particular case of the setting of Proposition 4.50, namely, the case where B∗ is the standard Euclidean ball, B∗ = {u ∈ Rν : uT u ≤ 1}. In this case,

379

SIGNAL RECOVERY BY LINEAR ESTIMATION

problem (4.175) reads Opt# =

min

Θ,H,λ,µ,µ′

 φT (λ) + µ + µ′ + Tr(ΓΘ) : λ ≥P 0, µ ≥ 0, µ′ ≥ 0,



1 [B − H T A]T λk Rk 2 T µIν − H A]   1 Θ H 2 0 1 H T µ′ Iν 2

=

min

Θ,H,λ,µ,µ′

= min χ,H

1 [B 2

k



 0,

 φT (λ) + µ + µ′ + Tr(ΓΘ) :

        

 ′ λ≥  P0, µ ≥ 0, µ 1≥ 0, µ [ k λk Rk ]  4 [B − H T A]T [B − H T A],  µ′ Θ  14 HH T [Schur Complement Lemma]  p p φT (χ) + Tr(HΓH T ) :   P ′  [B − H T A]T k χk Rk χ ≥ 0, 0 T [B − H A]



[by eliminating µ, µ′ ; note that φT (·) is positively homogeneous of degree 1].

Comparing the resulting representation of Opt# with (4.12), we see that the upper √ bH∗ appearing in (4.15) is ≤ Opt# . bound Opt on the risk of the linear estimate x Combining this observation with (4.176) and the evident relation RiskOpt

=



inf xb sup b(Ax + ξ)k2 } q x∈X Ex∼N (0,Γ) {kBx − x

inf xb

supx∈X Ex∼N (0,Γ) {kBx − x b(Ax + ξ)k22 } = Riskopt

(recall that we are in the case of k · k = k · k2 ), we arrive at (4.15) and thus justify Proposition 4.5. ✷ Proof of Proposition 4.50. It is immediately seen that problem (4.175) is nothing but problem (4.42) in the case when the spectratopes X , B∗ and the set Π participating in Proposition 4.14 are, respectively, the ellitopes given by (4.174), and the singleton {Γ}. Thus, Proposition 4.50 is, essentially, a particular case of Proposition 4.16. The only refinement in Proposition 4.50 as compared to Proposition 4.16 is the form of the logarithmic “nonoptimality” factor in (4.176); a similar factor in Proposition 4.16 is expressed in terms of spectratopic sizes D, F of X and B∗ (the total ranks of matrices Rk , k ≤ K, and Sℓ , ℓ ≤ L, in the case of (4.174)), while in (4.176) the nonoptimality factor is expressed in terms of ellitopic sizes K, L of X and B∗ . Strictly speaking, to arrive at this (slight—the sizes in question are under logs) refinement, we were supposed to reproduce, with minimal modifications, the reasoning of items 2o –7o of Section 4.8.5.1, with Γ in the role of Q∗ , and slightly refine Lemma 4.17 underlying this reasoning. Instead of carrying out this plan literally, we detail “local modifications” to be made in the proof of Proposition 4.16 in order to prove Proposition 4.50. Here are these modifications: A. The collections of matrices Λ = {Λk  0, k ≤ K}, Υ = {Υℓ  0, ℓ ≤ L} should be L substituted by collections of nonnegative reals λ ∈ RK + or µ ∈ R+ , and vectors

380

CHAPTER 4

λ[Λ], λ[Υ]—with vectors λ or µ. Expressions like Rk [W ], R∗k [Λk ], and Sℓ∗ [Υℓ ] should be replaced, respectively, with Tr(Rk W ), λk Rk , and µℓ Sℓ . Finally, Q∗ should be replaced with Γ, and scalar matrices, like tk Idk , should be replaced with the corresponding reals, like tk . B. The role of Lemma 4.17 is now played by Lemma 4.51. Let Y be an N × ν matrix, let k · k be a norm on Rν such that the unit ball B∗ of the conjugate norm is the ellitope B∗ = {y ∈ Rν : ∃(r ∈ R, y) : u = M y, y T Sℓ y ≤ rℓ , ℓ ≤ L},

(4.174)

and let ζ ∼ N (0, Q) for some positive semidefinite N × N matrix Q. Then the best upper bound on ψQ (Y ) := E{kY T ζk} yielded by Lemma 4.11, that is, the optimal value Opt[Q] in the convex optimization problem (cf. (4.40))     1 YM Θ 0 Opt[Q] = min φR (µ) + Tr(QΘ) : µ ≥ 0, 1 T T P2 M Y Θ,µ ℓ µ ℓ Rℓ 2 satisfies for all Q  0 the identity ( Opt[Q] = Opt[Q] :=

min φR (µ) + Tr(G) : G,µ

µ ≥ 0,

"

G

1 M T Y T Q1/2 2

1 1/2 Q YM 2P ℓ

µℓ Rℓ

#

0

)

(4.177) ,

and is a tight bound on ψQ (Y ). Namely, √ ψQ (Y ) ≤ Opt[Q] ≤ 22 3 ln L + 15 ln 2ψQ (Y ), where L is the size of the ellitope B∗ ; see (4.174). Furthermore, for all κ ≥ 1 one has   2 Opt[Q] e3/8 T Probζ kY ζk ≥ ≥ βκ := 1 − − 2Le−κ /3 . (4.178) 4κ 2 √ In particular, when selecting κ = 3 ln L + 15 ln 2, we obtain   Opt[Q] T 3 Probζ kY ζk ≥ √ ≥ βκ = 0.2100 > 16 . 4 3 ln L + 15 ln 2 Proof of Lemma 4.51 follows the lines of the proof of Lemma 4.17, with Lemma 4.47 substituting Lemma 4.46. 1o . Relation (4.177) can be verified exactly in the same fashion as in the case of Lemma 4.17. 2o . Let us set ζ = Q1/2 η with η ∼ N (0, IN ) and Z = Q1/2 Y . Observe that to prove (4.178) is the same as to show that when κ ≥ 1 one has 3/8 2 Opt[Q] ¯ ≥ βκ := 1 − e δ¯ = ⇒ Probη {kZ T ηk ≥ δ} − 2Le−κ /3 , 4κ 2

(4.179)

381

SIGNAL RECOVERY BY LINEAR ESTIMATION

where [Opt[Q] =]

Opt[Q]

:=

min Θ,µ

( 

φR (µ) + Tr(Θ) : µ ≥ 0, Θ T T 1 M Z 2

1

ZM ℓ µℓ R ℓ

P2



)

(4.180)

0 .

Justification of (4.179) goes as follows. 2.1o . Let us represent Opt[Q] as the optimal value of a conic problem. Setting K = cl{[r; s] : s > 0, r/s ∈ R}, we ensure that R = {r : [r; 1] ∈ K}, K∗ = {[g; s] : s ≥ φR (−g)},

where K∗ is the cone dual to K. Consequently, (4.180) reads  µ   ≥0   1 Θ ZM 2 P Opt[Q] = min θ + Tr(Θ) : 0 1 Θ,Υ,θ  M T ZT ℓ µℓ S ℓ  2  [−µ; θ] ∈ K∗

 (a)    . (b)    (c)

(PE )

2.2o . Now let us prove that there exist matrix W ∈ Sq+ and r ∈ R such that Tr(W Sℓ ) ≤ rℓ , ℓ ≤ L,

and Opt[Q]≤

X

(4.181)

σi (ZM W 1/2 ),

(4.182)

i

where σ1 (·) ≥ σ2 (·) ≥ ... are singular values. To get the announced result, let us pass from problem (P ) to its conic dual. (PE ) clearly is strictly feasible and bounded, so that the dual to  (PE ) problem (DE ) is solvable G −R with optimal value Opt[Q]. Denoting by λℓ ≥ 0, ℓ ≤ L,  0, [r; τ ] ∈ K, −RT W the Lagrange multipliers for the respective constraints in (PE ), and aggregating these constraints, the multipliers being the aggregation weights, we arrive at the aggregated constraint: P P P Tr(ΘG) + Tr(W ℓ µℓ Sℓ ) + ℓ λℓ µℓ − ℓ rℓ µℓ + θτ ≥ Tr(ZM RT ).

To get the dual problem, we impose on the Lagrange multipliers, in addition to the initial constraints, the restriction that the left-hand side in the aggregated constraint is equal to the objective of (P ), identically in Θ, µℓ , and θ, that is, G = I, Tr(W Sℓ ) + λℓ − rℓ = 0, 1 ≤ ℓ ≤ L, τ = 1, and maximize the right-hand side of the aggregated constraint. After immediate simplifications, we arrive at n o Opt[Q] = max Tr(ZM RT ) : W  RT R, r ∈ R, Tr(W Sℓ ) ≤ rℓ , 1 ≤ ℓ ≤ L W,R,r

(note that r ∈ R is equivalent to [r; 1] ∈ K, and W  RT R is the same as 0).



I −RT

−R W





Exactly as in the proof of Lemma 4.17, the above representation of Opt[Q] implies that n o Opt[Q] = max kZM W 1/2 kSh,1 : r ∈ R, W  0, Tr(W Sℓ ) ≤ rℓ , ℓ ≤ L . W,r

382

CHAPTER 4

The resulting problem clearly is solvable, and its optimal solution W ensures the target relations (4.181) and (4.182). 2.3o . Given W satisfying (4.181) and (4.182), we proceed exactly as in item 2.3o of the proof of Lemma 4.17, thus arriving at three random vectors (χ, υ, η) with marginal distributions N (0, Iq ), N (0, Iq ), and N (0, IN ), respectively, such that χT W 1/2 M T Z T η =

p X

σi υi2 ,

(4.183)

i=1

where p = min[q, N ] and σi = σi (ZM W 1/2 ). As in item 3o .i of the proof of Lemma 4.17, we have (i) ) ( p p X X e3/8 2 1 σi ≤ [= 0.7275...]. (4.184) α := Prob σ i υi < 4 2 i=1 i=1

The role of item 3o .ii in the aforementioned proof is now played by (ii) Whenever κ ≥ 1, one has

Prob{kM W 1/2 χk∗ > κ} ≤ 2L exp{−κ 2 /3},

(4.185)

with L as defined in (4.174). √ Indeed, setting ρ = 1/κ 2 ≤ 1 and ω = ρW 1/2 χ, we get ω ∼ N (0, ρW ). Let us apply Lemma 4.47 to Q = ρW , R in the role of T , with L in the role of K, and Sℓ ’s in the role of Rk ’s. Denoting Y := {y : ∃r ∈ R : y T Sℓ y  rℓ , ℓ ≤ L},

we have Tr(QSℓ ) = ρTr(W Sℓ ) = ρTr(W Sℓ ) ≤ ρrℓ , ℓ ≤ L, with r ∈ R (see (4.181)), so we are under the premise of Lemma 4.47 (with Y in the role of X and therefore with L in the role of K). Applying the lemma, we conclude that n o Prob χ : κ −1 W 1/2 χ 6∈ Y ≤ 2L exp{−1/(3ρ)} = 2L exp{−κ 2 /3}. Recalling that B∗ = M Y, we see that Prob{χ : κ −1 M W 1/2 χ 6∈ B∗ } is indeed upperbounded by the right-hand side of (4.185), and (4.185) follows. With (i) and (ii) at our disposal, we complete the proof of Lemma 4.51 in exactly the same way as in items 2.4o and 3o of the proof of Lemma 4.17. ✷

C. As a result of substituting Lemma 4.17 with Lemma 4.51, the counterpart of Lemma 4.49 used in item 4o of the proof of Proposition 4.16 now reads as follows: Lemma 4.52. Let W be a positive semidefinite n × n matrix, and κ ≥ 1. Then for any estimate x b(·) of Bη given observation ω = Aη + ξ with η ∼ N (0, W ) and ξ ∼ N (0, Γ) independent of each other, one has o n 2 e3/8 − 2Le−κ /3 Probη,ξ kBη − x b(ω)k ≥ [8κ]−1 inf [Ψ(H) + Φ(W, H)] ≥ βκ = 1 − H 2

where Ψ(H) and Φ(W, H) are defined, respectively, by (4.156) (where Q∗ should be set to Γ) and (4.161). In particular, for √ κ=κ ¯ := 3 ln K + 15 ln 2 the latter probability is > 3/16. D. We substitute the reference to Lemma 4.46 in item 7o of the proof with Lemma 4.47, resulting in replacing

383

SIGNAL RECOVERY BY LINEAR ESTIMATION

• definition of δ(κ) in (4.168) with 1

δ(κ) := Probζ∼N (0,In ) {Wκ1/2 ζ 6∈ X } ≤ 3Ke− 3κ , • definition (4.169) of κ ¯ with κ ¯=

1 , 3 ln K + 15 ln 2

• and, finally, definition (4.171) of ρ∗ with Opt ̺∗ := p . 8 (3 ln L + 15 ln 2)(3 ln K + 15 ln 2) 4.8.6 4.8.6.1

Proofs of Propositions 4.18 and 4.19, and justification of Remark 4.20 Proof of Proposition 4.18

The only claim of the proposition which is not an immediate consequence of Proposition 4.8 is that problem (4.64) is solvable; let us justify this claim. Let F = ImA. Clearly, feasibility of a candidate solution (H, Λ, Υ) to the problem depends solely on the restriction of the linear mapping z 7→ H T z onto F , so that adding to the constraints of the problem the requirement that the restriction of this linear mapping on the orthogonal complement of F in Rm is identically zero, we get an equivalent problem. It is immediately seen that in the resulting problem, the feasible solutions with the value of the objective ≤ a for every a ∈ R form a compact set, so that the latter problem (and thus the original one) indeed is solvable. ✷ 4.8.6.2

Proof of Proposition 4.19

We are about to derive Proposition 4.19 from Proposition 4.16. Observe that in the situation of the latter Proposition, setting formally Π = {0}, problem (4.42) becomes problem (4.64), so that Proposition 4.19 looks like the special case Π = {0} of Proposition 4.16. However, the premise of the latter proposition forbids specializing Π as {0}—this would violate the regularity assumption R which is part of the premise. The difficulty, however, can be easily resolved. Assume w.l.o.g. that the image space of A is the entire Rm (otherwise we could from the very beginning replace Rm with the image space of A), and let us pass from our current noiseless recovery problem of interest (!)—see Section 4.5.1—to its “noisy modification,” the differences with (!) being • noisy observation ω = Ax + σξ, σ > 0, ξ ∼ N (0, Im ); • risk quantification of a candidate estimate x b(·) according to

Riskσk·k [b x(Ax + σξ)|X ] = sup Eξ∼N (0,Im ) {kBx − x b(Ax + σξ)k} , x∈X

the corresponding minimax optimal risk being

RiskOptσk·k [X ] = inf Riskσk·k [b x(Ax + σξ)|X ]. x b(·)

384

CHAPTER 4

Proposition 4.16 does apply to the modified problem—it suffices to specify Π as {σ 2 Im }. According to this proposition, the quantity Opt[σ]

=

min ′

H,Λ,Υ,Υ ,Θ



φT (λ[Λ]) + φR (λ[Υ]) + φR (λ[Υ′ ]) + σ 2 Tr(Θ) :

Λ = {Λk  0, k ≤ K}, Υ =P {Υℓ  0, ℓ ≤ L}, Υ′ = {Υ′ℓ  0, ℓ ≤ L}, ∗ 1 [B T − AT H]M k Rk [Λk ] 2 P  0, ∗ T T 1 M [B − H A] ℓ Sℓ [Υℓ ] 2  1 Θ HM P2 ∗ ′ 0 T T 1 Sℓ [Υℓ ] M H ℓ 2

satisfies the relation

Opt[σ] ≤ O(1) ln(D)RiskOptσk·k [X ]

          

(4.186)

with D defined in (4.65). Looking at problem (4.64) we immediately conclude that Opt# ≤ Opt[σ]. Thus, all we need in order to extract the target relation (4.65) from (4.186) is to prove that the minimax optimal risk Riskopt [X ] defined in Proposition 4.19 satisfies the relation lim inf RiskOptσk·k [X ] ≤ Riskopt [X ]. σ→+0

(4.187)

To prove this relation, let us fix r > Riskopt [X ], so that for some Borel estimate x b(·) it holds sup kBx − x b(Ax)k < r. (4.188) x∈X

Were we able to ensure that x b(·) is bounded and continuous, we would be done, since in this case, due to compactness of X , it clearly holds lim inf σ→+0 RiskOptσk·k [X ] ≤ lim inf σ→+0 supx∈X Eξ∼N (0,Im ) {kBx − x b(Ax + σξ)k} ≤ supx∈X kBx − x b(Ax)k < r,

and since r > Riskopt [X ] is arbitrary, (4.187) would follow. Thus, all we need to do is to verify that given Borel estimate x b(·) satisfying (4.188), we can update it into a bounded and continuous estimate satisfying the same relation. Verification is as follows: 1. Setting β = maxx∈X kBxk and replacing estimate x b with its truncation  x b(ω), kb x(ω)k ≤ 2β x e(ω) = 0, otherwise

for any x ∈ X we only reduce the norm of the recovery error. At the same time, x e is Borel and bounded. Thus, we lose nothing when assuming in the rest of the proof that x b(·) is Borel and bounded. 2. For ǫ > 0, let x bǫ (ω) = (1 + ǫ)b x(ω/(1 + ǫ)) and let Xǫ = (1 + ǫ)X . Observe that supx∈Xǫ kBx − x bǫ (Ax)k = supy∈X kB[1 + ǫ]y − x bǫ (A[1 + ǫ]y)k = supy∈X kB[1 + ǫ]y − [1 + ǫ]b x(Ay)k = [1 + ǫ] supy∈X kBy − x b(Ay)k,

385

SIGNAL RECOVERY BY LINEAR ESTIMATION

implying, in view of (4.188), that for small enough positive ǫ we have r¯ := sup kBx − x bǫ (Ax)k < r.

(4.189)

x∈Xǫ

3. Finally, let A† be the pseudoinverse of A, so that AA† z = z for every z ∈ Rm (recall that the image space of A is the entire Rm ). Given ρ > 0, let θρ (·) be a nonnegative smooth function on Rm with integral 1 such that θρ vanishes outside of the ball of radius ρ centered at the origin, and let Z x bǫ,ρ (ω) = x bǫ (ω − z)θρ (z)dz Rm

be the convolution of x bǫ and θρ . Since x bǫ (·) is Borel and bounded, this convolution is a well-defined smooth function on Rm . Because X contains a neighbourhood of the origin, for all small enough ρ > 0, all z from the support of θρ and all x ∈ X the point x − A† z belongs to Xǫ . For such ρ and any x ∈ X we have kBx − x bǫ (Ax − z)k

= ≤ ≤

kBx − x bǫ (A[x − A† z])k † kBA zk + kB[x − A† z] − x bǫ (A[x − A† z])k Cρ + r¯

with properly selected constant C independent of ρ (we have used (4.189); note that for our ρ and x we have x − A† z ∈ Xǫ ). We conclude that for properly selected r′ < r, ρ > 0 and all x ∈ X we have kBx − x bǫ (Ax − z)k ≤ r′ ∀(z ∈ supp θρ ),

implying, by construction of x bǫ,ρ , that

∀(x ∈ X ) : kBx − x bǫ,ρ (Ax)k ≤ r′ < r.

The resulting estimate x bǫ,ρ is the continuous and bounded estimate satisfying (4.188) we were looking for. ✷ 4.8.6.3

Justification of Remark 4.20

Justification of Remark is given by repeating word by word the proof of Proposition 4.19, with Proposition 4.50 in the role of Proposition 4.16.

Chapter Five Signal Recovery Beyond Linear Estimates OVERVIEW In this chapter, as in Chapter 4, we focus on signal recovery. In contrast to the previous chapter, on our agenda now are • a special kind of nonlinear estimation—polyhedral estimate (Section 5.1), an alternative to linear estimates which were our subject in Chapter 4. We demonstrate that as applied to the same estimation problem as in Chapter 4—recovery of an unknown signal via noisy observation of a linear image of the signal, polyhedral estimation possesses the same attractive properties as linear estimation, that is, efficient computability and near-optimality, provided the signal set is an ellitope/spectratope. Besides this, we show that properly built polyhedral estimates are near-optimal in several special cases where linear estimates could be heavily suboptimal. • recovering signals from noisy observations of nonlinear images of the signal. Specifically, we consider signal recovery in generalized linear models, where the expected value of an observation is a known nonlinear transformation of the signal we want to recover, in contrast to observation model (4.1) where this expectation is linear in the signal.

5.1

POLYHEDRAL ESTIMATION

5.1.1

Motivation

The estimation problem we were considering so far is as follows: We want to recover the image Bx ∈ Rν of unknown signal x known to belong to signal set X ⊂ Rn from a noisy observation ω = Ax + ξx ∈ Rm , where ξx is observation noise (index x in ξx indicates that the distribution Px of the observation noise may depend on x). Here X is a given nonempty convex compact set, and A and B are given m × n and ν × n matrices; in addition, we are given a norm k · k on Rν in which the recovery error is measured. We have seen that if X is an ellitope/spectratope then, under reasonable assumptions on observation noise and k · k, an appropriate efficiently computable estimate linear in ω is near-optimal. Note that the ellitopic/spectratopic structure of X is crucial here. What follows is motivated by the desire to build an alternative estimation scheme which works beyond the ellitopic/spectratopic case, where linear estimates can become “heavily nonoptimal.”

387

SIGNAL RECOVERY BEYOND LINEAR ESTIMATES

Motivating example. Consider the simply-looking problem of recovering Bx = x in the k · k2 -norm from direct observations (Ax = x) corrupted by the standard Gaussian noise ξ ∼ N (0, σ 2 I), and let X be the unit k · k1 -ball: X |xi | ≤ 1}. X = {x ∈ Rn : i

In this situation, one can easily build the optimal, in terms of the worst-case, over x ∈ X , expected squared risk, linear estimate x bH (ω) = H T ω:  Risk2 [b xH |X ] := maxx∈X  E kb xH (ω) − Bxk22 = maxx∈X k[I − H T ]xk22 + σ 2 Tr(HH T ) = maxi≤n kColi [I − H T ]k22 + σ 2 Tr(HH T ). Clearly, the optimal H is just a scalar matrix hI, the optimal h is the minimizer of the univariate quadratic function (1 − h)2 + σ 2 nh2 , and the best squared risk attainable with linear estimates is   R2 = min (1 − h)2 + σ 2 nh2 = h

nσ 2 . 1 + nσ 2

On the other hand, consider a nonlinear estimate x b(ω) as follows. Given observation ω, specify x b(ω) as an optimal solution to the optimization problem Opt(ω) = min ky − ωk∞ . y∈X

Note that for every ρ > 0 the probability that the true signal satisfies kx−ωk∞ ≤ ρσ (“event E”) is at least 1 − 2n exp{−ρ2 /2}, and if this event happens, then both x and x b belong to the box {y : ky − ωk∞ ≤ ρσ}, implying that kx − x bk∞ ≤ 2ρσ. In addition, we always have kx − x bk2 ≤ kx − x bk1 ≤ 2, since x ∈ X and x b ∈ X . We therefore have  √ p 2 ρσ, ω ∈ E, bk∞ kx − x bk1 ≤ kx − x bk2 ≤ kx − x 2, ω 6∈ E, whence

 E kb x − xk22 ≤ 4ρσ + 8n exp{−ρ2 /2}. (∗) p Assuming σ ≤ 2n exp{−1/2} and specifying ρ as 2 ln(2n/σ), we get ρ ≥ 1 and 2n exp{−ρ2 /2} ≤ σ, implying that the right hand side in (∗) is at most 8ρσ. In other words, for our nonlinear estimate x b(ω) it holds p Risk2 [b x|X ] ≤ 8 ln(2n/σ)σ.

2 When p nσ is of order of 1, the latter bound on the squared risk is of order of σ ln(1/σ), while the best squared risk achievable with linear estimates under the circumstances is of order of 1. We conclude that when σ is small and n is large (specifically, is of order of 1/σ 2 ), the best linear estimate is by far inferior compared to our nonlinear estimate—the ratio of the corresponding squared risks is as large as √O(1) , the factor which is “by far” worse than the nonoptimality factor in σ

ln(1/σ)

the case of ellitope/spectratope X .

388

CHAPTER 5

The construction of the nonlinear estimate x b which we have built1 admits a natural extension yielding what we shall call polyhedral estimate, and our present goal is to design and to analyse presumably good estimates of this type. 5.1.2

Generic polyhedral estimate

A generic polyhedral estimate is as follows: Given the data A ∈ Rm×n , B ∈ Rν×n , X ⊂ Rn of our recovery problem (where X is a computationally tractable convex compact set) and a “reliability tolerance” ǫ ∈ (0, 1), we specify somehow positive integer N along with N linear forms hTℓ z on the space Rm where observations live. These forms define linear forms gℓT x := hTℓ Ax on the space of signals Rn . Assuming that the observation noise ξx is zero mean for every x ∈ X , the “plug-in” estimates hTℓ ω are unbiased estimates of the forms giT x. Assume that vectors hℓ are selected in such a way that ∀(x ∈ X ) : Prob{|hTℓ ξx | > 1} ≤ ǫ/N ∀ℓ.

(5.1)

In this situation, setting H = [h1 , ..., hN ] (in the sequel, H is referred to as contrast matrix), we can ensure that whatever be the signal x ∈ X underlying our observation ω = Ax+ξx , the observable vector H T ω satisfies the relation  Prob kH T ω − H T Axk∞ > 1 ≤ ǫ. (5.2) With the polyhedral estimation scheme, we act as if all information about x contained in our observation ω were represented by H T ω, and we estimate Bx by B x ¯, where x ¯ = x ¯(ω) is any vector from X compatible with this information, specifically, such that x ¯ solves the feasibility problem find x ¯ ∈ X such that kH T ω − H T A¯ xk∞ ≤ 1. Note that this feasibility problem with positive probability can be unsolvable; all we know in this respect is that the latter probability is ≤ 1 − ǫ, since by construction the true signal x underlying observation ω is with probability 1 − ǫ a feasible solution. In other words, such x ¯ is not always well-defined. To circumvent this difficulty, let us define x ¯ as  (5.3) x ¯ ∈ Argmin kH T ω − H T Auk∞ : u ∈ X , u

so that x ¯ always is well-defined and belongs to X , and estimate Bx by B x ¯. Thus, a polyhedral estimate is specified by an m × N contrast matrix H = [h1 , ..., hN ] with columns hℓ satisfying (5.1) and is as follows: given observation ω, we build x ¯=x ¯(ω) ∈ X according to (5.3) and estimate Bx by x bH (ω) = B x ¯(ω).

The rationale behind polyhedral estimation scheme is the desire to reduce complex 1 In fact, this estimate is nearly optimal under the circumstances in a meaningful range of values of n and σ.

SIGNAL RECOVERY BEYOND LINEAR ESTIMATES

389

estimating problems to those of estimating linear forms. To the best of our knowledge, this approach was first used in [192] (see also [185, Chapter 2]) in connection with recovering from direct observations (restrictions on regular grids of) multivariate functions from Sobolev balls. Recently, the ideas underlying the results of [192] have been taken up in the MIND estimator of [109], then applied to multiple testing in [203]. What follows is based on [139]. (ǫ, k · k)-risk. Given a desired “reliability tolerance” ǫ ∈ (0, 1), it is convenient to quantify the performance of polyhedral estimate by its (ǫ, k · k)-risk Riskǫ,k·k [b x(·)|X ] = inf {ρ : Prob {kBx − x b(Ax + ξx )k > ρ} ≤ ǫ ∀x ∈ X } ,

(5.4)

that is, the worst, over x ∈ X , size of “(1 − ǫ)-reliable k · k-confidence interval” associated with the estimate x b(·). An immediate observation is as follows:

Proposition 5.1. In the situation in question, denoting by Xs = 12 (X −X ) the symmetrization of X , given a contrast matrix H = [h1 , ..., hN ] with columns satisfying (5.1), the quantity  R[H] = max kBzk : kH T Azk∞ ≤ 2, z ∈ 2Xs (5.5) z

is an upper bound on the (ǫ, k · k)-risk of the polyhedral estimate x bH (·): Riskǫ,k·k [b xH |X ] ≤ R[H].

Proof is immediate. Let us fix x ∈ X , and let E be the set of all realizations of ξx such that kH T ξx k∞ ≤ 1, so that Px (E) ≥ 1−ǫ by (5.2). Let us fix a realization ξ ∈ E of the observation noise, and let ω = Ax+ξ, x ¯=x ¯(Ax+ξ). Then u = x is a feasible solution to the optimization problem (5.3) with the value of the objective ≤ 1, implying that the value of this objective at the optimal solution x ¯ to the problem is ≤ 1 as well, so that kH T A[x − x ¯]k∞ ≤ 2. Besides this, z = x − x ¯ ∈ 2Xs . We see that z is a feasible solution to (5.5), whence kB[x − x ¯]k = kBx − x bH (ω)k ≤ R[H]. It remains to note that the latter relation holds true whenever ω = Ax + ξ with ξ ∈ E, and the Px -probability of the latter inclusion is at least 1 − ǫ, whatever be x ∈ X. ✷ What is ahead. In what follows our focus will be on the following questions pertinent to the design of polyhedral estimates: 1. Given the data of our estimation problem and a tolerance δ ∈ (0, 1), how to find a set Hδ of vectors h ∈ Rm satisfying the relation  ∀(x ∈ X ) : Prob |hT ξx | > 1 ≤ δ. (5.6)

With our approach, after the number N of columns in a contrast matrix has been selected, we choose the columns of H from Hδ , with δ = ǫ/N , ǫ being a given reliability tolerance of the estimate we are designing. Thus, the problem of constructing sets Hδ arises, the larger Hδ , the better. 2. The upper bound R[H] on the (ǫ, k · k)-risk of the polyhedral estimate x bH is, in general, difficult to compute—this is the maximum of a convex function over a computationally tractable convex set. Thus, similarly to the case of linear

390

CHAPTER 5

estimates, we need techniques for computationally efficient upper bounding of R[·]. 3. With “raw materials” (sets Hδ ) and efficiently computable upper bounds on the risk of candidate polyhedral estimates at our disposal, how do we design the best in terms of (the upper bound on) its risk polyhedral estimate? We are about to consider these questions one by one. 5.1.3

Specifying sets Hδ for basic observation schemes

To specify reasonable sets Hδ we need to make some assumptions on the distributions of observation noises we want to handle. In the sequel we restrict ourselves to three special cases as follows: • sub-Gaussian case: For every x ∈ X , the observation noise ξx is sub-Gaussian with parameters (0, σ 2 Im ), where σ > 0, i.e. ξx ∼ SG(0, σ 2 Im ). • Discrete case: X P is a convex compact subset of the probabilistic simplex ∆n = {x ∈ Rn : x ≥ 0, i xi = 1}, A is a column-stochastic matrix, and ω=

K 1 X ζk K k=1

with random vectors ζk independent across k ≤ K, ζk taking value ei with probability [Ax]i , i = 1, ...., m, ei being the basic orths in Rm . • Poisson case: X is a convex compact subset of the nonnegative orthant Rn+ , A is entrywise nonnegative, and the observation ω stemming from x ∈ X is a random vector with entries ωi ∼ Poisson([Ax]i ) independent across i. The associated sets Hδ can be built as follows. 5.1.3.1

Sub-Gaussian case

When h ∈ Rn is deterministic and ξ is sub-Gaussian with parameters 0, σ 2 Im , we have   1 T . Prob{|h ξ| > 1} ≤ 2 exp − 2 2σ khk22 Indeed, when h 6= 0 and γ > 0, we have  Prob{hT ξ > 1} ≤ exp{−γ}E exp{γhT ξ} ≤ exp{ 12 σ 2 γ 2 khk22 − γ}. n o Minimizing the resulting bound in γ > 0, we get Prob{hT ξ > 1} ≤ exp − 2khk12 σ2 ; 2

the n same reasoning as applied to −h in the role of h results in Prob{hT ξ < −1} ≤ o exp − 2khk12 σ2 . 2

Consequently

πG (h) := σ |

and we can set

p

2 ln(2/δ) khk2 ≤ 1 ⇒ Prob{|hT ξ| > 1} ≤ δ, {z } ϑG

Hδ = HδG := {h : πG (h) ≤ 1}.

391

SIGNAL RECOVERY BEYOND LINEAR ESTIMATES

5.1.3.2

Discrete case

Given x ∈ X , setting µ = Ax and ηk = ζk − µ, we get ω = Ax +

K 1 X ηk . K k=1 | {z } ξx

Given h ∈ Rm ,

hT ξ x =

1 X T h η . | {z k} K k

χk

Random variables χ1 , ..., χK are independent zero mean and clearly satisfy  X [Ax]i h2i , |χk | ≤ 2khk∞ ∀k. E χ2k ≤ i

When applying Bernstein’s inequality2 we get (cf. Exercise 4.19) P Prob{|hT ξx | > 1} = Prob{| o n k χk | > K} . ≤ 2 exp − 2 P [Ax] K h2 + 4 khk i

Setting

πD (h)

=

i

i

3

(5.7)



p P ϑ2D maxx∈X i [Ax]i h2i + ̺2D khk2∞ , q ϑD = 2 ln(2/δ) , ̺D = 8 ln(2/δ) , K 3K

after a completely straightforward computation, we conclude from (5.7) that πD (h) ≤ 1 ⇒ Prob{|hT ξx | > 1} ≤ δ, ∀x ∈ X . Thus, in the Discrete case we can set Hδ = HδD := {h : πD (h) ≤ 1}. 5.1.3.3

Poisson case

In the Poisson case, for x ∈ X , setting µ = Ax, we have ω = Ax + ξx , ξx = ω − µ. It turns out that for every h ∈ Rm one has

n  ∀t ≥ 0 : Prob |hT ξx | ≥ t ≤ 2 exp − 2[P

t2

1 2 i hi µi + 3 khk∞ t]

o

(5.8)

2 The classical Bernstein inequality states that if X , ..., X 1 K are independent zero mean scalar random variables with finite variances σk2 such that |Xk | ≤ M a.s., then for every t > 0 one has ( ) t2 Prob{X1 + ... + Xk > t} ≤ exp − P 2 . 2[ k σk + 13 M t]

392

CHAPTER 5

(for verification, see Exercise 4.21 or Section 5.4.1). As a result, we conclude via a straightforward computation that setting p P πP (h) = ϑ2P maxx∈X i [Ax]i h2i + ̺2P khk2∞ , p ϑP = 2 ln(2/δ), ̺P = 43 ln(2/δ),

we ensure that

πP (h) ≤ 1 ⇒ Prob{|hT ξx | > 1} ≤ δ, ∀x ∈ X . Thus, in the Poisson case we can set Hδ = HδP := {h : πP (h) ≤ 1}. 5.1.4

Efficient upper-bounding of R[H] and contrast design, I.

The scheme for upper-bounding R[H] to be presented in this section (an alternative, completely different, scheme will be presented in Section 5.1.5) is inspired by our motivating example. Note that there is a special case of (5.5) where R[H] is easy to compute—the case where k · k is the uniform norm k · k∞ , whence  b R[H] = R[H] := 2 max max RowTi [B]x : x ∈ Xs , kH T Axk∞ ≤ 1 i≤ν

x

is the maximum of ν efficiently computable convex functions. It turns out that when k · k = k · k∞ , it is not only easy to compute R[H], but to optimize this risk bound in H as well.3 These observations underlie the forthcoming developments in this section: under appropriate assumptions, we bound the risk of a polyhedral b estimate with contrast matrix H via the efficiently computable quantity R[H] and then show that the resulting risk bounds can be efficiently optimized w.r.t. H. We shall also see that in some “simple for analytical analysis” situations, like that of the example, the resulting estimates are nearly minimax optimal. 5.1.4.1

Assumptions

We stay within the setup introduced in Section 5.1.1 which we augment with the following assumptions: A.1. k · k = k · kr with r ∈ [1, ∞]. A.2. We have at our disposal a sequence γ = {γi > 0, i ≤ ν} and ρ ∈ [1, ∞] such that the image of Xs under the mapping x 7→ Bx is contained in the “scaled k · kρ -ball” Y = {y ∈ Rν : kDiag{γ}ykρ ≤ 1}. (5.9) 5.1.4.2

Simple observation

Let BℓT be the ℓ-th row in B, 1 ≤ ℓ ≤ ν. Let us make the following observation: 3 On closer inspection, in the situation considered in the motivating example the k · k ∞ b optimal contrast matrix H is proportional to the unit matrix, and the quantity R[H] can be easily translated into an upper bound on, say, the k · k2 -risk of the associated polyhedral estimate.

393

SIGNAL RECOVERY BEYOND LINEAR ESTIMATES

Proposition 5.2. In the situation described in Section 5.1.1, let us assume that Assumptions A.1-2 hold. Let ǫ ∈ (0, 1) and let a positive real N ≥ ν be given; let also π(·) be a norm on Rm such that ∀(h : π(h) ≤ 1, x ∈ X ) : Prob{|hT ξx | > 1} ≤ ǫ/N. Next, let a matrix H = [H1 , ..., Hν ] with Hℓ ∈ Rm×mℓ , mℓ ≥ 1, and positive reals ςℓ , ℓ ≤ ν, satisfy the relations (a) (b)

π(Colj[H]) ≤ 1, 1 ≤ j ≤ N ; maxx BℓT x : x ∈ Xs , kHℓT Axk∞ ≤ 1 ≤ ςℓ , 1 ≤ ℓ ≤ ν.

(5.10)

Then the quantity R[H] as defined in (5.5) can be upper-bounded as follows: R[H] ≤ Ψ(ς)

:=

2 maxw {k[w1 /γ1 ; ...; wν /γν ]kr : kwkρ ≤ 1, 0 ≤ wℓ ≤ γℓ ςℓ , ℓ ≤ ν} ,

(5.11)

which combines with Proposition 5.1 to imply that Riskǫ,k·k [b xH |X ] ≤ Ψ(ς).

(5.12)

Function Ψ is nondecreasing on the nonnegative orthant and is easy to compute. Proof. Let z = 2¯ z be a feasible solution to (5.5), thus z¯ ∈ Xs and kH T A¯ z k∞ ≤ 1. Let y = B z¯, so that y ∈ Y (see (5.9)) due to z¯ ∈ Xs and A.2. Then kDiag{γ}ykp ≤ 1. Besides this, by (5.10.b) relations z¯ ∈ Xs and kH T A¯ z k∞ ≤ 1 combine with the symmetry of Xs w.r.t. the origin to imply that |yℓ | = |BℓT z¯| ≤ ςℓ , ℓ ≤ ν. Taking into account that k · k = k · kr by A.1, we see that  R[H] = maxz kBzkr : z ∈ 2Xs , kH T Azk∞ ≤ 2 ≤ 2 maxy {kykr : |yℓ | ≤ ςℓ , ℓ ≤ ν, & kDiag{γ}ykρ ≤ 1} = 2 maxw {k[w1 /γ1 ; ...; wν /γν ]kr : kwkρ ≤ 1, 0 ≤ wℓ ≤ γℓ ςℓ , ℓ ≤ ν} , as stated in (5.11). It is evident that Ψ is nondecreasing on the nonnegative orthant. Computing Ψ can be carried out as follows: 1. When r = ∞, we need to compute maxℓ≤ν maxw {wℓ /γℓ : kwkρ ≤ 1, 0 ≤ wj ≤ γj ςj , j ≤ ν} so that evaluating Ψ reduces to solving ν simple convex optimization problems; 2. When ρ = ∞, we clearly have Ψ(ς) = k[w ¯1 /γ1 ; ...; w ¯ν /γν ]kr , w ¯ℓ = min[1, γℓ ςℓ ]; 3. When 1 ≤ r, ρ < ∞, passing from variables wℓ to variables uℓ = wℓρ , we get ( ) X X r/ρ Ψr (ς) = 2r max γℓ−r uℓ : uℓ ≤ 1, 0 ≤ uℓ ≤ (γℓ ςℓ )ρ . u





When r ≤ ρ, the optimization problem on the right-hand side is the easily solvable problem of maximizing a simple concave function over a simple convex compact set. When ∞ > r > ρ, this problem can be solved by Dynamic

394

CHAPTER 5

Programming.



Comment. When we want to recover Bx in k · k∞ (i.e., we are in the case of r = ∞), under the premise of Proposition 5.2 we clearly have Ψ(ς) ≤ maxℓ ςℓ , resulting in the bound Riskǫ,k·k∞ [b xH |X ] ≤ 2 max ςℓ . ℓ≤ν

Note that this bound in fact does not require Assumption A.2 (since it is satisfied for any ρ with large enough γi ’s). 5.1.4.3

Specifying contrasts

Risk bound (5.12) allows for an easy design of contrast matrices. Recalling that Ψ is monotone on the nonnegative orthant, all we need is to select hℓ ’s satisfying (5.10) and resulting in the smallest possible ςℓ ’s, which is what we are about to do now. Preliminaries. Given a vector b ∈ Rm and a norm s(·) on Rm , consider convexconcave saddle point problem  (SP ) Opt = infm max φ(g, x) := [b − AT g]T x + s(g) g∈R

x∈Xs

along with the induced primal and dual problems   Opt(P ) = inf g∈Rm  φ(g) := maxx∈Xs φ(g, x)  = inf g∈Rm s(g) + maxx∈Xs [b − AT g]T x ,

and

Opt(D)

= = =

  maxx∈Xs  φ(g) := inf  Tg∈Rm φ(g,Tx)  maxx∈X  Ts inf g∈Rm b x − [Ax] g + s(g) maxx b x : x ∈ Xs , q(Ax) ≤ 1

(P )

(D)

where q(·) is the norm conjugate to s(·) (we have used the evident fact that inf g∈Rm [f T g + s(g)] is either −∞ or 0 depending on whether q(f ) > 1 or q(f ) ≤ 1). Since Xs is compact, we have Opt(P ) = Opt(D) = Opt by the Sion-Kakutani Theorem. Besides this, (D) is solvable (evident) and (P ) is solvable as well, since φ(g) is continuous due to the compactness of Xs and φ(g) ≥ s(g), so that φ(·) has bounded level sets. Let g¯ be an optimal solution to (P ), let x ¯ be an optimal solution to (D), ¯ be the s(·)-unit normalization of g¯, so that s(h) ¯ = 1 and g¯ = s(¯ ¯ Now and let h g )h. let us make the following observation: Observation 5.3. In the situation in question, we have  ¯ T Ax| ≤ 1 ≤ Opt. max |bT x| : x ∈ Xs , |h x

(5.13)

In addition, for any matrix G = [g 1 , ..., g M ] ∈ Rm×M with s(g j ) ≤ 1, j ≤ M , one has  maxx |bT x|: x ∈ Xs , kGT Axk∞ ≤ 1 (5.14) = maxx bT x : x ∈ Xs , kGT Axk∞ ≤ 1 ≥ Opt. Proof. Let x be a feasible solution to the problem in (5.13). Replacing, if

395

SIGNAL RECOVERY BEYOND LINEAR ESTIMATES

necessary, x with −x, we can assume that |bT x| = bT x. We now have |bT x|

= ≤ ≤

bT x = [¯ g T Ax − s(¯ g )] + [b − AT g¯]T x + s(¯ g) {z } | g )=Opt(P ) ≤φ(¯

¯ T Ax − s(¯ ¯ T Ax| −s(¯ Opt(P ) + [s(¯ g )h g )] ≤ Opt(P ) + s(¯ g ) |h g) | {z } ≤1

Opt(P ) = Opt,

as claimed in (5.13). Now, the equality in (5.14) is due to the symmetry of Xs w.r.t. the origin. To verify the inequality in (5.14), note that x ¯ satisfies the relations x ¯ ∈ Xs and q(A¯ x) ≤ 1, implying, due to the fact that the columns of G are of s(·)-norm ≤ 1, that x ¯ is a feasible solution to the optimization problems in (5.14). As a result, the second quantity in (5.14) is at least bT x ¯ = Opt(D) = Opt, and (5.14) follows. ✷ Comment. Note that problem (P ) has a very transparent origin. In the situation of Section 5.1.1, assume that our goal is, to estimate, given observation ω = Ax+ξx , the value at x ∈ X of the linear function bT x, and we want to use for this purpose an estimate gb(ω) = g T ω + γ affine in ω. Given ǫ ∈ (0, 1), how do we construct a presumably good in terms of its ǫ-risk estimate? Let us show that a meaningful answer is yielded by the optimal solution to (P ). Indeed, we have bT x − gb(Ax + ξx ) = [b − AT g]T x − γ − g T ξx .

Assume that we have at our disposal a norm s(·) on Rm such that ∀(h ∈ Rm , s(h) ≤ 1, x ∈ X ) : Prob{ξx : |hT ξx | > 1} ≤ ǫ, or, which is the same, ∀(g ∈ Rm , x ∈ X ) : Prob{ξx : |g T ξx | > s(g)} ≤ ǫ. Then we can safely upper-bound the ǫ-risk of a candidate estimate gb(·) by the quantity ρ = max |[b − AT g]T x − γ| +s(g). x∈X {z } | bias B(g, γ)

Observe that for g fixed, the minimal, over γ, bias is

M (g) := max[b − AT g]x. x∈Xs

Postponing verification of this claim, here is the conclusion: in the present setting, problem (P ) is nothing but the problem of building the best in terms of the upper bound ρ on the ǫ-risk affine estimate of linear function bT x. It remains to justify the above claim, which is immediate: on one hand, for all u ∈ X , v ∈ X we have B(g, γ) ≥ [b − AT g]T u − γ,

B(g, γ) ≥ −[b − AT g]T v + γ

396

CHAPTER 5

implying that B(g, γ) ≥ 21 [b − AT g]T [u − v] ∀(u ∈ X , v ∈ X ), just as B(g, γ) ≥ M (g). On the other hand, let M+ (g) = max[b − AT g]T x, M− (g) = − min[b − AT g]T x, x∈X

x∈X

so that M (g) = 12 [M+ (g) + M− (g)]. Setting γ¯ = 12 [M+ (g) − M− (g)], we have   maxx∈X [b − AT g]T x − γ¯  = M+ (g) − γ¯ = 21 [M+ (g) + M− (g)] = M (g), minx∈X [b − AT g]T x − γ¯ = −M− (g) − γ¯ = − 12 [M+ (g) + M− (g)] = −M (g).

That is, B(g, γ¯ ) = M (g). Combining these observations, we arrive at min B(g, γ) = γ

M (g), as claimed.



Contrast design. Proposition 5.2 and Observation 5.3 allow for a straightforward solution of the associated contrast design problem, at least in the case of subGaussian, Discrete, and Poisson observation schemes. Indeed, in these cases, when designing a contrast matrix with N columns, with our approach we are supposed to select its columns in the respective sets Hǫ/N ; see Section 5.1.3. Note that these sets, while shrinking as N grows, are “nearly independent” of N , since the norms πG , πD , πP in the description of the respective sets HδG , HδD , HδP depend on 1/δ via factors logarithmic in 1/δ. It follows that we lose nearly nothing when assuming that N ≥ ν. Let us act as follows: We set N = ν, specify π ¯ (·) as the norm (πG , or πD , or πP ) associated with the observation scheme (sub-Gaussian, or Discrete, or Poisson) in question and δ = ǫ/ν. We solve ν convex optimization problems   Optℓ = ming∈Rm φℓ (g) := maxx∈Xs φℓ (g, x) , (Pℓ ) φℓ (g, x) = [Bℓ − AT g]T x + π ¯ (g). Next, we convert optimal solution gℓ to (Pℓ ) into vector hℓ ∈ Rm by representing gℓ = π ¯ (gℓ )hℓ with π ¯ (hℓ ) = 1, and set Hℓ = hℓ . As a result, we obtain an m × ν contrast matrix H = [h1 , ..., hν ] which, taken along with N = ν, quantities ςℓ = Optℓ , 1 ≤ ℓ ≤ ν, (5.15) and with π(·) ≡ π ¯ (·), in view of the first claim in Observation 5.3 as applied with s(·) ≡ π ¯ (·), satisfies the premise of Proposition 5.2. Consequently, by Proposition 5.2 we have Riskǫ,k·k [b xH |X ] ≤ Ψ([Opt1 ; ...; Optν ]).

(5.16)

Comment. Optimality of the outlined contrast design for the sub-Gaussian, or Discrete, or Poisson observation scheme stems, within the framework set by Proposition 5.2, from the second claim of Observation 5.3, which states that when N ≥ ν and the columns of the m × N contrast matrix H = [H1 , ..., Hν ] belong to the set Hǫ/N associated with the observation scheme in question—i.e., the norm π(·) in the proposition is the norm πG , or πD , or πP associated with δ = ǫ/N —the quantities ςℓ participating in (5.10.b) cannot be less than Optℓ .

397

SIGNAL RECOVERY BEYOND LINEAR ESTIMATES

Indeed, the norm π(·) from Proposition 5.2 is ≥ the norm π ¯ (·) participating in (Pℓ ) (because the value ǫ/N in the definition of π(·) is at most νǫ ), implying, by (5.10.a), that the columns of matrix H obeying the premise of the proposition satisfy the relation π ¯ (Colj [H]) ≤ 1. Invoking the second part of Observation 5.3 with s(·) ≡ π ¯ (·), b = Bℓ , and G = Hℓ , and taking (5.10.b) into account, we conclude that ςℓ ≥ Optℓ for all ℓ, as claimed.

Since the bound on the risk of a polyhedral estimate offered by Proposition 5.2 is better the lesser are the ςℓ ’s, we see that as far as this bound is concerned, the outlined design procedure is the best possible, provided N ≥ ν. An attractive feature of the contrast design we have just presented is that it is completely independent of the entities participating in assumptions A.1-2—these entities affect theoretical risk bounds of the resulting polyhedral estimate, but not the estimate itself. 5.1.4.4

Illustration: Diagonal case

Let us consider the diagonal case of our estimation problem, where • X = {x ∈ Rn : kDxkρ ≤ 1}, where D is a diagonal matrix with positive diagonal entries Dℓℓ =: dℓ , • m = ν = n, and A and B are diagonal matrices with diagonal entries 0 < Aℓℓ =: aℓ , 0 < Bℓℓ =: bℓ , • k · k = k · kr , • We are in the sub-Gaussian case, that is, observation noise ξx is (0, σ 2 In )-subGaussian for every x ∈ X . Let us implement the approach developed in Sections 5.1.4.1–5.1.4.3. 1. Given reliability tolerance ǫ, we set p p δ = ǫ/n, ϑG := σ 2 ln(2/δ) = σ 2 ln(2n/ǫ),

(5.17)

and

H = HδG = {h ∈ Rn : πG (h) := ϑG khk2 ≤ 1}. 2. We solve ν = n convex optimization problems (Pℓ ) associated with π ¯ (·) ≡ πG (·), which is immediate: the resulting contrast matrix is H = ϑ−1 G In , and Optℓ = ςℓ := bℓ min[ϑG /aℓ , 1/dℓ ].

(5.18)

Risk analysis. The (ǫ, k · k)-risk of the resulting polyhedral estimate x b(·) can be bounded by Proposition 5.2. Note that setting γℓ = dℓ /bℓ , 1 ≤ ℓ ≤ n, we meet assumptions A.1-2, and the above choice of H, N = n, and ςℓ satisfies the premise of Proposition 5.2. By this proposition, Riskǫ,k·kr [b xH |X ] ≤ Ψ

:=

2 maxw {k[w1 /γ1 ; ...; wn /γn ]kr : kwkρ ≤ 1, 0 ≤ wℓ ≤ γℓ ςℓ } .

(5.19)

398

CHAPTER 5

Let us work out what happens in the simple case where 1 ≤ ρ ≤ r < ∞, aℓ /dℓ and bℓ /aℓ are nonincreasing in ℓ.

(a) (b)

(5.20)

Proposition 5.4. In the simple case just defined, let n = n when n X ℓ=1

ρ

(ϑG dℓ /aℓ ) ≤ 1;

otherwise let n be the smallest integer such that n X

ρ

(ϑG dℓ /aℓ ) > 1,

ℓ=1

with ϑG given by (5.17). Then for the contrast matrix H = ϑ−1 G In one has Riskǫ,k·kr [b xH |X ] ≤ Ψ ≤ 2

hX n

ℓ=1

(ϑG bℓ /aℓ )r

i1/r

.

Proof. Consider the optimization problem specifying Ψ in (5.19). Setting θ = r/ρ ≥ 1, let us pass in this problem from variables wℓ to variables zℓ = wℓρ , so that ( ) X X r r θ r ρ Ψ = 2 max zℓ (bℓ /dℓ ) : zℓ ≤ 1, 0 ≤ zℓ ≤ (dℓ ςℓ /bℓ ) ≤ 2r Γ, z





where Γ = max z

(

X ℓ

zℓθ (bℓ /dℓ )r

:

X ℓ

zℓ ≤ 1, 0 ≤ zℓ ≤ χℓ := (ϑG dℓ /aℓ )

ρ

)

(we have used (5.18)). Note that ΓP is the optimal value in the problem of maximizing a convex (since θ ≥ 1) function ℓ zℓθ (bℓ /dℓ )r over a bounded polyhedral set, so that the maximum is attained at an extreme point z¯ of the feasible set. By the standard characterization of extreme points, the (clearly nonempty) set I of positive entries in z¯ is as follows. Let us denote by I ′ the set of indexes ℓ ∈ I such that z¯ℓ is on its upper z¯ℓ = χℓ ; note that the cardinality |I ′ | of I ′ is at least P bound P |I| − 1. Since ℓ∈I ′ z¯ℓ = ℓ∈I ′ χℓ ≤ 1 and χℓ are nondecreasing in ℓ by (5.20.b), we conclude that |I ′ | X χℓ ≤ 1, ℓ=1



implying that |I | < n provided that n < n, so that in this case |I| ≤ n; and of course |I| ≤ n when n = n. Next, we have X X X Γ= z¯ℓθ (bℓ /dℓ )r ≤ χθℓ (bℓ /dℓ )r = (ϑG bℓ /aℓ )r , ℓ∈I

ℓ∈I

ℓ∈I

and Pn since bℓ /aℓr is nonincreasing in ℓ and |I| ≤ n, the latter quantity is at most ✷ ℓ=1 (ϑG bℓ /aℓ ) .

399

SIGNAL RECOVERY BEYOND LINEAR ESTIMATES

Application. Consider the “standard case” [72, 74] where p 0 < ln(2n/ǫ)σ ≤ 1, aℓ = ℓ−α , bℓ = ℓ−β , dℓ = ℓκ

with β ≥ α ≥ 0, κ ≥ 0 and (β − α)r < 1. In this case for large n, namely, −

1

n ≥ cϑG α+κ+1/ρ

[ϑG = σ

p

2 ln(2n/ǫ)]

(5.21)

(here and in what follows, the factors denoted by c and C depend solely on α, β, κ, r, ρ) we get −

1

n ≤ CϑG α+κ+1/ρ , resulting in β+κ+1/ρ−1/r

Riskǫ,k·kr [b x|X ] ≤ CϑG α+κ+1/ρ .

(5.22)

Setting x = D y, α ¯ = α + κ, β¯ = β + κ and treating y, rather than x, as the signal underlying the observation, we obtain the estimation problem which is similar to the original one in which α, β, κ and X are replaced, respectively, with α ¯, ¯ κ β, ¯ = 0, and Y = {y : kykρ ≤ 1}, and A, B replaced with A¯ = Diag{ℓ−α¯ , ℓ ≤ n}, 1 ¯ = Diag{ℓ−β¯ , ℓ ≤ n}. When n is large enough, namely, n ≥ σ − α+1/ρ ¯ , Y contains B the “coordinate box” −1

Y = {x : |xℓ | ≤ m−1/ρ , m/2 ≤ ℓ ≤ m, xℓ = 0 otherwise} of dimension ≥ m/2, where

1

¯ . m ≥ cσ − α+1/ρ

¯ 2 ≤ Cm−α¯ kyk2 , and kByk ¯ r ≥ cm−β¯ kykr . This Observe that for all y ∈ Y, kAyk observation, when combined with the Fano inequality, implies (cf. [79]) that for ǫ ≪ 1 the minimax optimal w.r.t. the family of all Borel estimates (ǫ, k · kr )-risk on the signal set X = D−1 Y ⊂ X is at least cσ

¯ β+1/ρ−1/r α+1/ρ ¯

.

In other words, in this situation, the upper bound (5.22) on the risk of the polyhedral estimate is within a factor logarithmic in n/ǫ from the minimax risk. In particular, without surprise, in the case of β = 0 the polyhedral estimates attain well-known optimal rates [72, 109]. 5.1.5 5.1.5.1

Efficient upper-bounding of R[H] and contrast design, II. Outline

In this section we develop and alternative approach to the design of polyhedral estimates which resembles in many aspects the approach to building linear estimates from Chapter 4. Recall that the principal technique underlying the design of a presumably good linear estimate x bH (ω) = H T ω was upper-bounding of maximal risk of the estimate—the maximum of a quadratic form, depending on H as a parameter, over the signal set X , and we were looking for a bounding scheme allowing us to efficiently optimize the bound in H. The design of a presumably good polyhedral estimate also reduces to minimizing

400

CHAPTER 5

the optimal value in a parametric maximization problem (5.5) over the contrast matrix H. However, while the design of a presumably good linear estimate reduces to unconstrained minimization, to conceive a polyhedral estimate we need to minimize bound R[H] on the estimation risk under the restriction on the contrast matrix H—the columns hℓ of this matrix should satisfy condition (5.1). In other words, in the case of polyhedral estimate the “design parameter” affects the constraints of the optimization problem rather than the objective. Our strategy can be outlined as follows. Let us denote by B∗ = {u ∈ Rν : kuk∗ ≤ 1} the unit ball of the norm k · k∗ conjugate to the norm k · k in the formulation of the estimation problem in Section 5.1.2. Assume that we have at our disposal a technique for bounding quadratic forms on the set B∗ × Xs , in other words, we have an efficiently computable convex function M(M ) on Sν+n such that M(M ) ≥

max

[u;z]∈B∗ ×Xs

[u; z]T M [u; z] ∀M ∈ Sν+n .

(5.23)

Note that the upper bound R[H], as defined in (5.5), on the risk of a candidate polyhedral estimate x bH is nothing but (   1 B T 2 [u; z] : R[H] = 2 max[u;z] [u; z] 1 BT } | 2 {z (5.24) B+  u ∈ B ∗ , z ∈ Xs , . z T AT hℓ hTℓ Az ≤ 1, ℓ ≤ N T T T Given λ ∈ RN + , the constraints z A hℓ hℓ Az ≤ 1 in (5.24) can be aggregated to yield the quadratic constraint X z T AT Θλ Az ≤ µλ , Θλ = HDiag{λ}H T , µλ = λℓ . ℓ

Observe that for every λ ≥ 0 we have  R[H] ≤ 2M 1 T B | 2

1 2B T



−A Θλ A {z }

+ 2µλ .

(5.25)

B+ [Θλ ]

Indeed, let [u; z] be a feasible solution to the optimization problem (5.24) specifying R[H]. Then [u; z]T B+ [u; z] = [u; z]T B+ [Θλ ][u; z] + z T AT Θλ Az; the first term on the right-hand side is ≤ M(B+ [Θλ ]) since [u; z] ∈ B∗ × Xs , and the second term on the right-hand side, as we have already seen, is ≤ µλ , and (5.25) follows.

Now assume that we have at our disposal a computationally tractable cone H ⊂ SN + × R+

401

SIGNAL RECOVERY BEYOND LINEAR ESTIMATES

satisfying the following assumption: C. Whenever (Θ, µ) ∈ H, we can efficiently find an n × N matrix H = [h1 , ..., hN ] and a nonnegative vector λ ∈ RN + such that hℓ satisfies (5.1), 1 ≤ ℓ ≤ N , T Θ P= HDiag{λ}H , i λi ≤ µ.

(a) (b) (c)

(5.26)

The following simple observation is crucial to what follows: Proposition 5.5. Consider the estimation problem posed in Section 5.1.1, and let efficiently computable convex function M and computationally tractable closed convex cone H satisfy (5.23) and Assumption C, respectively. Consider the convex optimization problem Opt = minτ,Θ,µ {2τ +2µ : (Θ, µ) ∈ H, M(B  + [Θ]) ≤ τ }  1 B 2 . B+ [Θ] = 1 T −AT ΘA 2B

(5.27)

Given a feasible solution (τ, Θ, µ) to this problem, by C we can efficiently convert it P to (H, λ) such that H = [h1 , ..., hN ] with hℓ satisfying (5.1) and λ ≥ 0 with ℓ λℓ ≤ µ. We have R[H] ≤ 2τ + 2µ, whence the (ǫ, k · k)-risk of the polyhedral estimate x bH satisfies the bound Riskǫ,k·k [b xH |X ] ≤ 2τ + 2µ.

(5.28)

Consequently, we can efficiently construct polyhedral estimates with (ǫ, k · k)-risk arbitrarily close to Opt (and with risk exactly Opt, provided problem (5.27) is solvable). Proof is readily given by the reasoning preceding the proposition. Indeed, with τ, Θ, µ, H, λ as in the premise of the proposition, the columns hℓ of H satisfy (5.1) by C, implying, by Proposition 5.1, that Riskǫ,k·k [b xH |X ] ≤ R[H]. Besides this, C says that for our H, λ it holds Θ = Θλ and µλ ≤ µ, so that (5.25) combines with the constraints of (5.27) to imply that R[H] ≤ 2τ + 2µ, and (5.28) follows by Proposition 5.1. ✷ The approach to the design of polyhedral estimates we develop in this section amounts to reducing the construction of the estimate (i.e., construction of the contrast matrix H) to finding (nearly) optimal solutions to (5.27). Implementing this approach requires devising techniques for constructing cones H satisfying C along with efficiently computable functions M(·) satisfying (5.23). These tasks are the subjects of the sections to follow. 5.1.5.2

Specifying cones H

We specify cones H in the case when the number N of columns in the candidate contrast matrices is m and under the following assumption on the given reliability tolerance ǫ and observation scheme in question: D. There is a computationally tractable convex compact subset Z ⊂ Rm +

402

CHAPTER 5

intersecting int Rm + such that the norm π(·) s X zi h2i π(h) = max z∈Z

i

induced by Z satisfies the relation π(h) ≤ 1 ⇒ Prob{|hT ξx | > 1} ≤ ǫ/m ∀x ∈ X . Note that condition D is satisfied for sub-Gaussian, Discrete, and Poisson observation schemes: according to the results of Section 5.1.3, • in the sub-Gaussian case, it suffices to take Z = {2σ 2 ln(2m/ǫ)[1; ...; 1]}; • in the Discrete case, it suffices to take Z=

64 ln2 (2m/ǫ) 4 ln(2m/ǫ) AX + ∆m , K 9K 2

where AX = {Ax : x ∈ X }, ∆m = {y ∈ Rm : y ≥ 0, • in the Poisson case, it suffices to take Z = 2 ln(2m/ǫ)AX +

16 9

X

yi = 1}.

i

ln2 (2m/ǫ)∆m ,

with AX and ∆m as above. Note that in all these cases Z only “marginally”—logarithmically—depends on ǫ and m. Under Assumption D, the cone H can be built as follows: • When Z is a singleton, Z = {¯ z }, so that π(·) is a scaled Euclidean norm, we set ) ( X m z¯i Θii . H = (Θ, µ) ∈ S+ × R+ : µ ≥ i

Given (Θ, µ) the m × m matrix H and λ ∈ Rm + are built as follows: setting √ ∈ H, √ S = Diag{ z¯1 , ..., z¯m }, we compute the eigenvalue decomposition of the matrix SΘS: SΘS = U Diag{λ}U T , where U isP orthonormal, andP set H = S −1 U , thus ensuring Θ = HDiag{λ}H T . Since µ ≥ i z¯i Θii , we have i λi = Tr(SΘS) ≤ µ. Finally, a column h of H is of the form S −1 f with k · k2 -unit vector f , implying that sX sX 2 −1 z¯i [S f ]i = fi2 = 1, π(h) = i

so that h satisfies (5.1) by D.

i

403

SIGNAL RECOVERY BEYOND LINEAR ESTIMATES

• When Z is not a singleton, we set φ(r) = κ = H =

T maxz∈Z √ z 2r, 6 ln(2 3m ), {(Θ, µ) ∈ Sm + × R+ : µ ≥ κφ(dg(Θ))},

(5.29)

where dg(Q) is the diagonal of a (square) matrix Q. Note that φ(r) > 0 whenever r ≥ 0, r 6= 0, since Z contains a positive vector. The justification of this construction and the efficient (randomized) algorithm for converting a pair (Θ, µ) ∈ H into (H, λ) satisfying, when taken along with (Θ, µ), the requirements of C are given by the following: Lemma 5.6. Let norm π(·) satisfy D. (i) Whenever H is an m × m matrix with columns hℓ satisfying π(hℓ ) ≤ 1 and λ ∈ Rm + , we have ! X T Θλ = HDiag{λ}H , µ = κ λi ∈ H. i

(ii) Given (Θ, µ) ∈ H with Θ 6= 0, we find decomposition Θ = QQT with m × m matrix Q, andp fix an orthonormal m × m matrix V with magnitudes of entries not exceeding 2/m (e.g., the orthonormal scaling of the matrix of the cosine µ transform). When µ > 0, we set λ = m [1; ...; 1] ∈ Rm and consider the random matrix r m QDiag{χ}V, Hχ = µ where χ is the m-dimensional Rademacher random vector. We have X Hχ Diag{λ}HχT ≡ Θ, λ ≥ 0, λi = µ.

(5.30)

i

Moreover, the probability of the event π(Colℓ [Hχ ]) ≤ 1 ∀ℓ ≤ m

(5.31)

is at least 1/2. Thus, generating independent samples of χ and terminating with H = Hχ when the latter matrix satisfies (5.31), we with probability 1 terminate with (H, λ) satisfying C, and the probability for the outlined procedure to terminate in the course of the first M = 1, 2, ... steps is at least 1 − 2−M . When µ = 0, we have Θ = 0 (since µ = 0 implies φ(dg(Θ)) = 0, which with Θ  0 is possible only when Θ = 0); thus, when µ = 0, we set H = 0m×m and λ = 0m×1 . Note that the lemma states, essentially, that the cone H is a tight, up to a factor logarithmic in m, inner approximation of the set   Θ = HDiag{λ}H T ,   m×m [H]) ≤ 1, ℓ ≤ m, (Θ, µ) : ∃(λ ∈ Rm ) : π(Col . ℓ +,H ∈ R P   µ ≥ ℓ λℓ For proof, see Section 5.4.2.

404

CHAPTER 5

5.1.5.3

Specifying functions M

In this section we focus on computationally efficient upper-bounding of maxima of quadratic forms over convex compact sets symmetric w.r.t. the origin by semidefinite relaxation, our goal being to specify a “presumably good” efficiently computable convex function M(·) satisfying (5.23). Cones compatible with convex sets. Given a nonempty convex compact set Y ⊂ RN , we say that a cone Y is compatible with Y if • Y is a closed convex computationally tractable cone contained in SN + × R+ • one has ∀(V, τ ) ∈ Y : max y T V y ≤ τ (5.32) y∈Y

• Y contains a pair (V, τ ) with V ≻ 0 • relations (V, τ ) ∈ Y and τ ′ ≥ τ imply that (V, τ ′ ) ∈ Y.4 We call a cone Y sharp if Y is a closed convex cone contained in SN + × R+ and such that the only pair (V, τ ) ∈ Y with τ = 0 is the pair (0, 0), or, equivalently, a sequence {(Vi , τi ) ∈ Y, i ≥ 1} is bounded if and only if the sequence {τi , i ≥ 1} is bounded. Note that whenever the linear span of Y is the entire RN , every cone compatible with Y is sharp. Observe that if Y ⊂ RN is a nonempty convex compact set and Y is a cone compatible with a shift Y − a of Y, then Y is compatible with Ys . Indeed, when shifting a set Y, its symmetrization 21 [Y − Y] remains intact, so that we can assume that Y is compatible with Y. Now let (V, τ ) ∈ Y and y, y ′ ∈ Y. We have [y − y ′ ]T V [y − y ′ ] + [y + y ′ ]T V [y + y ′ ] = 2[y T V y + [y ′ ]T V y ′ ] ≤ 4τ, {z } | ≥0

whence for z = 12 [y − y ′ ] it holds z T V z ≤ τ . Since every z ∈ Ys is of the form 1 [y − y ′ ] with y, y ′ ∈ Y, the claim follows. 2 Note that the claim can be “nearly inverted”: if 0 ∈ Y and Y is compatible with Ys , then the “widening” of Y—the cone Y + = {(V, τ ) : (V, τ /4) ∈ Y} —is compatible with Y (evident, since when 0 ∈ Y, every vector from Y is proportional, with coefficient 2, to a vector from Ys ).

Constructing functions M. The role of compatibility in our context becomes clear from the following observation: Proposition 5.7. In the situation described in Section 5.1.1, assume that we have at our disposal cones X and U compatible, respectively, with Xs and with the unit 4 The latter requirement is “for free”—passing from a computationally tractable closed convex + = {(V, τ ) : ∃¯ cone Y ⊂ SN τ ≤ τ : (V, τ¯) ∈ Y}, we get + × R+ satisfying (5.32) to the cone Y a cone larger than Y and still compatible with Y. It will be clear from the sequel that in our context, the larger is a cone compatible with Y, the better.

SIGNAL RECOVERY BEYOND LINEAR ESTIMATES

405

ball B∗ = {v ∈ Rν : kuk∗ ≤ 1}

of the norm k · k∗ conjugate to the norm k · k. Given M ∈ Sν+n , let us set M(M ) = inf {t + s : (X, t) ∈ X, (U, s) ∈ U, Diag{U, X}  M } . X,t,U,s

(5.33)

Then M is a real-valued efficiently computable convex function on Sν+n such that (5.23) takes place: for every M ∈ Sn+ν it holds M(M ) ≥

max

[u;z]∈B∗ ×Xs

[u; z]T M [u; z].

In addition, when X and U are sharp, the infimum in (5.33) is achieved. Proof is immediate. Given that the objective of the optimization problem specifying M(M ) is nonnegative on the feasible set, the fact that M is real-valued is equivalent to problem’s feasibility, and the latter is readily given by the fact that X is a cone containing a pair (X, t) with X ≻ 0 and similarly for U. Convexity of M is evident. To verify (5.23), let (X, t, U, s) form a feasible solution to the optimization problem in (5.33). When [u; z] ∈ B∗ × Xs we have [u; z]T M [u; z] ≤ uT U u + z T Xz ≤ s + t, where the first inequality is due to the -constraint in (5.33), and the second is due to the fact that U is compatible with B∗ , and X is compatible with Xs . Since the resulting inequality holds true for all feasible solutions to the optimization problem in (5.33), (5.23) follows. Finally, when X and U are sharp, (5.33) is a feasible conic problem with bounded level sets of the objective and as such is solvable. ✷ 5.1.5.4

Putting things together

The following statement combining the results of Propositions 5.7 and 5.5 summarizes our second approach to the design of the polyhedral estimate. Proposition 5.8. In the situation of Section 5.1.1, assume that we have at our disposal cones X and U compatible, respectively, with Xs and with the unit ball B∗ of the norm conjugate to k · k. Given reliability tolerance ǫ ∈ (0, 1) along with a positive integer N and a computationally tractable cone H satisfying Assumption C, consider the (clearly feasible) convex optimization problem  Opt = minΘ,µ,X,t,U,s f (t, s, µ) := 2(t + s + µ) : (Θ, t) ∈ X,(U, s) ∈ U, ) (5.34)  µ) ∈ H, (X, 1 B U . 2 0 1 T AT ΘA + X 2B Let Θ, µ, X, t, U, s be a feasible solution to (5.34). Invoking C, we can convert, in a computationally efficient manner, (Θ, µ) into (H, λ) such that the columns of the P m × N contrast matrix H satisfy (5.1), Θ = HDiag{λ}H T , and µ ≥ ℓ λℓ . The

406

CHAPTER 5

(ǫ, k · k)-risk of the polyhedral estimate x bH satisfies the bound Riskǫ,k·k [b xH |X ] ≤ f (t, s, µ).

(5.35)

In particular, we can build, in a computationally efficient manner, polyhedral estimates with risks arbitrarily close to Opt (and with risk Opt, provided that (5.34) is solvable). Proof. Let Θ, µ, X, t, U, s form a feasible solution to (5.34). By the semidefinite constraint in (5.34) we have     1 − 21 B U 2B , = Diag{U, X} − 0 1 T − 21 B T AT ΘA + X −AT ΘA 2B {z } | =:M

whence for the function M defined in (5.33) one has M(M ) ≤ t + s.

Since M, by Proposition 5.7, satisfies (5.23), invoking Proposition 5.5 we arrive at R[H] ≤ 2(µ + M(M )) ≤ f (t, s, µ). By Proposition 5.1 this implies the target relation (5.35). 5.1.5.5



Compatibility: Basic examples and calculus

Our approach to the design of polyhedral estimates utilizing the recipe described in Proposition 5.8 relies upon our ability to equip convex “sets of interest” (in our context, these are the symmetrization Xs of the signal set and the unit ball B∗ of the norm conjugate to the norm k · k) with compatible cones.5 Below, we discuss two principal sources of such cones, namely (a) spectratopes/ellitopes, and (b) absolute norms. More examples of compatible cones can be constructed using a “compatibility calculus.” Namely, let us assume that we are given a finite collection of convex sets (operands) and apply to them some basic operation, such as taking the intersection, or arithmetic sum, direct or inverse linear image, or convex hull of the union. It turns out that cones compatible with the results of such operations can be easily (in a fully algorithmic fashion) obtained from the cones compatible with the operands; see Section 5.1.8 for principal calculus rules. In view of Proposition 5.8, the larger are the cones X and U compatible with Xs and B∗ , the better—the wider is the optimization domain in (5.34) and, consequently, the less is (the best) risk bound achievable with the recipe presented in the proposition. Given convex compact set Y ∈ RN , the “ideal”—the largest— candidate to the role of the cone compatible with Y would be T Y∗ = {(V, τ ) ∈ SN + × R+ : τ ≥ max y V y}. y∈Y

However, this cone is typically intractable, therefore, we look for “as large as pos5 Recall

H.

that we already know how to specify the second element of the construction, the cone

407

SIGNAL RECOVERY BEYOND LINEAR ESTIMATES

sible” tractable inner approximations of Y∗ . 5.1.5.5.A. Cones compatible with ellitopes/spectratopes are readily given by semidefinite relaxation. Specifically, when Y = {y ∈ RN : ∃(r ∈ RK ) : y = M z, Rℓ2 [z] i  rℓ Idℓ , ℓ ≤ L} h ∈ R, z P Rℓ [z] = j zj Rℓj , Rℓj ∈ Sdℓ

with our standard restrictions on R, invoking Proposition 4.8 it is immediately seen that the set  dℓ Y = (V, τ ) ∈ SN (λ[Λ]) ≤ τ + × R+ : ∃Λ = {Λℓ ∈ S+ , ℓ ≤ L} : φRP (5.36) R∗ [Λℓ ] MT V M  ℓ

is a closed convex cone which is compatible with Y; here, as usual,

[R∗ℓ [Λℓ ]]ij = Tr(Rℓi Λℓ Rℓj ), λ[Λ] = [Tr(Λ1 ); ...; Tr(ΛL )], φR (λ) = max rT λ. r∈R

Similarly, when Y is an ellitope: Y = {y ∈ RN : ∃(r ∈ R, z ∈ RK ) : y = M z, z T Rℓ z ≤ rℓ , ℓ ≤ L} with our standard restrictions on Rℓ , invoking Proposition 4.6, the set X L T Y = {(V, τ ) ∈ SN λℓ Rℓ , φR (λ) ≤ τ } + × R+ : ∃λ ∈ R+ : M V M 

(5.37)



is a closed convex cone which is compatible with Y. In both cases, Y is sharp, provided that the image space of M is the entire RN . Note that in both these cases Y is a reasonably tight inner approximation of Y∗ : wheneverP (V, τ ) ∈ Y∗ , we have (V, θτ ) ∈ Y, with a moderate θ (specifically, θ = O(1) ln 2 ℓ dℓ in the spectratopic, and θ = O(1) ln(2L) in the ellitopic case; see Propositions 4.8, 4.6, respectively). 5.1.5.5.B. Compatibility via absolute norms. Preliminaries. Recall that a norm p(·) on RN is called absolute if p(x) is a function of the vector abs[x] := [|x1 |; ...; |xN |] of the magnitudes of entries in x. It ′ is well known that an absolute norm p is monotone on RN + , so that abs[x] ≤ abs[x ] ′ implies that p(x) ≤ p(x ), and that the norm p∗ (x) = max xT y y:p(y)≤1

conjugate to p(·) is absolute along with p. Let us say that an absolute norm r(·) fits an absolute norm p(·) on RN if for every vector x with p(x) ≤ 1 the entrywise square [x]2 = [x21 ; ...; x2N ] of x satisfies r([x]2 ) ≤ 1. For example, the largest norm r(·) which fits the absolute norm p(·) = k · ks , s ∈ [1, ∞], is  k · k1 , 1≤s≤2 r(·) = . k · ks/2 , s ≥ 2

408

CHAPTER 5

An immediate observation is that an absolute norm p(·) on RN can be “lifted” to a norm on SN , specifically, the norm p+ (Y ) = p([p(Col1 [Y ]); ...; p(ColN [Y ])]) : SN → R+ ,

(5.38)

where Colj [Y ] is j-th column in Y . It is immediately seen that when p is an absolute norm, the right-hand side in (5.38) indeed is a norm on SN satisfying the identity p+ (xxT ) = p2 (x), x ∈ RN .

(5.39)

Absolute norms and compatibility. Our interest in absolute norms is motivated by the following immediate observation: Observation 5.9. Let p(·) be an absolute norm on RN , and r(·) be another absolute norm which fits p(·), both norms being computationally tractable. These norms give rise to the computationally tractable and sharp closed convex cone  N N P = Pp(·),r(·) = (V, τ ) ∈ SN + × R+ : ∃(W ∈ S , w ∈ R+ ) :  (5.40) V  W + Diag{w}, [p+ ]∗ (W ) + r∗ (w) ≤ τ where [p+ ]∗ (·) is the norm on SN conjugate to the norm p+ (·), and r∗ (·) is the norm on RN conjugate to the norm r(·), and this cone is compatible with the unit ball of the norm p(·) (and thus with any convex compact subset of this ball). Verification is immediate. The fact that P is a computationally tractable and closed convex cone is evident. Now let (V, τ ) ∈ P, so that V  0 and V  W + Diag{w} with [p+ ]∗ (W ) + r∗ (w) ≤ τ . For x with p(x) ≤ 1 we have xT V x

≤ ≤ ≤

xT [W + Diag{w}]x = Tr(W [xxT ]) + wT [x]2 p+ (xxT )[p+ ]∗ (W ) + r([x]2 )r∗ (w) = p2 (x)[p+ ]∗ (W ) + r∗ (w) [p+ ]∗ (W ) + r∗ (w) ≤ τ

(we have used (5.40)), whence xT V x ≤ τ for all x with p(x) ≤ 1. ✷ Let us look at the proposed construction in the case where p(·) = k·ks , s ∈ [1, ∞], s¯ s , s¯∗ = s¯−1 , we clearly have and let r(·) = k · ks¯, s¯ = max[s/2, 1]. Setting s∗ = s−1 +

[p ]∗ (W ) = kW ks∗ :=

( P

s∗ i,j |Wij | maxi,j |Wij |,

1/s∗

,

s∗ < ∞ , r (w) = kwk , (5.41) ∗ s¯∗ s∗ = ∞

resulting in Ps

:=

Pk·ks ,k·ks¯ =

 N N (V, τ ) : V ∈ SN + , ∃(W ∈ S , w ∈ R+ ) :  V  W + Diag{w}, . kW ks∗ + kwks¯∗ ≤ τ

(5.42)

By Observation 5.9, Ps is compatible with the unit ball of k · ks -norm on RN (and therefore with every closed convex subset of this ball).

409

SIGNAL RECOVERY BEYOND LINEAR ESTIMATES

When s = 1, that is, s∗ = s¯∗ = ∞, (5.42) results in   V  W + Diag{w}, P1 = (V, τ ) : V  0, ∃(W ∈ SN , w ∈ RN ) : + kW k∞ + kwk∞ ≤ τ =

{(V, τ ) : V  0, kV k∞ ≤ τ },

(5.43)

and it is easily seen that the situation is a good as it could be, namely, P1 = {(V, τ ) : V  0, max xT V x ≤ τ }. kxk1 ≤1

It can be shown (see Section 5.4.3) that when s ∈ [2, ∞], and so s¯∗ = results in

s s−2 ,

s Ps = {(V, τ ) : V  0, ∃(w ∈ RN + ) : V  Diag{w} & kwk s−2 ≤ τ }.

(5.42) (5.44)

Note that P2 = {(V, τ ) : V  0, kV k2,2 ≤ τ }, and this is exactly the largest cone compatible with the unit Euclidean ball. When s ≥ 2, the unit ball Y of the norm k · ks is an ellitope: {y ∈ RN : kyks ≤ 1} = {y ∈ RN : ∃(t ≥ 0, ktks¯ ≤ 1) : y T Rℓ y := yℓ2 ≤ tℓ , ℓ ≤ L = N },

so that one of the cones compatible with Y is given by (5.37) with the identity matrix in the role of M . As it is immediately seen, the latter cone is nothing but the cone (5.44). 5.1.5.6

Near-optimality of polyhedral estimate in the spectratopic sub-Gaussian case

As an instructive application of the approach developed so far, consider the special case of the estimation problem stated in Section 5.1.1, where 1. The signal set X and the unit ball B∗ of the norm conjugate to k · k are spectratopes: X B∗

= =

{x ∈ Rn : ∃t ∈ T : Rk2 [x]  tk Idk , 1 ≤ k ≤ K}, {z ∈ Rν : ∃y ∈ Y : z = M y}, Y := {y ∈ Rq : ∃r ∈ R : Sℓ2 [y]  rℓ Ifℓ , 1 ≤ ℓ ≤ L},

(cf. Assumptions A, B in Section 4.3.3.2; as always, we lose nothing assuming spectratope X to be basic). 2. For every x ∈ X , observation noise ξx is sub-Gaussian, i.e., ξx ∼ SG(0, σ 2 Im ). We are about to show that in the present situation, the polyhedral estimate constructed in Sections 5.1.5.2–5.1.5.4, i.e., yielded by the efficiently computable (high accuracy near-) optimal solution to the optimization problem (5.34), is near-optimal in the minimax sense. Given reliability tolerance ǫ ∈ (0, 1), the recipe for constructing the m × m contrast matrix H as presented in Proposition 5.8 is as follows: • Set

Z = {ϑ2 [1; ...; 1]}, ϑ = σκ, κ =

p

2 ln(2m/ǫ),

410

CHAPTER 5

and utilize the construction from Section 5.1.5.2, thus arriving at the cone 2 2 H = {(Θ, µ) ∈ Sm + × R+ : σ κ Tr(Θ) ≤ µ}

satisfying the requirements of Assumption C. • Specify the cones X and U compatible with Xs = X , and B∗ , respectively, according to (5.36). The resulting problem (5.34), after immediate straightforward simplifications, reads    2 φR (λ[Υ]) + φT (λ[Λ]) + σ 2 κ2 Tr(Θ) : Opt = min Θ,U,Λ,Υ  Θ  0, U  0, Λ = {Λk  0, k ≤ K},   P (5.45)  ∗ T S [Υ ], Υ = {Υ  0, ℓ ≤ L}, M U M  ℓ ℓ ℓ ℓ   1 B U  2P  0  1 T T ∗ R [Λ ] A ΘA + B k k k 2 where, as always,

and

P [R∗k [Λk ]]ij = Tr(Rki Λk Rkj ) [Rk [x] =P i xi Rki ], [Sℓ∗ [Υℓ ]]ij = Tr(S ℓi Υℓ S ℓj ) [Sℓ [u] = i ui S ℓi ],

λ[Λ] = [Tr(Λ1 ); ...; Tr(ΛK )], λ[Υ] = [Tr(Υ1 ); ...; Tr(ΥL )], φW (f ) = max wT f. w∈W

Let now  RiskOptǫ = inf sup inf ρ : Probξ∼N (0,σ2 I) {kBx − x b(Ax + ξ)k > ρ} ≤ ǫ ∀x ∈ X , x b(·) x∈X

be the minimax optimal (ǫ, k · k)-risk of estimating Bx in the Gaussian observation scheme where ξx ∼ N (0, σ 2 Im ) independently of x ∈ X. Proposition 5.10. When ǫ ≤ 1/8, the polyhedral estimate x bH yielded by a feasible near-optimal, in terms of the objective, solution to problem (5.45) is minimax optimal within the logarithmic factor, namely r   P  P Riskǫ,k·k [b xH |X ] ≤ O(1) ln ℓ fℓ ln(2m/ǫ) RiskOpt 81 k dk ln r   P  P ≤ O(1) ln ℓ fℓ ln(2m/ǫ) RiskOptǫ k dk ln where O(1) is an absolute constant.

See Section 5.4.4 for the proof. Discussion. It is worth mentioning that the approach described in Section 5.1.4 is complementary to the approach developed in this section. In fact, it is easily seen that the bound Opt for the risk of the polyhedral estimate stemming from (5.34) is suboptimal in the simple situation described in the motivating example from Section 5.1.1. Indeed, let X be the unit k · k1 -ball, k · k = k · k2 , and let us consider the problem of estimating x ∈ X from the direct observation ω = x + ξ with Gaussian observation noise ξ ∼ N (0, σ 2 I). We equip the ball B∗ = {u ∈ Rn : kuk2 ≤ 1}

411

SIGNAL RECOVERY BEYOND LINEAR ESTIMATES

with the cone U = P2 = {(U, τ ) : U  0, kU k2,2 ≤ τ } and X with the cone X = P1 = {(X, t) : X  0, kXk∞ ≤ t}, (note that both cones are the largest w.r.t. inclusion cones compatible with the respective sets). The corresponding problem (5.34) reads Opt

=

=

 Θ  0, X  0, U  0,     1 I U min 2 κ2 σ 2 Tr(Θ) + max Xii + kU k2,2 : 2 n i 0 Θ,X,U  1 Θ + X I n 2   0, U  0,     Θ   0, X  2 2 1 τ In I . min 2 κ σ Tr(Θ) + max Xii + τ : 2 n i 0  Θ,X,U  1 Θ + X I n 2

  

(5.46)

Observe that every n × n matrix of the form Q = EP , where E is diagonal with diagonal entries ±1, and P is a permutation matrix, induces a symmetry (Θ, X, τ ) 7→ (QΘQT , QXQT , τ ) of the second optimization problem in (5.46), that is, a transformation which maps the feasible set onto itself and keeps the objective intact. Since the problem is convex and solvable, we conclude that it has an optimal solution which remains intact under the symmetries in question, i.e., solution with scalar matrices Θ = θIn and X = uIn . As a result,  √   2 2 Opt = min 2(κ σ nθ + u + τ ) : τ (θ + u) ≥ 41 = 2 min κσ n, 1 . (5.47) θ≥0,u≥0,τ

A similar derivation shows that the value Opt remains intact if we replace the set X = {x : kxk1 ≤ 1} with X = {x : kxks ≤ 1}, s ∈ [1, 2], and the cone X = P1 with X = Ps ; see (5.42). Since the Θ-component of an optimal solution to (5.46) can be selected to be scalar, the contrast matrix H we end up with can be selected to be the unit matrix. An unpleasant observation is that when s < 2, the quantity Opt given by (5.47) “heavily overestimates” the actual risk of the polyhedral estimate with H = In . Indeed, the analysis of this estimate in Section 5.1.4 results in the √ risk bound (up to a factor√logarithmic in n) min[σ 1−s/2 , σ n], which √ can be much less than Opt = 2 min [κσ n, 1], e.g., in the case of large n, and σ n = O(1). 5.1.6

Assembling estimates: Contrast aggregation

The good news is that whenever the approaches to the design of polyhedral estimates presented in Sections 5.1.4 and 5.1.5 are applicable, they can be utilized simultaneously. The underlying observation is that (!) In the problem setting described in Section 5.1.2, a collection of K candidate polyhedral estimates can be assembled into a single polyhedral estimate with the (upper bound on the) risk, as given by Proposition 5.1, being nearly the minimum of the risks of estimates we aggregate. Indeed, given an observation scheme (that is, collection of probability distributions Px of noises ξx , x ∈ X ), assume we have at our disposal norms πδ (·) : Rm → R parameterized by δ ∈ (0, 1) such that πδ (h), for every h, is larger the lesser δ is,

412

CHAPTER 5

and ∀(x ∈ X , δ ∈ (0, 1), h ∈ Rm ) : πδ (h) ≤ 1 ⇒ Probξ∼Px {ξ : |hT ξ| > 1} ≤ δ. Assume also (as is indeed the case in all our constructions) that we ensure (5.1) by imposing on the columns hℓ of an m × N contrast matrix H the restrictions πǫ/N (hℓ ) ≤ 1. Now suppose that given risk tolerance ǫ ∈ (0, 1), we have generated K candidate contrast matrices Hk ∈ Rm×Nk such that πǫ/Nk (Colj [Hk ]) ≤ 1, j ≤ Nk , so that the (ǫ, k · k)-risk of the polyhedral estimate yielded by the contrast matrix Hk does not exceed  Rk = max kBxk : x ∈ 2Xs , kHkT Axk∞ ≤ 2 . x

Let us combine the contrast matrices H1 , ..., HK into a single contrast matrix H with N = N1 + ... + NK columns by normalizing the columns of the concatenated matrix [H1 , ..., HK ] to have πǫ/N -norms equal to 1, so that ¯ 1 , ..., H ¯ K ], Colj [H ¯ k ] = θjk Colj [Hk ] ∀(k ≤ K, j ≤ Nk ) H = [H with θjk =

πǫ/Nk (h) 1 ≥ ϑk := min , h6=0 πǫ/N (h) πǫ/N (Colj [Hk ])

where the concluding ≥ is due to πǫ/Nk (Colj [Hk ]) ≤ 1. We claim that in terms of (ǫ, k·k)-risk, the polyhedral estimate yielded by H is “almost as good” as the best of the polyhedral estimates yielded by the contrast matrices H1 , ..., HK , specifically,6  R[H] := max kBxk : x ∈ 2Xs , kH T Axk∞ ≤ 2 ≤ min ϑ−1 k Rk . x

k

The justification is readily given by the following observation: when ϑ ∈ (0, 1), we have  Rk,ϑ := max kBxk : x ∈ 2Xs , kHkT Axk∞ ≤ 2/ϑ ≤ Rk /ϑ. x

Indeed, when x is a feasible solution to the maximization problem specifying Rk,ϑ , ϑx is a feasible solution to the problem specifying Rk , implying that ϑkBxk ≤ Rk . It remains to note that we clearly have R[H] ≤ mink Rk,ϑk . The bottom line is that the aggregation just described of contrast matrices H1 , ..., HK into a single contrast matrix H results in a polyhedral estimate which in terms of upper bound R[·] on its (ǫ, k · k)-risk is, up to factor ϑ¯ = maxk ϑ−1 k , not worse than the best of the K estimates yielded by the original contrast matrices. Consequently, if πδ (·) grows slowly as δ decreases, the “price” ϑ¯ of assembling the original estimates is quite moderate. For example, in our basic cases (sub-Gaussian, Discrete, and Poisson), ϑ¯ is logarithmic in maxk Nk−1 (N1 +...+NK ), and ϑ¯ = 1+o(1) as ǫ → +0 for K, N1 , ..., NK fixed. 6 This

is the precise “quantitative expression” of the observation (!).

SIGNAL RECOVERY BEYOND LINEAR ESTIMATES

5.1.7

413

Numerical illustration

We are about to illustrate the numerical performance of polyhedral estimates by comparing it to the performance of a “presumably good” linear estimate. Our setup is deliberately simple: the signal set X is just the unit box {x ∈ Rn : kxk∞ ≤ 1}, B ∈ Rn×n is “numerical double integration”: for a δ > 0,  2 δ (i − j + 1), j ≤ i Bij = , 0, j>i so that x, modulo boundary effects, is the second order finite difference derivative of w = Bx, wi − 2wi−1 + wi−2 , 2 < i ≤ n; xi = δ2 and Ax is comprised of m randomly selected entries of Bx. The observation is ω = Ax + ξ, ξ ∼ N (0, σ 2 Im ) and the recovery norm is k · k2 . In other words, we want to recover a restriction of a twice differentiable function of one variable on the n-point regular grid on the segment ∆ = [0, nδ] from noisy observations of this restriction taken along m randomly selected points of the grid. A priori information on the function is that the magnitude of its second order derivative does not exceed 1. Note that in the considered situation both linear estimate x bH yielded by Proposition 4.14 and polyhedral estimate x bH yielded by Proposition 5.7, are near-optimal in the minimax sense in terms of their k · k2 - or (ǫ, k · k2 )-risk. In the experiments reported in Figure 5.1, we used n = 64, m = 32, and δ = 4/n (i.e., ∆ = [0, 4]); the reliability parameter for the polyhedral estimate was set to ǫ = 0.1. For different noise levels σ = {0.1, 0.01, 0.001, 0.0001} we generate 20 random signals x from X and record the k · k2 -recovery errors of the linear and the polyhedral estimates. In addition to testing the nearly optimal polyhedral estimate PolyI yielded by Proposition 5.8 as applied in the framework of item 5.1.5.5.A, we also record the performance of the polyhedral estimate PolyII yielded by the construction from Section 5.1.4. The observed k · k2 -recovery errors of the three estimates are plotted in Figure 5.1. All three estimates exhibit similar empirical performance in these simulations. However, when the noise level becomes small, polyhedral estimates seem to outperform the linear one. In addition, the estimate PolyII seems to “work” better than or, at the very worst, similarly to PolyI in spite of the fact that in the situation in question the estimate PolyI, in contrast to PolyII, is provably near-optimal. 5.1.8

Calculus of compatibility

The principal rules of the calculus of compatibility are as follows (verification of the rules is straightforward and is therefore skipped): 1. [passing to a subset] When Y ′ ⊂ Y are convex compact subsets of RN and a cone Y is compatible with Y, the cone is compatible with Y ′ as well.

414

CHAPTER 5

0.7 0.11

0.6

0.1

0.5

0.09 0.08

0.4 0.07 0.06

0.3

0.05

0.2 0.04

0

2

4

6

8

10

12

14

16

18

20

0

2

4

6

σ = 0.1

8

10

12

14

16

18

20

14

16

18

20

σ = 0.01

0.02 0.018

10-2 0.016 0.014 0.012

0.01

0.008

10-3 0.006 0

2

4

6

8

10

12

14

16

18

0

20

2

4

6

8

10

12

σ = 0.001 σ = 0.0001 Figure 5.1: Recovery errors for the near-optimal linear estimate (circles) and for polyhedral estimates yielded by Proposition 5.8 (PolyI, pentagrams) and by the construction from Section 5.1.4 (PolyII, triangles), 20 simulations per each value of σ.

2. [finite intersection] Let cones Yj be compatible with convex compact sets Yj ⊂ RN , j = 1, ..., J. Then the cone j Y = cl{(V, τ ) ∈ SN + × R+ : ∃((Vj , τj ) ∈ Y , j ≤ J) : V 

is compatible with Y =

T j

X j

Vj ,

X j

τj ≤ τ }

Yj . The closure operation can be skipped when all

cones Yj are sharp, in which case Y is sharp as well. 3. [convex hulls of finite union] Let cones Yj be compatible with convex compact sets Yj ⊂ RN , j = 1, ..., J, and let there exist (V, τ ) such that V ≻ 0 and \ Yj . (V, τ ) ∈ Y := j

S Then Y is compatible with Y = Conv{ Yj } and, in addition, is sharp provided j

that at least one of the Yj is sharp. 4. [direct product] Let cones Yj be compatible with convex compact sets Yj ⊂ RNj , j = 1, ..., J. Then the cone N1 +...+NJ Y = {(V, τ ) ∈ S+ × R+ : ∃(Vj , τj ) ∈ Y j : V  Diag{V1 , ..., VJ } & τ ≥

X j

τj }

SIGNAL RECOVERY BEYOND LINEAR ESTIMATES

415

is compatible with Y = Y1 × ... × YJ . This cone is sharp, provided that all the Yj are so. 5. [linear image] Let cone Y be compatible with convex compact set Y ⊂ RN , let A be a K × N matrix, and let Z = AY. The cone T Z = cl{(V, τ ) ∈ SK + × R+ : ∃U  A V A : (U, τ ) ∈ Y}

is compatible with Z. The closure operation can be skipped whenever Y is either sharp, or complete, completeness meaning that (V, τ ) ∈ Y and 0  V ′  V imply that (V ′ , τ ) ∈ Y. The cone Z is sharp, provided Y is so and the rank of A is K. 6. [inverse linear image] Let cone Y be compatible with convex compact set Y ⊂ RN , let A be an N × K matrix with trivial kernel, and let Z = A−1 Y := {z ∈ RK : Az ∈ Y}. The cone T Z = cl{(V, τ ) ∈ SK + × R+ : ∃U : A U A  V & (U, τ ) ∈ Y}

is compatible with Z. The closure operations can be skipped whenever Y is sharp, in which case Z is sharp as well. 7. [arithmetic summation] Let cones Yj be compatible with convex compact sets Yj ⊂ RN , j = 1, ..., J. Then the arithmetic sum Y = Y1 + ... + YJ of the sets Yj can be equipped with a compatible cone readily given by the cones Yj ; this cone is sharp, provided all the Yj are so. Indeed, the arithmetic sum of Yj is the linear image of the direct product of the Yj ’s under the mapping [y 1 ; ...; y J ] 7→ y 1 + ... + y J , and it remains to combine rules 4 and 5; note the cone yielded by rule 4 is complete, so that when applying rule 5, the closure operation can be skipped.

5.2

RECOVERING SIGNALS FROM NONLINEAR OBSERVATIONS BY STOCHASTIC OPTIMIZATION

The “common denominator” of all estimation problems considered so far in this chapter is that what we observed was obtained by adding noise to the linear image of the unknown signal to be recovered. In this section we consider the problem of signal estimation in the case where the observation is obtained by adding noise to a nonlinear transformation of the signal. 5.2.1

Problem setting

A motivating example for what follows is provided by the logistic regression model, where • the unknown signal to be recovered is a vector x known to belong to a given signal set X ⊂ Rn , which we assume to be a nonempty convex compact set; • our observation ω K = {ωk = (ηk , yk ), 1 ≤ k ≤ K} stemming from a signal x is as follows: – the regressors η1 , ..., ηK are i.i.d. realizations of an n-dimensional random

416

CHAPTER 5

vector η with distribution Q independent of x and such that Q possesses a finite and positive definite matrix Eη∼Q {ηη T } of second moments;

– the labels yk are generated as follows: yk is the Bernoulli random variable independent of the “history” η1 , ..., ηk−1 , y1 , ..., yk−1 , and the conditional, given ηk , probability for yk to be 1 is φ(ηkT x), where φ(s) =

exp{s} . 1 + exp{s}

In this model, the standard (and very well-studied) approach to estimating the signal x underlying the observations is to use the Maximum Likelihood (ML) estimate: the logarithm of the conditional, given ηk , 1 ≤ k ≤ K, probability of getting the observed labels as a function of a candidate signal z is K

ℓ(z, ω )

=

K X 

k=1

=

"

X k

  yk ln φ(ηkT z) + (1 − yk ) ln 1 − φ(ηkT z) yk η k

#T

z−

X k

 ln 1 + exp{ηkT z} ,

(5.48)

and the ML estimate of the “true” signal x underlying our observation ω K is obtained by maximizing the log-likelihood ℓ(z, ω K ) over z ∈ X , x bML (ω K ) ∈ Argmax ℓ(z, ω K ),

(5.49)

z∈X

which is a convex optimization problem.

The problem we intend to consider (referred to as the generalized linear model (GLM) in Statistics) can be viewed as a natural generalization of the logistic regression just presented and is as follows: Our observation depends on unknown signal x known to belong to a given convex compact set X ⊂ Rn and is ω K = {ωk = (ηk , yk ), 1 ≤ k ≤ K}

(5.50)

with ωk , 1 ≤ k ≤ K, which are i.i.d. realizations of a random pair (η, y) with the distribution Px such that • the regressor η is a random n×m matrix with some probability distribution Q independent of x; • the label y is an m-dimensional random vector such that the conditional distribution of y given η induced by Px has the expectation f (η T x): Ex|η {y} = f (η T x),

(5.51)

where Ex|η {y} is the conditional expectation of y given η stemming from the distribution Px of ω = (η, y), and f (·) : Rm → Rm (“link function”) is a given mapping. Note that the logistic regression model corresponds to the case where m = 1,

SIGNAL RECOVERY BEYOND LINEAR ESTIMATES

417

exp{s} f (s) = 1+exp{s} , and y takes values 0,1, with the conditional probability of taking value 1 given η equal to f (η T x). Another example is provided by the model

y = f (η T x) + ξ, where ξ is a random vector with zero mean independent of η, say, ξ ∼ N (0, σ 2 Im ). Note that in the latter case the ML estimate of the signal x underlying the observations is X kyk − f (ηkT z)k22 . (5.52) x bML (ω K ) ∈ Argmin z∈X

k

In contrast to what happens with logistic regression, now the optimization problem—“Nonlinear Least Squares”—responsible for the ML estimate typically is nonconvex and can be computationally difficult. Following [140], we intend to impose on the data of the estimation problem we have just described (namely, on X , f (·), and the distributions Px , x ∈ X , of the pair (η, y)) assumptions which allow us to reduce our estimation problem to a problem with convex structure—a strongly monotone variational inequality represented by a stochastic oracle. At the end of the day, this will lead to a consistent estimate of the signal, with explicit “finite sample” accuracy guarantees. 5.2.2

Assumptions

Preliminaries: Monotone vector fields. A monotone vector field on Rm is a single-valued everywhere defined mapping g(·) : Rm → Rm which possesses the monotonicity property [g(z) − g(z ′ )]T [z − z ′ ] ≥ 0 ∀z, z ′ ∈ Rm . We say that such a field is monotone with modulus κ ≥ 0 on a closed convex set Z ⊂ Rm if [g(z) − g(z ′ )]T [z − z ′ ] ≥ κkz − z ′ k22 , ∀z z ′ ∈ Z, and say that g is strongly monotone on Z if the modulus of monotonicity of g on Z is positive. It is immediately seen that for a monotone vector field which is continuously differentiable on a closed convex set Z with a nonempty interior, the necessary and sufficient condition for being monotone with modulus κ on the set is dT f ′ (z)d ≥ κdT d ∀(d ∈ Rn , z ∈ Z).

(5.53)

Basic examples of monotone vector fields are: • gradient fields ∇φ(x) of continuously differentiable convex functions of m variables or, more generally, the vector fields [∇x φ(x, y); −∇y φ(x, y)] stemming from continuously differentiable functions φ(x, y) which are convex in x and concave in y; • “diagonal” vector fields f (x) = [f1 (x1 ); f2 (x2 ); ...; fm (xm )] with monotonically nondecreasing univariate components fi (·). If, in addition, the fi (·) are continuously differentiable with positive first order derivatives, then the associated field f is strongly monotone on every compact convex subset of Rm , the monotonicity modulus depending on the subset.

418

CHAPTER 5

Monotone vector fields on Rn admit simple calculus which includes, in particular, the following two rules: I. [affine substitution of argument]: If f (·) is a monotone vector field on Rm and A is an n × m matrix, the vector field g(x) = Af (AT x + a) is monotone on Rn ; if, in addition, f is monotone with modulus κ ≥ 0 on a closed convex set Z ⊂ Rm and X ⊂ Rn is closed, convex, and such that AT x + a ∈ Z whenever x ∈ X, g is monotone with modulus σ 2 κ on X, where σ is the n-th singular value of A (i.e., the largest γ such that kAT xk2 ≥ γkxk2 for all x). II. [summation]: If S is a Polish space, f (x, s) : Rm × S → Rm is a Borel vectorvalued function which is monotone in x for every s ∈ S, and µ(ds) is a Borel probability measure on S such that the vector field Z F (x) = f (x, s)µ(ds) S

is well-defined for all x, then F (·) is monotone. If, in addition, X is a closed convex set in Rm and f (·, s) is monotone on X with Borel in s modulus κ(s) for R every s ∈ S, then F is monotone on X with modulus S κ(s)µ(ds). Assumptions. In what follows, we make the following assumptions on the ingredients of the estimation problem posed in Section 5.2.1: • A.1. The vector field f (·) is continuous and monotone, and the vector field  F (z) = Eη∼Q ηf (η T z)

is well-defined (and therefore is monotone along with f by I, II); • A.2. The signal set X is a nonempty convex compact set, and the vector field F is monotone with positive modulus κ on X ; • A.3. For properly selected M < ∞ and every x ∈ X it holds  E(η,y)∼Px kηyk22 ≤ M 2 . (5.54)

A simple sufficient condition for the validity of Assumptions A.1-3 with properly selected M < ∞ and κ > 0 is as follows: • The distribution Q of η has finite moments of all orders, and Eη∼Q {ηη T } ≻ 0; • f is continuously differentiable, and dT f ′ (z)d > 0 for all d 6= 0 and all z. Besides this, f is of polynomial growth: for some constants C ≥ 0 and p ≥ 0 and all z one has kf (z)k2 ≤ C(1 + kzkp2 ). Verification of sufficiency is straightforward. The principal observation underlying the construction we are about to discuss is as follows. Proposition 5.11. With Assumptions A.1–3 in force, let us associate with a pair

419

SIGNAL RECOVERY BEYOND LINEAR ESTIMATES

(η, y) ∈ Rn×m × Rm the vector field G(η,y) (z) = ηf (η T z) − ηy : Rn → Rn . Then for every x ∈ X we have  E(η,y)∼Px G(η,y) (z) kF (z)k 2  E(η,y)∼Px kG(η,y) (z)k22

= ≤ ≤

F (z) − F (x) ∀z ∈ Rn M ∀z ∈ X 4M 2 ∀z ∈ X .

(5.55)

(a) (b) (c)

(5.56)

Proof is immediate. Indeed, let x ∈ X . Then n o  E(η,y)∼Px {ηy} = Eη∼Q Ex|η {ηy} = Eη ηf (η T x) = F (x)

(we have used (5.51) and the definition of F ), whence,  E(η,y)∼Px G(η,y) (z)

= =

n o n o E(η,y)∼Px ηf (η T z) − ηy = E(η,y)∼Px ηf (η T z) − F (x) n o Eη∼Q ηf (η T z) − F (x) = F (z) − F (x),

as stated in (5.56.a). Besides this, for x, z ∈ X , taking into account that the marginal distribution of η induced by Pz is Q, we have  E(η,y)∼Px {kηf (η T z)k22 } = Eη∼Q kηf (η T z)k22 o n = Eη∼Q kEy∼P|ηz {ηy}k22 [since Ey∼P|ηz {y} = f (η T z)] n  o ≤ Eη∼Q Ey∼P|ηz kηyk22 [by Jensen’s inequality]  = E(η,y)∼Pz kηyk22 ≤ M 2 [by A.3 due to z ∈ X ].

This combines with the relation E(η,y)∼Px {kηyk22 } ≤ M 2 given by A.3 due to x ∈ X to imply (5.56.b) and (5.56.c). ✷ Consequences. Our goal is to recover the signal x ∈ X underlying observations (5.50), and under assumptions A.1–3, x is a root of the monotone vector field  G(z) = F (z) − F (x), F (z) = Eη∼Q ηf (η T z) ; (5.57)

we know that this root belongs to X , and this root is unique because G(·) is strongly monotone on X along with F (·). Now, the problem of finding a root, known to belong to a given convex compact set X , of a vector field G which is strongly monotone on this set is known to be computationally tractable, provided we have access to an “oracle” which, given on input a point z ∈ X , returns the value G(z) of the field at the point. The latter is not exactly the case in the situation we are interested in: the field G is the expectation of a random field:  G(z) = E(η,y)∼Px ηf (η T z) − ηy , and we do not know a priori what the distribution is over which the expectation is taken. However, we can sample from this distribution—the samples are exactly the observations (5.50), and we can use these samples to approximate G and use

420

CHAPTER 5

this approximation to approximate the signal x.7 Two standard implementations of this idea are Sample Average Approximation (SAA) and Stochastic Approximation (SA). We are about to consider these two techniques as applied to the situation we are in. 5.2.3

Estimating via Sample Average Approximation

The idea underlying SAA is quite transparent: given observations (5.50), let us approximate the field of interest G with its empirical counterpart GωK (z) =

K  1 X ηk f (ηkT z) − ηk yk . K k=1

By the Law of Large Numbers, as K → ∞, the empirical field GωK converges to the field of interest G, so that under mild regularity assumptions, when K is large, GωK , with overwhelming probability, will be close to G uniformly on X . Due to strong monotonicity of G, this would imply that a set of “near-zeros” of GωK on X will be close to the zero x of G, which is nothing but the signal we want to recover. The only question is how we can consistently define a “near-zero” of GωK on X .8 A convenient notion of a “near-zero” in our context is provided by the concept of a weak solution to a variational inequality with a monotone operator, defined as follows (we restrict the general definition to the situation of interest): Let X ⊂ Rn be a nonempty convex compact set, and H(z) : X → Rn be a monotone (i.e., [H(z) − H(z ′ )]T [z − z ′ ] ≥ 0 for all z, z ′ ∈ X ) vector field. A vector z∗ ∈ X is called a weak solution to the variational inequality (VI) associated with H, X when H T (z)[z − z∗ ] ≥ 0 ∀z ∈ X . Let X ⊂ Rn be a nonempty convex compact set and H be monotone on X . It is well known that • The VI associated with H, X (let us denote it by VI(H, X )) always has a weak solution. It is clear that if z¯ ∈ X is a root of H, then z¯ is a weak solution to VI(H, X ).9 • When H is continuous on X , every weak solution z¯ to VI(H, X ) is also a strong solution, meaning that H T (¯ z )(z − z¯) ≥ 0 ∀z ∈ X .

(5.58)

Indeed, (5.58) clearly holds true when z = z¯. Assuming z 6= z¯ and setting zt = z¯+t(z−¯ z ), 0 < t ≤ 1, we have H T (zt )(zt −¯ z ) ≥ 0 (since z¯ is a weak solution), 7 The observation expressed by Proposition 5.11, however simple, and the resulting course of actions seem to be new. In retrospect, one can recognize unperceived ad hoc utilization of this approach in Perceptron and Isotron algorithms, see [1, 2, 29, 62, 116, 141, 142, 210] and references therein. 8 Note that we in general cannot define a “near-zero” of G ω K on X as a root of Gω K on this set—while G does have a root belonging to X , nobody told us that the same holds true for GωK . 9 Indeed, when z ¯ ∈ X and H(¯ z ) = 0, monotonicity of H implies that H T (z)[z − z¯] = [H(z) − H(¯ z )]T [z − z¯] ≥ 0 for all z ∈ X , that is, z¯ is a weak solution to the VI.

SIGNAL RECOVERY BEYOND LINEAR ESTIMATES

421

whence H T (zt )(z − z¯) ≥ 0 (since z − z¯ is a positive multiple of zt − z¯). Passing to limit as t → +0 and invoking the continuity of H, we get H T (¯ z )(z − z¯) ≥ 0, as claimed. • When H is the gradient field of a continuously differentiable convex function on X (such a field indeed is monotone), weak (or strong, which in the case of continuous H is the same) solutions to VI(H, X ) are exactly the minimizers of the function on X . Note also that a strong solution to VI(H, X ) with monotone H always is a weak one: if z¯ ∈ X satisfies H T (¯ z )(z − z¯) ≥ 0 for all z ∈ X , then H(z)T (z − z¯) ≥ 0 for all z ∈ X , since by monotonicity H T (z)(z − z¯) ≥ H T (¯ z )(z − z¯). In the sequel, we utilize the following simple and well-known fact: Lemma 5.12. Let X be a convex compact set, and H be a monotone vector field on X with monotonicity modulus κ > 0, i.e. ∀z, z ′ ∈ X [H(z) − H(z ′ )]T [z − z ′ ] ≥ κkz − z ′ k22 . Further, let z¯ be a weak solution to VI(H, X ). Then the weak solution to VI(H, X ) is unique. Besides this, H T (z)[z − z¯] ≥ κkz − z¯k22 ∀z ∈ X .

(5.59)

Proof: Under the premise of lemma, let z ∈ X and let z¯ be a weak solution to VI(H, X ) (recall that it does exist). Setting zt = z¯ + t(z − z¯), for t ∈ (0, 1) we have H T (z)[z − zt ] ≥ H T (zt )[z − zt ] + κkz − zt k22 ≥ κkz − zt k22 , where the first ≥ is due to strong monotonicity of H, and the second ≥ is due to the fact that H T (zt )[z − zt ] is proportional, with positive coefficient, to H T (zt )[zt − z¯], and the latter quantity is nonnegative since z¯ is a weak solution to the VI in question. We end up with H T (z)(z − zt ) ≥ κkz − zt k22 ; passing to limit as t → +0, we arrive at (5.59). To prove uniqueness of a weak solution, assume that besides the weak solution z¯ there exists a weak solution ze distinct from z¯, and let us z + ze]. Since both z¯ and ze are weak solutions, both the quantities set z ′ = 12 [¯ H T (z ′ )[z ′ − z¯] and H T (z ′ )[z ′ − ze] should be nonnegative, and because the sum of these quantities is 0, both of them are zero. Thus, when applying (5.59) to z = z ′ , we get z ′ = z¯, whence ze = z¯ as well. ✷ Now let us come back to the estimation problem under consideration. Let Assumptions A.1-3 hold, so that vector fields G(ηk ,yk ) (z) defined in (5.55), and therefore vector field GωK (z) are continuous and monotone. When using the SAA, we compute a weak solution x b(ω K ) to VI(GωK , X ) and treat it as the SAA estimate of signal x underlying observations (5.50). Since the vector field GωK (·) is monotone with efficiently computable values, provided that so is f , computing (a high accuracy approximation to) a weak solution to VI(GωK , X ) is a computationally tractable problem (see, e.g., [189]). Moreover, utilizing the techniques from [30, 204, 220, 212, 213], under mild regularity assumptions additional to A.1–3 one can get a non-asymptotical upper bound on, say, the expected k · k22 -error of the SAA estimate as a function of the sample size K and find out the rate at which this bound converges to 0 as K → ∞; this analysis, however, goes beyond our scope.

422

CHAPTER 5

Let us specify the SAA estimate in the logistic regression model. In this case we have f (u) = (1 + e−u )−1 , and   exp{ηkT z} G(ηk ,yk ) (z) = − yk η k , 1 + exp{ηkT z}  K  exp{ηkT z} 1 X − yk η k GωK (z) = K 1 + exp{ηkT z} k=1 # " X   1 T T ln 1 + exp{ηk z} − yk ηk z . ∇z = K k

In other words, GωK (z) is proportional, with negative coefficient −1/K, to the gradient field of the log-likelihood ℓ(z, ω K ); see (5.48). As a result, in the case in question weak solutions to VI(GωK , X ) are exactly the maximizers of the loglikelihood ℓ(z, ω K ) over z ∈ X , that is, for the logistic regression the SAA estimate is nothing but the Maximum Likelihood estimate x bML (ω K ) as defined in (5.49).10 On the other hand, in the “nonlinear least squares” example described in Section 5.2.1 with (for the sake of simplicity, scalar) monotone f (·) the vector field GωK (·) is given by K  1 X f (ηkT z) − yk ηk GωK (z) = K k=1

which is different (provided that f is nonlinear) from the gradient field 2

K X

k=1

  f ′ (ηkT z) f (ηkT z) − yk ηk

of the minus log-likelihood appearing in (5.52). As a result, in this case the ML estimate (5.52) is, in general, different from the SAA estimate (and, in contrast to the ML, the SAA estimate is easy to compute). 10 This phenomenon is specific to the logistic regression model. The equality of the SAA and the ML estimates in this case is due to the fact that the logistic sigmoid f (s) = exp{s}/(1+exp{s}) “happens” to satisfy the identity f ′ (s) = f (s)(1 − f (s)). When replacing the logistic sigmoid with f (s) = φ(s)/(1 + φ(s)) with differentiable monotonically nondecreasing positive φ(·), the SAA estimate becomes the weak solution to VI(Φ, X ) with # " X φ(ηkT z) − yk ηk . Φ(z) = 1 + φ(ηkT z) k

On the other hand, the gradient field of the minus log-likelihood i Xh − yk ln(f (ηkT z)) + (1 − yk ) ln(1 − f (ηkT z)) k

we need to minimize when computing the ML estimate is # " X φ′ (η T z) φ(ηkT z) k − y k ηk . Ψ(z) = φ(ηkT z) 1 + φ(ηkT z) k When k > 1 and φ is not an exponent, Φ and Ψ are “essentially different,” so that the SAA estimate typically will differ from the ML one.

423

SIGNAL RECOVERY BEYOND LINEAR ESTIMATES

5.2.4

Stochastic Approximation estimate

The Stochastic Approximation (SA) estimate stems from a simple algorithm— Subgradient Descent—for solving variational inequality VI(G, X ). Were the values of the vector field G(·) available, one could approximate a root x ∈ X of this VI using the recurrence zk = ProjX [zk−1 − γk G(zk−1 )], k = 1, 2, ..., K, where • ProjX [z] is the metric projection of Rn onto X : ProjX [z] = argmin kz − uk2 ; u∈X

• γk > 0 are given stepsizes; • the initial point z0 is an arbitrary point of X . It is well known that under Assumptions A.1-3 this recurrence with properly selected stepsizes and started at a point from X allows to approximate the root of G (in fact, the unique weak solution to VI(G, X )) to any desired accuracy, provided K is large enough. However, we are in the situation when the actual values of G are not available; the standard way to cope with this difficulty is to replace in the above recurrence the “unobservable” values G(zk−1 ) of G with their unbiased random estimates G(ηk ,yk ) (zk−1 ). This modification gives rise to Stochastic Approximation (coming back to [146])—the recurrence zk = ProjX [zk−1 − γk G(ηk ,yk ) (zk−1 )], 1 ≤ k ≤ K,

(5.60)

where z0 is a once and forever chosen point from X , and γk > 0 are deterministic stepsizes. The next item on our agenda is the (well-known) convergence analysis of SA under assumptions A.1–3. To this end observe that the zk are deterministic functions of the initial fragments ω k = {ωt , 1 ≤ t ≤ k} ∼ Px × ... × Px of our sequence {z } | of observations ω

K

Pxk k

= {ωk = (ηk , yk ), 1 ≤ k ≤ K}: zk = Zk (ω ). Let us set

Dk (ω k ) = 12 kZk (ω k ) − xk22 = 21 kzk − xk22 ,

dk = Eωk ∼Pxk {Dk (ω k )},

where x ∈ X is the signal underlying observations (5.50). Note that, as is well known, the metric projection onto a closed convex set X is contracting: ∀(z ∈ Rn , u ∈ X ) : kProjX [z] − uk2 ≤ kz − uk2 . Consequently, for 1 ≤ k ≤ K it holds Dk (ω k )

= ≤

=

1 2 1 2 1 2

kProjX [zk−1 − γk Gωk (zk−1 )] − xk22 kzk−1 − γk Gωk (zk−1 ) − xk22

kzk−1 − xk22 − γk Gωk (zk−1 )T (zk−1 − x) + 21 γk2 kGωk (zk−1 )k22 .

Taking expectations w.r.t. ω k ∼ Pxk on both sides of the resulting inequality and

424

CHAPTER 5

keeping in mind relations (5.56) along with the fact that zk−1 ∈ X , we get  (5.61) dk ≤ dk−1 − γk Eωk−1 ∼Pxk−1 G(zk−1 )T (zk−1 − x) + 2γk2 M 2 .

Recalling that we are in the case where G is strongly monotone on X with modulus κ > 0, x is the weak solution VI(G, X ), and zk−1 takes values in X , invoking (5.59), the expectation in (5.61) is at least 2κdk , and we arrive at the relation dk ≤ (1 − 2κγk )dk−1 + 2γk2 M 2 . We put S=

2M 2 , κ2

γk =

1 . κ(k + 1)

(5.62)

(5.63)

Let us verify by induction in k that for k = 0, 1, ..., K it holds dk ≤ (k + 1)−1 S.

(∗k )

Base k = 0. Let D stand for the k · k2 -diameter of X , and z± ∈ X be such that kz+ − z− k2 = D. By (5.56) we have kF (z)k2 ≤ M for all z ∈ X , and by strong monotonicity of G(·) on X we have [G(z+ ) − G(z− )]T [z+ − z− ] = [F (z+ ) − F (z− )][z+ − z− ] ≥ κkz+ − z− k22 = κD2 . By the Cauchy inequality, the left-hand side in the concluding ≥ is at most 2M D, and we get 2M D≤ , κ whence S ≥ D2 /2. On the other hand, due to the origin of d0 we have d0 ≤ D2 /2. Thus, (∗0 ) holds true. Inductive step (∗k−1 ) ⇒ (∗k ). Now assume that (∗k−1 ) holds true for some k, 1 ≤ k ≤ K, and let us prove that (∗k ) holds true as well. Observe that κγk = (k + 1)−1 ≤ 1/2, so that dk

≤ ≤ =

dk−1 (1 − 2κγk ) + 2γk2 M 2 [by (5.62)] S (1 − 2κγk ) + 2γk2 M 2 [by (∗k−1 ) and due to κγk ≤ 1/2] k    S k−1 2 S S S 1 1− + ≤ = + , 2 k k+1 (k + 1) k+1 k k+1 k+1

so that (∗k ) hods true. Induction is complete. Recalling that dk = 21 E{kzk − xk22 }, we arrive at the following: Proposition 5.13. Under Assumptions A.1–3 and with the stepsizes γk =

1 , k = 1, 2, ... , κ(k + 1)

(5.64)

for every signal x ∈ X the sequence of estimates x bk (ω k ) = zk given by the SA

425

SIGNAL RECOVERY BEYOND LINEAR ESTIMATES

recurrence (5.60) and ωk = (ηk , yk ) defined in (5.50) obeys the error bound  Eωk ∼Pxk kb xk (ω k ) − xk22 ≤

4M 2 , k = 0, 1, ... , + 1)

κ 2 (k

(5.65)

Px being the distribution of (η, y) stemming from signal x.

5.2.5

Numerical illustration

To illustrate the above developments, we present here the results of some numerical experiments. Our deliberately simplistic setup is as follows: • X = {x ∈ Rn : kxk2 ≤ 1}; • the distribution Q of η is N (0, In ); • f is the monotone vector field on R given by one of the following four options: A. f (s) = exp{s}/(1 + exp{s}) (“logistic sigmoid”); B. f (s) = s (“linear regression”); C. f (s) = max[s, 0] (“hinge function”); D. f (s) = min[1, max[s, 0]] (“ramp sigmoid”). • the conditional distribution of y given η induced by Px is

– Bernoulli distribution with probability f (η T x) of outcome 1 in the case of A (i.e., A corresponds to the logistic model), – Gaussian distribution N (f (η T x), In ) in cases B–D.

Note that when m = 1 and η ∼ N (0, In ), one can easily compute the field F (z). Indeed, we have ∀z ∈ Rn \{0}:   zz T zz T η, η + I − η= kzk22 kzk22 | {z } η⊥

and due to the independence of η T z and η⊥ , F (z) = Eη∼N (0,I) {ηf (η T z)} = Eη∼N (0,I)



zz T η f (η T z) kzk22



=

z Eζ∼N (0,1) {ζf (kzk2 ζ)}, kzk2

so that F (z) is proportional to z/kzk2 with proportionality coefficient h(kzk2 ) = Eζ∼N (0,1) {ζf (kzk2 ζ)}. In Figure 5.2 we present the plots of the function h(t) for the situations A–D and of the moduli of strong monotonicity of the corresponding mappings F on the k · k2 -ball of radius R centered at the origin, as functions of R. The dimension n in all experiments was set to 100, and the number of observations K was 400, 1, 000, 4, 000, 10, 000, and 40, 000. For each combination of parameters we ran 10 simulations for signals x underlying observations (5.50) drawn randomly from the uniform distribution on the unit sphere (the boundary of X ).

426

CHAPTER 5

5

100

4.5 4 3.5 3 2.5 2

10-1

1.5 1 0.5 0

0 0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Figure 5.2: Left: functions h; right: moduli of strong monotonicity of the operators F (·) in {z : kzk2 ≤ R} as functions of R. Dashed lines – case A (logistic sigmoid), solid lines – case B (linear regression), dash-dotted lines – case C (hinge function), dotted line – case D (ramp sigmoid).

In each experiment, we computed the SAA and the SA estimates (note that in the cases A and B the SAA estimate is the Maximum Likelihood estimate as well). The SA stepsizes γk were selected according to (5.64) with “empirically tuned” κ.11 Namely, given observations ωk = (ηk , yk ), k ≤ K—see (5.50)—we used them to build the SA estimate in two stages: — at the tuning stage, we generate a random “training signal” x′ ∈ X and then generate labels yk′ as if x′ were the actual signal. For instance, in the case of A, yk′ is assigned value 1 with probability f (ηkT x′ ) and value 0 with complementary probability. After the “training signal” and associated labels are generated, we run on the resulting artificial observations SA with different values of κ, compute the accuracy of the resulting estimates, and select the value of κ resulting in the best recovery; — at the execution stage, we run SA on the actual data with stepsizes (5.64) specified by the κ found at the tuning stage. The results of some numerical experiments are presented in Figure 5.3. Note that the CPU time for SA includes both tuning and execution stages. The conclusion from these experiments is that as far as estimation quality is concerned, the SAA estimate marginally outperforms the SA, while being significantly more time consuming. Note also that the dependence of recovery errors √ on K observed in our experiments is consistent with the convergence rate O(1/ K) established by Proposition 5.13. Comparison with Nonlinear Least Squares. Observe that in the case m = 1 of scalar monotone field f , the SAA estimate yielded by our approach as applied to observation ω K is the minimizer of the convex function Z t k  1 X T T f (s)ds, v(ηk z) − yk ηk z , v(r) = HωK (z) = K 0 k=1

11 We could get (lower bounds on) the moduli of strong monotonicity of the vector fields F (·) we are interested in analytically, but this would be boring and conservative.

427

SIGNAL RECOVERY BEYOND LINEAR ESTIMATES

102

100

101

100

10-1

10-1 10-2

10-3 103

104

103

k

Mean estimation error kb xk (ω ) − xk2

104

CPU time (sec)

Figure 5.3: Mean errors and CPU times for SA (solid lines) and SAA estimates (dashed lines) as functions of the number of observations K. o – case A (logistic link), x – case B (linear link), + – case C (hinge function), ✷ – case D (ramp sigmoid).

on the signal set X . When f is the logistic sigmoid, HωK (·) is exactly the convex loss function leading to the ML estimate in the logistic regression model. As we have already mentioned, this is not the case for a general GLM. Consider, e.g., the situation where the regressors and the signals are reals, the distribution of regressor η is N (0, 1), and the conditional distribution of y given η is N (f (ηx), σ 2 ), with f (s) = arctan(s). In this situation the ML estimate stemming from observation ω K is the minimizer on X of the function MωK (z) =

k 1 X 2 [yk − arctan(ηk z)] . K

(5.66)

k=1

The latter function is typically nonconvex and can be multi-extremal. For example, when running simulations12 we from time to time observe the situation similar to that presented in Figure 5.4. Of course, in our toy situation of scalar x the existence of several local minima of MωK (·) is not an issue—we can easily compute the ML estimate by a brute force search along a dense grid. What to do in the multidimensional case—this is another question. We could also add that in the simulations which led to Figure 5.4 both the SAA and the ML estimates exhibited nearly the same performance in terms of the estimation error: in 1, 000 experiments, the median of the observed recovery errors was 0.969 for the ML, and 0.932 for the SAA estimate. When increasing the number of observations to 1, 000, the empirical median (taken over 1, 000 simulations) of recovery errors became 0.079 for the ML, and 0.085 for the SAA estimate. 12 In these simulations, the “true” signal x underlying observations was drawn from N (0, 1), the number K of observations also was random with uniform distribution on {1, ..., 20}, and X = [−20, 20], σ = 3 were used.

428

CHAPTER 5

25

20

15

10

5

0

-5 -20

-15

-10

-5

0

5

10

15

20

Figure 5.4: Solid curve: MωK (z), dashed curve: HωK (z). True signal x (solid vertical line): +0.081; SAA estimate (unique minimizer of HωK , dashed vertical line): −0.252; ML estimate (global minimizer of MωK on [−20, 20]): −20.00, closest to x local minimizer of MωK (dotted vertical line): −0.363. 5.2.6

“Single-observation” case

Let us look at the special case of our estimation problem where the sequence η1 , ..., ηK of regressors in (5.50) is deterministic. At first glance, this situation goes beyond our setup, where the regressors should be i.i.d. drawn from some distribution Q. However, we can circumvent this “contradiction” by saying that we are now