333 5 2MB
English Pages 320 Year 2019
Universitext
J. Frédéric Bonnans
Convex and Stochastic Optimization
Universitext
Universitext
Series Editors Sheldon Axler San Francisco State University Carles Casacuberta Universitat de Barcelona Angus MacIntyre Queen Mary University of London Kenneth Ribet University of California, Berkeley Claude Sabbah École Polytechnique, CNRS, Université Paris-Saclay, Palaiseau Endre Süli University of Oxford Wojbor A. Woyczyński, Case Western Reserve University
Universitext is a series of textbooks that presents material from a wide variety of mathematical disciplines at master’s level and beyond. The books, often well class-tested by their author, may have an informal, personal even experimental approach to their subject matter. Some of the most successful and established books in the series have evolved through several editions, always following the evolution of teaching curricula, to very polished texts. Thus as research topics trickle down into graduate-level teaching, first textbooks written for new, cutting-edge courses may make their way into Universitext.
More information about this series at http://www.springer.com/series/223
J. Frédéric Bonnans
Convex and Stochastic Optimization
123
J. Frédéric Bonnans Inria-Saclay and Centre de Mathématiques Appliquées École Polytechnique Palaiseau, France
ISSN 0172-5939 ISSN 2191-6675 (electronic) Universitext ISBN 978-3-030-14976-5 ISBN 978-3-030-14977-2 (eBook) https://doi.org/10.1007/978-3-030-14977-2 Library of Congress Control Number: 2019933717 Mathematics Subject Classification (2010): 90C15, 09C25, 90C39, 90C40, 90C46 © Springer Nature Switzerland AG 2019 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
This book is dedicated to Viviane, Juliette, Antoine, and Na Yeong
Preface
These lecture notes are an extension of those given in the master programs at the Universities Paris VI and Paris-Saclay, and in the École Polytechnique. They give an introduction to convex analysis and its applications to stochastic programming, i.e., to optimization problems where the decision must be taken in the presence of uncertainties. This is an active subject of research that covers many applications. Classical textbooks are Birge and Louveaux [21], Kall and Wallace [62]. The book [123] by Wallace and Ziemba is dedicated to applications. Some more advanced material is presented in Ruszczynski and Shapiro [105], Shapiro et al. [113], Föllmer and Schied [49], and Carpentier et al. [32]. Let us also mention the historical review paper by Wets [124]. The basic tool for studying such problems is the combination of convex analysis with measure theory. Classical sources in convex analysis are Rockafellar [96], Ekeland and Temam [46]. An introduction to integration and probability theory is given in Malliavin [76]. The author expresses his thanks to Alexander Shapiro (Georgia Tech) for introducing him to the subject, Darinka Dentchev (Stevens Institute of Technology), Andrzej Ruszczyńki (Rutgers), Michel de Lara, and Jean-Philippe Chancelier (Ecole des Ponts-Paris Tech) for stimulating discussions, and Pierre Carpentier with whom he shared the course on stochastic optimization in the optimization masters at the Université Paris-Saclay. Palaiseau, France
J. Frédéric Bonnans
vii
Contents
1 A Convex Optimization Toolbox . . . . . . . . . . . . . . . . . . . . . . 1.1 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Optimization Problems . . . . . . . . . . . . . . . . . . . . . 1.1.2 Separation of Convex Sets . . . . . . . . . . . . . . . . . . 1.1.3 Weak Duality and Saddle Points . . . . . . . . . . . . . . 1.1.4 Linear Programming and Hoffman Bounds . . . . . . 1.1.5 Conjugacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Duality Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Perturbation Duality . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Subdifferential Calculus . . . . . . . . . . . . . . . . . . . . 1.2.3 Minimax Theorems . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 Calmness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Specific Structures, Applications . . . . . . . . . . . . . . . . . . . 1.3.1 Maxima of Bounded Functions . . . . . . . . . . . . . . . 1.3.2 Linear Conical Optimization . . . . . . . . . . . . . . . . . 1.3.3 Polyhedra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.4 Infimal Convolution . . . . . . . . . . . . . . . . . . . . . . . 1.3.5 Recession Functions and the Perspective Function . 1.4 Duality for Nonconvex Problems . . . . . . . . . . . . . . . . . . . 1.4.1 Convex Relaxation . . . . . . . . . . . . . . . . . . . . . . . 1.4.2 Applications of the Shapley–Folkman Theorem . . . 1.4.3 First-Order Optimality Conditions . . . . . . . . . . . . . 1.5 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
1 1 1 4 10 11 17 32 32 44 49 51 53 53 56 57 60 62 65 65 70 72 74
2 Semidefinite and Semi-infinite Programming . . . . . . 2.1 Matrix Optimization . . . . . . . . . . . . . . . . . . . . . . 2.1.1 The Frobenius Norm . . . . . . . . . . . . . . . . 2.1.2 Positive Semidefinite Linear Programming
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
75 75 75 77
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
ix
x
Contents
2.2 Rotationally Invariant Matrix Functions . . . . . . . . . . . . . . . 2.2.1 Computation of the Subdifferential . . . . . . . . . . . . . 2.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Logarithmic Penalty . . . . . . . . . . . . . . . . . . . . . . . . 2.3 SDP Relaxations of Nonconvex Problems . . . . . . . . . . . . . 2.3.1 Relaxation of Quadratic Problems . . . . . . . . . . . . . . 2.3.2 Relaxation of Integer Constraints . . . . . . . . . . . . . . 2.4 Second-Order Cone Constraints . . . . . . . . . . . . . . . . . . . . . 2.4.1 Examples of SOC Reformulations . . . . . . . . . . . . . 2.4.2 Linear SOC Duality . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 SDP Representation . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Semi-infinite Programming . . . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5.2 Multipliers with Finite Support . . . . . . . . . . . . . . . . 2.5.3 Chebyshev Approximation . . . . . . . . . . . . . . . . . . . 2.5.4 Chebyshev Polynomials and Lagrange Interpolation . 2.6 Nonnegative Polynomials over R . . . . . . . . . . . . . . . . . . . 2.6.1 Nonnegative Polynomials . . . . . . . . . . . . . . . . . . . . 2.6.2 Characterisation of Moments . . . . . . . . . . . . . . . . . 2.6.3 Maximal Loading . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 An Integration Toolbox . . . . . . . . . . . . . . . . . . . . . . . 3.1 Measure Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Measurable Spaces . . . . . . . . . . . . . . . . . . . 3.1.2 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.3 Kolmogorov’s Extension of Measures . . . . . 3.1.4 Limits of Measurable Functions . . . . . . . . . 3.1.5 Integration . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.6 Lp Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.7 Bochner Integrals . . . . . . . . . . . . . . . . . . . . 3.2 Integral Functionals . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Minimization of Carathéodory Integrals . . . 3.2.2 Measurable Multimappings . . . . . . . . . . . . . 3.2.3 Convex Integrands . . . . . . . . . . . . . . . . . . . 3.2.4 Conjugates of Integral Functionals . . . . . . . 3.2.5 Deterministic Decisions in Rm . . . . . . . . . . 3.2.6 Constrained Random Decisions . . . . . . . . . 3.2.7 Linear Programming with Simple Recourse . 3.3 Applications of the Shapley–Folkman Theorem . . . 3.3.1 Integrals of Multimappings . . . . . . . . . . . . . 3.3.2 Constraints on Integral Terms . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
80 80 84 85 87 87 90 91 91 93 94 95 95 97 101 103 106 106 111 114 115
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . .
117 117 117 122 125 127 128 134 139 141 142 143 145 146 150 152 153 156 156 159
Contents
xi
3.4 Examples and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 3.5 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
165 165 165 165 167 168 168 169 170 172 172 174 176
5 Sampling and Optimizing . . . . . . . . . . . . . . . . . . . . . . 5.1 Examples and Motivation . . . . . . . . . . . . . . . . . . . 5.1.1 Maximum Likelihood . . . . . . . . . . . . . . . . 5.2 Convergence in Law and Related Asymptotics . . . . 5.2.1 Probabilities over Metric Spaces . . . . . . . . . 5.2.2 Convergence in Law . . . . . . . . . . . . . . . . . 5.2.3 Central Limit Theorems . . . . . . . . . . . . . . . 5.2.4 Delta Theorems . . . . . . . . . . . . . . . . . . . . . 5.2.5 Solving Equations . . . . . . . . . . . . . . . . . . . 5.3 Error Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 The Empirical Distribution . . . . . . . . . . . . . 5.3.2 Minimizing over a Sample . . . . . . . . . . . . . 5.3.3 Uniform Convergence of Values . . . . . . . . . 5.3.4 The Asymptotic Law . . . . . . . . . . . . . . . . . 5.3.5 Expectation Constraints . . . . . . . . . . . . . . . 5.4 Large Deviations . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 The Principle of Large Deviations . . . . . . . 5.4.2 Error Estimates in Stochastic Programming . 5.5 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
177 177 177 178 178 179 185 186 188 190 190 191 193 194 195 198 198 200 200
6 Dynamic Stochastic Optimization . . . . . . . . . . . . . . . . . . 6.1 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Functional Dependency . . . . . . . . . . . . . . . . . 6.1.2 Construction of the Conditional Expectation . . 6.1.3 The Conditional Expectation of Non-integrable Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.4 Computation in Some Simple Cases . . . . . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
201 201 201 201
4 Risk Measures . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Utility Functions . . . . . . . . . . . . . . . . . . . . . 4.2.1 Framework . . . . . . . . . . . . . . . . . . . 4.2.2 Optimized Utility . . . . . . . . . . . . . . . 4.3 Monetary Measures of Risk . . . . . . . . . . . . 4.3.1 General Properties . . . . . . . . . . . . . . 4.3.2 Convex Monetary Measures of Risk . 4.3.3 Acceptation Sets . . . . . . . . . . . . . . . 4.3.4 Risk Trading . . . . . . . . . . . . . . . . . . 4.3.5 Deviation and Semideviation . . . . . . 4.3.6 Value at Risk and CVaR . . . . . . . . . 4.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . 206 . . . . . . . . . 206
xii
Contents
6.1.5 Convergence Theorems . . . . . . . . . . . . . . . . 6.1.6 Conditional Variance . . . . . . . . . . . . . . . . . . 6.1.7 Compatibility with a Subspace . . . . . . . . . . . 6.1.8 Compatibility with Measurability Constraints 6.1.9 No Recourse . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Dynamic Stochastic Programming . . . . . . . . . . . . . . 6.2.1 Dynamic Uncertainty . . . . . . . . . . . . . . . . . . 6.2.2 Abstract Optimality Conditions . . . . . . . . . . . 6.2.3 The Growing Information Framework . . . . . . 6.2.4 The Standard full Information Framework . . . 6.2.5 Independent Noises . . . . . . . . . . . . . . . . . . . 6.2.6 Elementary Examples . . . . . . . . . . . . . . . . . . 6.2.7 Application to the Turbining Problem . . . . . . 6.3 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
207 208 209 212 213 214 214 215 217 218 219 219 220 222
7 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . 7.1 Controlled Markov Chains . . . . . . . . . . . . . . . . . 7.1.1 Markov Chains . . . . . . . . . . . . . . . . . . . . 7.1.2 The Dynamic Programming Principle . . . . 7.1.3 Infinite Horizon Problems . . . . . . . . . . . . 7.1.4 Numerical Algorithms . . . . . . . . . . . . . . . 7.1.5 Exit Time Problems . . . . . . . . . . . . . . . . . 7.1.6 Problems with Stopping Decisions . . . . . . 7.1.7 Undiscounted Problems . . . . . . . . . . . . . . 7.2 Advanced Material on Controlled Markov Chains 7.2.1 Expectation Constraints . . . . . . . . . . . . . . 7.2.2 Partial Information . . . . . . . . . . . . . . . . . . 7.2.3 Linear Programming Formulation . . . . . . . 7.3 Ergodic Markov Chains . . . . . . . . . . . . . . . . . . . 7.3.1 Orientation . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Transient and Recurrent States . . . . . . . . . 7.3.3 Ergodic Dynamic Programming . . . . . . . . 7.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
223 223 223 229 231 235 239 240 243 243 243 248 254 255 255 256 263 266
8 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Stochastic Dual Dynamic Programming (SDDP) . 8.1.1 Static Case: Kelley’s Algorithm . . . . . . . . 8.1.2 Deterministic Dual Dynamic Programming 8.1.3 Stochastic Case . . . . . . . . . . . . . . . . . . . . 8.2 Introduction to Linear Decision Rules . . . . . . . . . 8.2.1 About the Frobenius Norm . . . . . . . . . . . . 8.2.2 Setting . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
267 267 267 269 272 275 275 276
Contents
8.2.3 Linear Programming Reformulation 8.2.4 Linear Conic Reformulation . . . . . . 8.2.5 Dual Bounds in a Conic Setting . . . 8.3 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xiii
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
276 277 278 281
9 Generalized Convexity and Transportation Theory 9.1 Generalized Convexity . . . . . . . . . . . . . . . . . . . 9.1.1 Generalized Fenchel Conjugates . . . . . . . 9.1.2 Cyclical Monotonicity . . . . . . . . . . . . . . 9.1.3 Duality . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.4 Augmented Lagrangian . . . . . . . . . . . . . 9.2 Convex Functions of Measures . . . . . . . . . . . . . 9.2.1 A First Result . . . . . . . . . . . . . . . . . . . . 9.2.2 A Second Result . . . . . . . . . . . . . . . . . . 9.3 Transportation Theory . . . . . . . . . . . . . . . . . . . . 9.3.1 The Compact Framework . . . . . . . . . . . . 9.3.2 Optimal Transportation Maps . . . . . . . . . 9.3.3 Penalty Approximations . . . . . . . . . . . . . 9.3.4 Barycenters . . . . . . . . . . . . . . . . . . . . . . 9.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
. . . . . . . . . . . . . . .
283 283 283 285 287 287 289 290 293 293 294 296 297 299 301
. . . .
. . . .
. . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
Chapter 1
A Convex Optimization Toolbox
Summary This chapter presents the duality theory for optimization problems, by both the minimax and perturbation approach, in a Banach space setting. Under some stability (qualification) hypotheses, it is shown that the dual problem has a nonempty and bounded set of solutions. This leads to the subdifferential calculus, which appears to be nothing but a partial subdifferential rule. Applications are provided to the infimal convolution, as well as recession and perspective functions. The relaxation of some nonconvex problems is analyzed thanks to the Shapley–Folkman theorem.
1.1 Convex Functions 1.1.1 Optimization Problems 1.1.1.1
The Language of Minimization Problems
¯ := R ∪ {−∞} ∪ {+∞}. A minimizaDenote the set of extended real numbers by R tion problem is of the form (P f,K ) Min f (x); x ∈ K , x
¯ (we say that f is extended where K is a subset of some set X , and f : X → R real-valued). The domain of f is dom( f ) := {x ∈ X ;
f (x) < +∞}.
(1.1)
We say that f is proper if its domain is not empty, and if f (x) > −∞, for all x ∈ X . The feasible set and value of (P f,K ) are resp. F(P f,K ) := dom( f ) ∩ K ; val(P f,K ) := inf{ f (x); x ∈ F(P)}. © Springer Nature Switzerland AG 2019 J. F. Bonnans, Convex and Stochastic Optimization, Universitext, https://doi.org/10.1007/978-3-030-14977-2_1
(1.2) 1
2
1 A Convex Optimization Toolbox
Since the infimum over the empty set is +∞, we have that val(P f,K ) < +∞ iff F(P f,K ) = ∅. The solution set of (P f,K ) is defined as S(P f,K ) := {x ∈ F(P f,K ); f (x) = val(P f,K )}.
(1.3)
Note that S(P f,K ) = ∅ when F(P f,K ) = ∅. A metric over X is a function, say d : X × X → R+ , such that d(x, y) = 0 iff x = y, that is symmetric: d(x, y) = d(y, x) and that satisfies the triangle inequality d(x, z) ≤ d(x, y) + d(y, z), for all x, y, z in X.
(1.4)
We say that X is a metric space if it is endowed with a metric. In that case, we say that f is lower semicontinuous, or l.s.c., if for all x ∈ X , f (x) ≤ lim inf k f (x k ) whenever x k → x. A minimizing sequence for problem (P f,K ) is a sequence xk in F(P f,K ) such that f (xk ) → val(P f,K ). Such a sequence exists iff F(P f,K ) is nonempty. Any (infinite) subsequence of a minimizing sequence is itself a minimizing sequence. If X is a metric space, K is closed and f is l.s.c., then any limit point of a minimizing sequence is a solution of (P f,K ). Example 1.1 Consider the problem of minimizing the exponential function over R. The value is finite, but the solution set is empty. Note that minimizing subsequences have no limit point in R.
1.1.1.2
Operations on Extended Real-Valued Functions
In the context of minimization problems, that f (x) = +∞ is just a way to express that x is not feasible. Therefore the following algebraic rules for extended real-valued functions are to be used: if f and g are extended real-valued functions over X , then h := f + g is the extended real-valued function over X defined by h(x) =
+∞ if max( f (x), g(x)) = +∞, f (x) + g(x) otherwise.
(1.5)
Note that there is no ambiguity in this definition (taking the usual addition rules in the presence of ±∞). The domain of the sum is the intersection of the domains. ¯ Example 1.2 With a subset K of X we associate the indicatrix function I K : X → R defined by 0 if x ∈ K , I K (x) := (1.6) +∞ otherwise. Let f K (x) := f (x) + I K (x). Then (P f,K ) has the same feasible set, value and set of solutions as (P f K ,X ).
1.1 Convex Functions
3
If f is an extended real-valued function and λ > 0, we may define λ f by the natural rule (λ f )(x) = λ f (x), for all x ∈ X. (1.7) Observe that (Pλ f,K ) has the same feasible set and set of solutions as (P f,X ), and the values are related by (1.8) val(Pλ f,K ) = λ val(P f,K ). For λ = 0, we must think of 0 f as the limit of λ f as λ ↓ 0, and therefore set (0 f )(x) =
f (x) if f (x) = ±∞, 0 otherwise.
(1.9)
Then (P0 f,K ) has the same feasible set as (P f,X ), and its set of solutions is F(P f,X ) if f is proper. Example 1.3 Consider the entropy function f (x) = x log x (with the convention that 0 log 0 = 0) if x ≥ 0, and +∞ otherwise. Then 0 f is the indicatrix of R+ . More generally, if f is proper, then 0 f is the indicatrix of its domain.
1.1.1.3
Maximization Problems
For a maximization problem Max g(x); x ∈ K x
(Dg,K )
we have similar conventions, adapted to the maximization framework. In particular the domain of g is dom(g) = {x ∈ X ; g(x) > −∞}. The domain of the sum is still the intersection of the domains. Note that (Dg,K ) is essentially the same problem as Min −g(x); x ∈ K . x
(Dg,K )
Indeed these two problems have the same feasible set and set of solutions, and they have opposite values.
1.1.1.4
Convex Sets and Functions
Let X be a vector space. We say that K ⊂ X is convex if For any x and y in K , and α ∈ (0, 1), we have that αx + (1 − α)y ∈ K . ¯ is convex if We say that f : X → R
(1.10)
4
1 A Convex Optimization Toolbox
For any x and y in dom( f ), and α ∈ (0, 1), we have that f (αx + (1 − α)y) ≤ α f (x) + (1 − α) f (y).
(1.11)
We see that a convex function has a convex domain. ¯ is the set The epigraph of f : X → R epi( f ) := {(x, α) ∈ X × R; α ≥ f (x)}.
(1.12)
Its projection over X is dom( f ). One easily checks the following: ¯ Then Lemma 1.4 Let f : X → R. (i) its epigraph is convex iff f is convex, (ii) if X is a metric space, its epigraph is closed iff f is l.s.c. Example 1.5 The epigraph of the indicatrix of K ⊂ X is K × R+ .
1.1.2 Separation of Convex Sets We recall without proof the Hahn–Banach theorem, valid in a vector space setting, and deduce from it some results of separation of convex sets in normed vector spaces.
1.1.2.1
The Hahn–Banach Theorem
Let X be a vector space. We say that p : X → R is positively homogeneous and subadditive if it satisfies (i) p(αx) = αp(x), for all x ∈ X and α > 0, (1.13) (ii) p(x + y) ≤ p(x) + p(y), for all x and y in X. Remark 1.6 (a) Taking x = 0 in (1.13)(i), we obtain that p(0) = 0, and so we could as well take α = 0 in (1.13)(i). (b) If β ∈ (0, 1), combining the above relations, we obtain that p(βx + (1 − β)y) ≤ βp(x) + (1 − β) p(y),
(1.14)
i.e., p is convex. Conversely, it is easily checked that a positively homogeneous (finite-valued) convex function is subadditive. The analytical form of the Hahn–Banach theorem, a nontrivial consequence of Zorn’s lemma, is as follows (see [28] for a proof): Theorem 1.7 Let p satisfy (1.13), X 1 be a vector subspace of X , and λ be a linear form defined on X 1 that is dominated by p in the sense that
1.1 Convex Functions
5
λ(x) ≤ p(x), for all x ∈ X 1 .
(1.15)
Then there exists a linear form μ on X , dominated by p, whose restriction to X 1 coincides with λ. We say that a real vector space X is a normed space when endowed with a mapping X → R, x → x, satisfying the three axioms ⎧ with equality iff x = 0, ⎨ x ≥ 0, αx = |α|x, for all α ∈ R, x ∈ X, ⎩ x + x ≤ x + x , (triangle inequality).
(1.16)
Then (x, y) → x − y is a metric over X . We denote the norm of Euclidean spaces, i.e., finite-dimensional spaces endowed with the norm ( i xi2 )1/2 , by |x|. A sequence xk in a normed vector space X is said to be a Cauchy sequence if x p − xq → 0 when p, q ↑ ∞. We say that X is a Banach space if every Cauchy sequence has a (necessarily unique) limit. The topological dual X ∗ of the normed vector space X is the set of continuous linear forms (maps X → R) on X . In the sequel, by dual space we will mean the topological dual. We denote the duality product between x ∗ ∈ X ∗ and x ∈ X by x ∗ , x X or simply x ∗ , x. Note that a linear form, say over X , is continuous iff it is continuous at 0, which holds iff sup{(x); x ≤ 1} < ∞. So we may endow X ∗ with the norm (1.17) x ∗ ∗ := sup{x ∗ , x; x ≤ 1}. It is easily checked that X ∗ is a Banach space. The dual of Rn (space of vertical vectors) is denoted by Rn∗ (space of horizontal vectors). In the sequel we may denote the dual norm by x ∗ . If X and Y are Banach spaces, we denote by L(X, Y ) the Banach space of linear continuous mappings X → Y , endowed with the norm A := sup{Ax; x ≤ 1}. We denote by B X (resp. B¯ X ) the open (resp. closed) unit ball of X . If x1∗ is a continuous linear form on a linear subspace X 1 of X , its norm is defined accordingly: x1∗ 1,∗ = sup{x ∗ , x; x ∈ X 1 , x ≤ 1}.
(1.18)
Here are some other corollaries of the Hahn–Banach theorem. Corollary 1.8 Let x1∗ be a continuous linear form on a linear subspace X 1 of the normed space X . Then there exists an x ∗ ∈ X ∗ whose restriction to X 1 coincides with x1∗ , and such that (1.19) x ∗ ∗ = x1∗ 1,∗ . Proof Apply Theorem 1.7 with p(x) := x1∗ 1,∗ x. Since x ∗ , ±x ≤ p(x), we have that x ≤ 1 implies x ∗ , ±x ≤ x1∗ 1,∗ . The result follows.
6
1 A Convex Optimization Toolbox
Corollary 1.9 Let x0 belong to the normed vector space X . Then there exists an x ∗ ∈ X ∗ such that x ∗ = 1 and x ∗ , x0 = x0 . Proof Apply Corollary 1.8 with X 1 = Rx0 and x1∗ (t x0 ) = tx0 , for t ∈ R.
The orthogonal of E ⊂ X is the closed subspace of X ∗ defined by E ⊥ := {x ∗ ∈ X ∗ ; x ∗ , x = 0, for all x ∈ E}.
(1.20)
Lemma 1.10 Let E be a subspace of X . Then E ⊥ = {0} iff E is dense. Proof (a) If E is dense, given x ∈ X , there exists a sequence xk in E, xk → x and hence, for all x ∗ ∈ E ⊥ , x ∗ , x = limk x ∗ , xk = 0, proving that x ∗ = 0. / E¯ (closure of E). We may assume that x0 = 1 and that (b) If E is not dense, let x0 ∈ B(x0 , ε) ∩ E = ∅ for some ε > 0. Let E 0 := E ⊕ (Rx0 ) denote the space spanned by E 0 and x0 . Consider the linear form λ on E 0 defined by λ(e + αx0 ) = α, for all e ∈ E and α ∈ R.
(1.21)
Since any x ∈ E 0 has a unique decomposition as x = e + αx0 with e ∈ E and α ∈ R, the linear form is well-defined. Let such an x satisfy α = 0. Since e := −e/α does not belong to B(x0 , ε), we have that x = |α|x0 − e ≥ ε|α|, and hence, λ(x) = α ≤ x/ε. If α = 0 we still have λ(x) ≤ x/ε. By Corollary 1.8, λ has an extension to a continuous linear form on X , which is a nonzero element of E ⊥ . Bidual space, Reflexivity Given x ∈ X , the mapping x : X ∗ → R, x ∗ → x ∗ , x is by (1.17) linear continuous. Since |x ∗ , x| ≤ x ∗ x, its norm x (in the bidual space X ∗∗ ) is not greater than x, and as a consequence of Corollary 1.9, is equal to x: the mapping x → x is isometric. This allows us to identify X with a closed subspace of X ∗∗ . We say that X is reflexive if X = X ∗∗ . The Hilbert spaces are reflexive, see [28].
1.1.2.2
Separation Theorems
We assume here that X is a normed vector space. A (topological) hyperplane of X is a set of the form: Hx ∗ ,α := {x ∈ X ; x ∗ , x = α, for some (x ∗ , α) ∈ X ∗ × R, x ∗ = 0}.
(1.22)
We call a set of the form {x ∈ X ; x ∗ , x ≤ α}, where x ∗ = 0, a (closed) half-space of X .
(1.23)
1.1 Convex Functions
7
Definition 1.11 Let A and B be two subsets of X . We say that the hyperplane Hx ∗ ,α separates A and B if x ∗ , a ≤ α ≤ x ∗ , b, for all (a, b) ∈ A × B.
(1.24)
We speak of a strict separation if x ∗ , a < x ∗ , b, for all (a, b) ∈ A × B,
(1.25)
and of a strong separation if, for some ε > 0, x ∗ , a + ε ≤ α ≤ x ∗ , b − ε, for all (a, b) ∈ A × B.
(1.26)
We say that x ∗ ∈ X ∗ (nonzero) separates A and B if (1.24) holds for some α, strictly separates A and B if (1.25) holds, and strongly separates A and B if (1.26) holds for some ε > 0 and α. If A is the singleton {a}, then we say that x ∗ separates a and B, etc. Given two subsets A and B of a vector space X , we define their Minkowski sum and difference as A + B = {a + b; a ∈ A, b ∈ B}, (1.27) A − B = {a − b; a ∈ A, b ∈ B}. The first geometric form of the Hahn–Banach theorem is as follows: Theorem 1.12 Let A and B be two nonempty subsets of the normed vector space X , with empty intersection. If A − B is convex and has a nonempty interior, then there exists a hyperplane Hx ∗ ,α separating A and B, such that x ∗ , a < x ∗ , b, whenever (a, b) ∈ A × B and a − b ∈ int(A − B).
(1.28)
Note that A − B has a nonempty interior whenever either A or B has a nonempty interior. The proof needs the following concept. Definition 1.13 Let C be a convex subset of X whose interior contains 0. The gauge function of C is (1.29) gC (x) := inf{β > 0; β −1 x ∈ C}. Example 1.14 If C is the closed unit ball of X , then gC (x) = x for all x ∈ X . A gauge function is obviously positively homogeneous and finite. If B(0, ε) ⊂ C for some ε > 0, then: gC (x) ≤ x/ε, for all x ∈ X, (1.30) so it is bounded over bounded sets. In addition, for any β > gC (x) and γ > 0, since x ∈ βC and B(0, γ ε) ⊂ γ C, we get x + B(0, γ ε) ⊂ (β + γ )C, so that gC (y) ≤ gC (x) + γ , for all y ∈ B(x, γ ε). We have proved that
8
1 A Convex Optimization Toolbox
If B(0, ε) ⊂ C, then gC is Lipschitz with constant 1/ε.
(1.31)
It easily follows that {x ∈ X ; gC (x) < 1} = int(C) ⊂ C¯ = {x ∈ X ; gC (x) ≤ 1}.
(1.32)
Lemma 1.15 A gauge is subadditive and convex. Proof Let x and y belong to X . For all βx > gC (x) and β y > gC (y), we have that (βx )−1 x ∈ C and (β y )−1 y ∈ C, so that βy x+y βx = (βx )−1 x + (β y )−1 y ∈ C. βx + β y βx + β y βx + β y
(1.33)
Therefore, gC (x + y) ≤ βx + β y . Since this holds for any βx > gC (x) and β y > gC (y), we obtain that gC is subadditive. Since gC is positively homogeneous, it easily follows that gc is convex. Proof (Proof of theorem 1.12) Let x0 ∈ int(B − A); since A ∩ B = ∅, x0 = 0. Set C := {a − b + x0 , b ∈ B, a ∈ A}.
(1.34)
We easily check that 0 ∈ int(C). Obviously, x ∗ separates A and B iff it separates C and {x0 }. Let λ be the linear form defined on X 1 := Rx0 by λ(t x0 ) = t, for t ∈ / C, and hence, gC (x0 ) ≥ 1. It easily follows that λ is R. Since A ∩ B = ∅, x0 ∈ dominated by gC on X 1 . By Theorem 1.7, there exists a (possibly not continuous) linear form x ∗ on X , dominated by gC , whose restriction to X 1 coincides with λ. Since 0 ∈ int(C) we have that (1.30) holds for some ε > 0, and hence, being dominated by gC , x ∗ is continuous. It follows that x ∗ , x ≤ 1, for all x ∈ C, or equivalently x ∗ , a − x ∗ , b + x ∗ , x0 ≤ 1, for all (a, b) ∈ A × B,
(1.35)
whereas x ∗ , x0 = 1. Therefore x ∗ separates A and B. In addition, if a − b ∈ int(A − B), say B(a − b, ε) ⊂ A − B for some ε > 0, then x ∗ , a − b + e ≤ 0 whenever e ≤ ε; maximizing over e ∈ B(0, ε) we obtain that x ∗ , a − b ≤ −εx ∗ . Relation (1.28) follows. Corollary 1.16 Let E be a closed convex subset of the normed space X . Then there exists a hyperplane that strongly separates any x0 ∈ / E and E. Proof For ε > 0 small enough, the open convex set A := B(x0 , ε) has empty intersection with E. By Theorem 1.12, there exists an x ∗ = 0 separating A and E, that is (1.36) x ∗ , x0 + εx ∗ ∗ = sup{x ∗ , a; a ∈ A} ≤ inf{x ∗ , b; b ∈ E}. The conclusion follows.
1.1 Convex Functions
9
Remark 1.17 Corollary 1.16 can be reformulated as follows: any closed convex subset of a normed space is the intersection of half spaces in which it is contained. The following example shows that, even in a Hilbert space, one cannot in general separate two convex sets with empty intersection. Example 1.18 Let X = 2 be the space of real sequences whose sum of squares of coefficients is summable. Let C be the subset of X of sequences with finitely many nonzero coefficients, the last one being positive. Then C is a convex cone that does not contain 0. Let x ∗ separate 0 and C. We can identify the Hilbert space X with its dual, and therefore x ∗ with an element of X . Since each element ei of the natural basis belongs to the cone C, we must have xi∗ ≥ 0 for all i, and x ∗j > 0 for some j. For any ε > 0 small enough, x := −e j + εe j+1 belongs to C, but x ∗ , x = −x ∗j + εx ∗j+1 < 0. This shows that one cannot separate the convex sets. So, 0 and C cannot be separated.
1.1.2.3
Relative Interior
Again, let X be a normed vector space, and E be a convex subset of X . We denote by affhull(E) the intersection of affine spaces containing E, and by affhull(E) its closure; the latter is the smallest closed affine space containing E. The relative interior of E, denoted by rint(E), is the interior of E viewed as a subset of affhull(E). Proposition 1.19 Let A and B be two nonempty subsets of X , with empty intersection. If A − B is convex and has a nonempty relative interior, then there exists a hyperplane Hx ∗ ,α separating A and B, and such that x ∗ , a < x ∗ , b, whenever (a, b) ∈ A × B and a − b ∈ rint(A − B).
(1.37)
Proof Set E := B − A and Y := affhull(E). By Theorem 1.12, there exists a y ∗ in Y ∗ separating 0 and E, with strict inequality for rint(E). By Theorem 1.7, there exists an x ∗ ∈ X ∗ whose restriction to Y is y ∗ , and the conclusion holds with x ∗ . Remark 1.20 By the previous proposition when B = {b} is a singleton, noting that rint(A − b) = rint(A) − b, we obtain that when A is convex, if b ∈ / rint(A) then there exists an x ∗ ∈ X ∗ such that x ∗ , a < x ∗ , b, whenever a ∈ rint(A).
(1.38)
Since any convex subset of a finite-dimensional subspace has a nonempty relative interior,1 we deduce the following: 1 Except
maybe when the set is a singleton and then the dimension is zero, where this is a matter of definition. However the case when A − B reduces to a singleton means that both A and B are singletons and then it is easy to separate them.
10
1 A Convex Optimization Toolbox
Corollary 1.21 Let A and B be two convex and nonempty subsets of a Euclidean space, with empty intersection. Then there exists a hyperplane Hx ∗ ,α separating A and B, such that (1.37) holds.
1.1.3 Weak Duality and Saddle Points Let X and Y be two sets and let L : X × Y → R. Then we have the weak duality inequality (1.39) sup inf L(x, y) ≤ inf sup L(x, y). y∈Y x∈X
x∈X y∈Y
Indeed, let (x0 , y0 ) ∈ X × Y . Then inf L(x, y0 ) ≤ L(x0 , y0 ) ≤ sup L(x0 , y).
x∈X
(1.40)
y∈Y
Removing the middle term, maximizing the left-hand side w.r.t. y0 and minimizing the right-hand side w.r.t. x0 , we obtain (1.39). We next define the primal and dual values, resp., for x ∈ X and y ∈ Y , by p(x) := sup L(x, y); d(y) := inf L(x, y), x∈X
y∈Y
(1.41)
and the primal and dual problem by Min p(x)
(P)
Max d(y).
(D)
x∈X
y∈Y
The weak duality inequality says that val(D) ≤ val(P). We say that (x, ¯ y¯ ) ∈ X × Y is a saddle point of L over X × Y if L(x, ¯ y) ≤ L(x, ¯ y¯ ) ≤ L(x, y¯ ), for all (x, y) ∈ X × Y.
(1.42)
An equivalent relation is ¯ y) = L(x, ¯ y¯ ) = inf L(x, y¯ ). sup L(x, y∈Y
x∈X
(1.43)
Minorizing the left-hand term by changing x¯ into the infimum w.r.t. x ∈ X and majorizing symmetrically the right-hand term, we obtain ¯ y¯ ) ≤ sup inf L(x, y), inf sup L(x, y) ≤ L(x,
x∈X y∈Y
y∈Y x∈X
(1.44)
1.1 Convex Functions
11
which, combined with the weak duality inequality, shows that x¯ ∈ S(P), y¯ ∈ S(D) and val(D) = val(P) = L(x, ¯ y¯ ). (1.45) But we have more, in fact, if we denote by S P(L) the set of saddle points, then: Lemma 1.22 The following holds: {val(D) = val(P) is finite}
⇒
S P(L) = S(P) × S(D).
(1.46)
Proof Indeed, let x¯ ∈ S(P) and y¯ ∈ S(D). Then ¯ y¯ ) ≤ sup L(x, ¯ y) = val(P). val(D) = inf L(x, y¯ ) ≤ L(x, x∈X
(1.47)
y∈Y
If val(D) = val(P), then these inequalities are equalities, so that (1.43) holds, and therefore (x, ¯ y¯ ) is a saddle point. The converse implication has already been obtained.
1.1.4 Linear Programming and Hoffman Bounds 1.1.4.1
Linear Programming
We assume in this section that X is a vector space (with no associated topology; this abstract setting does not make the proofs more complicated). Consider the infinitedimensional linear program Minc, x; ai , x ≤ bi , i = 1, . . . , p; x∈X
(L P)
where c and ai , i = 1, . . . , p, are linear forms over X , b ∈ R p , and ·, · denotes the action of a linear form over X . The associated Lagrangian function L : X × R p∗ → R is defined as p λi (ai , x − bi ) , (1.48) L(x, λ) := c, x + i=1 p∗
where the multiplier λ has to belong to R+ . The primal value satisfies p(x) = sup L(x, λ) = p∗ λ∈R+
c, x if x ∈ F(L P), +∞ otherwise.
(1.49)
Therefore (L P) and the primal problem (ofminimizing p(x)) have the same value p and set of solutions. Since L(x, λ) = c + i=1 λi ai , x − λb, we have that
12
1 A Convex Optimization Toolbox
d(λ) = inf L(x, λ) = x
p −λb if c + i=1 λi ai = 0, −∞ otherwise.
(1.50)
The dual problem has therefore the same value and set of solutions as the following problem, called dual to (L P): −λb; c + Max p∗
λ∈R+
p
λi ai = 0.
(L D)
i=1
For x ∈ F(L P), we denote the associated set of active constraints by I (x) := {1 ≤ i ≤ p; ai , x = bi }.
(1.51)
Consider the optimality system
p (i) c + i=1 λi ai = 0, (ii) λi ≥ 0, ai , x ≤ bi , λi (ai , x − bi ) = 0, i = 1, . . . , p.
(1.52)
Lemma 1.23 The pair (x, λ) ∈ F(L P) × F(L D) is a saddle point of the Lagrangian iff (1.52) holds. Proof Let x ∈ F(L P) and λ ∈ F(L D). Then (1.52)(i) holds, implying that the difference of cost function is equal to c, x + λb =
p
λi (bi − ai , x).
(1.53)
i=1
This sum of nonnegative terms is equal to zero iff the last relation of (1.52)(ii) holds, and then x ∈ S(L P) and λ ∈ S(L D), proving that (x, λ) is a saddle point. The converse implication is easily obtained. We next deal with the existence of solutions. Lemma 1.24 If (L P) (resp. (L D)) has a finite value, then its set of solutions is nonempty. Proof The proof of the two cases being similar, it suffices to prove the first statement. Since (L P) has a finite value, there exists a minimizing sequence x k . Extracting a subsequence if necessary, we may assume that I (x k ) is constant, say equal to J . Among such minimizing sequences we may assume that J is of maximal cardinality. If c, x k has, for large enough k, a constant value, then the corresponding x k is a solution of (L P). Otherwise, extracting a subsequence if necessary, we may assume that c, x k+1 < c, x k for all k. Set d k := x k+1 − x k , and consider the set E k := {ρ ≥ 0; x k + ρd k ∈ F(L P)}. Since c, d k < 0 and val(L P) > −∞, this set is bounded. The maximal element is
1.1 Convex Functions
13
ρk := argmin{(bi − ai , x k )/ai , d k ; ai , d k > 0}.
(1.54)
i
We have that y k := x k + ρk d k ∈ F(L P), c, y k < c, x k and J ⊂ I (y k ) strictly. Extracting from y k a minimizing sequence with constant set of active constraints, strictly containing J , we contradict the definition of J . q Lemma 1.25 Given y 1 , . . . , y q in Rn , the set E := { i=1 λi y i , λ ≥ 0} is closed. Proof (i) Assume first that the y i are linearly independent, and denote by Y the generated vector space, of which they are a basis. The coefficients λi represent the coordinates in this basis, and so, if ek → e in E, its coordinates converge to those of e and the result follows. (ii) We next deal with the general case. Let ek → e in E. Let us associate with each q q k k k e some λ ∈ R+ such that e = i=1 λik y i , and that λk has minimal support (the support being the set of nonzero components). Taking if necessary a subsequence, we may assume that this support J is constant q along the sequence. We claim that {y i , i ∈ J } is linearly independent, for if i=1 μi y i = 0 with μ = 0, and μi = 0 k k when λik = 0, we can find β k so that μ := λ + βk μ is nonnegative and has a support q smaller than J , and ek = i=1 μik y i , in contradiction with the definition of J . we deduce as in step (i) that the λk Since {y i , i ∈ J } is linearly independent, q converge to some λ¯ and that e = i=1 λ¯ i y i . The conclusion follows. We next prove a strong duality result. Lemma 1.26 If val(L P) is finite, then val(L P) = val(L D) and both S(L P) and S(L D) are nonempty. Proof By Lemma 1.26, (L P) has a solution x. ¯ Set J := I (x), ¯ j := |J | (the cardinality of J ); we may assume that J = 1, . . . , j. Consider the set C := {d ∈ X ; ai , d ≤ 0, i ∈ J }.
(1.55)
Obviously, if d ∈ C, then x¯ + ρd ∈ F(L P) for small enough ρ > 0, and consequently c, d ≥ 0, for all d ∈ C. (1.56) Consider the mapping Ax := (a1 , x, . . . , a j , x, c, x) over X , with image E 1 ⊂ R j+1 . We claim that the point z := (0, . . . , 0, −1) does not belong to the set j+1 E 2 := E 1 + R+ . Indeed, otherwise we would have z ≥ Ax, for some x ∈ X , and so ai , x ≤ 0, for all i ∈ J , whereas z j+1 = −1 contradicts (1.56). Let z 1 , . . . , z q be a basis of the vector space E 1 . Then E 2 is the set of nonnegative linear combinations of {±z 1 , . . . , ±z q , e1 , . . . , e j+1 }, where by ei we denote the elements of the natural basis of R j+1 . By Lemma 1.25, E 2 is closed. Corollary 1.16, allows us to strictly separate z and E 2 . That is, − λ j+1 = λz < inf λy, for some nonzero λ ∈ R j+1 . y∈E 2
(1.57)
14
1 A Convex Optimization Toolbox
Since E 2 is a cone, the above infimum is zero, whence λ j+1 > 0. Changing λ into j+1 λ/λ j+1 if necessary, we may assume that λ j+1 = 1. Since E 2 = E 1 + R+ it follows j+1 that λ ∈ R+ , and since E 1 ⊂ E 2 we deduce that 0 ≤ i∈J λi ai + c, d for all / J. d ∈ X , meaning that i∈J λi ai + c = 0. Let us now set λi = 0 for i ≤ p, i ∈ Then (1.52) holds at the point x. ¯ By Lemma 1.23, (x, ¯ λ) is a saddle point and the conclusion follows. Remark 1.27 It may happen, even in a finite-dimensional setting, that both the primal and dual problem are unfeasible, so that they have value +∞ and −∞ resp.; consider for instance the problem Min x∈R {−x; 0 × x = 1; −x ≤ 0}, whose dual is Maxλ∈R2 {−λ1 ; −1 − λ2 = 0; λ2 ≥ 0}. 1.1.4.2
Hoffman Bounds
As an application of linear programming duality we present Hoffman’s lemma [59]. Lemma 1.28 Given a Banach space X , a1 , . . . , a p in X ∗ , and b ∈ R p , set Cb := {x ∈ X ; ai , x ≤ bi , i = 1, . . . , p}.
(1.58)
Then there exists a Hoffman constant M > 0, not depending on b, such that, if Cb = ∅, then dist(x, Cb ) ≤ M
p
(ai , x − bi )+ , for all x ∈ X.
(1.59)
i=1
Proof (a) Define A ∈ L(X, R p ) by Ax = (a1 , x, . . . , a p , x). Let x1 , . . . , xq be elements of X such that (Ax1 , . . . , Axq ) is a basis of Im(A). Then q ≤ p, and the family {x1 , . . . , xq } is linearly independent. Denote by H the vector space with basis x1 , . . . , xq . The Euclidean norm over H is equivalent to the one induced by X , since all norms are equivalent on finite-dimensional spaces. (b) Let x ∈ X . We may express the coordinates of Ax in the basis {Ax j } as functions q of x, i.e., write Ax = j=1 α j (x)Ax j for some linear function α : X → Rq that is continuous. Indeed, by the equivalence of norms in a finite-dimensional space, for some positive c not depending on x: |α(x)| ≤ c |Ax| ≤ c Ax. Setting C x =
q j=1
(1.60)
α j (x)x j , and Bx := x − C x, we may write x = Bx + C x, and ABx = Ax −
q j=1
α j (x)Ax j = 0,
(1.61)
1.1 Convex Functions
15
i.e., AB = 0. So, Cb (assumed to be nonempty) is invariant under the addition of an element of the image of B. In particular, it contains an element xˆ such that xˆ − x ∈ H , that is, q βjxj. (1.62) xˆ = x + j=1
Specifically, let xˆ be such an element of Cb for which |β| is minimum, that is, β is a solution of the problem 1 γ j x j ≤ bi , i = 1, . . . , p. Minq |γ |2 ; ai , x + γ ∈R 2 j=1 q
(1.63)
q
The following optimality conditions hold: there exists a λ ∈ R+ such that γj +
p
λi ai , x j = 0,
j = 1, . . . , q,
(1.64)
i=1
⎛
and
λi ⎝ai , x +
q
⎞ γ j x j − bi ⎠ = 0, i = 1, . . . , p.
(1.65)
j=1
Using first (1.64) and then (1.65), we obtain |γ |2 =
q j=1
γ j2 = −
λi ai , γ j x j =
i, j
q
λi (ai , x − bi ) ≤ λ∞
i=1
q
(ai , x − bi )+ .
i=1
(1.66) (c) Among all possible multipliers λ we may take one with minimal support. From (1.64) we deduce the existence of M1 > 0 not depending on x and b, such that λ∞ ≤ M1 |γ |.
(1.67)
Combining with (1.66), we deduce that |γ | ≤ M1
q
(ai , x − bi )+ .
(1.68)
i=1
The conclusion follows since, as noticed before, the Euclidean norm on H is equivalent to the one induced by the norm of X .
16
1.1.4.3
1 A Convex Optimization Toolbox
The Open Mapping Theorem
We will generalize the previous result, and for this we need the following fundamental result in functional analysis. Theorem 1.29 (Open mapping theorem) Let X and Y be Banach spaces, and let A ∈ L(X, Y ) be surjective. Then α BY ⊂ AB X , for some α > 0.
Proof See e.g. [28].
Corollary 1.30 Let A and α be as in Theorem 1.29. Then Im(A ) is closed, and A λ ≥ αλ, for all λ ∈ Y ∗ .
(1.69)
Proof By the open mapping theorem, we have that A λ = sup λ, Ax X ≥ α sup λ, yY = αλ, x≤1
y≤1
(1.70)
proving (1.69). Let us now check that Im(A ) is closed. Let xk∗ in Im(A ) converge to x ∗ . There exists a sequence λk ∈ Y ∗ such that xk∗ = A λk . In view of (1.69), λk is a Cauchy sequence and hence has a limit λ¯ ∈ Y ∗ . Therefore x ∗ = A λ¯ ∈ Y ∗ . The conclusion follows. Proposition 1.31 Let X and Y be Banach spaces, and A ∈ L(X, Y ). Then Im(A ) ⊂ (Ker A)⊥ , with equality if A has a closed range. Proof (a) Let x ∈ Im(A ), i.e., x = A y ∗ , for some y ∈ Y , and x ∈ Ker A. Then x, x X = y, AxY = 0. Therefore, Im(A ) ⊂ (Ker A)⊥ . (b) Assume now that A has closed range. Let x ∗ ∈ (Ker A)⊥ . For y ∈ Y , set v(y) := x ∗ , x, where x ∈ X satisfies Ax = y.
(1.71)
Since x ∗ ∈ (Ker A)⊥ , any x such that Ax = y gives the same value of x ∗ , x X , and therefore v(y) is well-defined. It is easily checked that it is a linear function. By the open mapping theorem, applied to the restriction of A from X to its image (the latter being a Banach space by hypothesis), there exists an x ∈ α −1 yY B X such that Ax = y, so that |v(y)| ≤ α −1 x ∗ yY . So, v is a linear and continuous mapping, i.e., there exists a y ∗ ∈ Y ∗ such that v(y) = y ∗ , yY . For all x ∈ X , we have therefore x ∗ , x X = y ∗ , AxY = A y ∗ , x X , so that x ∗ = A y ∗ , as was to be proved. Remark 1.32 See in Example 1.115 another proof, based on duality theory. Example 1.33 Let X := L 2 (0, 1), Y := L 1 (0, 1), and A ∈ L(X, Y ) be the injection of X into Y . Then Ker A is reduced to 0, and therefore its orthogonal is X ∗ . On the other hand, we have that for y ∗ ∈ L ∞ (0, 1), A y ∗ is the operator in X ∗ defined by 1 x → 0 y ∗ (t)x(t)dt. So the image of A is a dense subspace of X ∗ , but A is not surjective, and therefore its image is not closed.
1.1 Convex Functions
17
We next give a useful generalization of Hoffman’s Lemma 1.28 in the homogeneous case. Lemma 1.34 Given Banach spaces X and Y , A ∈ L(X, Y ) surjective, and a1 , . . . , a p in X ∗ , set C := {x ∈ X ; Ax = 0; ai , x ≤ 0, i = 1, . . . , p}. Then there exists a Hoffman constant M > 0 such that
dist(x, C) ≤ M Ax +
p
(ai , x)+ , for all x ∈ X.
(1.72)
i=1
Proof By the open mapping Theorem 1.29, there exists an x ∈ Ker A such that x − x ≤ α −1 Ax, where α is given by Theorem 1.29. Therefore for some M > 0 not depending on x: ai , x + ≤ (ai , x)+ + MAx.
(1.73)
Applying Lemma 1.28 to x , with Ker A in place of X , we obtain the desired conclusion.
1.1.5 Conjugacy 1.1.5.1
Basic Properties
¯ Its (Legendre–Fenchel) conjugate is the Let X be a Banach space and f : X → R. ∗ ∗ ¯ function f : X → R defined by f ∗ (x ∗ ) := supx ∗ , x − f (x).
(1.74)
x∈X
This can be motivated as follows. Let us look for an affine minorant of f of the form x ∗ , x − β. For given x ∗ , the best (i.e., minimal) value of β is precisely f ∗ (x ∗ ). Being a supremum of affine functions, f ∗ is l.s.c. convex. We obviously have the Fenchel–Young inequality f ∗ (x ∗ ) ≥ x ∗ , x − f (x), for all x ∈ X and x ∗ ∈ X ∗ .
(1.75)
Remark 1.35 We have that f ∗ (x ∗ ) := −∞ for each x ∗ if dom( f ) = ∅.
(1.76)
f ∗ (x ∗ ) > −∞ for each x ∗ if dom( f ) = ∅.
(1.77)
18
1 A Convex Optimization Toolbox
Since the supremum over an empty set is −∞, we may always express f ∗ by maximizing over dom( f ): f ∗ (x ∗ ) :=
sup x ∗ , x − f (x).
x∈dom( f )
(1.78)
If f (x) is finite we can write the Fenchel–Young inequality in the more symmetric form x ∗ , x ≤ f (x) + f ∗ (x ∗ ). (1.79) Lemma 1.36 Let f be proper. Then the symmetric form (1.79) of the Fenchel–Young inequality is valid for any (x, x ∗ ) in X × X ∗ . Proof By (1.77), f ∗ (x ∗ ) > −∞, and f (x) > −∞, so that (1.79) makes sense and is equivalent to the Fenchel–Young inequality. Example 1.37 Let X be a Hilbert space, and define f : X → R by f (x) := 21 x2 . We identify X with its dual. Then it happens that f ∗ (x) = f (x) for all x in X , leading to the well-known inequality (where the l.h.s. is the scalar product between x and y): (x, y) ≤
1 1 x2 + y2 , for all x, y in X. 2 2
(1.80)
Example 1.38 Let p > 1. Define f : R → R by f (x) := |x| p / p. For y ∈ R, the maximum of x → x y − f (x) is attained at 0 if y = 0, and otherwise for some x = 0 of the same sign as y such that |x| p−1 = |y|. Introducing the conjugate ∗ exponent p ∗ such that 1/ p ∗ + 1/ p = 1, we get f ∗ (y) = |y| p / p ∗ , so that x y ≤ p p∗ ∗ for some p > 1, let f : Rn → R |x| / p + |y| / p , for all x, y in R. Similarly, p p n |xi | p . We easily obtain that be defined by f (x) := x p / p, where x p = i=1 ∗ p f ∗ (y) = y p∗ / p ∗ , leading to the Young inequality n
xi yi ≤
i=1
1 1 p∗ x pp + ∗ y p∗ , for all x, y in R. p p
(1.81)
Exercise 1.39 Let A be a symmetric, positive definite n × n matrix. (i) Check that the conjugate of f (x) := 21 x Ax is f ∗ (y) := 21 y A−1 y. Taking x = y, deduce the Young inequality |x|2 ≤
1 1 x Ax + x A−1 x, for all x ∈ Rn . 2 2
(1.82)
Conclude that A + A−1 − 2I is positive semidefinite, if A is symmetric and positive definite. (1.83)
1.1 Convex Functions
19
Exercise 1.40 Check that the conjugate of the indicatrix of the (open or closed) unit ball of X is the dual norm. ¯ Show that Exercise 1.41 Let f (x) := αg(x) with α > 0 and g : X → R. f ∗ (x ∗ ) = αg ∗ (x ∗ /α).
(1.84)
Exercise 1.42 Show that the Fenchel conjugate of the exponential is the entropy function H with value H (x) = x(log x − 1) if x > 0, H (0) = 0, and H (x) = ∞ if x < 0. Deduce the inequality x y ≤ e x + y(log y − 1), for all x ∈ R and y > 0. ¯ defined by The biconjugate of f is the function f ∗∗ : X → R f ∗∗ (x) := sup x ∗ , x − f ∗ (x ∗ ). x ∗ ∈X ∗
(1.85)
Proposition 1.43 The biconjugate f ∗∗ is the supremum of the affine minorants of f . Proof Let x ∗ ∈ X ∗ . If f ∗ (x ∗ ) = ∞, then f has no affine minorant with slope x ∗ . Otherwise, as we already observed, x ∗ , x − f ∗ (x ∗ ) is an affine minorant of f with the best possible constant term. The conclusion follows. A hyperplane (−x ∗ , β) ∈ X ∗ × R separating (x0 , α0 ) from epi( f ) is such that −x ∗ , x0 + βα0 ≤ −x ∗ , x + βα, for all (x, α) ∈ epi( f ).
(1.86)
If dom( f ) = ∅, then for some x ∈ dom( f ), we can take α → +∞ and it follows that β ≥ 0. If β = 0, we say that the hyperplane is vertical and the above relation reduces to (1.87) x ∗ , x0 ≥ x ∗ , x for all x ∈ dom( f ), i.e., −x ∗ separates x0 from dom( f ). Otherwise, we say that the separating hyperplane is oblique. We may then assume that β = 1 and we obtain that −x ∗ , x0 + α0 ≤ −x ∗ , x + f (x), for all x ∈ dom( f ).
(1.88)
This is equivalent to saying that x → x ∗ , x − x0 + α0 is an affine minorant of f . ¯ be proper, l.s.c. convex. Then f = f ∗∗ . Theorem 1.44 Let f : X → R Proof (a) It suffices to prove that any (x0 , α0 ) ∈ / epi( f ) can be strictly separated from epi( f ) by an oblique hyperplane. Indeed, the corresponding affine minorant / epi( f ∗∗ ). Since f ∗∗ ≤ f , it follows that epi( f ) = then guarantees that (x0 , α0 ) ∈ ∗∗ epi( f ) as was to be proved. (b) Since epi( f ) is a closed convex subset of X × R, by Corollary 1.16, it can be strictly separated from (x0 , α0 ). Note that not all separating hyperplanes are vertical, since otherwise f would have value −∞ over its (nonempty) domain. It follows that f has an affine minorant, say x → x ∗ , x + γ .
20
1 A Convex Optimization Toolbox
If the hyperplane strictly separating (x0 , α0 ) and epi( f ) is oblique, it provides an affine minorant of f with value greater than α0 at the point x0 , as required. If, on the contrary, the hyperplane strictly separating (x0 , α0 ) and epi( f ) is vertical, say y ∗ , x − x0 + ε ≤ 0 for all x ∈ dom( f ), with y ∗ = 0 and ε > 0, then we have that for any β > 0 and all x ∈ dom( f ): f (x) ≥ β(y ∗ , x − x0 + ε) + x ∗ , x + γ ,
(1.89)
meaning that the above r.h.s. is an affine minorant of f . At the same time, its value at x0 is βε + x ∗ , x0 + γ , which for β > 0 large enough is larger than α0 . So this r.h.s. is an oblique hyperplane separating (x0 , α0 ) from epi( f ). The conclusion follows. Definition 1.45 (i) Let E ⊂ X . The convex hull conv(E) is the smallest convex set containing E, i.e., the set of finite convex combinations (linear combinations with nonnegative weights whose sum is 1) of elements of E. The convex closure of E, denoted by conv(E), is the smallest closed convex set containing E (i.e., the intersection of closed convex set containing E). ¯ ¯ The convex closure of f is the function conv( f ) : X → R (ii) Let f : X → R. whose epigraph is conv(epi( f )) (note that conv( f ) is the supremum of l.s.c. convex minorants of f ). We obviously have that f = conv( f ) iff f is convex and l.s.c. ¯ We have the folTheorem 1.46 (Fenchel–Moreau–Rockafellar) Let f : X → R. lowing alternative: either (i) f ∗∗ = −∞ identically, conv( f ) has no finite value, and has value −∞ at some point, or (ii) f ∗∗ = conv( f ) and conv( f )(x) > −∞, for all x ∈ X . Proof If f is identically equal to +∞, the conclusion is obvious. So we may assume that dom( f ) = ∅. Since f ∗∗ is an l.s.c. convex minorant of f , we have that f ∗∗ ≤ conv( f ). So, if conv( f )(x1 ) = −∞ for some x1 ∈ X , then f has no affine minorant and f ∗∗ = −∞. In addition, since conv( f ) is l.s.c. convex, for any x ∈ dom(conv( f )), setting x θ := θ x + (1 − θ )x1 , we have that conv( f )(x) ≤ lim conv( f )(x θ ) ≤ θ conv( f )(x) + (1 − θ ) conv( f )(x1 ) = −∞, θ↑1
(1.90) so that (i) holds. On the contrary, if (i) does not hold, then f has a continuous affine minorant, so that then conv( f )(x) > −∞, for all x ∈ X . Being proper, l.s.c. and convex, conv( f ) is by Theorem 1.44 the supremum of its affine minorants, which coincide with the affine minorants of f . The conclusion follows. ¯ Then Corollary 1.47 Let f be convex X → R. (i) conv( f )(x) = lim inf x →x f (x ), for all x ∈ X , (ii) if f is finite-valued and l.s.c. at some x0 ∈ X , then f (x0 ) = f ∗∗ (x0 ).
1.1 Convex Functions
21
Proof (i) Set g(x) := lim inf x →x f (x ). It is easily checked that g is an l.s.c. convex minorant of f , and therefore g ≤ conv( f ). On the other hand, since conv( f ) is an l.s.c. minorant of f , we have that conv( f )(x) ≤ lim inf x →x f (x ) = g(x), proving (i). (ii) By point (i), since f is finite-valued and l.s.c. at x0 , we have that f (x0 ) = conv( f )(x0 ) > −∞, and we conclude by Theorem 1.46. Example 1.48 Let K be a nonempty closed convex subset of X , and set f (x) = −∞ if x ∈ K , and f (x) = +∞ otherwise. Then f is l.s.c. convex, and f ∗∗ has value −∞ everywhere, so that f = f ∗∗ .
1.1.5.2
Conjugacy in Dual Spaces
¯ Its (Legendre–Fenchel) conjugate (in the Let X be a Banach space, and g : X ∗ → R. ¯ defined by dual sense) is the function g ∗ : X → R g ∗ (x) := sup x ∗ , x − g(x ∗ ). x ∗ ∈X ∗
(1.91)
So we have the dual Fenchel–Young inequality g ∗ (x) ≥ x ∗ , x − g(x ∗ ), for all x ∈ X and x ∗ ∈ X ∗ .
(1.92)
Being a supremum of affine functions, g ∗ is l.s.c. convex. Its biconjugate g ∗∗ is the Legendre–Fenchel conjugate (in the sense of Sect. 1.1.5) of g ∗ , i.e. g ∗∗ (x ∗ ) := supx ∗ , x − g ∗ (x).
(1.93)
x∈X
Let us call a function X ∗ → R of the form x ∗ → x ∗ , x + α, with (x, α) ∈ X × R, a ∗affine function; note that this excludes the affine functions of the form x ∗ → x ∗∗ , x ∗ + α, where x ∗∗ ∈ X ∗∗ \ X . We call the ∗affine functions that minorize g ∗affine minorants of g. By the same arguments as in Sect. 1.1.5, we obtain that g ∗∗ is the supremum of ∗ affine minorants of g,
(1.94)
and so we get the following result: ¯ We have that g = g ∗∗ iff g is a supremum of ∗affine Lemma 1.49 Let g : X ∗ → R. functions. ¯ the Fenchel conjugate f ∗ is a supremum of Remark 1.50 For any f : X → R, ∗affine functions. It follows that f ∗∗∗ = f ∗ .
(1.95)
22
1 A Convex Optimization Toolbox
Example 1.51 Recall the definition (1.6) of the indicatrix function. The support ¯ (sometimes also denoted by σ (·, K )) is defined by function σ K : X ∗ → R σ K (x ∗ ) := sup{x ∗ , x; x ∈ K }.
(1.96)
Clearly σ K is the conjugate of I K . If K is closed and convex, then I K is proper, l.s.c. and convex, and hence, is equal to its biconjugate, so that the conjugate of σ K is I K . Otherwise, let K be the smallest closed convex set containing K . It is easily checked ∗ = σ K . Since IK is l.s.c. convex and proper it is equal to its biconjugate. We that IK proved that (1.97) IK and σ K are conjugate to each other.
1.1.5.3
Continuity and Subdifferentiability
¯ have a finite value at some x ∈ X . We define the subdifferential of Let f : X → R f at x as the set ∂ f (x) := {x ∗ ∈ X ∗ ;
f (x ) ≥ f (x) + x ∗ , x − x, for all x ∈ X }.
(1.98)
Equivalently, ∂ f (x) is the set of slopes of affine minorants of f that are exact (i.e., equal to f ) at the point x. The inequality in (1.98) may be written as x ∗ , x − f (x) ≥ x ∗ , x − f (x ), for all x ∈ X.
(1.99)
Therefore x ∗ ∈ ∂ f (x) iff x attains the maximum in the definition of f ∗ (x ∗ ), i.e., we have that (1.100) {x ∗ ∈ ∂ f (x)} ⇔ { f ∗ (x ∗ ) + f (x) = x ∗ , x}. In other words, x ∗ ∈ ∂ f (x) iff it gives an equality in the Fenchel–Young inequality (1.79). By (1.85), f ∗∗ is the supremum of its affine minorants which are of the form ∗ x , x − β, β ≥ f ∗ (x ∗ ), for x ∗ ∈ dom( f ∗ ). It follows that the affine minorants that are exact at x for f ∗∗ are those which attain the supremum in (1.85), i.e. {x ∗ ∈ ∂ f ∗∗ (x)} ⇔ { f ∗ (x ∗ ) + f ∗∗ (x) = x ∗ , x}.
(1.101)
Also, if ∂ f (x) = ∅, then the corresponding affine minorants, exact at x, are also minorants of f ∗∗ exact at x, and therefore {∂ f (x) = ∅} ⇒ { f ∗∗ (x) = f (x)} ⇒ {∂ f ∗∗ (x) = ∂ f (x)}.
(1.102)
¯ with finite value at We may also define the subdifferential of a function g : X ∗ → R ∗ x as
1.1 Convex Functions
23
∂g(x ∗ ) := {x ∈ X ; g(y ∗ ) ≥ g(x ∗ ) + y ∗ − x ∗ , x, for all y ∗ ∈ X ∗ }.
(1.103)
Similarly to what was done before we can express the above inequality as x ∗ , x − g(x ∗ ) ≥ y ∗ , x − g(y ∗ ), for all y ∗ ∈ X ∗ .
(1.104)
This means that x ∈ ∂g(x ∗ ) iff x ∗ attains the maximum in the definition of g ∗ (x), i.e., we have that {x ∈ ∂g(x ∗ )} ⇔ {g(x ∗ ) + g ∗ (x) = x ∗ , x}.
(1.105)
With similar arguments we obtain that {x ∈ ∂g ∗∗ (x ∗ )} ⇔ {g ∗∗ (x ∗ ) + g ∗ (x) = x ∗ , x}.
(1.106)
When g is itself a conjugate function we deduce the following. ¯ have a finite value at some x ∈ X . That equality holds Lemma 1.52 Let f : X → R in the Fenchel–Young inequality (1.79) implies that x ∈ ∂ f ∗ (x ∗ ); the converse holds if f is proper, l.s.c. convex. Proof If (1.79) holds with equality, we know that f (x) = f ∗∗ (x) and so, by (1.105) applied to g = f ∗ , x ∈ ∂ f ∗ (x ∗ ) holds. Conversely, if x ∈ ∂ f ∗ (x ∗ ), then by (1.105) applied to g = f ∗ , we have that f ∗ (x ∗ ) + f ∗∗ (x) = x ∗ , x. When f is proper, l.s.c. convex, f ∗∗ (x) = f (x), so that equality holds in the Fenchel–Young inequality, as was to be proved. Remark 1.53 So, if f is proper, l.s.c. convex, we have that x ∗ ∈ ∂ f (x) iff x ∈ ∂ f ∗ (x ∗ ).
(1.107)
In this sense the Fenchel Legendre transform is an extension of the property of the classical Legendre transform which, under certain conditions, associates with a smooth function f over Rn another smooth function fˆ over Rn such that y = f (x) iff x = fˆ (y). In the analysis of stochastic problems we will need the following sensitivity analysis results for linear programs. Example 1.54 Given d ∈ Rm , b ∈ R p and matrices A and M of size p × m and ¯ be defined by p × n resp., let f : Rn → R f (x) := infm {d · y; Ay = b + M x}. y∈R+
This is the value of a linear program whose dual is
(1.108)
24
1 A Convex Optimization Toolbox
Maxp −λ · (b + M x); d + A λ ≥ 0. λ∈R
(Dx )
The next lemma gives an expression of ∂ f (x). Lemma 1.55 Let f have a finite value at x¯ ∈ X . Then ∂ f (x) ¯ is nonempty and satisfies ∂ f (x) ¯ = {−M λ; λ ∈ S(Dx¯ )}. (1.109) Proof The conjugate of f is f ∗ (x ∗ ) = supx;y≥0 {x ∗ · x − dy; Ay = b + M x} = − inf x;y≥0 {d · y − x ∗ · x; Ay = b + M x}.
(1.110)
Since f (x) ¯ is finite, the linear program involved in the above r.h.s. is feasible. By linear programming duality (Lemma 1.26) it has the same value as its dual, and hence, − f ∗ (x ∗ ) = supλ∈R p {−λ · b; x ∗ = −M λ; d + A λ ≥ 0}.
(1.111)
The Fenchel–Young inequality implies 0 ≤ f (x) ¯ + f ∗ (x ∗ ) − x ∗ · x¯ = f (x) ¯ − x ∗ · x¯ + inf λ∈R p {λb; x ∗ = −M λ; d + A λ ≥ 0}.
(1.112)
Since f (x) ¯ is the finite value of a feasible linear program, it is equal to val(Dx¯ ). So, let λ¯ ∈ S(Dx¯ ). The Fenchel–Young inequality (1.112) is equivalent to λ¯ · (b + M x) ¯ ≤ −x ∗ · x¯ + infp {λ · b; x ∗ = −M λ; d + A λ ≥ 0}. λ∈R
(1.113)
When equality holds, the linear program on the r.h.s. has a solution, say λ, and ¯ so that equality holds iff −x ∗ · x¯ = λ M x, λ¯ · (b + M x) ¯ = minp {λ · (b + M x); ¯ x ∗ = −M λ; d + A λ ≥ 0}. λ∈R
(1.114)
Recall that this is the case of equality in the Fenchel–Young inequality, and therefore ¯ Since the cost function and last constraint correspond to those it holds iff x ∗ ∈ ∂ f (x). of (Dx¯ ), it follows that any solution λˆ of the linear program on the r.h.s. belongs to S(Dx¯ ). We have proved that, if x ∗ ∈ ∂ f (x), then x ∗ = −M λ for some λ ∈ S(Dx¯ ). The converse obviously holds in view of (1.114). Remark 1.56 Consider the particular case when b = 0 and M is the opposite of the identity. Rewriting as b the variable x, we obtain that the function f (b) := infm {d · y; Ay + b = 0} y∈R+
(1.115)
1.1 Convex Functions
25
has, over its domain, a subdifferential equal to the set of solutions of the dual problem Maxp λ · b; d + A λ ≥ 0. λ∈R
(1.116)
We now show that, for convex functions, a local uniform upper bound implies a Lipschitz property as well as subdifferentiability. ¯ be convex, finitely-valued at x0 , and uniformly upper Lemma 1.57 Let f : X → R bounded near x0 , i.e., for some a ∈ R and r > 0: f (x) ≤ a whenever x − x0 ≤ r.
(1.117)
Then f is Lipschitz with constant say L on B(x0 , 21 r ). ¯ 0 , r ), we have by Proof Let ε ∈]0, 1[ and h ∈ X , h = εr . Since x0 ± ε−1 h ∈ B(x convexity of f that f (x0 + h) ≤ (1 − ε) f (x0 ) + ε f (x0 + ε−1 h) ≤ (1 − ε) f (x0 ) + εa, f (x0 + h) ≥ (1 + ε) f (x0 ) − ε f (x0 − ε−1 h) ≥ (1 + ε) f (x0 ) − εa. It follows that | f (x0 + h) − f (x0 )| ≤ ε(a − f (x0 )) = r −1 (a − f (x0 ))h.
(1.118)
¯ 0 , r ), we have that f (x) ≥ b, with b := f (x0 ) − Therefore, for all x ∈ B(x ¯ 0 , r1 ), where r1 := 1 r . Then b ≤ f (x) ≤ a, for all x ∈ (a − f (x0 )). Let x1 ∈ B(x 2 ¯ 1 , r1 ). Applying (1.118) at the point x1 , with r = r1 , we get B(x | f (x1 + h) − f (x1 )| ≤ r1−1 (a − b)h, for allh < r1 .
(1.119)
¯ 0 , r1 ), as was to be Therefore f is Lipschitz with constant r1−1 (a − b) over B(x proved. ¯ be proper convex. Then it is locally Lipschitz over Corollary 1.58 Let f : Rn → R the interior of its domain. Proof Let x¯ ∈ int dom( f ). There exists x 0 , . . . , x n in dom( f ) such that x¯ ∈ int E, where E := conv({x 0 , . . . , x n }). Then f (x) ≤ max{ f (x 0 ), . . . , f (x n )} over E. We conclude by Lemma 1.57. ¯ be convex, and Lipschitz with constant L near x0 ∈ X . Lemma 1.59 Let f : X → R ¯ L). Then ∂ f (x0 ) is nonempty and included in B(0, Proof Let xˆ ∈ B(x0 , 21 r ). Set E = {(x, γ ) ∈ X × R; γ > f (x)}. Since f is continuous at x0 , for ε > 0 small enough,
26
1 A Convex Optimization Toolbox
B(x0 , ε) × [ f (x0 ) + 1, ∞) ⊂ E,
(1.120)
so that E has a nonempty interior. By Theorem 1.12, there exists (λ, α) ∈ X ∗ × R separating (x, ˆ f (x)) ˆ and E, i.e., such that λ, x ˆ + α f (x) ˆ ≤ λ, x + αγ , for all x ∈ dom( f ), γ > f (x).
(1.121)
Taking x = x0 and γ > f (x), γ → +∞, we see that α ≥ 0. The separating hyperplane cannot be vertical, since xˆ ∈ int(dom( f )), so that we may take α = 1. Minimizing w.r.t. γ we obtain that f (x) ≥ f (x) ˆ − λ, x − x, ˆ proving that −λ ∈ ∂ f (x). ˆ ¯ We now check that ∂ f (x) ⊂ B(0, L). Assume that x ∗ ∈ ∂ f (x), with x ∗ ∗ > L. Then there exists a d ∈ X with d = 1 and x ∗ , d > L. Therefore by the definition of a subdifferential, lim σ ↓0
f (x + σ d) − f (x) ≥ x ∗ , d > L , σ
in contradiction with the fact that L is a local Lipschitz constant.
(1.122)
Example 1.60 Consider the entropy function f (x) = x log x if x ≥ 0 (with value 0 at zero), and f (x) = +∞ otherwise. Then f is l.s.c. convex, and the subdifferential is empty at x = 0. So, in general, even in a Euclidean space, an l.s.c. convex function may have an empty subdifferential at some points of its domain. We next introduce a concept that in some sense is an algebraic variant of the interior. Definition 1.61 Let S ⊂ X , where X is a vector space. Then we say that x ∈ S belongs to the core of S and write x ∈ core(S) if, for each h ∈ X , there exists an ε > 0 such that [x − εh, x + εh] ⊂ S. Lemma 1.62 We have that int(S) ⊂ core(S), and the converse holds in the following cases: (i) int(S) = ∅, (ii) S is finite-dimensional, (iii) S is closed and convex. Proof That int(X ) ⊂ core(S) is an immediate consequence of the definition. That the converse holds in cases (i) and (ii) is left as an exercise. Let us suppose now that S is closed and convex. If core(S) = ∅, the conclusion trivially holds. Otherwise, we may assume that 0 ∈ core(S). For k ∈ N, set Sk := k S; then X = ∪k Sk . By Baire’s lemma,2 at least one element of the family has a nonempty interior. Since Sk = k S this means that there exists an x1 ∈ S such that B(x1 , ε) ⊂ S for some ε > 0. We have that −x1 ∈ S for some ∈ N, as well as B(x1 , ε) ⊂ S , and so since S is convex, B(0, 21 ε) ⊂ S , proving that B(0, ε ) ⊂ S with ε := 21 ε/. The conclusion follows. 2 Baire’s lemma tells us that any countable intersection of dense subsets in
X is dense, or equivalently, that any countable union of closed sets with empty interiors has an empty interior.
1.1 Convex Functions
27
We next give an example3 of a set with an empty interior, and a nonempty core. Example 1.63 Let X be an infinite-dimensional Banach space. It is known that there exists a non-continuous linear form on X , that we denote by a(x). Set A := {x ∈ X ; |a(x)| ≤ 1}.
(1.123)
Clearly, 0 ∈ core(A). However, since a(x) is not continuous, and therefore not bounded in a neighbourhood of 0, A has an empty interior. ¯ Then it is continuous over Proposition 1.64 Let f be a convex function Rn → R. the interior of its domain. Proof Let x¯ belong to the interior of dom( f ). Then there exists x 0 , . . . , x n in dom( f ) whose convex hull E is such that B(x, ¯ ε) ∈ int(E), for some ε > 0. Since f is convex ¯ ε). So, the conclusion follows it follows that f (x) ≤ maxi f (x i ) for all x in B(x, from Lemma 1.57. ¯ Then it is conProposition 1.65 Let f be a proper l.s.c. convex function X → R. tinuous over the interior of its domain. Proof Let x0 ∈ int(dom( f )), and set S := {x ∈ X ; f (x) ≤ f (x0 ) + 1}. Since f is l.s.c., this is a closed set. Fix h ∈ X ; for t ∈ R, the function ϕ(t) := f (x0 + th) has a finite value at 0, and its domain contains [−ε, ε] for some ε > 0. For t ∈ [−ε, ε], we have that ϕ(t) ≤ max( f (x0 − εh), f (x0 + εh)). By Lemma 1.57, ϕ is continuous at 0, proving that x0 ∈ core(S). By Lemma 1.62, x0 ∈ int(S), meaning that f is bounded from above near x0 . We conclude with Lemma 1.57. If f is convex, then for all x and h in X , ( f (x + th) − f (x))/t is nondecreasing w.r.t. t ∈ (0, ∞). Therefore the directional derivative f (x, h) := lim t↓0
f (x + th) − f (x) t
(1.124)
always exists. Let us see how its value is related to the subdifferential. For this we need a preliminary lemma on positively homogeneous functions. ¯ be positively homogeneous, i.e. Lemma 1.66 Let F : X → R F(γ x) = γ F(x), for all γ > 0,
(1.125)
with F(0) = 0. Then (i) F ∗ is the indicatrix of ∂ F(0), and
3 Provided
∂ F(0) := {x ∗ ∈ X ∗ ; x ∗ , h ≤ F(h), for all h ∈ X },
(1.126)
F ∗∗ (h) = sup{x ∗ , h; x ∗ ∈ ∂ F(0)}.
(1.127)
by Lionel Thibault, U. Montpellier II.
28
1 A Convex Optimization Toolbox
(ii) If ∂ F(0) = ∅, then ∂ F ∗∗ (h) = {x ∗ ∈ ∂ F(0); F ∗∗ (h) = x ∗ , h}.
(1.128)
Proof (i) Relation (1.126) is the definition of ∂ F(0). By positive homogeneity, F ∗ (x ∗ ) is equal to 0 if x ∗ ∈ ∂ F(0), and +∞ otherwise, proving that F ∗ is the indicatrix of ∂ F(0). It follows that F ∗∗ satisfies (1.127), and so is positively homogeneous. (ii) if ∂ F(0) = ∅, then by (1.102), F ∗∗ (0) = 0 = F(0), and ∂ F(0) = ∂ F ∗∗ (0). Now let x ∗ ∈ ∂ F ∗∗ (h). It is easily checked that for each γ > 0 and y ∈ X : γ F ∗∗ (y) = F ∗∗ (γ y) ≥ F ∗∗ (h) + x ∗ , γ y − h.
(1.129)
Dividing by γ ↑ ∞ we obtain that x ∗ , y ≤ F ∗∗ (y), for all y ∈ X , i.e., x ∗ ∈ ∂ F ∗∗ (0) = ∂ F(0). Taking y = 0 in (1.129) we get the opposite inequality F ∗∗ (h) ≤ x ∗ , h; therefore x ∗ belongs to the r.h.s. of (1.128). Conversely, let x ∗ ∈ ∂ F ∗∗ (0) = ∂ F(0) be such that x ∗ , h = F ∗∗ (h). Then for any y ∈ X , we have that F ∗∗ (y) ≥ x ∗ , y = x ∗ , y − h + F ∗∗ (h), proving that x ∗ ∈ ∂ F ∗∗ (h). The conclusion follows. ¯ be convex, finitely-valued at x, Theorem 1.67 Let f : X → R ¯ and set F(·) := f (x, ¯ ·). Then (i) ∂ f (x) ¯ = ∂ F(0), and (ii) if ∂ f (x) ¯ = ∅, then ¯ h) ≥ lim inf f (x, ¯ h ) = sup{x ∗ , h; x ∗ ∈ ∂ f (x)}. ¯ f (x, h →h
(1.130)
Proof (i) The function F is positively homogeneous, with value 0 at 0, and is easily proved to be convex. Let x ∗ ∈ ∂ F(0). Then ¯ = F(0) + x ∗ , x − x ¯ ≤ F(x − x) ¯ ≤ f (x) − f (x), ¯ for all x ∈ X, x ∗ , x − x (1.131) ¯ Then, for any h ∈ X : implying that ∂ F(0) ⊂ ∂ f (x). ¯ Conversely, let x ∗ ∈ ∂ f (x). F(h) = lim t↓0
f (x¯ + th) − f (x) ¯ ≥ x ∗ , h, t
(1.132)
proving the converse inclusion; point (i) follows. (ii) Since ∂ f (x) ¯ = ∅, we have that ∂ F(0) = ∅. By Theorem 1.46 and Corollary 1.47(i), we have that F ∗∗ (h) = conv(F)(h) = lim inf F(h ). h →h
(1.133)
We then deduce the equality in (1.130) from Lemma 1.66(i). The first inequality being trivial, the conclusion follows.
1.1 Convex Functions
29
Definition 1.68 Let F : X → Y , where X and Y are Banach spaces. We say that F is Gâteaux differentiable (or G-differentiable) at x¯ ∈ X if, for any h ∈ X , the directional ¯ h) exists and the mapping h → F (x, ¯ h) is linear and continuous. derivative F (x, ¯ h), We denote by D F(x) ¯ ∈ L(X, Y ) the derivative of F defined by D F(x)h ¯ = F (x, for all h ∈ X . ¯ be convex, and continuous at x. Corollary 1.69 Let f : X → R ¯ Then ¯ h) = max{x ∗ , h; x ∗ ∈ ∂ f (x)}. ¯ f (x,
(1.134)
If in addition ∂ f (x) ¯ is the singleton {x ∗ }, then f is G-differentiable at x¯ with G∗ derivative x . Proof By Lemmas 1.57 and 1.59, f is locally Lipschitz near x¯ and ∂ f (x) ¯ is ¯ ·) is Lipschitz (as is easily shown), we have equality nonempty. Since F(·) := f (x, in (1.130), and F = F ∗∗ has a subderivative at h, characterized by (1.128), so that the supremum in (1.130) is a maximum. If ∂ f (x) ¯ is a singleton, the G-differentiability easily follows. Exercise 1.70 Let X := 2 be the spaceof square summable sequences and f : X → R ∪ {+∞} be defined by f (x) := k kxk2 . Let x¯ be the null sequence. Show that ∂ f (x) ¯ is a singleton, although f is not G-differentiable at x. ¯ Hint: show that ∂ f (x) ¯ = {0}, and that t → f (t x) is not continuous at 0 if x ∈ / dom( f ). Definition 1.71 The norm of X is said to be differentiable if it is Fréchet differentiable at any nonzero point, and strictly subadditive if x + x < x + x except if x and x are both nonnegative multiples of some x ∈ X . We adopt similar definitions for X ∗ . Exercise 1.72 Show that (i) the conjugate of the norm is the closed unit ball of X ∗ , (ii) the norm is never differentiable at 0 (except for null spaces!), (iii) if x ∈ X is nonzero, then ∂x = {x ∗ ∈ X ∗ ; x ∗ ∗ = 1; x ∗ , x = x},
(1.135)
(iv) the space X has a differentiable norm if the dual norm is strictly subadditive. Example 1.73 (Example of strict inequality in (1.130)) Let f (x) := 21 x12 /x2 , with domain the elements x ∈ R2 such that x1 > 0 and x2 > 0. Since x1 /x2 ; − 21 x12 /x22
∇ f (x) =
D 2 f (x) =
1/x2 −x1 /x22 , −x1 /x22 x12 /x23
(1.136)
we easily check that D 2 f (x) is positive semidefinite, and hence, f is convex over its convex domain. We set f (x) = +∞ if x1 < 0 or x2 < 0, and examine how to define f on R2+ when min(x1 , x2 ) = 0, in order to make f l.s.c., i.e. we compute
30
1 A Convex Optimization Toolbox
f (x) := lim inf{ f (x ); min(x1 , x2 ) > 0}. Clearly, when x2 > 0 (resp. x1 > 0) there exists a limit of value 0 (resp. +∞), and so, since f is nonnegative, its value at 0 should be 0 (resp. +∞). So we finally set f (x) :=
⎧ ⎨0 ⎩
1 2 x /x2 2 1
+∞
if x = 0, if x1 ≥ 0 and x2 > 0, otherwise.
(1.137)
We easily check that f has the following strange property: while min( f ) = 0, there exists a sequence x k ∈ dom( f ) such that D f (x k ) → 0 and f (x k ) → +∞. The directional derivatives of f at x = 0 are for h = 0: f (0, h) :=
0 if h 1 ≥ 0 and h 2 > 0, +∞ otherwise.
(1.138)
¯ In (1.130), For h¯ = (1 0) , we have that lim inf h →h¯ f (0, h ) = 0 < +∞ = f (0, h). the inequality is strict, and the supremum is attained for x ∗ = 0.
1.1.5.4
Polarity of Convex Sets
We have already discussed in Example 1.51 the link between the indicatrix and support functions. Definition 1.74 Let K be a subset of a Banach space X , and x0 ∈ X . The (negative) polar set of K w.r.t. x0 is the set K − (x0 ) := {x ∗ ∈ X ; x ∗ , x − x0 ≤ 1, for all x ∈ K }.
(1.139)
Let K ∗ be a subset of X ∗ , and x0∗ ∈ X ∗ . The (negative) polar set of K ∗ w.r.t. x0∗ is the set K ∗− (x0∗ ) := {x ∈ X ; x ∗ − x0∗ , x ≤ 1, for all x ∗ ∈ K ∗ }.
(1.140)
Observe that we obtain the same polar sets if we replace K or K ∗ by their convex closure. We also define the positive polar sets as K + (x0 ) := −K − (x0 ) = {x ∗ ∈ X ; −x ∗ , x − x0 ≤ 1, for all x ∈ K }, (1.141) and similarly K ∗− (x0∗ ) := −K ∗− (x0∗ ). When x0 = 0 (resp. x0∗ = 0), we simply denote the polar set by K − (resp. K ∗− ). The bipolar set is defined as e.g. K −− := (K − )− . Exercise 1.75 Let C be the closed unit ball of the Banach space X . Check that C − is the closed unit ball of X ∗ , and that C −− = C. Hint: use Corollary 1.8.
1.1 Convex Functions
31
Exercise 1.76 Let K be a convex subset of a Banach space X . Check that: (i) If B(x0 , ε) ⊂ K , for some ε > 0, then K − (x0 ) ⊂ B(0, 1/ε). (ii) If K is bounded, then 0 ∈ int(K − (x0 )). Lemma 1.77 Let K be a subset of X . Then K −− = conv(K ∪ {0}). In particular, if K is closed and convex, and contains 0, then K −− = K . Proof It suffices to prove the first statement. It is easily seen that both K and 0 belong to K −− . Since K −− is closed and convex, it contains K := conv(K ∪ {0}). Now let x¯ ∈ / K . We can strictly separate K from x, ¯ i.e., there exists an x ∗ ∈ X ∗ such that ¯ Since 0 ∈ K , x ∗ , x ¯ > 0. For any positive α < x ∗ , x, ¯ supx∈K x ∗ , x < x ∗ , x. ¯ we have that y ∗ := α −1 x ∗ is such that y ∗ , x ¯ > 1, and close enough to x ∗ , x, y ∗ , x ≤ 1 for all x ∈ K , so that y ∗ ∈ K − and then x¯ cannot belong to K −− . The conclusion follows. We will mostly use the notion of polarity for convex cones. Exercise 1.78 Check that, when K (resp. K ∗− ) is a cone, then (i) K − (resp. K ∗− ) is itself a cone, called the (negative) polar cone, such that
K − := {x ∗ ∈ X ∗ ; x ∗ , x ≤ 0, for all x ∈ K }, K ∗− := {x ∈ X ; x ∗ , x ≤ 0, for all x ∗ ∈ K ∗ },
(1.142)
and (ii) the Fenchel conjugate of the corresponding indicatrix functions satisfy σ K = I K∗ = I K − ;
I K∗ ∗ = I K ∗− .
(1.143)
Exercise 1.79 Let X be a Banach space and C1 and C2 be two convex cones of the same space Y , with either Y = X or Y = X ∗ . Check that (C1 + C2 )− = C1− ∩ C2− .
(1.144)
Definition 1.80 Let K be a convex subset of a Banach space X , and x¯ ∈ K . (i) We ¯ the tangent cone (in the sense of convex analysis) to call the closure of R+ (K − x) ¯ (ii) We call the set K at x, ¯ and denote it by TK (x). ¯ := {x ∗ ∈ X ∗ ; x ∗ , x − x ¯ ≤ 0, for all x ∈ K } N K (x)
(1.145)
the normal cone to K at x. ¯ We note that, in the setting of the previous definition, if h ∈ TK (x), ¯ then dist(x¯ + σ h, K ) = o(σ ), for σ > 0.
(1.146)
Exercise 1.81 Let K be a closed convex subset of a Banach space X . 1. Check that the tangent and normal cone (to a convex set) are polar to each other.
32
1 A Convex Optimization Toolbox
2. Let x¯ ∈ K . Check that ∂ I K (x) ¯ = N K (x). ¯ 3. Let x ∗ ∈ X ∗ be such that σ K (x ∗ ) is finite. Show that ∂σ K (x ∗ ) = {x ∈ K ; x ∗ , x ≥ x ∗ , x , for all x ∈ K }, or equivalently, setting N K−1 (x ∗ ) := {x ∈ K ; x ∗ ∈ N K (x)}: ∂σ K (x ∗ ) = N K−1 (x ∗ ).
(1.147)
Exercise 1.82 Let C be a closed convex cone of a Banach space X , and let x¯ ∈ C. Check that ¯ = C − ∩ (x) ¯ ⊥ ; TC (x) ¯ = C + Rx. ¯ (1.148) NC (x) Hint: for the second relation, apply (1.144) with C1 = C and C2 = Rx. ¯
1.2 Duality Theory 1.2.1 Perturbation Duality 1.2.1.1
General Relations
Consider the family of “primal” problems Min ϕ(x, y) − x ∗ , x, x∈X
(Py )
¯ x ∗ ∈ X ∗ , and y ∈ Y . We denote where X and Y are Banach spaces, ϕ : X × Y → R, the associated value function by v(y) := inf (ϕ(x, y) − x ∗ , x). x
(1.149)
Observe that v∗ (y ∗ ) = sup y ∗ , y − inf (ϕ(x, y) − x ∗ , x) x y = sup y ∗ , y + x ∗ , x − ϕ(x, y) = ϕ ∗ (x ∗ , y ∗ ).
(1.150)
x,y
It follows that
v∗∗ (y) = sup y ∗ , y − ϕ ∗ (x ∗ , y ∗ ). y ∗ ∈Y ∗
(1.151)
1.2 Duality Theory
33
Define the dual problem as Max y ∗ , y − ϕ ∗ (x ∗ , y ∗ ).
y ∗ ∈Y ∗
(D y )
Then by the definition of (D y ) we obtain, without any hypothesis, using (1.101)– (1.102): Proposition 1.83 The following weak duality relation holds: val(D y ) = v∗∗ (y) ≤ v(y) = val(Py ),
(1.152)
and we have that, if val(D y ) is finite: S(D y ) = ∂v∗∗ (y).
(1.153)
If ∂v(y) = ∅, then ∂v(y) = S(D y ).
(1.154)
Additionally: In the sequel we will analyze the case of strong duality, i.e. when v(y) = v∗∗ (y), in order to get some information of ∂v(y). Remark 1.84 The dual problem can also be obtained by dualizing in the usual way an equality constraint. Indeed, write the primal problem in the form below, with z ∈ Y: Min ϕ(x, z) − x ∗ , x; y − z = 0, (1.155) x,z
with associated duality Lagrangian function, where y ∗ ∈ Y ∗ : L (x, z, y, y ∗ ) := ϕ(x, z) − x ∗ , x + y ∗ , y − z. We have that sup y ∗ L (x, z, y, y ∗ ) = ϕ(x, y) − x ∗ , x if y = z, +∞ otherwise, inf x,z L (x, z, y, y ∗ ) = y ∗ , y − ϕ ∗ (x ∗ , y ∗ ).
(1.156)
(1.157)
The dual problem obtained in the present perturbation duality framework may therefore be viewed as a particular case of the minimax duality discussed in Sect. 1.1.3.
1.2.1.2
Problems in Dual Spaces
¯ Consider a family of problems in the dual space: Let ψ : X ∗ × Y ∗ → R. Min ψ(x ∗ , y ∗ ) − y ∗ , y,
y ∗ ∈Y ∗
(PxD∗ )
34
1 A Convex Optimization Toolbox
¯ satisfies with value denoted by v D (x ∗ ). Then v∗D : X → R v∗D (x) = sup x ∗ , x + y ∗ , y − ψ(x ∗ , y ∗ ) = ψ ∗ (x, y), x ∗ ,y ∗
so that
∗ ∗ ∗ v∗∗ D (x ) = sup x , x − ψ (x, y) .
(1.158)
(1.159)
x
Therefore, we may define a problem dual to (PxD∗ ) as Max x ∗ , x − ψ ∗ (x, y). x∈X
(DxD∗ )
As in Proposition 1.83, we have the weak duality relation ∗ D D ∗ v∗∗ D (x ) = val(D x ∗ ) ≤ val(Px ∗ ) = v D (x ),
(1.160)
and also, in view of (1.106):
Additionally,
S(DxD∗ ) = ∂v∗∗ D (y).
(1.161)
if ∂v D (x ∗ ) = ∅, then ∂v D (x ∗ ) = S(DxD∗ ).
(1.162)
Now starting from a problem of type (Py ), and rewriting its dual (D y ) as a minimization problem, we can dualize it. Writing the obtained bidual as a minimization problem, we see that its expression is nothing but Min ϕ ∗∗ (x, y) − x ∗ , x. x∈X
(Py∗∗ )
By Theorem 1.44, the duality mapping is involutive in the class of proper, l.s.c. convex functions, in the following sense: Lemma 1.85 Let ϕ be proper, l.s.c. and convex. Then (Py ) and its bidual problem coincide. Remark 1.86 If X and Y are reflexive, then the bidual problem is the classical dual of the dual one, so that we will be able to apply the duality theory that follows to the dual problem.
1.2.1.3
Strong Duality
We call the relation of equality between a primal and a dual cost, that is, for (x, y, y ∗ ) ∈ X × Y × Y ∗ :
1.2 Duality Theory
35
ϕ(x, y) − x ∗ , x = y ∗ , y − ϕ ∗ (x ∗ , y ∗ )
(1.163)
an optimality condition (in the context of duality theory). By weak duality, this implies that the primal and dual problem have the same value. If the latter is finite, then x ∈ S(Py ) and y ∗ ∈ S(D y ), and (1.163) is equivalent to ϕ(x, y) + ϕ ∗ (x ∗ , y ∗ ) = x ∗ , x + y ∗ , y.
(1.164)
We recognize the case of equality in the Fenchel–Young inequality. By (1.100), this is equivalent to (1.165) (x ∗ , y ∗ ) ∈ ∂ϕ(x, y). Theorem 1.87 The following relations hold: ∂v(y) = ∅
⇒
val(D y ) = val(Py )
⇒
∂v(y) = S(D y )
(1.166)
and
If (1.163) holds with finite value, then x ∈ S(Py ), y ∗ ∈ S(D y ), val(Py ) = val(D y ), and ∂v(y) = S(D y ).
(1.167)
Proof Relation (1.166) follows from Proposition 1.83, and is easily seen to imply (1.167). We next need stronger assumptions that guarantee the equality of the primal and dual cost. Theorem 1.88 Assume that v is convex, uniformly upper bounded near y, and finitely-valued at y. Then (i) val(D y ) = val(Py ), (ii) x ∈ S(Py ) iff there exists a y ∗ ∈ Y ∗ such that the optimality condition (1.163) holds, (iii) ∂v(y) = S(D y ), the latter being nonempty and bounded, and (iv) the directional derivatives of v satisfy, for all z ∈ Y : (1.168) v (y, z) = sup{y ∗ , z; y ∗ ∈ S(D y )}. Proof By Lemma 1.57, v is continuous at y. By Corollary 1.47, v(y) = v∗∗ (y), meaning that val(D y ) = val(Py ), and by Lemma 1.59, ∂v(y) is nonempty and bounded. The conclusion follows from the second implication in (1.166) and Corollary 1.69. Remark 1.89 (i) A sufficient condition for v to be convex is that ϕ is convex. (ii) A sufficient condition for having a uniform upper bound near y is that ϕ(x0 , ·) is continuous at y, for some x0 ∈ X . It may happen, however, that while ϕ is l.s.c. convex, v is not l.s.c., and this prevents us from deducing its continuity from Proposition 1.65.
36
1 A Convex Optimization Toolbox
Exercise 1.90 Let X = L ∞ (0, 1), Y = L 2 (0, 1), and denote by A the injection from X into Y . Take x ∗ = 0 and ϕ(x, y) := 0 if Ax = y, +∞ otherwise.
(1.169)
Check that ϕ is l.s.c. convex, but v(y), equal to the indicatrix of L ∞ (0, 1), is not l.s.c. (see the related analysis in Example 1.136). We next state a stability condition, also called a qualification condition, that provides a sufficient condition for the continuity of the value function. The condition is that y ∈ int(dom(v)), or equivalently:
For all y ∈ Y close enough to y, there exists an x ∈ X such that ϕ(x , y ) < ∞.
(1.170)
Lemma 1.91 Assume that ϕ is l.s.c. convex, the stability condition (1.170) holds, and v( y¯ ) is finite. Then v is continuous at y¯ . Proof See e.g. [26, Prop. 2.152]; the proof is too technical to be reproduced here. Corollary 1.92 Under the assumptions of Lemma 1.91, the conclusion of Theorem 1.88 holds. Proof Combine the previous lemma with Theorem 1.88.
Example 1.93 (A strange example) Consider the reverse entropy function, where x ∈ R: Hˆ (x) = x log x if x > 0, Hˆ (0) = 0, and Hˆ (x) = +∞ if x < 0.
(1.171)
This is an l.s.c. convex function, with domain R+ . Consider the problem Min x; s.t. Hˆ (x) ≤ 0, x∈R
(1.172)
corresponding to ϕ(x, y) = x + I{ Hˆ (x)+y≤0} . It obviously has the unique solution x¯ = 0, and the stability condition holds (with here y = 0). The Lagrangian of the problem is L(x, λ) := x + λ Hˆ (x), and we can check that the dual problem is Max δ(λ), λ≥0
(1.173)
where δ(λ) := inf x L(x, λ). By the duality theory, the dual problem has a bounded and nonempty set of solutions and the primal and dual value are equal, i.e., λ is a dual solution iff δ(λ) = 0, with infimum in the Lagrangian attained at 0. Now if λ > 0, the infimum is attained at a positive point. So, the unique dual solution is λ¯ = 0 and the optimality condition reads
1.2 Duality Theory
37
0 ∈ argmin x + 0 × Hˆ (λ) .
(1.174)
x∈R
This indeed holds if we correctly interpret the product 0 × Hˆ (λ) as being equal to +∞ whenever Hˆ (λ) = +∞, see Sect. 1.1.1.2.
1.2.1.4
Projections and Moreau–Yosida Approximations
In many applications, we can check in a direct way the continuity of the value function. Here is a specific example. Proposition 1.94 Let K be a closed convex subset of the Hilbert space X . Then the function v(y) := 21 dist(y, K )2 is convex and of class C 1 , with derivative Dv(y) = y − PK (y). Proof Consider the function X × X → R, ϕ(x, y) := 21 x − y2 + I K (x), and take x ∗ = 0. Obviously, ϕ is l.s.c. and convex, and the unique solution of the primal problem (Py ) is x(y) := PK (y), the projection of y onto K . The (convex) primal value is v(y). We next compute the dual cost, identifying X and its dual. We have that ϕ ∗ (0, y ∗ ) = supx,y (y ∗ , y) − ϕ(x, y) = supx∈K ,y (y ∗ , y − x) − 21 x − y2 + (y ∗ , x) = supx∈K ,y (y ∗ , y ) − 21 y 2 + (y ∗ , x) = 21 y ∗ 2 + σ K (y ∗ ).
(1.175)
Since v is locally upper bounded, it is locally Lipschitz, so that by Lemma 1.59, its subdifferential is nonempty and bounded. It is equal to the solution of the dual problem 1 (y ∗ , y) − y ∗ 2 − σ K (y ∗ ), (1.176) Max y ∗ ∈X 2 and the optimality condition can be arranged in the following way: 1 1 x − y2 + y ∗ 2 − (y ∗ , y − x) + I K (x) + σ K (y ∗ ) − (y ∗ , x) = 0. (1.177) 2 2 The sum of the three first terms is 21 y − x − y ∗ 2 , and the sum of the three last is, by the Fenchel–Young inequality, nonnegative. Therefore (1.177) is equivalent to (i) y ∗ = y − x; (ii) (y ∗ , x − x) ≤ 0, for all x ∈ K .
(1.178)
This is easily seen to be equivalent to x = PK (y), so that ∂v(y) = {y − PK (y)}. Since v is a continuous function, we deduce from Corollary 1.69 that it is G-differentiable
38
1 A Convex Optimization Toolbox
with G-derivative v (y) = y − PK (y). The derivative being a continuous function of y, it follows that v is of class C 1 . A generalization of the above result is provided by the Moreau–Yosida approximation, which is the subject of the exercise below. Exercise 1.95 Given a Hilbert space X (identified with its dual), f l.s.c. proper ¯ y ∈ X , and ε > 0, consider the problem convex X → R, ε Min f (x) + x − y2 . x∈X 2
(1.179)
(i) Show that this problem has a unique solution xε (y) (hint: the cost is strongly convex), called the proximal point to y. (ii) Check that the dual problem is (y ∗ , y) − Max ∗ y ∈X
1 ∗ 2 y − f ∗ (y ∗ ). 2ε
(1.180)
(iii) Show that the primal and dual values are equal, and that the dual problem has a unique solution yε∗ (y) = ε(y − xε (y)). (iv) Show that f ε (y) := inf x ( f (x) + 2ε x − y2 ) (the Moreau–Yosida approximation) has a continuous derivative D f ε (y) = ε(y − xε (y)).
1.2.1.5
Composite Functions
In most applications we have to solve optimization problems with the following structure: Min f (x) + F(G(x) + y) − x ∗ , x. (Py ) x∈X
¯ This enters into our general framework, with here Here G : X → Y and F : Y → R. ϕ(x, y) := f (x) + F(G(x) + y).
(1.181)
Defining the standard Lagrangian4 L(x, y ∗ ) := f (x) + y ∗ , G(x), we have that
4 Not
to be confused with the duality Lagrangian defined in (1.156).
(1.182)
1.2 Duality Theory
39
ϕ(x ∗ , y ∗ ) = supx,y (y ∗ , y − f (x) − F(G(x) + y) + x ∗ , x) = supx,y (y ∗ , G(x) + y − F(G(x) + y) − L(x, y ∗ ) + x ∗ , x) = supx,y (y ∗ , y − F(y ) − L(x, y ∗ ) + x ∗ , x) = F ∗ (y ∗ ) − inf x (L(x, y ∗ ) − x ∗ , x), (1.183) so that the dual problem is y ∗ , y − F ∗ (y ∗ ) + inf (L(x, y ∗ ) − x ∗ , x). Max ∗ x
y
(D y )
We can express the optimality condition (1.164) in the form + y) + F ∗ (y ∗ ) − y ∗ , G(x) + y) (F(G(x) + L(x, y ∗ ) − x ∗ , x) − inf x (L(x , y ∗ ) − x ∗ , x ) = 0.
(1.184)
Each row above being nonnegative by the Fenchel–Young inequality, this is equivalent to the relations (i) y ∗ ∈ ∂ F(G(x) + y); (1.185) (ii) x ∈ argmin(L(·, y ∗ ) − x ∗ , ·). Remark 1.96 Since, as we have seen, these relations express nothing but the Fenchel– Young equality for ϕ, we deduce that if ϕ(x, y) is finite, then ∂ϕ(x, y) = {(x ∗ , y ∗ ) ∈ X ∗ × Y ∗ ; (1.186) holds}.
(1.186)
Remark 1.97 Since (Py ) is feasible iff y ∈ dom(F) − G(x) for some x ∈ dom( f ), we have that dom(v) = dom(F) − G(dom( f )), and the stability condition (1.170) reads: y ∈ int (dom(F) − G(dom( f ))) . (1.187) Proposition 1.98 Let ϕ be l.s.c. convex, and y ∈ Y be such that v(y) is finite, and (1.187) holds. Then v is continuous at y, and the conclusion of Theorem 1.88 holds. Proof Immediate consequence of Corollary 1.92.
Example 1.99 Given K ⊂ Y nonempty, closed and convex, the problem min f (x) − x ∗ , x; G(x) + y ∈ K x
(1.188)
enters into the previous framework with F = I K , the indicatrix of K . In that case the optimality conditions (1.185) reduce to
(i) y ∗ ∈ N K (G(x) + y); (ii) x ∈ argmin(L(·, y ∗ ) − x ∗ , ·).
(1.189)
40
1 A Convex Optimization Toolbox
1.2.1.6
Convexity of Composite Functions
Under which conditions is the function ϕ defined in (1.181) jointly convex and l.s.c.? Varying only y, we see that F must be l.s.c. convex. An obvious case is when f and F are l.s.c. convex, and G is affine and continuous. But there are some other cases when this property holds, although G is nonlinear. Example 1.100 Let F be a nondecreasing, l.s.c. proper convex function over R, and G be an l.s.c. proper convex function over X . We claim that ψ(x, y) := F(G(x) + y) is l.s.c. convex. Setting X := X × Y and G (x, y) := G(x) + y, we reduce the discussion to the l.s.c. and convexity of F(G(x)). Let xk → x¯ in X . Then F(G(x)) ¯ ≤ F(lim G(xk )) ≤ lim F(G(xk )). k
k
(1.190)
The first inequality uses the fact that F is nondecreasing and G is l.s.c.; the second inequality uses the l.s.c. of F. So, F ◦ G is l.s.c. Now for α ∈ (0, 1) and x , x in X , setting x := αx + (1 − α)x : F(G(x)) ≤ F(αG(x ) + (1 − α)G(x ) ≤ α F(G(x )) + (1 − α)F(G(x )). (1.191) We have used the convexity of G and the fact that F is nondecreasing in the first inequality, and the convexity of F in the second one. So, F ◦ G is convex; the claim follows. Example 1.101 More generally, consider the case when F is an l.s.c. proper convex function over R p that is nondecreasing (for the usual order relation y ≤ z if yi ≤ z i , for i = 1 to p), and G(x) = (G 1 (x), . . . , G p (x)) with G i (x) an l.s.c. proper convex function over X , for i = 1 to p). By similar arguments we get that ψ(x, y) := F(G(x) + y) is l.s.c. convex. A particular case is that of the supremum of convex functions, see Sect. 1.2.3. A more general analysis of the case of composite functions in the format (1.181) is as follows. Assume F to be l.s.c. proper convex. By Theorem 1.44, it is equal to its biconjugate, and hence, F(y) = sup{y ∗ , y − F ∗ (y ∗ );
y ∗ ∈ dom F ∗ }.
(1.192)
Therefore, ϕ(x, y) = f (x) + sup{y ∗ , G(x) + y − F ∗ (y ∗ );
y ∗ ∈ dom F ∗ }.
(1.193)
Since the supremum of l.s.c. convex functions is l.s.c. convex, we deduce that Lemma 1.102 Let F be l.s.c. proper convex, and x → y ∗ , G(x) be l.s.c. convex for any y ∗ ∈ dom F ∗ . Then ϕ is l.s.c. convex.
1.2 Duality Theory
1.2.1.7
41
Convex Mappings
Definition 1.103 The recession cone of the closed convex subset K of Y is the closed convex cone defined by K ∞ := {y ∈ Y ;
K ⊂ K + y}.
(1.194)
Remark 1.104 (i) If K is bounded, its recession cone reduces to {0}. The converse holds if Y is finite-dimensional. In infinite-dimensional spaces, there may exist unbounded convex sets with recession cone reducing to {0}: see [26, Example 2.43]. (ii) We have that K ∞ = K if K is a closed convex cone. Definition 1.105 Let G : X → Y , and K be a closed convex subset of Y . We say that G is K -convex if, for all α ∈ (0, 1) and x , x in X : G(αx + (1 − α)x ) − αG(x ) − (1 − α)G(x ) ∈ K ∞ .
(1.195)
Remark 1.106 We slightly changed the classical definition [26, Def. 2.103], but the theory is essentially the same. Note that any affine mapping is K -convex. The converse holds if K ∞ = {0}. On the other hand, if K = Y then any mapping is K -convex. ¯ be l.s.c. convex, and G be continuous and K -convex, Lemma 1.107 Let f : X → R where K is a closed convex subset of Y . Then ϕ(x, y) := f (x) + I K (G(x) + y) is l.s.c. convex. Proof The l.s.c. being obvious, it suffices to check that I K (G(x) + y) is convex. Let α ∈ (0, 1), x , x in X , y , y in Y . Set (x, y) := α(x , y ) + (1 − α)(x , y ). Then κ := G(x) − αG(x ) − (1 − α)G(x )
(1.196)
belongs to K ∞ , and therefore G(x) + y = α(y + G(x )) + (1 − α)(y + G(x )) + κ belongs to K . The result follows.
(1.197)
We next give a practical tool for recognizing K -convex mappings. Lemma 1.108 We have that G : X → Y is K -convex iff, for any λ ∈ (K ∞ )−, the function G λ (x) := λ, G(x) is convex. Proof Since K ∞ is closed and convex, by Lemma 1.77, it is the negative polar cone of (K ∞ )−, i.e., y0 ∈ K ∞ iff λ, y0 ≤ 0, for all λ ∈ (K ∞ )− . Therefore, G is K -convex iff (1.198) λ, G(αx + (1 − α)x ) − αG(x ) − (1 − α)G(x ) ≤ 0,
42
1 A Convex Optimization Toolbox
for all λ ∈ (K ∞ )−. The conclusion follows.
Remark 1.109 Lemma 1.107 can be deduced from Lemma 1.102, where F = I K and F ∗ = σ K , observing that dom(σ K ) ⊂ (K ∞ )− . p
Example 1.110 Let Y := R p and K := R− (the case of finitely many inequality p constraints). Then (K ∞ )− = K − = R+ . As expected we obtain that G is K -convex iff each of the p components of G is convex. Example 1.111 Let Y := C(Ω) where Ω is a metric compact set, and K := Y− (the case of punctual inequality constraints). Then (K ∞ )− = K − = Y+∗ is the set of nonnegative Borel measures on Y, and we obtain that G is K -convex iff G ω (x) is convex, for each ω ∈ Ω. Exercise 1.112 Let K = {x ∈ R2 ; x2 ≥ x12 }. (i) Show that K ∞ = {0} × R+ , and that we have the strict inclusion dom(σ K ) = {0} ∪ (R × (−∞, 0)) ⊂ R × R− = (K ∞ )− .
(1.199)
(ii) Show that G is K convex iff G 1 is affine and G 2 is concave.
1.2.1.8
Fenchel Duality
When G(x) = Ax, with A ∈ L(X, Y ), the Lagrangian defined in (1.182) is such that A y ∗ − x ∗ , x) inf x (L(x, y ∗ ) − x ∗ , x) = inf x ( f (x) + ∗ = − supx x − A y ∗ , x − f (x) = − f ∗ (x ∗ − A y ∗ ).
(1.200)
The expression of the primal and dual problem are therefore Min f (x) + F(Ax + y) − x ∗ , x,
(Py )
Max y ∗ , y − f ∗ (x ∗ − A y ∗ ) − F ∗ (y ∗ ). ∗
(D y )
x∈X
y
Finally, the optimality condition
f (x) + f ∗ (x ∗ − A y ∗ ) − x ∗ − A y ∗ , x + (F(Ax + y) + F ∗ (y ∗ ) − y ∗ , Ax + y) = 0
(1.201)
is equivalent to the relations y ∗ ∈ ∂ F(Ax + y); ∂ f (x) + A y ∗ x ∗ .
(1.202)
1.2 Duality Theory
43
The function ϕ(x, y) = f (x) + F(Ax + y) is l.s.c. convex if f and F are l.s.c. convex, and the stability condition (1.170) reads, since dom(v) = y + dom(F) − A dom( f ): y ∈ int (dom(F) − A dom( f )) . (1.203) We have obtained the following: Theorem 1.113 (Fenchel duality) Let f and F be l.s.c. convex, and (1.203) hold. Then {− f ∗ (x ∗ − A y ∗ ) − F ∗ (y ∗ )} < +∞, inf { f (x) + F(Ax) − x ∗ , x} = max ∗ x
y
(1.204) the maximum being attained on a nonempty and bounded set if the above value is finite. Example 1.114 Given a nonempty closed convex subset K of X , the problem Min f (x) − x ∗ , x); x
Ax + y ∈ K
(Py )
is the particular case of the previous example in which F(y) = I K (y) and x ∗ = 0, and therefore the dual problem is y ∗ , y − σ K (y ∗ ) − f ∗ (x ∗ − A y ∗ ). Max ∗ y
(D y )
The optimality condition is equivalent to x ∗ − A y ∗ ∈ ∂ f (x);
y ∗ ∈ N K (Ax + y).
(1.205)
The function ϕ(x, y) = f (x) + I K (Ax + y) is l.s.c. convex if f is l.s.c. convex and K is a closed convex set, and dom(v) = K − A dom( f ). By Theorem 1.88 applied when y = 0, we have that ⎧ ⎨ If f is l.s.c. convex and 0 ∈ int (K − A dom( f )) , then inf x { f (x) − x ∗ , x; Ax ∈ K } = max y ∗ {− f ∗ (x ∗ − A y ∗ ) − σ K (y ∗ )}, ⎩ the maximum being attained on a bounded set if the value is finite. (1.206) Example 1.115 Consider the particular case of the previous example in which A is surjective, K = {0}, x ∗ = 0, and f (x) = c, x, with c ∈ (Ker A)⊥ . By the open mapping theorem, for some c > 0, there exists a feasible x(y) such that x(y) ≤ cy. Since c ∈ (Ker A)⊥ , x(y) is a primal solution. The value function v(·), being both locally upper bounded and finite, is locally Lipschitz. By the discussion in Example 1.114, we have that c = A λ, for some λ ∈ Y ∗ . We have proved that (Ker A)⊥ ⊂ Im(A ). Since the converse inclusion is easily proved, we have obtained another proof of Proposition 1.31.
44
1 A Convex Optimization Toolbox
Exercise 1.116 (Tychonoff and Lasso [120] type regression) Assuming that Y is a Hilbert space identified with its topological dual, and given A ∈ L(X, Y ), b ∈ Y , ¯ consider the regularized linear leastε > 0 and a ‘regularizing function’ R : X → R, square problem 1 (Py ) Min Ax − b2H + ε R(x). x∈X 2 Deduce from (1.84) that the dual problem is 1 Max −(λ, b)Y − λ2Y − ε R ∗ (−A λ/ε). λ∈Y 2
(1.207)
Show that the optimality conditions are 1 λ = Ax − b; − A λ ∈ ∂ R(x). ε
(1.208)
In the case of the Tychonoff regularization R(x) = 21 x2X (assuming X to be a Hilbert space identified with its topological dual), show that: the primal and dual values are equal, both the primal and dual problems have a unique solution, and the second relation in (1.208) reduces to −A λ = εx. In the case when R is positively homogeneous, convex and continuous, with subdifferential at 0 denoted by K , show that: the primal and dual values are equal, and the dual problem is 1 (1.209) Max −(λ, b)Y − λ2Y ; −A λ ∈ εK , λ∈Y 2 and the second relation in (1.208) is equivalent to − A λ ∈ εK and − (λ, Ax) H = ε R(x).
(1.210)
n p Specialize this result n to the Lasso type regularization where X = R , Y = R and R(x) = x1 = i=1 |xi |.
We will next see how to compute the subdifferential of a composition of functions. This will be a consequence of the duality theory, based on a formula for partial subdifferentials.
1.2.2 Subdifferential Calculus 1.2.2.1
General Subdifferential Calculus Rules
¯ we We come back to the general format of Sect. 1.2.1.1. Given ϕ : X × Y → R, denote the partial subdifferential w.r.t. x by
1.2 Duality Theory
45
∂x ϕ(x, y) := {x ∗ ∈ X ∗ ; ϕ(x , y) ≥ ϕ(x, y) + x ∗ , x − x, for all x ∈ X }. (1.211) As in the case of differentiable functions one may ask if the partial subdifferential is the restriction of the “full” subdifferential, i.e., if x ∗ ∈ ∂x ϕ(x, y), does there exist a y ∗ ∈ Y ∗ such that (1.212) (x ∗ , y ∗ ) ∈ ∂ϕ(x, y). Theorem 1.117 If (1.212) holds, then x ∗ ∈ ∂x ϕ(x, y). Conversely, if ϕ is l.s.c. convex and the stability condition (1.170) holds, then x ∗ ∈ ∂x ϕ(x, y) iff the set of y ∗ ∈ Y ∗ satisfying (1.212) is nonempty and bounded. Proof That x ∗ ∈ ∂x ϕ(x, y) when (1.212) holds follows from the definition of full and partial subdifferentials. Now let ϕ be as in the theorem. It suffices to prove that if x ∗ ∈ ∂x ϕ(x, y), then there exists a y ∗ ∈ Y ∗ such that (1.212) holds. Since x ∗ ∈ ∂x ϕ(x, y), we have that the function x → ϕ(x , y) − x ∗ , x attains its minimum at x. By the duality result in Corollary 1.92, the set of solutions y ∗ of the dual problem, satisfying the optimality condition (1.163), which (by the discussion after (1.163)) is equivalent to (1.212), is nonempty and bounded. The conclusion follows. We now specialize the previous theorem to the case of the composite function ϕ(x, y) = f (x) + F(G(x) + y),
(1.213)
recalling that the (standard) Lagrangian was defined in (1.182). We give a direct proof of the expression of the subdifferential of ϕ, already obtained in Remark 1.96: Lemma 1.118 We have that (x ∗ , y ∗ ) ∈ ∂ϕ(x, y) iff (1.185) holds. Proof Let (x ∗ , y ∗ ) ∈ ∂ϕ(x, y). Using ϕ(x, y ) ≥ ϕ(x, y) + y ∗ , y − y for all y ∈ Y , we obtain (1.185)(i). Taking x ∈ X and y := G(x) − G(x ) + y, we get that f (x ) + F(G(x) + y) ≥ ϕ(x, y) + x ∗ , x − x + y ∗ , G(x) − G(x )
(1.214)
or equivalently L(x , y ∗ ) ≥ L(x, y ∗ ) + x ∗ , x − x for all x ∈ X,
(1.215)
implying (1.185)(ii). Conversely, let (1.185) hold. Then ϕ(x , y ) = f (x ) + F(G(x ) + y ) ≥ f (x ) + F(G(x) + y) + y ∗ , G(x ) − G(x) + y − y, = ϕ(x, y) + L(x , y ∗ ) − L(x, y ∗ ) + y ∗ , y − y, ≥ ϕ(x, y) + x ∗ , x − x + y ∗ , y − y, proving that (x ∗ , y ∗ ) ∈ ∂ϕ(x, y).
(1.216)
46
1 A Convex Optimization Toolbox
Theorem 1.119 Assume that ϕ(x, y) = f (x) + F(G(x) + y) is l.s.c. convex, and that the stability condition (1.187) holds. Then x ∗ ∈ ∂x ϕ(x, y) iff the set of y ∗ ∈ Y ∗ such that (1.185) holds is nonempty and bounded.
Proof Combine Theorem 1.117 and Lemma 1.118.
1.2.2.2
Fenchel’s Duality
In the case of Fenchel’s duality, i.e., when G(x) = Ax with A ∈ L(X, Y ), we see that (1.185)(ii) holds iff f (x ) ≥ f (x) + x ∗ − A y ∗ , x − x, for all x ∈ R,
(1.217)
i.e. iff x ∗ − A y ∗ ∈ ∂ f (x). We obtain the following Fenchel subdifferential formula: ¯ and Lemma 1.120 Let X and Y be Banach spaces, A ∈ L(X, Y ), f : Y → R ∗ ¯ F : Y → R be l.s.c. convex, and set ϕ(x, y) = f (x) + F(Ax + y). Then (x , y ∗ ) ∈ ∂ϕ(x, y) iff (1.218) (i) y ∗ ∈ ∂ F(Ax + y); (ii) x ∗ − A y ∗ ∈ ∂ f (x). We have that x ∗ ∈ ∂x ϕ(x, y) iff (1.218) holds for some y ∗ ∈ Y ∗ , whenever the stability condition (1.203) is satisfied.
Proof Direct application of the previous statements. In the case when A is the identity operator, we obtain the
¯ with finite value at Corollary 1.121 Let f and g be l.s.c. convex functions X → R, x0 . If 0 ∈ int (dom( f ) − dom(g)) (which holds in particular if f or g is continuous at x0 ), then ∂( f + g)(x0 ) = ∂ f (x0 ) + ∂g(x0 ). We next discuss the case of the sum of a finite number of functions. Example 1.122 Let gi , i = 1 to n, be l.s.c. proper convex functions over the Banach space X . We set G(x) :=
n
gi (x),
n with dom(G) = ∩i=1 dom(gi ).
(1.219)
i=1
Then G is of the form F ◦ A, with Y := X n , Ax = (x, . . . , x) (n times), and F(x1 , . . . , xn ) :=
n i=1
gi (xi ),
n with dom(F) = Πi=1 dom(gi ).
(1.220)
1.2 Duality Theory
47
n For (x1∗ , . . . , xn∗ ) ∈ (X ∗ )n , we have that A (x1∗ , . . . , xn∗ ) = i=1 xi∗ (the transpose of the copy operator is the sum). The qualification condition (1.203) can be written, since BY = (B X )n , as ∀(x1 , . . . , xn ) ∈ ε(B X )n ; ∃ x ∈ X ; xi ⊂ dom(gi ) − x, i = 1, . . . , n. (1.221) It follows by Lemma 1.120, where here f = 0 and y = 0, that ∂G(x) =
n
∂gi (x), for all x ∈ X, if (1.221) holds.
(1.222)
i=1
Remark 1.123 A sufficient condition for (1.221) is that (indeed, take x = x0 − xn ):
1.2.2.3
There exists an x0 ∈ dom(gn ) such that gi is continuous at x0 , for i = 1 to n − 1.
(1.223)
Geometric Calculus Rules
We show here how subdifferential calculus gives calculus rules for normal and tangent cones, starting with the simple case of the intersection of two convex sets. Lemma 1.124 Let K 1 and K 2 be two closed convex subsets of X , and let K := K 1 ∩ K 2 , and x¯ ∈ K . Then ¯ ⊂ TK 1 (x) ¯ ∩ TK 2 (x) ¯ and N K (x) ¯ ⊃ N K 1 (x) ¯ + N K 2 (x). ¯ TK (x)
(1.224)
If in addition 0 ∈ int(K 1 − K 2 ), equality holds in the above two inclusions. Proof The relations in (1.224) are easy consequences of the definition of tangent and normal cones. We next apply Corollary 1.121 with f := I K 1 and g := I K 2 , so that f + g = I K . Since dom( f ) − dom(g) = K 1 − K 2 and ∂ I K (x) = N K (x), we ¯ = N K 1 (x) ¯ + N K 2 (x). ¯ Computing the deduce that if 0 ∈ int(K 1 − K 2 ), then N K (x) normal cones (we have seen in (1.144) that the polar of a sum of convex cones is ¯ = TK 1 (x) ¯ ∩ TK 2 (x). ¯ The the intersection of their polar cones), it follows that TK (x) conclusion follows. By similar techniques one can prove various extensions of this result, given as exercises. Exercise 1.125 Consider the subsets of R2 defined by K 1 = {x; x2 ≥ x12 }, K 2 := −K 1 , and K := K 1 ∩ K 2 . Check that (1.224) holds with strict inclusion. Does 0 belong to int(K 1 − K 2 )? Make the connection with Lemma 1.124. Exercise 1.126 Let K 1 , . . . , K n be closed convex subsets of X . Set K := K 1 ∩ · · · ∩ K n . Let x¯ ∈ K . Assume that
48
1 A Convex Optimization Toolbox
∀(x1 , . . . , xn ) ∈ ε(B X )n ; ∃ x ∈ X ; xi ⊂ K i − x, i = 1, . . . , n.
(1.225)
Show that, then: ¯ = N K (x)
n
n N K i (x); ¯ TK (x) ¯ = ∩i=1 N K i (x). ¯
(1.226)
i=1
¯ = N K (x). ¯ Hint: apply Example 1.122 with gi (x) = I K i (x), and use ∂ I K (x) Exercise 1.127 Let K X and K be closed convex subsets of X and Y resp., A ∈ L(X, Y ), and b ∈ Y . Set K := {x ∈ K X ; Ax + b ∈ K }.
(1.227)
(i) Show that K is a closed convex set. (ii) Let x¯ ∈ K . Show that, if 0 ∈ int (K − b − AK X ), then ¯ = N K X (x) ¯ + A N K (A x¯ + b). NK (x)
(1.228)
Hint: apply Lemma 1.120, with f = I K X and F(y) = I K (y). Exercise 1.128 Let K X and K be closed convex subsets of X and Y resp., and G : X → Y . Set for y¯ ∈ Y :
Kˆ := {(x, y ) ∈ K X × Y ; G(x) + y ∈ K }, K := {x ∈ K X ; G(x) + y¯ ∈ K }.
(1.229)
Assume that Kˆ is a closed convex set, and that 0 ∈ int (K − G(K X ) − y¯ ) .
(1.230)
Set L(x, y ∗ ) := y ∗ , G(x). Show that NK (x) ¯ =
x ∗ ∈ X ∗ ; x¯ ∈ argmin L(·, y ∗ ) − x ∗ , x, for some y ∗ ∈ N K (G(x) ¯ + y¯ ) x∈K X
.
(1.231) Hint: apply Theorem 1.119, with f = I K X , and F = I K . Remark 1.129 In the framework of the previous exercise, assume in addition that ¯ + y¯ ). G(x) is G-differentiable and x → L(x, y ∗ ) is convex, for all y ∗ ∈ N K (G(x) ¯ + DG(x) ¯ y ∗ , so that Then x¯ ∈ argmin x∈K X (L(·, y ∗ ) − x ∗ , x) iff x ∗ ∈ N K X (x) NK (x) ¯ = N K X (x) ¯ + DG(x) ¯ N K (G(x) ¯ + y¯ ).
(1.232)
1.2 Duality Theory
49
This holds in particular if G is affine (and continuous): we then recover the conclusion of Exercise 1.127.
1.2.3 Minimax Theorems In this section we start from a relatively general Lagrangian function and see how to obtain the minmax duality thanks to the perturbation duality. Let X and Y be Banach spaces, X 0 ⊂ X and Y0∗ ⊂ Y ∗ , both nonempty, L : X 0 × Y0∗ → R. By (1.39), we have the weak duality inequality: sup inf L(x, y ∗ ) ≤ inf sup L(x, y ∗ ).
y ∗ ∈Y0∗ x∈X 0
x∈X 0 y ∗ ∈Y ∗ 0
(1.233)
In order to see when equality holds, it is of interest to introduce the perturbation Lagrangian, where y ∈ Y is the perturbation parameter: L (x, y, y ∗ ) := y ∗ , y + L(x, y ∗ ).
(1.234)
We have the more general weak duality inequality sup inf L (x, y, y ∗ ) ≤ inf sup L (x, y, y ∗ ), for all y ∈ Y.
y ∗ ∈Y0∗ x∈X 0
x∈X 0 y ∗ ∈Y ∗ 0
(1.235)
Let us apply the perturbation duality theory of Sect. 1.2.1 with: ϕ(x, y) :=
sup y ∗ ∈Y0∗ L (x, y, y ∗ ) if x ∈ X 0 , +∞ otherwise.
(1.236)
Clearly the primal problem Min ϕ(x, y) x∈X
(Py )
has value v(y) = val(Py ) equal to the r.h.s. of (1.235). We know by (1.151) that ¯ by v∗∗ (y) = sup y ∗ y ∗ , y − ϕ ∗ (0, y ∗ ). Define Lˆ : X × Y ∗ → R ⎧ / Y0∗ , if y ∗ ∈ ⎨ +∞ ∗ ∗ ˆ L(x, y ) := −L(x, y ) if (x, y ∗ ) ∈ X 0 × Y0∗ , ⎩ −∞ otherwise.
(1.237)
Denoting by Lˆ ∗y (x, y) the partial Fenchel–Legendre transform (in the dual space Y ∗ ) ˆ of L(x, ·) w.r.t. the second variable, we have that for all x ∈ X : ˆ ϕ(x, y) = sup y ∗ , y − L(x, y ∗ ) = Lˆ ∗y (x, y). y ∗ ∈Y ∗
(1.238)
50
1 A Convex Optimization Toolbox
∗ It follows that ϕ y∗ (x, y ∗ ) := Lˆ ∗∗ / X 0 ), and therefore y (x, y ) (equal to −∞ if x ∈
ϕ ∗ (0, y ∗ ) = sup ϕ y∗ (x, y ∗ ) = − inf
x∈X 0
x∈X 0
∗ − Lˆ ∗∗ y (x, y ) .
(1.239)
Consequently, v∗∗ (y) = val(D y ) where the dual problem (D y ) is defined as Max inf ∗ ∗
y ∈Y x∈X 0
∗ y ∗ , y − Lˆ ∗∗ y (x, y ) .
(D y )
Since a function always majorizes its biconjugate, ∗ ˆ y ∗ ) ≤ y ∗ , y − Lˆ ∗∗ L (x, y, y ∗ ) = y ∗ , y − L(x, y (x, y ).
(1.240)
We deduce the “canonical” relation between minimax and perturbation dualities sup inf L (x, y, y ∗ ) ≤ v∗∗ (y) ≤ v(y) = inf sup L (x, y, y ∗ ).
y ∗ ∈Y0∗ x∈X 0
x∈X 0 y ∗ ∈Y ∗ 0
(1.241)
In view of the expression of (D y ), the inequality on the left is an equality whenever ∗ ˆ L(x, y ∗ ) = Lˆ ∗∗ y (x, y ), for all x ∈ X 0 .
(1.242)
ˆ y ∗ ) is a supremum By Lemma 1.49, this holds iff for each x ∈ X 0 , y ∗ → L(x, ∗ ∗ of ∗affine functions, or equivalently, if y → L(x, y ) is an infimum of ∗affine functions. Theorem 1.130 Assume that X 0 and Y0∗ are nonempty and convex subsets, X 0 is closed, L(·, y ∗ ) is l.s.c. convex for each y ∗ ∈ Y0∗ , (1.242) holds, and Y0∗ is bounded. Then equality holds in (1.233), and the set of y ∗ for which the supremum on the left is attained is nonempty and bounded. Proof (a) Since X 0 is convex and closed, for each y ∗ ∈ Y0∗ , the function (x, y) → / X 0 is an l.s.c. convex function of (x, y), and L (x, y, y ∗ ) extended by +∞ if x ∈ hence, its supremum w.r.t. y ∗ ∈ Y0∗ , i.e. ϕ(x, y), is itself l.s.c. convex. (b) Let us check that v(y) < +∞. Fix x0 ∈ X 0 . Since y ∗ → L(x0 , y ∗ ) is an infimum of ∗affine functions, we have that for some (y0 , c0 ) ∈ Y × R (depending on x0 ): L(x0 , y ∗ ) ≤ y ∗ , y0 + c0 , for all y ∗ ∈ Y0∗ ,
(1.243)
and then since Y0∗ is bounded: v(y) ≤ ϕ(x0 , y) ≤ sup y ∗ , y + y0 + c0 < ∞. y ∗ ∈Y0∗
(1.244)
1.2 Duality Theory
51
(c) If the primal value is −∞, the conclusion follows from the weak duality inequality, the maximum of the dual cost being attained at each y ∗ ∈ Y0∗ . (d) In view of the expression of ϕ in (1.236), if v is finite at some y ∈ Y , we have that
|v(y ) − v(y)| ≤ sup |y ∗ , y − y| ≤ y ∗ ∈Y0∗
sup y ∗ y − y,
y ∗ ∈Y0∗
(1.245)
proving that v is everywhere finite and Lipschitz. Since v is convex and Lipschitz, by Lemma 1.59, ∂v(y) is nonempty and bounded, and therefore v(y) = v∗∗ (y) and the set of dual solutions is not empty and bounded. We conclude by (1.241), in which by (1.242) the first inequality is an equality. A direct consequence of the previous result is, see [94, Corollary 37.3.2]: Lemma 1.131 Let A and B be nonempty closed convex subsets of Rn and Rq , resp., with B bounded, and L be a continuous convex-concave mapping A × B → R. Then sup inf L(x, y) = inf sup L(x, y), y∈Y x∈X
x∈X y∈Y
(1.246)
and the supremum on the l.h.s. is attained.
1.2.4 Calmness ¯ have a finite value at x. Definition 1.132 Let f : X → R ¯ We say that f is calm at x¯ with constant r > 0 if f (x) ¯ ≤ f (x) + r x − x, ¯ for all x ∈ X.
(1.247)
¯ be convex, and calm at x¯ with constant r > 0. Then Lemma 1.133 Let f : X → R (i) f is l.s.c. at x, ¯ and (ii) ∂ f (x) ¯ has at least an element of norm at most r . Proof (i) Immediate consequence of (1.247). (ii) Let f¯(x) := conv( f )(x). By the Fenchel–Moreau–Rockafellar Theorem 1.46, ¯ = f¯(x) ¯ = f ∗∗ (x). ¯ f¯ = f ∗∗ . In view of (i) and Corollary 1.47(i), f (x) ¯ ¯ ¯ attains its minimum at x, ¯ and so 0 ∈ By (1.247), fr (x) := f (x) + r x − x ¯ By the subdifferential calculus rule for a sum (Corollary 1.121), and since ∂ f¯r (x). the subdifferential of the norm is the closed dual unit ball, we have that ¯ r )X ∗ , ¯ = ∂ f¯(x) ¯ + B(0, 0 ∈ ∂ f¯r (x) ¯ r ) X ∗ . The conclusion follows. proving that ∂ f¯(x) ¯ has an element in B(0,
(1.248)
52
1 A Convex Optimization Toolbox
¯ has a subdifferential q at x¯ of norm not Remark 1.134 Conversely, if f : X → R greater than r > 0, then f (x) ≥ f (x) ¯ + q, x − x ¯ ≥ f (x) ¯ − r x − x, ¯
(1.249)
which shows that f is calm at x¯ with constant r . So, if f is convex and f (x) ¯ is finite, then ∂ f (x) ¯ is nonempty iff f is calm at x. ¯ Corollary 1.135 In the framework of the perturbation duality theory presented in Sect. 1.2.1.1, assume that ϕ is convex (not necessarily l.s.c.), and that the value function v(·) is calm at y ∈ Y , with constant r > 0. Then val(Py ) = val(D y ), and ∂v(y) = S(D y ) has at least one element of norm at most r . Since, by Remark 1.134, calmness characterizes subdifferentiability for convex functions, the difficulty is of course to check this condition! We first present a “pathological” example that illustrates the theory. Example 1.136 Let X = L 2 (0, 1), Y = L 1 (0, 1), g ∈ X , and A be the canonical injection X → Y . Denote by (·, ·) X the scalar product in X . Consider the problem Min (g, x) X ; x∈X
Ax + y = 0 in Y.
(Py )
This enters into the framework of perturbation duality, with ϕ(x, y) =
(g, x) X if x = −y, +∞ otherwise.
(1.250)
The value of (Py ) is therefore v(y) =
−(g, y) X if y ∈ X, +∞ otherwise.
(1.251)
We distinguish two cases: (a) g ∈ / L ∞ (0, 1). Given y ∈ Y , it is easy to build a sequence yk in X such that yk → y in Y , and (g, yk ) → −∞. So v(·) is nowhere l.s.c. (b) g ∈ L ∞ (0, 1). Then for y, y in X we have that |v(y ) − v(y)| ≤ g∞ y − y1 ,
(1.252)
proving that v is calm with constant r := g∞ at each y ∈ X . We compute the dual problem by applying Example 1.114, with here K = {0}. Since f ∗ (x ∗ ) = 0 if x ∗ = g, and +∞ otherwise, we get: y ∗ , y; g = −A y ∗ . Max ∗ y ∈Z
If y ∈ X then y = −Ax, for some x ∈ X . Then, if y ∗ ∈ F(D y ):
(D y )
1.2 Duality Theory
53
y ∗ , y = −y ∗ , Ax = −A y ∗ , x = (g, x) X = −(g, y) X ,
(1.253)
and so the primal and dual values are equal, and the dual problem has solution −g, in accordance with Corollary 1.135. Of course it can be checked by direct means that ∂v(y) = −g. Example 1.137 Consider the family of linear optimization problems Min c, x; ai , x + yi ≤ 0, i = 1, . . . , p. x∈X
(Py )
Here ϕ ∗ (0, y ∗ ) = sup y ∗ · y − c, x; ai , x + yi ≤ 0, i = 1, . . . , p.
(1.254)
x,y
A supremum less than +∞ implies y ∗ ≥ 0, and the optimal choice for y is then p yi = −ai , x, so that ϕ ∗ (0, y ∗ ) = 0 if c + i=1 yi∗ ai = 0, and +∞ otherwise. The dual problem (in the framework of perturbation duality) is therefore, denoting by λ the dual variable: p λi ai = 0. (D y ) Maxp λ · y; c + λ∈R+
i=1
By Hoffman’s Lemma 1.28, calmness is satisfied whenever v(y) is finite, and hence, the primal and dual values are equal and the dual problem has a solution, in agreement with Lemma 1.26. Remark 1.138 The stability condition (1.170) does not hold in Example 1.136, and does not necessarily hold in Example 1.137. So these examples show the usefulness of the concept of calmness.
1.3 Specific Structures, Applications 1.3.1 Maxima of Bounded Functions Coming back to the minimization of composite functions in Sect. 1.2.1.5, assume that f is proper, l.s.c. convex, and that F is l.s.c., convex, and positively homogeneous with value 0 at 0. Then F(x) > −∞ for all x. By Theorem 1.44, F is equal to its biconjugate, and by Lemma 1.66, F(y) = σ K ∗ (y), and F ∗ = I K ∗ , where K ∗ = ∂ F(0). So problem (Py ) in Sect. 1.2.1.5 is of the form Min f (x) − x ∗ , x + sup y ∗ , G(x) + y. x∈X
y ∗ ∈K ∗
(1.255)
54
1 A Convex Optimization Toolbox
As in (1.182) we set L(x, y ∗ ) := f (x) + y ∗ , G(x). Since F ∗ = I K ∗ , by Sect. 1.2.1.5, the dual problem can be expressed as Max y ∗ , y + inf (L(x, y ∗ ) − x ∗ , x).
y ∗ ∈K ∗
x
(1.256)
In the sequel to this section, we assume that Y is a space of bounded functions, denoted by yω , over a certain set Ω, containing constant functions, and is a Banach space endowed with the uniform norm y := sup {|yω |; ω ∈ Ω} .
(1.257)
Remark 1.139 An obvious choice for Y is the space of bounded functions over Ω. If Ω is a compact metric space, we can also choose the space of continuous and bounded functions over Ω (indeed, by the Heine–Cantor theorem, a continuous function over a compact set is uniformly continuous, and this easily implies that a uniform limit of continuous functions is continuous). The dual space Y ∗ is endowed with the norm y ∗ := sup{y ∗ , y; y ∈ Y, |yω | ≤ 1, for all ω ∈ Ω}. We say that y ∗ ∈ Y ∗ is nonnegative, and write y ∗ ≥ 0, if y ∗ , y ≥ 0, for all y ≥ 0 (we recognize here a polarity relation between the (closed convex) cone of nonnegative functions of Y , and the positive polar cone of nonnegative linear forms). In the sequel to this section, we discuss problems of the type (in certain applications, the supremum will be an essential supremum) Min f (x) − x ∗ , x + sup {G ω (x) + yω }. x
ω∈Ω
(Py )
If Y : Ω → R, we denote the supremum function by sup y := sup{yω , ω ∈ Ω}. Let us denote by 1 the function with constant value 1 over Ω. We will see that the subdifferential of the supremum at 0 is the set S (Ω) := {y ∗ ∈ Y ∗ ; y ∗ ≥ 0; y ∗ , 1 = 1}.
(1.258)
We say that a function is non-expansive if it has Lipschitz constant one. Lemma 1.140 The convex, positively homogeneous function sup : Y → R is nonexpansive, and its subdifferential at 0 is S (Ω), so that for all y ∈ Y and y ∗ ∈ Y ∗ :
sup(y) = max y ∗ ∈S (Ω) y ∗ , y; (sup)∗ (y ∗ ) = IS (Ω) (y ∗ ); ∂ sup(y) = {y ∗ ∈ S (Ω); sup(y) = y ∗ , y}.
(1.259)
Proof The non-expansivity is a direct consequence of the definition of the supremum, and ensures that the subdifferential of the supremum at any point is contained in the
1.3 Specific Structures, Applications
55
closed unit ball. Let y ∗ ∈ S (Ω), and y ∈ Y . Since y ∗ is nonnegative, we have sup(y) = y ∗ , sup(y)1 − y + y ∗ , y ≥ y ∗ , y, which proves that S (Ω) ⊂ ∂ sup(0). Conversely, let y ∗ ∈ ∂ sup(0). Then ±1 = sup(±1) ≥ y ∗ , ±1 implies y ∗ , 1 = 1. In addition, for all y ≥ 0, we have that 0 ≥ sup(−y) ≥ y ∗ , −y, so y ∗ ≥ 0. We have shown that y ∗ ∈ S (Ω). We conclude by Lemma 1.66. It follows from Sect. 1.2.1.5 that the dual of (Py ) can be written in the form Max y ∗ , y + inf (L(x, y ∗ ) − x ∗ , x).
y ∗ ∈S (Ω)
(D y )
x
Theorem 1.141 Let Y be a Banach space endowed with the norm (1.257), containing the constant functions. We assume that f is proper, l.s.c., convex, x → G(x) is continuous and that for any y ∗ ∈ S (Ω), x → y ∗ , G(x) is convex. Then problems (Py ) and (D y ) have the same value, that is finite or equal to −∞. If this value is finite, then S(D y ) is nonempty (necessarily bounded since S (Ω) is). In addition, x ∈ S(Py ) and y ∗ ∈ S(D y ) iff (x, y ∗ ) satisfies x ∗ ∈ ∂x L(x, y ∗ ); y ∗ ∈ S (Ω); y ∗ , G(x) + y = sup(G(x) + y).
(1.260)
Proof The function (x, y) → σS (Ω) (G(x) + y) = sup (G ω (x) + yω ) = ω∈Ω
sup y ∗ , G(x) + y
y ∗ ∈S (Ω)
(1.261) is continuous (being a composition of continuous functions), and convex. Since f is l.s.c. convex, the function (x, y) → ϕ(x, y) := f (x) + σS (Ω) (G(x) + y) is l.s.c. convex. As f is proper, (Py ) is feasible for all y. Corollary 1.92 ensures the equality of primal and dual values. Finally, the optimality conditions follow from the duality theory for the composite functions (Proposition 1.98) combined with Lemma 1.66. Note that the hypotheses made on G in the above theorem imply in particular that for all ω ∈ Ω, the function x → G ω (x) is convex continuous (since y → yω is a linear continuous form that belongs to S (Ω)). Example 1.142 Under the hypotheses of Theorem 1.141, let Ω be finite, of cardinalC(Ω) with Rp ity p (therefore each component G i is convex). We will then ∗ identify p p p and S (Ω) with the set of probabilities over Ω: S := y ∈ R+ ; i=1 yi∗ = 1 . Using the subdifferential calculus rule in Example 1.122 and especially (1.223), we see that the optimality condition (1.260) reduces to 0 ∈ ∂ f (x) +
p i=1
yi∗ ∂x G i (x);
y∗ ∈ S p ;
y ∗j = 0, j ∈ / argmax G i (x). (1.262) i
56
1 A Convex Optimization Toolbox
Another case of interest is that of compact spaces. Example 1.143 Let Y = C(Ω), the space of continuous functions on the compact metric space Ω. The dual space is the space of finite Borel measures over Ω, and S (Ω) is nothing but the set P(Ω) of Borel probability measures over Ω. We can define the support of a measure, denoted by supp(·), as the complement of the largest open set where it is equal to 0. Then the two last relations of (1.260) are equivalent to (1.263) y ∗ ∈ P(Ω); supp(y ∗ ) ⊂ argmax(G(x) + y).
1.3.2 Linear Conical Optimization The literature often refers to linear conical optimization problems, which are as follows. Given two Banach spaces X and Y , consider the problem Min c, x; x∈X
Ax − b ∈ C,
(1.264)
where C ⊂ Y is a closed convex cone, c ∈ X ∗ , A ∈ L(X, Y ), and b ∈ Y . When Y = p Rq+ p and C = {0}Rq × R− , we recover the class of (possibly infinite-dimensional) linear programs. If C is the cone Sn+ of symmetric positive semidefinite matrices of size n, we obtain (possibly infinite-dimensional) semidefinite programming problems. Linear conical problems are nothing but particular cases of Fenchel duality, more precisely of those problems discussed in Example 1.114, where K := b + C, so that σ K (λ) = λ, b + IC − (λ), and f (x) := c, x is such that f ∗ (z) =
0 if z = c, +∞ otherwise.
(1.265)
So the expression of the dual problem when y = 0 is Max− −λ, b; c + A λ = 0. λ∈C
(1.266)
If we prefer to use the positive polar set C + = −C − , the expression of the dual problem becomes (1.267) Max+ η, b; A η = c. η∈C
The dual problem is itself in the conical linear class, except of course that the spaces are of dual type. It can be rewritten, setting K := C − × {0} (zero in X ∗ ), in the form (formally close to (1.264)): Min∗ λ, b; (λ, c + A λ) ∈ K . λ∈Y
(1.268)
1.3 Specific Structures, Applications
57
We can also dualize the dual problem (1.268); in view of Lemma 1.85, the resulting bidual problem will coincide with the original one. Corollary 1.144 (Primal qualification) If there exists an ε > 0 such that ε BY ⊂ C + b + Im(A)
(1.269)
(in particular, if there exists an x0 ∈ X such that Ax0 − b ∈ int C), then (1.264) and (1.266) have the same value. If the latter is finite, then the solution set of the dual problem (1.266) is nonempty and bounded. Proof Apply Corollary 1.92.
Corollary 1.145 (Dual qualification) Let X and Y be reflexive. If there exists an ε > 0 such that (1.270) ε B X ∗ ⊂ c + A C − , then (1.264) and (1.266) have the same value, and if this common value is finite, then the solution set of the primal problem (1.264) is nonempty and bounded. Proof It suffices to check that (1.270) is equivalent to the stability condition for the dual problem. The latter holds iff there exists an ε > 0 such that ε (BY ∗ × B X ∗ ) ⊂ C − × {−c} − {(λ, A λ); λ ∈ Y ∗ }.
(1.271)
This holds iff, for all (μ, η) close to 0 in Y ∗ × X ∗ , there exists a λ ∈ Y ∗ such that μ ∈ C − − λ; η = −c − A λ.
(1.272)
The first relation is equivalent to λ = λˆ − μ, with λˆ ∈ C − . Eliminating λ in the second relation, we obtain c + A λˆ = A μ − η. One easily shows that this is equivalent to the existence of an ε > 0 such that (1.270) holds.
1.3.3 Polyhedra Let X be a Banach space. A polyhedron P of X is a subset defined by a finite number of inequalities: (1.273) P = {x ∈ X ; ai , x ≤ bi , i = 1, . . . , p}, where a1 , . . . , a p belong to X ∗ . We set I = {1, . . . , p} and call {(ai , bi ); i ∈ I } a parameterization of P. The latter is of course not unique. If x ∈ P, we denote the set of active constraints by I (x) := {i ∈ I ; ai , x = bi }.
(1.274)
58
1 A Convex Optimization Toolbox
Lemma 1.146 Let x¯ ∈ P. Then N P (x) ¯ =
⎧ ⎨ ⎩
i∈I (x) ¯
⎫ ⎬
λi ai ; λ ≥ 0 . ⎭
(1.275)
Proof Let Nˆ P (x) ¯ denote the r.h.s. of (1.275). If x ∗ = i∈I (x) ¯ λi ai with λ ≥ 0, and x ∈ P, then ¯ = λi ai , x − x ¯ = λi (ai , x − bi ) ≤ 0, (1.276) x ∗ , x − x i∈I (x) ¯
i∈I (x) ¯
proving that Nˆ P (x) ¯ ⊂ N P (x). ¯ Conversely, let x ∗ ∈ N P (x). ¯ Then x¯ is a solution of the linear program Min{−x ∗ , x; x ∈ P}. By the strong duality for linear programs (Lemma 1.26), there exists a solution λ of the dual problem. By dual feasibility and ¯ the complementarity conditions, we deduce that x ∗ ∈ Nˆ P (x). Consider now a collection ai in X ∗ , for i in I ∪ J , the sets I and J being finite. Set Q = {x ∈ X ; a j , x ≤ 0, j ∈ J }, and for x ∈ Q,
I (x) = {i ∈ I ; ai , x ≥ ak , x, for all k ∈ I }; J (x) = { j ∈ J ; a j , x = 0}.
(1.277)
Set g(x) := max{ai , x; i ∈ I }. Lemma 1.147 Let x¯ ∈ X . Then ∂g(x) ¯ = conv{ai ; i ∈ I (x)}. ¯
(1.278)
Proof (i) For ∈ Rn , set max(z) := max(z 1 , . . . , z n ). It is an elementary exercise to check that (1.279) ∂ max(z) = conv{ei ; 1 ≤ i ≤ n; z i = max(z)}. Since g is the composition of the max function by the linear mapping (assuming that |I | = n): x → Ax := (a1 , x, . . . , an , x), so that A λ = j λ j ai , we conclude with Lemma 1.120. ¯ be defined by Let Φ : X → R Φ(x) := max{ai , x; i ∈ I } + I Q (x) = g(x) + I Q (x).
(1.280)
Definition 1.148 If E is a subset of X , we denote the convex cone generated by E (the set of finite nonnegative combinations of elements of E) by cone(E).
1.3 Specific Structures, Applications
59
Lemma 1.149 Let x¯ ∈ Q. Then ¯ + cone a j , i ∈ J (x) ¯ . ∂Φ(x) ¯ = conv {ai , i ∈ I (x)}
(1.281)
Proof Since g is convex and continuous, and I Q is l.s.c. convex, by the subdifferential ¯ We conclude calculus rules (Lemma 1.120), we have that ∂Φ(x) ¯ = ∂g(x) ¯ + ∂ I Q (x). ¯ = N Q (x), ¯ whose expression is given by Lemma 1.146, and by noting that ∂ I Q (x) by Lemma 1.147. Lemma 1.150 With the above notations, let M ∈ L(Z , X ), where Z is a Banach space, and let Ψ = Φ ◦ M have a finite value at z¯ ∈ Z . Set x¯ = M z¯ . Then ∂Ψ (¯z ) = M ∂Φ(M z¯ ).
(1.282)
Proof The function Ψ is of the same nature as Φ, replacing the ai by M ai , for i ∈ I ∪ J . The conclusion follows by Lemma 1.149. We admit the Minkowski–Weyl theorem of representations of polyhedra (see [97, Part IV], or [110, Chap. 8]; we assume that X = Rn . Theorem 1.151 Let P satisfy (1.273) and be nonempty. Then there exists an element xi ∈ X , with i ∈ I ∪ J , I and J finite sets, such that P = conv{xi , i ∈ I } + cone{x j , j ∈ J }.
(1.283)
Consider now the following family of linear programs Minc, x; ai , x + yi ≤ 0, i = 1, . . . , p, x∈X
(L Py )
parameterized by y ∈ R p . The dual problem is Maxp λ · y; c + λ∈R+
p
λi ai = 0.
(L D y )
i=1
Now let Z be a Banach space, and let Mi ∈ L(Z , X ), for i = 1 to p. Define M z := (M1 z, . . . , M p z) . Set v(y) := val(P L y ), and V (z) := v(M z). Theorem 1.152 Fix z¯ ∈ Z , set y¯ = M z¯ , and let x¯ ∈ S(L Py¯ ). Then ∂ V (¯z ) = M ∂v( y¯ ) = M S(L D y¯ ).
(1.284)
60
1 A Convex Optimization Toolbox
Proof By linear programming duality and the general duality theory (Lemma 1.26 and Theorem 1.87), we have that val(L Py¯ ) = val(L D y¯ ); ∂v( y¯ ) = S(L D y¯ ).
(1.285)
Let {λi , i ∈ I ∪ J } be a Minkowski–Weyl representation of F(L D y¯ ); note that the latter does not depend on y¯ . It is easily checked that val(L D y¯ ) = Φ(M z) = Ψ (z), where Φ was defined in (1.280). We conclude by Lemma 1.150.
(1.286)
This result, which is essentially another proof of (1.109), will be used in Sect. 3.2.7.
1.3.4 Infimal Convolution Let X be a Banach space, and f 1 , f 2 be two extended real-valued functions over X . Their infimal convolution is the extended real-valued function over X defined as f 1 f 2 (y) := inf ( f 1 (y − x) + f 2 (x)) . x∈X
(1.287)
It is easily seen that the operator (that to two extended real-valued functions over X associates their infimal convolution) is commutative and associative. More generally, the infimal convolution of n extended real-valued functions f 1 , . . . , f n over X is defined as n n n i=1 f i (y) := infn f i (xi ); xi = y . (1.288) x∈X
i=1
i=1
3 f i (y). In order to fit with our duality One easily checks that ( f 1 f 2 ) f 3 = i=1 theory, consider the related problem Minn x∈X
n
n f i (xi ) − xi∗ , xi ; xi = y,
i=1
(Py )
i=1
with value function denoted by v(y); we have that n v(y) = i=1 f i (y) whenever x ∗ = 0,
(1.289)
1.3 Specific Structures, Applications
61
as well as v∗ (y ∗ ) = sup y y ∗ , y − v(y) n n ∗ xi , xi − f i (xi ) ; xi = y, = supy ∗ , y + x,y
i=1
i=1
n n ∗ = sup y + xi∗ , xi − f i (xi ) = f i∗ (y ∗ + xi∗ ). x
i=1
(1.290)
i=1
Taking all xi∗ equal to zero, we obtain that the Fenchel conjugate of the infimal convolution is the sum of conjugates, i.e. n n ∗ i=1 f i (y ∗ ) = f i∗ (y ∗ ).
(1.291)
i=1
The dual problem to (Py ) is y ∗ , y − Max ∗ ∗
y ∈Y
Since dom(v) =
n i=1
n
f i∗ (y ∗ + xi∗ ).
(D y )
i=1
dom( f i ), the stability condition is y ∈ int
n
dom( f i ) .
(1.292)
i=1
We deduce that Proposition 1.153 Assume that the f i are l.s.c. convex, and let (1.292) hold. Then v(y) = val(D y ), and if the value is finite, S(D y ) is nonempty and bounded. When the f i are proper, l.s.c. convex, the cost function of (Py ) is itself a proper, l.s.c. and convex function of (x, y). By Lemma 1.85, (Py ) is the dual of (D y ). In view of Remark 1.86, when X is reflexive, we may regard (Py ) as the “classical” dual of (D y ), with perturbation parameter x ∗ . Clearly (D y ) is feasible iff there exists a y ∗ ∈ Y ∗ such that y ∗ + xi∗ ∈ dom( f i∗ ), for i = 1 to n, i.e., if x ∗ ∈ Πi dom( f i ) − Ay ∗ , where the operator A : Y ∗ → (Y ∗ )n is defined by Ay ∗ = (y ∗ , . . . , y ∗ ) (n times). The dual stability condition is therefore n dom( f i∗ ) − AY ∗ . (xi∗ , . . . , xn∗ ) ∈ int Πi=1
(1.293)
We have proved that: Proposition 1.154 Let X be reflexive and the f i be proper, l.s.c. convex, and (1.293) hold. Then val(Py ) = val(D y ), and if the value is finite, S(Py ) is nonempty and bounded.
62
1 A Convex Optimization Toolbox
Example 1.155 Let X = R, f 1 (x) = e x , f 2 (x) = e−x . Set g(x) := ( f 1 f 2 )(x). Then g(x) = 0 for all x in R, dom( f 1∗ ) = R+ , dom( f 2∗ ) = R− , and, with the convention that 0 log 0 = 0: f 1∗ (x ) = x log x − x ;
f 2∗ (x ) = −x log(−x ) + x .
(1.294)
Let x1∗ = x2∗ = 0. The dual problem reads Max y ∗ , y − f 1∗ (y ∗ ) − f 2∗ (y ∗ ).
y ∗ ∈Y ∗
(D y )
The unique feasible point is y ∗ = 0, which is also the unique dual solution. The primal stability condition holds, and accordingly we find that the primal and dual values are equal and that the dual solution (equal to 0) is the subgradient of the infimal convolution. The dual stability condition cannot hold since the infimum in the infimal convolution is not attained. Indeed this condition is that for any x ∗ close to 0 in R2 , there exists a y ∈ R such that x2∗ ≤ y ≤ x1∗ , which is impossible.
1.3.5 Recession Functions and the Perspective Function 1.3.5.1
Recession Functions
Let f be a proper l.s.c. convex function over X . Given x0 ∈ dom( f ), we define the ¯ by recession function f ∞ : X → R f ∞ (d) := sup τ >0
f (x0 + τ d) − f (x0 ) . τ
(1.295)
It is easily checked that f ∞ is convex and positively homogeneous, and that the supremum is attained when τ → +∞. Lemma 1.156 The recession function is the support function of the domain of f ∗ , that is, (1.296) f ∞ (d) = sup x ∗ , d; f ∗ (x ∗ ) < +∞ . x ∗ ∈X ∗
Proof Being proper l.s.c. convex, f is equal to its biconjugate, that is, f (x0 + τ d) = sup x ∗ , x0 + τ d − f ∗ (x ∗ ). x ∗ ∈X ∗
Therefore, f ∞ (d) = sup sup
τ >0 x ∗ ∈X ∗
x ∗ , x0 + τ d − f ∗ (x ∗ ) − f (x0 ) . τ
(1.297)
(1.298)
1.3 Specific Structures, Applications
63
Changing the order of maximization, we get that x ∗ , x0 − f ∗ (x ∗ ) − f (x0 ) ∗ . f ∞ (d) = sup x , d + sup τ τ >0 x ∗ ∈X ∗
(1.299)
By the Fenchel–Young inequality, and since f ∗ is proper, the second supremum is 0 if f ∗ (x ∗ ) < +∞, and −∞ otherwise. The result follows. By the above lemma, the recession function does not depend on the element x0 ∈ dom( f ) used in its definition.
1.3.5.2
Perspective Function
¯ with domain With f as before we associate the perspective function g : X × R → R, ]0, ∞[× dom( f ), defined by g(x, t) := t f (x/t) (where t > 0).
(1.300)
Being proper l.s.c. convex f has an affine minorant, say a, x X + b; then g has affine minorant a, x X + bt. Lemma 1.157 The perspective function is convex and positively homogeneous; its conjugate is the indicatrix of the set C := {(x ∗ , t ∗ ) ∈ X ∗ × R;
f ∗ (x ∗ ) + t ∗ ≤ 0}.
(1.301)
Proof Note that the domain of g is convex. Let x1 , x2 in X , t1 > 0, t2 > 0, and θ ∈]0, 1[. Set x := θ x1 + (1 − θ )x2 ; t := θ t1 + (1 − θ )t2 ; θ := θ t1 /t.
(1.302)
Then θ ∈]0, 1[, (1 − θ ) = (1 − θ )t2 /t, and x/t = θ x1 /t1 + (1 − θ )x2 /t2 . Using the convexity of f , we get g(x, t) ≤ t θ f (x1 /t1 ) + (1 − θ ) f (x2 /t2 ) = θ t1 f (x1 /t1 ) + (1 − θ )t2 f (x2 /t2 ) = θg1 (x1 , t1 ) + (1 − θ )g(x2 , t2 ),
(1.303)
proving that g is convex. The positive homogeneity is obvious; it follows that, by Lemma 1.66, g ∗ is the indicatrix of the convex set C1 := (x ∗ , t ∗ ) ∈ X ∗ × R; x ∗ , x + t ∗ t ≤ g(x, t), for all (x, t) ∈ dom(g) . (1.304)
64
1 A Convex Optimization Toolbox
Dividing by t > 0 and setting y := x/t we see that the above set of inequalities is equivalent to x ∗ , y − f (y) + t ∗ ≤ 0, for all y ∈ X . Maximizing in y we obtain the conclusion. Since f is proper l.s.c. convex, so is f ∗ . Therefore C is nonempty. It follows that g = σC is never equal to −∞ (this also follows from the fact that g has, as already established, an affine minorant). By the Fenchel–Moreau–Rockafellar Theorem 1.46, g ∗∗ is equal to the convex closure of g. ∗∗
Lemma 1.158 The biconjugate of the perspective function satisfies, for all x ∈ X : ⎧ ⎪ ⎨ (i) +∞ if t < 0, ∗∗ g (x, t) = (ii) g(x, t) if t > 0, ⎪ ⎩ (iii) f ∞ (x) if t = 0.
(1.305)
Proof We have that g ∗∗ is the support function of C, and therefore: g ∗∗ (x, t) = sup{x ∗ , x + tt ∗ ;
f ∗ (x ∗ ) + t ∗ ≤ 0}.
(1.306)
(i) If t < 0, we may take x0∗ in the nonempty set dom( f ∗ ) and set (x ∗ , t ∗ ) := (x0∗ , − f ∗ (x0∗ ) − t ), with t → ∞; it follows that g ∗∗ (x, t) = +∞. (ii) If t > 0, maximizing in t ∗ in (1.306) and since f = f ∗∗ , we get g ∗∗ (x, t) = sup{x ∗ , x − t f ∗ (x ∗ )} = t sup{x ∗ , x/t − f ∗ (x ∗ )} = t f (x/t) = g(x, t). x∗
x∗
(iii) If t = 0, then g ∗∗ (x, 0) = sup{x ∗ , x;
f ∗ (x ∗ ) ≤ −t ∗ } = sup{x ∗ , x; x ∗ ∈ dom( f ∗ )}. (1.307) So, by Lemma 1.156, g ∗∗ (x, 0) = f ∞ (x). 1.3.5.3
Minimizing over a Union of Convex Sets
We next relate the perspective function to the resolution of the nonconvex problem Minc, x; x∈X
f 1 (x) ≤ 0 or f 2 (x) ≤ 0,
(P12 )
with c ∈ X ∗ , f i l.s.c. proper convex functions X → R, for i = 1, 2. We assume that the sets K i := f i−1 (R− ), i = 1, 2 are nonempty. Next, consider the convex problem Min
(x1 ,x2 )∈X ×X,t1 >0,t2 >0
c, x1 + x2 ; t1 f 1 (x1 /t1 ) ≤ 0 t2 f 2 (x2 /t2 ) ≤ 0; t1 > 0; t2 > 0; t1 + t2 = 1.
) (P12
1.3 Specific Structures, Applications
65
Lemma 1.159 Problems (P12 ) and (P12 ) have the same value. ). Setting xi := xi /ti , for i = Proof Let (x1 , x2 , t1 , t2 ) be in the feasible set of (P12 1, 2, one easily checks that (P12 ) has the same value as the problem
Min
(x1 ,x2 )∈X ×X,t1 >0,t2 >0
c, t1 x1 + t2 x2 ; f 1 (x1 ) ≤ 0
f 2 (x2 ) ≤ 0;
(P12 )
t1 > 0; t2 > 0; t1 + t2 = 1.
Minimizing w.r.t. to (x1 , x2 ) first, we see that the value of problem (P12 ) is equal to
inf
ti >0,t1 +t2 =1
t1 inf
x1 ∈K 1
c, x1
+ t2 inf
x2 ∈K 2
c, x2
= min
inf
x1 ∈K 1
c, x1 ,
inf
x2 ∈K 2
c, x2
.
(1.308)
The result easily follows.
1.4 Duality for Nonconvex Problems 1.4.1 Convex Relaxation In this section we discuss a nonconvex optimization problem with a finite dimensional constraint.
1.4.1.1
Coercive Dual Cost
Consider a problem of the form Min f (x); g(x) ∈ K . x∈X
(P)
Here X is an arbitrary set, f : X → R, g : X → R p , and K is a (possibly nonconvex) nonempty subset of R p . Denote by K the closed convex hull of K , and recall that K and K have the same support function. The associated Lagrangian is L(x, λ) := f (x) +
p
λi gi (x),
(1.309)
i=1
and the opposite of the dual criterion is d(λ) := σ K (λ) + sup{−L(x, λ)}.
(1.310)
x
This is obviously an l.s.c. convex function, everywhere greater than −∞. We will assume that it is proper; this holds, for instance, if inf f > −∞, since then 0 ∈
66
1 A Convex Optimization Toolbox
dom(d). The dual problem can be written as Min d(λ).
λ∈R p
(D )
We denote it by (D ) to take into account the change of sign, but call the amount − val(D ) the dual value in order to remain coherent with the general duality theory. Proposition 1.160 We assume that (i) the function d(·) is proper, and (ii) the existence of ε > 0 such that ε B ⊂ conv (g(X ) − K ) .
(1.311)
Then the dual problem has a nonempty and compact set of solutions. Remark 1.161 Note that (1.311) is equivalent to the same relation in which we write K instead of K . Proof (Proof of Proposition 1.160) Since Y ⊂ R p , (1.311) implies the existence of x1 , . . . xr in X and k1 , . . . kr in K such that
We have that
1 ε B ⊂ conv ({ki − g(xi ), i = 1, . . . , r }) . 2
(1.312)
d(λ) ≥ maxi {− f (xi ) + λ, ki − g(xi )} ≥ mini {− f (xi )} + maxi {λ, ki − g(xi )} ≥ mini {− f (xi )} + 21 ε|λ|,
(1.313)
the last inequality using the fact that a maximum of linear forms is equal to the maximum over their convex hull. It follows that a minimizing sequence λk (which exists since d is proper) is bounded and therefore has a subsequence converging to some λ¯ . Since d(·) is l.s.c. (being a supremum of linear forms), λ¯ ∈ S(D ). That S(D ) is bounded is a consequence of the coercivity property (1.313).
1.4.1.2
Dual Optimality Conditions
Let us now add some hypotheses for ensuring the existence of points minimizing the Lagrangian in the vicinity of dual solutions. Proposition 1.162 Assume that there exists a metric compact set Ω ⊂ X such that, if λ is close enough to S(D ), the set of minima of L(·, λ) has at least one point in Ω, and that f and g are continuous over Ω. a Borelian probability measure μ over Ω such Then λ ∈ S(D ) iff there exists that, denoting by Eμ g(x) = Ω g(x)dμ(x) the associated expectation, the following holds:
1.4 Duality for Nonconvex Problems
67
supp μ ⊂ argmin L(·, λ); Eμ g(x) ∈ K ; λ ∈ NK (Eμ g(x)).
(1.314)
Proof Set δ(λ) := supx∈X {−L(x, λ)}. By our assumptions, when λ is close enough to S(D ), δ(λ) is equal to the continuous function δ (λ) := max{−L(x, λ)}. x∈Ω
(1.315)
Since δ(·) and δ (·) are convex, and coincide near S(D ), they have the same subdifferential near S(D ). Let λ ∈ S(D ). Since δ (·) is continuous at λ, Corollary 1.121 implies that 0 ∈ ∂d(λ) = ∂σK (λ) + ∂δ (λ).
(1.316)
By (1.147), y ∈ ∂σK (λ) iff y ∈ K and λ ∈ NK (y). So, (1.316) is equivalent to λ ∈ NK (−q), for some q ∈ ∂δ (λ).
(1.317)
We next give an expression for ∂δ (λ). We have that δ (λ) = F[G(λ)], where F : C(Ω) → R is defined by F(y) := max{yx , x ∈ Ω}, and G affine R p → C(Ω) is defined by (denoting the value at x ∈ Ω by a subindex) G(λ)x := −L(x, λ). Set A := DG(λ). Since F is Lipschitz, the subdifferential calculus rules (Theorem 1.119) apply, so that by (1.317): q ∈ ∂δ (λ) = A ∂ F(G(λ)).
(1.318)
p Now A ∈ L(R p , C(Ω)) satisfies (Aλ)x := − i=1 λi gi (x). For μ ∈ C(Ω)∗ we have p λi gi (x)dμ(x) μ, AλC(Ω) = − Ω
i=1
so that A μ = Ω g(x)dμ(x). By Lemma 1.140, ∂ F(y) is equal to the set of Borel measures over Ω, with support over the set of points where y attains its maximum. The conclusion follows. Remark 1.163 When X is a metric compact set we can also consider the following relaxed formulation
f (x)dμ(x);
Min
μ∈P (X )
X
g(x)dμ(x) ∈ K .
(1.319)
X
The stability condition for this convex problem is precisely (1.311). So, under this condition, if the above problem is feasible, there is no duality gap. The Lagrangian is
68
1 A Convex Optimization Toolbox L (μ, λ) =
f (x) + X
p
λi gi (x) dμ(x) − σK (λ) =
i=1
X
L(x, λ)dμ(x) − σK (λ).
(1.320) Therefore the infimum of the Lagrangian w.r.t. the primal variable μ ∈ P can be expressed as inf
μ∈P (X )
L (μ, λ) − σK (λ) = inf L(x, λ) − σK (λ). x∈X
(1.321)
That is, (1.319) has the same dual as the original problem. If (1.311) holds, then the stability condition holds for the convex problem (1.319), and hence, there is no duality gap. We can therefore interpret the dual problem as the dual of the relaxed problem. ¯ is such that g(x) Proposition 1.164 (i) Let λ¯ ∈ S(D ). If x¯ ∈ argmin L(·, λ) ¯ ∈K ¯ then x¯ ∈ S(P), and the primal and dual problems have the same and λ ∈ NK (g(x)), value. (ii) Under the hypotheses of Proposition 1.162, if K is closed and convex, and λ¯ ∈ S(D ) is such that x → g(x) is constant over argmin L(·, λ¯ ) (which is the ¯ attains its minimum at a single point), then any case in particular if L(·, λ) x¯ ∈ argmin L(·, λ¯ ) is a solution of (P) and the conclusion of point (i) is therefore satisfied. Proof (i) Since L(x, ¯ λ¯ ) = inf x L(x, λ¯ ) and λ¯ ∈ NK (g(x)), ¯ and consequently σK (λ¯ ) ¯ g(x), is equal to λ, ¯ we have that f (x) ¯ = L(x, ¯ λ¯ ) − σK (λ¯ ) = inf L(x, λ¯ ) − σK (λ¯ ) = −d(λ¯ ), x
(1.322)
i.e., x¯ and λ¯ are primal and dual feasible with equal cost, meaning that x¯ is a solution of the primal problem, λ¯ is a solution of the dual one, and the primal and dual values are equal. ¯ we obtain (ii) We apply Proposition 1.162. Since g(x) is constant over argmin L(·, λ), the existence of a probability measure with support over argmin L(·, λ¯ ), such that ¯ g(x) for any x¯ ∈ argmin L(·, λ), ¯ = Eμ g(x) ∈ K . We conclude using point (i). Remark 1.165 In most non-convex problems there is a duality gap; the hypotheses of the proposition above are not satisfied. Now the hypotheses of Propositions 1.160 and 1.162 are weak. So, in general, the dual problem has a compact and nonempty set of solutions, but in each of them, the minimum of the Lagrangian is reached at several points (with different values of the constraint g(x)). Remark 1.166 We will apply point (ii) of Proposition 1.164 (in a case when the set of minima of the Lagrangian is in general not a singleton) to the study of controlled Markov chains with expectation constraints, see Theorem 7.34. Exercise 1.167 For x ∈ R, let f (x) = 1 − x 2 and g(x) = x. The problem of minimizing f (x) over X := [−1, 1], under the constraint g(x) = 0, has a unique solution
1.4 Duality for Nonconvex Problems
69
x¯ = 0 and value 1. The Lagrangian L(x, λ) = 1 − x 2 + λx is concave, and therefore attains its minimum over ±1, so that the opposite of dual cost is d(λ) = |λ| (note that here σ K is the null function). So, the dual problem has unique solution λ¯ = 0, for which the Lagrangian attains its minimum at ±1, and so, the relaxed solution is the measure with equal probability 1/2 at ±1 (so that, as required, the expectation of g(x) is zero).
1.4.1.3
Estimate of Duality Gap
We assume here that X is a convex subset of a Banach space X and that g(x) = ¯ is defined, for x ∈ Ax, with A ∈ L(X , Rn ). The convexification of f : X → R conv(dom( f )), by conv( f )(x) := inf
i
αi f (x ); i
αi x = x , i
(1.323)
i
over all finite families i ∈ I with αi ≥ 0, i αi = 1 and x i ∈ X . This is the largest convex function minorizing f . It satisfies, for x ∈ X : f (x) − conv( f )(x) ≤ ρ X ( f ), where ρ X ( f ) := sup ( f (x) − conv( f )(x)) . x∈X
(1.324) Note that ρ X ( f ) ≥ 0, with equality iff f is convex over X . We call it the estimate of lack of convexity of f over X . The perturbed relaxed primal problem, in this setting, for y ∈ R p , reads (Py ) Min conv( f )(x); Ax + y ∈ K . x∈X
In this setting the stability condition similar to (1.311) reads as ε B ⊂ conv (A dom( f ) − K ) .
(1.325)
Proposition 1.168 Let (1.325) hold, and val(P) be finite. Then val(P) − val(D) ≤ ρ X ( f ).
(1.326)
Proof If ρ X ( f ) = ∞ the conclusion is obvious. So, let us assume that ρ X ( f ) < ∞. We easily check that (P) and (P0 ) have the same dual (a similar observation was made after (1.321)). It follows that val(D) ≤ val(P0 ) ≤ val(P) < ∞.
(1.327)
70
1 A Convex Optimization Toolbox
By the stability condition (1.325), val(Py ) is finite near y¯ = 0. By Proposition 1.64, val(Py ) is continuous at y¯ = 0. So, by Theorem 1.88, val(D) = val(P0 ). Therefore, val(P) − val(D) = val(P) − val(P0 ) ≤ ρ X ( f ),
(1.328)
where the last inequality follows from the definition of ρ X ( f ). The conclusion follows. Remark 1.169 Proposition 1.168 was obtained by Aubin and Ekeland [11, Thm. A]. We will next see how to improve this estimate in the case of decomposable problems.
1.4.2 Applications of the Shapley–Folkman Theorem 1.4.2.1
The Shapley–Folkman Theorem
We give a simple proof of this theorem, following [127]. Theorem 1.170 Let Si , i = 1 to p, be nonempty subsets of Rn , with p > n. Set S := p S1 + · · · + S p . Then any x ∈ conv(S) has the representation x = i=1 xi , where xi ∈ conv(Si ) for all i, and xi ∈ Si for at least ( p − n) indices. p Proof Sincea sum of convex sets is convex, and S ⊂ i=1 conv(Si ), we have p that p conv(S) ⊂ i=1 conv(Si ). So any x ∈ conv(S) has the representation x = i=1 yi , of conv(Si ), there exists finite sets Ji , coefwith yi ∈ conv(Si ). By the definition ficients αi j ≥ 0,j ∈ Ji , with j∈Ji αi j = 1, and elements yi j ∈ Si , for all j ∈ Ji , such that yi = j∈Ji αi j yi j . Define the following elements of Rn+ p by ⎧ z ⎪ ⎪ ⎪ ⎪ ⎪ z 1j ⎪ ⎪ ⎪ ⎨ z 2j ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ z pj ⎪ ⎪ ⎩
:= (x , 1, 1, . . . , 1), := (y1j , 1, 0, . . . , 0), := (y2j , 0, 1, . . . , 0), ... := (y , 0, 0, . . . , 1), pj
(1.329)
...
p Then z = i=1 j∈Ji αi j z i j . Since any nonnegative combination of elements of Rn+ p is a nonnegative combination of at most n + p of them,5 we have that the minimal number of a nonnegative combination was greater than n + p, adding some linear combination of these elements equal to 0 and with nonzero elements, we could easily find another nonnegative combination of z with fewer nonzero coefficients, which would give a contradiction.
5 If
1.4 Duality for Nonconvex Problems
71
p z = i=1 j∈Ji βi j z i j with at most n + p nonzero βi j . Since for all i, j∈Ji βi j = 1, for each i, meaning that at most n indices have more than at least one βi j is nonzero p one nonzero βi j . As x = i=1 j∈Ji βi j yi j , the conclusion follows. 1.4.2.2
Estimate of Duality Gap for Decomposable Problems
Consider again problem (P) of Sect. 1.4.1.1, that is, Min f (x); g(x) ∈ K ,
(1.330)
x∈X
with K a convex subset of R p , assuming that the constraints are linear and a decomposable cost function f (x) =
N
f k (xk ); g(x) =
k=1
N
gk (xk ), gk (xk ) = Ak xk , k = 1, . . . , N ,
k=1
(1.331) with X = X 1 × · · · × X N , X k a closed convex subset of a Banach space X k , Ak ∈ L(X k , R p ), and xk ∈ X k for each k. The associated Lagrangian defined in (1.309) satisfies L(x, λ) :=
N
L k (xk , λ), where L k (xk , λ) := f k (xk ) + λ · gk (x).
(1.332)
k=1
So, we have a decomposability property for the Lagrangian: the (opposite of) the dual cost satisfies d(λ) =
N
dk (λ), where dk (λ) := inf L k (xk , λ).
(1.333)
xk ∈X k
k=1
We recall the definition of the measure of lack of convexity in (1.324), and set ρk := ρ X k ( f k ), for k = 1 to N . Proposition 1.171 Let (1.325) and (1.331) hold. Then val(P) − val(D) ≤
max
I ⊂{1,...,N }
ρk ; |I | ≤ p + 1 ≤ ( p + 1) max ρk . (1.334)
k∈I
k
Proof Let Sk := {( f (xk ), g(xk ), xk ∈ X k }, for k = 1 to N , and set S := S1 + · · · + S N . Write s ∈ S as (s , s ) where s is the first component and s ∈ R p . The relaxed problem has the same value as Min s ; s ∈ K .
s∈conv(S)
(1.335)
72
1 A Convex Optimization Toolbox
Let s be feasible for this problem. By the Shapley–Folkman Theorem 1.170, we may assume that sk = ( f (x¯k ), g(x¯k )), with x¯k ∈ X k , except for at most p + 1 indexes, say the p + 1 first. For k = 1 to p + 1, since X k is convex, there exists x¯k ∈ X k such that sk = Ak x¯k . Then x¯ (which is well defined as an element of X ) is feasible and satisfies f k (xk ) ≤ sk + ρk , for k = 1 to p + 1. The result follows. Remark 1.172 This result is due to Aubin and Ekeland N [11, Thm. D]. In this decomposable setting, it is easily checked that ρ X ( f ) = k=1 ρk . So, in general the duality estimate improves the one in Proposition 1.168 when p + 1 < N .
1.4.3 First-Order Optimality Conditions While these notes are mainly devoted to convex problems, it is useful to discuss optimality conditions in the case of nonlinear equality constraints. The (general) result below will have an application in the theory of semidefinite programming, see the proof of Lemma 2.12. So, consider the following problem Min f (x); g(x) = 0, x∈X
(P)
where X and Y are Banach spaces, g : X → Y is of class C 1 , and F : X → R is continuous and convex. Definition 1.173 We say that x¯ ∈ X is a local solution of problem (P) if g(x) ¯ =0 and f (x) ¯ ≤ f (x) whenever g(x) = 0 and x is close enough to x. ¯ Theorem 1.174 Let x¯ be a local solution of (P), such that Dg(x) ¯ is onto. Then there exists a unique λ ∈ Y ∗ such that ∂ f (x) ¯ + Dg(x) ¯ λ 0. The proof of the theorem is based on Liusternik’s theorem that essentially gives a sufficient condition for an element of Ker Dg(x) ˜ to be a tangent direction to g −1 (0). Theorem 1.175 Let x˜ be such that g(x) ˜ = 0 and Dg(x) ˜ is onto. Let h ∈ Ker Dg(x). ˜ Then there exists a path R+ → X , t → x(t) such that x(t) = x˜ + th + o(t) and g(x(t)) = 0. Proof Set A := Dg(x) ˜ and denote by c(·) the modulus of continuity of Dg(x) at x, ˜ such that Dg(x) − Dg(x) ˜ ≤ c(r ) whenever x − x ˜ ≤ r. (1.336) By the open mapping Theorem 1.29, there exists a c > 0 such that, for any b ∈ Y , there exists an a ∈ X that satisfies Aa = b and a ≤ c A b. So given t > 0, consider the sequence x k in X such that x 0 = x˜ + th and g(x k ) + A(x k+1 − x k ) = 0 and x k+1 − x k ≤ c A g(x k ), k ≥ 1.
(1.337)
1.4 Duality for Nonconvex Problems
73
Set ek := x k+1 − x k . Then g(x
k+1
) = g(x ) + k
1
1
Dg(x + σ e )e dσ = k
k
0
k
Dg(x k + σ ek ) − A ek dσ
0
(1.338)
and therefore g(x
k+1
) ≤
1
Dg(x k + σ ek ) − A dσ ek ≤ ck ek ≤ ck c A g(x k ),
0
where
(1.339) ˜ c(x k − x) ˜ . ck ≤ max c(x k+1 − x),
(1.340)
˜ ≤ R. Let K be Let R > 0 be such that c(x − x) ˜ ≤ 1/(2c A ) whenever x − x ˜ ≤ R. Then by an integer such that for all 0 ≤ k ≤ K + 1, we have that x k − x induction we obtain that g(x k+1 ) ≤ 2−k−1 g(x); ek ≤ 2−k c A g(x),
(1.341)
and so ˜ ≤ x − x ˜ + 2c A g(x), for all ≤ K + 1. x − x
(1.342)
Now let x be such that x − x ˜ + 2c A g(x) ≤ R. The above relations imply that (1.341) hold for all k. Therefore x k converges to x a such that g(x a ) = 0, and in addition x a − x ≤ 2c A g(x 0 ). Since g(x 0 ) = g(x + th) = o(t), the result follows by taking x(t) := x a . Proof (Proof of Theorem 1.174) (a) The difference of two multipliers belongs to the kernel of Dg(x) ¯ . However, that Dg(x) ¯ is onto implies that its transpose is injective. The uniqueness of the multiplier follows. (b) We prove the existence of the multiplier. Given h ∈ Ker Dg(x), ¯ let x(t) be the associated feasible path provided by Theorem 1.175. As F is locally Lipschitz, we have that F(x(t)) = F(x¯ + th) + o(t). Since x¯ is a local solution it follows that 0 ≤ lim t↓0
F(x(t)) − F(x) ¯ F(x¯ + th) − F(x) ¯ ¯ h). = lim = F (x, t↓0 t t
(1.343)
Consider the convex problem min F(x¯ + h); h∈X
Dg(x)h ¯ = 0.
(1.344)
74
1 A Convex Optimization Toolbox
If h is feasible then 0 ≤ F (x, ¯ h) ≤ F(x¯ + h) − F(x), ¯ and therefore h¯ = 0 is a solution of (1.344). This problem satisfies the stability condition since Dg(x) ¯ is onto. We conclude by applying Fenchel’s duality (Example 1.114), with here K = {0}, and ¯ = Y ∗. so, N K (g(x))
1.5 Notes Conjugate functions were introduced by Mandelbrojt [78] for functions on R, and in the Rn setting by Fenchel [47]. The Fenchel conjugate, in the smooth case, reduces to the Legendre transform. Then (as quoted in [93], which includes an extension of this result) Fenchel stated a strong duality result [48] for problems which have a structure corresponding to our Example 1.2.1.8, whence the name “Fenchel duality”. Many extensions were obtained in the sixties, especially by Moreau in his university lecture notes and various notes to the French Academy of Sciences, synthetized in [82, 84], and by Rockafellar [93, 95], who introduced the technique of duality through perturbations [100]. Some classical references, still worth consulting, are the lecture notes by Moreau [83], and the books by Rockafellar [97] in the finitedimensional setting, and by Ekeland and Temam [46] for infinite-dimensional spaces. Theorem 1.130 is a particular case of Sion’s theorem [117], in which hypotheses of “quasi-convexity” and “quasi-concavity” are made, in a topological vector space setting; see [64] for a simple proof. The Attouch–Brézis theorem [10] has a weak qualification condition under which the equality of primal and dual values hold, the dual problem having solutions. About extensions of the perspective function, see Maréchal [79].
Chapter 2
Semidefinite and Semi-infinite Programming
Summary This chapter discusses optimization problems in the cone of positive semidefinite matrices, and the duality theory for such ‘linear’ problems. We relate convex rotationally invariant matrix functions to convex functions of the spectrum; this allows us to compute the conjugate of the logarithmic barrier function and the dual of associate optimization problems. The semidefinite relaxation of problems with nonconvex quadratic cost and constraints is presented. Second-order cone optimization is shown to be a subclass of semidefinite programming. The second part of the chapter is devoted to semi-infinite programming and its dual in the space of measures with finite support, with application to Chebyshev approximation and to one-dimensional polynomial optimization.
2.1 Matrix Optimization This section is devoted to optimization problems in matrix spaces. We identify L(R p , Rn ) with the vector space of matrices of size p × n, and denote by S n the space of symmetric matrices of size n.
2.1.1 The Frobenius Norm Let us endow L(R p , Rn ) with the Frobenius norm and its associated scalar product; for p × n matrices A and B: ⎛ A F := ⎝
⎞1/2 Ai2j ⎠
;
A, B F :=
i, j
© Springer Nature Switzerland AG 2019 J. F. Bonnans, Convex and Stochastic Optimization, Universitext, https://doi.org/10.1007/978-3-030-14977-2_2
Ai j Bi j .
(2.1)
i, j
75
76
2 Semidefinite and Semi-infinite Programming
In particular, let x and x be in Rn , y and y be in R p . Denoting by “·” the Euclidean scalar product, we get: A, yx F = y Ax;
y (x ) , yx F = (y · y) (x · x).
(2.2)
Let A and B belong to L(R p , Rn ). Then A, B F = trace(AB ) = trace(B A ) = trace(A B) = trace(B A).
(2.3)
To prove this, it suffices to check the first relation and use the fact that the Frobenius scalar product is symmetric and the identity A, B F = A , B F . Being the sum of eigenvalues, the trace of a matrix is invariant under a basis change: for all square matrices M and P, with P invertible, we have that trace(M) = trace(P −1 M P).
(2.4)
In particular, let Q and Qˆ be orthonormal matrices of size resp. p and n, so that Q −1 = Q and Qˆ −1 = Qˆ . Then ˆ Q B Q ˆ F. A, B F = trace(Q A Qˆ Qˆ B Q) = Q A Q,
(2.5)
In other words, the Frobenius scalar product is invariant under orthonormal basis changes in Rn and R p . In particular, let x 1 , . . . , x n be an orthonormal system (i.e., the columns of an orthonormal matrix Q). Then A2F = trace(Q A AQ) =
|Ax i |2 .
(2.6)
i
Consider now the case of symmetric matrices. We know that A ∈ S n can be diagonalized by an orthonormal basis change. Denoting by λi (A) the eigenvalues of A, counted with their multiplicity, and arranged in nonincreasing order, we obtain by (2.5): n λi (A)2 . (2.7) A2F = trace(A2 ) = i=1
Let A and B belong to S n . Denote by λi and μ j their eigenvalues, and x i , y j an orthonormal system of associated eigenvectors. We get by (2.2) A, B F =
n
λi μ j (x i · y j )2 .
i, j
One easily deduces from this formula the following result:
(2.8)
2.1 Matrix Optimization
77
Theorem 2.1 (Fejer) The symmetric square matrix A is positive semidefinite iff we have A, B F ≥ 0 for all symmetric positive semidefinite B. Denote by S+n the set of semidefinite positive matrices. By the Fejer theorem this is a selfdual (i.e., equal to its positive polar) cone. Proposition 2.2 Let A ∈ S n , and Q be an orthonormal matrix such that A = Q D Q, where D is a diagonal matrix. Then the projection in the Frobenius norm of A over S+n is Q D+ Q, where D+ is the diagonal matrix of diagonal elements (Dii )+ , i = 1 to n. Proof The Frobenius norm endows S n with a Hilbert space structure. The projection, say B, of A over the nonempty closed convex set S+n is therefore well defined, and characterized by the relation B ∈ S+n ; B − A, C − B F ≥ 0, for all C ∈ S+n ,
(2.9)
which is a consequence of (and in fact is equivalent to) the two relations B − A, C F ≥ 0, for all C ∈ S+n , and B − A, B F = 0. Clearly, Q D+ Q satisfies these relations (the first one by Fejer’s theorem).
2.1.2 Positive Semidefinite Linear Programming 2.1.2.1
Framework
Positive semidefinite linear programs (SDP) are optimization problems of the form Minn c · x; A0 + x∈R
n
xi Ai 0,
(S D P)
i=1
where the Ai , i = 0 to n, are symmetric matrices of size p, and, given two symmetric matrices A and B of the same size, “A B” means that A − B is positive semidefinite. (In a similar way we will use to denote positive definiteness, and and ≺ for negative semidefiniteness and negative definiteness resp.). Let us see how to reduce some optimization problems to the SDP format. It is trivial to reduce linear constraints to SDP constraints1 : Ax − b ≤ 0
⇔
−diag(Ax − b) 0.
In the case of quadratic convex constraints such as 1 We
denote by diag the operator that to a vector associates the diagonal matrix having this vector for its diagonal, and also the operator that to a square matrix associates its diagonal.
78
2 Semidefinite and Semi-infinite Programming
q(x) := (Ax + b) · (Ax + b) − c · x − d, we have that q(x) ≤ 0 iff
I Ax + b (Ax + b) c · x + d
0.
This is a trivial consequence of the following, easily proved lemma: A B , with A and C symmetric and A Lemma 2.3 (Schur lemma) Let A = B C invertible. Then A 0 ⇔ {A 0 and C B A−1 B}. Example 2.4 Let (X, x) ∈ S n × Rn . Then X x x iff
1 x x X
0.
The following problem, with quadratic criterion and constraints: Minn q0 (x) ; qi (x) ≤ 0, i = 1 to p, x∈R
is equivalent to the problem with a linear cost and quadratic constraints: Min t ; q0 (x) − t ≤ 0 ; qi (x) ≤ 0, i = 1 to p. (x,t)
This allows us to reduce problems with convex quadratic cost function and constraints to the SDP format. Another type of example is that of minimisation of the greatest eigenvalue: Min t ; t I − A(x) 0. (x,t)
2.1.2.2
Linear Duality
We next apply the duality theory to problem (S D P) of Sect. 2.1.2.1. It is a special case of linear conical optimization (Chap. 1, Sect. 1.3.2). However we will derive the dual problem in a direct way. We have seen that, by Fejer’s theorem 2.1, the polar −S+n . cone of S+n is S−n := n xi Ai . The Lagrangian of problem (S D P) is Set A(x) := A0 + i=1 L(x, λ) = c · x + λ, A(x) F with λ ∈ S n , i.e., L(x, λ) = λ, A0 F + is therefore
n i=1
(ci + λ, Ai F ) xi . The dual problem
Max A0 , λ F ; ci + Ai , λ F = 0, i = 1, . . . , n.
λ∈S −n
(DS D P)
2.1 Matrix Optimization
79
On the other hand, the family of perturbed problems associated with (S D P) is Minn c · x ; A0 +
n
x∈R
xi Ai + y 0,
(S D Py )
i=1
with here y ∈ S p . Set v(y) := val(S D Py ). The (strong duality) Corollary 1.92 implies the following. Theorem 2.5 Assume that val(S D P) is finite, and that the following stability conˆ 0. Then dition holds: there exists an xˆ ∈ Rn such that A(x) (a) we have the equality val(DS D P) = val(S D P), (b) the set S(DS D P) is nonempty and bounded, (c) for all z ∈ S n , we have v (0, z) = max{y ∗ , z F ; y ∗ ∈ S(DS D P)}. By Lemma 1.85, the primal problem is also the dual of its dual. So, consider the perturbation of equality constraints Max A0 , λ F ; ci + Ai , λ F + h i = 0, i = 1, . . . , n.
λ∈S −n
(DS D Ph )
Here h ∈ Rn can be interpreted as a perturbation of the primal cost. Set w(h) := val(DS D Ph ); this is a concave function. When h = 0, the bidual problem is nothing else than the primal problem (S D P). Applying the strong duality Corollary 1.92, we deduce that: Theorem 2.6 Assume that val(DS D P) finite, and the following stability condition is satisfied: the family A1 , . . . , An is linearly independent, and there exists a λ ≺ 0, feasible for (DS D P). Then (a) we have the equality val(DS D P) = val(S D P), (b) the set S(S D P) is nonempty and bounded, (c) for all d ∈ Rn , we have w (0, d) = min{x, d; x ∈ S(S D P)}. The exercises below, taken from [125, Chap. 4], show important differences with the duality theory for linear programming. Exercise 2.7 Check that the following problem has no solution, despite the absence of a duality gap and the finiteness of the common value: Min x1 ;
x1 1 1 x2
0.
Exercise 2.8 Show that the following problem has a nonzero duality gap, although both the primal and dual problems have solutions: ⎛
⎞ x2 + 1 0 0 Min x2 ; ⎝ 0 x1 x2 ⎠ 0. 0 x2 0
80
2 Semidefinite and Semi-infinite Programming
Hint: check that the feasible set is R+ × {0}, and so the primal value is 0, while the dual is Maxn λ11 ; λ22 = 0, 1 + λ11 + 2λ23 = 0, λ∈S −
and therefore, any dual feasible λ satisfies λ23 = 0, and so λ11 = −1, so that the dual value is −1.
2.2 Rotationally Invariant Matrix Functions 2.2.1 Computation of the Subdifferential ¯ One says that F is rotationally invariant if, for Let F be an application of S n in R. all orthonormal matrices Q of size n, we have F(M) = F(Q M Q ), for all M ∈ S n .
(2.10)
¯ One says that f is symmetric if, for all permutations π of {1, . . . , n} Let f : Rn → R. (i.e., a bijective mapping from {1, . . . , n} into itself), we have f (x1 , . . . , xn ) = f (xπ1 , . . . , xπn ), for all x ∈ Rn .
(2.11)
Let us recall that we denote by λ1 (M), . . . , λn (M) the eigenvalues of M ∈ S n in nonincreasing order, and we set λ(M) := (λ1 (M), . . . , λn (M)) . ¯ is rotationally invariant iff there exists a Lemma 2.9 The function F : S n → R n ¯ symmetric function f : R → R, such that F(M) = f (λ1 (M), . . . , λn (M)) for all M ∈ S n .
(2.12)
Proof Let F be rotationally invariant. We can choose Q in such a way that Q M Q is a diagonal matrix, whose diagonal elements are the eigenvalues of M arranged in an arbitrary order. It follows that F is a symmetric function of the spectrum of M, whence (2.12). The converse is immediate. We will call f the spectral function associated with F. Let us see how to compute the Fenchel conjugate of a rotationally invariant function. ¯ be rotationally invariant, and f the associated Theorem 2.10 Let F : S n → R spectral function. Then (i) the Fenchel conjugate of F is rotationally invariant, with associated spectral function f ∗ , the Fenchel conjugate of f , (ii) the function F is convex, l.s.c. and proper iff f is so. The first step of the proof deals with the cone of nonincreasing vectors:
2.2 Rotationally Invariant Matrix Functions
81
K d := x ∈ Rn ; x1 ≥ x2 ≥ · · · ≥ xn .
(2.13)
Lemma 2.11 (i) The polar of the cone of nonincreasing vectors is K d− =
y ∈ Rn ;
j
yi ≤ 0, j = 1, . . . , n − 1;
i=1
n
yi = 0 .
(2.14)
i=1
(ii) In addition, if x ∈ K d and y ∈ K d− , then x · y = 0 iff (xi−1 − xi )
i−1
yk = 0, i = 2, . . . , n.
(2.15)
k=1
(iii) If x and z are elements of K d , and P is a permutation matrix, then y := P z − z belongs to K d− , and x y = 0 iff there exists a permutation matrix Q such that Qx = x; Q P z = z.
(2.16)
Proof If x and y belong to Rn , we have x y = (x1 − x2 )y1 + (x2 − x3 )(y1 + y2 ) + · · · + (xn−1 − xn )
n−1 k=1
yk + xn
n
yk .
k=1
(2.17) It follows that the r.h.s. of (2.14) is included in K − . Conversely, let 1 ≤ p ≤ n − 1, are equal to 1, and the other ones y ∈ K d− , and x ∈ Rn whose p first coordinates p to 0. Then x ∈ K d , and so 0 ≥ x y = k=1 yk . Choosing x = ±1, we obtain 0 ≥ (±1) y = − nk=1 yk , whence (i). Point (ii) is an immediate consequence of (i) and (2.17). Let us show (iii). By the definition of y and (i), it is clear that y ∈ K d− . If (2.16) is satisfied, then x P z = x Q Q P z = (Qx) z = x · z
(2.18)
and so x · y = 0. Let us show the converse. By (ii), x · y = 0 iff (2.15) is satisfied. Let I be the set of equivalence classes of components of x, and Q be a permutation; then Qx = x iff any I ∈ I is stable under Q. In particular, there exists a permutation Q for which Q Pz is nonincreasing over all I ∈ I . If i(I ) denotes the smallest index of i(I )−1 each class, we observe that (2.15) is equivalent to k=1 yk = 0, for all I ∈ I , and so i∈I yi = 0, I ∈ I , i.e., i∈I (Q P z)i = i∈I z i , for all I ∈ I . But Q P z ≤ z, and these two vectors have nondecreasing components over each I ∈ I ; they are therefore equal. Lemma 2.12 Let X and Y belong to S n . Then X, Y F ≤ λ(X ) · λ(Y ),
(2.19)
82
2 Semidefinite and Semi-infinite Programming
with equality iff there exists an orthonormal matrix U diagonalizing these two matrices, and such that U XU = diag(λ(X )); U Y U = diag(λ(Y )).
(2.20)
Proof Consider the optimization problem Max trace Z X Z Y ; I − Z Z = 0,
Z ∈M n
(2.21)
where I is the identity in Rn . We take S n (and not M n ) as the constraint space. The feasible set is the set of orthonormal matrices, which is compact. So, the above problem has (at least) one solution Z¯ . Let us check that the derivative of the constraints is surjective at this point. Indeed, the linearized equation − Z¯ W − W Z¯ = A,
(2.22)
where A ∈ S n , has solution W = − 21 Z¯ A. The Lagrangian of this problem can be expressed as (2.23) trace Z X Z Y + Λ − Z Z Λ . By Theorem 1.174, there exists a unique Lagrange multiplier Λ ∈ S n such that the above Lagrangian has a zero derivative w.r.t. Z at Z¯ . In other words, for all W ∈ M n , we have that trace W X Z¯ Y + Z¯ X W Y − W Z¯ Λ − Z¯ W Λ = 0.
(2.24)
Using (2.3), we obtain the equivalent expression trace W X Z¯ Y − Z¯ Λ + trace Y Z¯ X − Λ Z¯ W = 0.
(2.25)
Set M := X Z¯ Y − Z¯ Λ. Taking W = M, we obtain 0 = trace(M M) = M2F , and so M = 0. It follows that Z¯ X Z¯ Y = Λ = Λ = Y Z¯ X Z¯ .
(2.26)
This means that Y and Z¯ X Z¯ do commute. So there exists [60] an orthonormal matrix V diagonalizing these two matrices, which means that V Y V = diag(P1 λ(Y )); V Z¯ X Z¯ V = diag(P2 λ( Z¯ X Z¯ )) = diag(P2 λ(X )),
(2.27)
where P1 and P2 are permutation matrices. We can assume that P2 = I . We get then, since Z¯ is a solution of (2.21), that
2.2 Rotationally Invariant Matrix Functions
83
trace X Y ≤ trace Z¯ X Z¯ Y = trace(V Z¯ X Z¯ V V Y V ) = λ(X ) P1 λ(Y ). (2.28) By Lemma 2.11(iii), we have λ(X ) P1 λ(Y ) ≤ λ(X ) · λ(Y ), with equality iff there exists a permutation matrix Q leaving λ(X ) invariant, and such that Q P1 λ(Y ) = λ(Y ). Then U := V Q satisfies (2.20). Indeed, using (2.27), we get (leaving the details of the last equality in (2.29) to the reader),
and
U Y U = QV Y V Q = Qdiag(P1 λ(Y ))Q = diag(λ(Y ))
(2.29)
U XU = QV X V Q = Qdiag(λ(X ))Q = diag(λ(X )).
(2.30)
Conversely, if (2.20) is satisfied, it is clear that equality holds in (2.19).
Proof (Proof of theorem 2.10) By Lemma 2.12, we have that F ∗ (Y ) = sup {X, Y F − F(X )} ≤ sup {λ(X ) · λ(Y ) − f (λ(X ))} ≤ f ∗ (λ(Y )). X ∈S n
X ∈S n
(2.31) Taking Y = U diag(λ(Y ))U , with U orthonormal, and X of the form U diag(x)U , with x ∈ Rn , we get F ∗ (Y ) ≥ supx∈Rn {U diag(x)U, Y F − F(X )} = supx∈Rn {x λ(Y ) − f (x)} = f ∗ (λ(Y )).
(2.32)
By (2.31)–(2.32), f ∗ is the spectral function associated with F ∗ , whence (i). If F is convex, l.s.c. and proper, it is, by Theorem 1.44, equal to its biconjugate F ∗∗ , which by (i) has the spectral function f = f ∗∗ . Therefore f is convex l.s.c., and proper since F is. Conversely, if f is convex, l.s.c. and proper, then F = F ∗∗ by (i), so F is l.s.c. convex, and proper since f is; whence (ii). One deduces from the previous results an expression for the subdifferential of a rotationally invariant function. ¯ be rotationally invariant, f be the associated Proposition 2.13 Let F : S n → R n spectral function, and let X ∈ S be such that F(X ) ∈ R. Then Y ∈ ∂ F(X ) iff the following two relations are satisfied: (i) λ(Y ) ∈ ∂ f (λ(X )) and (ii) there exists an orthonormal matrix U satisfying (2.20). Proof The Fenchel–Young inequality ensures that Y ∈ ∂ F(X ) iff F(X ) + F ∗ (Y ) = X, Y F , which is equivalent to f (λ(X )) + f ∗ (λ(Y )) = X, Y F . Now Lemma 2.12, combined with the Fenchel–Young inequality, ensures that the equality is satisfied iff λ(Y ) ∈ ∂ f (λ(X )) and there exists an orthogonal matrix U satisfying (2.20). The conclusion follows.
84
2 Semidefinite and Semi-infinite Programming
2.2.2 Examples When applying the previous results, it is convenient to rewrite (2.20) in the form X = U diag(λ(X ))U ; Y = U diag(λ(Y ))U .
(2.33)
The columns of U form an orthonormal basis of eigenvectors of X , the latter by nonincreasing order of eigenvalues. We will speak of an ordered basis. The condition over Y is therefore that at least one ordered basis for X is also an ordered basis for Y . Denote by Ui the ith column of U . Then Y = U diag(λ(Y ))U =
n
λi (Y )Ui Ui .
(2.34)
i=1
Example 2.14 Let q be a nonnegative integer, and let the function F : S n → R be q q defined by F(X ) := trace(X q ). Since λ(X n) = λ(X )q (we take here the power of the vector componentwise) n we qget F(X ) = i=1 λi (X ) , and so the associated spectral xi . function is f (x) = i=1 If q is even, then f is convex and (since it is differentiable) its subdifferential reduces to its derivative. We get, for U orthonormal satisfying (2.20): D F(X ) = Y = qU diag(λ(X q−1 ))U = q X q−1 .
(2.35)
Example 2.15 The function F(X ) := λ1 (X ) (greatest eigenvalue) has associated spectral function f (x) = maxi xi . If x ∈ Rn has nonincreasing components, and x1 = · · · = x p > x p+1 , it follows from Lemma 1.140 that
n ∂ f (x) = y ∈ Rn+ , i=1 yi = 1; yi = 0, i > p .
(2.36)
Denote by U the set of orthonormal matrices whose p first columns form a base of the eigenspace E 1 associated with λ1 (X )). By (2.34), Y ∈ ∂ F(X ) iff, for a certain U ∈U: p p p Y = αi Ui Ui ; α ∈ R+ , αi = 1. (2.37) i=1
Setting
i=1
Pp = α ∈
p R+ ,
p i=1
we deduce the directional derivative formula
αi = 1 ,
(2.38)
2.2 Rotationally Invariant Matrix Functions
λ1 (X, Z ) = max{Y,
pZ F ; Y∈ ∂λ1 (X )} = max i=1 αi Ui ZUi ; α ∈ P p , U ∈ U = max{λ1 (U1:p ZU1: p ); U ∈ U }.
85
(2.39)
We have proved the following: ⎧ ⎨ The directional derivative of the greatest eigenvalue of X in direction Z is the greatest eigenvalue ⎩ of the restriction of Z (seen as a quadratic form) to E 1 .
(2.40)
2.2.3 Logarithmic Penalty 2.2.3.1
Logarithmic Barrier Function
Set R++ := (0, ∞), R−− := −R++ . The function f (λ) := −
n
log λi if λi > 0, i = 1, . . . , n, +∞ otherwise,
(2.41)
i=1
is l.s.c. convex, and differentiable over its domain Rn++ . The associated matrix function, called the logarithmic barrier of the cone S+n , is F(X ) := − log det X if X 0, +∞ otherwise.
(2.42)
By the above theory, its derivative is the opposite of the inverse of X : D F(X ) = U diag(−λ(X )−1 ))U = −X −1 ,
(2.43)
where here still the inversion of the vector is computed componentwise. Since ∗ ∗ ∗ the conjugate of f (t) = − log t (with domain R ++ ) is f (t ) = −1 − log(−t ) n n ), the conjugate of f (x) = − i=1 log xi (with domain R++ ) (with domain R−− n log(−xi∗ ) (with domain Rn−− ), and the conjugate function is f ∗ (x ∗ ) = −n − i=1 of F is (2.44) F ∗ (Y ∗ ) = −n − log det(−Y ∗ ), whose domain is the set of negative definite symmetric matrices of size n.
2.2.3.2
Central Trajectory
The logarithmic barrier allows the extension to linear SDP problems of the interior point algorithms for linear programming. Here we just give a brief discussion of
86
2 Semidefinite and Semi-infinite Programming
the penalized problem. With problem (S D P) of Sect. 2.1.2.1 we associate one with logarithmic penalty, where μ > 0 is the penalty parameter, setting A(x) := A0 + n x A : i=1 i i (S D Pμ ) Minn c · x − μ log det(A(x)). x∈R
We apply Fenchel’s duality (Chap. 1, Sect. 1.2.1.8), taking into account that (i) the conjugate of x → c · x is the indicatrix of {c}, (ii) F ∗ (μ−1 Y ) = F ∗ (Y ) + n log μ, (iii) the conjugate of F1 (A) := μF(A0 + A) is F1∗ (Y ) = supY ∈S n Y, A F − μF(A0 + A) = supY ∈S n −Y, A0 F + μ(μ−1 Y, (A + A0 ) F − F(A0 + A)) = −Y, A0 F + μF ∗ (μ−1 Y ) = −nμ(1 − log μ) − Y, A0 F − μ log det(−Y ).
(2.45)
The dual problem is therefore Max
n Y ∈S−−
nμ(1 − log μ) + A0 , Y F + μ log det(−Y ); ci = −Ai , Y F , i = 1, . . . , n. (DS D Pμ )
It is usually written in terms of S = −Y as
Max nμ(1 − log μ) − A0 , S F + μ log det S; ci = Ai , S F , i = 1, . . . , n.
n S∈S ++
(DS D Pμ )
The optimality condition can be written as n
n
c·x+ xi Ai , Y F I{ci } (−Ai , Y F ) + i=1 i=1 + μ − log det(A(x)) − n − log det(−μ−1 Y ) − A(x), μ−1 Y F = 0.
(2.46)
Each row corresponds to an equality in the Fenchel–Young inequality for f (x) := c · x and F resp., and by (2.43), the above display is equivalent to, using the variable S rather than Y and denoting by Id the identity matrix: S A(x) = μId ; S 0; A(x) 0; ci = Ai , S F , i = 1, . . . , n.
(2.47)
One may prefer to rewrite the first relation in a symmetrized form (which is equivalent since S A(x) = μId implies that S and A(x) commute): S A(x) + A(x)S = μId ; S 0; A(x) 0; ci = Ai , S F , i = 1, . . . , n. (2.48) See [125] for how to solve this system by efficient algorithms.
2.3 SDP Relaxations of Nonconvex Problems
87
2.3 SDP Relaxations of Nonconvex Problems 2.3.1 Relaxation of Quadratic Problems In this section we study a problem with quadratic criterion and constraints: Minn f 0 (x); x∈R
with f i (x) =
f i (x) ≤ 0, i = 1, . . . , p,
1 i x A x + bi · x + ci , i = 0, . . . , p, 2
(QC P)
(2.49)
where the Ai , bi and ci are given in S n , Rn and R respectively; we can assume that c0 = 0. We already discussed this problem in the case when the Ai are positive semidefinite; here we make no such hypothesis, so that problem (QC P) is in general non-convex. We can write it in the form Min
x ∈ Rn X ∈Sn
1 0 1 A , X F + b0 · x; Ai , X F + bi · x + ci ≤ 0, i = 1, . . . , p; X = x x . 2 2 (QC P )
We will call the SDP relaxation of problem (QC P) the variant of the formulation (QC P ) in which we relax X = x x in X x x . By Example 2.4, an equivalent formulation of the relaxed problem is Min
x ∈ Rn X ∈Sn
1 0 A , X F + b0 · x; 21 Ai , X F + bi · x + ci ≤ 0, i = 1, . . . , p; 2 (R QC P) 1x 0. x X
The SDP relaxation is therefore a linear SDP problem. Remark 2.16 It may happen that bi = 0, for i = 0 to p. It is then optimal to choose x = 0 in the SDP relaxation, and the SDP constraint reduces to X 0. Proposition 2.17 We have that val(R QC P) ≤ val(QC P). If in addition the matrices Ai , i = 0 to p, are positive semidefinite (in other words if the criterion and the constraints are convex), then val(R QC P) = val(QC P). Proof Since problem (R QC P) has the same criterion as (QC P ), and a larger feasible set, we have that val(R QC P) ≤ val(QC P). If the matrices Ai , i = 0 to p, are positive semidefinite, let (x, X ) ∈ F(R QC P). Define X := x x and ϕ(x, X ) := 1 A0 , X F + b0 · x. Since X X , by Fejer’s theorem, Ai , X F ≤ Ai , X F , so 2 that (x, X ) ∈ F(R QC P), and ϕ(x, X ) ≤ ϕ(x, X ); so val(QC P ) ≤ val(R QC P) and the result follows.
88
2 Semidefinite and Semi-infinite Programming
In the sequel we will show that SDP relaxation is strongly related to classical duality, whose discussion needs a generalization of the Schur lemma 2.3. We introduce the pseudo inverse of A ∈ S n , A† :=
n
λi† xi xi
(2.50)
i=1
where xi is an orthonormal basis of eigenvectors of A, λi their associated eigenvalues, and (2.51) λi† = λi−1 if λi = 0, and 0 otherwise. We leave the (easy) proof of the next lemma as an exercise. A B , with A and C symLemma 2.18 (Generalized Schur lemma) Let A = B C metric. Then A 0
⇔ {A 0, C B A† B, and Im(B) ⊂ Im(A)}.
The Lagrangian of problem (QC P) is L(x, λ) =
1 x A(λ)x + b(λ) · x + c(λ), 2
where λ ∈ R p and (setting λ0 = 1) A(λ) =
p
λi A ; b(λ) = i
i=0
p
λi b ; c(λ) = i
p
i=0
λi ci .
i=1
We will denote the dual criterion by q(λ) := inf x L(x, λ). The dual problem is therefore: (D QC P) Maxp q(λ); λ ≥ 0. λ∈R
Lemma 2.19 (i) The dual criterion can be expressed as q(λ) =
c(λ) − 21 b(λ) A(λ)† b(λ) if A(λ) 0 and b(λ) ∈ Im(A(λ)), −∞ otherwise.
(ii) The dual problem is equivalent to the following SDP problem (in the sense that it has the same value, and their solutions have the same components for λ) Max w;
λ≥0,w∈R
c(λ) − w 21 b(λ) 1 b(λ) 2
1 2
A(λ)
0.
(D QC P )
2.3 SDP Relaxations of Nonconvex Problems
89
Proof Point (i) is an elementary computation, and point (ii) is an immediate application of the generalized Schur lemma. Since (D QC P ) is an SDP linear problem, we know how to compute its dual problem; it is convenient to call the latter the bidual of (QC P). Let us write problem α x the multiplier, an element of S n+1 , in the form . The Lagrangian of problem x X (D QC P ) can be expressed as 1 L (λ, w, α, x, X ) := w + α(c(λ) − w) + b(λ) · x + A(λ), X F . 2 Define
αx n n 0 . C := (α, x, X ) ∈ R × R × S ; x X
We can rewrite (D QC P ) in the form Max
inf
λ≥0,w (α,x,X )∈C
L (λ, w, α, x, X ).
The bidual problem is therefore Min
sup L (λ, w, α, x, X ).
(α,x,X )∈C λ≥0,w
(B QC P)
We get L (λ, w, α, x, X ) = (1 − α)w +
p
λi
i=0
1 i i A , X F + b · x + ci . 2
By an elementary computation, we obtain the Lemma 2.20 The bidual problem coincides with (R QC P). The qualification hypothesis for problem (D QC P ) is equivalent to the existence of λ ∈ R p and w ∈ R such that c(λ) − w 21 b(λ) λi > 0, i = 1, . . . , p;
0. (2.52) 1 b(λ) 21 A(λ) 2 It is easy to see that one obtains an equivalent condition by writing λi ≥ 0 in lieu of λi > 0; using the Schur lemma, we see that (2.52) is equivalent to p
There exists a λ ∈ R+ ;
A(λ) 0.
(2.53)
This hypothesis is for example satisfied if A0 0 (take λ = 0). The following theorem sums up the main results of the section.
90
2 Semidefinite and Semi-infinite Programming
Theorem 2.21 (i) We have the relations val(D QC P) ≤ val(R QC P) ≤ val(QC P).
(2.54)
(ii) If the criterion and constraints of (QC P) are convex, val(R QC P) = val(QC P). (iii) If problem (D QC P ) satisfies the qualification hypothesis (2.53), then (D QC P) and (R QC P) have the same value, i.e., the SDP relaxation has the same value as the classical dual. SDP relaxation therefore does as well as classical duality and, in many cases, both have the same value.
2.3.2 Relaxation of Integer Constraints Consider a variant of the previous problem, where the f i are still defined by (2.49), with an additional integrity constraint: Minn f 0 (x); x∈R
f i (x) ≤ 0, i = 1, . . . , p; x ∈ E := {−1, 1}n .
(QC P I )
Remark 2.22 One easily reduces the more usual constraint x ∈ {0, 1}n to the above integrity constraint. Observe that when x ∈ E, X = x x has all diagonal elements equal to 1. This leads to the SDP relaxation ⎧ Min 21 A0 , X F + b0 · x; ⎪ ⎪ ⎪ x ∈ Rn ⎪ n ⎪ ⎪ ⎨ X ∈S 1 Ai , X F + bi · x + ci ≤ 0, i = 1, . . . , p; (R QC P I ) 2 ⎪ ⎪ ⎪ ⎪ 1x ⎪ ⎪ 0; X ii = 1, i = 1, . . . , n. ⎩ x X Note the obvious extension of Remark 2.16 to the present framework. Remark 2.23 In the case of a linear programming problem with the above integrity constraints, we have that all Ai are equal to 0, and hence, the formulation of the relaxed problem reduces to ⎧ ⎪ Min b0 · x; bi · x + ci ≤ 0, i = 1, . . . , p; ⎪ ⎨ x ∈ Rnn X ∈S 1x ⎪ ⎪ 0; X ii = 1, i = 1, . . . , n. ⎩ x X
(2.55)
2.4 Second-Order Cone Constraints
91
2.4 Second-Order Cone Constraints 2.4.1 Examples of SOC Reformulations Given a nonzero, nonnegative integer m, we choose to denote by s = (s0 , . . . , sm ) the elements of Rm+1 , and we set s¯ := (s1 , . . . , sm ) . The second-order cone (SOC), or Lorenz cone, is defined as Q m+1 := {s ∈ Rm+1 ; s0 ≥ |¯s |}.
(2.56)
The associated order relation is, given x and y in Rm+1 , x Q m+1 y if x − y ∈ Q m+1 . We will see how to rewrite various relations in the form p
Ax + b ∈ R− × Q m 1 +1 × · · · × Q m q +1 .
(2.57)
We then speak of a linear SOC reformulation. Exercise 2.24 Let w ∈ Rn , α and β scalars. Check that
α ≥ 0, β ≥ 0, |w| ≤ αβ 2
⇔
2w . α + β ≥ α−β
(2.58)
Exercise 2.25 Given ai , i = 1 to p, and c j , j = 1 to q in Rn , and given b ∈ R p and d ∈ Rq , consider the problem Minn x∈R
p
1/(ai · x + bi );
i=1
ai · x + bi > 0, i = 1, . . . , p,
(2.59)
ci · x + di ≥ 0, i = 1, . . . , q.
Check that an equivalent formulation is Min n
x∈R ,t∈R p
p
ti
i=1
ti (ai · x + bi ) ≥ 1, i = 1, . . . , p, ti ≥ 0, ai · x + bi ≥ 0, i = 1, . . . , p, ci · x + di ≥ 0, i = 1, . . . , q.
(2.60)
Obtain a linear SOC reformulation, by applying Example 2.24. Exercise 2.26 Given ai , i = 1 to p in Rn , and b ∈ R p with positive coordinates, consider the problem of uniform approximation, in the logarithmic scale: Minn max |log(ai · x) − log(bi )| x∈R
(2.61)
92
2 Semidefinite and Semi-infinite Programming
with the implicit constraint ai · x > 0 for all i. Show that a reformulation of this problem is 1 ai · x ≤ t. (2.62) Min t; ≤ x∈Rn ,t∈R t bi Apply to the inequalities on the left, rewritten in the form tai · x ≥ bi , the result of Exercise 2.24, and conclude that we have a linear SOC reformulation of this problem. We next discuss some more elaborate examples. Example 2.27 Let be a positive integer. Let us show that we can rewrite “linearly” the relation x ∈ R2+ ; t ∈ R; t ≤ (x1 x2 . . . x2 )1/2 . (2.63) For = 1 this boils down to t ≤ τ; 0 ≤ τ ≤
√
x1 x2 ,
(2.64)
and the last inequality can be rewritten as τ 2 ≤ x1 x2 ; we conclude by applying Exercise 2.24. For = 2 we introduce y ∈ R2 and rewrite (2.63) in the form x ∈ R4+ ;
y ∈ R2+ ; t ∈ R; t ≤ τ ; 0 ≤ τ ≤
√
y1 y2 ;
y1 ≤
√
x1 x2 ;
y2 ≤
√
x3 x4 ;
(2.65)
which itself can be rewritten as x ≥ 0;
y ≥ 0; τ ≥ 0; t ≤ τ ; τ 2 ≤ y1 y2 ;
y12 ≤ x1 x2 ;
y22 ≤ x3 x4 . (2.66)
We again apply Exercise 2.24. We leave to the reader the generalization to arbitrary , and check that one obtains O(2 ) “linear” relations in R3 . Example 2.28 Consider the relations x ∈ Rn+ ; t ∈ R+ ; t ≤ x1π1 x2π2 . . . xnπn . We assume that πi = pi / p, with pi a positive integer and p an integer, p ≥ Let be such that 2 ≥ p. Consider the relation 1/2 0 ≤ t ≤ x1 x2 . . . x2
(2.67) i
pi .
(2.68)
where the xi are replaced by x1 for the first p1 indexes, x2 for the following p2 indexes, until xn , and then by t for the following 2 − p indexes, and finally by 1 for the p − i pi remaining indexes. Computing the power 2 of both sides of (2.68), we get p p t 2 ≤ x1 1 x2 2 . . . xnpn t 2 − p . (2.69)
2.4 Second-Order Cone Constraints
93
Simplifying by t 2 and taking the pth root, we see that (2.68) is equivalent to (2.67); using Example 2.27, it follows that (2.67) has a linear SOC reformulation. Note that, in particular, the geometric mean can be SOC linearly rewritten.
2.4.2 Linear SOC Duality Consider the following SOC linear problem, in which A j is an (m j + 1) × n matrix and b j ∈ Rm j +1 , for j = 1, . . . , J : Minn c · x ; A j x − b j Q m j +1 0, j = 1, . . . , J. x∈R
(L S OC P)
In order to compute the dual problem, we introduce the operator Rm+1 → Rm+1 , y → y˜ := (y0 , − y¯ ), that leaves Q m+1 invariant. Lemma 2.29 (i) The second-order cone Q m+1 is selfdual (equal to its positive polar cone). (ii) In addition, when x and y are two nonzero elements of Q m+1 , we have ¯ and y ∈ R+ x. ˜ x · y = 0 iff x0 = |x| Proof Let x and y belong to Q m+1 . If x0 = 0, then x is zero and x · y = 0, and the same for y. Assume now that x0 and y0 are positive. Then |x| ¯ | y¯ | . ¯ y¯ | = x0 y0 1 − x · y = x0 y0 + x¯ · y¯ ≥ x0 y0 − |x|| x0 y0
(2.70)
By definition of Q m+1 , the above fractions have values in [0, 1], whence x · y ≥ 0, ¯ y0 = | y¯ | and which proves that Q m+1 ⊂ Q + m+1 . In addition x · y = 0 iff x 0 = | x|, ¯ whence (ii). x¯ · y¯ = −|x|| ¯ y¯ |. Since x¯ = 0, the last relation is equivalent to y¯ ∈ R− x, + ¯ = 0, let x ∈ Q m+1 be It remains to show that Q + m+1 ⊂ Q m+1 . Let y ∈ Q m+1 . If y such that x0 = 1. We get that 0 ≤ x · y = y0 , so y ∈ Q m+1 . If on the contrary y¯ = 0, set z := (| y¯ |, − y¯ ). Then z ∈ Q m+1 , and so 0 ≤ y · z = y0 | y¯ | − | y¯ |2 = | y¯ |(y0 − | y¯ |), implying y0 ≥ | y¯ |, as was to be shown. The dual of (LSOCP) (which again is a particular case of conical linear optimization) can therefore be expressed as Max
J y∈Πi=1 Q m j +1
J j=1
bj · y j;
J (A j ) y j = c.
(LSOCP∗ )
j=1
We deduce the optimality conditions: Primal and dual feasibility and complementarity, the latter being obtained for each j:
94
2 Semidefinite and Semi-infinite Programming
A j x − b j ∈ Q m j +1 , y j ∈ Q m j +1 , (A j x − b j ) · y j = 0, j = 1, . . . , J ;
J
(A j ) y j = c.
j=1
(2.71)
2.4.3 SDP Representation Let us show how to represent a linear SOC constraint as a linear SDP constraint. Given s ∈ Q m+1 , we define the “arrow” mapping: Rm+1 → S m+1 , (we recall that S m+1 is the space of symmetric matrices of size m + 1): s0 s¯ . Arw(s) := s¯ s0 Im
(2.72)
Lemma 2.30 We have s ∈ Q m+1 iff Arw(s) 0. Proof If s0 < 0, then s ∈ / Q m+1 , and Arw(s) cannot be positive semidefinite. If s0 > 0, by application of the Schur lemma (eliminating the last block) Arw(s) 0 iff s0 − |¯s |2 /s0 ≥ 0, and so s ∈ Q m+1 iff Arw(s) 0. Finally, if s0 = 0, we know that a symmetric matrix with diagonal zero is positive semidefinite iff it is equal to 0, and so Arw(s) 0 iff s = 0, whence the conclusion. We can therefore rewrite an SOC linear problem as an SDP linear problem. We will compare the dual solutions. The primal formulation can be expressed as Minn c · x; Arw(A j x − b j ) 0, j = 1, . . . , J.
(L S D P)
x∈R
We define s j := A j x − b j , j = 1 to J . Partitioning the symmetric matrices of S m+1 (with index from 0 to n) in the form Y =
Y00 Y¯0
Y¯0 , Y¯
(2.73)
we obtain that Arw(s) · Y = s0 trace(Y ) + 2¯s · Y¯0 , and so Arw : S m+1 → Rm+1 can be expressed as trace(Y ) . (2.74) Arw Y := 2Y¯0 The dual formulation of (L S D P) hence has the expression Max
m j +1 J Y ∈Πi=1 S+
J j=1
b0 trace(Y j ) + 2b¯ · Y¯0 ; j
j
J j=1
(A j )
trace(Y j ) j 2Y¯0
= c. (LSDP∗ )
2.4 Second-Order Cone Constraints
95
Proposition 2.31 (i) The dual problems (LSOCP∗ ) and (LSDP∗ ) have the same value. (ii) The feasible set of (LSOCP∗ ) is the image under the mapping Arw of the feasible set of (LSDP∗ ). Proof It suffices to check point (ii), which, in view of the dual costs, implies point (i). Let us show that Arw S m+1 ⊂ Q m+1 . Indeed, if s ∈ Q m+1 and Y ∈ S m+1 , we have by Fejer’s theorem 2.1 s Arw Y = Arw(s), Y F ≥ 0
(2.75)
and we conclude by Lemma 2.29. On the other hand, Arw is injective; its transpose operator is therefore surjective. We conclude by identifying the feasible points of j (LSOCP∗ ) with the elements of the form (trace(Y j ), 2Y¯0 ), where Y j ∈ S m j +1 . Note that no qualification hypothesis was made, so that the primal and dual values can be different. In order to obtain an expression for the solutions of (LSDP∗ ) as a function of those of (LSOCP∗ ), we must, given y ∈ Q m+1 , express the set Arw− (y) = {Y ∈ S m+1 ; Arw Y = y}.
(2.76)
We will only discuss the most interesting case when y0 = | y¯ | > 0. Lemma 2.32 Let y ∈ Q m+1 such that y0 = | y¯ | > 0. Then Arw− (y) reduces to the single element 1 y0 ( y¯ ) . (2.77) Y (y) = 2 y¯ y¯ y¯ /y0 Proof We have that Arw Y (y) = y, and by the Schur lemma 2.3, Y (y) 0. Let now Y ∈ Arw− (y). Since Y¯0 = 21 y¯ , and Y00 cannot be zero, the Schur lemma implies Y¯ = 14 y¯ y¯ /Y00 + M, with M 0. Therefore, y0 = trace(Y ) = Y00 + trace(Y¯ ) = Y00 + 41 y02 /Y00 + trace(M) 2 = y0 + Y00 − 21 y0 /Y00 + trace(M).
(2.78)
This implies that Y00 = 21 y0 and trace(M) = 0, whence M = 0, as was to be proved.
2.5 Semi-infinite Programming 2.5.1 Framework In this section we study linear semi-infinite programming problems of the following type
96
2 Semidefinite and Semi-infinite Programming
Minn c · x; aω · x ≤ bω , ω ∈ Ω. x∈R
(S I L)
Here c ∈ Rn , Ω is a compact metric space, and for each ω ∈ Ω, aω ∈ Rn , bω ∈ R, and the mapping Ω → Rn+1 , ω → (aω , bω ) is continuous. We denote by Y = C(Ω) the space of continuous functions over Ω. Endowed with the norm y := max{|yω |; ω ∈ Ω}, this is a Banach space. One defines the contact set of x¯ ∈ F(S I L) as (2.79) Ω(x) ¯ := {ω ∈ Ω; aω x¯ = bω }, and the qualification (Slater) condition by ¯ There exists an xˆ ∈ Rn such that aω xˆ < bω , for all ω ∈ Ω(x).
(2.80)
By the compactness of Ω and the continuity of the application ω → (aω , bω ), this hypothesis implies the existence of an ε > 0 such that aω · xˆ − bω ≤ −ε,
for all ω ∈ Ω.
(2.81)
Finally, the linearized problem at point x¯ is: Minn c · h; aω h ≤ 0, ω ∈ Ω(x). ¯ h∈R
(L x¯ )
The following lemma allows us to reduce the study of first-order optimality conditions to those of a homogeneous problem. Lemma 2.33 (i) If x¯ ∈ F(S I L) is such that h = 0 is a solution of the linearized problem, then x¯ ∈ S(S I L). (ii) If the qualification condition (2.80) holds, and x¯ ∈ F(S I L), then x¯ ∈ S(S I L) iff h = 0 is a solution of the linearized problem. Proof (i) If, on the contrary, x¯ ∈ / S(S I L), then there exists an x˜ ∈ F(S I L) such that c · x˜ < c · x, ¯ and then h := x˜ − x¯ is feasible for the linearized problem, and c · h < 0, so that 0 is not a solution of the linearized problem. (ii) In view of step (i), it suffices to prove that, if x¯ ∈ S(S I L), then h = 0 is a solution of the linearized problem. Assume on the contrary that c · h < 0, for some ¯ satisfies c · h ∈ F(L x¯ ). Let ε > 0 small enough be such that h ε := h + ε(xˆ − x) h ε < 0. Set x(t) := x¯ + th ε . Let us show that, for t > 0 small enough, we have x(t) ∈ F(S I P). If this is not the case, there exists a sequence tk ↓ 0 and ωk ∈ Ω such that aωk x(tk ) > bωk . Extracting a subsequence if necessary, we can assume that ¯ Passing to the limit in the previous inequality, we get ω¯ ∈ Ω(x), ¯ and so ωk → ω. ¯ ε ≤ εaω¯ (xˆ − x) ¯ < 0. For (x, ω) in (x, ¯ ω), ¯ we there exists an α > 0 such that aω¯ (x)h have therefore aω h ε < −0, and so, if k is large enough: aωk x(t) = aωk x¯ + tk aωk h ε < bωk , which gives the desired contradiction.
(2.82)
2.5 Semi-infinite Programming
97
Since x(t) ∈ S(S I L), we have 0 ≤ limt↓0 t −1 c · (x(t) − x)) ¯ = c · h ε . Passing to the limit over ε, we get c · h ≥ 0, for all h ∈ F(L x¯ ), implying val(L x¯ ) ≥ 0. Since 0 ∈ F(L x¯ ), point (i) follows.
2.5.2 Multipliers with Finite Support We know that the topological dual of C(Ω) is the space M(Ω) of (signed) finite, Borel measures over Ω, (see Malliavin [77, Chap. 2]), or in short, measures. We will rather show how to obtain in a “direct way” the existence of Lagrange multipliers as measures with finite support, i.e., linear combinations of finitely many Dirac measures, in the form λ, y = ω∈supp(λ) λω yω . Here the set supp(λ) is a finite subset of Ω, called the support of λ, and such that λω = 0, for all ω ∈ supp(λ). Denote by M(Ω)+ the cone of positive measures. If λ is a measure with finite support, we call {λω , ω ∈ supp(λ)} the components of λ, and we will say that λ is positive if its components are. We will denote by M F (Ω) p the set of finite measures over Ω, by M F (Ω) the set of finite measures of support of p cardinality at most p, and by M F (Ω)+ , M F (Ω)+ the corresponding positive cones. One defines the dual problem “with finite support”, or “finite dual”, as Max
λ∈M F (Ω)+
−bω λω ; c +
ω∈supp(λ)
λω aω = 0.
(F S I D)
ω∈supp(λ)
Let us first state a weak duality result: Proposition 2.34 (i) We have val(F S I D) ≤ val(S I D). (ii) Let λ ∈ F(F S I D) and x ∈ F(S I L). If val(F S I D) = val(S I D), then λ ∈ S(F S I D) and x ∈ S(S I L) implies the complementarity condition aω · x = bω ,
for all ω ∈ supp(λ).
(2.83)
(iii) Conversely, if λ ∈ F(F S I D) and x ∈ F(S I L) satisfy (2.83), then (F S I D) and (S I L) have the same value, λ ∈ S(F S I D), and x ∈ S(S I L). Proof Let λ ∈ F(F S I D) and x ∈ F(S I L). Then c·x ≥c·x+
ω∈supp(λ)
λω (aω · x − bω ) =
−bω λω .
ω∈supp(λ)
Taking the infimum over x ∈ F(S I L) and the supremum over λ ∈ F(F S I D), we obtain (i). In addition, if the primal and dual values are equal, this relation implies that x ∈ S(S I L) and λ ∈ S(F S I D) iff the inequality is in fact an equality, whence (ii) and by the same type of argument (iii).
98
2 Semidefinite and Semi-infinite Programming
Let us now state the main result of the section. We will say that E ⊂ M F (Ω)+ is bounded if there exists an α > 0 such that ω∈supp(λ) λω ≤ α, for all λ ∈ E. Theorem 2.35 Let the Slater hypothesis (2.80) hold, and val(S I L) be finite. Then val(S I L) = val(F S I D), and S(F S I D) is nonempty and bounded. In addition, (F S I D) has at least one solution with support of cardinality at most n. The proof is based on the next lemmas, which have their own interest. Remark 2.36 Applying the duality theory of Chap. 1, we obtain the existence of Lagrange multipliers in M(ω)+ . Then the Krein–Milman theorem [65] allows us to obtain the existence of multipliers with finite support. On the other hand, our approach uses only elementary computations. Let the convex cone generated by {aω ; ω ∈ Ω} be denoted by C :=
⎧ ⎨ ⎩
λω aω ; λ ∈ M F (Ω)+
ω∈supp(λ)
⎫ ⎬ ⎭
∪ {0}.
(2.84)
Lemma 2.37 The cone C is generated by the nonnegative linear combinations of at most n terms of aω . In other words,
For all y ∈ C \{0}, there exists a λ ∈ M Fn (Ω)+ such that y =
λω a ω .
ω∈supp(λ)
(2.85)
Proof Let y ∈ C . There exists a λ ∈ M F (Ω)+ such that y = ω∈supp(λ) λω aω . Choose such a λ with support of minimal cardinality, say p. Let us obtain a contradiction if p > n. Then there exists a μ ∈ R p , μ = 0, such that ω∈supp(λ) μω aω = 0. Changing μ into −μ if necessary, we can assume that min μω < 0. Let t > 0 be the smallest positive value such that λω + tμω ≥ 0, for all ω ∈ supp(λ). Then y = ω∈supp(λ) (λω + tμω ) aω , and the support of λ + tμ is strictly included in that of λ. This gives the desired contradiction. Lemma 2.38 Let the Slater hypothesis (2.80) hold. If the dual problem (F S I D) is feasible, then it has a nonempty and bounded solution set. Proof Let λ ∈ F(F S I D). Using (2.81), we get for some ε > 0: −c · xˆ =
λω aω · xˆ ≤
ω∈supp(λ)
and so ε
ω∈supp(λ)
λω ≤ c · xˆ +
λω (bω − ε),
ω∈supp(λ)
ω∈supp(λ)
λω bω .
2.5 Semi-infinite Programming
99
If λ is an ε solution of (F S I D), with ε > 0 (this exists since, the primal being feasible, the dual value is finite), let −
λω bω ≥ val(F S I D) − ε ,
(2.86)
ω∈supp(λ)
then we obtain the estimate ω∈supp(λ) λω = O(1). As a consequence, a maximizing sequence {λk } of problem (F S I D) is bounded. In addition, by Lemma 2.37, we can w.l.o.g. assume that the cardinality of the support of elements of this sequence is equal to at most n. Extracting a subsequence if necessary, we can assume that this cardinal p ≤ n is constant along the sequence. So, let {ω1k , . . . , ωkp } denote the support of λk . Extracting again a subsequence, we can assume that the points in the supports converge to (ω¯ 1 , . . . , ω¯ p ) (some of these limits could coincide) and that λik → λ¯ i . We deduce that λ¯ ∈ S(F S I D), with support of cardinality at most n. Lemma 2.39 Let the Slater hypothesis (2.80) hold. If (S I L) has a solution, then it has the same value as (F S I D), and S(F S I D) is nonempty and bounded. Proof (a) Let x¯ ∈ S(S I L). By Lemma 2.33, h = 0 is a solution of the linearized problem (L x¯ ). Set C (x) ¯ :=
⎧ ⎨ ⎩
⎫ ⎬
λω aω ; λ ∈ M F (Ω)+ ; supp(λ) ⊂ Ω(x) ¯
⎭
ω∈supp(λ)
∪ {0}.
(2.87)
The argument of the proof of Lemma 2.37 tells us that ∀ y ∈ C (x)\{0}; ¯ ∃λ ∈ M Fn (Ω)+ ; supp(λ) ⊂ Ω(x); ¯ y=
λω a ω .
ω∈supp(λ)
(2.88) (b) Let us show that C (x) ¯ is closed. Indeed, let y ∈ C (x). ¯ Computing the scalar product of y by xˆ − x, ¯ with xˆ given by the Slater condition, we get y · (xˆ − x) ¯ =
ω∈supp(λ)
λω aω · (xˆ − x) ¯ =
ω∈supp(λ)
λω aω · xˆ − bω ≤ −ε
λω ,
ω∈supp(λ)
¯ converges to which shows that ω∈supp(λ) λω = O(y). If the sequence y k of C (x) k y¯ , the associated sequence λ (for which, by Lemma 2.37, we can assume that the support is of cardinality at most n) is therefore bounded, and hence, we can pass to the limit, whence the closedness of C (x). ¯ (c) Let us show that −c ∈ C (x). ¯ Since C (x) ¯ is convex and closed, if this is not the case, we can strictly separate −c and C (x). ¯ So, there exist h ∈ Rn and α ∈ R such that −c · h > α and y · h ≤ α, for all y ∈ C (x). ¯ Taking y = 0, we get c · h < 0, and ¯ which gives the desired contradiction to Lemma also aω · h ≤ 0, for all ω ∈ Ω(x), 2.33.
100
2 Semidefinite and Semi-infinite Programming
(d) Since −c ∈ C (x), ¯ there exists a λ ∈ F(F S I D) with support in C (x). ¯ By proposition 2.34(iii), λ ∈ S(F S I D), and problems (S I L) and (F S I D) have the same value. Finally, S(F S I D) is nonempty and bounded in view of Lemma 2.38. We now relax the hypothesis of existence of a solution to the primal problem. Lemma 2.40 Let val(S I L) be finite, and the Slater hypothesis (2.80) hold. Then (S I L) and (F S I D) have the same value, and S(F S I D) has at least an element of cardinality at most n. Proof (a) We apply Lemma 2.39 to the perturbed problem Minn c · x + γ x∈R
n
|xi |; aω · x ≤ bω , ω ∈ Ω,
(S I L γ )
i=1
where γ > 0. We first show that this problem, whose value is finite, has a solution. Given ε > 0, let x be an ε solution of (S I L γ ). We then have val(S I L) + γ
n
|xi | ≤ c · x + γ
i=1
and so γ
n
n
|xi | ≤ val(S I L γ ) + ε,
(2.89)
i=1
|xi | ≤ val(S I L γ ) + ε − val(S I L).
i=1
A minimizing sequence of (S I L γ ) is therefore bounded. Passing to the limit, we deduce, for all γ > 0, the existence of x γ ∈ S(S I L γ ). (b) Problem (S I L γ ) can be rewritten as a linear, semi-infinite optimization problem: c·x +γ Min n x∈R
z∈Rn
n
z i ; ±xi ≤ z i , i = 1, . . . , n; aω · x ≤ bω , ω ∈ Ω. (S I L γ )
i=1
ˆ zˆ ) In addition, set zˆ i := 1 + |xˆi |, i = 1 to n (where xˆ satisfies (2.80)). Then (x, satisfies the Slater hypothesis for problem (S I L γ ). Denote by 1 the vector of Rn with components equal to 1. Lemma 2.39 implies the equality of values of (S I L γ ) and of its finite dual, that can be written as −b λ ; c + μ − η + λω aω = 0; μ + η = γ 1. Max ω ω n n μ∈R η∈R
+ + λ∈M F (Ω)+ ω∈supp(λ)
ω∈supp(λ)
(F S I Dγ ) γ γ γ It also ensures that (F S I Dγ ) has a solution (μ , η , λ ). We have in addition, when
γ ↓ 0,
2.5 Semi-infinite Programming
101
γ γ −bω λω → val(S I L); c + λω a ω → 0. ω∈supp(λγ ) ω∈supp(λγ )
(2.90)
Indeed, the first relation follows from the equality val(S I L γ ) = val(F S I Dγ ), and from the equality limγ ↓0 val(S I L γ ) = val(S I L), which can be easily checked. The second is an immediate consequence of the definition of (F S I Dγ ). These relations allow us to show that λγ is bounded; indeed, ⎛ ⎞ o(1) = ⎝c + λγω aω ⎠ · xˆ ≤ c · xˆ + λγω (bω − ε), ω∈supp(λγ )
ω∈supp(λγ )
and so, by the first relation of (2.90), ε
λγω ≤ c · xˆ +
ω∈supp(λγ )
λγω bω + o(1) = O(1).
ω∈supp(λγ )
To obtain λ ∈ S(F S I D), via Proposition 2.34(i) it then suffices to pass to the limit (in a subsequence) in (2.90). Proof (Proof of theorem 2.35) Under the hypotheses of the theorem, Lemma 2.40 ensures the equality val(S I L) = val(F S I D) as well as the existence of an element of S(F S I D) of cardinality at most n. Combining with Lemma 2.38, we obtain that S(F S I D) is bounded.
2.5.3 Chebyshev Approximation Let a and b be two real numbers, with a < b. The problem of the best uniform approximation of a continuous function f over [a, b] by a polynomial of degree n can be written as (AT ) Min max | p(x) − f (x)|; x ∈ [a, b], p∈P n
where Pn denotes the set of polynomials of degree at most n with real coefficients. We denote by I+ ( p) (resp. I− ( p)) the set of points where p(x) − f (x) attains its maximum (resp. minimum), and we set I ( p) := I+ ( p) ∪ I− ( p). We recall that f ∞ := sup{| f (x)|, x ∈ [a, b]}. Lemma 2.41 A polynomial p ∈ Pn is a solution of (AT ) iff there exists no polynomial r ∈ Pn such that ( f (x) − p(x))r (x) < 0,
for all x ∈ I ( p).
(2.91)
102
2 Semidefinite and Semi-infinite Programming
Proof We can rewrite (AT ) as a linear semi-infinite optimization problem: (AT )
Min v; ±( f (x) − p(x)) − v ≤ 0, for all x ∈ [a, b]. v∈R
p∈P n
The cost function and constraints are affine functions of the optimization parameters, and the Slater condition (2.80) is satisfied (take h¯ = (¯v, p) ¯ with v¯ = 1 + f ∞ and p¯ = 0). By Lemma 2.33, (v, p) ∈ S(AT ) iff (w, r ) = 0 is a solution of the linearized problem. The latter can be written as follows: Min w; r (x) ≤ w, x ∈ I+ ( p); −r (x) ≤ w, x ∈ I− ( p). w∈R
(L AT )
r ∈P n
In other words, (v, p) ∈ S(AT ) iff there exists no polynomial r ∈ Pn such that r (x) < 0 when x ∈ I+ ( p) and r (x) > 0 when x ∈ I− ( p). The conclusion follows. Theorem 2.42 (Characterization theorem) A polynomial p ∈ Pn is a solution of (AT ) iff there exist n + 2 points x0 < x1 < · · · < xn+1 in [a, b] such that | p(xi ) − f (xi )| = p − f ∞ , i = 0, . . . , n + 1.
(2.92)
p(xi+1 ) − f (xi+1 ) = −[ p(xi ) − f (xi )], i = 0, . . . , n.
(2.93)
Proof By Lemma 2.41, it suffices to check that (2.92)–(2.93) is satisfied iff (2.91) has no solution. If (2.92)–(2.93) is satisfied, then by (2.91), r changes sign at least n + 1 times, and therefore has at least n + 1 distinct roots, which is impossible. If on the contrary (2.92)–(2.93) is not satisfied, then ( p − f ) changes sign at most n / I ( p) times over I ( p). So there exist m ≤ n and numbers α0 , . . . , αm+1 , with αi ∈ for all i, such that a = α0 < α1 < · · · < αm+1 = b, and ( p − f ) is non-zero and has a constant sign, alternatively +1 and −1, over m (x − αi ). ThereI ( p)∩]αi , αi+1 [ for all i = 0 to m. The same holds for r (x) := Πi=1 fore either r or −r satisfies (2.91). We say that the set of points x0 < x1 < · · · < xn+1 in [a, b] is a reference of the polynomial p if (2.92)–(2.93) is satisfied. Theorem 2.43 (Uniqueness theorem) Problem (AT ) has a unique solution. Proof (a) Existence: the space Pn of polynomials of degree n, whose elements n i = are denoted by p z i=0 z i x , being of finite dimension, the two norm pz P := n i=0 |z i | and pz ∞ are equivalent. A minimizing sequence is therefore bounded. We easily deduce the existence of a solution. (b) Uniqueness. Let p and q be two distinct solutions. Set r := p − q, and let x0 , . . . , xn+1 be a reference of p. Relations f − p∞ = f − q∞ and
2.5 Semi-infinite Programming
r (xi ) = ( p(xi ) − f (xi )) − (q(xi ) − f (xi )), i = 0, . . . , n + 1
103
(2.94)
imply that either r (xi ) is equal to zero, or it has the sign of p(xi ) − f (xi ), for all i. Set I := {i : r (xi ) = 0, i = 0, . . . , n + 1}; J := {i : r (xi ) = 0, i = 0, . . . , n + 1}. Since r is a polynomial of degree n, it suffices to check that it has at least n + 1 zeros, the latter being counted with their order of multiplicity. If J is empty, and so r changes sign at least n + 1 times, this holds; we get the same conclusion if J contains no other points than 0 and n + 1. Otherwise, let i ∈ J be different from 0 and n + 1, and s = ±1 be the sign of p(xi ) − f (xi ). Set α := max{sr (x); xi−1 ≤ x ≤ xi+1 }. Then α ≥ sr (xi ) ≥ 0, and as r (xi+1 ) and r (xi−1 ) have a sign that is opposite to that of s, the maximum is attained at a point xˆi ∈]xi−1 , xi+1 [. Set xˆi = xi , when i ∈ I , and i = 0 or n + 1. Then r (xˆi ) is of the same sign as p(xi ) − f (xi ), and if r (xˆi ) = 0, then xˆi is a zero of r with multiplicity at least two. Therefore, to each interval ]xˆi , xˆi+1 [ we can associate a zero of r that corresponds either to a change of sign of r over ]xˆi , xˆi+1 [, or to one of the multiple zeros of r at xˆi or xˆi+1 (or to simple zeros at the end points of the interval). We have shown that r has at least n + 1 distinct zeros, as was to be done.
2.5.4 Chebyshev Polynomials and Lagrange Interpolation The previous results allow us to present the theory of Chebyshev polynomials, and their application to Lagrange interpolation. The Chebyshev polynomial of degree n, denoted by Tn , is defined over [−1, 1] by the equality Tn (cos θ ) = cos(nθ ), or equivalently Tn (x) = cos(n cos−1 x). The formula cos((n + 1)θ ) + cos((n − 1)θ ) = 2 cos θ cos(nθ ) implies the induction relation Tn+1 (x) = 2x Tn (x) − Tn−1 (x). In particular, T0 (x) = 1, T1 (x) = x, T2 (x) = 2x 2 − 1. Proposition 2.44 For all n ∈ N, the polynomial ( 21 )n Tn+1 is, among all polynomials of degree n + 1 whose coefficient of x n+1 is 1, the one of minimal uniform norm over [−1, 1]. Proof (a) One easily checks by induction that the coefficient of x n+1 of Tn+1 is 2n . The coefficient of x n+1 of ( 21 )n Tn+1 is therefore 1.
104
2 Semidefinite and Semi-infinite Programming
(b) We want to show that the coefficients of degree 0 to n of ( 21 )n Tn+1 are a solution of the problem n n+1 i (2.95) Min max x − zi x . z 0 ,...,z n x∈[−1,1] i=0
This can be interpreted as the problem of uniform approximation of x n+1 by a polynomial of degree n over [−1, 1]. By Theorems 2.42 and 2.43, there exists a unique solution characterized by (2.92)–(2.93), with here f (x) = x n+1 . Now the uniform norm of ( 21 )n Tn+1 is ( 21 )n , and ( 21 )n Tn+1 (x) is equal to alternatively ±( 21 )n when x = cos(1 − i/(n + 1))π , i = 0, . . . , n + 1. The result follows. Consider now the problem of interpolation of a continuous function f by a polynomial of degree n over an interval [a, b]. The method of Lagrange interpolation consists in the choice of n + 1 distinct points, called interpolation points, x0 , . . . , xn in [a, b], and of the polynomial of degree n equal to f at these n + 1 points: p(xi ) = f (xi ), i = 1, . . . , n.
(2.96)
Given j ∈ {0, . . . , n}, there is a unique polynomial of degree nthat vanishes at all points xi , except at x j where it is equal to one, that is j (x) := i= j (x − xi )/(x j − n xi ) and (2.96) therefore has the unique solution p(x) = i=0 f (xi ) j (x). A naive choice of the interpolation points is to take them with constant increments. This leads to significant errors. We will see that, in some sense, the zeros of the Chebyshev polynomial are the best possible choice. Lemma 2.45 Let f ∈ C n+1 [a, b]. Then the error e(x) = f (x) − p(x) satisfies 1 (x − xi ) f (n+1) (ξ ), (n + 1)! i=0 n
e(x) =
(2.97)
where the point ξ ∈ [a, b] depends on x. Proof (a) If a function of class C 1 over [a, b] vanishes at two distinct points, then by Rolle’s theorem, its derivative has at least a zero between these two points. By induction, we deduce that if g ∈ C n+1 [a, b] vanishes at n + 2 distinct points, then its derivative of order n + 1 has at least a zero in [a, b]. (b) If x ∈ [a, b] is an interpolation point, (2.97) is trivially satisfied. Otherwise, set g(t) = f (t) − p(t) − e(x)
n t − xi . x − xi i=0
(2.98)
Since g is of class C n+1 [a, b] and vanishes at all interpolation points and at x, there exists an ξ ∈ [a, b] such that g (n+1) (ξ ) = 0. Computing g (n+1) (ξ ), the result follows.
2.5 Semi-infinite Programming
105
The previous lemma suggests, in the absence of specific information about f (n+1) , to choose the interpolation points that minimize the uniform norm of the product function n (x − xi ). (2.99) prod(x) := i=0
Proposition 2.46 There exists a unique choice of interpolation points minimizing the uniform norm of the product function, which is xi =
1 [2(n − i) + 1)]π 1 (a + b) + (b − a) cos , i = 0, . . . , n. 2 2 2(n + 1)
(2.100)
Choosing these points, we have that f − p∞ ≤
f (n+1) ∞ i = 0, . . . , n. 2n (n + 1)!
(2.101)
Proof It suffices to check the result when [a, b] = [−1, 1]. We must find the polynomial of degree n + 1, with coefficient of x n+1 equal to 1 and roots in [−1, 1], of minimal uniform norm. By Proposition 2.44, ( 21 )n Tn+1 is the unique solution of this problem, and the interpolation points are the zeros of Tn+1 , whence (2.100). Using Tn+1 ∞ = 1 and (2.97), we obtain (2.101). Corollary 2.47 If f is a polynomial of degree n + 1, its best uniform approximation by a degree n polynomial is the one obtained by taking for reference the points given by (2.100). Proof Since f (n+1) is constant, by Lemma 2.45, the maximal error is proportional to the uniform norm of the product function. By Proposition 2.46, this amount is minimal if the interpolation points are given by (2.100). Remark 2.48 The corollary is useful in the following situation. Let p be a polynomial of degree n + 1, which is a candidate for the approximation of f . We can wonder what would be the quality of the approximation of f by a polynomial of degree n. Taking the polynomial q obtained by using the interpolation points given by (2.100), we have by (2.101) the estimate f − q∞ ≤ f − p∞ +
p (n+1) ∞ . 2n (n + 1)!
(2.102)
106
2 Semidefinite and Semi-infinite Programming
2.6 Nonnegative Polynomials over R 2.6.1 Nonnegative Polynomials We show in this section that the nonnegativity of a polynomial over an interval of R is equivalent to the semidefinite positivity of a matrix whose coefficients are related to those of the polynomial by linear relations. This is of interest since efficient algorithms for solving linear positive semidefinite optimization problems are known. Consider the polynomial n z k ωk (2.103) pz (ω) = k=0
of coefficients (z 0 , . . . , z n ). Let us start by characterizing the non negativity of this polynomial over R. Of course, this implies that n is even. Lemma 2.49 Let n be even. Then the polynomial pz (ω) is nonnegative over R iff there exists a symmetric, positive semidefinite matrix Φ = {Φi j }, 0 ≤ i, j ≤ n/2, such that Φi j , k = 0, . . . , n. (2.104) zk = i+ j=k
Proof If (2.104) holds, set y := (1, ω, . . . , ωn ). Then pz (ω) =
n k=0 i+ j=k
Φi j ω = k
n
Φi j ωi ω j = yΦy ≥ 0,
k=0 i+ j=k
and therefore the polynomial pz is nonnegative. Conversely, assume that pz (ω) ≥ 0 for all ω. Then its real roots have an even multiplicity, otherwise a change of sign would occur. Denote by αi the real roots, of multiplicity 2ri , and by a j ± ib j the conjugate complex roots. Necessarily z n ≥ 0. Let us decompose the polynomial q(ω) := z n1/2 Πi (ω − αi )ri Π j (ω − a j − ib j ) in the form q(ω) = A(ω) + i B(ω), where A and B are polynomials with real coefficients. Then pz (ω) = A(ω)2 + B(ω)2 . We have obtained a decomposition pz (ω) as a sum of two squares of polynomials, of degree at most n/2. Consider a polynomial n/2 of degree at most n/2: k=0 ci ωi . Its square is of the desired form with Φi j = ci c j for all i and j (this rank 1 matrix is positive semidefinite). The same holds for a sum of squares (it suffices to sum the corresponding matrices). Remark 2.50 (i) We have shown that a polynomial is nonnegative over R iff it is the sum of at most two squares of polynomials. (ii) Lemma 2.49 allows us to check the nonnegativity of a polynomial by solving an SDP problem.
2.6 Nonnegative Polynomials over R
107
Example 2.51 Let K be a polyhedron in Rn+1 . Then the problem Min z
n
ci z i ; z ∈ K ;
i=0
n
z k ωk ≥ 0,
for all ω ∈ R,
k=0
is equivalent to the SDP problem Min z,Φ
n
ci z i ; z ∈ K ; z k =
i=0
Φi j , k = 0, . . . , n; Φ 0.
i+ j=k
The previous result allows us to deduce an analogous result in the case of the nonnegativity of a polynomial over R+ . Lemma 2.52 A polynomial pz (ω) of degree n is nonnegative over R+ iff there exists a symmetric, positive semidefinite matrix Φ = {Φi j }, 0 ≤ i, j ≤ n, such that
0 = i+ j=2k−1 Φi j , 1 ≤ k ≤ n, z k = i+ j=2k Φi j , 0 ≤ k ≤ n.
(2.105)
Proof The nonnegativity of pz (ω) over R+ is equivalent to that of the polynomial z 0 + z 1 ω2 + · · · + z n ω2n over R, whence the result by Lemma 2.49.
(2.106)
This parametrization involves a matrix of size 1 + n. We can do better by parametrizing with two matrices of size 1 + 21 n. We first give a preliminary result. Lemma 2.53 Let a, f 1 and f 2 be three functions R → R such that f i (ω) = qi (ω)2 + a(ω)ri (ω)2 , where qi (resp. ri ) are polynomials of degree n i (resp. n i−1 ). Then the function f (ω) = f 1 (ω) f 2 (ω) is of the form q(ω)2 + a(ω)r (ω)2 , where q and r are functions R → R that are polynomial if a(·) is. If in addition a(·) is a polynomial of degree at most 2, we can choose polynomials q and r of degree at most n 1 + n 2 and n 1 + n 2 − 1, respectively. Proof It suffices to use the identity f 1 (ω) f 2 (ω) = (q1 (ω)q2 (ω) + a(ω)r1 (ω)r2 (ω))2 +a(ω)(q1 (ω)r2 (ω) − q2 (ω)r1 (ω))2 .
(2.107)
We denote by x the integer part of x (greatest integer less than or equal to x). Lemma 2.54 A polynomial pz (ω) of degree n is nonnegative over R+ iff it satisfies one of the following two conditions:
108
2 Semidefinite and Semi-infinite Programming
(i) There exist two polynomials q and r of degree at most 21 n and 21 n − 1, respectively, such that (2.108) pz (ω) = q(ω)2 + ωr (ω)2 . (ii) There exist two positive semidefinite matrices Φ and Ψ , of indexes varying from 0 to 1 + 21 n and 0 to 21 (n − 1) respectively, such that
z 0 = Φ00 ; z k =
Φi j +
i+ j=k
Ψi j , k = 1, . . . , n.
(2.109)
i+ j=k−1
Proof (i) It is clear that if pz is of the form (2.108), it is nonnegative over R+ . Conversely, let pz be nonnegative over R+ . We give a proof by induction over n. If n = 0 or 1, the decomposition (2.108) is easily obtained. Let us deal with the case n = 2. If pz has real roots, either there is a double root and (2.108) holds with r = 0, or they are simple and then pz is the product of two affine functions that are nonnegative over R+ ; then (2.108) is a consequence of Lemma 2.53. Finally, in the case of conjugate complex roots, then pz (ω) = a[(ω + β)2 + α], with β ∈ R, α > 0, is enough to discuss the casewhen a = 1. and a > 0 since p is nonnegative over R+ . It Then pz (ω) = (ω − γ )2 + δω, with γ = β 2 + α and δ := 2β + 2 β 2 + α > 0 which gives the desired decomposition. Assume that now the conclusion holds until n − 1, with n > 2. Let us check the existence of α ≥ 0 and β ∈ R such that pz (ω) = (ω + α)q(ω) or pz (ω) = ((ω + β)2 + α)q(ω),
(2.110)
where q is a nonnegative polynomial over R+ , of degree n − 1 in the first case, and n − 2 in the second. Indeed, if pz has a root ω0 over R− , it is of the first form with α = −ω0 . Otherwise, pz has either a positive √ root −β, that necessarily has even multiplicity, or two conjugate roots −β ± i α. In both cases p is of the second form. We have shown that pz is a product of polynomials of the desired form (taking into account the discussion of the case n = 2). We conclude then by Lemma 2.53. (ii) If (2.109) is satisfied, set y := (1, ω, . . . , ωn ). Then pz (ω) =
n k=0
=
⎛ ⎝
Φi j +
i+ j=k
n k=0 i+ j=k
Ψi j ⎠ ωk
i+ j=k−1 n
Φi j ωi ω j + ω
= yΦy + ωyΨ y
⎞
Ψi j ωi ω j
k=0 i+ j=k−1
is nonnegative over R+ . Conversely, let pz be nonnegative over R+ . Then it has a decomposition of the form (2.108). Denote by z 1 and z 2 the coefficients of q and r , resp. Then the matrices Φ = z 1 (z 1 ) and Ψ = z 2 (z 2 ) satisfy (2.109). We can state a similar result in the case of a bounded interval.
2.6 Nonnegative Polynomials over R
109
Lemma 2.55 Let a and b be two real numbers, with a < b. A polynomial pz (ω) of degree n is nonnegative over [a, b] iff it satisfies one of the two following conditions: (i) There exist two polynomials q and r of degree at most 21 n and 21 n − 1 resp. if n is even, and at most 21 (n − 1) if n is odd, such that pz (ω) =
q(ω)2 + (b − ω)(ω − a)r (ω)2 if n is even, (ω − a)q(ω)2 + (b − ω)r (ω)2 otherwise.
(2.111)
(ii) There exist two positive semidefinite matrices Φ and Ψ , with index varying from 0 to 21 n and 21 n − 1 resp. if n is even, and from 0 to 21 (n − 1) if n is odd, such that, if n is even: zk =
Φi j − ab
i+ j=k
Ψi j + (a + b)
i+ j=k
Ψi j −
i+ j=k−1
Ψi j , k = 1, . . . , n.
i+ j=k−2
(2.112) and if n is odd: z k = −a
Φi j +
i+ j=k
i+ j=k−1
Φi j + b
i+ j=k
Ψi j −
Ψi j , k = 1, . . . , n.
i+ j=k−1
(2.113) Proof We follow the scheme of proof of Lemma 2.54(i). (a) We first check that the set of polynomials of the form (2.111) is stable under multiplication. For the product of two even polynomials, this follows from Lemma 2.53, with here a(ω) = (b − ω)(ω − a). For the product of two odd polynomials of the form pi = (ω − a)qi (ω)2 + (b − ω)ri (ω)2 , with i = 1, 2, we obtain with (2.107), omitting the argument ω: b−ω 2 b−ω 2 2 2 q2 + r r p1 p2 = (ω − a) q1 + ω−a 1 ω−a 2 2 b−ω b−ω 2 2 q1 q2 + = (ω − a) r1 r2 + (q1r2 − q2 r1 ) ω−a ω−a = ((ω − a)q1 q2 + (b − ω)r1r2 )2 + (b − ω)(ω − a)(q1r2 − q2 r1 )2 , 2
which is of the form (2.111), since p1 p2 is even. Finally, if p1 = q12 + (b − ω)(ω − a)r12 is even, and p2 = (ω − a)q22 + (b − ω)r22 is odd, (2.114) we get by (2.107)
110
2 Semidefinite and Semi-infinite Programming
q12 b−ω 2 b−ω 2 2 q2 + r r p1 p2 = (ω − a) + (ω − a)2 ω−a 1 ω−a 2 2 2 q1 q1 b−ω b−ω 3 = (ω − a) q2 + r1 r2 + r 2 − q2 r 1 (ω − a) ω−a ω − a (ω − a) = (ω − a)(q1 q2 + (b − ω)r1r2 )2 + (b − ω)(q1r2 − (ω − a)q2 r1 )2 , 3
which is still of the desired form. (b) If pz is of the form (2.111), it is clear that it is nonnegative over [a, b]. Conversely, let pz be nonnegative over [a, b]. We proceed by induction over n. For n = 0 or 1, one easily obtains the decomposition (2.111). Let us check it when n = 2. In that case q and r are of degree at most 1 and 0. If p has a real root, either it is of even multiplicity and we obtain the desired form with δ = 0, or it is outside (a, b) and p is then the product of two factors of the desired form for n = 1; we have checked in point (a) that the product still has the desired form. It remains to deal with the case of conjugate complex roots, i.e., (normalizing the leading coefficient) p1 (ω) = (ω + β)2 + α, with β ∈ R, and α > 0. By the change of variable ω = (ω − a)/(b − a), we boil down to the case when a = 0 and b = 1. In the sequel we look for q of degree 1. If β = − 21 , the desired decomposition is p1 = (ω − 1 2 ) + α = (4α + 1)(ω − 21 )2 + 4αω(1 − ω). If β = 21 , let us check that p1 is of 2 the form γ (ω − ω0 )2 + δω(1 − ω), with γ ≥ 0, δ ≥ 0, and ω0 ∈ R. Writing the equality of coefficients of each degree, and eliminating δ = γ − 1 (second degree), it remains to solve (1 − 2ω0 )γ = (2β + 1) and γ ω02 = α + β 2 . Since β = 21 , we have ω0 = 21 , and so, γ = (2β + 1)/(1 − 2ω0 ). Combining with the previous equality, we get (2β + 1)ω02 = (1 − 2ω0 )(α + β 2 ), which (since it has positive discriminant) necessarily has a real solution, different from 21 . Now assume the conclusion holds up to n − 1, with n ≥ 3. Proceeding as in the proof of Lemma 2.54, we see that pz can be written as a product of polynomials with constant sign over [a, b] of the form (2.110), with α ∈ R. We conclude by Lemma 2.53. (ii) The argument is similar to the one used in the previous proofs. Remark 2.56 We can also reduce nonnegativity over an interval to nonnegativity over R, by using the following relations: pz (ω) ≥ 0, ω ∈ [a, ∞[, iff pz a+ ω2 ≥ 0, ω ∈ R. pz (ω) ≥ 0, ω ∈ (−∞, a], iff pz a − ω2 ≥ 0, ω ∈ R. ω2 pz (ω) ≥ 0, ω ∈ [a, b], iff (1 + ω2 )n pz a + (b − a) ≥ 0, ω ∈ R. 1 + ω2 However, if the polynomial is of degree n, the SDP constraints involve a matrix of size 1 + n for the nonnegativity over a half-space or over a bounded interval. This is less efficient than the characterizations of the previous lemmas.
2.6 Nonnegative Polynomials over R
111
2.6.2 Characterisation of Moments In this section we assume that Ω is a closed interval of R, non-reduced to a point. We have defined the space of measures M(Ω), as well as their positive and negative cones M(Ω)+ and M F (Ω)− in Sect. 2.5.2. The moment of order k ∈ N of the positive measure μ ∈ M(Ω)+ is, whenever it is defined, the integral Mk (μ) :=
Ω
ωk dμ(ω).
(2.115)
We denote the set of possible values of the first n + 1 moments of positive measures over Ω by M n := {(m 0 , . . . , m n ); ∃μ ∈ M(Ω)+ ; Mk (μ) = m k , k = 0, . . . , n} . Similarly, we denote by M Fn the set of moments of positive measures with finite support over Ω; of course M Fn ⊂ M n . We will, in this section, study characterizations of the sets M n and M Fn . The latter are obviously convex cones of Rn . Lemma 2.57 The set M Fn has a nonempty interior, and Rn = M Fn − M Fn . Proof We will prove a more precise result: the conclusion remains true if Ω includes at least n + 1 distinct points. (a) Let ω0 , . . . , ωn be distinct points of Ω. Let us show that the set of moments of measures with support over ω0 , . . . , ωn is equal to Rn+1 . Indeed these moments form a vector subspace; let z belong to its orthogonal. We then have, for all (λ0 , . . . , λn ), 0=
n i=0
zi
n k=0
(ωk ) λk i
=
n
λk pz (ωk ).
k=0
This proves that the polynomial pz vanishes at the points ω0 , . . . , ωn , i.e., it has more roots than its degree, implying that z = 0, as was to be proved. (b) We show that the interior of M Fn is nonempty, by checking that the set E of moments of positive measures with support over the distinct points ω0 , . . . , ωn of Ω has a nonempty interior. Since E is convex, if it has an empty interior, it is included in a hyperplane with normal λ; then λ is also normal to E − E, but we checked that E − E = Rn , which is a contradiction. Remark 2.58 The set of moments of positive measures with support over the points {ω0 , . . . , ωn } is of course the cone generated by the n + 1 Dirac measures associated with {ω0 , . . . , ωn }. It is therefore characterized by a finite number of linear inequalities (Pulleyblank [90]). The aim of this section is to present a method of characterization of the set M n . We first recall a classical result, based on the following matrices (often called moment matrices in the literature)
112
2 Semidefinite and Semi-infinite Programming
⎛
m0 m1 ⎜m 1 m 2 ⎜ M0 (m) := ⎜ . .. ⎝ .. . m n m n+1 ⎛
m1 m2 .. .
⎜ ⎜ M1 (m) := ⎜ ⎝
m n+1
⎞ · · · mn · · · m n+1 ⎟ ⎟ .. .. ⎟ ; . . ⎠ · · · m 2n
⎞ m 2 · · · m n+1 m 3 · · · m n+2 ⎟ ⎟ .. .. .. ⎟ . . . . ⎠ m n+2 · · · m 2n+1
Lemma 2.59 (i) Let (m 0 , . . . , m 2n+1 ) ∈ M 2n+1 . Then M0 (m) 0. (ii) If in addition Ω ⊂ R+ , then M1 (m) 0. Proof (i) Set x(ω) = (1, ω, ω2 , . . . , ωn ) . From x(ω)x(ω) 0 and μ ≥ 0, we deduce that x(ω)x(ω) dμ(ω) 0. (2.116) M0 (m) = Ω
(ii) Since Ω ⊂ R+ , the vector x(ω) ˆ = (ω1/2 , ω3/2 , . . . , ωn+1/2 ) is well-defined. 0 and μ ≥ 0, imply that The relations x(ω) ˆ x(ω) ˆ M1 (m) =
Ω
x(ω) ˆ x(ω) ˆ dμ(ω) 0.
(2.117)
Remark 2.60 We can give other examples of necessary conditions based on similar arguments. For instance, when Ω = [0, 1], the vector x(ω) ˜ = ((1 − ω)1/2 , (1 − ω)3/2 , . . . , (1 − ω)n+1/2 ) # is well-defined, and so, M2 (m) := Ω x(ω) ˜ x(ω) ˜ dμ(ω) 0. This gives additional information: for example, the nonnegativity of the first element of this matrix gives m0 ≥ m1. Our study of characterizations of moments will use duality theory. Consider the following problem: Min
z∈Rn+1
n
m k zk ;
pz (ω) ≥ 0
over Ω.
(P M)
k=0
The criterion is linear, and the feasible domain is a cone; the value of this problem is therefore 0 or −∞. The “finite dual” problem (in the sense of Sect. 2.5) is Max
μ∈M F (Ω)+
0;
Mk (μ) = m k , k = 0, . . . , n.
(D M F )
2.6 Nonnegative Polynomials over R
113
Its value is 0 if m ∈ M n , and −∞ otherwise. Its feasible set is the set of positive measures having m for first moments. Lemma 2.61 (i) We have val(D M F ) ≤ val(P M). (ii) If in addition Ω is compact, then val(D M F ) = val(P M), and M n = M Fn . Proof (i) If the dual is not feasible, so that its value is −∞, then val(D M F ) ≤ val(P M) trivially holds. Otherwise, let μ ∈ M(Ω) be such that Mk (μ) = m k , k = 0 to n, and z ∈ F(P M). Then n
m k zk ≥
k=0
n
m k zk −
k=0
Ω
pz (ω)dμ(ω) =
n
z k (m k − Mk (μ)) = 0.
(2.118)
k=0
In particular, taking μ ∈ (D M F ), we obtain (i). (ii) Problem (P M) satisfies the Slater hypothesis (2.80): it suffices to take the polynomial constant equal to 1. Theorem 2.35 ensures the equality val(D M F ) = val(P M). In addition, if m ∈ M n , then val(P M) = 0 by (2.118), and Theorem 2.35 implies m ∈ M Fn , whence the conclusion. The lemmas of Sect. 2.6.1 imply that, when Ω is a closed interval of R, bounded or not, problem (P M) has the same value as an SDP problem of the type Min z,Φ
n
m k zk ; z =
L
A Φ , Φ 0, = 1, . . . , L ,
(2.119)
=1
k=0
where L = 1 or 2, the Φ being symmetric; the linear mappings (depending on Ω) A : S n → Rn+1 , for some n , can be deduced from the relations in Lemmas 2.53– 2.55, or from Remark 2.56. Lemma 2.62 Problem (2.119) has value 0 if A m 0, for = 1 to L, and −∞ otherwise. Proof Eliminating z, we can write (2.119) in the form Min Φ
L Φ , A m; Φ 0, = 1, . . . , L .
(P M )
=1
We conclude by Fejer’s theorem 2.1.
Example 2.63 Let Ω = R+ . We have defined matrices M0 (m) and M1 (m) in (2.116)–(2.117). Let A1 and A2 be deduced from the parametrization (2.109). We can check that Ai m = Mi (m), for i = 0, 1 (taking the convention that the indexes M1 (m) and M2 (m) go from 0 to n). Lemma 2.62 then implies that val(P M) = 0 iff M0 (m) 0 and M1 (m) 0. Combining Lemmas 2.61 and 2.62, we deduce the following result:
114
2 Semidefinite and Semi-infinite Programming
Theorem 2.64 Let Ω be a closed interval of R. (i) If m ∈ M n , then A m 0, = 1, . . . , L. (ii) If in addition Ω is bounded, then the converse holds: m ∈ M n iff A m 0, = 1, . . . , L. In addition, there exists a finite measure, with cardinality of support at most n + 1, having m for its first moments. The previous theorem provides a characterization of the set M n whenever Ω is bounded. Let us briefly discuss the case when Ω is unbounded. Proposition 2.65 Let m ∈ int M n . Then there exists a finite measure, with cardinality of support at most n + 1, having m for first moments. Proof Let r > 0 and set Ωr := Ω ∩ [−r, r ]. If the conclusion does not hold, there By Lemma 2.61, exists no measure with support in Ωr having m for first moments. m k z rk < 0. Let there exists a z r ∈ Rn+1 such that pzr is nonnegative over Ωr , and k r r z¯ = 0 be a limit point of z /|z |. Then p z¯ is nonnegative over Ω, and k m k z¯ k ≤ 0. Choose m so close to m that k m k z¯ k < 0. Then problem (P M) for m has / M n , in contradiction with m ∈ value −∞, which by (2.118) implies that m ∈ n int M . Example 2.66 Let us show that if Ω = R+ , the set M n is not closed. Let r > 1. To the measure μr = (1 − r −n )δ0 + r −n δr are associated the moments m r = / M n. (1, r 1−n , . . . , 1) with limit m = (1, 0, . . . , 0, 1). It is clear that m ∈
2.6.3 Maximal Loading Let n ∈ N, n > 0, and S be an interval contained in Ω. We consider the problem of maximal loading on the set S, under constraints of moments: Mk (μ) = m k , k = 0, . . . , n.
dμ(ω);
Max
μ∈M(Ω)+
(D M S )
S
The data are m = (m 0 , . . . , m n ) . We may assume that m 0 = 1; the value of this problem is equal to 1 iff it is possible to realize the moments m k with a probability with support over S. Denote by χ S the characteristic function of S: χ S (ω) =
1 if ω ∈ S, 0 otherwise.
(2.120)
With z = (z 0 , . . . , z n ) we associate the polynomial defined in (2.103). We will interpret this problem as the dual of the following “primal” problem: Min
z∈Rn+1
n k=0
m k zk ;
pz (ω) − χ S (ω) ≥ 0 over Ω.
(P M S )
2.6 Nonnegative Polynomials over R
115
We can rewrite the latter in the form Min
n
z∈Rn+1
m k zk ;
pz (ω) ≥ 1 over S;
pz (ω) ≥ 0 over Ω\S.
(P M S )
k=0
Given measures μ1 and μ2 with support S and Ω\S resp., the Lagrangian of the problem is, denoting by M0:n (·) the vector of moments of order 0 to n: L(μ, z) :=
n
m k zk −
k=0
( pz (ω) − 1) dμ1 (ω) − S
= (m − M0:n (μ1 + μ2 )) · z +
Ω\S
pz (ω)dμ2 (ω) (2.121)
dμ1 (ω) S
and therefore the dual problem is dμ1 (ω);
Max μ1 ,μ2
M0:n (μ1 + μ2 ) = m; μ1 ∈ M(S)+ ; μ2 ∈ M(Ω\S)+ .
S
(2.122) We can write in a unique way μ2 = μ2 + μ2 with μ2 (Ω \ S) = 0 and μ2 (S) = 0. Changing if necessary μ1 into μ1 + μ2 and μ2 into μ2 , we see that it is optimal that μ2 (S) = 0, and hence the dual cost has the same value and constraints as the maximal loading problem for μ := μ1 + μ2 . We deduce the following result: Theorem 2.67 (i) We have val(D M S ) ≤ val(P M S ). (ii) If in addition Ω is compact, then val(D M S ) = val(P M S ), and (D M S ) has a solution with finite support. Remark 2.68 By the results of Sect. 2.6.1, problem (P M) is equivalent to a linear positive semidefinite optimization problem. When Ω is compact, the optimal loading problem therefore reduces to an SDP problem. Remark 2.69 The previous results can be useful in the context of risk control. Assume that certain moments of a probability of gains, with values in a bounded interval Ω, are known. We can then compute the maximal value probability of gain below a certain threshold s, by solving a maximal loading problem, with here S :=] − ∞, s] ∩ Ω.
2.7 Notes An overview of SDP optimization is provided in the Handbook [125] edited by Wolkowicz et al. Proposition 2.13 is due to Lewis [72]; see Lewis and Overton [73] and Lewis [71]. The SDP relaxation of quadratic problems is discussed in [125, Chap. 13]; our presentation is inspired by Lemaréchal and Oustry [70]. Second-order cone models are discussed in Ben-Tal and Nemirovski [15] and Lobo Sousa et al. [76];
116
2 Semidefinite and Semi-infinite Programming
questions of sensitivity are dealt with in Bonnans and Ramírez [25], and an overview is given in Alizadeh and Goldfarb [4]. About semi-infinite programming, see Bonnans and Shapiro [26, Sect. 5.4], or Goberna and Lopez [53]. The problem of moments is discussed in Chap. 16 of [125]; a classical reference is Akhiezer [2]. The related work by Lasserre [69] deals with the minimization of polynomial functions of several variables, with polynomial constraints. Our discussion of Chebyshev interpolation follows Powell’s book [89], a classical reference in approximation theory.
Chapter 3
An Integration Toolbox
Summary This chapter gives a concise presentation of integration theory in a general measure space, including classical theorems on the limit of integrals. It gives an extension to the Bochner integrals, needed for measurable functions with values in a Banach space. Then it shows how to compute the conjugate and subdifferential of integral functionals, either in the convex case, based on convex integrand theory, or in the case of Carathéodory integrands. Then optimization problems with integral cost and constraint functions are analyzed using the Shapley–Folkman theorem.
3.1 Measure Theory 3.1.1 Measurable Spaces Soit Ω be a set; we denote by P(Ω) the set of its subsets. We say that F ⊂ P(Ω) is an algebra (resp. σ -algebra) if it contains ∅ and Ω, the complement of each of its elements, and the finite (resp. countable) unions of its elements. Note that an algebra (resp. a σ -algebra) also contains the finite (resp. countable) intersections of its elements. The trivial σ -algebra is the algebra {∅, Ω}. An intersection of algebras (resp. σ -algebras) is an algebra (resp. σ -algebra). Therefore, if E ⊂ P(Ω), we may define its generated algebra (resp. σ generated-algebra) as the intersection of algebras (resp. σ -algebras) containing it, or equivalently the smallest algebra (resp. σ -algebra) containing it. The above intersections are not over an empty set since they contain the trivial σ -algebra. If F is a σ -algebra of Ω, we say that (Ω, F ) is a measurable space, and call the elements of F measurable sets. Remark 3.1 We can build the algebra (resp. σ -algebra) generated by E ⊂ P(Ω) as follows. Consider the sequence Ek ⊂ P(Ω), k ∈ N, such that E0 := E , and Ek+1 is the subset of P(Ω) whose elements are the elements of Ek , as well as their complements and finite (resp. countable) unions. We can call Ek the k steps completion © Springer Nature Switzerland AG 2019 J. F. Bonnans, Convex and Stochastic Optimization, Universitext, https://doi.org/10.1007/978-3-030-14977-2_3
117
118
3 An Integration Toolbox
(resp. σ -completion) of E . This is a nondecreasing sequence, and the algebra (resp. σ -algebra) generated by E ⊂ P(Ω) is the limiting set ∪k Ek . Example 3.2 (i) If E is a partition of Ω (a finite family of pairwise disjoint subsets with union Ω), the generated σ -algebra is the set of (possibly empty) unions of elements of E . More generally, if E is a countable partition of Ω (a countable family of pairwise disjoint subsets with union Ω), the generated σ -algebra is the set of (possibly empty) unions of elements of E . (ii) If f : E → Ω, where E is an arbitrary set and F is a σ -algebra in Ω, then { f −1 (A); A ∈ F } is a σ -algebra in E, called the σ -algebra generated by f . (iii) If Ω is a topological space,1 then we call the σ -algebra generated by the open subsets of Ω the Borel σ -algebra and denote it by B(Ω). Definition 3.3 Given two sets Ω1 and Ω2 , and subsets Fˆi of P(Ωi ), with generated σ -algebras denoted by Fi , for i = 1, 2, we set Fˆ1 ⊗ Fˆ2 := {F1 × F2 , Fi ∈ Fˆi , i = 1, 2},
(3.1)
σ
and let Fˆ1 ⊗ Fˆ2 be the σ -algebra in Ω1 × Ω2 generated by Fˆ1 ⊗ Fˆ2 (called the product σ -algebra). We have the obvious inclusion σ
σ
Fˆ1 ⊗ Fˆ2 ⊂ F1 ⊗ F2 .
(3.2)
Consider the following hypothesis Ωi is a countable union of elements of Fˆi , for i = 1, 2. σ
(3.3)
σ
Proposition 3.4 If (3.3) holds, then Fˆ1 ⊗ Fˆ2 = F1 ⊗ F2 . Proof We follow Villani [121, Prop. III.35].
σ
σ
(a) In view of (3.2) it suffices to prove that F1 ⊗ F2 ⊂ Fˆ1 ⊗ Fˆ2 . Since the r.h.s. is a σ -algebra, this holds if the following claim holds: for any A ∈ F1 and B ∈ F2 , σ A × B ∈ Fˆ1 ⊗ Fˆ2 . σ (b) When A ∈ Fˆ1 and B ∈ Fˆ2 , by (3.3), A × Ω2 and Ω1 × B belong to Fˆ1 ⊗ Fˆ2 . Using this, given B ∈ Fˆ2 , we easily check that the set of A ⊂ Ω1 such that A × B ∈ σ
σ
Fˆ1 ⊗ Fˆ2 is a σ -algebra. So, A × B ∈ Fˆ1 ⊗ Fˆ2 whenever A ∈ F1 and B ∈ Fˆ2 . σ Since Fˆ1 ⊗ Fˆ2 is a σ -algebra, the claim follows.
means that there exists a subset O of P (Ω) that contains Ω and ∅, and is stable under finite intersection and arbitrary union. Its elements are called open sets. The complements of open sets are called closed sets.
1 This
3.1 Measure Theory
119
Definition 3.5 Let Ωi , i = 1, 2, be topological spaces. Then the product topology in Ω := Ω1 × Ω2 is defined as follows: A ⊂ Ω is an open set if for any a ∈ A, there exists O1 and O2 , open subsets of Ω1 and Ω2 resp., such that a ∈ O1 × O2 ⊂ A. If (Y, ρ) is a metric space, its open balls with center y ∈ Y and radius r > 0 are denoted as B(y, r ) := {y ∈ Y ; ρ(y, y ) < r }. (3.4) The open subsets of Y are defined as unions of open balls. This makes Y a topological space. In the sequel metric spaces will always be endowed with their Borel σ -algebra. Proposition 3.6 Let Ωi , i = 1, 2, be separable metric spaces. Let Ω := Ω1 × Ω2 be endowed with the product topology. Then σ
B(Ω) = B(Ω1 ) ⊗ B(Ω2 ).
(3.5)
Proof We follow Villani [121, Prop. III.36]. (a) Let ai ∈ Ωi , for i = 1, 2. Since Ωi is the union of open balls B(ai , k) for k ∈ N, hypothesis (3.3) holds. Let Oi denote the family of open subsets of Ωi . By Propoσ σ sition 3.4, where Fˆi = Oi and Fi = B(Ωi ), we have that O1 ⊗ O2 = B(Ω1 ) ⊗ σ
σ
B(Ω2 ). So, we need to prove that O1 ⊗ O2 = B(Ω). That O1 ⊗ O2 ⊂ B(Ω) follows from the obvious inclusion O1 ⊗ O2 ⊂ B(Ω). We next show the converse inclusion. Since B(Ω) is generated by the open subsets, it suffices to prove that any σ
open subset of Ω belongs to O1 ⊗ O2 . (b) For i = 1, 2, let xki be a dense sequence in Ωi , Oi be an open subset of Ωi , and x i ∈ Oi . Then B(x i , 2/n) ⊂ Oi for large enough n ∈ N. Pick k such that dist(xki , x i ) < 1/n. Then x i ∈ B(xk , 1/n). (c) Let O be an open subset of Ω, and z ∈ O. For i = 1, 2, there exist Oi , open subsets of Ωi , such that z = (x, y) ∈ O1 × O2 ⊂ O. By point (b), x ∈ B(xk1 , 1/n ) ⊂ O1 and y ∈ B(xk2 , 1/n ) ⊂ O2 . So, an open subset of Ω is a countable union of sets of σ
the form B(xk1 , 1/n ) × B(xk2 , 1/n ). It therefore belongs to O1 ⊗ O2 , as was to be proved.
Measurable Mappings Let (X, F X ) and (Y, FY ) be two measurable spaces. The mapping f : X → Y is said to be measurable if, for all Y1 ∈ FY , f −1 (Y1 ) ∈ F X . A composition of measurable mappings is therefore measurable, if the σ -algebra of the intermediate space is the same for the two mappings. Lemma 3.7 Let (X, F X ) and (Y, FY ) be two measurable spaces, the σ -algebra FY being generated by G ⊂ P(Y ). Then f : X → Y is measurable iff f −1 (g) is measurable, for all g ∈ G .
120
3 An Integration Toolbox
Proof If f is measurable and g ∈ G , clearly f −1 (g) ⊂ F X . Conversely, assume that f −1 (g) ⊂ F X , for all g ∈ G . As in Remark 3.1, denote by Gk the k step σ -completion of G . For k = 0 we have that f −1 (g) ⊂ F X , for all g ∈ Gk . On the other hand, if for some k ∈ N, f −1 (g) ⊂ F X , for all g ∈ Gk , since F X is a σ -algebra,we easily see that f −1 (g) ⊂ F X , for all g ∈ Gk+1 . So, f −1 (g) ⊂ F X , for all g ∈ Gk , for all k ∈ N. Since the σ -algebra FY generated by G coincides with ∪k Gk , the conclusion follows. Corollary 3.8 Let (X, F X ) and (Y, FY ) be two measurable spaces, and let f : X → Y . If FY is a Borel σ -algebra, then f is measurable iff the inverse image of any open set is measurable. Lemma 3.9 Let (X, F X ), (Y, FY ) and (Z , F Z ) be three measurable spaces, and let f : X → Y × Z , with components denoted as f (x) := ( f 1 (x), f 2 (x)). Then f is measurable iff its components f 1 and f 2 are. Proof If f is measurable, then for all A ∈ FY , f 1−1 (A) = f −1 (A × Z ) is measurable. Therefore f 1 is measurable, as is f 2 by a symmetry argument. Assume now that f 1 and f 2 are measurable. Since the product σ -algebra is generated by the elements of the form A × B, with A ∈ FY and B ∈ F Z , by Lemma 3.7, it suffices to check that f −1 (A × B) is measurable, which is immediate since f −1 (A × B) = f 1−1 (A) ∩ f 2−1 (B). In the sequel (Y, ρ) is a metric space. Definition 3.10 We denote by L 0 (Ω, Y ) the vector space of measurable functions on Ω with values in Y , and by E 0 (Ω, Y ) the subspace of simple functions (sometimes called step functions), i.e., of measurable functions with finite range. If Y = R, we denote these spaces by L 0 (Ω) and E 0 (Ω) resp. By simple convergence of a sequence of functions we mean the convergence at any point. If V ⊂ Y and y ∈ Y , we define the distance function ρ(y, V ) := inf{ρ(y, y ); y ∈ V }, and for r > 0, we set2 Vr := {y ∈ Y ; ρ(y, Y \ V ) > 1/r }. Lemma 3.11 Let f k be a sequence of measurable functions Ω → Y , simply converging to f¯. Then f¯ is measurable, and for any open set O in Y : f¯−1 (O) =
r >0 k∈N
f −1 (Or )
.
(3.6)
≥k
Proof Let O be an open subset of Y . Clearly, O = ∪r >0 Or , and hence, x ∈ f¯−1 (O) iff there exists an r0 > 0 such that x ∈ f¯−1 (Or0 ), i.e., there exists a y ∈ Or0 such that y = f¯(x) = limk f k (x). This holds iff, for any r1 > r0 , f k (x) ∈ Or1 for large enough k, i.e., iff x belongs to ≥k f −1 (Or1 ) for large enough k: relation (3.6) follows. 2 By
the definition, A \ B := {x ∈ A; x ∈ / B}.
3.1 Measure Theory
121
We next give a way to approximate measurable functions with values in Rn by functions having a countable or finite image. Definition 3.12 Let a denote the integer part of a, i.e., the greatest integer m such that m ≤ a. For k ∈ N, set −k k 2 2 a if a ≥ 0, (3.7) a k := −2−k −2k a otherwise. If now f is a real-valued function, define f k by f k (x) := f (x) k . If f is a mapping with values in Rn , define f k by f i k (x) := f i k (x), i = 1 to n. We call f k the floor approximation of f . We recall that a function is simple if it is measurable with finite image. Lemma 3.13 Let f be a measurable function. Then (i) f k is measurable, has a countable range, and converges uniformly to f , (ii) the truncation f k := max(−k, min(k, f k ))
(3.8)
is a sequence of simple functions that converges simply to f . In addition if f is nonnegative, so is f k , and f k as well as f k are nondecreasing. Proof It suffices to discuss the case when f is nonnegative. The image of f k is included in 2−k N, and for j ∈ N, f k−1 (2−k j) = f −1 ([2−k j, 2−k ( j + 1)[) is measur able, so that f k is measurable. The conclusion easily follows. Definition 3.14 We say that f : Rn → R p is Borelian if it is measurable when Rn and R p are endowed with the Borel σ -algebra. More generally, if f is measurable X → Y , where X and Y are topological sets endowed with their Borel σ -algebra, we say that f is Borelian. Lemma 3.15 (Doob–Dynkin) Let Ω be an arbitrary set and X , Y be two measurable functions from Ω to Rn and R p resp. Denote by F X the σ -algebra generated by X . Then Y is F X measurable iff there exists a Borelian function g : Rn → R p such that Y = g(X ). Proof If Y = g(X ) for some Borelian function g : Rn → R p , and if B is an open set in R p , then Y −1 (B) = X −1 [g −1 (B)] ∈ F X , and so Y is F X measurable. We show the converse in the case when p = 1, the extension to p > 1 being easy. Let Y be F X measurable. If Y = 1 A is the characteristic function of the set A ∈ F X , ), and the since A = X −1 (B) where B is Borelian, we have that Y = 1 X −1 (B) = 1 B (X conclusion holds with g = 1 B . More generally, let Y be of the form Y = k αk 1 Ak (finite or countable sum) with αk all different and Ak = Y −1 (αk ) ∈ F X for all k, necessarily pairwise disjoint. Note that the sum is well defined since at most one term is nonzero. Then Ak = X −1 (Bk ), where the Bk are pairwise disjoint Borelian sets and the conclusion holds with g(x) = k αk 1 Bk (x) (again, at most one term in
122
3 An Integration Toolbox
the sum is nonzero). Finally, in the general case, by the previous discussion, there exists a measurable function gk such that Yk = gk (X ), and for m > n, we have |gm (x) − gk (x)| ≤ 2−k , proving that the sequence gk simply converges to some g, which is measurable by Lemma 3.26. Remark 3.16 For a measurable function f with value in a separable (i.e. that contains a dense sequence) metric space (E, ρ), we can do a somewhat similar construction. Let ek , k ∈ N be a dense sequence in E. Set E k, := {e j , k ≤ j ≤ }. Define f k by induction: f k (x) = e0 if ρ(e0 , f (x)) ≤ ρ(e, f (x)) for all e ∈ E 0,k , and at step i, 1 ≤ i ≤ k, if f k (x) has not been set yet, then f k (x) = ei if ρ(ei , f (x)) ≤ ρ(e, f (x)), for all e ∈ E i,k . In this way we obtain a sequence of simple functions that simply converge to f . It follows that the above Doob–Dynkin lemma holds when replacing R p by a separable metric space.
3.1.2 Measures Let (Ω, F ) be a measurable space. We say that μ : F → R+ ∪ {+∞} is a measure if it satisfies the two axioms of countable additivity
μ (∪i∈I Ai ) = i∈I μ(Ai ), for I finite or countable and {Ai }i∈I ⊂ F such that Ai ∩ A j = ∅ if i = j,
(3.9)
and σ -finiteness:
There exists an exhaustion sequence Ak in F , i.e., such that μ(Ak ) < ∞ and Ω = ∪k Ak .
(3.10)
We say that (Ω, F , μ) is a measure space. If in addition μ(Ω) = 1, we say that μ is a probability measure and that (Ω, F , μ) is a probability space. If A ∈ F , we then interpret μ(A) as the probability that ω ∈ A. It follows from (3.9) that μ (∪i∈I Ai ) ≤
μ(Ai ),
if {Ai }i∈I is a finite or countable family in F . (3.11)
i∈I
Indeed, we may assume that I = N. Setting Bi := ∪ j≤i Ai and Ci := Bi \ B j−1 (with B0 := ∅) it suffices to apply (3.9) to the family {Ci }i∈I whose union is ∪i∈I Ai ; since Ci ⊂ Ai , the result follows from μ (∪i∈I Ai ) = μ (∪i∈I Ci ) =
i∈I
μ(Ci ) ≤
μ(Ai ).
(3.12)
i∈I
We also have μ (∪k Ak ) = lim μ(Ak ) if Ak ⊂ Ak+1 for all k. k
(3.13)
3.1 Measure Theory
123
Indeed, apply (3.9) to the disjoint family Ak \ Ak−1 (for k ≥ 1, assuming w.l.o.g. that A0 = ∅). Let us show next that (3.13) implies ⎧ ⎨ μ (∩k Ak ) = limk μ(Ak ) if {Ak } is a nonincreasing family of measurable sets ⎩ having finite measure for large enough k.
(3.14)
We may assume that mes(A1 ) is finite. The family Ak := A1 \ Ak is nondecreasing, and A1 \ (∩k Ak ) = ∪k Ak . By (3.13), μ (∩k Ak ) = μ(A1 ) − limk μ(Ak ) = limk μ(Ak ).
Construction of Measures In the case of Lebesgue measure over R the starting point is the length of finite intervals; one has to check that the latter can be extended to a measure over the Borelian σ -algebra. More generally let Fˆ be an algebra of subsets of Ω, and let F denote the generated σ -algebra. Let μ : Fˆ → R+ ∪ {+∞} be a σ -additive function over Fˆ , i.e., if ∪i∈I Ai ∈ Fˆ , then
μ (∪i∈I Ai ) =
i∈I
μ(Ai ), for I finite or countable
and {Ai }i∈I ⊂ Fˆ such that Ai ∩ A j = ∅ if i = j.
(3.15)
We have Carathéodory’s extension theorem: Theorem 3.17 Let μˆ be a nonnegative σ -additive function over Fˆ . Then it has a unique extension as a measure over F . Proof See Royden [105, Chap. 12, Th. 8].
Remark 3.18 (i) Note that the proof needs Ω to be σ -finite. (ii) The delicate point when applying this theorem is to check the σ -additivity assumption. (iii) For an extension see Villani [121, Thm. I.69]. As a consequence we obtain the construction of the Lebesgue measure on R. Corollary 3.19 There exists a unique measure over R, endowed with the Borelian σ -algebra, that for a finite segment [a, b] has value b − a. Proof (i) It is easily checked that the length of segments has a unique extension μˆ to the algebra Fˆ generated by segments, which is nothing but the finite union of (finite or not) segments. So, by the extension Theorem 3.17, it suffices to check the σ -additivity assumption of μˆ over Fˆ . For this, see Royden [105, Chap. 3], or Dudley [45, Chap. 3]. Remark 3.20 For the construction of a non-measurable subset of R, see Royden [105, Chap. 4, Sect. 4].
124
3 An Integration Toolbox
Product Spaces We next show how to construct measures over products of measure spaces. Proposition 3.21 Let (Ωi , Fi , μi ), for i = 1, 2, be two measure spaces. Let μ be the set function over F1 × F2 defined by μ(F1 × F2 ) := μ1 (F1 )μ2 (F2 ), for all F1 ∈ F1 σ
and F2 ∈ F2 . Then there exists a unique measure μ¯ over F1 ⊗ F2 that extends μ, in the sense that μ(F) ¯ = μ(F), for all F ∈ F1 × F2 . Proof (Taken from Royden [105, Chap. 12, lemma 14]) (i) Let Fˆ denote the algebra generated by F1 × F2 . Any A ∈ Fˆ is a finite disjoint with Ai in F1 and Ai in F2 , union of elements of F1 × F2 , say A = i Ai × Ai the sum being over a finite set. We define μ(A) ˆ := i μ1 (Ai )μ2 (Ai ). While the decomposition is not unique, all possible decompositions give the same value for μ(A), ˆ so that μˆ is well-defined as a nonnegative, finitely additive set function of Fˆ . (ii) By the extension Theorem 3.17, if suffices now to check that μˆ is σ -additive over Fˆ . Since each element of Fˆ has a representation as a finite disjoint union of elements of F1 × F2 , it suffices to prove that μ is σ -additive over F1 × F2 . (iii) So, let A × B ∈ F1 × F2 be the disjoint union of Ai × Bi ∈ F1 × F2 , with i ∈ I countable. Given (x, y) ∈ A × B, (x, y) belongs to only one of the Ai × Bi , whose index is denoted by j (x, y). Then I (x) := ∪ y∈B j (x, y) denotes the subset of those i ∈ I such that x ∈ Ai . Since A × B is the disjoint union of the Ai × Bi , B = ∪i∈I (x) Bi . Since μ2 is σ -additive it follows that μ2 (B) = i∈I (x) μ2 (Bi ), and so, μ2 (Bi )χ Ai (x). (3.16) μ2 (B)χ A (x) = i∈I (x)
By the Lebesgue theorem on series Theorem 3.32 (obtained later, but by independent arguments) we deduce that μ2 (B)μ1 (A) = i∈I (x) μ2 (Bi )μ1 (Ai ), as was to be proved. Negligible Sets and Completed σ -Algebras A (not necessarily measurable) subset A of Ω is said to be negligible if, for all ε > 0, it is contained in a measurable set of measure less than ε. In particular, for each nonzero k ∈ N, there exists a measurable set Ωk of measure at most 1/k such that A ⊂ Ωk , and so A ⊂ ∩k Ωk . This proves that a negligible subset is included in a measurable set of zero measure. A countable union of negligible sets is negligible. A property that holds outside of a negligible set (or equivalently, outside of a subset of zero measure) is said to be true almost everywhere (a.e.), or almost surely (a.s.) in the case of a probability measure. We say that the σ -algebra F is complete if it contains the negligible sets. We call the family of subsets
3.1 Measure Theory
125
Fc := {A := A1 ∪ A2 ; A1 ∈ F ; A2 is negligible}
(3.17)
the completed σ -algebra of F . It is easily seen that Fc is a σ -algebra, and we endow it with the completed measure defined by μc (A1 ∪ A2 ) := μ(A1 ). The completion of the Borel σ -algebra of Rn is called the Lebesgue σ -algebra. Lemma 3.22 Let (Ω, F , μ) be a measure space, and f : Ω → R be Fc measurable. Then there exists an F -measurable real-valued function g such that f = g a.e. Proof We follow Aliprantis and Border [3, Thm. 10.35] (a) Decomposing f as a difference of nonnegative functions, we see that it suffices to consider the case when f (x) ≥ 0 a.e. If f = χ A for some A ∈ Fc , then A = A1 ∪ A2 with A1 ∈ F and A2 negligible, and we may take g = χ A1 . The set of functions f for which the conclusion holds is a vector space, and so, the conclusion holds when f is a simple function. (b) In the general case, by Lemma 3.13, there exists a nondecreasing sequence f k of Fc -simple functions, a.e. converging to f . By step (a), there exists gk , F -measurable and equal to f k , except for a negligible set Ak . Being negligible, ∪k Ak is included in a zero measure set whose complement is denoted by B. Then gk (x) := gk (x)χ B (x) converges a.e. to the function g equal to f (x) on B, and to 0 on its complement. By Lemma 3.11, g is measurable. The conclusion follows. A Useful Lemma Lemma 3.23 (Borel–Cantelli) Let (Ω, F , μ) be a measure space. If the sequence Ak in F satisfies k μ(Ak ) < ∞, then the following holds for almost all ω ∈ Ω: {k ∈ N; ω belongs toAk } is finite.
(3.18)
Proof That Bn := ∪k≥n Ak is nonincreasing implies μ(∩n Bn ) = limn μ(Bn ). As / ∩n Bn iff (3.18) holds, the μ(Bn ) ≤ k≥n μ(An ), the limit is equal to 0. Since ω ∈ conclusion follows.
3.1.3 Kolmogorov’s Extension of Measures Set X = (R p )∞ , i.e., any x ∈ X has the representation x = (x1 , x2 , . . .) with each xi in R p . To any Borelian subset A of R p×n we associate the cylinder C(A) := {x ∈ X ; (x1 , . . . , xn ) ∈ A}.
(3.19)
Denote by Fˆ (resp. F ) the algebra (resp. σ -algebra) generated by the cylinders. Let μn be a sequence of probability measures on R p×n (endowed with the Borelian σ -algebra), having the consistency property
126
3 An Integration Toolbox
μn+1 (A × R p ) = μn (A), n = 1, . . . .
(3.20)
Then we can define a finitely additive set function μˆ over the set of cylinders of X by (3.21) μ(C(A)) ˆ := μn (A), for each measurable A in R p×n . If A and A are Borelian sets in R p×i and R p× j resp. with j > i, then C(A) coincides ˆ is well defined. with C(A ) iff A = A × R p×( j−i) . So, in view of (3.20), μ(C(A)) Theorem 3.24 The set function μˆ has a unique extension to a probability measure μ on F . Proof (Taken from Shiryaev [115, Chap. 2, Sect. 3]). By Carathéodory’s extension Theorem 3.17, it suffices to prove that μˆ is σ -additive. Since μˆ is finitely additive, its σ -additivity is equivalent (taking complements) to the property of “continuity at 0”, i.e., if Ck := C(Ak ) is a decreasing sequence in Fˆ with empty intersection, then μ(C ˆ k ) → 0. Note that we may assume that Ak is ˆ k ) → δ > 0. a Borelian set in R p×n k , with n k ≥ k. Assume on the contrary that μ(C We prove (independently) in Lemma 5.4 that any Borelian subset A of Rn is such that For any ε > 0, there exist F, G resp. closed and open subsets of Rn (3.22) such that F ⊂ A ⊂ G and P(G \ F) < ε. Intersecting F with a closed ball of arbitrarily large radius, it is easily seen that we may in addition assume F to be compact. So, let Fk ⊂ Ak be compact sets such that μn k (Ak \ Fk ) < δ/2k+1 . Let Fˆk := C(∩q≤k Fk ) = ∩q≤k C(Fk ). Then Ck \ Fˆk = Ck \ ∩q≤k C(Fk ) = ∪q≤k Ck \ C(Fk ),
(3.23)
and therefore μ(C ˆ k \ Fˆk ) ≤
p≤k
μ(C ˆ k \ C(Fk )) =
μn p (A p \ F p ) ≤ δ/2.
(3.24)
p≤k
ˆ Fˆk ) ≥ δ/2. So, there exists a sequence Since μ(C ˆ k ) → δ > 0 it follows that limk μ( x k in R∞ such that x k ∈ Fˆk for all k. Let p ≥ 1 be an integer. Since Fk is a compact subset of Rn k , with n k ≥ k, p → x kp is bounded. By a diagonal argument we can, up to the extraction of a subsequence, assume that p → x kp is convergent for each p to, say, x p . We easily see that x belongs to ∩k Ck , contradicting our hypothesis. Remark 3.25 As mentioned in [115], the proof has an immediate extension to the case when X = Y ∞ , where Y is a metric space endowed with the Borelian σ -algebra. We need probabilities μn on Y n , satisfying the consistency property μn+1 (A × Y ) = μn (A), for all n = 1, . . .. Then we are able to build an extension of the μn on the
3.1 Measure Theory
127
σ -algebra generated by the cylinders, provided that to each Borelian A ∈ Y n and ε > 0 we can associate a compact set F ⊂ A such that μn (A \ F) < ε.
3.1.4 Limits of Measurable Functions Let (X, F X ), (Y, FY ) be two measurable spaces. We endow (X, F X ) with a measure μ. Denote by M the vector space of functions X → Y that are measurable, after modification on a negligible set, and by Mμ the quotient space through the equivalence relation f ∼ f iff f (x) = f (x) a.e. We call an element of an equivalence class a representative of that class, and say that the sequence f k ∈ Mμ converges a.e. to f ∈ Mμ if the convergence holds a.e. for some representative. Lemma 3.26 Let (X, F X , μ) be a measure space, and (Y, dY ) be a metric space. If a sequence in Mμ converges a.e., then its limit is a measurable function. Proof Let f k be a sequence in Mμ converging a.e. to f¯. Let Ω0 be a zero measure set such that f k simply converges on Ω1 := Ω \ Ω0 . Let y0 ∈ Y , and set gk (ω) := f k (ω) if ω ∈ Ω \ Ω1 ; gk (ω) := y0 otherwise.
(3.25)
Then g ∈ Mμ simply converges to the function f˜ equal to f¯ on Ω1 , and y0 on Ω0 . By Lemma 3.11, f˜ is measurable; so is f¯, being equal to f˜ a.e. Theorem 3.27 (Egoroff) Let (X, F X , μ) be a measure space such that μ(X ) < ∞, ¯ then (Y, ρ) a metric space, and f¯k a sequence of Mμ (X, Y ). If f¯k converge a.e. to g, ¯ and ε > 0, there exists a K ⊂ F X such that for any representatives ( f k , g) of ( f¯k , g) μ(X \K ) ≤ ε and f k uniformly converges to g on K . Proof The family indexed by k and q in N, q ≥ 1: Ak,q := ∪≥k {x ∈ X ; ρ( f (x), g(x)) > 1/q}
(3.26)
is nonincreasing in k, and by (3.14), limk μ(Ak,q ) = μ(∩k Ak,q ) = 0. So there exists kq ∈ N such that μ(Akq ,q ) ≤ ε2−q . Set Kˆ := ∪q Akq ,q , and let K be the complement of Kˆ . Then μ( Kˆ ) ≤ ε and ρ( f (x), g(x)) < 1/q whenever ≥ kq , for all x ∈ K , implying the uniform convergence on K .
(3.27)
If (Y, dY ) is a metric space, we say that a sequence f k in Mμ (X, Y ) converges in measure (in probability) to g ∈ Mμ (X, Y ) if For all ε > 0, we have that μ ({x ∈ X ; ρ( f k (x), g(x)) > ε}) → 0.
(3.28)
128
3 An Integration Toolbox
Theorem 3.28 Let (X, F X , μ) be a measure space, and (Y, ρ) be a metric space. Let f k be a sequence of measurable mappings X → Y . Then: (i) Convergence in measure implies convergence a.e. of a subsequence. (ii) If μ(X ) < ∞, convergence a.e. implies convergence in measure. Proof (i) Let f k converge in measure to f¯, and ε > 0. Set Ak := {ω ∈ Ω; ρ( f k (ω), f¯(ω)) > ε}.
(3.29)
Extracting a subsequence if necessary, we may assume that μ(Ak ) ≤ 2−k . By the Borel–Cantelli Lemma 3.23, for a.a. ω ∈ Ω, ω belongs to finitely many Ak ; that is, there exists a function k(ω) such that ρ( f k (ω), f¯(ω)) < ε a.e. for k > k(ω). This being true for all ε > 0, the convergence a.e. of f k to f¯ follows, along the subsequence. (ii) Immediate consequence of Egoroff’s Theorem 3.27.
Metrizability of Convergence in Measure We briefly review some results, referring for the proof to [77, Chap. 1]. For f , g in L 0 (Ω) set (3.30) e( f, g) := inf {ε + μ(| f − g| > ε)}. ε>0
This is a symmetric function with values in R+ ∪ {+∞}, such that e( f, g) = 0 iff f = g a.e., and that satisfies the triangle inequality e( f, h) ≤ e( f, g) + e(g, h).
(3.31)
It easily follows that δ( f, g) := e( f, g)/(1 + e( f, g)) is a metric over Mμ . Theorem 3.29 We have that f k → f in measure iff δ( f k , f ) → 0, and Mμ endowed with the metric δ is complete.
3.1.5 Integration Let (Ω, F , μ) be a measure space. The spaces L 0 (Ω) and E 0 (Ω) of measurable and simple functions, resp., were introduced in Definition 3.10. If f ∈ E 0 (Ω) has values a1 < · · · < an , the sets Ai := f −1 (ai ), i = 1 to n, are measurable and give a partition of Ω. Denote by E 1 (Ω) the subspace of E 0 (Ω) for which the Ai have a finite measure whenever ai = 0. If f ∈ E 1 (Ω), we define the integral of f as Ω
f (ω)dμ(ω) :=
n i=1
ai μ(Ai ).
(3.32)
3.1 Measure Theory
129
This defines a linear form over E 1 (Ω), and the function f 1 :=
Ω
| f (ω)|dμ(ω) =
n
|ai |μ(Ai ),
(3.33)
i=1
where |ai | μ(Ai ) = 0 if ai = 0, is a seminorm (nonnegative, positively homogeneous function that satisfies the triangle inequality) of the same value for all representatives of the equivalence class under the relation of being equal a.e. Over these equivalence classes, this seminorm induces a norm, denoted again by · 1 , that satisfies the Tchebycheff inequality: for all ε > 0, 1 ε
1 f − g1 . ε Ω (3.34) We shall build L 1 (Ω) as the completion, for the norm · 1 , of the equivalence classes of functions of the space E 1 (Ω). More precisely, let f k be a Cauchy sequence in E 1 (Ω). Extracting a subsequence if necessary, we may assume that f k − f k+1 1 ≤ 2−k−1 . Fix ε > 0 and set Ak := ω ∈ Ω; sup | f k (ω) − f (ω)| > ε . (3.35) μ ({ω ∈ Ω; | f (ω) − g(ω)| > ε}) ≤
| f (ω) − g(ω)|dμ(ω) =
≥k
A variant of the Tchebycheff inequality gives 1 μ ( Ak ) ≤ ε
sup | f k (ω) − f (ω)| ≤
Ω ≥k
1 2−k . f +1 − f 1 ≤ ε ≥k ε
(3.36)
By the Borel–Cantelli Lemma 3.23, ω belongs a.e. to a finite number of Ak , showing that | f k (ω) − f (ω)| ≤ ε for k = k(ω) large enough and ≥ k. In other words, f k (ω) is a.e. a Cauchy sequence, and therefore f k converges a.e. to some g, which is measurableby Lemma 3.26. Since the integral is a continuous mapping in the norm · 1 , limk Ω f k (ω)d(ω) also converges. Lemma 3.30 We may set Ω g(ω)d(ω) := limk Ω f k (ω)d(ω), in the sense that if another Cauchy sequence f k in E 1 (Ω) converges a.e. to the same function g, then Ω f k (ω)d(ω) and Ω f k (ω)d(ω) have the same limit. Proof We follow [77, French edition, p. 37]. If the conclusion does not hold, then h k := f k − f k is a Cauchy sequence that converges a.e. to zero and such that h ≥ Ω h k (ω)dμ(ω) has a nonzero limit, say γ , so that, for large enough k, k 1 1 |γ | > 0 and h k − h 1 < |γ |/8 for > k. Fix such a k and write h k = q ξq 1 Aq . 2 Then, for > k:
130
3 An Integration Toolbox
q
Aq
|ξq (ω) − h (ω)|dμ(ω) ≤ h k − h 1 ≤ 41 h k 1 =
1 4
q
|ξq |dμ(ω). Aq
(3.37) So, we must have that |ξq (ω) − h (ω)|dμ(ω) ≤ Aq
1 4
Aq
|ξq (ω)|dμ(ω) = 41 |ξq (ω)|μ(Aq ), for some q. (3.38)
For this particular q, by the Tchebycheff inequality: μ({ω ∈ Aq ; |ξq (ω) − h (ω)| > 21 |ξq (ω)|}) < 21 μ(Aq ),
(3.39)
μ({ω ∈ Aq ; |h (ω)| > 21 |ξq (ω)|}) ≥ 21 μ(Aq ).
(3.40)
so that
Since 1 Aq h converges a.e. to zero on Aq , which has finite measure, by Theorem 3.28(ii), it converges in measure, which contradicts (3.40). The conclusion follows. The vector space L 1 (Ω) of equivalent classes (for the relation of equality a.e.) of such limits is endowed with g1 := limk f k 1 , which is easily checked to be a norm. This space, being constructed as limitsof Cauchy sequences, is easily seen to be complete. Over E 1 (Ω), the operator g → Ω g(ω)dμ(ω) is linear, nondecreasing and non-expansive (Lipschitz with constant 1). Since E 1 (Ω) is a dense subset of L 1 (Ω), the integral has a unique extension to L 1 (Ω) that keeps these properties. The integral is a continuous linear form over the space L 1 (Ω), with unit norm. Therefore, f k → g in L 1 (Ω) implies Ω f k (ω)dμ(ω) → Ω g(ω)dμ(ω). Since the Tchebycheff inequality (3.34) holds on L 1 (Ω), by Theorem 3.28, the following holds: Lemma 3.31 Convergence in L 1 (Ω) implies convergence in measure, and hence, convergence a.e. for a subsequence. If f ∈ L 0 (Ω) is bounded, by Lemma 3.13, we can approximate it uniformly by functions in E 0 (Ω). Therefore, if μ(Ω) < ∞, or more generally if f is zero a.e. outside of a set of finite measure, then f ∈ L 1 (Ω). Theorem 3.32 (Lebesgue’s theorem on series) Let the sequencef k in L 1 (Ω) con n verge normally, i.e., k f k 1 < ∞. Then (i) the series Fn (ω) := k=0 f k converges 1 lim Ω Fk (ω)dμ(ω), (iii) the series Fk (ω) in L (Ω) to some g, (ii) Ω g(ω)dμ(ω) = absolutely converges a.e. to g(ω), that is, k | f k (ω)| < ∞ and Fk (ω) → g(ω) a.e. Proof (i) Any normally convergent sequence in a complete space is convergent. (ii) Since the integral is linear and continuous, the integral ofthe limit is the limit of integrals of the partial sums. (iii) The remainders rn := ∞ k=n | f k | converge to 0 in L 1 (Ω), and hence, a.e. for a subsequence. For a nonincreasing and nonnegative sequence, convergence a.e. to 0 for a subsequence implies convergence a.e. for the sequence. So the sequence rn converges a.e. to 0. The result follows.
3.1 Measure Theory
131
We can define integrals with infinite values in the following way. Definition 3.33 Let f ∈ L 0 (Ω). We set Ω f (ω)dμ(ω) := −∞ if f + ∈ L 1 (Ω) and / L 1 (Ω), and Ω f (ω)dμ(ω) := +∞ if f − ∈ L 1 (Ω) and f + ∈ / L 1 (Ω). f− ∈ With the above definition we have the usual calculus rules such as ( f + g)(ω)dμ(ω) = f (ω)dμ(ω) + g(ω)dμ(ω), Ω
Ω
(3.41)
Ω
whenever the integrals of f and g are defined, except of course if f and g have infinite integrals of opposite sign. Theorem 3.34 (Monotone convergence) Let f k be a nondecreasing sequence of L 1 (Ω), with limit a.e. g. Then
lim k
Ω
f k (ω)dμ(ω) =
g(ω)dμ(ω),
(3.42)
Ω
the limit being possibly +∞. If in addition, limk Ω f k (ω)dμ(ω) < ∞, then g ∈ L 1 (Ω) and f k → g in L 1 (Ω). Proof Since f k ≤ g, Ω f k (ω)dμ(ω) ≤ Ω g(ω)dμ(ω). So, (3.42) holds if lim k
Ω
f k (ω)dμ(ω) = ∞.
(3.43)
Otherwise, we conclude by applying Theorem 3.32 to the normally convergent series f k+1 − f k . Example 3.35 The sequence of functions f k : R → R, f k (x) = −1x≥k (x), is nondecreasing and has limit g(x) = 0 a.e., and yet
lim k
Ω
f k (ω)dω = −∞ < 0 =
g(ω)dμ(ω).
(3.44)
Ω
The above theorem does not apply since f k is not integrable. 1 Lemma 3.36 Let f ∈ L (Ω) be nonnegative. Then the mapping F → R, A → ρ f (A) := A f (ω)dω is a measure.
Proof The σ -finiteness axiom (3.10) holds since ρ f (Ω) := f 1 < ∞. It remains to in (3.9), then ρ A show that, if the Ai satisfy the assumptions = (∪ ) f i∈I i i∈I ρ(Ai ), or equivalently ∪i∈I Ai f (ω)dω = i∈I Ai f (ω)dω. This follows from the mono tone convergence Theorem 3.34, where we set f k (ω) := f (ω) ≤k 1 A (ω). Corollary 3.37 Let {Bk } ⊂ F be such that Bk+1 ⊂ Bk , and B := ∩k Bk has zero measure. Then Bk f (ω)dμ(ω) → 0, for all f ∈ L 1 (Ω).
132
3 An Integration Toolbox
Proof Decomposing f into its positive and negative parts, we see that it suffices to prove the result when f ≥ 0. Since Bk f (ω)dμ(ω) = ρ f (Bk ), and ρ(B) = 0, this follows from Lemma 3.36 and (3.14). Theorem 3.38 (Lebesgue dominated convergence) Let the sequence f k of L 1 (Ω) converge a.e. to g, and be dominated by h ∈ L 1(Ω), in the sense that | f k (ω)| ≤ h(ω) a.e. Then g ∈ L 1 (Ω), f k → g in L 1 (Ω), and Ω f k (ω)dμ(ω) → Ω f (ω)dμ(ω). Proof (a) Since g is dominated by h ∈ L 1 (Ω), so are the floor approximations g k , which (being measurable) are therefore integrable. Applying the monotone convergence Theorem 3.34 to the positive and negative parts of g k , we deduce that g ∈ L 1 (Ω). The relation Ω f k (ω)dμ(ω) → Ω g(ω)dμ(ω) is a consequence of the convergence of f k to g in L 1 (Ω), which we prove next. (b) We first assume that μ(Ω) < ∞. By Egoroff’s Theorem 3.27, for each ∈ N, > 0, there exists K ⊂ F X such that K := X \K satisfies μ(K ) ≤ 1/, and f k converges uniformly on K to g. Changing if necessary K into ∪q≤ K q , we may assume that K is nonincreasing. Let ρh denote the measure associated with h (see Lemma 3.36). Since ∩ K has zero measure, and ρh (K ) is finite, by (3.14), we have that lim ρh (K ) = lim K |h(ω)|dω = 0, and so when ↑ +∞:
α := sup k
K
| f k (ω) − g(ω)|dμ(ω) ≤ 2
K
h(ω)dμ(ω) → 0.
(3.45)
On the other hand, since f k converges uniformly on K : | f k (ω) − g(ω)|dμ(ω) → 0.
(3.46)
K
So, given γ > 0, take such that α ≤ γ . By (3.46), we have that lim supk f k − g1 ≤ γ . It follows that f k → g in L 1 (Ω). (c) Assume now that μ(Ω) = ∞, and let A be the exhaustion sequence in (3.10). Set B := Ω \ A . By step (b), A | f k (ω) − g(ω)|dω → 0, and so
lim sup f k − g1 = lim sup k
k
| f k (ω) − g(ω)|dω ≤ 2 B
|h(ω)|dω.
(3.47)
B
Now B |h(ω)|dω = ρh (B ). Since ∩ B has zero measure, by Corollary 3.37, the above r.h.s. converges to 0 when ↑ +∞. The conclusion follows. Remark 3.39 We have proved in step (a) that a measurable function belongs to L 1 (Ω) whenever it is dominated by some h ∈ L 1 (Ω). Example 3.40 Define the functions f k and g R → R by f k (x) := e−(x−k) , g(x) = 0. Then f k and g are integrable, and f k → g a.e. However, the integral of f k does not converge to that of g. The above theorem does not apply, since the domination hypothesis does not hold. 2
3.1 Measure Theory
133
We recall that, if f k is a sequence of real-valued functions over Ω, its lower limit is defined by lim inf k f k (ω) := limk inf j≥k f j (ω). Since the r.h.s. is nondecreasing, ¯ the limit exists in R. Lemma 3.41 (Fatou’s lemma) Let f k be a sequence in L 1 (Ω), with f k ≥ g, where g is an integrable function. Then
Ω
lim inf f k (ω)dμ(ω) ≤ lim inf k
k
Ω
f k (ω)dμ(ω).
(3.48)
Proof If lim inf k Ω f k (ω)dμ(ω) = ∞, then (3.48) certainly holds. Otherwise, note that gk := inf j≥k f j satisfies g ≤ gk ≤ f k . So, by Remark 3.39, gk ∈ L 1 (Ω) and it satisfies gk (ω)dμ(ω) ≤ f k (ω)dμ(ω). (3.49) Ω
Ω
Since the l.h.s. is nondecreasing, we have that
lim k
Ω
gk (ω)dμ(ω) ≤ lim inf k
Ω
f k (ω)dμ(ω).
(3.50)
Let g(ω) ¯ := limk gk (ω) = lim inf k f k (ω). By the monotone convergence Theo¯ The conclusion rem 3.34, the l.h.s. of (3.50) is equal to Ω g(ω)dμ(ω). follows. Remark 3.42 Fatou’s lemma allows us to prove the l.s.c. of some integral functionals, see e.g. after (3.134). Corollary 3.43 Let f k be a sequence in L 1 (Ω) such that f k 1 ≤ C for all k. If f k converges a.e. to f , then f ∈ L 1 (Ω) and f 1 ≤ C. Proof Apply Lemma 3.41 to the sequence | f k |, which converges a.e. to | f |, with g = 0. Example 3.44 The integrable sequence f k (x) := e−(x−k) simply converges to 0, and gives an example of strict inequality in (3.48). It also shows that the convergence in L 1 (Ω) does not necessarily occur in the setting of Corollary 3.43. Taking now 2 f k (x) := −e−(x−k) , we verify that, to obtain (3.48), the hypothesis that f k ≥ g, with g integrable, cannot be omitted. 2
We have until now presented the standard theorems of integration theory. We now end this section with some more advanced results. Let us first show that Fatou’s lemma implies an easy and useful generalized dominated convergence theorem, see Royden [105, Chap. 4, Thm. 17]. Theorem 3.45 Let f k and gk be sequences in L 1 (Ω) such that (i) | f k (ω)| ≤ gk (ω) a.e., (ii) ( f k , gk ) converges a.e. to ( f, g),
134
3 An Integration Toolbox
(iii) Ω g(ω)dμ(ω). gk (ω)dμ(ω) → Ω Then Ω f k (ω)dμ(ω) → Ω f (ω)dμ(ω). ± Proof Since ψk± := gk ± fk is integrable, nonnegative, and converges a.e. to ψ := g ± f , by Fatou’slemma, Ω ψ ± (ω)dμ(ω) ≤ lim inf k Ω ψk± dμ(ω). Using (iii), it follows that ± Ω f (ω)dμ(ω) ≤ lim inf k Ω (± f k (ω))dμ(ω). The conclusion follows.
We next present Vitali’s convergence theorem. Definition 3.46 (Uniform integrability) Let (Ω, F , P) be a probability space. We say that a set E of measurable functions is uniformly integrable if, for all ε > 0, there exists an Mε > 0 such that E| f |1{| f |>Mε } ≤ ε, for all f ∈ E. Theorem 3.47 Let (Ω, F , P) be a probability space, and f k be a uniformly integrable sequence in L 1 (Ω), with a.s. finite limit f . Then f ∈ L 1 (Ω), and f k → f in L 1 (Ω). Proof Let ε, Mε be as above. Since f k 1 ≤ Mε + ε, f k is bounded in L 1 (Ω), and Corollary 3.43 implies that f ∈ L 1 (Ω). By Egoroff’s Theorem 3.27, for all ε j ↓ 0, there exists an E j ∈ F such that f k → f uniformly over E j , and F j := Ω \ E j has measure less than ε j . Therefore, Fj
| f k (ω)|dP(ω) ≤ Mε |F j | + E| f k |1{| fk |>Mε } ≤ ε j Mε + ε
(3.51)
for all k. Changing E j into ∪i≤ j E i if necessary, we may assume that F j is a nonincreasing sequence, whose intersection has zero measure. By Corollary 3.37, we may fix j such that F j | f (ω)|dμ(ω) ≤ ε and ε j Mε + ε ≤ 2ε, so that F j | f k − f |(ω)dP(ω) ≤ 3ε. Since f k → f uniformly on E j , the conclusion follows. Remark 3.48 The theorem does not hold over a measure space when μ(Ω) is not finite, as Example 3.44 shows. Exercise 3.49 Let Ω := [0, 1] be endowed with Lebesgue’s measure. Let fk (ω) = k over [0, 1/k] and f k (ω) = 0 otherwise. Show that this sequence is not uniformly integrable, and does not satisfy the conclusion of Vitali’s theorem.
3.1.6 L p Spaces Let (Ω, F , μ) be a measure space. For f ∈ L 0 (Ω), set f ∞ := inf{α > 0; | f (ω)| ≤ α a.e.}.
(3.52)
3.1 Measure Theory
135
Let
L ∞ (ω) := { f ∈ L 0 (Ω); f ∞ < ∞}.
(3.53)
It is easily checked that this space, endowed with the norm · ∞ , is a Banach space. Now, for p ∈ [1, ∞) set L p (ω) := { f ∈ L 0 (Ω); | f | p ∈ L 1 (Ω)}.
(3.54)
For f ∈ L p (Ω) we set f p :=
1/ p | f (ω)| dμ(ω)
.
p
Ω
(3.55)
We will check in Lemma 3.53 that this is a norm. Let us prove that L p (Ω) is a vector space. It is enough to check that if f , g in L p (ω), then f + g in L p (ω). Indeed, the function x → |x| p being convex, we have that 2− p f + g pp =
3.1.6.1
Ω
| 21 ( f + g)| p ≤
1 2
Ω
| f |p +
1 2
Ω
|g| p = 21 f pp + 21 g pp . (3.56)
Hölder’s Inequality
Let p ∈ [1, ∞], and q be the conjugate exponent, such that 1/ p + 1/q = 1. The following lemma shows that to every element of L q (Ω) is associated a continuous linear form on L p (Ω): Lemma 3.50 (Hölder inequality) Let f ∈ L p (Ω) and g ∈ L q (Ω). Then f g ∈ L 1 (Ω), and (3.57) f g1 ≤ f p gq . Proof The result is obvious if p ∈ {1, ∞}. So, let p ∈ (1, ∞). Since the inequality (3.57) is positively homogeneous w.r.t. f and g, it is enough to check that f g1 ≤ 1 whenever f p = gq = 1. So, given f ∈ L p (Ω), f p = 1, we need to check that the convex problem below has value not less than −1: Min −
g∈L q (Ω)
Ω
f (ω)g(ω)dμ(ω);
1 q
Ω
|g(ω)|q dμ(ω) ≤
1 . q
(3.58)
We may always assume that f g ≥ 0 a.e., since otherwise we obtain a lower cost by changing g(ω) over −g(ω) on {ω ∈ Ω; f (ω)g(ω) < 0}. We may assume that f (ω) ≤ 0 a.e., in view of the discussion on the sign of f g. We will solve this qualified convex problem by finding a solution to the optimality system, with multiplier λ > 0. The Lagrangian function can be expressed as
136
3 An Integration Toolbox
Ω
− f (ω)g(ω) + λ|g(ω)|q /q dμ(ω) − λ/q,
(3.59)
whose minimum is attained for g ≥ 0 such that − f (ω) + λg(ω)q−1 = 0 a.e., i.e., g(ω) = ( f (ω)/λ) p/q , which is an element of L q (Ω). Since λ > 0 the constraint is binding, and so, 1=
|g(ω)| dμ(ω) = λ q
Ω
−p
Ω
| f (ω)| p dμ(ω) = λ− p ,
(3.60)
so that λ = 1. Finally, integrating the product of f (ω) = −g(ω)q−1 with g(ω), we see that the value of problem (3.58) is −1, as was to be proved. Corollary 3.51 Let 1/ p + 1/q = 1/r with r ≥ 1, f ∈ L p (Ω), and g ∈ L q (Ω). Then f g ∈ L 1 (Ω), and (3.61) f gr ≤ f p gq . Proof Apply the Hölder inequality (3.57) to f := | f |r and g := |g|r .
Corollary 3.52 Let μ(Ω) < ∞, and 1/ p + 1/q = 1/r with r ∈ (1, ∞). Then L p (Ω) ⊂ L r (Ω) and if f ∈ L p (Ω), we have that f r ≤ μ(ω)1/q f p . Proof Apply Corollary 3.51 with g(ω) = 1.
(3.62)
Lemma 3.53 The space L p (Ω) is a normed vector space, so that for any f , g in L p (Ω), the following Minkowski inequality holds: f + g p ≤ f p + g p .
(3.63)
Proof It is enough to check the triangle inequality (3.63) when f and g are nonnegative. Let q be such that 1/ p + 1/q = 1. By Lemma 3.50, since p − 1 = p/q:
Ω
( f + g) p =
Ω
f ( f + g) p−1 +
Ω
g( f + g) p−1 ≤ ( f p + g p )( f + g) p/q q .
(3.64)
Note that 1/q
( f + g) p/q q =
Ω p
( f + g) p
= f + g p/q = f + g p−1 p p .
(3.65)
p−1
We obtain that f + g p ≤ ( f p + g p ) f + g p . The conclusion follows. Note the following variant in L p (Ω) of the dominated convergence Theorem 3.38:
3.1 Measure Theory
137
Theorem 3.54 (Dominated convergence in L p (Ω)) Let the sequence f k of L p (Ω), with p ∈ (1, ∞), converge a.e. to g, and be dominated by h ∈ L p (Ω), in the sense that | f k (ω)| ≤ h(ω) a.e. Then g ∈ L p (Ω), and f k → g in L p (Ω). Proof Apply the dominated convergence Theorem 3.38 to f k := | f k − g| p , which converges a.e. to 0, is integrable and is dominated by the integrable function 2ph p. Remark 3.55 Under the hypotheses of the above theorem, if μ(Ω) < ∞, by Corolr r lary 3.52, for all r ∈ [1, to L (Ω) and f k → g in L (Ω). Taking p), f k and g belong r = 1 it follows that Ω f (ω)dμ(ω) → Ω g(ω)dμ(ω). Next we will check that, for p ∈ (1, ∞), L p (Ω) is complete, by characterizing it as a dual space.
3.1.6.2
Dual Spaces: the Riesz Theorem
In the sequel we will characterize the dual of L p (Ω) spaces. See also Royden [105, Chap. 11] or Lang [68, Chap. VII]. Theorem 3.56 (Riesz representation theorem) Let G be a continuous linear form on L p (Ω), with p ∈ [1, ∞[. Then there exists a g ∈ L q (Ω), with 1/ p + 1/q = 1, such that f (ω)g(ω)dμ(ω), for any f ∈ L p (Ω). (3.66) G( f ) = Ω
Proof We just give the proof in the case when p = 1. (a) Assume first that μ(Ω) < ∞. Then L 2 (Ω) ⊂ L 1 (Ω) with continuous inclusion. Denote by G the restriction of G to L 2 (Ω). By the Cauchy–Schwarz inequality, for all f ∈ L 2 (Ω), we have that (3.67) |G ( f )| ≤ G f 1 ≤ G μ(Ω) f 2 , and therefore G is a continuous linear form over L 2 (Ω). By the Riesz reprethere exists a g ∈ L 2 (Ω) such that G( f ) = sentation theorem for Hilbert spaces, 2 ∞ Ω g(ω) f (ω)dμ(ω), for all f ∈ L (Ω). We next prove that g ∈ L (Ω). Let f k be the characteristic function of the set {ω; g(ω) ≥ n}. Then G( f k ) ≥ k f k 1 and therefore we must have f k = 0 for large enough k. This proves that esssup g < ∞, and by a symmetric argument we obtain that g∈ L ∞ (Ω). Since L 2 (Ω) is a dense subset of L 1 (Ω), it easily follows that G( f ) = Ω g(ω) f (ω)dμ(ω), for all f ∈ L 1 (Ω), and so the conclusion holds. (b) When Ω = ∪k Ω k with μ(Ωk ) < ∞, by the previous arguments, for each k we have that G( f ) = Ωk gk (ω) f (ω)dμ(ω), for all f ∈ L 1 (Ωk ), with gk ∞ ≤ G. We may assume that the Ωk are nondecreasing. Then we may define g ∈ L ∞ (Ω) by g(ω) = gk (ω) for all ω ∈ Ωk and all k. Given f ∈ L 1 (Ω), let f k (ω) = f (ω) if
138
3 An Integration Toolbox
ω ∈ Ωk , and f k (ω) = 0 otherwise. By dominated convergence, ( f k , g f k ) → ( f, g f ) in L 1 (Ω), and therefore G( f ) = lim G( f k ) = lim g(ω) f (ω)dμ(ω) = g(ω) f (ω)dμ(ω). (3.68) k
k
Ωk
Ω
The result follows.
Remark 3.57 The conclusion when p ∈ (1, 2] can be obtained in a similar way, using again the Riesz representation theorem for Hilbert spaces. For p ∈ (2, ∞) the idea is to decompose a continuous linear form into the difference of nonnegative linear forms. Applying such a nonnegative linear form G to characteristic functions, we obtain a measure with value zero on negligible sets. It can be proved then that this measure has a density g w.r.t. μ, and g is in L q (Ω).
3.1.6.3
The Brézis–Lieb Theorem
A somewhat surprising improvement of Fatou’s lemma is due to Brézis and Lieb [29]. Theorem 3.58 Let f k be a bounded sequence in L p (Ω), p ∈ [1, ∞[, converging a.e. to some f . Then we have that f ∈ L p (Ω), and in addition, f pp = lim f k pp − f − f k pp . k
(3.69)
Proof That f ∈ L p (Ω) easily follows from Corollary 3.43. We check in Remark 3.59 below that, for any ε > 0, there exists a Cε > 0 such that, for any a, b in R : |a + b| p − |a| p ≤ ε|a| p + Cε |b| p .
(3.70)
Set h k (ω) := | f k (ω)| p − | f k (ω) − f (ω)| p − | f (ω)| p and gk (ω) := |h k (ω)| − ε| f k (ω) − f (ω)| p + .
(3.71)
Obviously h k → 0 a.e., and so does gk . Taking a := f k (ω) − f (ω) and b := f (ω) in (3.70), we obtain that |h k (ω)| ≤ || f k (ω)| p − | f k (ω) − f (ω)| p | + | f (ω)| p ≤ ε| f k (ω) − f (ω)| p + (1 + Cε )| f (ω)| p ,
(3.72)
so that |gk (ω)| ≤ (1 + Cε )| f (ω)| p . By the Corollary 3.43 of Fatou’s lemma, | f | p is integrable. So, by the dominated convergence Theorem 3.38, gk → 0 in L 1 (Ω). p On the other hand, |h k (ω)| ≤ gk (ω) + ε| f k (ω) − f (ω)| , and so, lim supk h k 1 = 1 O(ε). Therefore, h k → 0 in L (ω), and so, Ω h k (ω)dμ(ω) → 0. The conclusion follows.
3.1 Measure Theory
139
Remark 3.59 The inequality (3.70) can be justified as follows. For p = 1 it is trivial. Now let p ∈ (1, ∞). If |b| > 2|a|, then |a + b| p − |a| p = |a + b| p − |a| p ≤ 2 p |b| p ,
(3.73)
and the desired relation holds with Cε = 2 p . Otherwise, by the mean value theorem, we have that, for some θ ∈]0, 1[: |a + b| p − |a| p = p|a + θ b| p−1 |b| ≤ 3 p−1 p|a| p−1 |b|.
(3.74)
Let q be such that 1/ p + 1/q = 1. We conclude by using Young’s inequality (1.81): αβ ≤ α q /q + β p / p (for α, β nonnegative) with α = (qε)1/q |a| p−1 and β := (qε)−1/q 3 p−1 p|b|. The desired relation holds with Cε = max(2 p , (qε)− p/q 3 p( p−1) p p−1 ).
(3.75)
3.1.7 Bochner Integrals We need to discuss integrals with values in a Banach space Y . Given a measure space (Ω, F , μ), by L 0 (Ω; Y ) we denote the space of measurable functions of (Ω) with image Y ; remember that the Banach space Y is implicitly endowed with the Borel σ algebra (the one generated by open subsets), so that f ∈ L 0 (Ω; Y ) iff, for any Borel subset A of Y , f −1 (A) is Lebesgue measurable. The subspace of simple functions 00 (with finitely many values except on a null n set of [Ω]) is denoted by L (Ω; Y ). Simple functions can be written as f = i=1 yi 1 Ai , where yi ∈ Y , and the Ai are measurable subsets of [Ω], with negligible intersections. We may define the integral and norm of the simple function f by Ω
f (ω)dω :=
n
yi mes(Ai ); f 1,Y :=
i=1
n
yi Y mes(Ai ).
(3.76)
i=1
Note that f 1,Y =
Ω
f (ω)Y dω, for all f ∈ L 00 (Ω; Y ).
(3.77)
The space L 1 (Ω; Y ) of (Bochner) integrable functions is obtained, as is done for the Lebesgue integral, by passing to the limit in Cauchy sequences of simple functions. If f k is such a sequence, extracting a subsequence if necessary, we may assume that f q − f p 1,Y ≤ 2−q for any q < p, so that the series f k+1 − f k 1,Y is convergent. Consider the series sk (ω) := f k+1 (ω) − f k (ω)Y and the corresponding sums Sk (ω) := ≤k sk (ω). By the monotone convergence theorem, Sk converges
140
3 An Integration Toolbox
in L 1 (Ω) to some S∞ , and (being nondecreasing) converges also for a.a. ω. So, for a.a. ω, the normally convergent sequence f k (ω) has a limit f (ω) in Y , such that f (ω) − f k (ω)Y ≤ S∞ (ω) − Sk (ω). Therefore Ω
(3.78)
f (ω) − f k (ω)Y dω ≤
Ω
|S∞ (ω) − Sk (ω)|dω = o(1).
(3.79)
We define the integral and norm of f as the limit of those of the f k . These integral and norm of f are well defined, since they coincide for every Cauchy sequence of simple functions having the same limit. Indeed, let f k be another Cauchy sequence of simple functions for the L 1 norm, converging to f for a.a. ω. By (3.77), (3.79) applied to f k and f k , and the triangle inequality: f k − f k 1,Y = ≤
Ω Ω
f k (ω) − f k (ω)1 dω
f k (ω)
− f (ω)1 dω +
Ω
(3.80) f (ω) − f k (ω)1 dω
converges to 0, so that gk := f k − f k converges to zero both in L 1 and (by the previous discussion) a.e. Remark 3.60 By (3.77) and (3.79), we have that
f 1,Y = lim f k 1,Y = lim k
k
Ω
f k (ω)Y dω =
Ω
f (ω)Y dω,
(3.81)
the last equality being a consequence of the dominated convergence theorem; the domination hypothesis holds since f k (ω)Y ≤ f 0 (ω)Y + S∞ (ω) and the r.h.s. belongs to L 1 (Ω). Remark 3.61 An element of L 0 (Ω; Y ) is said to be Bochner measurable (or strongly measurable) if it has values (up to a null measure subset of (Ω)) in a separable subspace of Y (we recall that a subspace is separable if it contains a dense sequence). Being an a.e. limit of simple functions, an element of L 1 (Ω; Y ) is strongly measurable. Conversely, let f be strongly measurable. By Remark 3.16, f is a limit a.e. of simple functions. If in addition f (ω)Y is integrable, using the σ -finiteness hypothesis (3.10) we easily deduce that f ∈ L 1 (Ω; Y ). In general, we have the strict inclusion 1 0 f (ω)Y dω < ∞ . (3.82) L (Ω; Y ) ⊂ f ∈ L (Ω; Y ); Ω
Note that there is a version of the dominated convergence theorem for Bochner integrals, see also Aliprantis and Border [3, Thm. 11.46]:
3.1 Measure Theory
141
Theorem 3.62 Let f k be a sequence in L 1 (Ω; Y ), converging a.e. to f ∈ L 0 (Ω; Y ), such that f k (ω)Y ≤ g(ω) for a.a. ω, where g ∈ L 1 (Ω). Then f ∈ L 1 (Ω; Y ), and f k → f in L 1 (Ω; Y ). Proof Let gk (ω) := f k (ω) − f (ω)Y . Then gk → 0 a.e. and gk (ω) ≤ 2g(ω)Y a.e. By the (standard) dominated convergence theorem, gk → 0 in L 1 (Ω). Extracting a subsequence if necessary, we may assume that gk L 1 (Ω) ≤ 2−k . Then f q − f k L 1 (Ω;Y ) ≤ Ω f q (ω) − f (ω)Y dμ(ω) + Ω f (ω) − f k (ω)Y dμ(ω) = Ω (gq (ω) + gk (ω))dμ(ω) (3.83) converges to 0. That is, f k is a Cauchy sequence in L 1 (Ω; Y ). Being constructed as a set of limits of Cauchy sequences, L 1 (Ω; Y ) is necessarily complete, and we have seen that convergence in this space implies convergence a.e. for a subsequence. Since f k → f a.e, it follows that f k → f in L 1 (Ω; Y ). The conclusion follows. Example 3.63 Let Y = C(X ), the space on continuous functions over the metric compact set X , known to be separable (as a consequence of the Stone–Weierstrass theorem). Then L 1 (Ω, Y ) coincides with the set of measurable functions f : Ω → Y such that f (x, ω)Y is integrable, and f L 1 (Ω,Y ) =
max | f (x, ω)|dμ(ω).
Ω x∈X
(3.84)
By the above dominated convergence theorem, if f k ∈ L 1 (Ω, Y ) satisfies the domination hypothesis, and if f k (·, ω) → f (·, ω) a.e. in C(X ), i.e., maxx∈X | f k (x, ω) − f (x, ω)| → 0 a.e., then f k → f in L 1 (Ω, Y ), that is, max | f k (x, ω) − f (x, ω)|dμ(ω) → 0.
Ω x∈X
(3.85)
3.2 Integral Functionals ¯ We consider an optiLet (Ω, F , μ) be a measure space, and let f : Ω × Rm → R. mization problem of the form Min p
u∈L (Ω;Rm )
F(u) :=
Ω
f (ω, u(ω))dμ(ω); u(ω) ∈ U a.e.,
(3.86)
where p ∈ [1, ∞], and U ⊂ Rm . We adopt the following definition of the domain of an integral cost, valid in the context of a minimization problem. Definition 3.64 Let F be as above. We define its domain as
142
3 An Integration Toolbox
dom(F) :=
u ∈ L p (Ω; Rm ); f (ω, u(ω)) is measurable . f (ω, u(ω))+ is integrable
(3.87)
Denote the set of elements of L p (Ω; Rm ) with values a.e. in U by L p (Ω; U ) := {u ∈ L p (Ω)m ; u(ω) ∈ U a.e.}.
(3.88)
Intuitively, we would expect that the infimum in (3.86) is obtained by the exchange property of the minimization and integration operators, i.e.
inf
u∈L p (Ω;U ) Ω
f (ω, u(ω))dμ(ω) =
inf f (ω, v)dμ(ω).
Ω v∈U
(3.89)
This, however, raises some technical issues, the first of them being to check that the r.h.s. integral is well-defined. We will solve this problem assuming that f is a Carathéodory function, and then in the case when in addition the local constraint depends on ω. We will analyze conjugate functions, also in the case of more general convex integrands, and discuss the related problem of minimizing such an integral subject to the constraint that some integrals of the same type are nonpositive.
3.2.1 Minimization of Carathéodory Integrals Definition 3.65 We say that f : Ω × Rm → R is a Carathéodory function if, for a.a. ω, f (ω, ·) is continuous, and if, for all v ∈ Rm , f (·, v) is measurable. Lemma 3.66 Let f be a Carathéodory function. Then ω → f (ω, u(ω)) is measurable, for all u ∈ L 0 (Ω; Rm ). 0 m Proof By Lemma 3.13, u ∈ L (Ω; R ) is the limit a.e. of a sequence of simple functions u k (ω) = i∈Ik u ki 1{ω∈Aki } , where the Ik are finite sets, and Aki are measurable sets with null measure intersections. Therefore f (ω, u ki )1{ω∈Aki } (3.90) f (ω, u k (ω)) = i∈Ik
is measurable and converges a.e. to f (ω, u(ω)). We conclude by Lemma 3.26.
Proposition 3.67 Let f be a Carathéodory function, and dom(F) be nonempty. Then ω → inf v∈U f (ω, v) is a measurable function, and the exchange property (3.89) holds. Proof (a) Let uˆ ∈ dom(F). Consider a dense sequence ak in Rm . Let bk ∈ U be such that |ak − bk | ≤ 2 dist(ak , U ). Then bk is a dense sequence in U . Let the sequence u k of functions Ω → Rm be inductively defined by u 0 = u, ˆ and for k ≥ 1:
3.2 Integral Functionals
143
u k (ω) =
u k−1 (ω) if f (ω, u k−1 (ω)) ≤ f (ω, bk ), otherwise. bk
(3.91)
(b) Then u k is measurable and f (ω, u k (ω)) is a nonincreasing function of k. Since ω → f (ω, u 0 (ω)) ∈ dom(F), it follows that f (ω, u k (ω)) ∈ dom(F) as well. Since bk is a dense sequence in U , and f (ω, ·) is continuous, we have that lim f (ω, u k (ω)) = inf f (ω, v), k
v∈U
(3.92)
proving that the r.h.s. is measurable. If Ω f (ω, u k (ω)dμ(ω) = −∞ for large enough k, then the equality (3.89) holds with value −∞. Otherwise, we conclude by the monotone convergence Theorem 3.34.
3.2.2 Measurable Multimappings We next discuss a more general case where we have a constraint of the form u(ω) ∈ U (ω), for a.a. ω ∈ Ω,
(3.93)
where U is a multimapping Ω → P(Rm ). We say that U is measurable if, for any closed set C ⊂ Rm , U −1 (C) is measurable, and that U is closed-valued if U (ω) is closed for a.a. ω. Given a measurable multimapping U , for p ∈ [1, ∞], consider the set (3.94) L p (Ω; U ) := u ∈ L p (Ω; Rm ) ; u(ω) ∈ U (ω), a.a. ω ∈ Ω . Definition 3.68 Let U be a multimapping Ω → P(Rm ). We call a sequence u k in L p (Ω; U ) such that, for a.a. ω, U (ω) is the closure of {u k (ω), k ∈ N} a Castaing representation of U in L p (Ω; Rm ). By a result due to C. Castaing (see e.g. [102, Thm. 1B, p. 161]), any measurable multimapping with closed values has a Castaing representation. We next prove this result. We first need to properly define a single-valued projection on a nonconvex, nonempty closed set. Let C be a closed subset of Rm . For z ∈ Rm , set Pz C := {c ∈ C; |z − c| = dist(z, C)}. Next, let z 0 , . . . , z m be affinely independent (i.e., not included in a hyperplane). Set πz 0 ...,z m C := Pz 0 · · · Pz m C.
(3.95)
It can be proved by induction that the intersection of k + 1 spheres with affinely independent centers in Rm is a sphere in a subspace of dimension less than m − k. It follows that πz 0 ,...,z m is a singleton. Definition 3.69 Assuming that z m = 0, we define the projection of a point a ∈ Rm over a closed set C by:
144
3 An Integration Toolbox
Pˆa (C) := πa+z 0 ...,a+z m C.
(3.96)
Clearly, Pˆa (C) ∈ Pa C, so that if C is convex we recover the usual projection on a convex set. We next denote by I the countable set of affinely independent elements of (Rm )m , with rational coordinates. We say that the multimapping U (·) is compatible with the space L p (Ω; Rm ) if the function dist(0, U (ω)) (which by the proposition below is measurable) belongs to L p (Ω; Rm ). Proposition 3.70 Let U : Ω → P(Rm ) be a measurable and closed-valued multimapping. Then: (i) For any b ∈ I , the map ω ∈ Ω → πˆ b (ω) := πb U (ω) ∈ Rm is measurable. (ii) If U (ω) is compatible with L p (Ω; Rm ), then the family {πˆ b (ω), b ∈ I } is a Castaing representation of U . Proof Since (ii) easily follows from (i), it suffices to prove the latter. We essentially reproduce the arguments in [102]. Let a ∈ Rm . Since Pˆa is a composition of projections, it suffices to prove that, if Γ (ω) is a measurable closed-valued multimapping, then Pa (ω) := Pa Γ (ω) is measurable. For this, consider the sequence of multimappings Γ k (ω) := {v ∈ Rm ; dist(v, Γ (ω)) < k −1 ; |v − a| < dist(a, Γ (ω)) + k −1 }. (3.97) Let C be a closed subset of Rm . Then Pa (ω) ∈ C iff C ∩ Γ k (ω) = ∅ for all k, and thus (3.98) Pa−1 (C) = ∩k Γk−1 (C). Next, let D be a countable dense subset of C, which always exists. We claim that (Γ k )−1 (C) = (Γ k )−1 (D) = ∪d∈D (Γ k )−1 (d).
(3.99)
The second equality is obvious and, since D is a dense subset of C, in order to establish the first equality it suffices to check that if c ∈ C and ω0 ∈ (Γ k )−1 (c), then for c close enough to c we have that ω0 ∈ (Γ k )−1 (c ). But this follows directly from the definition of Γ k (ω) in (3.97). Our claim follows. On the other hand, for any v ∈ Rm and α ≥ 0 we have {ω ∈ Ω ; dist(v, Γ (ω)) < α} = Γ −1 (v + α B),
(3.100)
where B is the unit ball in Rm . Thus, since Γ is measurable, so is the process dist(v, Γ (ω)). Therefore, from the definition (3.97), for any (ω, v) ∈ Ω × Rm we have that (Γ k )−1 (v) is measurable. By (3.98), (3.99), Pˆa is an intersection of unions of measurable sets, and is therefore itself measurable. We apply the previous result to the problem of minimizing an integral cost.
3.2 Integral Functionals
145
Proposition 3.71 Let f : Ω × Rm → R be a Carathéodory function, and U (ω) be a measurable, closed-valued multimapping from Ω to Rm . Assume that there exists a uˆ in dom(F) ∩ L p (Ω, U ). Then the following exchange property holds:
inf p
u∈L (Ω;U ) Ω
f (ω, u(ω))dμ(ω) =
inf
Ω v∈U (ω)
f (ω, v)dμ(ω).
(3.101)
Proof Let ak be a Castaing representation of the multimapping U . Consider the ˆ and for k ≥ 1: sequence u k defined by u 0 := u, u k (ω) =
u k−1 (ω) if f (ω, u k−1 (ω)) ≤ f (ω, ak (ω)), ak (ω) otherwise.
(3.102)
We conclude, as in step (b) of the proof of Proposition 3.67, by the monotone convergence Theorem 3.34. Remark 3.72 If U (ω) is, for a.a. ω, a finite set, then the above minimizing sequence u k converges a.e. to some u¯ ∈ L 0 (Ω), with values in U (ω). By the monotone convergence Theorem 3.34, we have that
inf p
u∈L (Ω;U ) Ω
f (ω, u(ω))dμ(ω) =
Ω
f (ω, u(ω))dμ(ω). ¯
(3.103)
3.2.3 Convex Integrands In the case of convex integrands (such that f (ω, ·) is, for a.a. ω, convex) we can deal with integral functionals using the following result. ¯ and E be a dense Lemma 3.73 Let g be a proper, l.s.c. convex function Rm → R, subset of dom(g). Then, for all y ∈ dom(g), we have that g(y) = lim inf{g(x); x ∈ E, x → y}.
(3.104)
Proof Denote by g(y) ˆ the r.h.s. of (3.104). Since g is l.s.c., g(y) ≤ g(y). ˆ We next prove the opposite inequality. Changing if necessary Rm into the affine space spanned by dom(g), we may assume that the latter has a nonempty interior. We know that g is continuous over the interior of its domain. Since E is a dense part of dom(g), if y ∈ int(dom(g)), then (3.104) holds, and hence, for all y ∈ dom(g): g(y) ˆ ≤ lim inf{g(x); x → y; x ∈ int(dom(g))}.
(3.105)
Let y ∈ dom(g), y0 ∈ int(dom(g)), and t ∈ [0, 1]. Set yt := (1 − t)y0 + t y, with t ∈ (0, 1). Since t → g(yt ) is l.s.c. convex, we have
146
3 An Integration Toolbox
g(y) ˆ ≤ lim sup g(yt )) ≤ lim sup ((1 − t)g(y0 ) + tg(y)) = g(y), t↑1
(3.106)
t↑1
as was to be shown.
Proposition 3.74 Let f : Ω × Rm → R be such that, for a.a. ω, f (ω, ·) is l.s.c. convex with a nonempty interior, and for all v ∈ Rm , f (·, v) is measurable. Assume that there exists a uˆ ∈ dom(F). Then the exchange property (3.101) holds. Proof The proof is similar to that of Proposition 3.67, with here U (ω) = Rm . Given a dense sequence ak in Rm , set u k (ω) =
u k−1 (ω) if f (ω, u k−1 (ω)) ≤ f (ω, ak ), otherwise. ak
(3.107)
Then u k is measurable, and limk f (ω, u k (ω)) = inf v∈U f (ω, v) in view of Lemma 3.73. We next deal with the more general situation when dom( f (ω, ·)) may have an empty interior. ¯ is a normal inteDefinition 3.75 Let p ∈ [1, ∞]. We say that f : Ω × Rm → R grand if the multimapping dom( f (ω, ·)) has a Castaing representation, i.e., if there exists a sequence u k in L p (Ω)m such that {u k (ω)} is dense in dom( f (ω, ·)), for a.a. ω. If in addition f (ω, ·) is l.s.c. convex for a.a. ω, we say that f is a normal convex integrand. Proposition 3.76 Let f : Ω × Rm → R be a normal convex integrand. Then the exchange property (3.101) holds. Proof The proof is an easy variant of that of Proposition 3.74. The details are left to the reader. Remark 3.77 The difficulty here is to check the existence of a Castaing representation in Definition 3.75. If dom( f (ω, ·)) is a closed-valued measurable multimapping, this follows from Proposition 3.70. If f is a convex integrand and dom( f (ω, ·)) has a nonempty interior a.e., a Castaing representation is the sequence u k constructed in the proof of Proposition 3.74.
3.2.4 Conjugates of Integral Functionals 3.2.4.1
Case p < ∞
As we have seen, when p ∈ [1, ∞), the dual of L p (Ω)m is L q (Ω)m , where 1/ p + 1/q = 1, and when p = ∞, its dual contains L 1 (Ω)m . Let U (·) be a measurable
3.2 Integral Functionals
147
¯ consider the multimapping over Ω with image in Rm . Given f : Ω × Rm → R, ¯ function F : L p (Ω)m → R, F(u) := f (ω, u(ω))dμ(ω), (3.108) Ω
with domain dom(F) := {u ∈ L p (Ω)m ; f (ω, u(ω)) is measurable; f (ω, u(ω))+ ∈ L 1 (Ω)}, (3.109) and (3.110) FU (u) := F(u) if u ∈ L p (Ω; U ), +∞ otherwise, with domain dom(FU ) := dom(F) ∩ L p (Ω; U ).
(3.111)
Let u ∗ ∈ L q (Ω)m . Then FU∗ (u ∗ ) :=
sup
u∈dom(FU ) Ω
∗ u (ω) · u(ω) − f (ω, u(ω)) dμ(ω).
(3.112)
This amounts to minimizing the integral of ω → f (ω, u(ω)) − u ∗ (ω) · u(ω) over L p (Ω, U ). The latter is Carathéodory (resp. a normal convex integrand) iff the same holds for f (ω, u). Set fU (ω, u) := f (ω, u) + IU (ω) (u),
(3.113)
whose Fenchel conjugate is fU∗ (ω, u ∗ ) := sup
u∈U (ω)
∗ u (ω) · u − f (ω, u) .
(3.114)
As a consequence of Propositions 3.71 and 3.76, we obtain the following statements: Proposition 3.78 Let f : R × Rm → R be a Carathéodory function, and U (ω) be a measurable, closed-valued multimapping, such that dom(FU ) = ∅. Let p ∈ [1, ∞], and u ∗ ∈ L q (Ω)m . Then ∗ ∗ FU (u ) := fU∗ (ω, u ∗ (ω))dμ(ω). (3.115) Ω
Corollary 3.79 Let f , p and u ∗ be as in Proposition 3.78, and let F have a finite value at u. Then u ∗ ∈ ∂ FU (u) iff u ∗ (ω) ∈ ∂ fU (ω, u(ω)) a.e. Proof By the above proposition, the Fenchel–Young inequality for F reads
148
3 An Integration Toolbox
Ω
fU (ω, u(ω)) + fU∗ (ω, u ∗1 (ω)) − u ∗1 (ω) · u(ω) dμ(ω) ≥ 0,
(3.116)
and u ∗ ∈ ∂ FU (u) iff equality holds, i.e., iff the above integrand is equal to 0 a.e. The conclusion follows. Proposition 3.80 Let f : Ω × Rm → R be a normal convex integrand. Let p ∈ [1, ∞], and u ∗ ∈ L q (Ω)m . If dom(F) = ∅, then F ∗ (u ∗ ) :=
Ω
f ∗ (ω, u ∗ (ω))dμ(ω).
(3.117)
Corollary 3.81 Let f , p and u ∗ be as in Proposition 3.80, and let F have a finite value at u. Then u ∗ ∈ ∂ F(u) iff u ∗ (ω) ∈ ∂ f (ω, u(ω)) a.e. Proof The argument is similar to the one in the proof of Corollary 3.79.
Example 3.82 We extend Example 1.38 to the present setting as follows. Take f (x) := |x| p / p with p > 1. Then for u ∗ ∈ L p (Ω), with 1/ p + 1/q = 1, we have that 1 u ∗ (ω)qq dμ(ω). (3.118) F ∗ (u ∗ ) = q Ω 3.2.4.2
General Case When p = ∞
/ L 1 (Ω)m . We need the following We next consider the case when p = ∞, and u ∗ ∈ ∞ ∗ characterization of elements of L (Ω) . Lemma 3.83 Each u ∗ ∈ L ∞ (Ω)∗ has the unique decomposition u ∗ = u 1 + u s , where the regular part u 1 belongs to L 1 (Ω), and the singular part u s is such that there exists a nondecreasing sequence Ak of measurable subsets of Ω, such that Ω = ∪k Ak , and that for any k ∈ N, u s , u = 0, for all u ∈ L ∞ (Ω) with zero value on Ω \ Ak . Proof This difficult result is due to Yosida and Hewitt [126]. See also Castaing and Valadier [33, Chap. 8] (it is convenient to say that u s is concentrated on the complement of the Ak , in the sense of the above definition). Example 3.84 Take Ω = N, endowed with a probability measure μ such that each “basis” element ei (a sequence of zeros except for the ith term, which is equal to 1) has a positive probability. Let X := ∞ be the space of bounded sequences. Given ∗ u∗ ∈ X ∗ , set ai := u ∗ , ei , i ∈ N. Then the regular part is defined by u 1 , u = ∞ i∈N ai u i , for all u ∈ , with u = i∈N u i ei , and the singular part depends only on the behavior at infinity of u. We can construct a singular element of X ∗ as follows. If x ∈ X has a limit, denote it by lim(x). This is a continuous linear form over the subspace X 1 of sequences having a limit. Then extend this linear form over X thanks to Corollary 1.8.
3.2 Integral Functionals
149
¯ Lemma 3.85 Let f be a normal convex integrand, and F : L ∞ (Ω; Rm ) → R, defined by F(u) = Ω f (ω, u(ω))dμ(ω), be proper. Let u ∗ ∈ L ∞ (Ω; Rm )∗ have decomposition u ∗1 + u ∗s as in Lemma 3.83. Then we have the decoupling property: ⎧ ⎨ F ∗ (u ∗ ) = F ∗ (u ∗1 ) + σ (u ∗s , dom(F)), ⎩ F ∗ (u ∗1 ) =
(3.119)
f ∗ (u ∗1 (ω), ω)dμ(ω).
Ω
Proof The second relation follows from Proposition 3.80; let us prove the first one. For all α < F ∗ (u ∗1 ), there exists a u α ∈ dom(F) such that α < u ∗1 , u α −
Ω
f (ω, u α (ω))dμ(ω).
(3.120)
For all β < σ (u ∗s , dom(F)), there exists a u β ∈ dom(F) such that u ∗s , u β > β. Let Ak be as in the definition of a singular part of an element of L ∞ (Ω). Set u α,β,k (ω) := Then
u α (ω) if ω ∈ Ak , u β (ω) otherwise.
(3.121)
u ∗s , u α,β,k = u ∗s , 1Ω\Ak u β = u ∗s , u β > β.
(3.122)
For a.a. ω, u α,β,k (ω) = u α (ω) for large enough k, so that u α,β,k (ω) → u α and f (ω, u α,β,k (ω)) → f (ω, u α (ω)) a.e., when k ↑ ∞. (3.123) So, by the dominated convergence theorem (note that, by the definition of u α and u β , f (ω, u α (ω)) and f (ω, u β (ω)) are integrable), we get limk
Ω
u ∗1 (ω) · u α,β,k (ω) − f (ω, u α,β,k (ω)) dμ(ω) ∗ = u 1 (ω) · u α (ω) − f (ω, u α (ω)) dμ(ω),
(3.124)
Ω
and so F (u ) ≥ lim u ∗1 + u ∗s , u α,β,k − ∗
∗
k
Ω
f (ω, u α,β,k (ω))dμ(ω) > α + β. (3.125)
Maximizing over α and β, we get F ∗ (u ∗ ) ≥ F ∗ (u ∗1 ) + σ (u ∗s , dom(F)). The opposite inequality is obvious, since for u ∈ dom(F), by the Fenchel–Young inequality for F ∗ (u ∗1 ), we get u ∗ , u − F(u) = u ∗1 , u − F(u) + u ∗s , u ≤ F ∗ (u ∗1 ) + σ (u ∗s , dom(F)). (3.126)
150
3 An Integration Toolbox
The conclusion follows.
Corollary 3.86 Let F be as in Lemma 3.85, with finite value at u. Then u ∗ = u ∗1 + u ∗s belongs to ∂ F(u) iff the two conditions below hold:
(i) u ∗1 ∈ ∂ F(u), i.e., u ∗ (ω) ∈ ∂ f (ω, u(ω)) a.e., (ii) u ∗s , u = σ (u ∗s , dom(F)), i.e., u ∗s ∈ Ndom(F) (u).
(3.127)
Proof By Lemma 3.85, the Fenchel–Young inequality for F reads as
Ω
f (ω, u(ω)) + f ∗ (ω, u ∗1 (ω)) − u ∗1 (ω) · u(ω) dμ(ω) +σ (u ∗s , dom(F)) − u ∗s , u ≥ 0,
(3.128)
and u ∗ ∈ ∂ F(u) iff the sum equals 0, i.e., iff both the integral and Δ := (σ (u ∗s , dom(F)) − u ∗s , u)
(3.129)
are equal to 0 (since each of these two terms is nonnegative). Now Δ = 0 iff u ∗s , u − u ≤ 0, for all u ∈ dom(F), i.e., Δ = 0 iff u ∗s ∈ Ndom(F) (u). The conclusion follows.
(3.130)
Remark 3.87 With the previous notation, if u ∈ int(dom(F)), then Ndom(F) (u) = {0} and ∂ F(u) ⊂ L 1 (Ω, Rm ).
(3.131)
3.2.5 Deterministic Decisions in Rm Consider now the case when the decision x ∈ Rm should not depend on ω. We have to minimize f (ω, x)dμ(ω), (3.132) f¯(x) := Ω
where x ∈ Rm s and f is a normal convex integrand. Set Y := L p (Ω, Rm ), with p ∈ [1, ∞]. We need that Y includes constant functions, so that μ(Ω) < ∞, and so, we may assume that (Ω, F , μ) is a probability space. Denote by A the operator that ¯ by to x ∈ Rm associates the constant function on Y with value x. Define F : Y → R F(y) :=
Ω
f (ω, y(ω))dμ(ω).
(3.133)
3.2 Integral Functionals
151
Then f¯ = F ◦ A, F is convex, and assuming that for some g ∈ L 1 (ω): f (ω, y(ω)) ≥ g(ω) a.e.
(3.134)
it follows from Fatou’s Lemma 3.41 and the l.s.c. of f (ω, ·) a.s. that F is l.s.c. Given a nonempty closed, convex subset K of Rm , consider the problem Min f¯(x).
(P)
x∈K
We are in the framework of the Fenchel duality theory in Example 1.2.1.8. The expression of the stability condition (1.203) becomes ε BY ⊂ dom(F) − AK , for some ε > 0.
(3.135)
By the subdifferential calculus rules (Lemma 1.120) if f¯ has a finite value at x ∈ Rm , then: (3.136) ∂ f¯(x) ⊃ A ∂ F(Ax), with equality if (3.135) holds. Let us give the expression of A . Here ( p, q) are such that 1/ p + 1/q = 1. Definition 3.88 Let ys∗ be a singular element of L ∞ (Ω, Rm )∗ . Let 1 denote the constant function of L ∞ (Ω) with value 1. Then we define the expectation of ys∗ by, for i = 1 to n: (3.137) (Eys∗ )i = ysi∗ , 1. Lemma 3.89 (i) If y ∗ ∈ L q (Ω, Rm ), then A y ∗ = Ω y ∗ (ω)dμ(ω) = Ey ∗ . (ii) When p = ∞, if y ∗ ∈ L ∞ (Ω, Rm )∗ has the decomposition y ∗ = y1∗ + ys∗ , with y1∗ ∈ L 1 (Ω, Rm ), and ys∗ singular with components denoted by ysi∗ , i = 1, . . . , n, we have (3.138) A y ∗ = Ey1∗ + Eys∗ . Proof Point (i) follows from y ∗ , Ax =
Ω
y ∗ (ω) · xdμ(ω) =
Point (ii) follows from y ∗ , Ax =
Ω
Ω
y ∗ (ω)dμ(ω) · x.
m y1∗ (ω)dμ(ω) · x + i=1 xi ysi∗ , 1.
(3.139)
We deduce the following result. Proposition 3.90 Let f be a normal convex integrand such that f¯ has a finite value at x and that (3.134), (3.135) hold. Then, when p ∈ [1, ∞[ : ∂ f¯(x) =
Ω
x (ω)dμ(ω); x ∈ L (Ω); x (ω) ∈ ∂ f (ω, x) p.s. , ∗
∗
q
∗
(3.140)
152
3 An Integration Toolbox
and when p = ∞, for some singular xs∗ : ⎧ ⎫ ∗ ∗ ∗ 1 ⎨ x1 (ω)dμ(ω) + Exs ; x1 ∈ L (Ω); ⎬ ∂ f¯(x) = . Ω ⎩ ∗ ⎭ ∗ x1 (ω) ∈ ∂ f (ω, x) p.s.; xs ∈ Ndom(F) (x1)
(3.141)
Corollary 3.91 If x is a solution of (P), and (3.134), (3.135) hold, we deduce from the above proposition that, if p ∈ [1,∞), then Ω x ∗ (ω)dμ(ω) + N K (x) 0, with x ∗ as in (3.140), and if p = ∞, then Ω x1∗ (ω)dμ(ω) + Exs∗ + N K (x) 0, with x1∗ , xs∗ as in (3.141). Remark 3.92 The sum in the first line of Equation 3.141 reduces to E(x1∗ + xs∗ ) (where the expectation of the sum is defined as the sum of expectations, which is correct since the decomposition is unique).
3.2.6 Constrained Random Decisions We consider a more general situation where the decision is a Banach space different from L p (Ω), having in mind the case when ω is a vector and the decision might depend on some components of the vector. So let X be a Banach space and let A ∈ L(X, L p (Ω)). Given a closed convex subset K of X , we consider the problem Min f¯(x), x∈K
(P)
where F(y) := Ω f (ω, y(ω))dμ(ω), and f¯(x) := F(Ax). We assume that the following optimality condition (similar to (3.135)) holds: ε BY ⊂ dom(F) − AK .
(3.142)
By the same arguments as before we obtain that Proposition 3.93 Let f be a normal convex integrand such that f¯ has a finite value at x, and that (3.134) and (3.142) hold. Then (i) when p ∈ [1, ∞[ : ∂ f¯(x) = A x ∗ ; x ∗ ∈ L q (Ω); x ∗ (ω) ∈ ∂ f (ω, x) p.s. ,
(3.143)
and when p = ∞, for some singular xs∗ : ∂ f¯(x) =
A (x1∗ + xs∗ ) x1∗ ∈ L 1 (Ω); . x1∗ (ω) ∈ ∂ f (ω, x) p.s.; xs∗ ∈ Ndom(F) (x1)
(3.144)
(ii) If x is solution of (P), say when p = ∞, we have, with x1∗ and xs∗ as above, that
3.2 Integral Functionals
153
A (x1∗ + xs∗ ) + N K (x) 0.
(3.145)
We next apply this result when Ω = Ω1 × Ω2 , (Ωi , Fi , μi ) are measure spaces for i = 1, 2, F is the product σ -algebra and μ is the product of μ1 and μ2 . We take X = L p (Ω1 )m , that is, the decision x may depend on ω1 , but not on ω2 . Then A is the embedding from X = L p (Ω1 )m into Y = L p (Ω)m , and so, A is the restriction of x ∗ ∈ Y ∗ to the subspace X . If u ∗ ∈ L q (Ω)m , its restriction v∗ := A u ∗ is such that for any v ∈ L p (Ω1 )m :
u ∗ (ω1 , ω2 ) · v(ω1 )dμ(ω) Ω ∗ = v(ω1 ) · u (ω1 , ω2 )dμ2 (ω2 ) dμ1 (ω1 )
v∗ , v =
Ω1
(3.146)
Ω2
and therefore, for a.a. ω1 : v∗ (ω1 ) =
Ω2
u ∗ (ω1 , ω2 )dμ2 (ω2 ).
(3.147)
3.2.7 Linear Programming with Simple Recourse Let us consider the following problem of linear programming with simple recourse Min c · x + Eω dω · yω x,y
x ∈ Rn+ ; A0 x ≤ b0 ω ω ω yω ∈ Rm + ; A yω = b + M x, a.s.
(3.148)
Here (Ω, F , μ) is a probability space, and (dω , Aω , bω , M ω ) are measurable, essentially bounded vector or matrix functions whose dimensions do not depend on ω. For given x ∈ Rn and ω, the recourse yω is the solution of the following problem: Aω yω = bω + M ω x.
(Pω (x))
−λω · (bω + M ω x); dω + (Aω ) λω ≥ 0. Max ω
(Dω (x))
Min dω · yω ; yω
yω ∈ Rm +;
The linear program dual to (Pω (x)) is λ
Since its feasible set does not depend on x, it is natural to suppose that it is nonempty a.s. (otherwise (3.148) would have infimum −∞ whenever it is feasible). Denote by vω (x) the value of problem (Pω (x)). By linear programming duality theory (Lemma 1.26), we have that
154
3 An Integration Toolbox
vω (x) = val(Pω (x)) = val(Dω (x)) a.s.
(3.149)
Lemma 3.94 We have that vω is a normal convex integrand. Proof (i) It is easily checked that vω (x) is a.s. convex. Since vω (x) = val(Dω (x)) a.s., the latter being a supremum of affine functions of x, it is also l.s.c. 1 (ii) Let y k be a dense sequence in Rm + . Let | · |1 denote the norm in a finitedimensional space. The function ϕω (x) := inf |Aω y k − bω − M ω x|1 k
(3.150)
is measurable, and satisfies ϕω (x) = minm |Aω y − bω − M ω x|1 , y∈R+
(3.151)
the infimum being attained a.e. since it corresponds to the value of a linear program. Therefore, ϕω (x) = 0 iff x ∈ dom vω , and dom vω = ϕω−1 (0) is a.s. nonempty. In addition, let the sequence x j → x be such that ϕω (x j ) → 0. Then there exists a ω j ω ω j ω j sequence y j ∈ Rm + such that |A y − b − M x |1 → 0. It follows that |A y − ω ω b − M x|1 → 0, that is, x j → x and ϕω (x j ) → 0 implies ϕω (x) = 0. Set n G ω := {(yω , x) ∈ Rm + × R+ ;
Aω yω − bω − M ω x = 0}.
(3.152)
(3.153)
By Lemma 1.28, there exists a Hoffman constant cω > 0 such that n dist((yω , x), G ω ) ≤ cω |Aω yω − bω − M ω x|, for all (yω , x) ∈ Rm + × R+ . (3.154) , we obtain Minimizing the r.h.s. over yω ∈ Rm +
dist(x, dom vω ) ≤ inf m dist((yω , x), G ω ) ≤ cω ϕω (x). yω ∈R+
(3.155)
Now it is enough to check that for any bounded closed subset C of Rn , (dom v)−1 (C) is measurable. Let c be a dense sequence in C. We claim that (dom v)−1 (C) = E, where E := {ω ∈ Ω; ∩ε>0 ∪ {B(c , ε); ϕω (c ) ≤ ε}. (3.156) Indeed, For any sequence εk ↓ 0 there exists some cˆk in the sequence c such that |x − cˆk | < εk . Being a minimum of continuous functions, ϕω (·) is u.s.c. and therefore lim supk ϕω (cˆk ) ≤ ϕω (x) = 0. It follows that ω ∈ E.
3.2 Integral Functionals
155
Conversely, let ω ∈ E. Given εk ↓ 0 there exists a sequence ck in C such that ϕω (ck ) ≤ εk . Extracting a subsequence if necessary, we may assume that ck → x ∈ C. By (3.152) we have that ϕω (x) = 0, that is, x ∈ dom(vω ) as was to be proved. The claim (3.156) follows. Since the set E is obviously measurable, the multimapping ω → dom vω is measurable. Since vω (·) is a closed-valued multimapping, we deduce from Proposition 3.70 the existence of a Castaing representation. The conclusion follows. Since vω is a normal convex integrand, we have that inf
y∈L ∞ (Ω)m
{Edω · yω ; yω ∈ F(Pω (x)) a.s.} = Evω (x).
(3.157)
Therefore, the original problem is equivalent to Min c · x + Eω vω (x); x ∈ Rn+ ; A0 x ≤ b0 . x
(3.158)
Consider the qualification condition There exists ε > 0 and xˆ ∈ Rn such that B(x, ˆ ε) ∈ dom vω a.s. xˆ > 0 and A0 xˆ < b0 . (3.159) ¯ by F(z) := Eω vω (z ω ). We recall the definition of expecDefine F : L ∞ (Ω)n → R tation of elements of L ∞ (Ω, Rn )∗ given in Definition 3.88. Theorem 3.95 Let the qualification condition (3.159) hold. If x¯ is a solution of (3.148), then there exists s¯ ∈ Rn+ , λ¯ (ω) ∈ L 1 (Ω), with λ¯ (ω) ∈ S(Dω (x)) a.s., and xs∗ ∈ Ndom(F) (x1), such that c + s¯ + (A0 ) η¯ + Exs∗ − E(M ω ) λ¯ (ω) = 0, s¯ ≥ 0; s¯ · x¯ = 0; η¯ ≥ 0; η¯ · (A0 x¯ − b0 ) = 0.
(3.160)
Proof Denote by ∂vω (x) the partial subdifferential of vω (x) w.r.t x. By Lemma 1.55, we have that (3.161) ∂vω (x) = −(M ω ) S(Dω (x)) a.s. Proposition 3.90, whose hypotheses hold in view of (3.159), imply that ⎫ ⎧ ⎨ x1∗ (ω)dμ(ω) + Exs∗ , 1; x1∗ ∈ L 1 (Ω); ⎬ . ∂ F(x) ¯ = Ω ⎭ ⎩ ∗ x1 (ω) ∈ ∂vω (x) ¯ a.s.; xs∗ ∈ Ndom(F) (x1)
(3.162)
Let P 0 := {x ∈ Rn+ ; A0 x ≤ 0}. Condition (3.159) implies that ¯ 0. c + ∂ F(x) ¯ + N P 0 (x)
(3.163)
156
3 An Integration Toolbox
By linear programming duality q
¯ = {(s, η) ∈ Rn+ × R+ ; s · x¯ = η · (A0 x¯ − b0 ) = 0}. N P 0 (x)
(3.164)
We conclude by (3.161).
3.3 Applications of the Shapley–Folkman Theorem We have already stated the Shapley–Folkman Theorem 1.170.
3.3.1 Integrals of Multimappings Let (Ω, F , μ) be a probability space. We assume that μ is non-atomic, i.e., for any A ∈ F , with μ(A) > 0, there exists a B ∈ F , B ⊂ A, such that 0 < μ(B) < μ(A). This is known to be equivalent to the Darboux property3
For all α ∈ (0, 1), there exists a B ∈ F , B ⊂ A, such that μ(B) = αμ(A).
(3.165)
Let F be a (not necessarily measurable) multimapping Ω → Rn , defined a.e. on Ω. If f ∈ L 1 (Ω) is such that f (ω) ∈ F(ω) a.e., we say that f is an integrable selection of F. We set F := f (ω)dμ(ω); f is an integrable selection of F . (3.166) Ω
Ω
The following holds [74]. Our proof follows [119]. Theorem 3.96 We have that Ω F is a convex subset of Rn . Proof Let x1 and x2 belong x = αx1 + (1 − α)x2 , for some α ∈ (0, 1). to Ω F, and We have to prove that x ∈ Ω F. So, f 1 , f 2 being the integrable selections associated with x1 and x2 , it suffices to consider the case when F(ω) := { f 1 (ω), f 2 (ω)}. Let p > n. The Darboux property implies that Ω is the union of p disjoint measurable sets Ai , each of measure 1/ p. Then
x ∈ conv
3 For
F Ω
= conv
p i=1
F .
(3.167)
Ai
a proof of the Darboux property, based on Zorn’s lemma, see [3, Theorem 10.52, p. 395].
3.3 Applications of the Shapley–Folkman Theorem
157
By the Shapley–Folkman Theorem 1.170, there exists an Ip ⊂ {1, . . . , p} of cardinality at most n, such that we have the representation x = i=1 xi , with xi ∈ conv Ai F if i ∈ I , and xi ∈ Ai F otherwise. Repeating a similar argument for each set Ai , for i ∈ I , we obtain by induction a sequence of representations of the form partition (Ak , Bk ) of Ω, with Ak nondex = y k + z k , where for some measurable creasing and μ(Bk ) → 0, y k ∈ Ak F, Bk = ∪∈Ik Bk , the Bk being disjoint mea surable subsets of Bk , and z k = ∈Ik z k , with z k ∈ conv( Bk F) for all ∈ Ik , Set f¯(ω) := max(| f 1 (ω)|, | f 2 (ω)|). Since f¯ is integrable, |z k | ≤ Bk f¯(ω)dμ(ω) and μ(Bk ) → 0, by Corollary 3.37, we have that z k → 0. We conclude by passing to the limit. We next discuss the case when for some measurable multimapping I : Ω → P{1, . . . , p}, and integrable functions f 1 , . . . , f p , F is defined by F(ω) = { f i (ω); i ∈ I (ω)}, a.e. on Ω.
(3.168)
The multimapping conv F is defined by conv F(ω) = conv{ f i (ω); i ∈ I (ω)}, a.e. on Ω.
Set A :=
F, A :=
p
c
Ω
Ω
conv F, S p := {α ∈ R+ ;
αi = 1},
(3.169)
(3.170)
i
and / I (ω); a.e. , S Ic := α ∈ L ∞ (Ω) p ; α(ω) ∈ S p ; αi (ω) = 0, i ∈ S I := α ∈ S Ic ; αi (ω) ∈ {0, 1}; a.e. .
(3.171) (3.172)
The next proposition is a variant of the Lyapunov convexity theorem. Proposition 3.97 Let (3.168) hold. Then A is equal to Ac , is convex and compact, and any x ∈ A has the following representation: x=
p i=1
Ω
αi (ω) f i (ω)dμ(ω), for some α ∈ S I .
(3.173)
Proof (a) By Theorem 3.96, A is convex; since the f i are integrable, it is bounded. Let f be an integrable selection of F. Set E 1 := {ω ∈ Ω; i ∈ I (ω); f 1 (ω) = f (ω)},
(3.174)
158
3 An Integration Toolbox
and by induction, for i = 2 to p: / E j , j < i; f i (ω) = f (ω)}. E i := {ω ∈ Ω; i ∈ I (ω); ω ∈
(3.175)
Let αi be the indicatrix of E i . Then α ∈ S I and (3.173) holds. (b) It remains to show that A is closed and is equal to Ac . Since A is a convex subset of Ac , it suffices to check that any x¯ ∈ ∂ Ac belongs to A. By Corollary 1.21, we can separate x¯ and rint(Ac ), and so, there exists a λ ∈ Rn∗ such that λx¯ ≤ λx, for all x ∈ Ac , or equivalently λx¯ = inf c λ α∈S I
p i=1
Ω
αi (ω) f i (ω)dμ(ω).
(3.176)
Let f λ := (λ f 1 , . . . , λ f p ) . By Proposition 3.71, we have that λx¯ =
Ω
min{ f iλ (ω); i ∈ I (ω)}dμ(ω).
(3.177)
By Remark 3.72, there exists an α ∈ S that reaches the infimum in (3.176). Set Acλ := {x ∈ A¯ c ; λx = λx}; ¯
Aλ := {x ∈ A; λx = λx}. ¯
(3.178)
We have proved that Acλ contains an element of Aλ . On the other hand, since Ac is bounded, to any nonzero λ ∈ R p is associated some x¯ ∈ ∂ Ac such that (3.176) holds. Therefore A and Ac have the same support function, and since these sets are convex, they have the same closure.4 So it suffices to prove that A is closed, which ¯ λ) as above. is equivalent to the equality Acλ = Aλ , for any pair (x, We conclude by an induction argument over the dimension of A. If A is one¯ and since it contains one point in A, it follows that dimensional, then Acλ = {x}, Acλ = Aλ . Let the result hold when the dimension of A is q − 1, for q ≥ 2, and let A have dimension q. Define I λ (ω) := {i ∈ I (ω); f iλ (ω) ≤ f jλ (ω), for all j ∈ I (ω)},
(3.179)
and consider the multimapping F λ (ω) = { f i (ω); i ∈ I λ (ω)}, a.e. on Ω.
(3.180)
We see that Aλ = Ω F λ . Since F λ has the same structure as F and Aλ has dimension at most q − 1, we have that Aλ is closed and equal to Acλ . The conclusion follows. 4 The
indicatrix function of a nonempty closed convex set is l.s.c. convex, and hence, equal to its biconjugate. Since the conjugate of the indicatrix is the corresponding support function, two closed convex sets having the same associated support function are equal.
3.3 Applications of the Shapley–Folkman Theorem
159
3.3.2 Constraints on Integral Terms We next consider the following problem Min F0 (u); (F1 (u), . . . , Fq (u)) ∈ K ,
u∈L p (Ω)
(P I )
where again μ is a non-atomic probability measure, K is a closed convex subset of Rq , and for i = 0 to q, given Carathéodory functions i : Ω × Rm → R and a measurable multimapping U : Ω → Rm , for all u ∈ L p (Ω, Rm ): Fi (u) :=
⎧ ⎨
i (ω, u(ω))dμ(ω) if u(ω) ∈ U (ω) a.e., Ω ⎩ +∞ otherwise,
with the convention that F0 (u) is equal to +∞ if 0 (ω, u(ω))+ is not integrable. We assume (for the sake of simplicity) that for any u ∈ L p (U ), i (ω, u(ω)) is integrable, ¯ is defined by i = 0 to q. The Lagrangian of the problem L : L p (Ω)m × Rq∗ → R L(u, λ) := F0 (u) +
q
λi Fi (u).
(3.181)
i=1
The dual problem is Max d(λ) := λ
inf
u∈L p (U )
L(u, λ) − σ K (λ).
(D I )
Set F(u) := (F0 (u), . . . , Fq (u)) , with range E := {F(u); u ∈ L p (U )}.
(3.182)
By Theorem 3.96, this set is convex; its components are indexed from 0 to q. We may rewrite the primal problem in the form Min e0 ; e1:q ∈ K , e∈E
where e1:q ∈ Rq has components e1 to eq , and set E 1:q := {e1:q ; e ∈ E}. The primal problem is feasible iff 0 ∈ E 1:q − K , and the stability condition (1.170) of perturbation duality is (3.183) ε B ⊂ E 1:q − K , for some ε > 0. Theorem 3.98 Let (3.183) hold. Then val(P I ) = val(D I ), and u¯ ∈ S(P I ) iff there ¯ such that exists a λ ∈ N K (F(u)) L(u, ¯ λ) ≤ L(u, λ), for all u ∈ L p (Ω).
(3.184)
160
3 An Integration Toolbox
Proof The convex sets E and E := (−∞, val(P I )) × K are disjoint, since any point in the intersection is the image of a feasible u ∈ L p (Ω) with cost function lower than the value of (P). By Corollary 1.21, we can separate E and E, i.e., there exists a nonzero pair (β, λ) ∈ R × R p∗ such that βγ + λk ≤ βe0 + λ · e1:q , for all γ < val(P I ) and (k, e) ∈ K × E.
(3.185)
Fixing k ∈ K and making γ ↓ −∞ we deduce that β ≥ 0. If β = 0 then λ = 0 and λ(e1:q − k) ≥ 0, for all (k, e) ∈ K × E. By (3.183) this implies that λ = 0, which is a contradiction. We have proved that β > 0, and so dividing (β, λ) by β if necessary, we can assume that β = 1. Maximizing over γ in (3.185) and recalling the definition of E, we deduce that val(P I ) ≤ L(u, λ) − λk, for all (k, u) ∈ K × L p (Ω).
(3.186)
Minimizing the r.h.s. over (k, u) we obtain that val(P I ) ≤ d(λ) ≤ val(D I ). Since the converse inequality obviously holds, the primal and dual values are equal. Assume now that u¯ ∈ S(P I ). Then, by (3.186), ¯ = val(P I ) ≤ L(u, λ) − λk, for all (k, u) ∈ K × L p (Ω), L(u, ¯ λ) − λF1:q (u) (3.187) or equivalently 0≤
inf p
u∈L (Ω)
¯ − k) . ¯ λ)) + inf λ(F1:q (u) (L(u, λ) − L(u, k∈K
(3.188)
¯ we see that each infimum is nonpositive. Therefore Taking u = u¯ and k = F1:q (u), ¯ and (3.184) holds. Conversely, if they are both equal to zero, i.e., λ ∈ N K (F(u)) ¯ and (3.184) holds, then for all u ∈ dom(F): u¯ ∈ dom(F) is such that λ ∈ N K (F(u)) ¯ λ) − λF1:q (u) ¯ = F0 (u), ¯ F0 (u) = L(u, λ) − λF1:q (u) ≥ L(u, and hence, u¯ ∈ S(P I ). The conclusion follows.
(3.189)
Remark 3.99 While problem (P I ) is not convex in general (for instance, its cost function is not convex) we have been able to reformulate it as a convex problem. Set [λ](ω, u) := 0 (ω, u) + λ1:q (ω, u). Since the Lagrangian is itself an integral, to which Proposition 3.71 applies, we deduce that, under the hypotheses of the above theorem, if u¯ ∈ S(P I ), then u(ω) ¯ ∈ argmin 0 [λ](ω, u), for a.a. ω.
(3.190)
Exercise 3.100 Discuss the case when Ω = [0, 1], the integrands f i do not depend on ω and are polynomials of degree at most n, and U (ω) = R.
3.4 Examples and Exercises
161
3.4 Examples and Exercises Example 3.101 (Constrained entropy maximization) Let Ω be a measurable subset of Rn with finite Lebesgue measure. Consider the set of measurable, a.e. positive functions in X := L 1 (Ω): X + := {u ∈ L 1 (Ω); u(ω) ≥ 0 a.e.}.
(3.191)
We have observations Ω
ai (ω)u(ω)dω = bi , i = 1, . . . , N ,
(3.192)
where each ai belongs to X ∗ = L ∞ (Ω) and b ∈ R N is a noisy measurement, so that the available information is that b ∈ K , where K is a closed convex subset of R N . We define ˆ Hˆ (u(ω))dω. (3.193) H (x) := x log x; H (u) := Ω
The strictly l.s.c. convex function Hˆ (x) has domain R+ , with value 0 at 0. We have in view cases when u is a density probability and so we assume that a1 (ω) = 1. In the crystallographic applications that we have in mind, u(ω) is the density probability for atoms to be at position ω and the observations correspond to the computation of Fourier modes, see [39]. The problem to be considered is Min H (u); u∈X
Au ∈ K ,
(3.194)
where (Au)i := Ω ai (ω)u(ω)dω. The cost function is obviously convex, and is l.s.c. in view of Fatou’s Lemma 3.41 (where we can take g(ω) = −c, c being the maximum of − Hˆ ). So, the Fenchel duality framework is applicable. Set Hˆ λ (ω, v) := Hˆ (v) +
N
λi ai (ω) · v.
(3.195)
i=1
Let a(ω) := (a1 (ω), . . . , a N (ω)) . Observe that inf Hˆ λ (ω, v) = − Hˆ ∗ (−a(ω) · λ). v
(3.196)
The Lagrangian function is
L(u, λ) := H (u) + λ Au =
Ω
Hˆ λ (ω, u(ω))dω.
The integrand is normal convex. Therefore, the dual cost satisfies
(3.197)
162
3 An Integration Toolbox
δ(λ) = inf
u∈X
Ω
Hˆ λ (ω, u(ω))dω − σ K (λ) = −
Ω
Hˆ ∗ (−a(ω) · λ)dω − σ K (λ).
(3.198) We assume that the primal problem is feasible and that the stability condition holds: 0 ∈ int(K − A dom(H )).
(3.199)
Since Hˆ (v) ≥ −c, the primal value is not less than −c|Ω|. So, the primal problem has a finite value. By (3.199), the primal and dual values are equal and the set of dual solutions is nonempty and bounded. Let λ¯ be a dual solution. Then u ∈ dom Hˆ is a primal solution iff it satisfies the optimality condition ¯ a.e. Hˆ (u(ω)) + Hˆ ∗ (−a(ω) · λ) = −(a(ω) · λ)u(ω)
(3.200)
Since Hˆ is strictly convex, there is a unique primal solution u¯ that is determined by the above relation. Indeed, we have that D Hˆ (v) = 1 + log v = z, iff v = e z−1 and so ¯ (3.201) u(ω) ¯ = e−a(ω)·λ−1 . It follows that
¯ Hˆ (u(ω)) ¯ = −(a(ω) · λ¯ + 1)e−a(ω)·λ−1 ,
(3.202)
so that the dual cost is δ(λ) = −
Ω
¯ ¯ e−a(ω)·λ−1 dω − σ K (λ).
(3.203)
Example 3.102 Consider the particular case of the previous example when N = 1, and the constraint has a probability density, i.e. a(ω) = 1 a.e. and K = {1}. Then the dual cost is −|Ω|e−λ−1 − λ, which attains its maximum when |Ω|e−λ−1 = 1, i.e., ¯ for λ¯ = log |Ω| − 1; the optimal density is u = e−λ−1 = 1/|Ω|, as expected (the uniform law maximizes the entropy). Example 3.103 (Phase transition models, see [80]) Let f : R → R, f (u) := u(1 − u), and let Ω be a measurable subset of Rn . We choose the function space X := L p (Ω), p ∈ [1, ∞). For u ∈ X , set F(u) := Ω f (u(ω))dμ(ω), where dμ is the Lebesgue measure. Consider theproblem of minimizing F(u) with the constraints u(ω) ∈ U a.e., U := [0, 1], and Ω u(ω)dμ(ω) = a, a ∈ (0, mes(Ω)). Given λ ∈ R, the Lagrangian of this problem is L(u, λ) := FU (u) + λ
Ω
u(ω)dμ(ω) − a
=
Ω
( f (u(ω)) + λu(ω))dμ(ω) − λa.
(3.204)
3.4 Examples and Exercises
163
The dual cost function is therefore δ(λ) =
−FU∗ (−λ)
We compute
− λa = −
Ω
fU∗ (−λ)dμ(ω) − λa.
fU∗ (z) := sup uz − u(1 − u).
(3.205)
(3.206)
u∈U
Since u(1 − u) is concave the supremum is attained at 0 if z ≤ 0 and at 1 otherwise, and so, 0 if z ≤ 0, ∗ fU (z) = (3.207) z if z ≥ 0. In other words, fU∗ (z) = max(0, z) = z + . So
δ(λ) = −
Ω
(−λ)+ dμ(ω) − λa =
λ(mes(Ω) − a) if λ ≤ 0, −λa if λ ≥ 0.
(3.208)
Clearly it attains its maximum at λ¯ = 0, and so, the primal and dual values are equal, although the problem is nonconvex. Example 3.104 This example illustrates how singular multipliers occur in optimality systems. Consider the problem Min x; x + 1/(k + 1) ≥ 0, k = 0, 1, . . . . x∈R
(3.209)
We choose ∞ (the space of bounded sequences) as the constraint space and denote by 1 and b the sequences with generic term 1 and 1/(k + 1), respectively. Thus we are considering the problem Min x; x1 + b ≥ 0, k = 0, 1, . . . x∈R
(3.210)
where we have used the natural order relation for sequences. Let K = ∞ + be the convex cone of elements of ∞ with nonnegative elements and let A : R → ∞ , Ax := x1. The constraint can be written as Ax + b ∈ K . The duality Lagrangian is x + λ, x1 + b − σk (λ) = λ, b + x(1 + λ, 1) − σk (λ).
(3.211)
The dual cone to K is K − := {λ ∈ (∞ )∗ ; λ, y ≤ 0, for all y ∈ ∞ + }. So, the dual problem is
(3.212)
164
3 An Integration Toolbox
Max λ, b; 1 + λ, 1 = 0.
λ∈K −
(3.213)
The problem is convex and the stability condition obviously holds, and so, primal and dual values are equal. The optimality condition is, in view of the dual constraint: 0 = x − λ, b = −λ, x1 + b.
(3.214)
For any y ∈ K , N K (y) = K − ∩ y ⊥ (see Chap. 1), so that K − ∩ (x1 + b)⊥ = N K (x1 + b), the set of dual solutions (which is nonempty and bounded), is {λ ∈ N K (x1 + b); 1 + λ, 1 = 0}.
(3.215)
We now use the structure of elements of (∞ )∗ . Any λ ∈ (∞ )∗ can be uniquely decomposed as λ = λ1 + λs , where λ1 ∈ 1 and the singular part λs depends only on the behavior at infinity. For any y ∈ K , we have that 0 ≥ λ, y = λ1 , y + λs , y.
(3.216)
Taking, for i ∈ N, y = ei (the sequence with all components equal to 0 except the ith equal to 1) we obtain that λ1 ∈ K − . Then let y ∈ K . Denote by y N the sequence whose N first terms are zero, the others being equal to those of y. We have that λ1 , y N = o(1) and λs , y N = λs , y. Since λ ∈ K − , we deduce that λs ∈ K − . Finally take y = ek /(k + 1). Then x1 ± y ∈ K , and therefore 0 ≥ λ, ±y = 1 λk /(k + 1), proving that λ1k = 0. and therefore λ1 = 0. In view of the dual constraint, it follows that λs = 0.
3.5 Notes For complements on Sect. 3.1 (measure theory) we refer to e.g. Malliavin [77]. The integral functionals discussed in Sect. 3.2 were studied in Rockafellar [96, 98, 99]; see also Castaing and Valadier [33], Aubin and Frankowska [12]. The proof of the Shapley–Folkman theorem is taken from Zhou [127]. We use it to prove the convexity of integrals of multimappings. See Tardella [119] and its references on the Lyapunov theorem. Maréchal [79] introduced useful generalizations of the perspective function.
Chapter 4
Risk Measures
Summary Minimizing an expectation gives little control of the risk of a reward that is far from the expected value. So, it is useful to design functionals whose minimization will allow one to make a tradeoff between the risk and expected value. This chapter gives a concise introduction to the corresponding theory of risk measures. After an introduction to utility functions, the monetary measures of risk are introduced and connected to their acceptation sets. Then the case of deviation and semi-deviation, as well as the (conditional) value at risk, are discussed.
4.1 Introduction When minimizing an expectation, we miss the possibility of large variance of the cost, leading to high risk of a poor result. So it may be wise to modify the cost function in order to reduce the associated risk. We present in this chapter some tools which allow us to do this.
4.2 Utility Functions 4.2.1 Framework ¯ with connected domain Definition 4.1 We call a nondecreasing function u : R → R, denoted by D(u), a disutility function. Note that classical economic theory deals with gain maximization and (often concave) utility functions. However, since we choose to analyze minimization problems, we will use disutility functions (which will be the opposite of the utility functions).
© Springer Nature Switzerland AG 2019 J. F. Bonnans, Convex and Stochastic Optimization, Universitext, https://doi.org/10.1007/978-3-030-14977-2_4
165
166
4 Risk Measures
Definition 4.2 Let (Ω, F , μ) be a probability space and let s ∈ [1, ∞]. The prefer¯ ence function associated with the disutility function u is the function U : L s (Ω) → R defined by U (y) := E[u(y)], (4.1) with domain D(U ) = {x ∈ L s (Ω); u(x) is measurable and E[max(u(x), 0)] < +∞}.
(4.2)
If x and y belong to L s (Ω) (representing losses) we say that x is preferred to y if U (x) ≤ U (y). Definition 4.3 We say that the preference function U is risk-adverse if the disutility function u is proper, l.s.c. convex. Let U be risk-averse. Then it is convex, and u has an affine minorant, say ay + b, so that U has the affine minorant aEy + b and is therefore proper (its domain is nonempty since it contains the constant functions with value in dom(u)). By Fatou’s Lemma 3.41 we deduce that U is proper l.s.c. convex. Let y ∈ D(U ). By Jensen’s inequality, we have that u[E(y)] ≤ E[u(y)]. (4.3) This expresses the preference for getting the mean value of a random variable rather than the variable itself. Definition 4.4 A certainty equivalent price (also called “utility equivalence price”) of y ∈ D(U ) is defined as an amount α ∈ R such that u[α] = E[u(y)].
(4.4)
Since u is nondecreasing, E[u(y)] belongs to the image of u, and so the set of certainty equivalent prices is a nonempty interval. If u is increasing, it is a singleton, equal to u −1 (E[u(y)]), denoted by ce(y). Remark 4.5 If U is risk-averse and y ∈ D(U ), in view of (4.3), we always have ce(y) ≥ E(y), and we can interpret ce(y) as the “fair price” of the random variable y. Example 4.6 The exponential disutility function u(x) = e x is risk-averse, and for all a ∈ R, we have U (y + a) = E[u(y + a)] = ea E[u(y)] = ea U (y),
(4.5)
ce(y + a) = log(ea U (y)) = a + log(U (y)) = a + ce(y).
(4.6)
so that
So, in the case of the exponential utility function, the certainty equivalent price satisfies the relation of translation invariance: ce(y + a) = a + ce(y).
4.2 Utility Functions
167
4.2.2 Optimized Utility We now interpret y as the gain of a portfolio that can by combined with other random variables, called free assets. If a financial asset z, an element of L s (Ω), has price pz on the market, then the asset z − pz has a zero price. These market prices should not be confused with utility indifference prices that apply to assets that are not priced in the market. So if z 1 , . . . , z n are zero value assets, for any θ ∈ Rn , we may choose to have the portfolio (4.7) y(θ ) = y + θ1 z 1 + · · · + θn z n . We assume that there is no constraint on the decision variables θ . Therefore the investor minimizes its disutility by solving the problem Minn U [y(θ )]. θ∈R
(4.8)
Assuming that the above function of θ is differentiable, and that the rule for differentiating the argument of the sum holds, we see that the optimality condition of this problem is ∂U [y(θ )] = Eμ [u (y(θ ))z i ], i = 1, . . . , n. (4.9) 0= ∂θi Assume that u (·) is positive everywhere. Let the random variable ηθ be defined by ηθ =
u (y(θ )) . Eμ [u (y(θ ))]
(4.10)
Being positive and with unit expectation, η is the density of the equivalent probability measure μθ such that dμθ = ηθ μ. We may write the optimality condition (4.9) as 0 = Eμθ [z i ].
(4.11)
In other words, optimal portfolios are those for which the financial assets have null expectation under their associated probability μθ . In such a case, we say that μθ is a neutral risk probability. Remark 4.7 If short positions are forbidden, meaning that we have the constraint θ ≥ 0, then the optimality conditions may be expressed as Eμθ [z i ] ≥ 0; θ ≥ 0; θi Eμθ [z i ] = 0, i = 1, . . . , n.
(4.12)
In particular, all assets in the optimal portfolio are risk neutral. Remark 4.8 Given a nonempty closed convex subset K of Rn , we can apply the Fenchel duality setting (Theorem 1.113) to the problem Min U [y(θ )], θ∈K
(4.13)
168
4 Risk Measures
with X = Rn , f = I K , Y = L s (Ω), x = θ , Aθ = θ1 y1 + · · · + θn yn , F = U . The stability condition is 0 ∈ int (dom(U ) − AK ) . (4.14) This will hold if s = 1 and u satisfies a linear growth condition, since in this case dom(U ) = Y . Since f ∗ = σ K , and U ∗ (·) = E(u ∗ (·)) by Proposition 3.80, the expression of the dual problem is, assuming s ∈ [1, ∞) and 1/s + 1/s = 1: Max E(y ∗ y − u ∗ (y ∗ )) − σ K (−A y ∗ ).
y ∗ ∈L s
(4.15)
Since (A y ∗ ) I = y ∗ , yi s (duality product in L s (Ω)) for i = 1 to n, and ∂ I K = N K , the optimality condition at a solution θ¯ is, setting y¯ := y(θ¯ ): ⎞ ⎛ ∗
y , y1 s ⎟ ⎜ .. ∗ N K (θ¯ ) + ⎝ (4.16) ⎠ 0; y ∈ ∂u( y¯ ) a.s. .
y ∗ , yn s
In particular, if K = Rn this means that y ∗ , yi s = 0, for i = 1 to n, for some y ∗ ∈ L s (Ω), y ∗ ∈ ∂u( y¯ ) a.s. If y ∗ is a.s. positive, then y¯ ∗ := y ∗ /Eμ y ∗ is well-defined and is an equivalent probability measure, under which the assets have null expectation.
4.3 Monetary Measures of Risk 4.3.1 General Properties We now give an axiomatic approach to risk measures associated with estimates of incomes, and explicit expressions of some of these risk measures, in the form of a maximum of mean values. Let Ω be the set of events. An (uncertain) outcome (opposite of income) is a function x : Ω → R; x(ω) is the actual outcome obtained if event ω occurs. Outcome functions are assumed to belong to a Banach space X , containing constant functions. The space X is endowed with the order relation for functions of ω: if x, y belong to X , then x ≤ y if x(ω) ≤ y(ω) (either everywhere or a.e.). Definition 4.9 A mapping ρ : X → R is called a monetary measure of risk (MMR) if, for all x and y in X , the following holds: Monotonicity: if x ≥ y, then ρ(x) ≥ ρ(y), i.e., ρ is nondecreasing. Translation invariance: if a ∈ R, then ρ(x + a) = ρ(x) + a. Lemma 4.10 (i) The set of monetary measures of risk is convex, and invariant under addition of a constant et translation.1 (ii) Let f i , i ∈ I , be a family of MMRs. If the 1 Change
of ρ(x) into ρ(x + a), with a ∈ R.
4.3 Monetary Measures of Risk
169
supremum (resp. infimum) is everywhere finite, then it is an MMR. (iii) A monetary measure of risk is non-expansive (i.e., Lipschitz continuous with constant at most 1) with respect to the supremum norm: |ρ(x) − ρ(y)| ≤ sup |x(ω) − y(ω)|.
(4.17)
ω
Proof The proof of (i)–(ii) being immediate, it suffices to prove (iii). Let x and y be in X , and M := supω |x(ω) − y(ω)|. Then x ≥ y − M. By monotonicity and translation invariance, ρ(x) ≥ ρ(y − M) = ρ(y) − M. Exchanging x and y, we obtain the converse inequality. We recall (see Sect. 1.3.4, Chap. 1) that the infimal convolution of a family f i of extended real-valued functions over X , i ∈ I finite, is defined as
f i (xi ); xi = x . (4.18) (i∈I f i ) (x) := inf i∈I
i∈I
Lemma 4.11 The infimal convolution of a finite family of monetary measures of risk is, whenever it is finite-valued, a monetary measure of risk. Proof Let the f i be extended real-valued functions over X , for i = 1 to n. Then i∈I f i (x) =
inf
x1 ,...,xn−1
f i (xi ) + f n x −
1≤i≤n−1
xi
.
(4.19)
1≤i≤n−1
By Lemma 4.10(i), each term in the “inf” is an MMR. We conclude by Lemma 4.10(ii).
4.3.2 Convex Monetary Measures of Risk We denote by S the following set: S := {Q ∈ X ∗ ; Q ≥ 0, Q, 1 = 1},
for all x and y in X.
(4.20)
In some cases S will have the interpretation of probability measures. We have established in (1.291) that the Fenchel conjugate of the infimal convolution is the sum of conjugates. / Lemma 4.12 (i) If ρ is an MMR (possibly nonconvex), then ρ ∗ (Q) = +∞ if Q ∈ S. (ii) The function ρ : X → R is a convex l.s.c. MMR iff it has finite values and satisfies: ρ(x) = sup{ Q, x − ρ ∗ (Q); Q ∈ S }.
(4.21)
170
4 Risk Measures
Proof (i) If Q 0, there exists a y ≥ 0 such that Q, y < 0. Let x ∈ X . Then ρ ∗ (Q) ≥ Q, x − y − ρ(x − y) ≥ Q, x − Q, y − ρ(x).
(4.22)
Taking the supremum over x, we obtain ρ ∗ (Q) ≥ ρ ∗ (Q) − Q, y. Since ρ ∗ (Q) > −∞ and Q, y < 0, this implies ρ ∗ (Q) = +∞. If on the other hand Q, 1 = 1, by translation invariance, we get ρ ∗ (Q) ≥ supα∈R { Q, α1 − ρ(α1)} = supα∈R {α( Q, 1 − 1) − ρ(0)} = +∞.
(4.23)
(ii) If ρ : X → R is an l.s.c. convex MMR, it is equal to its biconjugate, and so, by (i), (4.21) holds. Conversely, (4.21) expresses that ρ is a supremum of MMRs. Having finite values, it is an MMR.
4.3.3 Acceptation Sets A monetary measure of risk ρ is characterized by its associated zero sublevel set, called in this setting an acceptation set: Aρ := {x ∈ X ; ρ(x) ≤ 0}.
(4.24)
Indeed, by the translation invariance property, we have that ρ(x) = min{α ∈ R; x − α1 ∈ Aρ }.
(4.25)
In other words, ρ(x) is the smallest constant reduction of losses that allows one to get a nonpositive risk. Acceptation sets satisfy
(i) A − X + ⊂ A, (ii) For all x ∈ X , ρ(x) := min{α ∈ R; x − α1 ∈ A} is finite.
(4.26)
Conversely, to a set A satisfying (4.26) is associated an MMR ρ A defined by ρ A (x) = inf{α ∈ R; x − α1 ∈ A}.
(4.27)
Lemma 4.13 A monetary measure of risk ρ is convex iff its associated acceptance set is convex. Proof If ρ is a convex MMR, then Aρ = ρ −1 (R− ) is obviously convex. Conversely, assume that Aρ is convex. Let x1 and x2 belong to X . Then xi − ρ(xi ) ∈ Aρ , i = 1, 2. Since Aρ is convex, for any γ ∈ [0, 1], we have that
4.3 Monetary Measures of Risk
171
γ (x1 − ρ(x1 )) + (1 − γ )(x2 − ρ(x2 )) ∈ Aρ .
(4.28)
Set x := γ x1 + (1 − γ )x2 . Then x − γρ(x1 ) − (1 − γ )ρ(x2 ) ∈ Aρ which, in view of the definition of Aρ , implies, ρ(x) ≤ γρ(x1 ) + (1 − γ )ρ(x2 ) as was to be proved. In order to obtain lower estimates of the value of optimization problems associated with MMRs, it is useful to characterize the greatest convex minorant of an MMR. The notion of convex closure for sets and functions was introduced in Definition 1.45. Lemma 4.14 Let ρ be an MMR, with acceptation set A. Assume that conv(A) is not the entire space. Then conv(ρ) is a monetary measure of risk whose acceptation set is conv(A). Proof By the Hahn–Banach theorem, since conv(A) = X , there exist (Q, α) ∈ X ∗ × R, with Q = 0, such that Q, y ≤ α for all y ∈ A. For any y ∈ A and z ∈ X + , y − z ∈ A, and hence, Q, y − z ≤ α, proving that Q ≥ 0. Next, given x ∈ X , we have that y := x − ρ(x)1 ∈ A, and hence,
Q, 1ρ(x) ≥ Q, x − α.
(4.29)
If Q, 1 = 0, then Q, x ≤ α for all x ∈ X , which cannot hold since Q ≥ 0 and Q = 0. So Q, 1 > 0, and we may assume that Q, 1 = 1. It then follows from (4.29) that Q, x − α is a minorant of ρ. Since ρ has an affine minorant, by the Fenchel–Moreau–Rockafellar theorem 1.46, conv(ρ) is equal to ρ ∗∗ . Since ρ ∗∗ ≤ ρ, we have that ρ ∗∗ is everywhere finite. By Lemma 4.12(i), ρ ∗∗ is a supremum of the form (4.21). We conclude by Lemma 4.12(ii) that ρ ∗∗ = conv(ρ) is an l.s.c. convex MMR such that conv(ρ) ≤ ρ, and so A ⊂ Aconv(ρ) . Since Aconv(ρ) is closed and convex, this implies conv(A) ⊂ Aconv(ρ) .
(4.30)
We now show the converse inclusion by checking that conv(A) satisfies the axioms of an acceptation set. Condition (4.26)(i) is a consequence of the one satisfied by A, and for (4.26)(ii), set r (x) := inf{γ ∈ R; x − γ 1 ∈ conv(A)}.
(4.31)
Since A ⊂ conv(A), we have that r (x) ≤ ρ(x). The affine minorant (x) := Q, x − α of ρ(x) is such that (x) ≤ 0 for all x ∈ A, and hence, for all x ∈ conv(A). So, x − γ 1 ∈ conv(A) implies α ≥ Q, x − γ 1 = Q, x − γ , i.e., and so γ ≥
Q, x − α. Therefore, the infimum in (4.31) is finite and, as conv(A) is closed, is attained. The associated MMR r is an l.s.c. convex minorant of ρ, and so r ≤ conv(ρ). But, since the mapping ρ → Aρ is nonincreasing, the converse inclusion in (4.31) holds. The conclusion follows.
172
4 Risk Measures
4.3.4 Risk Trading This model will illustrate the above concepts. It involves two agents, the issuer A and the buyer B. An asset F is to be sold to the buyer at a price π to be determined. Initially, A and B have outcome functions X and Y , in the Banach space X , and assess risk with risk measures ρ A and ρ B . The buyer will find the transaction advantageous if ρ B (Y + F + π ) ≤ ρ B (Y ).
(4.32)
In view of the translation invariance property, the best (highest) price is π(F) := ρ B (Y ) − ρ B (Y + F). The financial product F minimizing the risk of the issuer in a class F is then the solution of Min ρ A (X − F − π(F)); π(F) := ρ B (Y ) − ρ B (Y + F). F∈F
(4.33)
Using again the translation invariance property, we obtain the equivalent problem Min ρ A (X − F) + ρ B (Y + F) − ρ B (Y ).
(4.34)
F∈F
If F = X , using the identity inf {ρ A (X − F) + ρ B (Y + F)} = inf {ρ A (X + Y − G) + ρ B (G)},
F∈X
G∈X
(4.35)
we recognize an inf convolution: the above infimum is ρ A ρ B (X + Y ) − ρ B (Y ), while the gain of the issuer is ρ A (X ) + ρ B (Y ) − ρ A ρ B (X + Y ).
4.3.5 Deviation and Semideviation Let (Ω, F , μ) be some probability space. Let p ∈ [1, ∞), X = L p (Ω). The deviation in L p (Ω) is Ψ p (x) :=
1/ p |x(ω) − E(x)| dμ(ω) p
Ω
.
(4.36)
This is a composition of the “centering” continuous mapping Ax = x − E(x)1 with the L p norm, and is a positively homogeneous continuous convex function. Since the subdifferential of a norm at 0 is the closed dual unit ball, ∂Ψ p (0) = A Bq , with Bq the unit ball of L q (Ω), 1/ p + 1/q = 1. For y ∈ L q (Ω), we have A y = y − E(y)1, and so, by Lemma 1.66:
4.3 Monetary Measures of Risk
173
∂Ψ p (x) = z = y − E(y)1; y L q (Ω) ≤ 1; z(ω) · x(ω)dμ(ω) = Ψ p (x) . Ω
(4.37) When p = 1, q = +∞ and for all y ∈ B∞ , we have that Ey ≤ 1, so if z ∈ ∂Ψ1 (0), z ≥ −2 a.s. The function ρ1 (x) := E(x) + c |x(ω) − E(x)|dμ(ω) (4.38) Ω
is, for all c ≥ 0, convex and continuous, translation invariant, and satisfies ∂ρ1 (0) = 1 + c{y − E(y); y∞ ≤ 1}.
(4.39)
If c ∈ [0, 1/2], any z ∈ ∂ρ1 (0) is nonnegative and has unit expectation. We deduce that: Lemma 4.15 For c ∈ [0, 1/2], the function ρ1 (x) defined in (4.38) is a convex MMR.
4.3.5.1
Semi-deviation
Consider now the function Φ p (x) :=
1/ p p
Ω
|x(ω) − E(x)|+ dμ(ω)
.
(4.40)
This is a composition of the same “centering” mapping Ax = x − E(x)1 with the function x+ p . Let us show that the latter is convex: it is positively homogeneous, and since (x + y)+ ≤ x+ + y+ , we have (x + y)+ p ≤ x+ + y+ p ≤ x+ p + y+ p ,
(4.41)
i.e., x+ p is sublinear.2 Now a sublinear, positively homogeneous function is convex.3 Lemma 4.16 The subdifferential at 0 of x+ p is (Bq )+ , the set of nonnegative elements of Bq . Proof Since x → x+ p is nondecreasing and non-expansive, the elements of its subdifferential are nonnegative and contained in the closed dual unit ball, i.e., ∂x+ p (0) ⊂ (Bq )+ . Conversely, if y ∈ (Bq )+ , then for all x in X , we have x+ p ≥
q, x+ ≥ 0, and so, q ∈ ∂x+ p (0). function f is sublinear if f (+y) ≤ f (x) + f (y), for all x and y. if f is sublinear and positively homogeneous, for α ∈]0, 1[, we have f (αx + (1 − α)y) ≤ f (αx) + f ((1 − α)y) = α f (x) + (1 − α) f (y).
2A
3 Since,
174
4 Risk Measures
By the above discussion, ∂Φ p (0) = {y − E(y); y ∈ (Bq )+ }.
(4.42)
As in the case of the deviation function, since E(y) ≤ 1 when y ∈ (Bq )+ the subdifferential of Φ p is a.s. greater than or equal to −1. We deduce the following result: Lemma 4.17 For p ∈ [1, ∞) and c ∈ [0, 1], the following function is a convex MMR: (4.43) ρˆ p (x) := E(x) + cΦ p (x). Remark 4.18 The function ρ p+ is of practical interest since it penalizes losses, and not gains, w.r.t. the average revenue.
4.3.6 Value at Risk and CVaR 4.3.6.1
Value at Risk
Risk models often involve a constraint on the probability that losses are no more than a given level. Denote by (4.44) H X (a) := P[X ≤ a] the cumulative distribution function (CDF) of the real random variable X . This is a nondecreasing function with limits 0 at −∞, and 1 at +∞, which is right continuous. Setting H X (a − ) := limb↑a H X (b), we have that H X (a − ) = P[X < a]; P(X = a) = H X (a) − H X (a − ).
(4.45)
Given α ∈]0, 1[, we call any number a ∈ R such that P[X < a] ≤ α ≤ P[X ≤ a]
(4.46)
an α quantile. Having in view the minimization of losses, we define the value at risk of level α ∈]0, 1[ as VaRα (x) := min{a; H X (a) ≥ 1 − α} = min{a ∈ R; P[X > a] ≤ α}.
(4.47)
A constraint of the type VaRα (x) ≤ a means that the probability of a loss greater than a is no more than α. Obviously, VaRα (x) is an MMR. Its acceptation set is
(4.48)
4.3 Monetary Measures of Risk
175
A V a R,α := {X ; P[X > 0] ≤ α}.
(4.49)
Since the acceptation set is nonconvex, the value at risk is also nonconvex.
4.3.6.2
Conditional Value at Risk
Consider an optimization problem of the form: Min F(X ); VaRα (X ) ≤ 0,
X ∈X
(4.50)
where X is a Banach space. Let us see how to compute a convex function G(X ) such that G(X ) > 0 if VaRα (X ) > 0; the related problem Min F(X ); G(X ) ≤ 0
X ∈X
(4.51)
might be easier to solve, and its value will provide an upper bound of the one of (4.50). Observe that, for any γ > 0: P(X > 0) = E1{X >0} ≤ E[1 + γ X ]+ = γ E[γ −1 + X ]+ .
(4.52)
Dividing by γ > 0, we deduce that VaRα (X ) ≤ 0
⇒
inf {E[γ −1 + X ]+ − α/γ } ≤ 0.
γ >0
(4.53)
Setting δ = −1/γ and dividing by α, we obtain the equivalent relation VaRα (X ) ≤ 0
⇒
inf {δ + α −1 E[X − δ]+ } ≤ 0.
δ 0 a.s. Clearly Iϕ (P, P) = 0, and for a ∈ ∂ϕ(1): Iϕ (Q, P) = ≥a
Ω
ϕ(1 + (q(ω) − p(ω))/ p(ω)) p(ω)dμ(ω)
Ω
q(ω) − p(ω) p(ω)dμ(ω) = 0, p(ω)
(5.4)
since p and q are densities. In addition, since ϕ is strictly convex, equality holds iff q(ω) = p(ω) a.s. Taking ϕ = − log we recover, up to a constant, the (opposite of the) above function Φ.
5.2 Convergence in Law and Related Asymptotics In this section we will discuss random variables with image in a metric space. So, (Y, ρ) will be a metric space and the associated distance. An example that will be considered in applications is that of the space of continuous functions over a compact set.
5.2.1 Probabilities over Metric Spaces As a σ -field over Y we take the Borelian field (generated by open sets; its elements are called the Borelian subsets).
5.2 Convergence in Law and Related Asymptotics
179
Definition 5.3 We say that the probability measure P over Y is regular if any Borelian subset A of Y is such that, For any ε > 0, there exist F, G resp. closed and open subsets of Y (5.5) such that F ⊂ A ⊂ G and P(G \ F) < ε. Lemma 5.4 Any probability measure over a metric space is regular. Proof We follow [20, Ch. 1]. If A is closed, take F = A and for some δ > 0, G := G δ , where G δ := {y ∈ Y ; ρ(y, A) < δ}. Then P(G δ \ A) = E1{0 0 such that Pk (|X k |∼ > cε ) ≤ ε.
(5.15)
If X is an r.v. with value in Y , we have2 for all ε > 0, there exists a κε > 0 such that μ(|X |∼ > κε ) ≤ 21 ε.
(5.16)
Lemma 5.11 Let X k be a sequence of r.v.s with image in the metric space Y , converging in law to X . Then X k is bounded in probability. Proof Let ε > 0, κε be given by (5.16), and f : Y → R be continuous with image in [0, 1], with value 0 if |y|∼ ≤ κε , and 1 if |y|∼ ≥ κε + 1. Then Pk (|X k |∼ > κε + 1) ≤ Ek f (X k ) → E f (X ) ≤ μ(|X |∼ > κε ) ≤ 21 ε. We get the conclusion with cε := κε + 1.
(5.17)
Definition 5.12 A function f : Y → R is said to be uniformly continuous if for all ε > 0, there exists an α > 0 such that | f (y1 ) − f (y2 )| ≤ ε when ρ(y1 , y2 ) ≤ α. The function is said to be Lipschitz with constant L f if | f (y1 ) − f (y2 )| ≤ L f ρ(y1 , y2 ), for all y1 , y2 in Y . Definition 5.13 Let f be bounded Y → R. Given λ > 0, its Lipschitz regularisation is defined by f λ (y) := inf
z∈Y
1 f (z) + ρ(y, z) , for all y ∈ Y. λ
(5.18)
We recognize the natural extension to a metric space of an infimal convolution. We have in particular inf f ≤ f λ (y) ≤ f (y), for all y ∈ Y . Lemma 5.14 Let f be bounded and continuous Y → R, with Lipschitz regularisation f λ (y). Then (i) f λ (y) is Lipschitz with constant 1/λ, (ii) we have f λ (y) ↑ f (y) when λ ↓ 0, for all y ∈ Y . 1 The
definition is independent of y0 . In the applications Y will be a Banach space and we will take y0 = 0 so that | · |∼ will be equal to the norm of Y . 2 Indeed, the family A := {ω ∈ Ω; |X | > n} being nonincreasing with empty intersection, n ∼ μ(An ) ↓ 0 by (3.14).
182
5 Sampling and Optimizing
Proof Using the majoration of differences of infima by the supremum of differences, and the triangle inequality, we get
1 1 sup ρ(y , z) − ρ(y, z) ≤ ρ(y , y). λ z∈Y λ
f λ (y ) − f λ (y) ≤
(5.19)
By symmetry we deduce that f λ is Lipschitz with constant 1/λ. (ii) By the definition, f λ (y) ≤ f (y) and f λ (y) increases when λ ↓ 0. Fix y ∈ Y . Since f is continuous in y, for all ε > 0, there exists an α > 0 such that | f (z) − f (y)| ≤ ε when ρ(z, y) ≤ α. So,
f λ (y) = min inf ρ(z,y)≤α f (z) + λ1 ρ(z, y) , inf ρ(z,y)>α f (z) + λ1 ρ(z, y) ≥ min( f (y) − ε, inf f + α/λ), and hence, lim inf λ↓0 f λ (y) ≥ f (y) − ε. The conclusion follows.
(5.20)
Lemma 5.15 The convergence in law of X k to X holds iff Ek f (X k ) → E f (X ), for all f : Y → R Lipschitz and bounded.
(5.21)
Proof The condition is obviously necessary; let us show that it is sufficient. So, let (5.21) be satisfied, and let f : Y → R be continuous and bounded. By symmetry, it suffices to show that lim inf k Ek f (X k ) ≥ E f (X ). The Lipschitz regularization f λ of f being Lipschitz and bounded, it satisfies Ek f λ (X k ) → E f λ (X ).
(5.22)
By monotone convergence and in view of Lemma 5.14(ii), we have that for all ε > 0, there exists a λε such that E f λ (X ) ≥ E f (X ) − ε if λ < λε .
(5.23)
Using f λ (y) ≤ f (y), we get when λ < λε : lim inf Ek f (X k ) ≥ lim inf Ek f λ (X k ) = E f λ (X ) ≥ E f (X ) − ε, k
as was to be shown.
k
(5.24)
Corollary 5.16 The convergence of a random variable over a given probability space, either a.s., or in probability, implies the convergence in law. Proof By Theorem 3.28(ii), the convergence a.s. of an r.v. on a probability space implies the convergence in probability. It suffices therefore to consider the case of a sequence of r.v.s X k over (Ω, F , P) converging to X in probability. Let f be Lipschitz and bounded, with Lipschitz constant L. Then
5.2 Convergence in Law and Related Asymptotics
183
|E( f (X k ) − f (X ))| = E1{|X k −X |>ε} | f (X k ) − f (X )| +E1{|X k −X |≤ε} | f (X k ) − f (X )|
(5.25)
≤ 2 f ∞ meas({|X k − X | > ε}) + εL ,
converges to 0. We conclude by the previous lemma.
Definition 5.17 We say that f : Y → R has bounded support if f (y) = 0 when |y|∼ is large enough (i.e., when f is zero outside a set of finite diameter). Actually we can use as test functions Lipschitz functions with bounded support: L
Lemma 5.18 Let X k be bounded in probability. Then X k → X iff
Ek f (X k ) → E f (X ), for all f : Y → R Lipschitz with bounded support.
(5.26)
Proof It suffices to check that (5.26) implies the convergence in law. Let f be Lipschitz and bounded. For M > 0, let ϕ M be Lipschitz R → [0, 1], with value 1 over [0, M] and 0 over [M + 1, ∞[. By dominated convergence, lim Eϕ M (|X |∼ ) = E1 = 1.
M↑∞
(5.27)
Fix ε > 0. Let Mε be such that Eϕ Mε (|X |∼ ) ≥ 1 − 21 ε. For k large enough, by (5.26), we have that Ek ϕ Mε (|X k |∼ ) ≥ 1 − ε. Define ψε (t) := 1 − ϕ Mε (t). Then Eψε (|X |∼ ) ≤ 21 ε and, for large enough k, Ek ψε (|X k |∼ ) ≤ ε. We have then Ek f (X k ) = Ek f (X k )ϕ Mε (X k ) + Ek f (X k )ψε (X k ).
(5.28)
Using Ek f (X k )ϕ Mε (|X k |∼ ) → E f (X )ϕ Mε (|X |∼ ) and |Ek f (X k )ψε (|X k |∼ )| ≤ f ∞ Eψε (|X k |∼ ) ≤ ε f ∞ ,
(5.29)
we get with (5.28) lim inf Ek f (X k ) ≥ E f (X )ϕ Mε (|X |∼ ) − ε f ∞ . k
(5.30)
By the monotone convergence theorem, E f (X )ϕ Mε (|X |∼ ) → E f (X ) when ε ↓ 0, and so, lim inf k Ek f (X k ) ≥ E f (X ). Changing f into − f we obtain the converse inequality. The conclusion follows. Definition 5.19 Let yk be a sequence of r.v.s with image in Y . One says that yk converges in probability to y¯ ∈ Y (deterministic) if it converges in probability to the constant function with value y¯ over Y , i.e., if Pk {ω ∈ Ω; ρ(yk (ω), y¯ ) > ε} → 0, for all ε > 0.
(5.31)
184
5 Sampling and Optimizing
In particular, if yk is another sequence of r.v.s over the same probability spaces (Ωk , Fk , Pk ) as yk , with image in the separable3 metric space Y , one says that (the sequence of r.v.s Ωk × Ωk → R) ρ(yk , yk ) converges in probability to 0 if Pk {ρ(yk , yk ) > ε} → 0, for all ε > 0. Lemma 5.20 The convergence in probability of yk to y¯ ∈ Y is equivalent to the convergence in law of yk to the Dirac measure at y¯ . Proof (a) If yk converges in probability to y¯ ∈ Y , for all f Lipschitz and bounded, and ε > 0, we have Ek f (yk ) = Ek f (yk )1{ρ(yk , y¯ )≤ε} + Ek f (yk )1{ρ(yk , y¯ )>ε} ≥ Ek ( f ( y¯ ) − εL f ) + o( f ∞ ),
(5.32)
so that lim inf k Ek f (yk ) ≥ f ( y¯ ) − εL f . By symmetry we deduce that Ek f (yk ) → f ( y¯ ) and so, yk converges in law to the Dirac measure at y¯ . (b) Conversely, if yk converges in law to the Dirac measure at y¯ , taking f (y) := min(1, ρ(y, y¯ )), we get for all ε ∈ (0, 1): 0 = lim Ek f (yk ) − f ( y¯ ) = lim Ek f (yk ) ≥ ε lim sup Pk {ρ(yk , y¯ ) ≥ ε}. k
k
(5.33)
k
The conclusion follows. yk
Proposition 5.21 Let yk and be two sequences of r.v with image in the separable metric space Y , such that yk and yk have the same probability space (Ωk , Fk , Pk ), and ρ(yk , yk ) → 0 in probability. Then (i) We have that Ek [ f (yk ) − f (yk )] → 0, for all f Lipschitz and bounded. L
L
(ii) If yk → y¯ , where y¯ is an r.v., then yk → y¯ . Proof (i) Let f be Lipschitz and bounded. Then Ek [ f (yk ) − f (yk )] = Ek ( f (yk ) − f (yk ))1{ρ(yk ,yk )>ε} +Ek ( f (yk ) − f (yk ))1{ρ(yk ,yk )≤ε} ≥ −2 f ∞ Pk [ρ(yk , yk ) > ε] − εL f ,
(5.34)
and so lim inf k Ek [ f (yk ) − f (yk )] ≥ −εL f , which by symmetry implies Ek [ f (yk ) − f (yk )] → 0 as was to be shown. (ii) A consequence of (i) and of Lemma 5.15.
Remark 5.22 We will apply the proposition in the case when yk is a constant sequence equal to some r.v. y¯ . We have proved that, if ρ( y¯ , yk ) converges in probability to 0, then yk converges in law to y¯ . 3 The
II].
separability of Y ensures that ρ(yk (ω), yk (ω)) is measurable, see Billingsley [20, Appendix
5.2 Convergence in Law and Related Asymptotics
185
We recall the Skorokhod–Dudley representation theorem [118]; see [45, Thm. 11.7.2] for a proof. Theorem 5.23 Let y k be a sequence of r.v.s over (Ωk , Fk , Pk ), with values in a separable Banach space Y , converging in law to a probability P. Then there exists a probability space (Ω, F , μ) and a sequence yˆ k of r.v.s over (Ω, F , μ) with values L
in Y , such that yˆ k ∼ y k (and therefore yˆ k converges in law to P), and yˆ k converges a.s. (and therefore also, by Theorem 3.28, in probability).
5.2.3 Central Limit Theorems We first recall the classical result, see e.g. [20]. Theorem 5.24 (Central limit) Let X be an r.v. with values in Rm and finite second moment, expectation X¯ , and covariance matrix V of size m × m. Set X N := N −1 (X 1 + · · · + X N ), where the X i are independent with the law of X . Then N 1/2 (X N − X¯ ) converges in law to the Gaussian of expectation 0 and variance V . In what follows we will consider samples of functions to be minimized. So we need an infinite-dimensional version of the previous results. Definition 5.25 Let y and z be two r.v.s over the probability space (Ω, F , μ) with image in a Banach space Y . We assume that y and z have finite second moment, and denote by y¯ , z¯ their expectations. For any pair (g, h) in Y ∗ × Y ∗ , we define the covariance of (y, z) along (g, h) by cov[y, z](g, h) := E [g, y − y¯ h, z − z¯ ] .
(5.35)
Note that the functions (y, z) → cov[y, z](g, h) and (g, h) → cov[y, z](g, h) are bilinear and continuous, from L 2 (Ω, Y )2 and Y ∗ × Y ∗ to R resp. Set var[y](g) := cov[y, y](g, g).
(5.36)
Definition 5.26 We say that a measure μ over Y is Gaussian if, for all nonzero y ∗ ∈ Y ∗ , the following measure over R is Gaussian: μ[y ∗ ](B) := μ({y ∈ Y ; y ∗ , y ∈ B}), for all Borelian B ⊂ R.
(5.37)
Consider a probability space (Ω, F , μ), a compact space X ⊂ Rn , and a Carathéodory function f : Ω × Rn → R p , i.e., f (ω, x) is continuous w.r.t. x a.s., and
186
5 Sampling and Optimizing
measurable in ω for all x. We assume that f is Lipschitz (in x) with a square integrable Lipschitz constant, in the sense that | f (ω, x ) − f (ω, x)| ≤ a(ω)|x − x|, for all x and x in X,
(5.38)
with a(ω) ∈ R+ measurable and of finite second moment: Ea(ω)2 < ∞.
(5.39)
We assume the existence of a finite second moment for a particular point x0 ∈ X : E f (ω, x0 ))2 < ∞,
(5.40)
which combined with the previous hypotheses implies the existence of a finite second moment of f (·, x) for any x ∈ X . Then ω → f (ω, ·) is an r.v. with image in the Banach space Y = Cb (X ) p , with expectation denoted by f¯(x). We denote the sample approximation of f¯ by N 1 f (ωi , x). fˆN (x) := N i=1
(5.41)
We next state a Functional Central Limit Theorem (FCLT; functional here means infinite-dimensional).
√ Theorem 5.27 If (5.38)–(5.40) holds, then N fˆN (x) − f¯(x) converges in law to the Gaussian of covariance equal to that of f . Proof See Araujo and Giné [8, Cor. 7.17] for the proof of this difficult result.
5.2.4 Delta Theorems 5.2.4.1
The First-Order Delta theorem
We now establish some differential calculus rules for r.v.s converging in law. Theorem 5.28 (Delta theorem) Let Yk be a sequence of r.v.s with values in a separable Banach space Y containing η, τk ↑ ∞, and Z an r.v. with values in Y , such that Z k := τk (Yk − η) converges in law to Z . Let G : Y → W , where W is a Banach space, be differentiable at η. Then τk (G(Yk ) − G(η)) converges in law to G (η)Z . Proof In view of the representation Theorem 5.23, we may suppose that the Yk are r.v.s over the same probability space (Ω, F , P) and that Z k → Z a.s. Since G is differentiable at η,
5.2 Convergence in Law and Related Asymptotics
τk (G(Yk ) − G(η)) → G (η)Z a.s.
187
(5.42)
We conclude by applying Corollary 5.16 to the above expression. In the applications we wish to expand the minimum value of an expectation function. Since the minimum is not a differentiable function, we need to extend the Delta theorem. Definition 5.29 Let X be a Banach space, K ⊂ X , and x¯ ∈ K . We call the set ¯ := {h ∈ X ; there exists tk ↓ 0, xk ∈ K ; (xk − x)/t ¯ k → h} TK (x)
(5.43)
the (tangent) cone of Bouligand to K at x. ¯ Note that, if K is convex, this set coincides with the tangent cone in the sense of convex analysis (Definition 1.80). Definition 5.30 Let X and W be two Banach spaces, K ⊂ X , and G : K → W . One says that G is Hadamard differentiable at x¯ ∈ K , tangentially to K , in the direction ¯ if, for any sequence (tk , xk ) associated with Definition 5.29, we have h ∈ TK (x) ¯ that (G(xk ) − G(x))/t k has a limit, independent of the particular sequence (tk , x k ), ¯ one says that G is Hadamard denoted by G (x, h). If this holds for all h ∈ TK (x), differentiable at x¯ tangentially to K . When K = Y , one says that G is Hadamard differentiable at x. ¯ Lemma 5.31 If G is the restriction of a Lipschitz mapping X → Y , with directional derivatives at x, ¯ then it is Hadamard differentiable at x. ¯ ¯ k → h, Proof Indeed, let G have Lipschitz constant L. When tk ↓ 0 and (xk − x)/t we have lim k
G(xk ) − G(x) ¯ G(x¯ + tk h) − G(x) ¯ G(xk ) − G(x¯ + tk h) = lim + lim . k k tk tk tk (5.44)
Since G(xk ) − G(x¯ + tk h) ≤ Lxk − (x¯ + tk h) = o(tk ), the limit of the r.h.s. of (5.44) is G (x, h). The result follows. We next introduce the “Hadamard” version of the Delta theorem.
(5.45)
Theorem 5.32 (Hadamard Delta Theorem) Let Y and W be Banach spaces, with Y separable, K a subset of Y , G : K → W Hadamard differentiable at η ∈ K tangentially to K , and Yk a sequence of r.v.s with values in K . Let τk ↑ ∞, and Z an r.v. with values in Y , such that Z k := τk (Yk − η) converges in law to Z . Then τk (G(Yk ) − G(η)) converges in law to G (η, Z ). Proof The proof is similar to that of Theorem 5.28, replacing (5.42) with τk (G(Yk ) − G(η)) → G (η, Z ) a.s.
(5.46)
188
5.2.4.2
5 Sampling and Optimizing
The Second-Order Delta Theorem
Definition 5.33 Let X and W be two Banach spaces, K ⊂ X , and G : K → W ¯ be Hadamard differentiable at x¯ ∈ K , tangentially to K , in direction h ∈ TK (x), ¯ h). One says that G is second-order with directional derivative denoted by G (x, ¯ if Hadamard differentiable at x¯ ∈ K , tangentially to K , in the direction h ∈ TK (x) for any sequence (tk , xk ) associated with Definition 5.29, we have the existence of G (x, ¯ h) := lim k
G(xk ) − G(x) ¯ − G (x, ¯ xk − x) ¯ , 1 2 t 2 k
(5.47)
¯ the limit being independent of the sequence (tk , xk ). If this holds for all h ∈ TK (x), one says that G is second-order Hadamard differentiable at x¯ tangentially to K . When K = Y , one says that G is second-order Hadamard differentiable at x. ¯ Observe that, if G is of class C 2 , then ¯ h) = D 2 G(x)(h, ¯ h). G (x,
(5.48)
Theorem 5.34 (Second-order Hadamard Delta Theorem) Let Y and W be Banach spaces, with Y separable, K a subset of Y , G : K → W second-order Hadamard differentiable at η ∈ K tangentially to K , and Yk a sequence of r.v.s with values in K . Let τk ↑ ∞, and Z an r.v. with values in Y , such that Z k := τk (Yk − η) converges in law to Z . Then we have the convergence in law L
2τk2 (G(Yk ) − G(η) − G (η, Yk − η)) → G (η, Z ). Proof The arguments are similar to those of the first-order delta theorem.
(5.49)
5.2.5 Solving Equations 5.2.5.1
Taylor Expansion of the Solution of an Equation
Let Z be an open subset of Rn , with closure denoted by Z¯ , and Lipschitz boundary ∂ Z (meaning that locally, up to a diffeomorphism, Z coincides with the set {z ∈ Rn ; z n ≤ f (z 1 , . . . , z n−1 )} for some Lipschitz function f ). If ϕ is a C p function over Z with image in Rn and uniformly continuous derivatives up to order p, we extend these derivatives over ∂ Z by continuity. We denote by Φ p the space of such C p function over Z¯ . We can identify Φ 1 with a closed subspace of Cb ( Z¯ )n+1 , and similarly identify Φ p with a closed subspace of Cb ( Z¯ )n( p) , for some n( p). For ϕ ∈ Φ p , p ≥ 1, and z ∈ Z , consider the equation F(ϕ, z) := ϕ(z) = 0.
(5.50)
5.2 Convergence in Law and Related Asymptotics
189
Clearly F is for given z a linear continuous function of ϕ ∈ Φ p , with derivative D F(ϕ, z)(ψ, ζ ) = ψ(z) + ϕ (z)ζ.
(5.51)
This derivative being a continuous function of (ϕ, z), F is of class C 1 . Assume next that ϕ¯ has a root z¯ in the interior of Z , and that ϕ¯ (¯z ) is invertible. By the implicit function theorem we have that, locally, ϕ(z) = 0 iff z = G(ϕ) for some C 1 function G : Φ p → Z . So we have that ϕ(G(ϕ)) = 0. Computing the derivative ¯ we obtain of ϕ(G(ϕ)) = 0 in direction ψ ∈ Φ p , at ϕ, ¯ = 0. ψ(¯z ) + ϕ¯ (¯z )G (ϕ)ψ
(5.52)
Since ϕ¯ (¯z ) is invertible, we obtain the expression of the derivative of G at ϕ¯ as ¯ = −ϕ¯ (¯z )−1 ψ(¯z ). G (ϕ)ψ
(5.53)
We can write this for a neighbouring function ϕ as ϕ (G(ϕ))G (ϕ)ψ + ψ(G(ϕ)) = 0.
(5.54)
Differentiating the above expression wrt ϕ in the direction of ψ we obtain ψ (z)G (ϕ)ψ + ϕ (z)(G (ϕ)ψ)2 + ϕ (z)G (ϕ)(ψ)2 + ψ (z)G (ϕ)ψ = 0. (5.55) Since ϕ (z) is invertible this provides an expression for G (ϕ)(ψ)2 . The first and last term are identical and we can eliminate G (ϕ)ψ = −ϕ (z)−1 ψ(z).
(5.56)
So, 2 ¯ = ϕ¯ (¯z )−1 2ψ (¯z )ϕ (¯z )−1 ψ(¯z ) − ϕ (¯z )(ϕ (¯z )−1 ψ(¯z ))2 . G (ϕ)(ψ)
5.2.5.2
(5.57)
Stochastic Equations
Let f (ω, x) be a Carathéodory function Ω × Rn → Rn , a.s. of class C 2 w.r.t. x. Denote by D f (ω, x) and D 2 f (ω, x) the corresponding derivatives. Assume that f , D f (ω, x) and D 2 f (ω, x) are square integrable and Lipschitz in x with a square integrable Lipschitz constant (hypotheses (5.38)–(5.40)). Setting f¯(x) := E f (·, x), consider the equation f¯(x) = 0. (5.58) We assume that it has a regular root x, ¯ i.e., f¯(x) ¯ = 0 and D f¯(x) ¯ is invertible. There ¯ ¯ x, exists an ε > 0 such that f has no other root in B( ¯ ε), and D f¯ is uniformly
190
5 Sampling and Optimizing
¯ x, invertible over B( ¯ ε). Let Y denote the separable Banach space of C 1 functions ¯ x, over B( ¯ ε) with image in Rn , endowed with the natural norm yY := max |y(x)| + max |Dy(x)|. x
x
(5.59)
If g ∈ Y is close enough to the restriction of f¯ to Y , it has a unique solution denoted by χ (g). Otherwise we set χ (g) equal to a given x0 ∈ Rn . The sampling approximation is (5.60) fˆN (x) = 0. We set xˆ N := χ ( fˆN ). Theorem 5.35 Let f , x¯ be as above and Z (x) ¯ denote the covariance of f (·, x). ¯ Then L N 1/2 (xˆ N − x) ¯ → − f¯ (x) ¯ −1 Z (x), ¯ (5.61) ¯ −1 Z (x)) ¯ converges in law to and 2N (xˆ N − x¯ + f¯ (x) ¯ −1 2Z (x) ¯ f¯ (x) ¯ −1 Z (x) ¯ + f¯ (x)( ¯ f¯ (x) ¯ −1 Z (x)) ¯ 2 . f¯ (x)
(5.62)
5.3 Error Estimates In this section, given a probability space (Ω, F , P), we assume that X is a compact subset of Rn , and
(5.63)
We also assume that f has finite second moments and denote its expectation by f¯(x).
5.3.1 The Empirical Distribution When optimizing an expectation, it frequently occurs that the law μ is not known, but nevertheless it is possible to get a sample of realizations that follows the law of μ. Given an integer N > 0, we obtain an empirical distribution μˆ n that associates the probability 1/N with each element ω1 , . . . , ω N of the sample (or rather gives probability p/N in the case of p identical realizations) and zero to the others. This empirical distribution is an r.v. We recall that we denote by N 1 ˆ f (ωi , x) f N (x) = N i=1
(5.64)
5.3 Error Estimates
191
the mean value of the empirical distribution, also called the standard estimator of the mean value. We recall that this estimator is unbiased, since N 1 E fˆN (x) = E f (ωi , x) = f¯(x). N i=1
(5.65)
The estimation error is fˆN (x) − f¯(x), with variance N
2 2 1
1 E fˆN (x) − f¯(x) = 2 E f (ωi , x) − f¯(x) = V ( f, x). N i=1 N
(5.66)
So, the standard deviation (square root of the variance) of fˆN (x) is N −1/2 V ( f, x)1/2 . We recall the classical estimator of the variance. Lemma 5.36 A convergent, unbiased estimator of V ( f, x) is Vˆ ( f, x) :=
2 1 f (ωi , x) − fˆN (x) . N − 1 i=1 N
(5.67)
Proof Omitting the dependence on x and assuming w.l.o.g. that f¯ = 0, we get (N − 1)Vˆ ( f ) :=
N i=1
f (ωi )2 − 2 fˆN
N
f (ωi ) + N fˆN2 =
i=1
N
f (ωi )2 − N fˆN2 ,
i=1
and so (N − 1)EVˆ ( f ) = N V ( f ) − V ( f ); the result follows.
(5.68)
Remark 5.37 It follows that the naive estimator below has a negative bias of order 1/N : N
2 1 ˜ V ( f, x) := f (ωi , x) − fˆN (x) . (5.69) N i=1
5.3.2 Minimizing over a Sample The problem of minimizing the expectation of f : Min f¯(x) x∈X
(P)
can be approximated by the problem of minimizing the standard estimate of the mean value:
192
5 Sampling and Optimizing
Min fˆN (x) x∈X
( PˆN ).
Lemma 5.38 The function N → E val( PˆN ) is nondecreasing, and satisfies
lim E val( PˆN ) ≤ val(P). N
(5.70)
Proof (a) We first show that E val( PˆN ) ≤ val(P). Since f¯(x) = E fˆN (x), this is equivalent to ˆ ˆ (5.71) inf E f N (x) ≥ E inf f N (x) , x∈X
x∈X
which is a consequence of Jensen’s inequality,
the infimum being a concave function. ˆ (b) Let us check that v N := E val( PN ) is nondecreasing. Indeed, by Jensen’s inequality again: N +1
1 1 E inf f (ω j , x) x∈X N +1 N j=i i=1 N +1
1 1 E inf ≥ f (ω j , x) x∈X N N +1 i=1 j=i N +1
1 1 E inf = f (ω j , x) = v N , N + 1 i=1 x∈X N j=i
v N +1 =
(5.72)
as was to be shown.
By the above lemma, val( PˆN ) is an estimate of val(P) with a nonpositive bias. Remark 5.39 As an illustration of Lemma 5.38, consider the unbiased estimate Vˆ ( f, x) of the variance, defined in (5.67). Alternatively we could solve the problem Min e∈R
n 1 ( f (ωi ) − e)2 , N i=1
whose solution is e = fˆN . The value of this problem is the estimator V˜ ( f, x) = (N − 1)N −1 Vˆ ( f, x), which as we have seen has a negative bias.
5.3 Error Estimates
193
5.3.3 Uniform Convergence of Values Set g(ω) := max | f (ω, x)|. Since X is compact, it contains a dense sequence x k , x
and since f (ω, x) ∈ Cb (X ) a.s., we have that g(ω) = max | f (ω, x k )| a.s., proving k
that g is measurable. Theorem 5.40 Let (5.63) hold. If g is integrable, then fˆN (x) → f¯(x) uniformly, with probability (w.p.) 1. Proof Since g is integrable, by the dominated convergence theorem, for any x ∈ X , we have that (i) f (·, x) is integrable, and so, f¯(x) is real-valued, and (ii) if x j → x ˆ i.e., f¯ is continuous. in X , then f¯(x j ) → f¯(x), Define, for x and x in X and ω ∈ Ω: h(ω, x ) := | f (ω, x) − f (ω, x )|.
(5.73)
This function is continuous w.r.t. (x, x ), and is dominated by 2g(ω). By arguments ¯ similar to the previous ones, its expectation h(x, x ) is a continuous function, with zero value when x = x . Since a continuous function over a compact set is uniformly ¯ x ) < ε when |x − continuous, for all ε > 0, there exists an αε > 0 such that h(x, x| ≤ αε . In addition, lim N
1 ˆ f N (x) − fˆN (x ) ≤ lim N N x ∈B(x,αε ) sup
sup
N
x ∈B(x,αε ) i=1
h(ω, xi ) ≤ ε w.p. 1,
(5.74) where the first inequality uses the triangle inequality, and the second one uses the separability of B(x, αε ) to ensure the measurability of supx ∈B(x,αε ) h(ω, x ), and the law of large numbers. Covering the compact set X by finitely many open balls with radius αε and center x k , k = 1 to K ε , and using fˆN (x k ) → f¯(x k ) w.p. 1, we get, for x ∈ B(x k , αε ): lim sup fˆN (x) − f¯(x) ≤ lim sup fˆN (x) − fˆN (x k ) + lim sup f¯(x k ) − f¯(x) . N
N
N
(5.75) The first limit in the r.h.s. is w.p. 1 no more than ε by (5.74), and the second one can be made arbitrarily small by taking αε small enough. The conclusion follows. Corollary 5.41 Under the hypotheses of Theorem 5.40, val( Pˆn ) → val(P) with probability 1. Proof Indeed, the theorem ensures w.p. 1 the uniform convergence of the cost function, and the function f → min x∈X f (x) is continuous over Cb (X ).
194
5 Sampling and Optimizing
5.3.4 The Asymptotic Law Let f : X → R. We set S( f ) := {x¯ ∈ X ; f (x) ¯ = inf f (x)}. x∈X
(5.76)
The next proposition is due to Danskin [38]. Proposition 5.42 The map min : Cb (X ) → R, that to f ∈ Cb (ω) associates the value min x∈X f (x), is Hadamard differentiable, and its derivative at f in direction g ∈ Cb (X ) is (5.77) min ( f, g) = min x∈S( f ) g(x). Proof Since |min( f ) − min( f )| ≤ supx | f (x) − f (x)|, we have that min(·) is Lipschitz. By Lemma 5.31, it suffices to check that it has directional derivatives satisfying (5.77). Let f and g belong to Cb (X ). We have, for ε > 0: min( f + εg) ≤ min ( f (x) + εg(x)) = min( f ) + ε min g(x). x∈S( f )
x∈S( f )
(5.78)
On the other hand, let εk ↓ 0, and x k ∈ S( f + εk g). Extracting a subsequence if ¯ Passing to the limit in the relation necessary, we may assume that x k → x. f (x k ) + εk g(x k ) ≤ f (x) + εk g(x), for all x ∈ X,
(5.79)
we deduce that x¯ ∈ S( f ). By continuity of g, we have ¯ + o(εk ) min( f + εk g) = f (x k ) + εk g(x k ) = f (x k ) + εk g(x) ≥ min( f ) + εk min g(x) + o(εk ),
(5.80)
x∈S( f )
which combined with (5.78) implies the conclusion.
Theorem 5.43 Let f (ω, x) satisfy (5.38)–(5.40), and denote by Z (x) the Gaussian with variance equal to that of f (ω, x). Then N 1/2 (val( PˆN ) − val(P)) converges in law to min x∈S( f¯)) Z (x). Proof By Theorem 5.27, N 1/2 ( fˆN (x) − f¯(x)) converges in law to Z . We conclude by combining Proposition 5.42 and the Hadamard Delta Theorem 5.32. Remark 5.44 The asymptotic law of N 1/2 (val( PˆN ) − val(P)), when the minimum of f¯ over X is not unique, is therefore in general not Gaussian. Example 5.45 Let ω √ be a standard Gaussian variable, X = {1, 2},√ f (ω, 1) = ω, f (ω, 2) = 0. Then N fˆN (1) is a standard Gaussian variable, and N fˆN (2) = 0. ˆ So the √ law of ( PN ) is min(0, Z 1 ), where Z 1 is a standard Gaussian variable. We have ˆ that N f N (x) converges in law to Z := (Z 1 , 0) so that, as follows from the above
5.3 Error Estimates
195
theorem, min x fˆN (x) converges in law (since in fact here the law is constant over the sequence) to min(0, Z 1 ).
5.3.5 Expectation Constraints Problems with expectation type constraints need a more involved analysis. We restrict ourselves to the convex setting, which is the only one that is well understood.
5.3.5.1
Marginal Analysis of Convex Problems
Let X be a compact subset of Rn and ( f, G) be continuous functions from X to R and R p resp. The associated optimization problem is Min f (x); G(x) ≤ 0, x ∈ X. x
(P f,G )
Denote by val( f, G) its value, and by L[ f, G](x, λ) := f (x) + λ · G(x) its Lagrangian. The dual problem is Maxp inf L[ f, G](x, λ), λ∈R+ x∈X
(D f,G )
with solution set denoted by Λ( f, G). We know that, if the duality gap is zero, then ¯ is a primal-dual solution4 iff (x, ¯ λ) ¯ = 0. x¯ ∈ argmin L[ f, G](x, λ¯ ); λ¯ ≥ 0; λ¯ · G(x)
(5.81)
x∈X
One easily checks that the stability condition (1.170) of the duality theory holds iff
There exists β f,G > 0, and x f,G ∈ X, such that G i (x f,G ) < −β f,G , i = 1, . . . , p.
(5.82)
The following result is consequence of Proposition 1.98: Lemma 5.46 Assume that X ⊂ Rn is convex and compact, f and G i , i = 1 to p, are continuous and convex functions over Rn , and the stability condition (5.82) holds. Then (P f,G ) and (D f,G ) have the same value, the sets S( f, G) and Λ( f, G) are compact and nonempty, and the primal-dual solutions are characterized by (5.81). Denote by Cconv (X ) the set of restrictions to X of continuous convex functions over Rn . The reference functional space is Cb (X ). The following result takes its origin in Gol’shtein [54]: 4 This
means that x¯ is a solution of (P f,G ), and λ¯ is a solution of (D f,G ).
196
5 Sampling and Optimizing
Theorem 5.47 Assume that X ⊂ Rn is convex and compact, f and G i , i = 1 to p, are restrictions of continuous convex functions over Rn , and the stability condition (5.82) holds. Then val(·, ·) is Hadamard differentiable at ( f, G) tangentially to Cconv (X ), and the expression of its derivative in the direction (φ, Ψ ) tangent to Cconv (X ) at ( f, G) is val ( f, G)(φ, Ψ ) = min
max L[φ, Ψ ](x, λ).
x∈S( f,G) λ∈Λ( f,G)
(5.83)
In addition, let a sequence in Cconv (X ) be of the form ( f + εk φ k , G + εk Ψ k ), with (φ k , Ψ k ) → (φ, Ψ ) uniformly and εk ↓ 0. Then any limit point of S( f + εk φ k , G + εk Ψ k ), belong to the set argmin max L[φ, Ψ ](x, λ).
x∈S( f,G) λ∈Λ( f,G)
(5.84)
Proof (a) Let (φ k , Ψik ), i = 1 to p, belong to Cb (Rn ), and be such that ( f + εk φik , G + εk Ψik ), i = 1 to p, are continuous convex functions over Rn , and (φ k , Ψ k ) converge uniformly to (φ, Ψ ) over X . Set vk := val( f + εk φ k , G + εk Ψ k ).
(5.85)
Let x¯ ∈ S( f, G), γ ∈ (0, 1) and set x γ := γ x f,G + (1 − γ )x¯ with x f,G defined in (5.82). Then x γ ∈ X and ¯ < −γβ f,G . G i (x γ ) ≤ γ G i (x f,G ) + (1 − γ )G i (x)
(5.86)
For k large enough, we have that x γ ∈ F( f + εk φ k , G + εk Ψ k ). Let x k ∈ S( f + εk φ k , G + εk Ψ k ). Then lim sup f (xk ) = lim sup vk ≤ inf lim( f + εk φ k )(x γ ) = val( f, G). k
k
γ ∈(0,1) k
(5.87)
(b) Again, let x k ∈ S( f + εk φ k , G + εk Ψ k ). For all x ∈ S( f, G) and λ ∈ Λ( f, G), we have that: vk = ( f + εk φ k )(x k ) ≥ ( f + εk φ k )(x k ) + λ · (G + εk Ψ k )(x k ) = L[ f, G](x k , λ) + εk L[φ k , Ψ k ](x k , λ) ≥ val( f, G) + εk L[φ k , Ψ k ](x k , λ).
(5.88)
We used here the complementarity conditions, the minimality of the Lagrangian L[ f, G](·, λ) at S( f, G), and the fact that L[ f, G](x, λ) = val( f, G) when x ∈ S( f, G). By (5.87), (5.88), vk → val( f, G), and since vk = f (x k ) + o(1), and G(x k )+ = O(ε), any limit point xˆ of x k belongs to S( f, G), and we get by (5.88), extracting a subsequence if necessary:
5.3 Error Estimates
197
vk ≥ val( f, G) + εk L[φ, Ψ ](x, ˆ λ) + εk o(1 + |λ|).
(5.89)
Maximizing w.r.t. λ in the compact set Λ( f, G), we get ˆ λ) + o(εk ). vk ≥ val( f, G) + εk max L[φ, Ψ ](x, λ∈Λ( f,G)
(5.90)
Minimizing then w.r.t. xˆ ∈ S( f, G), we obtain vk ≥ val( f, G) + εk min
max L[φ, Ψ ](x, λ) + o(εk ).
x∈S( f,G) λ∈Λ( f,G)
(5.91)
(c) Again, let x k ∈ S( f + εk φ k , G + εk Ψ k ). Fix x¯ ∈ S( f, G). The stability condition (5.82) implies that Λk := Λ(φ k , Ψ k ) is uniformly bounded for k large enough. Let ¯ and λk ∈ Λk . Extracting a subsequence if necessary, we may assume that λk → λ, one shows easily that λ¯ ∈ Λ( f, G). We get vk = f (x k ) + εk φ k (x k ) = min L[ f + εk φ k , G + εk Ψ k ](x, λk ) x∈X
≤ L[ f + εk φ k , G + εk Ψ k ](x, ¯ λk ) k k ¯ λk ) = L[ f, G](x, ¯ λ ) + εk L[φ , Ψ k ](x, ¯ λ¯ ) + o(εk ). ≤ val( f, G) + εk L[φ, Ψ ](x,
(5.92)
¯ λ¯ ) = val( f, G), The second inequality uses the relation L[ f, G](x, ¯ λk ) ≤ L[ f, G](x, ¯ a consequence of the fact that λ is a dual solution. Since Λ( f, G) is bounded, we get ¯ λ) + o(εk ), vk ≤ val( f, G) + εk max L[φ, Ψ ](x, λ∈Λ( f,G)
(5.93)
and minimizing w.r.t. x¯ ∈ S( f, G) we obtain the converse inequality of (5.91), implying (5.83). Finally the property about limit points of primal solutions follows from (5.83) and (5.90).
5.3.5.2
Application to Expectation Constraints
Let f (ω, x) : Ω × Rn → R and G(ω, x) : Ω × Rn → R p . Assume that f and G i , i = 1 to p, are convex w.r.t. x a.s., and measurable in ω for all x (Carathéodory conditions for convex functions), and satisfy (5.38)–(5.40). Their expectations f¯(x) = E f (x, ·) and G¯ i (x) = EG i (x, ·) are therefore convex and continuous. Let us consider the convex problem ¯ ≤ 0, x ∈ X, Min f¯(x); G(x) x
(P f¯,G¯ )
198
5 Sampling and Optimizing
with X a compact and convex subset of Rn . The sample approximation of this problem is (P ˆ ˆ ) Min fˆN (x); Gˆ N (x) ≤ 0, x ∈ X, f N ,G N
x
where fˆN is the empirical estimate (5.41), and the same convention for Gˆ N (with the same sample). We need the qualification condition There exists β > 0, and x 0 ∈ X such that G¯ i (x 0 ) < −β, i = 1 to p.
(5.94)
¯ of solutions of (P f¯,G¯ ) is a convex and compact subset of X . We The set S( f¯, G) ¯ ¯ recall that we denote by L[ f¯, G](x, λ) := f¯(x) + λ · G(x) the Lagrangian and by ¯ ¯ Λ( f , G) the set of Lagrange multipliers, solutions of the dual problem ¯ λ). Maxp inf L[ f¯, G](x,
(D f¯,G¯ )
λ∈R+ x∈X
Theorem 5.48 Let f (ω, x) and G(ω, x) satisfy (5.38)–(5.40), and let the qualification condition (5.94) hold. Let (Z f¯ , Z G¯ ) denote the components of the Gaussian ¯ Let Z i variable with image in Cb (X ) p+1 , of covariance equal to that of ( f¯, G). ¯ denote the component associated
with G i . Then we have the convergence in law of 1/2 N val(P fˆN ,Gˆ N ) − val(P f¯,G¯ ) towards min
max
¯ λ∈Λ( f¯,G) ¯ x∈S( f¯,G))
Z f¯ (x) +
p
λi Z i (x) .
(5.95)
i=1
Proof By (5.38)–(5.40) and Theorem 5.27, ( fˆN , Gˆ N ) converges in law towards (Z f¯ , Z G¯ ). We conclude by combining Theorem 5.47 and the Hadamard Delta Theorem 5.32.
5.4 Large Deviations In this section we briefly recall the starting point of the theory of large deviations, and show how to apply this theory to stochastic optimization problems.
5.4.1 The Principle of Large Deviations Let X 1 , . . . , X N be independent r.v.s with law equal to that of X . Set Z N := N −1 (X 1 + · · · + X N ). For all a ∈ R and t > 0, we have that
5.4 Large Deviations
199
P(Z N ≥ a) = E[1 Z N ≥a ] ≤ e−ta E[et Z N 1 Z N ≥a ] ≤ e−ta E[et Z N ].
(5.96)
Denote by M(t) := E[et X ] the moment-generating function. We know that N E[et X i /N ] = M(t/N ) N . E[et Z N ] = Πi=1
(5.97)
Denote by L M(t) := log M(t) the logarithmic moment-generating function. Then 1 t log P(Z N ≥ a) ≤ − a + L M(t/N ). N N
(5.98)
Minimizing over t > 0, we obtain
where
1 log P(Z N ≥ a) ≤ −I + (a), N
(5.99)
I + (a) := sup {aτ − L M(τ )} .
(5.100)
τ >0
This definition is close to that of the Fenchel transform of the logarithmic momentgenerating function, also called the rate function: I (a) := sup {aτ − L M(τ )} . τ
(5.101)
We have of course I + (a) ≤ I (a). The interesting case is when a > E(X ); then the probability of Z N ≥ a tends to zero as N ↑ ∞. We will see then that I + (a) = I (a) under weak hypotheses, and this gives the following large deviations estimate: Theorem 5.49 (Cramér’s theorem) Let a > E(X ). If M(τ ) has a finite value in [−τ, τ ] for some τ > 0, then I + (a) = I (a), and so with (5.99): P(Z N ≥ a) ≤ e−N I (a) .
(5.102)
Proof We have that M(0) = 1. Set τ := 21 τ . Let t ∈ (−τ , τ ). By the mean value theorem, et X (ω) − 1 = t X (ω)eθ X (ω) for some θ = θ (ω) ∈ (−τ , τ ), and since τ |X (ω)| ≤ eτ |X (ω)| : |et X (ω) − 1| 1 1 ≤ τ |X (ω)|eτ |X (ω)| ≤ eτ |X (ω)| . t τ τ
(5.103)
Since M(τ ) has a finite value in [−τ, τ ], the r.h.s. has an expectation majorized by (M(τ ) + M(−τ ))/τ . By the dominated convergence theorem, (M(t) − M(0))/t has when t ↓ 0 a limit equal to M (0+ ) = E(X ). Consequently, L M (0) = M (0)/M(0) = E(X ).
(5.104)
200
5 Sampling and Optimizing
Since L M is convex,5 and so aτ − L M(τ ) is concave, and has a derivative a − E(X ) > 0 at τ = 0, the supremum in (5.101) is attained for τ > 0. The result follows.
5.4.2 Error Estimates in Stochastic Programming Let us come back to the stochastic optimization problem (P) and its sampled version ( PˆN ) of Sect. 5.3.2. Assume for the sake of simplicity that (P) has at least one solution x, ¯ and that the moment-generating function M(t) is finitely-valued for t > 0 small enough. By the large deviations principle, for all a > f¯(x), ¯ we have, denoting by Ix the rate function associated with f (ω, x): ¯ ≥ a) ≤ e−N Ix¯ (a) . P(val( PˆN ) ≥ a) ≤ P( fˆN (x)
(5.105)
So, the value of the sampled problem has an “exponentially weak” probability of being more than f¯(x) ¯ plus a given positive amount. Remark 5.50 When minimizing over a finite set, it follows by similar arguments that the value of the sampled problem has an “exponentially weak” probability of being less than f¯(x) ¯ minus a given positive amount.
5.5 Notes The state of the art on the subject of the chapter is given in Ruszczynski and Shapiro [106], and Shapiro et al. [114]. The Hadamard Delta Theorems 5.32 and 5.43 are due to Shapiro [111]. Theorem 5.47 is also due to Shapiro [112]. Linderoth et al. [75] made extensive numerical tests to obtain statistical estimates of the value function for simple recourse problems.
5 It suffices to check this in the case of a finite sum. Let L M(t) = log( n t xi with the p posi i=1 pi e ), n n 2 t xi − −1 t x −1 i itive of sum one. Then L M (t) = M(t) p x e and L M (t) = M(t) i i i=1 i=1 pi x i e n pi xi et xi )2 . We conclude by the Cauchy–Schwarz inequality. M(t)−2 ( i=1
Chapter 6
Dynamic Stochastic Optimization
Summary Dynamic stochastic optimization problems have the following information constraint: each decision must be a function of the available information at the corresponding time. This can be expressed as a linear constraint involving conditional expectations. This chapter develops the corresponding theory for convex problems with full observation of the state. The resulting optimality system involves a backward costate equation, the control variable being a point of minimum of some Hamiltonian function.
6.1 Conditional Expectation 6.1.1 Functional Dependency Quite often a decision needs to be a function of certain signals, or outputs of the system. Mathematically this means that, given two functions X (the signal) and Y (the decision) over the set Ω of events, we need to take Y = g(X ) for some function g. As Lemma 3.15 shows, in the framework of finite-dimensional measurable functions, this (nonlinear) constraint can be expressed as the (linear) constraint that Y belongs to the σ -algebra generated by X . So in the sequel we will study optimization problems with the measurability constraint to belong to a certain sub σ -algebra.
6.1.2 Construction of the Conditional Expectation Let (Ω, F , P) be a probability space, and let G be a sub σ -algebra of F . For s ∈ [1, ∞], we write L s (F ) := L s (Ω, F );
HF := L 2 (F )m ,
© Springer Nature Switzerland AG 2019 J. F. Bonnans, Convex and Stochastic Optimization, Universitext, https://doi.org/10.1007/978-3-030-14977-2_6
(6.1) 201
202
6 Dynamic Stochastic Optimization
with a similar convention for G . The scalar product in HF is denoted by (X, X )F := E(X · X ), for all X, X in HF .
(6.2)
Both HF and HG are Hilbert spaces, and the norm on HG is induced by the norm of HF . It follows that HG is a closed subspace of HF . The (orthogonal) projection operator from HF onto HG is called the conditional expectation (over G ) and usually denoted by E[·|G ]; but this notation is often too heavy and so it is convenient to write PG instead. So, if X ∈ HF , its projection Y onto HG is such that Y = PG X = E[X |G ].
(6.3)
The mapping PG is obviously linear. Consequently, if α1 and α2 belong to Rm , then PG (α1 · X 1 + α2 · X 2 ) = α1 · PG X 1 + α2 · PG X 2 .
(6.4)
Also, PG is non-expansive: PG X ≤ X , and therefore continuous, and it operates componentwise, i.e., Yi = PG X i , for i = 1 to m. Clearly PG X = X iff X ∈ HG . For any a ∈ Rm , we have that, identifying a constant with the constant function of HF having the same value: PG (a + X ) = a + PG X.
(6.5)
We give some additional properties of the conditional expectation in the L 2 setting. We define the componentwise product of random variables Z , Z with values in Rm by (Z Z )i (ω) := z i (ω)z i (ω), for i = 1 to m and ω ∈ Ω. Lemma 6.1 Let X ∈ HF and Y = PG X . Then (i) Y is characterized by the following relations Y ∈ HG
and E(Y · Z ) = E(X · Z ), for all Z ∈ HG .
(6.6)
(ii) We have that PG (Z X ) = Z PG X = Z Y, for all Z ∈ L ∞ (G )m .
(6.7)
(iii) For any X and X in HF , with m = 1, we have that X ≤ X
⇒
PG X ≤ PG X .
(6.8)
Proof (i) The expectations in (6.6) are the scalar products in HF of Z with Y and Z . So we can rewrite this equation as (X − Y, Z )F = 0, for all Z ∈ HG , which is the characterization of the projection onto a subspace. (ii) Set Y Z := PG (Z X ). By point (i), Y Z is characterized by
6.1 Conditional Expectation
203
E(Y Z · Z ) = E(X Z · Z ), for all Z ∈ HG .
(6.9)
Now Z Z ∈ HG , and so by point (i): E(X Z · Z ) = (X, Z Z )F = (Y, Z Z )F = (Y Z , Z )F .
(6.10)
Since Y Z ∈ HG , the result follows with (i). (iii) By linearity it suffices to check that if X ≥ 0, then Y ≥ 0. Indeed, Y+ (the positive part of Y taken a.s.) clearly satisfies Y+ ∈ HG and X − Y+ 2 ≤ X − Y 2 . Since Y is the orthogonal projection of X onto HG , this implies Y = Y+ and the result follows. Taking Z in (6.6) constant, we get EPG X = EX, for all X ∈ HF .
(6.11)
We now present the conditional Jensen’s inequality and some of its consequences. Lemma 6.2 Let X ∈ HF , Y = PG X , and ϕ be a proper l.s.c. convex function over Rm . Then (i) The following conditional Jensen inequality holds: ϕ(Y ) ≤ PG (ϕ(X )) a.s. on Ω.
(6.12)
(ii) Let K be a nonempty, closed convex subset of Rm . Then X (ω) ∈ K a.s.
⇒
Y (ω) ∈ K a.s.
(6.13)
(iii) We have the integral Jensen inequality (the expectations having values in R ∪ {+∞}): Eϕ(Y ) ≤ Eϕ(X ). (6.14) (iv) For any s ∈ [1, ∞], we have that PG X s ≤ X s , a.s. on Ω.
(6.15)
Proof (i) Since ϕ is proper l.s.c. continuous and convex, we have that ϕ is a supremum of its affine minorants, i.e., there exists an A ⊂ Rm × R such that, for all x ∈ Rm : ϕ(x) = sup{a · x + b; (a, b) ∈ A}.
(6.16)
In view of (6.4), (6.5) and (6.8), we have that for any (a, b) ∈ A: a · Y + b = a · PG X + b = PG [a · X + b] ≤ PG (ϕ(X )) .
(6.17)
204
6 Dynamic Stochastic Optimization
Maximizing the l.h.s. over (a, b) ∈ A, we get the desired result. (ii) Take ϕ = I K , the indicatrix function of K , in (6.12). The r.h.s. is equal to 0, and hence the l.h.s. is nonpositive. The conclusion follows. (iii) Take expectations on both sides of (6.12), noting that since ϕ has an affine minorant, the expectations are well-defined with value in R ∪ {+∞}), and use (6.11). (iv) For s ∈ [1, ∞), apply point (iii) with ϕ(x) = |x|s . For s = ∞, apply point (ii) ¯ with K = B(0, X ∞ ). We next show how to extend the conditional expectation from HF to the larger space L 1 (F )m . By (6.15) we already know that, for all X ∈ HF : PG X 1 ≤ X 1 ,
(6.18)
and consequently PG X has a unique continuous extension to L 1 (F )m , denoted in the same way. Remark 6.3 (i) If G is the σ -algebra generated by a random variable, say g : Ω → Rq , then we write the conditional expectation of the random variable X in the form E[X |g]. As we have seen, then, E[X |g](ω) = h(g(ω)) a.e. for some Borelian function h. (ii) We define the conditional expectation of X when g(ω) = a as E[X |g = a] := h(a) if g −1 (a) has a positive probability, 0 otherwise.
(6.19)
Lemma 6.4 (i) Relation (6.18) is also satisfied by the extension of the conditional probability to L 1 (F )m . (ii) The latter satisfies (6.4), (6.5), (6.7), (6.8), and (6.12)–(6.15). If X ∈ L 1 (F )m , then Y = PG X is characterized by the relation Y ∈ HG
and E(Y · Z ) = E(X · Z ), for all Z ∈ L ∞ (G )m .
(6.20)
Proof (i) That (6.18) holds for all X ∈ HF follows from Lemma 6.2(iv) with s = 1. Given X ∈ L 1 (G )m , and k ∈ N, k = 0, consider the truncation X k (ω) := 0 if |X (ω)| > k, and X (ω) otherwise.
(6.21)
Then X k belongs to HG , and is a Cauchy sequence converging to X in L 1 (G )m . Thanks to (6.18) (for X ∈ HF ), Y k := PG X k is a Cauchy sequence in L 1 (G )m , and so has in this space a limit Y , which by the definition of an extended operator satisfies Y = PG X . (ii) We leave the proofs (based again on the sequences X k and Y k ) as an exercise. For s ∈ [1, ∞], we denote by s its conjugate number, such that 1/s + 1/s = 1, and set Us := L s (F )m . We denote by Ps the restriction of the conditional expectation (over U1 ) to Us , and view it as an element of L(Us ).
6.1 Conditional Expectation
205
Lemma 6.5 (i) Let s ∈ [1, ∞] and u ∈ Us . Then Ps u is characterized by Ω
Z (ω) · (Ps u)(ω)dω =
Ω
Z (ω) · u(ω)dω, for all Z ∈ L s (G )m .
(6.22)
(ii) For any s ∈ [1, ∞), we have that Ps = Ps .
u = P1 u. (iii) Let u ∈ U 1 . Then P∞ Proof (i) By Lemma 6.4(ii), Ps u is characterized by the equality in (6.22), for all Z ∈ U∞ . So we only have to prove that (6.22) holds for s ∈ (1, ∞]. Let v ∈ Us . The componentwise truncated sequence: vik (ω) := max(−k, min(k, vi (ω))), i = 1 to m,
(6.23)
belongs to L s (G )m and converges to v in Us . By (6.20) we deduce that (the duality product being for the Us space): v, Ps u = limvk , Ps u = limvk , u = v, u. k
(6.24)
k
Point (i) follows. (ii) Let s ∈ [1, ∞) and (u, v) ∈ Us × Us . Since Ps v ∈ L s (G )m and Ps u ∈ L s (G )m , we have by point (i) that Ps v, u = Ps v, Ps u = v, Ps u,
(6.25)
proving that Ps = Ps . ∗ ), (iii) By the same arguments, when v ∈ U∞ and u ∈ U1 (which is a subset of U∞
we have that P∞ v = P1 v. An obvious consequence of Lemma 6.5(i) is the following corollary:
Corollary 6.6 Let s ∈ [1, ∞] and u ∈ Us . Let E be a subset of L s (G ), whose spanned vector space is dense in L s (G ). Then Ps u is characterized by Ω
Z (ω) · (Ps u)(ω)dω =
Ω
Z (ω) · u(ω)dω, for all Z ∈ E.
(6.26)
This holds in particular when taking for E the set of characteristic functions of G -measurable subspaces. We recall that, by Lemma 3.83, an element v∗ ∈ L ∞ (F )∗ can be decomposed in a unique way as v∗ = v1 + vs , where v1 ∈ L 1 (F ) and vs is a singular multiplier. Definition 6.7 The conditional expectation of v∗ ∈ L ∞ (F )∗ is defined by
∗ v . E[v∗ |G ] := P∞
(6.27)
206
6 Dynamic Stochastic Optimization
Remark 6.8 (i) In view of Lemma 6.5(iii), when v∗ ∈ L 1 (F ), we recover the usual conditional expectation. (ii) By the same lemma, for all s ∈ [1, ∞], Ps is a conditional expectation (but of
= P1 ). course P∞
6.1.3 The Conditional Expectation of Non-integrable Functions Let (Ω, F , P) be, as before, a probability space, and let G be a sub σ -algebra of F . Denote by L 0 (Ω, F ) the set of measurable functions w.r.t. the σ -algebra F , and by L 0+ (Ω, F ) the set of such measurable functions that are nonnegative a.s. To f ∈ L 0+ (Ω, F ) we associate the sequence of truncated functions f k , k ∈ N, such that f k (ω) := min( f (ω), k) and their conditional expectation gk := E[ f k , G ]. The latter are well-defined since f k ∈ L ∞ (Ω, F ). The conditional expectation being a nondecreasing mapping, the sequence gk is nondecreasing and converges a.s. to some g ∈ L 0+ (Ω, F ). We say that g is the conditional expectation of f and write g = E[ f, G ]. More generally, if f ∈ L 0 (Ω, F ) is such that f ≥ h a.s. for some h in L 1 (Ω, F ), we can define E[ f, G ] as the limit a.s. of the nondecreasing sequence E[ f k , G ]. Lemma 6.9 Let f ∈ L 0+ (Ω, F ). Then g = E[ f, G ] satisfies E[ f · z] = E[g · z], for all z ∈ L ∞ + (Ω, G ).
(6.28)
Proof Let z ∈ L ∞ + (Ω, G ). Using the monotone convergence Theorem 3.34 twice, and the fact that gk = E[ f k |G ], we get E[ f · z] = lim E[ f k · z] = lim E[gk · z] = E[g · z].
(6.29)
If follows that g = E[ f |G ] satisfies (6.28). The conclusion follows.
k
k
6.1.4 Computation in Some Simple Cases Let s ∈ [1, ∞]. The two extreme cases are: when G is the trivial σ -algebra, the conditional expectation coincides with the expectation; when G = F , the conditional expectation is the identity operator in L s (F ). Example 6.10 Let G ⊂ Ω be an atom of G . Then g = PG f has over G the value E( f 1G )/P(G) (the mean value of f over G). Indeed, it suffices to get the result when f is scalar, and to take Z = 1G in (6.22).
6.1 Conditional Expectation
207
Example 6.11 Let (Ω1 , F1 ) and (Ω2 , F2 ) be measurable spaces, and let F be the product σ -algebra (the one generated by F1 × F2 ). Set Ω := Ω1 × Ω2 , and let P be a probability measure on (Ω, F ). Set G := F1 × {Ω2 , ∅}. The associated random functions are those that do not depend on ω2 . Then, roughly speaking, Y := E[X |G ] is obtained by averaging for each ω1 ∈ Ω1 the value of X (ω1 , ·). More precise statements follow. Example 6.12 In the framework of Example 6.11, assume that Ω1 and Ω2 are finite sets, say equal to {1, . . . , p} and {1, . . . , q} resp., with elements denoted by i and j; let pi j be the probability of (i, j). Taking for Z the characteristic function 1{i0 } (i, j), for any i 0 ∈ {1, . . . , p}, in (6.22), we deduce that Y (i) =
j∈Ω2
pi j X (i, j)
j∈Ω2
pi j
, for all i ∈ Ω1 .
(6.30)
Example 6.13 (Independent noises) In the framework of Example 6.11, let P be the product of the probability P1 over (Ω1 , F1 ) and P2 over (Ω2 , F2 ), so that ω1 and ω2 are independent. Then Y := E[X |G ] is given by, a.s.: Y (ω1 ) =
Ω2
X (ω1 , ω2 )dP2 (ω2 ).
(6.31)
Remark 6.14 More general expressions can be obtained using the disintegration theorem [40, Chap. III]. In most applications we have (reformulating the model if necessary) independent noises.
6.1.5 Convergence Theorems The main convergence theorems of integration theory have their counterparts for conditional expectations. Theorem 6.15 (Monotone convergence) Let f k be a nondecreasing sequence of L 1 (F ), with limit a.s. f ∈ L 1 (F ). Set gk := E[ f k |G ] and g := E[ f |G ]. Then gk is nondecreasing, and converges to g both a.s. and in L 1 (G ). Proof Since f k is nondecreasing, by (6.8) (which is valid in L 1 (F )) so is gk , and ˆ such that gk ≤ gˆ ≤ g. By domhence, gk → gˆ a.s. for some measurable function g, inated convergence, gˆ is integrable. Let A ∈ G with characteristic function z = 1 A . Using the monotone convergence Theorem 3.34 twice, we get: E(z g) ˆ = lim E(zgk ) = lim E(z f k ) = E(z f ) = E(zg). k
k
(6.32)
We deduce by Corollary 6.6 that gˆ = g, and therefore gk → g in L 1 (G ) by monotone convergence.
208
6 Dynamic Stochastic Optimization
Theorem 6.16 (Lebesgue dominated convergence) Let the sequence f k of L 1 (F ) converge a.e. to f , and be dominated by h ∈ L 1 (F ), in the sense that | f k (ω)| ≤ h(ω) a.s. Set gk := E[ f k |G ] and g := E[ f |G ]. Then g ∈ L 1 (G ), and gk → g in L 1 (G ). Proof By the Lebesgue dominated convergence Theorem 3.38, f k → f in L 1 (F ), and by Lemma 6.4(i) the conditional expectation is a continuous operator in L 1 (F ). The conclusion follows. Lemma 6.17 (Fatou’s lemma) Let f k be a sequence in L 1 (F ), with f k ≥ h, where h is an integrable function. Let f := lim inf k f k be integrable, and set gk := E[ f k |G ], g := E[ f |G ]. Then (6.33) g ≤ lim inf gk a.s. k
Proof Set fˆk := inf{ f ; ≥ k}, and gˆ k := E[ fˆk |G ]. Then fˆk is nondecreasing and converges a.s. to f . Since h ≤ fˆk ≤ f k , fˆk is integrable. By the monotone convergence Theorem 6.15, gˆ k ↑ g a.s. Since fˆk ≤ f k , we have that gˆ k ≤ gk . The conclusion follows.
6.1.6 Conditional Variance Definition 6.18 Let G be a sub σ -algebra of some σ -algebra F , X ∈ L 2 (F ), and Y = EG X . We call the G measurable function var G X := EG (X − Y )(X − Y )
(6.34)
the conditional variance of X . Lemma 6.19 Let F , G , X and Y be as in the previous definition, with X ∈ L 2 (F ). Then we have the law of total variance var X = Evar G X + varY.
(6.35)
Proof We may assume that EX = EY = 0 and it is enough to prove the result when X is a scalar. Then var X = EX 2 = E(X − Y + Y )2 = E(X − Y )2 + 2E[(X − Y )Y ] + varY. (6.36) Now E(X − Y )2 = EEG (X − Y )2 = Evar G X
(6.37)
E[X Y ] = EEG [X Y ] = E(Y EG [X ]) = EY 2
(6.38)
and so that E[(X − Y )Y ] = 0. The result follows.
6.1 Conditional Expectation
209
Remark 6.20 The law of total variance (6.35) can be interpreted as the decomposition of the variance as the sum of the term varY explained, or predicted by G , and of the unexplained, or unpredicted term Evar G X .
6.1.7 Compatibility with a Subspace In this subsection, instead of a measurability constraint, we consider the more general case of a Banach space U with a closed subspace V . This abstract setting simplifies the discussion and allows us to apply the results to more general frameworks (dynamic case). We assume the existence of a projector P from U onto V , i.e., P ∈ L(U ) and
Pu ∈ V, for all u ∈ U, Pu = u, for all u ∈ V.
(6.39)
Note that P := I − P is itself a projector on the closed subspace V := Im(P ) = Ker P.
(6.40)
Any u ∈ U can be decomposed in a unique way as u = u + u , with u ∈ V and u ∈ V . Also, the transpose operator P can be interpreted as the restriction of linear forms over the subspace V . In the applications, u ∈ V might represent a measurability constraint, and P would then be the corresponding conditional expectation. Remember that then, P is also a conditional expectation. Given K ⊂ U , nonempty, closed and convex, we set KV := K ∩ V . Definition 6.21 We say that K is compatible with P if PK ⊂ K , i.e., if any u ∈ K is such that Pu ∈ K . Remark 6.22 By Remark 1.17, there exists an E ⊂ U ∗ × R such that K = {u ∈ U ; u ∗ , u ≤ b, for all (u ∗ , b) ∈ E}.
(6.41)
Therefore, K is compatible whenever for all (u ∗ , b) ∈ E, we have that, u ∗ , u = u ∗ , Pu for all u ∈ U , or equivalently, if u ∗ = P u ∗ , for all (u ∗ , b) ∈ E.
(6.42)
We emphasize the fact that we consider here KV as a subset of U (and not of V ). Therefore the normal cone to KV at u ∈ KV (of which the lemma below gives an expression) is considered as a subset of U ∗ (and not of V ∗ ). Of course V ⊥ denotes the orthogonal of V in U ∗ . Lemma 6.23 (i) We have that Ker P = V ⊥ . (ii) Let K be compatible with P. Then, for all u ∈ KV :
210
6 Dynamic Stochastic Optimization
NK V (u) = NK (u) + V ⊥ .
(6.43)
Proof (i) Let u ∗ ∈ U ∗ and u ∈ U . Then P u ∗ , u = u ∗ , Pu. Since the range of P is V , the result follows. (ii) We have the trivial inclusion for normal cones of an intersection: NK (u) and V ⊥ being elements of NK V , and the latter being a cone, the inclusion NK V (u) ⊃ NK (u) + V ⊥ follows. We next show the converse inclusion. Let u ∈ KV and u ∗ ∈ NK V (u). Given v ∈ K , define v1 := Pv. By the definition of compatibility, v1 ∈ KV , and so 0 ≥ u ∗ , v1 − u = u ∗ , P(v − u) = P u ∗ , v − u,
(6.44)
proving that P u ∗ ∈ NK (u). So it suffices to prove that u ∗ − P u ∗ ∈ V ⊥ . Indeed, if v ∈ V then we have that u ∗ − P u ∗ , v = u ∗ , v − Pv = 0. The result follows. Remark 6.24 We proved in Lemma 1.124 the following geometric calculus rule: the normal cone of an intersection of closed convex sets is the sum of normal cones to these sets, provided that the qualification condition 0 ∈ int(K 1 − K 2 ) holds. In the above lemma we obtained the geometric calculus rule without the qualification condition. An easy application of the above result, that we state for future reference, is as follows. Consider the problem Min F(u);
u∈K V
y[u] := A u ∈ K Y ,
(6.45)
where F is a continuous convex function over U , Y is another Banach space, A ∈ L(U, Y ), and K Y is a closed convex subset of Y . Assume that the following qualification condition holds, where B stands for the unit open ball: ε B ⊂ A KV − K Y , for some ε > 0.
(6.46)
Proposition 6.25 Let u¯ ∈ F(6.45) satisfy (6.46), and set y¯ = A u. ¯ Then u¯ is a solu¯ and q ∈ NK (u) ¯ such that tion of (6.45) iff there exists y ∗ ∈ N K Y ( y¯ ), u ∗ ∈ ∂ F(u) P A y ∗ + u ∗ + q = 0.
(6.47)
Proof In view of the qualification condition (6.46), by the subdifferential calculus rules (Lemma 1.120), we have that u¯ ∈ S(6.45) iff there exists y ∗ ∈ N K Y ( y¯ ), u ∗ ∈ ¯ such that ∂ F(u) ¯ and q1 ∈ NK V (u) A y ∗ + u ∗ + q1 0.
(6.48)
6.1 Conditional Expectation
211
By Lemma 6.23(ii) this is equivalent to A y ∗ + u ∗ + q ∈ V ⊥ , for some q ∈ NK (u), ¯ and we conclude by Lemma 6.23(i). Remark 6.26 (i) Let Ki , i ∈ I , be nonempty closed convex subsets of U , compatible with the subspace V . Then K := ∩i∈I Ki is a closed convex subset of U , which (if nonempty) is obviously compatible with V . (ii) If in addition I is finite and the following condition for normal cones holds: for all u ∈ K ∩ V , we have that: NK i (u), (6.49) NK (u) = i∈I
then by Proposition 6.25, if u¯ ∈ F(6.45) satisfies (6.46) and y¯ = A u, ¯ then u¯ ∈ S(6.45) iff there exists y ∗ ∈ N K Y ( y¯ ), u ∗ ∈ ∂ F(u) ¯ and qi ∈ NK i (u), ¯ for all i ∈ I , such that
∗ ∗ P qi = 0. A y +u + (6.50) i∈I
Example 6.27 (Product structure) In the applications to stochastic programming, we have a discrete set of times T = 0, . . . , T , and (note that the index of control variables runs from 0 to T − 1, and that of state variables from 1 to T ): U=
T
−1
Ut ; K =
t=0
T
−1
Kt ; Y =
t=0
T
Yt ; K
t=1
Y
=
T
Kt Y ,
(6.51)
t=1
where Kt is a nonempty, closed convex subset of a Banach space Ut , Kt Y is a nonempty, closed convex subset of a Banach space Yt , and Pt ∈ L(Ut ) is a projection onto a closed subspace Vt of Ut . We may write yτ [u] =
T −1
Aτ t u t , τ = 1, . . . , T, where Aτ t ∈ L(Ut , Yτ ).
(6.52)
t=0
If the qualification condition (6.46) holds, a solution u¯ will be characterized by the existence of ¯ u ∗ ∈ ∂ F(u);
yt∗ ∈ NK t Y ( y¯t ); qt ∈ NK t (u¯ t ), t = 1, . . . , T,
such that Pt
(6.53)
Aτ t yτ∗
+
u ∗t
+ qt
= 0, t = 0, . . . , T − 1.
τ ∈T
Remark 6.28 By Lemma 6.23(ii), (6.54) is equivalent to
(6.54)
212
6 Dynamic Stochastic Optimization
Aτ t yτ∗ + u ∗t + NK t 0, t = 0, . . . , T − 1.
(6.55)
τ ∈T
So (by the sudifferential calculus rule for a sum) it is also equivalent to the fact that u¯ is solution of the problem min F(u) + u
T −1 T yτ∗ , Aτ t u t ; u t ∈ Kt ∩ Vt , t = 0, . . . , T − 1.
(6.56)
t=0 τ =1
6.1.8 Compatibility with Measurability Constraints We apply the results of the previous section in the case of measurability constraints, i.e., (Ω, F , μ) is a probability space, and G is a σ -algebra included in F . For some s ∈ [1, ∞], we assume that U = L s (F )m and V = L s (G )m . We recall that Ps denotes the conditional expectation operator in L s (F )m . Definition 6.29 Let K be a closed convex subset of L s (F )m , for some s ∈ [1, ∞]. We say that K is compatible with G if PG K ⊂ K , i.e., if any x ∈ K is such that PG x ∈ K . Proposition 6.30 Let u¯ ∈ F(6.45) satisfy the qualification condition (6.46); set y¯ = ¯ q ∈ NK (u) ¯ such A u. ¯ Then iff there exists y ∗ ∈ N K Y ( y¯ ), u ∗ ∈ ∂ F(u) u¯ ∈ ∗S (6.45)
∗ that Ps A y + u + q = 0, or equivalently, E[A y ∗ + u ∗ + q|, G ] = 0. Proof Immediate Definition 6.7.
consequence
of
Proposition 6.25,
(6.57) Lemma 6.5(ii)
and
We next present some examples of compatible constraints. Definition 6.31 Let K be a closed convex subset of L s (F )m , for some s ∈ [1, ∞]. (i) We say that K defines a Jensen type constraint if, for some proper l.s.c. convex function ϕ over Rm , we have that K = {x ∈ L s (F )m ; ϕ(x(ω)) ≤ 0 a.s.}.
(6.58)
(ii) We say that K defines an integral Jensen type constraint if, for some proper l.s.c. convex function ϕ over Rm , we have that K = {x ∈ L s (F )m ; Eϕ(x(·)) ≤ 0}.
(6.59)
(iii) We say that K defines a local constant constraint if there exists a nonempty closed convex subset K of Rm such that
6.1 Conditional Expectation
K = {x ∈ L s (F )m ; x(ω) ∈ K a.e.}.
213
(6.60)
Clearly a local constant constraint is a special case of a Jensen type constraint with ϕ = I K . Lemma 6.32 A Jensen (resp. integral Jensen) type constraint is compatible with G measurability. Proof Immediate consequence of the Jensen and integral Jensen inequalities (6.12) and (6.14). We next give some generalizations of the previous examples. Definition 6.33 Consider a measurable function ϕ : Ω × Rm → R ∪ {+∞} of the form (6.61) ϕ(ω, u) := sup{ai (ω) · u + bi (ω), i ∈ I }, where I is a countable set and the (ai , bi )i∈I are F -measurable and essentially bounded. We say that ϕ is an F -adapted function, and that it is G -adapted if in addition any (ai , bi ), for i ∈ I , is G -measurable. Definition 6.34 Let K be a nonempty, closed convex subset of L s (F )m , for some s ∈ [1, ∞]. (i) We say that K defines a generalized Jensen type constraint if, for some function ϕ satisfying (6.61), G -adapted, we have that K = {u ∈ L s (F )m ; ϕ(ω, u(ω)) ≤ 0 a.e.}.
(6.62)
(ii) We say that K defines a generalized integral Jensen type constraint if, for some function ϕ satisfying (6.61), G -adapted, we have that K = {u ∈ L s (F )m ; Eϕ(·, u(·)) ≤ 0}.
(6.63)
Lemma 6.35 All constraints of the previous type are G -compatible. Proof It is enough to discuss case (i). If u ∈ K , then for all i ∈ I , g(ω) := ai (ω)u(ω) + b(ω) ≤ 0. Since the conditional expectation is nondecreasing, and ai , bi are G -measurable, we deduce that ai (ω)E[u|G ](ω) + bi (ω) = E[g|G ](ω) ≤ 0. The conclusion follows by taking the supremum over i ∈ I .
(6.64)
6.1.9 No Recourse The problem without recourse is a particular case of the previous theory, when G = {∅, Ω} is the trivial σ -algebra. Then the conditional expectation in L s (F )
214
6 Dynamic Stochastic Optimization
coincides with the expectation when s ∈ [1, ∞). If u ∗ ∈ L ∞ (F )m , its conditional expectation, denoted by Eu ∗ , is the element of Rm defined by (Eu ∗ )i = Eu i∗ = u i∗ , 1.
(6.65)
A very simple example illustrates the fact that, in the presence of constraints to be satisfied a.e., the multipliers in the dual of L ∞ typically have singular parts. Example 6.36 Let u ∈ R+ represent a number of items to be ordered at price p0 , and sold at price p1 > p0 . The stochastic demand is ω, with uniform law in Ω = [dm , d M ] with 0 < dm < d M . However, all bought items must be sold. The optimal decision is therefore u¯ = dm . The mathematical formulation of the optimization problem is, setting p := p1 − p0 : Min − pu; u≥0
y[u](ω) := u − ω ≤ 0 a.s.
(6.66)
Set y¯ (ω) := u¯ − ω. Taking Y = L ∞ (Ω) as constraint space, and observing that the constraint is qualified, we obtain the existence of a multiplier λ such that λ ∈ NY− ( y¯ ); − p + λ, 1 = 0.
(6.67)
For any ε > 0 and y ∈ Y with zero value on (dm , dm + ε), there exists a ρ > 0 such that y¯ ± ρy ∈ Y− . Since λ ∈ NY− ( y¯ ), and so λ, y¯ = 0, it follows that λ, y = 0. We have proved that λ is equal to its singular part; note that it is nonzero in view of (6.67), since p > 0.
6.2 Dynamic Stochastic Programming 6.2.1 Dynamic Uncertainty Random variables such as prices, temperatures, etc. that depend on time are modelled as series, say yt ∈ Rn with t ∈ Z. Quite often the yt are not independent variables, and we can express them as function of past values: yt = Ψ (yt−1 , . . . , yt−q ) + Φ(yt−1 , . . . , yt−q )et ,
(6.68)
where the random variables et ∈ Rm , called innovations, are “white noise”, i.e., i.i.d. with zero mean and unit variance. A simple example is the one of autoregressive (AR) models ˆ t, (6.69) yt = a1 yt−1 + · · · + aq yt−q + Φe
6.2 Dynamic Stochastic Programming
215
where the ai are n × n matrices and Φˆ is a given matrix; this model of order q is also called ARq. Then the vector Yt := (yt , yt−1 , . . . , yt−q+1 ) has the first-order dynamics ⎞ ⎛ ⎛ ⎞ a1 a2 · · · aq−1 aq Φˆ ⎟ ⎜1 0 ⎜0⎟ ⎟ ⎜ ⎜ ⎟ Yt+1 = ⎜ (6.70) ⎟ Yt + ⎜ . ⎟ et . .. ⎠ ⎝ ⎝ .. ⎠ . 0
··· 1
0
0
So this type of model is suitable for our framework. For more on AR models and their nonlinear extensions, we refer to [55].
6.2.2 Abstract Optimality Conditions We start with the general setting of an abstract problem in product form of Example 6.27. We call u the control, and y the state, and assume that the control to state mapping is defined by the state equation yt+1 = At yt + Bt u t + dt , t = 0, . . . , T − 1;
y0 ∈ Y0 given,
(6.71)
with At ∈ L(Yt , Yt+1 ), Bt ∈ L(Ut , Yt+1 ), dt ∈ Yt+1 , and solution denoted by y[u], and that the cost function has the following form: F(u) = J (u, y[u]), with J (u, y) :=
T −1
t (u t , yt ) + ϕ(yT ).
(6.72)
t=0
Here t and ϕ are continuous convex functions over Ut × Yt , for t = 0 to T − 1, and over Y N , resp. The linearized state equation is z t+1 = At z t + Bt vt , t = 0, . . . , T − 1; z 0 = 0.
(6.73)
We first give a means to express the subdifferential of F, using the adjoint state (or costate) approach. Definition 6.37 Set P := Y1∗ × · · · × YT∗ as costate space. Let u¯ ∈ U have associated state y¯ := y[u]. ¯ The costate p ∈ P (i.e., pt ∈ Yt∗ , t = 1 to T ) associated with ∗ ∗ ∗ u, ¯ y ∈ Y and w ∈ Y ∗ (we distinguish these two dual variables since they will play different roles) is defined as the solution of the backward equation ( pt is computed by backward induction)
pt = yt∗ + wt∗ + A t pt+1 , t = 1, . . . , T − 1; pT = yT∗ + wT∗ .
(6.74)
216
6 Dynamic Stochastic Optimization
We note the useful identity, where (v, z) satisfies the linearized state equation (6.73): ⎧ T T −1 ⎪ ⎪ ∗ ∗ ⎪ ⎪ y + w , z = p , z + pt − A t T T ⎪ t t t pt+1 , z t ⎪ ⎪ t=1 ⎪ t=1 ⎪ ⎪ T T −1 ⎪ ⎪ ⎪ ⎪ p , z − pt+1 , At z t = ⎪ t t ⎨ t=1 t=0 (6.75) T T −1 ⎪ ⎪ ⎪ ⎪ pt , z t + pt+1 , Bt vt − z t+1 = ⎪ ⎪ ⎪ ⎪ t=1 t=0 ⎪ ⎪ T −1 ⎪ ⎪ ⎪ ⎪ Bt pt+1 , vt . = ⎪ ⎩ t=0
¯ y¯ ) iff v0∗ ∈ ∂0 (u¯ 0 , y¯0 ), (vt∗ , yt∗ ) ∈ Note that (v∗ , y ∗ ) ∈ U ∗ × Y ∗ belongs to ∂ J (u, ∗ ∂t (u¯ t , y¯t ), for t = 1 to T − 1, and yT ∈ ∂ϕ( y¯T ). Lemma 6.38 We have that u ∗ ∈ ∂ F(u) ¯ iff there exists (v∗ , y ∗ ) ∈ ∂ J (u, ¯ y¯ ) such that the costate p associated with y ∗ and w∗ = 0 satisfies u ∗t = vt∗ + Bt p¯ t+1 , t = 0, . . . , T − 1.
(6.76)
Proof We have that the state satisfies y[u] = A u + d for some linear continuous operator A and some d in an appropriate space. Since F(u) = J (u, y[u]), by the ¯ iff u ∗ = subdifferential calculus rules in Lemma 1.120, we have that u ∗ ∈ ∂ F(u) ∗
∗ ∗ ∗ ¯ y¯ ), or equivalently, if v + A y for some (v , y ) ∈ ∂ J (u, T −1 T −1 T u ∗t , vt = vt∗ , vt + yt∗ , z t . t=0
t=0
(6.77)
t=1
We conclude by (6.75), where here wt∗ = 0 for all t.
We are now in a position to state the optimality conditions. Theorem 6.39 Let u¯ be feasible, with associated state y¯ . Assume that the qualification condition (6.46) holds, and that the constraints that u¯ t belongs to Kt are compatible with the projector Pt , for t = 0 to T − 1. Then u¯ is a solution of the abstract optimal control problem (6.45) iff there exists yT∗ ∈ ∂ϕ( y¯T ), and ∗ Y (y ∈ N K t+1 ¯t+1 ), qt ∈ NK t (u¯ t ), t = 0, . . . , T − 1, (vt∗ , yt∗ ) ∈ ∂t (u¯ t , y¯t ), wt+1 (6.78) such that the costate p¯ ∈ P, a solution of (6.74), satisfies
Pt vt∗ + Bt pt+1 + qt = 0, t = 0, . . . , T − 1.
(6.79)
6.2 Dynamic Stochastic Programming
Proof Immediate consequence of Example 6.27 and Lemma 6.38.
217
Remark 6.40 Similarly to Remark 6.28 we can observe that (6.79) is equivalent to the fact that for t = 0 to T − 1, u¯ t minimizes u → t (u, y¯t ) + pt+1 , Bt u over Kt .
6.2.3 The Growing Information Framework We now particularize the previous setting by assuming that the spaces Ut and Yt do not depend on t, so we may denote them as U0 , Y0 , and that, if y = y[u] with u t ∈ Vt for all t, then yt belongs to some closed subspace Z t of Y0 , with which is associated a projector Q t . We assume that the operators Pt ∈ L(U0 ) and Q t ∈ L(Y0 ) (which in our stochastic programming applications correspond to some conditional expectations) satisfy PT = I , Q T = I as well as the following identities: Pt = Pt Pτ = Pτ Pt ; and
Q t = Q t Q τ = Q τ Q t , t = 0, . . . , τ − 1,
(6.80)
Q t+1 At = At Q t+1 ; t = 0, . . . , T − 1,
(6.81)
Bt = Bt Q Pt+1 t+1 , t = 0, . . . , T − 2.
(6.82)
Note that (6.80) implies that the sequences of spaces Vt and Z t are nondecreasing. We introduce the adapted costate p¯ t = Q t pt , t = 1, . . . , T.
(6.83)
Remark 6.41 By Remark 6.8, the transpose of conditional expectations are conditional expectations (in a generalized sense for L ∞ norms), so that (at least in the case of L s spaces for s ∈ [1, ∞)), in the stochastic optimization applications, p¯ t will be adapted. This justifies the terminology of adapted costate. Lemma 6.42 Under the assumptions of Lemma 6.38, if (6.80)–(6.82) hold, then the following adapted costate equation holds
∗ ∗
p¯ t = Q ¯ t+1 , t = 1, . . . , T − 1; t yt + wt + At p p¯ T = yT∗ + wT∗ ,
(6.84)
as well as (6.78) and Pt vt∗ + Bt p¯ t+1 + qt = 0, t = 0, . . . , T − 1.
(6.85)
Proof Applying (6.80)–(6.81) several times, we have that
¯ t+1 . Q t At pt+1 = Q t Q t+1 At pt+1 = Q t At p
(6.86)
218
6 Dynamic Stochastic Optimization
Multiplying by Q t on both sides of the costate equation (6.74), we get (6.84). Now (6.85) holds for t = T − 1 since p¯ T = pT . By (6.81)–(6.82),
Bt = Pt Bt Q Pt Bt = Pt Pt+1 t+1 ,
we get Pt Bt pt+1 = Pt Bt p¯ t+1 ; (6.85) then follows from (6.79).
(6.87)
Remark 6.43 Similarly to Remark 6.40 we observe that (6.79) is equivalent to the fact that for t = 0 to T − 1, u¯ t minimizes u → t (u, y¯t ) + p¯ t+1 , Bt u over Kt .
6.2.4 The Standard full Information Framework We now apply the previous ‘abstract’ framework to stochastic programming problems. We consider a nondecreasing sequence F0 , . . . , FT of σ -algebras, included in F , such that FT = F , called a filtration. Roughly speaking, Ft represents the information available at time t, when taking the decision u t . Definition 6.44 We say that a measurable mapping (with values in a Banach space) u = (u 0 , . . . , u T −1 ) is adapted to the filtration if u t is Ft measurable for t = 0 to T − 1. We also call the fact that u needs to be adapted a nonanticipativity constraint. In the sequel we assume that it holds. The function spaces are, for s ∈ [1, ∞]: Ut := L s (F )m ; Vt := L s (Ft )m ; Yt := L s (F )n ; Z t := L s (Ft )n .
(6.88)
We also assume that, for t = 0 to T − 1, At ∈ L(Y0 , Y0 ) and Bt ∈ L(U0 , Y0 ) satisfy At ∈ L(Z t , Z t+1 );
Bt ∈ L(Vt , Z t+1 ); dt ∈ Z t+1 , t = 0, . . . , T − 1. (6.89)
Later we will see examples of operators At and Bt . The state equation is
yt+1 (ω) = (At yt ) (ω) + (Bt u t ) (ω) + dt (ω), t = 0, . . . , T − 1; a.s., with y0 ∈ Z 0 given,
(6.90)
we have indeed that yt ∈ Z t , for t = 0 to T . We assume next that the cost function is an expectation with the property of additivity w.r.t. time, i.e.,
t (u t , yt ) = Eˆt (ω, u t (ω), yt (ω)), t = 0, . . . , T − 1, ˆ yT (ω)), ϕ(yT ) = Eϕ(ω,
(6.91)
ˆ where the functions (ω, ·, ·) and ϕ(ω, ˆ ·) are a.s. convex functions. Under technical conditions seen in Sect. 3.2 of Chap. 3, we have that, for t = 0 to T − 1:
6.2 Dynamic Stochastic Programming
219
∂t (u t , yt ) = {(vt∗ , yt∗ ) ∈ Ut∗ × Yt∗ ; (vt∗ (ω), yt∗ (ω)) ∈ ∂ ˆt (ω, u t (ω), yt (ω)) a.s.}, (6.92) ˆ yT (ω)) a.s.}. (6.93) ∂ϕ(yT ) = {yT∗ ∈ YT∗ ; yT∗ (ω) ∈ ∂ ϕ(ω, We may denote the conditional expectation over Ft by Et . Noticing that the operators Pt and Q t as well as their adjoints are conditional expectations over Ft , we may write the adapted costate equation (6.84) in the following form:
¯ t+1 , t = 1, . . . , T − 1; p¯ t = Et yt∗ + wt∗ + A t p p¯ T = yT∗ + wT∗ ,
(6.94)
and the optimality condition (6.85) in the form Et vt∗ + Bt p¯ t+1 + qt = 0, t = 0, . . . , T − 1.
(6.95)
6.2.5 Independent Noises We assume here, as is often the case in applications, that we can write ω = (ω0 , . . . , ωT ) with ωt independent variables, each over some probability space (Ωˆ t , Fˆt , Pt ), and the decision u t is a function of (the past information) (ω0 , . . . , ωt ). Then the filtration is such that Ft is the set of measurable functions of (ω0 , . . . , ωt ). We have seen in Example 6.13 how to compute conditional expectations in the case of independent noises. So we can write for t = 0 to T − 1: u t = u t (ω0 , . . . , ωt ), yt+1 = yt+1 (ω0 , . . . , ωt+1 ), pt+1 = pt+1 (ω0 , . . . , ωt+1 ), (6.96) etc. and the conditional expectation from Φ, Ft+1 -measurable, to Ft , is a.s. Et Φ(ω0 , . . . , ωt ) =
Ωt+1
Φ(ω0 , . . . , ωt+1 )dPt+1 (ωt+1 ).
(6.97)
Remark 6.45 In practice it is not easy to deal with functions of several variables. Storing them, or computing conditional expectations becomes very expensive when the dimension increases. The optimality conditions are nevertheless of interest for studying theoretical properties (such as sensitivity analysis).
6.2.6 Elementary Examples We may define operators At and Bt in the following way. If Aˆ t is an n × n matrix, set (6.98) (At yt ) (ω) := Aˆ t yt (ω).
220
6 Dynamic Stochastic Optimization
More generally, if Aˆ t is an n × n matrix that is Ft+1 -measurable and essentially bounded, set (6.99) (At yt ) (ω) := Aˆ t (ω)yt (ω). This case of a local operator is quite common in practice. Assuming that Bt has the same structure and identifying the operators At and Aˆ t , Bt and Bˆ t , we can express the optimality conditions in the following form:
yt+1 (ω) = Aˆ t (ω)yt (ω) + Bˆ t (ω)u t (ω) + dt (ω) a.s., t = 0, . . . , T − 1; y0 ∈ Z 0 given, (6.100) ∗ ∗
∗ ∗ ˆ p¯ t = Et yt + wt + At p¯ t+1 , t = 1, . . . , T ; p¯ T = yT + wT . (6.101) Et vt∗ + Bˆ t p¯ t+1 + qt = 0, t = 0, . . . , T − 1.
(6.102)
6.2.7 Application to the Turbining Problem 6.2.7.1
Framework
Let yt ∈ [ym , y M ] denote the amount of water at a dam at the beginning of day t. We can turbine an amount u t ∈ [u m , u M ], and spill an amount st ≥ 0. The natural increment of water is bt ≥ 0. So the dynamics is yt+1 = yt + bt − u t − st , t = 0, . . . , T − 1.
(6.103)
Each day we have to fix u t and vt . So we have the constraints yt+1 ∈ [ym , y M ]; u t ∈ [u m , u M ]; st ≥ 0, t = 0, . . . , T − 1.
(6.104)
The price of the electricity market is ct ≥ 0, t = 0 to T − 1. The total revenue, to be maximized, is T −1 ct u t + C T yT , (6.105) t=0
where C T ≥ 0 is an estimation of the water price at final time.
6.2.7.2
A Deterministic Model
In a deterministic version of this problem, where bt and ct are known for all time t, the problem of maximizing the revenue can be written as
6.2 Dynamic Stochastic Programming
Min − u,s
T −1
221
ct u t − C T yT s.t. (6.103)-(6.104).
(6.106)
t=0
Denoting by pt ∈ R the costate we obtain the costate equation pt = wt∗ + pt+1 , t = 1, . . . , T − 1; where
pT = wT∗ − C T ,
⎧ ∗ ⎨ wt ≤ 0 if yt = ym , w∗ = 0 if yt ∈ (ym , y M ), ⎩ t∗ wt ≥ 0 if yt = y M .
(6.107)
(6.108)
Eliminating w we can also write ⎧ ⎧ ⎨ pT ≤ −C T if yT = ym , ⎨ pt ≤ pt+1 if yt = ym , pt = pt+1 if yt ∈ (ym , y M ), pT = −C T if yT ∈ (ym , y M ), ⎩ ⎩ pt ≥ pt+1 if yt = y M , pT ≥ −C T if yT = y M .
(6.109)
Similarly to Remark 6.43 we can observe that for t = 0 to T − 1, u¯ t minimizes v → −(ct + pt )v over [u m , u M ], and therefore setting pˆ t := − pt :
u t = u m if pˆ t < ct , u t = u M if pˆ t > ct .
(6.110)
We can interpret pˆ t as the marginal value of storing, called in this context the water price. If the market price ct is strictly smaller (resp. strictly greater) than the water value, then one should store (resp. turbine) as much as possible. Observe that the water value decreases (resp. increases) when the storage attains the minimum (resp. maximum) value. For the spilling variable s the policy is to take st = 0 as long as the water value is positive, and st ≥ 0 otherwise (with a value compatible with the constraint yt+1 ≤ y M ). This is in agreement with the following observation. If during some time interval the inflows are important, it may be worth turbining even if the market price is low. So the water price should be small, and possibly become greater after. Exercise 6.46 If ym = −∞ and y M = +∞, show that the optimal strategies are to take u t = u m if ct < C T , and u t = u M if ct > C T , and u t ∈ [u m , u M ] otherwise.
6.2.7.3
Stochastic Model
We may assume that randomness occurs only in the variables bt and ct . Here we will assume that (6.111) bt is deterministic; ym = 0; y M = +∞,
222
6 Dynamic Stochastic Optimization
so that no spilling occurs. We also assume that ct ∈ L 1 (Ft ) for t = 0 to T − 1, and C T ∈ L 1 (F ). We choose the function spaces Vt = Z t = L ∞ (Ft ). The cost function is −E
T −1
(6.112)
ct , u t + C T , yT .
(6.113)
t=0
Also, we have that Kt := {u ∈ Vt ; u m ≤ u(ω) ≤ u M a.s.}, Kt Y = (Z t )+ := {y ∈ Z t ; y(ω) ≥ 0 a.s.},
(6.114) and so, by Exercise 1.82, for any y0 ∈ (Z t )+ : NK t Y = {w∗ ∈ (Z t∗ )− ; w∗ , y0 = 0}.
(6.115)
The adapted costate equation is p¯ t = Et wt∗ + p¯ t+1 , t = 1, . . . , T − 1;
p¯ T = wT∗ − C T ,
(6.116)
where wt∗ ∈ NK t Y , for t = 1 to T , and Et (−ct − p¯ t+1 + qt ) = 0; qt ∈ NK t (u¯ t ); t = 0, . . . , T − 1.
(6.117)
Since the conditional expectation is a nondecreasing operator, by (6.116), the adapted costate is itself nondecreasing. Set c¯t := −Et (ct + p¯ t+1 ) .
(6.118)
c¯t , v − u¯ t ≥ 0, for all v ∈ Kt .
(6.119)
The relation (6.117) implies
6.3 Notes The discussion of conditional expectation is classical, see e.g. Malliavin [77], and Dellacherie and Meyer [40]. For more on first-order optimality conditions, see Rockafellar and Wets [103, 104], Wets [124] and Dallagi [37] for the numerical aspects.
Chapter 7
Markov Decision Processes
Summary This chapter considers the problem of minimizing the expectation of a reward for a controlled Markov chain process, either with a finite horizon, or an infinite one for which the reward has discounted values, including the cases of exit times and stopping decisions. The value and policy (Howard) iterations are compared. Extensions of these results are provided for problems with expectations constraints, partial observation, and for the ergodic case, limit in some sense of large horizon problems with undiscounted cost.
7.1 Controlled Markov Chains 7.1.1 Markov Chains 7.1.1.1
The Probability Setting
We consider a state space S , equal to either {1, . . . , m}, with m ∈ N, or to N∗ = {1, 2, . . .}, and a time index k ∈ {0, . . . , N } where N ∈ N∗ is called the horizon. For k ∈ {0, . . . , N }, we denote by x a process (i.e. a random function of time) with values in S , for = k (the starting time of the process) to N . A Markov chain is a process whose transition from state i at time to state j at time + 1 (for = k to N − 1) happens with a given probability Mij , independently of the values taken by the process for times less than . Obviously Mij ≥ 0 and j∈S Mi j = 1. The Markov chain framework can be put in the setting of probability spaces in the following way. Let X k be the class of processes starting at time k, and X ik := {x ∈ X k ; x k = i},
for all i ∈ S .
(7.1)
Any element of X ik has the representation x = (i, x k+1 , . . . , x N ). Let the set of events (denoted by Ω in probability theory) be X ik , with σ -field P(X ik ). We denote by Pik © Springer Nature Switzerland AG 2019 J. F. Bonnans, Convex and Stochastic Optimization, Universitext, https://doi.org/10.1007/978-3-030-14977-2_7
223
224
7 Markov Decision Processes
the probability defined as follows. Since X ik is a countable set, the probability of A ⊂ X ik is the sum of probabilities of elements of A, the latter being defined by N −1 Pik (x) := Mikx k+1 . . . MxNN−1 −1 x N = Π=k M x x +1 ,
for all x in X ik .
(7.2)
In the sequel we will often use the more intuitive notation P((x k , . . . , x N ) | x k = i) := Pik (x),
(7.3)
which remains meaningful for a process starting at a time possibly less than k. In the next lemma, we check that, given the knowledge of the state at some time < N , the additional knowledge of past states (for times up to k − 1) is useless for the estimation of x +1 (and so, by induction for x j , j > + 1). Lemma 7.1 Given times 0 ≤ k < < N , A ⊂ S −k , and q ∈ S , set Aq := {x ∈ X k ; (x k , . . . , x −1 ) ∈ A; x = q}.
(7.4)
Assume that Aq has a positive probability. Then Pik (x +1 = j | x ∈ Aq } = Pik (x +1 = j | x = q) = Mqk j .
(7.5)
Proof We have that
P(x +1 = j and x ∈ Aq ) = Mq j
k Mxkk x k+1 . . . Mx−1 −1 q = Mq j P(x ∈ A q ).
(x k ,...,x )∈A
(7.6) Therefore by the Bayes rule P(x +1 = j | x ∈ Aqk ) =
P(x +1 = j and x ∈ Aq ) P(x ∈ Aq )
= Mqk j ,
(7.7)
as was to be shown.
7.1.1.2
Transition Operators
We can view M k = {Mikj }i, j∈S ×S as a possibly ‘infinite matrix’ with a (nonnegative) element Mikj in row i and column j, the sum over each row being equal to 1. We call such a ‘matrix’ having these two properties a transition operator. If S is finite, a transition operator M reduces to a stochastic matrix (a matrix with nonnegative elements whose sum over each row is 1). We have the following calculus rules that extend the usual matrix calculus: products between transition operators, and the product of a transition operator with a
7.1 Controlled Markov Chains
225
horizontal vector on the left, or a vertical vector on the right, under appropriate conditions on these vectors. More precisely, let 1 and ∞ , respectively, denote the space of summable and bounded sequences, whose elements are represented as horizontal (for 1 ) and vertical (for ∞ ) vectors. These spaces are resp. endowed with the norms
π 1 :=
|πi |; v∞ := sup |vi |.
(7.8)
i∈S
i∈S
We recall that ∞ is the topological dual (the set of continuous linear forms) of 1 . We denote the duality pairing by π v :=
πi vi ,
for all π ∈ 1 and v ∈ ∞ .
(7.9)
i∈S
This is in accordance with the rules for products of vectors in the case of a finite state space. Let π ∈ 1 , v ∈ ∞ , and M be a transition operator. We define the products π M ∈ 1 and Mv ∈ ∞ by (π M) j :=
πi Mi j ; (Mv)i :=
i∈S
Mi j v j ,
for all i, j in S .
(7.10)
j∈S
We easily check that π → π M and v → Mv are non-expansive, i.e., π M1 ≤ π 1 ; Mv∞ ≤ v∞ .
(7.11)
In addition, for all v ∈ ∞ : inf vi ≤ inf (Mv)i ≤ sup(Mv)i ≤ sup vi . i
i
i
(7.12)
i
If M 1 and M 2 are two transition operators, their product M 1 M 2 is defined as (M 1 M 2 )i j :=
1 Miq Mq2j ,
for all i, j in S .
(7.13)
q∈S
It is easy to check that the product of two transition operators is a transition operator. We interpret 1 P := π ∈ ; πi ≥ 0, i ∈ S ; πi = 1 (7.14) i∈S
as a set of probability laws over S , and ∞ as a values space. The (left) product of a probability law π with a transition operator is a probability law, and we can interpret the pairing (7.9) as the expectation of v under the probability law π . One can interpret
226
7 Markov Decision Processes
the ith row of M k as the probability law of x k+1 , knowing that the process x satisfies xk = i ∈ S . Let x ∈ X k , the class of processes starting at time k. It may happen that the initial state x k is unknown, but has a known probability law π k ; we then write x k ∼ π k . Then we may define the event set as X k and the probability of x ∈ X k as Pkπ k (x) := πxkk Mx k x k+1 . . . Mx N −1 x N .
(7.15)
Pkπ k (x) := πxkk Pkx k (x),
(7.16)
We note that and that for > k, the probability law of x k+1 , i.e. π := P(x | x k ∼ π k ), satisfies the forward Kolmogorov equation π +1 = j
πi P[x +1 = j, x = i] =
i
πi Mi, j , for = k to N − 1,
(7.17)
i
or equivalently M q , for = k to N − 1 . π +1 = π M = π k Πq=k
7.1.1.3
(7.18)
Cost Processes
We define a Markov cost process by associating with a Markov chain process {x k } the cost function {cik }, i ∈ S , k ∈ N. We assume that ck := {cik }i∈S belongs to ∞ , which means that the costs are uniformly bounded in space. We represent ck as a vertical vector. Recalling the notion of conditional expectation for a given value of a random variable (Remark 6.3), define the value associated with c and the Markov chain starting at time k with state i and horizon N ≥ k as Vik
:= E
N
cx
|x =i . k
(7.19)
=k
The above conditional expectation is well-defined, since c is bounded. The probabilities π being defined by (7.18), we have that Vik = cik +
N
π c .
(7.20)
=k+1
Denote by e j the probability concentrated at state i, i.e., the element of 1 with all components equal to 0, except for the jth one equal to 1.
7.1 Controlled Markov Chains
227
Proposition 7.2 For all k = 0, . . . , N , the value function V k belongs to ∞ , and is the solution of the backwards Kolmogorov equation
V k = ck + M k V k+1 , k = 0, . . . , N − 1, V N = cN .
(7.21)
Proof That V N = c N is obvious. Now let k ∈ {0, . . . , N − 1}. Then Vik
=
cik
+
P[x
k+1
= j | x = i] k
E[cx | x k+1 = j].
(7.22)
=k+1
j∈S
Now P[x k+1 = j | x k = i] = Mikj and clusion follows.
7.1.1.4
N
N
=k+1
E[cx | x k+1 = j] = V jk+1 . The con-
Discounted Problems with Infinite Horizon
In the case of an infinite horizon, the probability space can be defined by Kolmogorov’s extension of finite horizon probabilities, see Theorem 3.24. We first consider a problem with discount rate β ∈ (0, 1) and non-autonomous data, i.e., ck and M k depend on the time k. We assume that c∞ := sup ck ∞ < ∞.
(7.23)
k∈N
The associated value function, starting at state i and time k, is defined by := (1 − β)E
∞
β −k cx |x k
=i .
(7.24)
β −k c ∞ ≤ c∞ .
(7.25)
V k = (1 − β)ck + β M k V k+1 , k ∈ N.
(7.26)
Vik
=k
It is well-defined and belongs to ∞ , since |Vik | ≤ (1 − β)
∞ =k
Lemma 7.3 We have that
Proof In view of (7.18), and since (ei M k ) j = Mikj it follows that
228
7 Markov Decision Processes
Vik −k = cik + ei ∞ M k . . . M −1 c =k+1 β 1−β (7.27) ∞ −k k+1 −1 , = cik + j∈S Mikj βck+1 + e β M . . . M c =k+2 j j so that
Vik Mikj E = cik + 1−β j∈S
∞
β
−k
cx |x
k+1
= j ,
(7.28)
=k+1
and the above expectation is nothing else than βV jk+1 /(1 − β). The conclusion follows. Remark 7.4 Lemma 7.3 allows us to compute V k given V k+1 . In practice, we can compute an approximation of V k given a horizon N > 0, setting ck,N = ck if k < N , and ck,N = 0 otherwise.
(7.29)
The corresponding expectation Vik,N
:= (1 − β)E
N −1
k β −k cx,N |x
=i
(7.30)
=k
is the value function of a problem with finite horizon N and therefore can be computed by induction, starting from V N ,N = 0. We have the error estimate V k,N − V k ∞ ≤ (1 − β)
β −k c ∞ ≤ β N −k c∞ .
(7.31)
≥N
Remark 7.5 In the autonomous case, i.e., when (ck , M k ) does not depend on time, and is then denoted as (c, M), it is easily checked that V k actually does not depend on k, and is therefore denoted by V . Then Lemma 7.3 tells us that V satisfies V = (1 − β)c + β M V.
(7.32)
Since M is non-expansive in ∞ , V → (1 − β)c + β M V is a contraction with coefficient β. By the Banach–Picard theorem, (7.32) has a unique solution. As observed in Remark 7.4, applying the above contraction mapping N times, starting from the zero value function, is equivalent to compute the value function V N of the corresponding problem with horizon N and zero terminal cost, and we have as in (7.31): (7.33) V N − V ∞ ≤ β N c∞ . Remark 7.6 We often have periodic data (think of seasonal effects in economic modelling), i.e., (ck , M k ) = (ck+K , M k+K ), where the positive integer K is called
7.1 Controlled Markov Chains
229
the period. It is easily checked that V k is periodic with period K , so it suffices to compute (V 1 , . . . , V K ). It then follows from (7.26) that, with obvious notations: ⎛
⎞ ⎛ 1⎞ ⎛ 1 ⎞⎛ 2 ⎞ 0 V1 c M V 1⎜ . ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ . . ⎟ . .. ⎝ .. ⎠ = (1 − β) ⎝ .. ⎠ + β ⎝ ⎠ ⎝ .. ⎠ , β 0 MK VK cK V K +1
(7.34)
with V K +1 := V 1 . We see that (V 1 , . . . , V K ) is a solution of a contracting fixedpoint equation, of the same nature as the one obtained in the autonomous case (but K times larger).
7.1.2 The Dynamic Programming Principle Consider now a Markov chain whose transition probabilities Mikj (u) depend on a control variable u ∈ Uik , where Uik is an arbitrary set depending on the time k and state i ∈ S . We have costs depending on the control and state: cik (u) : Uik → R, and final values ϕ ∈ ∞ , such that c∞ := sup |cik (u)| < ∞.
(7.35)
k,i,u
Let Φ k denote the set of feedback mappings (at time k), that to each i ∈ S associates u i ∈ Uik . Given a horizon N > k, we choose a feedback policy, i.e., an element u of the set (7.36) Φ (0,N −1) := Φ 0 × · · · × Φ N −1 , that to each i ∈ S and k ∈ {0, . . . , N − 1} associates an element u ik of Uik . We denote by M k (u k ) the transition operator with generic term Mikj (u ik ), and by Pu and Eu the associated probability and expectation. From the discussion of the uncontrolled case it follows that with the feedback policy u are associated the values Vik (u)
:= E
u
N −1
cx (u x )
+ ϕx N |x = i , k ∈ N, i ∈ S . k
(7.37)
=k
By our previous results, these values are characterized by the relations
V k (u) = ck (u) + M k (u)V k+1 (u), k = 0, . . . , N − 1; V N (u) = ϕ.
(7.38)
Here, by the short notation ck (u), we mean the function of i ∈ S with value cik (u i ). Also, the following holds:
230
7 Markov Decision Processes
V k (u)∞ ≤
N −1
ck ∞ + ϕ∞ , k = 0, . . . , N .
(7.39)
=k
The (minimal) value is defined by Vik :=
inf
u∈Φ (0,N −1)
Vik (u), i ∈ S ; k = 0, . . . , N .
(7.40)
ck ∞ + ϕ∞ , k = 0, . . . , N .
(7.41)
In view of (7.39), we have that V k ∞ ≤
N −1 =k
Given ε ≥ 0 and k ∈ {0, . . . , N − 1}, we define the set Φ k,ε of ε-optimal feedback policies at time k, as Φ k,ε =
⎧ ⎨ ⎩
⎧ ⎨
uˆ ∈ Φ k ; uˆ i ∈ ε-argmin cik (u i ) + u∈U k ⎩ i
j
⎫ ⎫ ⎬ ⎬ Mikj (u i )V jk+1 , for all i ∈ S . ⎭ ⎭
(7.42) By ε-argminu∈Uik , we mean the set of points where the infimum is attained up to ε, that is, in the present setting, the set of uˆ i ∈ Uik such that cik (uˆ i ) +
j
Mikj (uˆ i )V jk+1 ≤ ε + inf
u∈Uik
⎧ ⎨ ⎩
cik (u) +
Mikj (u)V jk+1
j
⎫ ⎬ ⎭
.
(7.43)
Note that this set may be empty if ε = 0. Consider the dynamic programming equation: find (v = v0 , . . . , v N ) ∈ (∞ ) N +1 such that ⎧ ⎧ ⎫ ⎪ ⎨ ⎬ ⎪ ⎨ k vi = inf cik (u) + Mikj (u)vk+1 , i ∈ S , k = 0, . . . , N − 1, j (7.44) ⎭ u∈Uik ⎩ j ⎪ ⎪ ⎩ N v = ϕ. Proposition 7.7 The (minimal) value function V k is the unique solution of the dynamic programming equation. If the policy u¯ is such that for some εk ≥ 0, u¯ k ∈ U¯ k,εk for all k, then Vik ≤ V k (u) ¯ ≤ Vik + ε¯ k , ε¯ k :=
N −1 =k
ε , k = 0, . . . , N − 1.
(7.45)
7.1 Controlled Markov Chains
231
In particular, if the above relation holds with εk = 0 for all k, then the policy u is optimal in the sense that Vik = Vik (u), for all k = 0, . . . , N − 1, and i ∈ S . Proof By (backward) induction, we easily obtain that the dynamic programming principle (7.44) has a unique solution v in (∞ ) N +1 that satisfies the estimate (7.41). ¯ This holds (with equality) for k = N , Given a policy u, ¯ we claim that vk ≤ V k (u). and if it holds at time k + 1, then the claim follows by induction, since vik = inf u∈Uik {cik (u) + j Mikj (u)vk+1 } ≤ cik (u¯ ik ) + j Mikj (u¯ ik )vk+1 ≤ cik (u¯ ik ) + j Mikj (u¯ ik )V jk+1 (u) ¯ = Vik (u). ¯
(7.46)
Minimizing over u¯ we obtain that vk ≤ V k . We next prove the second inequality in (7.45) with v in lieu of V . It obviously holds when k = N , and if it does at time k + 1, then Vik (u) ¯ = cik (u¯ ik ) +
Mikj (u¯ ik )V jk+1 (u) ¯ k ≤ εk + inf u∈Uik {ci (u) + j Mikj (u)V jk+1 (u)} ¯ k+1 k k ≤ ε¯ k + inf u∈Uik {ci (u) + j Mi j (u)v j } = ε¯ k + vik . j
(7.47)
¯ ≤ ε¯ k + vik . Since ε¯ k can be taken arbitrarSo, we have proved that vk ≤ V k ≤ V k (u) k k ily small, v = V for all k, and the conclusion follows.
7.1.3 Infinite Horizon Problems 7.1.3.1
Main Result
In this section, we assume that the data are autonomous: the cost function, transition operator and control sets do not depend on time, and we have a discount coefficient β ∈ (0, 1). The following theorem characterizes the optimal policies, and shows in particular that we can limit ourself to autonomous (not depending on time) feedback policies Φ that with each i ∈ S associate to an element u i of Ui . Sometimes we will use the following hypothesis:
For all i and j in S , Ui is metric compact and the functions ci (u) and Mi j (u) are continuous.
(7.48)
Set, for all i ∈ S : Vi (u) := (1 − β)E
u
∞ k=0
β cx k (u xk )|x = i . k
0
(7.49)
232
7 Markov Decision Processes
Given the discount factor β ∈]0, 1[, the (minimal) value function is defined by Vi := inf Vi (u), i ∈ S .
(7.50)
u∈Φ
Theorem 7.8 (i) The value function is the unique solution of the dynamic programming equation: find v ∈ ∞ such that vi = inf
u∈Ui
⎧ ⎨ ⎩
(1 − β)ci (u) + β
Mi j (u)v j
j
⎫ ⎬ ⎭
, i ∈ S.
(7.51)
(ii) Given ε ≥ 0, let u ∈ Φ be an autonomous policy and V (u) ∈ ∞ be the associated value, the unique solution of V (u) = (1 − β)c(u) + β M(u)V (u).
(7.52)
Assume that, for all i ∈ S , ⎛ ˜ +β Vi (u) ≤ inf ⎝(1 − β)ci (u) u∈U ˜ i
⎞ Mi j (u)V ˜ j (u)⎠ + ε.
(7.53)
j
Set ε := (1 − β)−1 ε. Then the policy u is ε suboptimal, in the sense that the associated value V (u) satisfies Vi (u) ≤ Vi + ε ,
for all i ∈ S .
(7.54)
(iii) Let (7.48) hold. Then there exists (at least) an optimal policy. We recall that inf a(u) − inf b(u) ≤ sup |a(u) − b(u)|, u∈U u∈U
(7.55)
u∈U
and define the Bellman operator T : ∞ → ∞ as (T w)i := inf
u∈Ui
⎧ ⎨ ⎩
(1 − β)ci (u) + β
j
Mi j (u)w j
⎫ ⎬ ⎭
.
(7.56)
Proof (a) Let us show first that (7.51) has a unique solution. This equation is of the form v = T v. Since T w∞ ≤ (1 − β)c∞ + βw∞ , the operator T indeed maps ∞ into itself. Given w and w in ∞ , using (7.55) and the fact that M(u) is a transition operator, we get
7.1 Controlled Markov Chains
233
m m 1 Mi j (u)(w − w) j ≤ sup (T w )i − (T w)i ≤ sup Mi j (u)w − w∞ β u∈Ui j=1 u∈Ui j=1
and the r.h.s. is equal to w − w∞ . So, T is a contraction with coefficient β and, by the Banach–Picard theorem, has a unique solution denoted by v. We next prove that v is equal to the minimal value V . (b) Let u ∈ Φ be a policy, with associated value V (u). Since v ≤ (1 − β)c(u) + β M(u)v,
(7.57)
we deduce using (7.52) that v − V (u) ≤ β M(u)(v − V (u)). Lemma 7.9 below ensures that v ≤ V (u). Since this holds for all policies, we also have v ≤ V . (c) If (7.53) is satisfied, using (7.55) we get Vi (u) − vi ≤ ε + sup β u∈U ˜ i
Mi j (u)(V ˜ j (u) − v j ) ≤ ε + β sup(V (u) − v). (7.58)
j∈S
Taking the supremum in i, we deduce that sup(V (u) − v) ≤ ε . Since v ≤ V (u) for any u ∈ Φ, we deduce (7.54), whence (ii). (d) It follows from (ii) that a policy satisfying the dynamic programming equation (7.51) is optimal. Such a policy exists whenever (7.48) holds. Points (i) and (iii) follow. Lemma 7.9 Let M be a transition operator, β ∈]0, 1[, ε ≥ 0 and w ∈ ∞ satisfy w ≤ ε1 + β Mw. Then w ≤ (1 − β)−1 ε1. Proof We have Mw ≤ (sup w)1 since M is a transition operator, and so w ≤ (ε + β sup w)1. Therefore, sup w ≤ ε + β sup w, whence the conclusion. Definition 7.10 We say that the sequence {u q } of autonomous feedback policies q simply converges to u¯ ∈ Φ if u i → u¯ i , for all i ∈ S . We define in the same way the 1 simple convergence in and ∞ . Lemma 7.11 Let {u q } simply converge to u¯ in Φ. Then the associated value sequence ¯ V (u q ) simply converges to V (u). Proof Since V (u q ) is bounded in ∞ , by a diagonalizing argument, there exists a subsequence of V (u q ) that simply converges to some V¯ ∈ ∞ . We will show that ¯ V¯ = V (u). ¯ It easily follows then that the sequence V (u q ) simply converges to V (u). So, extracting a subsequence if necessary, we may assume that V (u q ) simply converges to V¯ ∈ ∞ . Fix ε ∈ (0, 1) and i ∈ S . There exists a partition (I, J ) of S such that Mi j (u) ¯ ≥ 1 − 21 ε. (7.59) I has a finite cardinality and j∈I
Since I is finite and u q simply converges to u, ¯ for q large enough, we have that q j∈I Mi j (u ) ≥ 1 − ε, and so
234
7 Markov Decision Processes
Mi j (u) ¯ ≤ ε;
j∈J
Mi j (u q ) ≤ ε.
(7.60)
j∈J
¯ Since I is finite, we have that Set, for i ∈ S , Δi := lim supq |Vi (u q ) − Vi (u)|. q q Δi = lim sup (1 − β)(ci (u i ) − ci (u¯ i )) + β (Mi j (u i )V (u q ) j − Mi j (u¯ i )V j (u)) ¯ q j ≤ β lim sup (Mi j (u q )V (u q ) j − Mi j (u)V ¯ j (u)) ¯ + ε(V (u q )∞ + V (u) ¯ ∞) q j∈I ≤ ε(V q ∞ + V (u) ¯ ∞ ).
Since we may take ε arbitrarily small, the result follows.
Remark 7.12 By similar arguments it can be shown that, in a finite horizon setting, ¯ if a sequence {u q } of feedback policies simply converges to the feedback policy u, ¯ then the associated values V (u q ) simply converge to V (u).
7.1.3.2
Characterization of Optimal Policies
We now want to characterize optimal policies when starting from a given point, say i ∈ S . That is, a policy u ∈ Φ such that the associated value satisfies Vi (u) = Vi . Definition 7.13 Consider an autonomous Markov chain with transition operator M. Let i ∈ S . We say that j ∈ S is q-steps accessible from i (with q ≥ 1) if a Markov chain starting at state i and time 0 has a nonzero probability of having its state equal to j at time q. We say that j is accessible from i if it is n-steps accessible for some n ≥ 1. The union of such j is called the accessible set from state i. Let here M q denote the q times product of M. It is easily checked by induction q that Mi j > 0 iff the Markov chain starting at i at time 0 has a positive probability of being equal to j at time q. Therefore the accessible set is ∞ { j ∈ S ; Mi j > 0}. Si = ∪q=1 q
(7.61)
In the case of a controlled Markov chain, we denote by Si (u) the accessible set when starting from i, with the policy u ∈ Φ. Set Sˆi (u) := {i} ∪ Si (u). Theorem 7.14 A policy u ∈ Φ is optimal, when starting from i 0 ∈ S , iff it satisfies the dynamic programming equation over Sˆi0 (u), i.e., ⎧ ⎨
u i ∈ argmin (1 − β)ci (v) + β v∈Ui ⎩
j
Mi j (v)V j
⎫ ⎬ ⎭
,
for all i ∈ Sˆi0 (u).
(7.62)
7.1 Controlled Markov Chains
235
Proof Let i ∈ S be such that Vi (u) = Vi . Then ⎛ (1 − β) ⎝ci (u i ) + β
⎞
⎛
Mi j (u i )⎠ V j (u) = Vi = inf (1 − β) ⎝ci (v) + β v∈Ui
j
⎞ Mi j (v)V j ⎠ .
j
(7.63) Since V j ≤ V j (u) this holds iff V j (u) = V j whenever Mi j (u) = 0. The result then follows by induction, starting with i = i 0 .
7.1.4 Numerical Algorithms 7.1.4.1
Value Iteration
In the case of autonomous infinite horizon problems, the simplest method for solving the dynamic programming principle (7.51) is the value iteration algorithm: compute the sequence vq in ∞ , for q ∈ N, the solution of q+1
vi
= inf
u∈Ui
⎧ ⎨ ⎩
(1 − β)ci (u) + β
q
Mi j (u)v j
j
⎫ ⎬ ⎭
, i ∈ S , q ∈ N.
(7.64)
We initialize the sequence with an arbitrary element v0 of ∞ . The sequence vq is not to be confused with the values vk used in the case of finite horizon. Observe that (7.64) coincides with the formula for computing the value of finite horizon problems (up to the fact that here we increase the index q instead of decreasing it). It easily follows that vq is the value function of the following finite horizon, discounted problem q Vi (u)
:= (1 − β) min E
u
u∈Φ (0,q−1)
q−1
β cx (u ) +
β q vx0q |x 0
= i , k ∈ N, i ∈ S ,
=0
where the set Φ (0,N −1) of feedback policies was defined in (7.36).
(7.65)
Proposition 7.15 The value iteration algorithm converges to the unique solution V of (7.51), and we have vq − V ∞ ≤ β q v0 − V ∞ ,
for all q ∈ N.
(7.66)
Proof We showed in the proof of Theorem 7.8 that the Bellman operator T , defined in (7.56), is a contraction with ratio β in the uniform norm. We conclude by the Banach–Picard theorem. Remark 7.16 When taking v0 = 0 we obtain the explicit estimate of distance to the solution:
236
7 Markov Decision Processes
vq − V ∞ ≤ β q V ∞ ≤ β q c∞ ,
for all q ∈ N.
(7.67)
Remark 7.17 Observe that vq+1 is a nondecreasing function of vq . So if v1 ≤ v0 , we obtain by an induction argument that vq is a nonincreasing sequence. This is the case in particular if v0j ≥ supi ci , for all j ∈ S . Similarly, if v0j ≤ inf i ci , for all j ∈ S , then vq is nondecreasing.
7.1.4.2
Policy Iteration
When β is close to 1, the value iteration algorithm can be very slow. A possible alternative is the policy iterations, or Howard algorithm. Roughly speaking, the idea is, for a given policy, to compute the associated value, and then to update the policy by computing the argument of the minimum in the dynamic programming operator. We assume that the compactness hypothesis (7.48) holds. Each iteration of the algorithm has two steps: Algorithm 7.18 (Howard algorithm) 1. Initialization: choose a policy u 0 ∈ Φ; set q := 0. 2. Compute the value function vq associated with the policy u q ∈ Φ, i.e., the solution of the linear equation vq = (1 − β)c(u q ) + β M(u q )vq .
(7.68)
3. Compute a policy u q+1 ∈ Φ, a solution of q+1
ui
⎧ ⎫ ⎨ ⎬ q ∈ arg min (1 − β)ci (u) + β Mi j (u)v j , u∈Ui ⎩ ⎭
for all i ∈ S . (7.69)
j
4. q := q + 1; go to step 2. Denote by V the value function, the unique solution of the dynamic programming principle (7.51). Proposition 7.19 Let (7.48) hold. Then the Howard algorithm is well-defined. The sequence vq is nonincreasing and satisfies vq+1 − V ∞ ≤ βvq − V ∞ ,
for all q ∈ N.
(7.70)
In addition, denote by v¯ q+1 the value obtained by applying the value iteration to vq . Then vq+1 ≤ v¯ q+1 . Proof The linear system (7.68) has a unique solution in ∞ , since it is a fixed point equation of a contraction. In view of (7.48), the minimum in the second step is attained. The sequence vq is bounded in ∞ since we have
7.1 Controlled Markov Chains
237
vq ∞ ≤ (1 − β)c(u q )∞ + βM(u q )vq ∞ ≤ (1 − β)c(u q )∞ + βvq ∞ , (7.71) and therefore vq ∞ ≤ c∞ . Relations (7.68) and (7.69) imply (1 − β)c(u q+1 ) + β M(u q+1 )vq ≤ (1 − β)c(u q ) + β M(u q )vq ,
(7.72)
whence vq+1 − vq = (1 − β)(c(u q+1 ) − c(u q )) + β(M(u q+1 )vq+1 − M(u q )vq ) ≤ β M(u q+1 )(vq+1 − vq ), and so vq+1 − vq ≤ 0 by Lemma 7.9. By Proposition 7.15, ¯vq+1 − V ∞ ≤ βvq − V ∞ . Since V ≤ vq+1 , it is enough to establish that vq+1 ≤ v¯ q+1 . Indeed, we get after cancellation that vq+1 − v¯ q+1 = β M(u q+1 )(vq+1 − vq ) ≤ 0, since M(u q+1 ) has nonnegative elements and vq+1 ≤ vq . The conclusion follows. Remark 7.20 The previous proof shows that the policy iterations converge at least as rapidly as the value iteration. However, each iteration needs to solve a linear system. This can be expensive, especially if the transition operators are not sparse. Remark 7.21 The contraction constant β is optimal for the Howard algorithm, as Example 7.22 shows. In addition, in this example the sequence computed by the value and Howard algorithms coincide. So, in general, the Howard algorithm does not converges more rapidly than the value iteration. Example 7.22 Here is a variant of an example due to Tsitsiklis, see Santos and Rust [109], showing that the Howard algorithm does not necessarily converge faster than the value iteration algorithm. Let S = N, and for all i ∈ N, i = 0, Ui = {0, 1}. The decision 0 (resp. 1) represents a (deterministic) move from state i to state i − 1 (resp. to itself). The only possible decision at state 0 is to remain there. The cost is 1 at any state i = 0, and 0 at state 0. So the optimal policy is to choose u = 0 when i = 0. The optimal value is V0 = 0 and for i > 0, Vi = (1 − β)(1 + β + · · · + β i−1 ) = 1 − β i .
(7.73)
We choose to initialize Howard’s algorithm with the policy u = 1 for any i > 0. So vi0 = 0 if i = 0, vi0 = 1 otherwise.
(7.74)
v0 − V ∞ = v10 − V1 = 1 − (1 − β) = β.
(7.75)
We then have
238
7 Markov Decision Processes
At each iteration q of the algorithm the decision at state q changes from 1 to 0, and this is the only change, so that q
q
vi = Vi , 0 ≤ i ≤ q; vi = 1, i > q.
(7.76)
It follows that q
vq − V ∞ = vq+1 − Vq+1 = 1 − (1 − β q+1 ) = β q+1 = β q v0 − V ∞ . (7.77) 7.1.4.3
Modified Policy Iteration Algorithms
The idea is to replace, in Howard’s algorithm, the linear system resolution with finitely many value iteration-like steps, where the decision is freezed. Algorithm 7.23 (Modified policy iteration algorithm) 1. Initialization: choose a policy u 0 ∈ Φ, an initial value estimate v−1 ∈ ∞ , and m ∈ N∗ . Set q := 0. 2. Set vq,0 := vq−1 . Compute vq,k , k = 1 to m, as the solution of ‘freezed value iteration steps’ as follows: vq,k := (1 − β)c(u q ) + β M(u q )vq,k−1 .
(7.78)
3. Set vq := vq,m and compute the policy u q+1 ∈ Φ, a solution of q+1
ui
⎧ ⎫ ⎨ ⎬ q ∈ arg min (1 − β)ci (u) + β Mi j (u)v j , u∈Ui ⎩ ⎭
for all i ∈ S . (7.79)
j
4. q := q + 1; go to step 2. Remark 7.24 (i) If m = 1 we recover the value iteration algorithm. (ii) Denote by vˆ q the value associated with the policy u q . The convergence analysis of the freezed value iteration steps is similar to that of the ‘classical’ value iterations. We deduce that (7.80) vq,m − vˆ q ∞ ≤ β m vq−1 − vˆ q ∞ . So, informally speaking, when m is large, the sequence vq should not be too different from the one computed by Howard’s algorithm. The gain is that the freezed value iteration steps are generally much faster than the corresponding classical value iterations.
7.1 Controlled Markov Chains
239
7.1.5 Exit Time Problems Let Sˆ be a subset of S , and consider an autonomous controlled Markov chain model, the control sets Ui being metric and compact, and the transition operator M and cost ci (u) being continuous. Let τ be the first exit time of Sˆ of the Markov chain starting at i ∈ Sˆ , at time zero: / Sˆ }. τ := min{k ∈ N; x k ∈
(7.81)
Note that τ = ∞ if no exit occurs. We consider the value function, for i ∈ S : Vi := (1 − β) inf E
u
u∈Φ
τ −1
τ
β cx k (u x k ) + β ϕx τ |x = i . k
0
(7.82)
k=0
Remark 7.25 If the ci are set to zero and ϕi = 1 (resp. ϕi = −1) for all i ∈ S \ Sˆ, we see that the problem consists, roughly speaking, in maximizing (resp. minimizing) a discounted value of the exit time. It appears that exit problems reduce to the standard one by adding a final state say i f to the state space, which becomes S := S ∪ {i f }. The decision sets are for i ∈ S : Ui if i ∈ Sˆ , (7.83) Ui = {0} otherwise. The associated transition operators are defined by Mi j (u)
=
Mi j (u) for i ∈ Sˆ , u ∈ Ui , δ ji f if i ∈ / Sˆ .
(7.84)
In other words, for any i ∈ S \ Sˆ , the only possible transition is to the final state i f , and when in i f the process remains there. The costs are ⎧ ⎨ ci j (u) if i ∈ Sˆ , ci (u) = ϕi if i ∈ S \ Sˆ , ⎩ 0 if i = i f .
(7.85)
Proposition 7.26 Let supu∈U |ci (u)| be finite and ϕ bounded. Then the value function of the exit time problem is the unique solution of the dynamic programming equation ⎧ ⎫ ⎧ ⎨ ⎬ ⎪ ⎪ ⎨ v = inf (1 − β)c (u) + β Mi j (u)v j , i ∈ Sˆ, i i u∈Ui ⎩ ⎭ j ⎪ ⎪ ⎩ i ∈ S \ Sˆ . vi = (1 − β)ϕi ,
(7.86)
240
7 Markov Decision Processes
Proof This is a consequence of our previous results. We have just shown that exit time problems can be rewritten as standard controlled Markov chain problems, and the value at the state i f is zero. Writing the corresponding dynamic programming equation, for i ∈ Sˆ we get the first row in (7.86), and otherwise we get vi = (1 − β)ϕi + βvi f , i ∈ S \ Sˆ; vi f = βvi f . Therefore vi f = 0 and then (7.86) follows.
(7.87)
Remark 7.27 The value iteration algorithm (rewriting the exit problem as a standard one), when starting with initial values such that vi0 = (1 − β)ϕi , i ∈ S \ Sˆ ; vi0f = 0, satisfies
vi = (1 − β)ϕi , i ∈ S \ Sˆ ; vi f = 0, q
q
for all q ∈ N.
(7.88)
(7.89)
So we can express it in the form, for q ∈ N: ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩
q+1
vi
q+1 vi
= inf
u∈Ui
⎧ ⎨ ⎩
(1 − β)ci (u) + β
q
Mi j (u)v j
j
⎫ ⎬ ⎭
, i ∈ Sˆ ,
(7.90)
i ∈ S \ Sˆ .
= (1 − β)ϕi ,
Exercise 7.28 Extend the policy iteration algorithm to the present setting, the sequence of values satisfying (7.89).
7.1.6 Problems with Stopping Decisions 7.1.6.1
Setting
We now study an extension of the previous framework, with the additional possibility of a stopping decision at any state i ∈ S with cost ψi in R ∪ {+∞} (in fact, the possibly infinite value restricts the possibility of stopping to the states with a finite value of ψ). We assume that Ψ has a finite infimum. Let M(u) be the transition operator of the controlled Markov chain. We assume that (7.48) holds. Given Sˆ ⊂ S , we denote by τ the first exit time of Sˆ , and consider the additional decision θ , called the stopping time (a function of i ∈ S ). Set χθ 0 and η ∈ (0, 1),
n−1 n−1 1 1 C 1 − ηn k k . w − w¯ = (w − w) ¯ ≤ n n n 1−η k=0 k=0
(7.184)
(7.185)
Taking the example of a constant sequence (except for the first term) we see that the convergence in the Cesaro sense is typically at best at speed 1/n. Coming back to Markov chains, in view of (7.180) and (7.183), we obviously have C C ¯ then π n → ¯ ¯ and V¯ n → Sc. π¯ = π 0 S, (7.186) If S n → S, Example 7.43 Consider an uncontrolled ! Markov chain with S = {1, 2} and M equal 01 to the permutation matrix M := . Then M n is equal to M if n is odd, and equal 10 to the identity otherwise. So, M n and π n have no limit. However, we have the Cesaro limits ! C n C 1 1 1 ; π n → 21 21 , V¯ N → 21 (c1 + c2 )1. (7.187) M →2 11
7.3.2 Transient and Recurrent States With a transition matrix M we associate the graph G M in which the set of nodes (or vertices) is the state set S , and there is an edge (a directed arc) between vertices i and j iff Mi j > 0.
7.3 Ergodic Markov Chains
257
Definition 7.44 An (n-step) walk in G M is an ordered string of nodes (i 0 , . . . , i n ), n ≥ 1, such that there exists an arc from i k to i k+1 , for k = 0 to n − 1. A path is a walk in which no node is repeated. A cycle is a walk with the same initial and final node, and no other repeated node. It is easily checked that there exists an n-step walk from i to j iff Minj > 0, i.e., if j is n-step accessible from i (Definition 7.13). We say that two states i, j communicate if each of them is accessible from the other. This is an equivalence relation whose classes are called communication classes or just classes. A recurrent class is a class that contains all states that are accessible from any of its elements. Once the state enters this class, it stays in it forever. A transient class is a class that is not recurrent. A state is transient (resp. recurrent) if it belongs to a transient (resp. recurrent) class. Definition 7.45 The class graph is the graph whose nodes are the communication classes, with a directed arc between two classes C , C iff C = C , and Mi j > 0, for some i ∈ C and j ∈ C . Observe that the class graph is acyclic (it contains no cycle) so that each maximal path ends in a recurrent class. In particular, there exists at least one recurrent class.
7.3.2.1
Invariant Probabilities
Recall that a probability π is invariant iff π = π M, i.e., if it is a left eigenvector of M with eigenvalue 1. If M is the identity operator, any probability law is invariant. Therefore, invariant probabilities are in general nonunique. We call the support of a probability law the set of states over which it is nonzero, and if B ⊂ S , we set π(B) := i∈B πi . Lemma 7.46 Let π be an invariant probability law. Then for all i ∈ S , πi = 0 whenever i is transient. Proof Let π be an invariant probability law. Let T (resp. R) denote the set of transient (resp. recurrent) states. Then Minj = 0 if i ∈ R and j ∈ T , for any n ≥ 1, and so, π(T ) =
j∈T i∈S
πi Minj =
j∈T i∈T
πi Minj =
i∈T
πi
Minj .
(7.188)
j∈T
This implies that if πi = 0, then j∈T Minj = 1 for all n ≥ 1, meaning that all accessible states from i are transient, contradicting the fact that (as is easily established) some recurrent states must be accessible from any transient state. Lemma 7.47 Let C denote the square submatrix of M corresponding to row and columns associated with transient states. Then C n → 0 at a linear rate. Proof For large enough n, the probability that the Markov chain starting at any transient state i ∈ T is in a recurrent state is positive, say greater than ε > 0 when
258
7 Markov Decision Processes
n > n 0 . Let n > n 0 . By the previous discussion, a strict contraction in ∞ since, for any v ∈ ∞ : C n v∞ ≤ max i∈T
j∈T
j∈T
Cinj < 1 − ε. But then C n is
⎛ ⎞ Cinj |v j | ≤ max ⎝ Cinj ⎠ v∞ ≤ (1 − ε)v∞ i∈T
(7.189)
j∈T
and since C is non-expansive, for m = qn + r , 0 ≤ r < n, we have that C m v∞ ≤ (1 − ε)q v∞ ≤ (1 − ε)m/n−1 v∞ . The conclusion follows.
(7.190)
Remark 7.48 Let π be an invariant probability and R be a recurrent class. Then either R is included in the support of π , or π vanishes over R. Indeed, if i, j belong to R and πi > 0, for some n, Minj > 0, and since π = π M n , π j ≥ πi Minj > 0. 7.3.2.2
Regular Transition Matrices
We start by discussing the contraction property of operators. Lemma 7.49 Let M be a transition operator and y ∈ ∞ . Set z := M y. Then (i) the following holds: min(y) ≤ min(z); max(z) ≤ max(y). (7.191) The first (resp. second) equality occurs iff for some i ∈ S , Mi j = 0 for any j ∈ S such that y j > min(y) (resp. y j < max(y)), and (ii) if M is a transition matrix such that ε := mini, j Mi j is positive, then for any y ∈ ∞ , M n y converges to a constant vector and max(M n y) − min(M n y) ≤ (1 − 2ε)n (max(y) − min(y)).
(7.192)
Proof (i) Immediate. (ii) Given y ∈ Rm , set z := M y, a := min(y), b := max(y), attained at indexes i 1 and m be such that yi 1 = a and yi = b otherwise. i 2 resp., and ε := mini, j Mi j . Let y ∈ R Since M ≥ 0, j Mi j = 1 and y ≤ y , we have that for all i ∈ S : z i = (M y)i ≤ (M y )i = Mii1 a + (1 − Mii1 )b ≤ εa + (1 − ε)b.
(7.193)
Similarly, let y ∈ Rm be such that yi 2 = b and yi = a otherwise. Then z i = (M y)i ≥ (M y )i = Mii2 b + (1 − Mii2 )a ≥ εb + (1 − ε)a.
(7.194)
Setting N (y) := max(y) − min(y), we obtain that N (M y) ≤ (1 − 2ε)N (y), and conclude by an induction argument.
7.3 Ergodic Markov Chains
259
We say that M is a regular transition matrix if there exists an n 0 such that M n 0 has no zero components. It is easily checked then that M n has no zero components for all n > n 0 . If M has a unique invariant probability law π¯ , we denote by M¯ the matrix whose rows are all equal to π¯ . Lemma 7.50 A regular transition matrix M has a unique invariant probability law π¯ , which is the unique left eigenvector of M, having an eigenvalue of modulus greater than or equal to 1. Also, M n → M¯ at a linear rate, in the sense that, for some C > 0 and η ∈ (0, 1): ¯ ≤ Cηn . (7.195) M n − M Proof (a) Let y ∈ Rm . By Lemma 7.49(i) y n = M n y is such that min(y n ) is nondecreasing and max(y n ) is nonincreasing. By Lemma 7.49(ii), max(y n 0 j ) − min(y n 0 j ) → 0 at a linear rate. Combining the two results we see that y n converges towards a constant vector. Taking y equal to column j of M, so that y n equals column j of M n , we deduce that M n converges to some transition matrix M¯ whose column i is of the form π¯ i 1, with 0 ≤ min(π¯ ) ≤ max(π¯ ) ≤ 1. Since M¯ is a transition matrix, π¯ has sum 1 and is therefore a probability law. For any horizontal vector z, we have that zi . (7.196) lim z M n = z M¯ = (π¯ 1 , . . . , π¯ m ) n
i
Let z be a left eigenvector of M with eigenvalue λ ∈ C. If |λ| > 1 then z M n = λn z diverges, which contradicts (7.196). If λ = eiθ then z M n = eniθ z. By (7.196) this implies that θ = 0 (modulo 2π ) and that z is colinear to π¯ . So, π¯ is the unique left ¯ with an eigenvalue of modulus at least one. In particular, π¯ is the eigenvector of M, unique invariant probability law of M. Example 7.51 A transition matrix ! may have nonzero eigenvalues of modulus less 10 than 1. For instance, M = 1 1 has eigenvalues 1 and 21 . 2 2
Remark 7.52 Let M be a regular transition matrix. The orthogonal space to the invariant probability π¯ is π¯ ⊥ = {y ∈ ∞ ; π¯ y = 0}.
(7.197)
For y ∈ ∞ , we have the unique decomposition y = α1 + z, z ∈ π¯ ⊥ , with α = π¯ y. Since M¯ y = α1, it follows from (7.195) that, for some positive η < 1, |M n z| = O(ηn ), and therefore: (7.198) M n y = α1 + O(ηn ). We recover the fact that 1 is a simple eigenvalue of M (the associated eigenspace has dimension 1) and that the other eigenvalues have modulus less than one. If follows that if π 0 is any probability law for x 0 , then for some C > 0: |π 0 M n − π¯ | ≤ C ηn .
(7.199)
260
7.3.2.3
7 Markov Decision Processes
Single Class Transition Matrices
If the transition matrix M has a single class (which is therefore recurrent), we cannot hope M n to converge in general (see Example 7.43), but we still have the following result. Lemma 7.53 A single class transition matrix has a unique invariant probability. Proof For ε ∈ (0, 1), set Mε := ε I + (1 − ε)M, where M is a single class transition matrix. Then by the binomial formula for commutative matrices ! n n 1− p ε (1 − ε) p M p (M ) = p ε n
(7.200)
p=0
has, for n large enough, only positive elements. By Lemma 7.50, M ε has a unique invariant probability. Since it is easily checked that M ε and M have the same invariant probabilities, the conclusion follows. Lemma 7.54 A single class transition matrix M (with therefore a unique invariant probability π¯ ), is such that for any probability π 0 , π n := π 0 Sn converges to π¯ , and Sn converges to ⎛ ⎞ .. .. . ⎟ ⎜ .1 m⎟ · · · π ¯ π ¯ S¯ = ⎜ (7.201) ⎝ ⎠. .. .. . . Proof Since Sn is a transition matrix, π n is a sequence of probabilities. Since Sn (I − M) = (I − M n+1 )/(n + 1) → 0, we have that π n (I − M) → 0. So, any ¯ limit-point of π n is an invariant probability, and is by Lemma 7.53 equal to π. This implies that π n → π¯ . So, any limit point S¯ of (the bounded sequence) S n is such that π 0 S¯ = π¯ , for any initial probability π 0 , from which (7.201) follows.
7.3.2.4
General Case
More generally after some permutation of the state indexes we may write the transition matrix in the form ! A 0 M= , (7.202) BC where A is a p × p matrix, p ≤ m, the first p state being recurrent, the other being transient. The matrix A is block diagonal, each block corresponding to a recurrent class. Then ! An 0 , (7.203) Mn = Bn C n
7.3 Ergodic Markov Chains
261
with B1 := B and Bn+1 = Bn A + C n B, so that by induction Bn+1 = B An + C B An−1 + · · · + C n B =
n
C i B An−i .
(7.204)
i=0
We set (we will check in the lema below that (I − C) is invertible): ! A¯ 0 , B¯ 0
M¯ :=
¯ where B¯ := (I − C)−1 B A.
(7.205)
k Lemma 7.55 (i) The matrix (I − C) is invertible, and (I − C)−1 = ∞ k=0 C . (ii) n n ¯ If all recurrent classes are regular, then M → M. (iii) Otherwise, M converges to M¯ in the Cesaro sense. k Proof (i) By Lemma 7.47, C n → 0 geometrically, so that C := ∞ k=0 C is welldefined, and
(I − C)C = lim(I − C) q
q
C k = lim(I − C q+1 ) = I.
k=0
q
(7.206)
Point (i) follows. ¯ a block(ii) Assume that all recurrent classes are regular. By Lemma 7.50, An → A, diagonal matrix each block of which has identical rows equal to the unique invariant probability of the corresponding block of A, as in (7.201). It remains to prove that ¯ Since C n → 0 at a linear rate, An is bounded and converges to A, ¯ this folBn → B. lows from the dominated convergence theorem in the space of summable sequences. Point (ii) follows. (iii) General case. Setting ⎧ ⎪ ⎨ A¯ n =
1 (I + A + · · · + An ), n+1 ⎪ ⎩ C¯ n = 1 (I + C + · · · + C n ), n+1 and using (7.204) we obtain that for some S n is A¯ n Sn = Bn By Lemma 7.54, a given recurrent C¯ n → 0 since C n
(7.207)
Bn , the expression of the average sum ! 0 . (7.208) C¯ n
A¯ n → A¯ (as before, the block diagonal matrix whose rows for class are equal to the corresponding invariant probability) and → 0 at a linear rate. By (7.204),
262
7 Markov Decision Processes
Bn =
n 1 1 i (B1 + · · · + Bn ) = C B Aj, n+1 n + 1 q=0 i+ j=q
(7.209)
1 i j i n − i + 1 ¯ n−i C B A = C B A . n + 1 i=0 n+1 j=0 i=0
(7.210)
and therefore n
Bn =
n−i
n
¯ The conclusion Again by a dominated convergence argument, Bn → (I − C)−1 B A. follows.
7.3.2.5
A Linear System for the Average Return
We assume that the Markov chain has a unique invariant probability law π¯ (i.e., there is a unique recurrent class). Set W := M − I , and consider the linear equation c + W V = η1,
(7.211)
where c ∈ ∞ is given, so that the solution space is (V, η) ∈ ∞ × R. Observe that, if (V, η) is a solution, then for all α ∈ R, setting V := V + α1, we have that (V , η) is another solution, so that we redefine the solution space as (∞ /R) × R. Being an invariant probability law, π¯ belongs to the left kernel of W . Multiplying (7.211) on the left by π¯ , we deduce that η = π¯ c (7.212) is the average cost. Solving the linear system (7.211) therefore gives a way to compute the average cost without computing the invariant probability law. Lemma 7.56 Equation (7.211) has a unique solution in (∞ /R) × R. Proof Since the setting is finite-dimensional, it suffices to check that the only solutions for c = 0 are when η = 0 and V is constant. That η = 0 follows from (7.212). Now let V attain its maximum at i 0 . Then Mi 0 j V j ≤ Mi0 j Vi0 = Vi0 , (7.213) Vi0 = (M V )i0 = j
j
where we used the fact that M is a stochastic matrix. The equality means that we have that V j = Vi0 whenever Mi0 j = 0, i.e., when j is 1-step accessible from i 0 . By induction, we deduce that this holds for any state accessible from i 0 , and in particular for any element of the recurrent class. We have a similar result by considering a state where V attains its minimum. Therefore the minimum of V is equal to its maximum. The result follows.
7.3 Ergodic Markov Chains
7.3.2.6
263
More on Average Return
The linear equation (7.211), taking into account the decomposition (7.202) of the transition matrix M, is of the form η1 = c + (A − I )V , (7.214) η1 = c + BV + (C − I )V , where c refers to the subvector of c with corresponding components for the recurrent class, etc. Since A is a transition matrix with a unique recurrent class, the first row has a unique solution that determines (η, V ). Since, by Lemma 7.55(i), (C − I ) is invertible, the second row determines the value of V , given (η, V ). Observe that, in order to have the correct value of η, it is enough to solve the first block, i.e., we can ignore the transient states.
7.3.2.7
A Link with Finite Horizon Problems
Let (η, V ) be a solution of the average return Eq. (7.211). Then 1 V¯ n = S n−1 c = S n−1 η1 − S n−1 (M − I )V = η1 − (M n − I )V. n
(7.215)
Remark 7.57 If M is a regular transition matrix, by Remark 7.52, 1 is a simple eigenvalue. Take for V (defined in ∞ /R) the representative Vˆ that is a combination of vectors of other eigenspaces, whose eigenvalues have modulus less that 1, so that M n Vˆ ≤ cγ n for some c > 0, γ < 1. So, (7.215) gives the following expansion of V n = n V¯ n : (7.216) V n − nη1 − Vˆ = M n Vˆ ≤ cγ n .
7.3.3 Ergodic Dynamic Programming We next consider the problem of minimizing the average cost.
7.3.3.1
Controlled Markov Chains
We still assume that S is finite, and that For each u ∈ Φ, the Markov chain M(u) has a unique recurrent class S (u) and therefore a unique invariant probability law π(u).
(7.217)
264
7 Markov Decision Processes
The minimum ergodic cost problem can then be defined as Min
u∈Φ,π∈P
π c(u); π M(u) = π.
(7.218)
In view of (7.217), each feasible pair (u, π ) is such that π = π(u). By the previous section, an equivalent problem is Min η; η1 + V = c(u) + M(u)V.
u∈Φ,V,η
(7.219)
We connect this to the following ergodic dynamic programming principle: η + Vi = min (c(u) + M(u)V )i , u∈Ui
for all i ∈ S .
(7.220)
For u ∈ Φ, we set W (u) := M(u) − I . Theorem 7.58 Let u¯ satisfy the ergodic dynamic programming principle (7.220). Then (i) u¯ is a solution of the ergodic problem (7.219). (ii) If uˆ is another solution of (7.220), writing c¯ = c(u), ¯ cˆ = c(u), ˆ etc., then V¯ − Vˆ is maximal and constant over S (u). ˆ If in addition S (u) ¯ ∩ S (u) ˆ = ∅, then V¯ − Vˆ is constant. Proof (i) Let (u, ¯ V¯ , η) ¯ satisfy (7.220) and let uˆ ∈ Φ. Then η1 ¯ = c¯ + W¯ V¯ ≤ cˆ + Wˆ V¯ = Wˆ (V¯ − Vˆ ) + cˆ + Wˆ Vˆ = Wˆ (V¯ − Vˆ ) + η1. ˆ (7.221) Multiplying on the left by the invariant probability law πˆ for the policy u, ˆ since πˆ Wˆ = 0, we obtain that η¯ ≤ η. ˆ Point (i) follows. (ii) Let (u, ˆ Vˆ , η) ˆ be another solution of (7.220). Set δV := V¯ − Vˆ . Since η¯ = ηˆ by point (i), (7.221) implies that Wˆ δV ≥ 0, i.e. ˆ δV ≤ MδV.
(7.222)
We deduce that if δV attains its maximum at state i, then the maximum is also attained at each 1-step accessible state from i, and also by induction at any accessible state from i (for the policy u). ˆ This implies that δV is maximal and constant over S (u). ˆ Exchanging the roles of uˆ and u, ¯ we obtain that δV is minimal and constant over S (u). ¯ So, if S (u) ¯ ∩ S (u) ˆ = ∅, then δV is constant. We assume next that the Ui are compact, c and M are continuous, and (7.217) holds; then, by Lemma 7.56, the following Howard policy iteration algorithm is well-defined (compare to the Howard Algorithm 7.18 for the discounted case):
7.3 Ergodic Markov Chains
265
Algorithm 7.59 (Ergodic Howard algorithm) 1. Initialization: choose a policy u 0 ∈ Φ; q := 0. 2. Compute a solution (V q , ηq ) in ∞ × R of the linear equation ηq 1 + V q = c(u q ) + M(u q )V q .
(7.223)
3. Compute a policy u q+1 , a solution of q+1
ui
⎧ ⎨
∈ arg min ci (u) + u∈Ui ⎩
q
Mi j (u)V j
j
⎫ ⎬ ⎭
,
for all i ∈ S .
(7.224)
4. Go to step 2. Next, consider the following single class hypothesis for any strategy, stronger than (7.217): For each u ∈ Φ, S = S (u). (7.225) Theorem 7.60 (i) The sequence computed by the Howard algorithm is such that ηq is nonincreasing. (ii) If (7.225) holds, then any limit point of (u q , V q , ηq ) satisfies the ergodic dynamic programming principle and is therefore an optimal policy in view of Theorem 7.58. Proof (i) Setting cq := c(u q ), etc., we have that W q+1 V q+1 + cq+1 − ηq+1 1 = 0 = W q V q + cq − ηq 1,
(7.226)
and therefore (W q+1 − W q )V q + cq+1 − cq = W q+1 (V q − V q+1 ) + (ηq+1 − ηq )1.
(7.227)
By the definition of the Howard algorithm, the l.h.s., denoted by ξ q , has nonpositive values. Multiplying on the left by π q+1 ≥ 0, since π q+1 W q+1 = 0, we obtain that 0 ≥ π q+1 ξ q = ηq+1 − ηq .
(7.228)
So, ηq is nonincreasing. (ii) Being bounded, ηq converges to some η¯ ∈ R. Take a subsequence for which ¯ u), ˆ with similar conventions for costs, probabilities, etc. Passing (u q , u q+1 ) → (u, to the limit in the relation cq+1 + W q+1 V q ≤ c(u) + W (u)V q , we obtain that
cˆ + Wˆ V¯ ≤ c(u) + W (u)V¯ ,
for all u ∈ Φ,
for all u ∈ Φ.
(7.229)
(7.230)
266
7 Markov Decision Processes
Taking u = u¯ and adding the relation c¯ + W¯ V¯ = η1 ¯ = η1 ˆ = cˆ + Wˆ Vˆ ,
(7.231)
we deduce that ζ := Wˆ (V¯ − Vˆ ) ≤ 0. Since πˆ Wˆ = 0 we have that πˆ ζ = 0. Since πˆ ≥ 0 and ζ ≤ 0 it follows that ζ vanishes over the support of πˆ , which is the recurrent class, say Sˆ, associated with the policy u. ˆ So, for any i ∈ Sˆ we get, using ζi = 0 and (7.230): (c¯ + W¯ V¯ )i = η¯ = ηˆ = (cˆ + Wˆ Vˆ )i = (cˆ + Wˆ V¯ )i ≤ (c(u) + W (u)V¯ )i . (7.232) With (7.225) we conclude that (η, ¯ V¯ ) satisfies the ergodic dynamic programming principle (7.220). The conclusion follows. Remark 7.61 Any solution of the ergodic dynamic programming principle (7.220) provides a stationary sequence for Howard’s algorithm. So, if the single class hypothesis (7.225) holds, by the above two theorems, a policy is optimal iff it satisfies the ergodic dynamic programming principle.
7.4 Notes For partially observed processes, see Monahan [81]. For more on ergodic Markov chains, see Arapostathis et al. [7] and Hsu et al. [61]. On the superlinear convergence of Howard’s type algorithms, see Bokanowski, Maroso and Zidani [22], and Santos and Rust [109]. For further reading we refer to the books by Bersekas [19], Altman [5] (especially about expectation constraints), Puterman [91], and for continuous state spaces to Hernández-Lerma, and Lasserre [56, 57]. The link with discretization of continuous time processes is discussed in Kushner and Dupuis [67]. On the modified policy iteration algorithms for discounted Markov decision problems, see Puterman and Shin [92].
Chapter 8
Algorithms
Summary In the case of convex, dynamical stochastic optimization problems, the Bellman functions, being convex, can be approximated as finite suprema of affine functions. Starting with static and deterministic problems, it is shown how this leads to the effective stochastic dual dynamic programming algorithm. The second part of the chapter is devoted to the promising approach of linear decision rules, which allows one to obtain upper and lower bounds of the value functions of stochastic optimization problems.
8.1 Stochastic Dual Dynamic Programming (SDDP) In this section we will study the case of convex dynamic problems, whose convex Bellman values can be approximated by a collection of affine minorants. We start with the static case.
8.1.1 Static Case: Kelley’s Algorithm Consider the problem Min f (x), x∈X
(8.1)
where X is a convex, compact subset of Rn and f : Rn → R is convex. Given sequences x k in X and y k ∈ ∂ f (x k ), k ∈ N, we define the sequence of functions ϕk : Rn → R by (8.2) ϕk (x) := max f (x i ) + y i · (x − x i ) . 0≤i≤k
In view of the definition of a subgradient, we have that ϕk (x) ≤ f (x), for all x ∈ Rn . © Springer Nature Switzerland AG 2019 J. F. Bonnans, Convex and Stochastic Optimization, Universitext, https://doi.org/10.1007/978-3-030-14977-2_8
(8.3) 267
268
8 Algorithms
Setting ak := f (x k ) − y k · x k we note that ϕk (x) := max ai + y i · x . 0≤i≤k
(8.4)
So, computing ϕk (x) can be done by storing only (k + 1) vectors of Rn+1 , instead of 2(k + 1) vectors of Rn , as would suggest the definition of ϕk . Lemma 8.1 We have that f (x k ) − ϕk (x k ) → 0, and if x¯ is a limit-point of x k , then ¯ → f (x). ¯ ϕk (x) Proof (a) Let L be a Lipschitz constant of f over a bounded neighbourhood of X . ¯ Then ∂ f (x) ⊂ B(0, L), for all x ∈ X . Being a maximum of Lipschitz functions with constant L, ϕk is itself Lipschitz with constant L. Let x¯ be a limit-point of x k . For ¯ < ε. Since ϕk and f are Lipschitz with any ε > 0 there exists a kε such that |x kε − x| constant L, we get ϕk (x) ¯ ≥ f (x kε ) + y kε · (x¯ − x kε ) ≥ f (x kε ) − L|x kε − x| ¯ ≥ f (x) ¯ − 2Lε, for all k > kε .
(8.5) Since ϕk (x) ¯ ≤ f (x), ¯ it follows that ϕk (x) ¯ → f (x). ¯ (b) If for a subsequence f (x ki ) − ϕki (x ki ) → 0, since ϕk is Lipschitz with constant L not depending on k, it converges uniformly and for some limit-point x¯ of x ki : ¯ − ϕki (x)), ¯ 0 < lim( f (x ki ) − ϕki (x ki )) = lim( f (x) i
i
which gives a contradiction with (a). The conclusion follows.
(8.6)
The cutting plane (or Kelley) algorithm is as follows: Algorithm 8.2 (Cutting plane) 1. Data: x 0 ∈ X , ε ≥ 0. Set k := 0. 2. Compute x k+1 ∈ X such that ϕk (x k+1 ) ≤ ϕk (x), for all x ∈ X.
(8.7)
3. If f (x k+1 ) − ϕk (x k+1 ) ≤ ε, return xˆ := x k+1 . Otherwise, set k := k + 1 and go to step 2. Note that computing x k+1 , when X is a polyhedron, means solving a linear program. By (8.3), we have that ϕk (x k+1 ) = min ϕk (x) ≤ min f (x) ≤ f (x k+1 ). x∈X
x∈X
(8.8)
Set εk+1 := f (x k+1 ) − ϕk (x k+1 ). It follows that for k ≥ 1, x k is an εk -solution of (8.1), in the sense that
8.1 Stochastic Dual Dynamic Programming (SDDP)
269
f (x k+1 ) − min f (x) ≤ εk . x∈X
(8.9)
In particular, when the algorithm stops the return point xˆ is an ε solution. Lemma 8.3 If ε > 0, the algorithm stops after finitely many iterations. If ε = 0, either it stops after finitely many iterations, or ϕk (x k+1 ) → min X f , and any limitpoint of x k is a solution of (8.1). Proof It suffices to study the case when ε = 0 and the algorithm does not stop. Let x¯ be a limit-point of x k . For the associated subsequence x ki , since ϕk is Lipschitz and nondecreasing as a function of k, the value of its minimum converges. We conclude using Lemma 8.1, (8.8) and ¯ = f (x). ¯ min f (x) ≥ lim ϕk (x k+1 ) = lim ϕki −1 (x ki ) = lim ϕki −1 (x) x∈X
k
i
i
(8.10)
8.1.2 Deterministic Dual Dynamic Programming 8.1.2.1
Principle
Consider now the problem (P) of minimizing J (u, y) :=
N −1
t (u t , yt ) + N (y N ),
(8.11)
t=0
subject to the state equation and control constraints yt+1 = At yt + Bt u t , t = 0, . . . , N − 1; y0 = y 0 ,
(8.12)
u t ∈ Ut , t = 0, . . . , N − 1.
(8.13)
Here At and Bt are matrices of size n × n and n × m resp., (yt , u t ) ∈ Rn × Rm , the initial condition y 0 is given, the Ut are convex and compact subsets of Rm , and the functions t for 0 ≤ t ≤ N − 1, and N , are convex and Lipschitz. We say that (u, y) is a feasible trajectory if it satisfies (8.12)–(8.13). We denote by y[u] the state associated with control u and denote the reduced cost by F(u) := J (u, y[u]). The Bellman values are such that, for τ = 0 to N − 1 and x ∈ Rn :
(8.14)
270
8 Algorithms
v N = N ; vτ (x) :=
min
N −1
u τ ,...,u N −1
t (u t , yt ) + N (y N ); yτ = x ,
(8.15)
t=τ
where the minimization is over the control variables satisfying the control constraints (8.13). Then the following dynamic programming principle holds: vτ (x) = min (τ (u, x) + vτ +1 (Aτ x + Bτ u)) . u∈Uτ
(8.16)
Since the data are Lipschitz, so are the Bellman values vτ , with constant, say, L. The algorithm is as follows. At iteration k, we have a convex minorant ϕtk of vt , which is therefore necessarily Lipschitz with constant at most L, and a nondecreasing function of k. The initialization with k = 0 is usually done by taking ϕt0 equal to a large negative number. We first perform the forward step: this means computing a feasible trajectory (u k , y k ) such that u kt is a solution of the approximate dynamic k : programming strategy, where vt+1 is replaced by ϕt+1 k u kt ∈ argmin t (u, ytk ) + ϕt+1 (At ytk + Bt u) , t = 0, . . . , N − 1.
(8.17)
u∈Ut
This step is forward in the sense that we first compute u k0 , then u k1 , etc. We then see how to perform the backward step, which consists in computing an improved minorant of vt , i.e., ϕtk+1 such that ϕtk ≤ ϕtk+1 ≤ vt .
(8.18)
ϕ0k (y 0 ) ≤ v0 (y 0 ) ≤ F(u k ).
(8.19)
Let us note that So we have that u k is an εk -solution with εk := F(u k ) − ϕ0k (y 0 ). 8.1.2.2
(8.20)
Backward Step and Convergence
We will improve the minorant of the Bellman function by applying the subdifferential calculus rule in Lemma 1.120 to the forward step (8.17), combined with Theorem 1.117. Since t and ϕt+1 are continuous, the latter in view of (8.17), for t = 0 to N − 1, there exists k k k k k , r yt ) ∈ ∂t (u kt , ytk ); h kt ∈ NUt (u kt ); qt+1 ∈ ∂ϕt+1 (yt+1 ), rtk = (rut
such that
k k + Bt qt+1 + h kt = 0, t = 0, . . . , N − 1. rut
(8.21)
(8.22)
8.1 Stochastic Dual Dynamic Programming (SDDP)
271
Relation (8.21) means that for all u ∈ Ut , y and y in Rn : ⎧ k k k k k k ⎪ ⎨ t (u, y) ≥ t (u t , yt ) + rut · (u − u t ) + r yt · (y − yt ), k k k k k (y ) ≥ ϕt+1 (yt+1 ) + qt+1 · (y − yt+1 ), ϕt+1 ⎪ ⎩ k k 0 ≥ h t · (u − u t ).
(8.23)
Summing these relation when y = At y + Bt u, so that k k k k k k · (y − yt+1 ) = (A qt+1 t qt+1 ) · (y − yt ) + (Bt qt+1 ) · (u − u t ),
(8.24)
and using (8.22), we obtain that k (A y + B u) ≥ (u k , y k ) + ϕ k (y k ) + r k + A q k k t (u, y) + ϕt+1 t t t t t yt t t+1 · (y − yt ). t+1 t+1
(8.25) Minimizing the l.h.s. over u ∈ Ut we obtain an affine minorant of the value function vt . Therefore, the above r.h.s. is itself an affine minorant of the value function vt . So, we can update ϕtk as follows: k k k k · (y − ytk ) . (yt+1 ) + r yt + A qt+1 ϕtk+1 (y) := max ϕtk (y), t (u kt , ytk ) + ϕt+1 (8.26) We also update ϕ Nk as follows: ϕ Nk+1 (y) := max ϕ Nk (y), N (y Nk ) + r Nk · (y − y Nk ) , where r Nk ∈ ∂ N (y Nk ). (8.27) The updates of the ϕtk can be performed in parallel or in any order, and is anyway very fast. We see that the costly step of the algorithm is the forward one. Since ϕtk is nondecreasing and upper bounded by vt , it has a limit denoted by ϕ¯t . Lemma 8.4 We have that ϕ0k (y 0 ) → v0 (y 0 ). More generally, vt (ytk+1 ) − ϕtk (ytk+1 ) → 0, for t = 0 to N ,
(8.28)
and any limit-point of u k is a solution of (P). Proof (a) We claim, using a backward induction argument, that (8.28) holds. For t = N this follows from Lemma 8.1. Let it hold for t + 1, with 0 ≤ t ≤ N − 1. It suffices to check the result for a subsequence ki such that y ki +1 is convergent. Since the data are Lipschitz and the minorants ϕk are Lipschitz with constant L, k k + A ctk := r yt t qt+1 is bounded. Given ε > 0, for large enough i, by the induction hypothesis, since ϕtk is nondecreasing w.r.t. k, for j > i, we have using (8.26) that
272
8 Algorithms
ϕt j (yt j ) ≥ ϕtki +1 (yt j ) k
k
k
k
ki ≥ t (u kt i , ytki ) + ϕt+1 (Ay ki + Bu ki ) + ctk · (yt j − ytki )) k
≥ t (u kt i , ytki ) + vt+1 (Ay ki + Bu ki ) − ε − |ctk | |yt j − ytki | ≥ vt (ytki ) − ε − k
k |ctk | |yt j
(8.29)
− ytki | k
≥ vt (yt j ) − ε − (L + |ctk | ) |yt j − ytki |. k k Since |ctk | is bounded, this implies lim inf j ϕt j (yt j ) − vt (ytki ) ≥ 0. Since ϕtk is a minorant of vt , the claim follows. (b) We must prove that any limit-point of u k is a solution of (P). Indeed we have that for all u t ∈ Ut , in view of step (a): k k t (u kt , ytk ) + ϕt+1 (At ytk + Bt u kt ) ≤ t (u t , ytk ) + ϕt+1 (At ytk + Bt u t ) k ≤ t (u t , yt ) + vt+1 (At ytk + Bt u t ) + o(1). (8.30) Making k ↑ ∞ we get that
t (u¯ t , y¯t ) + ϕ¯t+1 ( y¯t+1 ) ≤ t (u t , y¯t ) + vt+1 (At y¯t + Bt u t ).
(8.31)
By point (a), ϕ¯t+1 ( y¯t+1 ) = vt+1 ( y¯t+1 ). Minimizing the r.h.s. over u t ∈ Ut we get that in view of the dynamic programming principle t (u¯ t , y¯t ) + vt+1 ( y¯t+1 ) ≤ vt ( y¯t ). So, u¯ satisfies the DPP and is therefore optimal.
(8.32)
8.1.3 Stochastic Case 8.1.3.1
Principle
For the sake of simplicity we assume that Ω = Ω0N +1 , and that ω = (ω0 , . . . , ω N ) with all components independent, of the same law. Additionally Ω0 = {1, . . . , M} and the event i has probability pi . We say that a random variable is Ft -measurable if it depends on (ω0 , . . . , ωt−1 ). We consider adapted policies: u t (and therefore also yt ) is Ft -measurable, for t = 0 to N − 1. We denote by y[u] the state associated with control u, the adapted solution of yt+1 = At yt + Bt u t + et (ωt ), t = 0, . . . , N − 1. The cost function is, given (u, y) adapted and a.s. bounded
(8.33)
8.1 Stochastic Dual Dynamic Programming (SDDP)
J (u, y) := E
N −1
273
t (u t , yt , ωt ) + N (y N , ω N ) .
(8.34)
t=0
We assume that the functions entering into the cost are Lipschitz and convex w.r.t. (u, y). Denote the reduced cost by F(u) := J (u, y[u]).
(8.35)
The problem is to minimize the reduced cost satisfying the control constraints: Min F(u); u adapted, u t ∈ Ut a.s., 0 ≤ t ≤ N − 1. u
(8.36)
The Bellman values are, for τ = 0, . . . , N − 1 and x ∈ Rn , solutions of: v N = E N ; vτ (x) :=
min
u τ ,...,u N −1
E
N −1
t (u t , yt , ωt ) + N (y N , ω N ) | yτ = x ,
t=τ
(8.37) where the minimization is over the feasible adapted policies (feasible in the sense that they satisfy the above control constraints). The dynamic principle reads vt (y) = min Et (t (u, yt , ωt ) + vt+1 (At y + Bt u + et (ωt ))) , u∈Ut
(8.38)
or equivalently writing i = ωt , ei := e(i): vt (y) = min u∈Ut
M
pi (t (u t , yt , i) + vt+1 (At y + Bt u + ei )) .
(8.39)
i=1
The SDDP algorithm will compute a nondecreasing sequence of minorants ϕt of vt . We can then compute a trajectory based on the approximate dynamic principle, i.e., (u k , y k ) such that for a given realization of ω: u kt ∈ argmin u∈Ut
M
k pi t (u, ytk ) + ϕt+1 (At ytk + Bt u + ei ) , t = 0, . . . , N − 1,
i=1
(8.40) k according to (8.33). Assuming that in the case of multiple and then compute yt+1 minima we choose one of them following a rule such as choosing the solution of minimum norm, this determines an adapted policy. Computing trajectories when choosing i with probability pi , this procedure then appears as a Monte Carlo type computation for estimating the reduced cost F(u k ) associated with the adapted policy u k . We have that (8.41) ϕ0k (y 0 ) ≤ v0 (y 0 ) ≤ F(u k ).
274
8 Algorithms
So, provided we have a statistical procedure implying that for some ε > 0 and ak ∈ R F(u k ) ≤ ak with probability 1 − ε,
(8.42)
F(u k ) − v0 (y 0 ) ≤ a k − ϕ0k (y 0 ) with probability 1 − ε.
(8.43)
we deduce the estimate
8.1.3.2
Backward Step and Convergence
We next provide an extension of the backward step of the deterministic case. We apply the subdifferential calculus rule in Lemma 1.120 to the forward step (8.40). Since t and ϕt+1 are continuous functions of (u, y) and y resp., setting k := At y k + Bt u k + ei , yi,t+1
(8.44)
there exists for t = 0 to N − 1: k h t ∈ NUt (u kt ); k k k k , rikyt ) ∈ ∂t (u kt , ytk , i); qi,t+1 ∈ ∂ϕt+1 (yi,t+1 ), i = 1, . . . , M, ritk = (riut (8.45) such that M k k = 0, t = 0, . . . , N − 1. (8.46) pi riut + Bt qi,t+1 h kt + i=1
Relations (8.45) means that for all u ∈ Ut , y and y in Rn : ⎧ k k k k k k ⎪ ⎨ t (u, y, i) ≥ t (u t , yt , i) + riut · (u − u t ) + ri yt · (y − yt ), k k k k k ϕt+1 (y ) ≥ ϕt+1 (yi,t+1 ) + qi,t+1 · (y − yi,t+1 ), ⎪ ⎩ k k 0 ≥ h t · (u − u t ).
(8.47)
Summing these relation (with weights pi for the two first) when y = At y + Bt u + ei and using (8.46) we obtain that M
k pi t (u, y, i) + ϕt+1 (At y + Bt u + ei ) ≥ atk + btk · (y − ytk ),
(8.48)
i=1
where
⎧ M k k ⎨ atk := i=1 pi t (u kt , ytk , i) + ϕt+1 (yi,t+1 ) , M k ⎩ btk := i=1 . pi rikyt + A qi,t+1
(8.49)
8.1 Stochastic Dual Dynamic Programming (SDDP)
275
Minimizing the l.h.s. of (8.48) over u ∈ Ut , we see that the above r.h.s. gives an affine minorant of the value function vt , so that we can update ϕtk as follows: ϕtk+1 (y) := max ϕtk (y), atk + btk · (y − ytk ) .
(8.50)
We also update ϕ Nk as follows:
ϕ Nk+1 (y) := max ϕ Nk (y), N (y Nk ) +
M
k pi ri N · (y − y Nk ) , rikN ∈ ∂ N (y Nk , i).
i=1
(8.51) Note that we can perform the update of the ϕtk in parallel or in any order. Lemma 8.5 We have that ϕ0k (y 0 ) → v0 (y 0 ). More generally, vt (ytk+1 ) − ϕtk (ytk+1 ) converges to 0, for t = 0 to N . Proof We show by backward induction that vt (ytk+1 ) − ϕtk (ytk+1 ) → 0, for t = 0 to N . For t = N this follows from Lemma 8.1. Let it hold for t + 1, with 0 ≤ t ≤ N − 1. Let k j be a subsequence such that u k j +1 → u¯ (in the space of adapted strategies). Let k := k j , k := k j+1 , u := u k , etc. Then given ε > 0, for large enough j, by the induction hypothesis ϕtk+1 (ytk ) = ≥ ≥
M
k k k k k i=1 pi t (u t , yt , i) + ϕt+1 (Ay + Bu + ei ) M k k k k i=1 pi t (u t , yt , i) + vt+1 (Ay + Bu + ei ) vt (ytk ) − ε.
−ε
(8.52)
The conclusion follows.
For a discussion of the SDDP approach we refer to the notes at the end of this chapter.
8.2 Introduction to Linear Decision Rules 8.2.1 About the Frobenius Norm We recall that the Frobenius scalar product between two matrices A, B of same size is Ai, j Bi, j = trace(AB ). (8.53) A, B F = i, j
Note that, if A, B, C are matrices such that AB and C have the same dimension, then we have the “transposition rule” AB, C F = trace(ABC ) = trace(A(C B ) ) = A, C B F .
(8.54)
276
8 Algorithms
8.2.2 Setting Let (Ω, F , P) be a probability space. Consider the problem Min
x∈L 2 (Ω)n
E c(ω) · x(ω);
Ax(ω) ≤ b(ω) a.s.
(8.55)
Here A, a p × n matrix, b(ω) ∈ L 2 (Ω) p and c(ω) ∈ L 2 (Ω)n are given. We assume that the probability has support over the closed set Ω ⊂ Rn ω and that for some matrices B and C of appropriate dimension: c(ω) = Cω; b(ω) = Bω; E|ω|2 < ∞.
(8.56)
In addition we decide to take a linear decision rule, i.e. for some X ∈ Rn×n ω : x(ω) = X ω a.s. on Ω.
(8.57)
Remark 8.6 We may assume that ω1 = 1 a.s. on Ω,
(8.58)
so that these linear decision rules are in fact affine decision rules on ω2 , . . . , ωn ω . Denoting by (AX − B)i the ith row of AX − B, the resulting problem reads: Min E(Cω) · (X ω); (AX − B)i ω ≤ 0 a.s., i = 1, . . . , p. X
(8.59)
Denoting the second moment of ω by M := Eωω , we get by (8.54) that E(Cω) · (X ω) = Eω C X ω = E ωω , C X F = Eωω , X C F = trace (MC X ),
(8.60)
so that (8.59) can be reformulated as Min trace (MC X ); (AX − B)i ∈ Ω − , i = 1, . . . , p. X
(8.61)
This is a linear problem in X , which might be tractable if Ω − has a nice structure.
8.2.3 Linear Programming Reformulation Let (z, h) ∈ Rn z × Rn h . Assume that, for some matrices W , Z of appropriate size: Ω = {ω ∈ Rn ω ; W ω + Z z ≥ h}.
(8.62)
8.2 Introduction to Linear Decision Rules
277
Let y ∈ Rn ω . That y ∈ Ω − means that v(y) ≥ 0, where v(y) := inf {−y · ω; W ω + Z z ≥ h}. ω,z
(8.63)
So, v(y) is the value of a feasible linear program (we assume of course that Ω is nonempty) whose Lagrangian function is − y · ω + λ · (h − W ω − Z z) = −(y + W λ) · ω − (Z λ) · z + λ · h.
(8.64)
Therefore, the dual problem has the same value as the primal one, i.e., v(y) = sup{λ · h; y + W λ = 0; Z λ = 0}. λ≥0
(8.65)
In addition, both the primal and dual problem have solutions if v(y) is finite. So, v(y) ≥ 0 iff λ · h ≥ 0, for some λ satisfying the constraints in (8.65), which may be expressed in the form λ W + y = 0; λ Z = 0; λ ≥ 0.
(8.66)
Taking for y the rows of AX − B, and denoting by Λ the matrix whose rows are the transpose of the corresponding λ, we obtain an equivalent linear programming reformulation of problem (8.62): Min trace (MC X ); X,Λ
AX + ΛW = B; ΛZ = 0; Λh ≥ 0; Λ ≥ 0. (8.67)
So, we have proved that Lemma 8.7 Let Ω be of the form (8.62). Then the value of the linear programming problem (8.67) is an upper bound of the value of the original problem (8.55).
8.2.4 Linear Conic Reformulation We next generalize the previous analysis by considering the setting of linear conical optimization, see Chap. 1, Sect. 1.3.2. Assume that for some z ∈ Rn z , h ∈ Rn h , W and Z matrices of appropriate dimensions, and some (finite-dimensional) closed convex cone K : (8.68) Ω = {ω ∈ Rn ω ; W ω + Z z − h ∈ K }. That y ∈ Ω − means that v(y) ≥ 0, where v(y) := inf {−y · ω; W ω + Z z − h ∈ K }. ω,z
(8.69)
278
8 Algorithms
Remember that the infimum is not necessarily attained, even if the value is finite. Assume that the above problem is qualified, i.e., for some ε > 0: ε BY ⊂ K + h + Im(W ) + Im(Z ).
(8.70)
Expressing the dual using K + rather than K − , by Corollary 1.144, we have that either v(y) = −∞, or v(y) = max+ {λ · h; y + W λ = 0; Z λ = 0}. λ∈K
(8.71)
So, v(y) ≥ 0 iff λ · h ≥ 0, for some dual feasible λ. The dual constraints may be expressed in the form λ W + y = 0; λ Z = 0; λ ∈ K + .
(8.72)
Taking for y the rows of AX − B, and denoting by Λ the matrix whose rows are the transpose of the corresponding λ, we obtain an equivalent conic reformulation of (8.62), where Λi denotes the ith row of the matrix Λ: Min trace (MC X ); AX + ΛW = B; ΛZ = 0; X,Λ
Λh ≥ 0; Λi ∈ K + , i = 1, . . . , n ω .
(8.73)
We have proved that Lemma 8.8 Let Ω be of the form (8.68), and satisfy the qualification condition (8.70). Then the value of (8.73) is an upper bound of the value of the original problem (8.55).
8.2.5 Dual Bounds in a Conic Setting 8.2.5.1
Derivation of the Dual Bound
We are now looking for lower bounds of the value of the original stochastic optimization problem (8.55), when Ω is of the form (8.68). Denoting by v P the value of (8.68), which we may express as vP =
inf
p
sup
x∈L 2 (Ω)n ,s∈L 2 (Ω)+ y∈L 2 (Ω) p
E (c(ω) · x(ω) + y(ω) · (Ax(ω) + s(ω) − b(ω))) ,
(8.74) we get a lower bound by restricting, in the above expression, y to some subspace say Y of L 2 (Ω) p : so v P ≥ vY , where
8.2 Introduction to Linear Decision Rules
vY :=
inf
279
sup E (c(ω) · x(ω) + y(ω) · (Ax(ω) + s(ω) − b(ω))) .
p
x∈L 2 (Ω)n ,s∈L 2 (Ω)+ y∈Y
(8.75) Note that vY :=
inf
p
x∈L 2 (Ω)n ,s∈L 2 (Ω)+
Ec(ω) · x(ω);
Ax(ω) + s(ω) − b(ω) ∈ Y ⊥ .
(8.76)
Consider the particular case of a linear multiplier rule: Y = {y ∈ L 2 (Ω) p ; y(ω) = Y ω for some matrix Y }.
(8.77)
Then vY =
inf
p x∈L 2 (Ω)n ,s∈L 2 (Ω)+
sup E c(ω) · x(ω) + ω Y (Ax(ω) + s(ω) − b(ω)) . Y
(8.78) Set e(ω) := Ax(ω) + s(ω) − b(ω). Then by the transposition rule (8.54): Eω Y e(ω) = Ee(ω) Y ω = E Y, e(ω)ω F = Y, Ee(ω)ω F ,
(8.79)
and therefore e(·) ∈ Y ⊥ iff Ee(ω)ω = 0. It follows that vY =
inf
p
x∈L 2 (Ω)n ,s∈L 2 (Ω)+
Ec(ω) · x(ω); E(Ax(ω) + s(ω) − b(ω))ω = 0. (8.80)
We next discuss the second-order moment of ω. Lemma 8.9 We have that M = Eωω is full rank iff Ω spans Rn ω . Proof Since M is symmetric and semidefinite, it is not of full rank iff there exists some nonzero g ∈ Rn ω so that 0 = g Mg = Eg ωω g = E(ω g)2 .
(8.81)
So, M is not of full rank iff ω lies in the orthogonal of some nonzero vector g ∈ Rn ω . The conclusion follows. In the sequel we assume that M = Eωω is full rank.
(8.82)
So, the matrices X , S, B are uniquely defined by the relations below: X M = Ex(ω)ω ; S M = Es(ω)ω ;
B M = Eb(ω)ω .
(8.83)
280
8 Algorithms
On the other hand, given any n × n ω matrix X , we have that x(ω) = X ω is such that the above first relation holds. Assume in the sequel that c(ω) = Cω. Using (8.54) and the symmetry of M, we get that Ec(ω) · x(ω) = Ex(ω) Cω = C, Ex(ω)ω = C, X M = M X , C = M, C X = M, X C = trace(MC X ). (8.84) So, we can express vY in terms of X rather than x. It follows that vY =
inf
p
X,S,s∈L 2 (Ω)+
trace(MC X ); S M = Es(ω)ω ; (AX + S − B)M = 0. (8.85)
Since M is invertible, we deduce that vY =
inf
p
X,S,s∈L 2 (Ω)+
trace(MC X ); AX + S = B; S M = Es(ω)ω .
(8.86)
The above problem is still not tractable, but we will see that it has the following tractable relaxation: vY1 = inf trace(MC X ); AX + S = B; (W − he1 )M S + Z Γ ∈ K p . X,S,Γ
(8.87) The last inclusion relation means that each column of (W − he1 )M S + Z Γ belongs to K . We need to assume that
There exists a measurable mapping Ω → Rn z , ω → z(ω) such that W ω + Z z(ω) ≥ h and for some c > 0 : |z(ω)| ≤ c(1 + |ω|) a.s.
(8.88)
Lemma 8.10 Let (8.58) and (8.88) hold. Then vY1 ≤ vY ≤ v P . Proof The second inequality follows from the previous arguments. We next prove the first one. Since vY1 and vY have the same cost function, it suffices to check that if (X, S, s) satisfies the constraints in (8.86), then (X, S, Γ ) satisfies the constraints p in (8.87), for some Γ . Indeed, let s ∈ L 2 (Ω)+ and S be such that S M = Es(ω)ω . Since, by (8.58), any element of Ω has a first component equal to 1: (W − he1 )M S = (W − he1 )Eωs(ω) = E(W ω − h)s(ω) .
(8.89)
Set Γ := Ez(ω)s(ω) ; note that this expectation is finite since E|ω|2 < ∞, in view of (8.88). By the above display, (W − he1 )M S + Z Γ = E(W ω − h + Z z(ω))s(ω) .
(8.90)
The jth column of the r.h.s. matrix is E(W ω − h + Z z(ω))s j (ω). Since W ω − h + Z z(ω) ∈ K a.e., and K is a closed convex cone, it belongs to K . The conclusion follows.
8.2 Introduction to Linear Decision Rules
281
Remark 8.11 (i) The derivation of this dual bound did not assume any qualification condition. (ii) For a refined analysis of the lower bound, in the case when K is the set of nonnegative vectors, see [66].
8.3 Notes Kelley’s [63] algorithm 8.1 for minimizing a convex function over a set X essentially requires us to solve a linear programming problem at each step, if X is a polyhedron. Various improvements, involving the quadratic penalization of the displacement and therefore the resolution of convex quadratic programs, are described in Bonnans et al. [24]. The SDDP algorithm, due to Pereira and Pinto [86], can be seen as an extension of the Benders decomposition [18]. Shapiro [113] analyzed the convergence of such an algorithm for problems with potentially infinitely many scenarios, and considered the case of a risk averse formulation, based on the conditional value at risk. See also Girardeau et al. [52]. In the case of a random noise process with memory, a possibility is to approximate it by a Markov chain, obtained by a quantization method, and to apply the SDDP approach to the resulting dynamic programming formulation. This applies more generally when the value functions are convex w.r.t. some variables only, see Bonnans et al. [23]. The SDDP approach can also provide useful bounds in the case of problems with integer constraints, see Zou, Ahmed and Sun [128]. In the presentation of linear decision rules we follow Georghiou et al. [51, 66]. The primal upper bound (8.73) can be computed by efficient algorithms when K is the product of polyhedral cones, second-order cones, and cones of semidefinite symmetric matrices. See e.g. Nesterov and Nemirovski [85]. For other aspects of linear decision rules, in connection with robust optimization (for which a reference book is [16]), see Ben-Tal et al. [14].
Chapter 9
Generalized Convexity and Transportation Theory
Summary This chapter first presents the generalization of convexity theory when replacing duality products with general coupling functions on arbitrary sets. The notions of Fenchel conjugates, cyclical monotonicity and duality of optimization problems, have a natural extension to this setting, in which the augmented Lagrangian approach has a natural interpretation. Convex functions over measure spaces, constructed as Fenchel conjugates of integral functions of continuous functions, are shown to be sometimes equal to some integral of a function of their density. This is used in the presentation of optimal transportation theory over compact sets, and the associated penalized problems. The chapter ends with a discussion of the multi-transport setting.
9.1 Generalized Convexity 9.1.1 Generalized Fenchel Conjugates Let X and Y be arbitrary sets and κ : X × Y → R, called a coupling between X and Y (and then X , Y are called in this context coupled spaces). The κ-Fenchel conjugate of ϕ : X → R is ϕ κ : Y → R, defined by ϕ κ (y) := sup (κ(x, y) − ϕ(x)) .
(9.1)
x∈X
We have the κ-Fenchel–Young inequality ϕ κ (y) ≥ κ(x, y) − ϕ(x), for all x ∈ X and y ∈ Y.
(9.2)
If ϕ has a finite value at x ∈ X , we define the κ-subdifferential of ϕ at x ∈ X as ∂κ ϕ(x) := {y ∈ Y ; ϕ(x) + ϕ κ (y) = κ(x, y)}. © Springer Nature Switzerland AG 2019 J. F. Bonnans, Convex and Stochastic Optimization, Universitext, https://doi.org/10.1007/978-3-030-14977-2_9
(9.3) 283
284
9 Generalized Convexity and Transportation Theory
So y ∈ ∂κ ϕ(x) iff equality holds in the κ-Fenchel–Young inequality. Recalling the definition of ϕ κ , we see that y ∈ ∂κ ϕ(x) iff ϕ(x) is finite and the following κsubdifferential inequality holds: ϕ(x ) ≥ ϕ(x) + κ(x , y) − κ(x, y), for all x ∈ X.
(9.4)
We call a κ-minorant of ϕ any function over X of the form x → κ(x, y) − β, for some (y, β) ∈ Y × R, that is a minorant of ϕ, i.e., such that β ≥ κ(x, y) − ϕ(x), for all x ∈ X.
(9.5)
Clearly, this holds iff β ≥ ϕ κ (y). In other words, for any given y ∈ Y , if ϕ κ (y) is finite, then x → κ(x, y) − ϕ κ (y) is the ‘best’ κ-minorant of the form κ(x, y) − β, for some β ∈ R. If ϕ κ (y) = ∞, there is no such minorant. Finally, ϕ κ (y) = −∞ means that ϕ(x) = ∞, for any x ∈ X . Since X and Y play symmetric roles we have similar notions for ψ : Y → R. For instance, the κ-Fenchel conjugate of ψ is the function ψ κ : X → R defined by ψ κ (x) := sup (κ(x, y) − ψ(y)) .
(9.6)
y∈Y
We can define the κ-biconjugate of ϕ : X → R as the conjugate of its conjugate, i.e. the function X → R defined by ϕ κκ (x) := sup (κ(x, y) − ϕ κ (y)) .
(9.7)
y∈Y
In view of (9.5), the κ-biconjugate is the supremum of κ-minorants, and therefore is itself a minorant of ϕ, that is, ϕ κκ (x) ≤ ϕ(x), for all x ∈ X.
(9.8)
We say that ϕ : X → R is κ-convex if it is the κ-conjugate of some function ψ : Y → R. The following holds: Lemma 9.1 (i) The biconjugate of ϕ is the greatest κ-convex function dominated by ϕ. That is, if f : X → R is κ-convex and f (x) ≤ ϕ(x) for all x ∈ X , then f (x) ≤ ϕ κκ (x) for all x ∈ X . (ii) A function is κ-convex iff it is equal to its biconjugate. (iii) A supremum of κ-convex functions is κ-convex. Proof (i) Let f be a κ-convex minorant of ϕ. Then f = ψ κ for some ψ : Y → R, and then ϕ(x) + ψ(y) ≥ f (x) + ψ(y) ≥ κ(x, y), (9.9)
9.1 Generalized Convexity
so that
285
ψ(y) ≥ sup(κ(x, y) − ϕ(x)) = ϕ κ (y)
(9.10)
x∈X
and therefore (since the κ-Fenchel conjugate is obviously decreasing) ϕ κκ (x) ≥ f (x). (ii) Direct consequence of (i). ¯ be κ-convex for i ∈ I , and set ϕ(x) := supi∈I ϕi (x). Since (iii) Let the ϕi : X → R each ϕi is equal to its biconjugate, we have that ϕ(x) = sup sup(κ(x, y) − ϕiκ (y)) = sup(κ(x, y) − inf ϕiκ (y)), i∈I y∈Y
y∈Y
i∈I
which shows that ϕ is a κ-conjugate, and is therefore κ-convex.
(9.11)
Remark 9.2 If the subdifferential of ϕ at x¯ ∈ X contains y¯ , then ϕ κκ (x) ¯ ≥ κ(x, ¯ y¯ ) − ϕ κ ( y¯ ) = ϕ(x). ¯
(9.12)
ϕ κκ (x) ¯ = ϕ(x). ¯
(9.13)
In other words, ¯ = ∅ ∂κ ϕ(x)
⇒
Lemma 9.3 Let ϕ : X → R. Then (i) ϕ and its biconjugate have the same conjugate. ¯ = ∂κ ϕ κκ (x). ¯ (ii) Let ϕ be equal to ϕ κκ at x¯ ∈ X . Then ∂κ ϕ(x) Proof (i) Since the biconjugate is the supremum of κ minorants, a function and its biconjugate have the same κ-minorants, and hence, the same κ-conjugate. (ii) The κ-subdifferential of the biconjugate of ϕ at any x ∈ X satisfies, in view of (i): (9.14) ∂κ ϕ κκ (x) := {y ∈ Y ; ϕ κκ (x) + ϕ κ (y) = κ(x, y)}. So when ϕ and its biconjugate have the same value at some point they also have the same κ-subdifferential. Remark 9.4 When X is a Banach space, Y is its dual, and κ(x, y) = y, x is the usual duality product, we will speak of usual convexity, and then we recover the usual Fenchel transform. Note, however, the difference in the definition of convex functions.
9.1.2 Cyclical Monotonicity We say that the set Γ ⊂ X × Y is κ-cyclically monotone if for any positive N ∈ N and finite sequence (x1 , y1 ), . . . , (x N , y N ) in Γ , setting x N +1 := x1 , the following holds:
286
9 Generalized Convexity and Transportation Theory N
κ(xi , yi ) ≥
i=1
N
κ(xi+1 , yi ).
(9.15)
i=1
Lemma 9.5 We have that Γ is κ cyclically monotone iff there exists a κ-convex function ϕ over X , such that y ∈ ∂κ ϕ(x), for all (x, y) ∈ Γ.
(9.16)
Proof (i) If some κ-convex function ϕ over X satisfies (9.16) then, by the κ-Fenchel– Young inequality: κ(xi , yi ) = ϕ(xi ) + ϕ κ (yi ), (9.17) −κ(xi+1 , yi ) ≥ −ϕ(xi+1 ) − ϕ κ (yi ). Summing these inequalities for i = 1 to N , we get (9.15). (ii) Conversely, let (9.15) hold. Fix (x1 , y1 ) ∈ Γ and, for x ∈ X , set ϕ(x) := sup
N
(κ(xi+1 , yi ) − κ(xi , yi )).
(9.18)
i=1
The supremum is w.r.t. to all nonzero N ∈ N, and to all (xi , yi ) in Γ , i = 2 to N , with x N +1 equal to x. Then ϕ is κ-convex since we may express it as ϕ(x) := sup
y N ∈Y
κ(x, y N ) + sup −κ(x N , y N ) +
N −1
(κ(xi+1 , yi ) − κ(xi , yi ))
,
i=1
(9.19) the second supremum being w.r.t. to all nonzero N ∈ N, and to all (xi , yi ) in Γ , i = 2 to N , with y N = y given. Observe that ϕ(x) > −∞ for all x ∈ X . By cyclical monotonicity, ϕ(x1 ) ≤ 0. Taking N = 2 and (x2 , y2 ) = (x1 , y1 ), we obtain the converse inequality; it follows that ϕ(x1 ) = 0. ¯ Next, let (x, ¯ y¯ ) ∈ Γ . We must prove that ϕ(x) ¯ is finite, and that y¯ ∈ ∂κ ϕ(x). ¯ y¯ ) and x N +2 := x, we get that, by the definition of ϕ: Setting (x N +1 , y N +1 ) := (x, ϕ(x) ≥
N +1
N
i=1
i=1
(κ(xi+1 , yi ) − κ(xi , yi )) = κ(x, y¯ ) − κ(x, ¯ y¯ ) +
(κ(xi+1 , yi ) − κ(xi , yi )).
(9.20) Maximizing over the last sum it follows that ϕ(x) ≥ κ(x, y¯ ) − κ(x, ¯ y¯ ) + ϕ(x). ¯
(9.21)
¯ < ∞. It follows that ϕ(x) ¯ is finite, and by the Taking x = x1 we deduce that ϕ(x) ¯ The conclusion follows. above display, y¯ ∈ ∂κ ϕ(x).
9.1 Generalized Convexity
287
9.1.3 Duality Consider a family of optimization problems of the form Min ϕ(x, y) − κ X (x, x ). x∈X
(Py )
Here we have arbitrary sets X , X , Y , Y , y ∈ Y , x ∈ X , and coupling functions κ X , κY between (X, X ) and (Y, Y ) resp. The product spaces (X, Y ) and (X , Y ) are endowed with the product coupling κ(x, y; x , y ) := κ X (x, x ) + κY (y, y ).
(9.22)
We denote the value function (for fixed x ) of problem (Py ) by v(y) := inf ϕ(x, y) − κ X (x, x ) . x∈X
(9.23)
Its κY conjugate, denoted by vκ , is vκ (y ) :=
sup
(x,y)∈X ×Y
κ X (x, x ) + κY (y, y ) − ϕ(x, y) = ϕ κ (x , y ).
(9.24)
So, its biconjugate is vκκ (y) := sup κY (y, y ) − ϕ κ (x , y ) . y ∈Y
(9.25)
This leads us to define the dual problem as κ κ (y, y ) − ϕ (x , y ) . Max Y y ∈Y
(D y )
Our previous results on generalized convexity (in particular Lemma 9.3) lead to the following weak duality result: Theorem 9.6 We have that val(D y ) = vκκ (y) ≤ v(y) = val(Py ), S(D y ) = ∂κ vκκ (y), ∂κ v(y) = ∅ ⇒ ∂κ v(y) = S(D y ).
(9.26) (9.27) (9.28)
9.1.4 Augmented Lagrangian We continue in the previous setting, in the case when X is again an arbitrary set, Y is a Banach space and the family of optimization problems is
288
9 Generalized Convexity and Transportation Theory
Min f (x) − κ X (x, x ); g(x) + y ∈ K , x∈X
(Py )
with g : X → Y and K a closed convex subset of Y . We introduce a penalty function ¯ and a penalty parameter r > 0. The penalized problem is P:Y →R Min f (x) − κ X (x, x ) + r P(y); g(x) + y ∈ K . x∈X
(Pr,y )
Its value, denoted by vr , satisfies vr (y) = inf f (x) − κ X (x, x ) + r P(y); g(x) + y ∈ K = v(y) + r P(y). x (9.29) In the sequel we assume that Y is the (topological) dual of Y , and we consider two types of dualization: (a) Dualization of the previous penalized problem using the standard coupling whose expression is κY (y, y ∗ ) := y ∗ , y . We then have, writing y = z − g(x), with z ∈ K : vrκ (y ∗ ) = sup{κ X (x, x ) − f (x) + y ∗ , y − r P(y); g(x) + y ∈ K } x,y
= sup{κ X (x, x ) − f (x) − y ∗ , g(x) } + sup{ y ∗ , z − r P(z − g(x))}. x
z∈K
(9.30) Define the augmented Lagrangian L r (x, y ∗ ) := f (x) + inf {r P(z − g(x)) + y ∗ , g(x) − z }. z∈K
(9.31)
We have shown that vrκ (y ∗ ) = sup{κ X (x, x ) − L r (x, y ∗ )}.
(9.32)
x
So, the dual problem is nothing but Max y ∗ , y + inf {L r (x, y ∗ ) − κ X (x, x )}.
y ∗ ∈Y ∗
x
(Dr,y )
(b) Dualization of the original problem (Py ), with value v(y), using the coupling between Y and Y defined by κˆ Y (y, y ∗ ) := y ∗ , y − r P(y).
(9.33)
We denote the κ-conjugate ˆ of v(y) by vˆ κ . Then vˆ κ (y ∗ ) = sup{κ X (x, x ) − f (x) + y ∗ , y − r P(y)); g(x) + y ∈ K }. x,y
(9.34)
9.1 Generalized Convexity
289
Therefore we get the same value function: vˆ κ (y ∗ ) = sup{κ X (x, x ) − L r (x, y ∗ )} = vrκ (y ∗ ).
(9.35)
x
Definition 9.7 We say that y ∗ ∈ Y ∗ is an augmented Lagrange multiplier of the unperturbed problem (y = 0) if it belongs to ∂vr (0); that is, if v(0) = val(Dr,0 ) and y ∗ ∈ S(Dr,0 ). Note that y ∗ ∈ Y ∗ is an augmented Lagrange multiplier iff v(0) = inf {L r (x, y ∗ ) − κ X (x, x )}.
(9.36)
x
Remark 9.8 Observe that, when P(0) = 0, in cases (a) and (b), the duality gap is the same for the unperturbed problem y = 0. So, the augmented Lagrangian approach can be seen as a generalized convexity approach on the original problem with the nonstandard coupling y ∗ , y − r P(y). Example 9.9 The classical example is when Y is a Hilbert space identified with its dual, and P(y) = 21 y2 . Then the penalty term in the augmented Lagrangian is inf
z∈K
1
r z − g(x)2 + y ∗ , g(x) − z = inf 2
z∈K
1 r z 2
− g(x) −
1 1 ∗ 2 y − 2 y ∗ 2 , r r (9.37)
and therefore the augmented Lagrangian is L r (x, y ∗ ) := f (x) + r dist K
g(x) +
1 ∗ y r
2 −
1 ∗ 2 y . r2
(9.38)
The case of finitely many inequality constraints corresponds to the case when Y = Rm is endowed with the Euclidean norm and K = Rm − . The expression of the augmented Lagrangian is then ∗
L r (x, y ) := f (x) + r
m i=1
1 g(x) + y ∗ r
2 +
−
1 ∗ 2 y . r2
(9.39)
9.2 Convex Functions of Measures In various applications we need to minimize some nonlinear functions of measures, involving for instance some entropic regularization terms as we will see later in the context of optimal transportation problems. We will see how to construct some convex functions of measures, as Fenchel conjugates of integrals of convex functions of continuous functions.
290
9 Generalized Convexity and Transportation Theory
9.2.1 A First Result Let Ω be a compact subset of Rn , and C(Ω) denote the set of continuous functions over Ω. Let p ∈ N be nonzero and set X := C(Ω) p , whose elements are viewed as continuous functions over Ω with value in R p , and norm ϕ X := max |ϕ(ω)|. ω∈Ω
(9.40)
¯ l.s.c. convex and proper, let F : X → R ¯ be defined by Given f : R p → R,
F(ϕ) :=
Ω
f (ϕ(ω))dω.
(9.41)
Lemma 9.10 The functional F is convex, l.s.c. proper over X . Proof By Theorem 1.44, f has an affine minorant, and therefore F is well-defined, with value in (−∞, +∞]. The convexity of F is obvious. Taking ϕ to be constant, equal to an element of the domain of f , we obtain that F is proper. Finally, we prove that F is l.s.c. It is enough to consider a sequence ϕk → ϕ in X such that there exists limk F(ϕk ) < ∞. For any measurable function a ∈ L 1 (Ω) p , we have that
limk F(ϕk ) ≥ lim inf (a(ω) · ϕk (ω) − f ∗ (a(ω)))dω
k Ω = (a(ω) · ϕ(ω) − f ∗ (a(ω)))dω.
(9.42)
Ω
Indeed a(ω) · x − f ∗ (a(ω)) ≤ f (x), so that the above inequality holds (since f ∗ has an affine minorant, the integral has value in [−∞; ∞)), and the equality is obvious since ϕk → ϕ in X . By Proposition 3.74, the supremum over a(·) of the r.h.s. is precisely F(ϕ). The result follows. Remark 9.11 If dom( f ) = R p then dom(F) = X , and F is bounded over bounded sets, so that it is continuous. Recall the Definition 5.3 of regular measures. The dual of C(Ω) is M(Ω), the set of finite Borel regular measures over Ω; see [77, Chap. II, Sect. 5]. The Fenchel ¯ defined by conjugate of F is F ∗ : M(Ω) p → R
∗
F (μ) := sup μ, ϕ X − ϕ∈X
Ω
f (ϕ(ω))dω.
(9.43)
Here, denoting by μi , 1 ≤ i ≤ p, the components of the vector measure μ: μ, ϕ X :=
p i=1
Ω
ϕi (ω)dμi (ω).
(9.44)
9.2 Convex Functions of Measures
291
p
Let L 1μ := Πi=1 L 1μi (Ω) denote the set of integrable functions for the measure μ. ¯ has superlinear growth if, for all k > 0, Definition 9.12 We say that h = R p → R h(y)/|y| > k when |y| > rk , for some rk > 0. Lemma 9.13 We have that f ∗ has superlinear growth iff dom( f ) = R p . Proof Let ck := sup{ f (x); |x| ≤ k}. Then f ∗ (y) = sup{x · y − f (x); |x| ≤ k} ≥ sup{x · y − ck ; |x| ≤ k} = sup(k|y| − ck ). k,x
k,x
k
(9.45) If dom( f ) = R p , by Corollary 1.58, f is continuous, so that ck is finite, for all k, and then by the above display, f ∗ has superlinear growth. Conversely, let f ∗ have superlinear growth. Set gk (x) = sup {x · y − f ∗ (y)}; h k (x) = sup {x · y − f ∗ (y)}. |y|≤rk
|y|>rk
(9.46)
Then f (x) = max(gk (x), h k (x)), and when |x| < k: h k (x) ≤ sup {x · y − k|y|} ≤ sup (|x| − k)|y| = 0. |y|>rk
|y|>rk
(9.47)
On the other hand, f ∗ (y) has an affine minorant, say a · y − b, so that gk (x) ≤ sup {x · y − a · y + b} ≤ rk |x − a| + b. |y|≤rk
Therefore f (x) < ∞.
(9.48)
By the Lebesgue decomposition theorem, any μ ∈ X ∗ can be decomposed in a unique way as μ = μs + μa , where μs is the singular part and μa is the absolutely continuous part, see [105, Chap. 11]. We identify μa ∈ L 1 (Ω) p with its density w.r.t. the Lebesgue measure. Lemma 9.14 Let f ∗ have superlinear growth. Then F ∗ (μ) =
⎧ ⎨ ⎩
∞ Ω
if μs = 0,
f ∗ (μa (ω))dω otherwise.
(9.49)
Proof (i) If μs = 0, there exists a measurable subset E of Ω, of null measure, such that μ(E) = 0 (note that μ(E) ∈ R p ), say μ1 (E) > 0, where μ1 is the first component of μ. Since μ1 is regular, there exists a compact K ⊂ E such that μ1 (K ) > 0. Given ε ∈ (0, 1), set ϕε (ω) := c(1 − d K (ω)/ε)+ for some c > 0. By the dominated convergence theorem, ϕε converges in L 1μ1 to c1 K , so that
292
9 Generalized Convexity and Transportation Theory
μ1 , ϕε → c μ1 , 1 K = cμ1 (K ).
(9.50)
We next identify ϕε with the element of C(Ω) p with first component ϕε and the other components equal to zero. By Lemma 9.13, f is Lipschitz on bounded sets, and so, by the dominated convergence theorem, F(ϕε ) → 0. Therefore, F ∗ (μ) ≥ limε ( μ1 , ϕε − F(ϕε )) = cμ1 (K ). Letting c ↑ ∞ we deduce that F ∗ (μ) = +∞. (ii) Let μs = 0. Then
f ∗ (μa (ω))dω, (9.51) F ∗ (μ) = sup (μa (ω) · ϕ(ω) − f (ϕ(ω)))dω ≤ ϕ∈X
Ω
Ω
where in the last inequality we use the Fenchel–Young inequality. We next prove the converse inequality. Set b(ω, v) := μa (ω)v − f (v). Let ak be a dense sequence in dom f and let ϕk ∈ L ∞ (Ω) be inductively defined by ϕ0 (ω) = a0 and ϕk (ω) =
if b(ω, ak ) > b(ω, ϕk−1 (ω)), ak ϕk−1 (ω) otherwise.
(9.52)
Then b(ω, ϕk (ω)) → f ∗ (μa (ω)) a.e. and, by the monotone convergence theorem, ∗ Ω b(ω, ϕk (ω))dω → Ω f (μa (ω))dω. We cannot conclude the result from this since the ϕk are not continuous. So, given ε > 0, fix k such that
Ω
b(ω, ϕk (ω))dω >
⎧ ⎨ Ω
∗
⎩ 1/ε
f (μa (ω))dω − ε if
f ∗ (μa (ω)) < ∞,
Ω
(9.53)
otherwise.
Given M > 0, denote the truncation of μa by μaM (ω) := max(−M, min(M, μa (ω))).
(9.54)
Fix M > 0 such that μaM − μa L 1 (Ω) < ε. Extend ϕk over R p by 0 and let η : R p → R+ be of class C ∞ with integral 1 and support in the unit ball. Set for α > 0, ηα (x) := α −n η(x/α), and ϕˆα := ϕk ∗ ηα (convolution product). By Jensen’s inequality,
Rn
f (ϕˆ α (ω))dω ≤
Rn
( f (ϕk ) ∗ ηα )(ω)dω =
By a dominated convergence argument we obtain that for α > 0 small enough, by the above inequality:
f (ϕk (ω))dω.
Rn \Ω
(9.55)
f (ϕˆα (ω))dω → 0. So,
Ω
Ω
f (ϕˆ α (ω))dω ≤
Ω
f (ϕk (ω))dω + ε.
Also, since μaM − μa L 1 (Ω) < ε, for small enough α:
(9.56)
9.2 Convex Functions of Measures
293
| μ, ϕˆα − ϕk | ≤ | μ − μ M , ϕˆα − ϕk | + | μ M , ϕˆα − ϕk | ≤ εϕˆα − ϕk ∞ + Mϕˆ α − ϕk L 1 (Ω) ≤ ε(2ϕk ∞ + 1).
(9.57)
In the last inequality we use ϕˆα ∞ ≤ ϕk ∞ and ϕˆα → ϕk in L 1 (Ω). We conclude by combining the previous inequality with (9.53) and (9.56).
9.2.2 A Second Result Let g(ω, x) : Ω × R p → R be a continuous function, convex w.r.t. x. Define G : X → R by
g(ω, ϕ(ω))dω. (9.58) G(ϕ) := Ω
Clearly G is convex and bounded over bounded sets. So, it is continuous, with conjugate
g(ω, ϕ(ω))dω. (9.59) G ∗ (μ) := sup μ, ϕ X − ϕ∈X
Ω
We denote by g ∗ the Fenchel conjugate of g w.r.t. its second variable. Lemma 9.15 We have that G ∗ (μ) =
⎧ ⎨ ⎩
Ω
∞
if μs = 0,
g ∗ (ω, μa (ω))dω otherwise.
(9.60)
Proof This is an easy variant of the proof of Lemma 9.14. Let us just mention that, while Jensen’s inequality in (9.55) cannot be easily extended, we get directly the analogous to (9.56), namely
Ω
g(ω, ϕˆ α (ω))dω ≤
Ω
g(ω, ϕ(ω))dω + ε
(9.61)
by the dominated convergence theorem, since ϕˆα is bounded in L ∞ (Ω) p and con verges to ϕ in L 1 (Ω) p .
9.3 Transportation Theory We next analyze in a simple way the Kantorovich duality that extends the classical Monge problem.
294
9 Generalized Convexity and Transportation Theory
9.3.1 The Compact Framework Let x be a compact subset of Rn , and c(x) denote the space of continuous functions over x, endowed with the uniform norm ϕx := max{|ϕ(x)|; x ∈ x}.
(9.62)
This is a Banach space, with dual denoted by m(x). We say that η ∈ m(x) is nonnegative, and we write η ≥ 0, if η, ϕ c(x) ≥ 0, for any nonnegative ϕ. We denote by m + (x) the positive cone (set of nonnegative elements) of m(x). It is known that m(x) is the space of finite Borel measures over x, see [77, Chap. 2]. Given a compact subset y of R p , set z := x × y (this is a compact subset of Rn+ p ). To ϕ ∈ c(x) we associate bx ϕ ∈ c(z) defined by (bx ϕ) (x, y) = ϕ(x), for all (x, y) ∈ z.
(9.63)
We define in the same way b y ψ, where ψ ∈ c(y), by b y ψ (x, y) = ψ(y), for all (x, y) ∈ z. One easily checks that bx (as well as b y ) is isometric: bx ϕc(z) = ϕc(x) . So, we call bx (resp. b y ) the canonical injection from c(x) (resp. c(y)) into c(z). Let μ ∈ m(z). We call the element μ| X of M(X ) defined by μ| X , ϕ C(X ) = μ, B X ϕ C(Z ) , for all ϕ ∈ C(X )
(9.64)
the marginal of μ over X . The marginal mapping μ → μ| X is nothing but the transpose of the canonical injection from C(X ) into C(Z ), and is non-expansive in the sense that (9.65) μ| X M(X ) ≤ μ M(Z ) . Let 1 X have value 1 over X . The marginals are related by the compatibility relation μ| X , 1 X C(X ) = μ|Y , 1Y C(Y ) = μ, 1 Z C(Z ) .
(9.66)
By P(X ) we denote the set of Borel probabilities over X , i.e., P(X ) := {η ∈ M+ (X ); η, 1 X = 1}.
(9.67)
We fix (η, ν) ∈ P(X ) × P(Y ), and c(x, y) ∈ C(Z ). Consider the Kantorovich problem Min − η, ϕ X − ν, ψ C(Y ) ; ϕ(x) + ψ(y) − c(x, y) ≤ 0, for all (x, y) ∈ Z .
ϕ∈C(X ) ψ∈C(Y )
(9.68) This is a convex problem, whose Lagrangian L : C(X ) × C(Y ) × M(Z ) → R is
9.3 Transportation Theory
295
L (ϕ, ψ, μ) := − η, ϕ C(X ) − ν, ψ C(Y ) + μ, B X ϕ(x) + BY ψ(y) − c C(Z ) = μ| X − η, ϕ C(X ) + μ|Y − ν, ψ C(Y ) − μ, c C(Z ) . (9.69) Observe that −∞ if μ| X = η or μ|Y = ν, (9.70) inf L (ϕ, ψ, μ) = − μ, c C(Z ) otherwise. ϕ∈C(X ) ψ∈C(Y )
So, the dual problem is Max − μ, c C(Z ) ; μ| X = η; μ|Y = ν.
μ∈M+ (Z )
(9.71)
Proposition 9.16 Problems (9.71) and (9.68) have the same finite value, and both have a nonempty set of solutions. Proof (a) The dual problem (9.71) is feasible (take for μ the product of η and ν) and the primal problem (9.68) is qualified: there exists a pair (ϕ0 , ψ0 ) in C(X ) × C(Y ) such that c(x, y) − ϕ0 (x) − ψ0 (y) is uniformly positive. By general results of convex duality theory, problems (9.68) and (9.71) have the same finite value, and (9.71) has a nonempty and bounded set of solutions. (b) It remains to prove that (9.68) has a nonempty set of solutions. Let (ϕk , ψk ) be a minimizing sequence. Set
ψk (y) := min{c(x, y) − ϕk (x); x ∈ X }, ϕk (x) := min{c(x, y) − ψk (y); y ∈ Y }.
(9.72)
It is easily checked that these two functions are continuous, and satisfy the primal constraint as well as the inequality (ϕk , ψk ) ≥ (ϕk , ψk ), implying that the associated cost is smaller than the one for (ϕk , ψk ); so, (ϕk , ψk ) is another minimizing sequence. In addition, (ϕk , ψk ) has a continuity modulus not greater than the one of c (in short, it has a c-continuity modulus), since a finitely-valued infimum of functions with ccontinuity modulus has c-continuity modulus. Since we can always add a constant to ϕk and subtract it from ψk we get the existence of a minimizing sequence (ϕk , ψk ) with c-continuity modulus, and such that ϕk (x0 ) = 0. It easily follows that (ϕk , ψk ) is bounded in C(Y ) and C(X ) resp. By the Ascoli–Arzela theorem, there exists a subsequence in C(X ) × C(Y ) converging to some (ϕ, ψ). Passing to the limit in the cost function and constraints of (9.68) we obtain that (ϕ, ψ) is a solution to this problem. Remark 9.17 The primal solution (ϕ, ψ) constructed in the above proof satisfies
ψ(y) = min{c(x, y) − ϕ(x); x ∈ X }, ϕ(x) := min{c(x, y) − ψ(y); y ∈ Y }.
(9.73)
296
9 Generalized Convexity and Transportation Theory
Setting κ(x, y) := −c(x, y), the above relations can be interpreted as κ-conjugates in the sense of (9.1): (9.74) − ψ = (−ϕ)κ ; −ϕ = (−ψ)κ .
9.3.2 Optimal Transportation Maps Let (ϕ, ψ) and μ be primal and dual feasible, resp. The difference of associated costs is, since η and ν are the marginals of μ: μ, c C(Z ) − η, ϕ C(X ) − ν, ψ C(Y ) = μ, c(x, y) − ϕ(x) − ψ(y) C(Z ) . (9.75) As expected it is nonnegative, and (since the primal and dual problem have the same value), (ϕ, ψ) and μ are primal and dual solutions, resp., iff the above r.h.s. is equal to zero, meaning that c(x, y) = ϕ(x) + ψ(y) over the support of μ, which we denote by Γ . Let (x, ¯ y¯ ) ∈ Γ . Then c(x, ¯ y¯ ) − ϕ(x) ¯ = ψ( y¯ ) ≤ c(x, y¯ ) − ϕ(x), for all x ∈ X.
(9.76)
In the sequel we assume that X and Y are the closure of their interior, and that c(·, ·) is of class C 1 . By the above remark, we may assume that ϕ and ψ satisfy (9.74) and therefore are Lipschitz. By Rademacher’s theorem, see [6, Thm. 2.14], ϕ is a.e. differentiable over int(X ). If x¯ ∈ int(X ) and ϕ(x) is differentiable at x, ¯ (9.76) implies that ¯ y¯ ) a.e. ∇ϕ(x) ¯ = ∇x c(x,
(9.77)
¯ = x¯ − y¯ . Therefore Example 9.18 Take c(x, y) = 21 |x − y|2 . We obtain that ∇ϕ(x) y¯ = T (x), ¯ where T (x) := x − ∇ϕ(x), a.e.,
(9.78)
so that the support of μ is contained in the graph of the transportation map T (x). If η has a density, we can identify μ with this transportation map. In addition, since (9.74) holds for (ϕ, ψ) we have that ϕˆ := −ϕ satisfies ϕ(x) ˆ = max{−c(x, y) + ψ(y)} = − 21 |x|2 + max x · y − 21 |y|2 + ψ(y) . y∈Y
y∈Y
(9.79) The last maximum of affine functions of x, say F(x), is a convex function of x. We deduce that ϕ(x) = 21 |x|2 − F(x), with F convex. We have proved that
If c(x, y) = 21 |x − y|2 , then the transportation plan is a.e. of the form T (x) = ∇ F(x) a.e., where F is a convex function.
(9.80)
9.3 Transportation Theory
297
Example 9.19 More generally assume that c(x, y) = f (x − y) with f convex and Lipschitz. Then (9.76) implies that ∇ϕ(x) ¯ ∈ ∂ f (x¯ − y¯ ), a.e.,
(9.81)
where by ∂ f we denote the subdifferential. If ∂ f is injective, then we have that ¯ x¯ − y¯ a.e., ∂ f −1 (∇ϕ(x))
(9.82)
¯ with now T f (x) := x − ∂ f −1 (∇ϕ(x)) a.e. y¯ = T f (x),
(9.83)
meaning that
So, if η has a density, we can identify μ with the transportation map T f .
9.3.3 Penalty Approximations 9.3.3.1
Duality
The dual problem was set in (9.68). We assume that η and ν have densities.
(9.84)
¯ for the nonnegativity of the measure, of the Consider a penalty function e : R → R following type:
e is proper l.s.c. convex with superlinear growth, (0, ∞) ⊂ dom(e) ⊂ [0, ∞).
(9.85)
A typical example is the entropy penalty e(s) ˆ := s(log s − 1),
(9.86)
with e(0) ˆ := 0 and domain [0, ∞). The penalty term is defined as P(μ) =
⎧ ⎨ ⎩
∞ Ω
if μs = 0,
e(μa (ω))dω otherwise.
(9.87)
A penalized version of the dual problem, with ε > 0, is (remember that Z := X × Y ): Max − μ, c C(Z ) − ε P(μ); μ|C(X ) = η; μ|Y = ν.
μ∈M+ (Z )
(9.88)
298
9 Generalized Convexity and Transportation Theory
Set
f (μ)=ε
P(μ(x, y))d(x, y); F(ν) = I{0} (ν), Aμ = −(μ| X , μ|Y ); y = (η, ν). Z
(9.89) Here A is from M(Z ) into M(X ) × M(Y ). The penalized dual problem can be written in the form (9.90) Min f (μ) + μ, c C(Z ) + F(Aμ + y). μ∈M(Z )
We can compute the dual to this problem in dual spaces, as explained in Chap. 1, Sect. 1.2.1.2. While the problem is in a dual space setting, the computations are similar to those in the standard Fenchel duality framework, so that the ‘bidual’ expressed as a minimization has expression Min f ∗ (−c − A (ϕ, ψ)) + F ∗ (ϕ, ψ) − η, ϕ − ν, ψ . ϕ,ψ
(9.91)
Now F ∗ is the null function, and (ε f )∗ = ε f ∗ (·/ε). Lemma 9.20 We have that f ∗ has finite values and, for every c ∈ C(Z ): ∗
e∗ (c(x, y))d(x, y).
P (c) =
(9.92)
z
Proof Let fˆ(c) denote the above r.h.s. Since e is l.s.c. proper convex, it is the Fenchel conjugate of e∗ . So, by Lemma 9.13, since e has superlinear growth, e∗ is finitevalued and bounded over bounded sets. So, fˆ(c) is a continuous convex function and, by Lemma 9.14, its conjugate is P(μ). Since fˆ(c) is equal to its biconjugate, the conclusion follows. Consider the problem
P∗
Min − η, ϕ C(X ) − ν, ψ C(Y ) + ε ϕ,ψ
Z
ϕ(x) + ψ(y) − c(x, y) d(x, y). ε (9.93)
Proposition 9.21 The penalized problem (9.88) is the dual of problem (9.93). Proof Apply the Fenchel duality theory, taking into account Lemma 9.20.
Remark 9.22 Since the primal penalized problem is qualified (in the case of the usual penalties given above) its dual has a nonempty and bounded set of solutions. The semiprimal problem consists in minimizing the primal cost w.r.t. ϕ only. The primal cost can be expressed as
9.3 Transportation Theory
299
ε (P ∗ ((ϕ(x) + ψ(y) − c(x, y))/ε)dy − η(x)ϕ(x)) dx − ν, ψ C(Y ) . X
Y
(9.94)
So the first-order condition for minimizing w.r.t. ϕ is that
(D P ∗ ((ϕ(x) + ψ(y) − c(x, y))/ε)dy = η(x).
(9.95)
Y
Example 9.23 Entropy penalty: then P ∗ (s) = D P ∗ (s) = es and (9.95) reduces to
(exp((ψ(y) − c(x, y))/ε)dy = η(x).
exp(ϕ(x)/ε)
(9.96)
Y
Since the l.h.s. of (9.96) is a positive and continuous function of x, (9.96) has a solution iff η is absolutely continuous, with positive and continuous density (since Ω is compact, this implies that the density has a positive minimum), and the solution is ϕ such that ϕ(x) + log ε
(exp((ψ(y) − c(x, y))/ε)dy
= log η(x).
(9.97)
Y
Substituting into (9.94), we obtain the expression of the semiprimal cost:
ε
log X
Y
(exp((ψ(y) − c(x, y))/ε)dy η(x)dx − ν, ψ C(Y )
+ ε − ε η(x) log(η(x))dx.
(9.98)
X
9.3.4 Barycenters 9.3.4.1
The Multi-transport Setting
Let X and Yk , for k = 1 to K , be compact subsets of Rn , Z k := X × Yk , ck ∈ C(Z k ), ν k ∈ P(Yk ). Consider the problem in dual spaces Maxμ,η −
K k=1
μk , ck C(Z k ) ;μk ∈ M+ (Z k ), k = 1, . . . , K ; η ∈ M(X ); μk| X = η; μk|Y = ν k .
(9.99) In some cases these problems can be interpreted as the computation of barycenters, see the references at the end of the chapter. Note that, by the above constraints, η ∈ P(X ). We easily check that problem (9.99) is the dual of
300
9 Generalized Convexity and Transportation Theory
Min − ϕ,ψ
K ν k , ψ k C(Yk ) ;ϕ k (x) + ψ k (y) − ck (x, y) ≤ 0, k=1
for all x ∈ X and y ∈ Yk , K ϕ k (x) = 0, for all x ∈ X,
(9.100)
k=1
ϕ k ∈ C(X ); ψ k ∈ C(Yk ), k = 1, . . . , K . Proposition 9.24 Problems (9.99) and (9.100) have the same finite value, and both have a nonempty set of solutions. Proof (a) The dual problem (9.99) is feasible (take for μk the product of η and ν k ) and the primal problem (9.100) is qualified: for k = 1 to K , there exists pairs (ϕ0k , ψ0k ) in C(X ) × C(Yk ) such that ck (x, y) − ϕ0k (x) − ψ0k (y) is uniformly positive. By general results of convex duality theory, problems (9.99) and (9.100) have the same finite value, and (9.99) has a nonempty and bounded set of solutions. (b) It remains to show that the primal problem (9.100) has solutions. We adapt the ideas in the proof of Proposition 9.16. Let (ϕ j , ψ j ) be a minimizing sequence. Set ⎧ k ⎪ ψˆ j (y) := min{ck (x, y) − ϕ kj (x); x ∈ X }, k = 1, . . . , K , ⎪ ⎪ ⎪ ⎪ ⎨ ϕˆ k (x) := min{ck (x, y) − ψˆ k (y); y ∈ Yk }, k = 1, . . . , K − 1, j j K −1 ⎪ ⎪ ⎪ K ⎪ ϕˆ kj (x). ⎪ ⎩ ϕˆ j (x) := −
(9.101)
k=1
Then ϕˆ kj ≥ ϕ kj , for k = 1 to K − 1, so that ϕˆ Kj ≤ ϕ Kj . It follows that (ϕˆ j , ψˆ j ) is feasible. The associated cost is not greater than the one for (ϕ j , ψ j ), since ψˆ kj (y) ≥ ψ kj for all k. Therefore (ϕˆ j , ψˆ j ) is a minimizing sequence which, in addition, has a uniform continuity modulus. Changing ϕˆ k (x) into ϕˆ k (x) − ϕˆ k (x0 ) if necessary, we get that ϕ k (x0 ) = 0 for all k (the sum of the ϕˆ k is still equal to 0, and this operation leaves the cost invariant). We have constructed a bounded minimizing sequence with uniform continuity modulus, and conclude by the Ascoli–Arzela theorem.
9.3.4.2
Penalization
As in the case of a standard transport problem we start from a penalty approximation of the dual formulation, that is, we approximate (9.100) by Max − μ,η
K μk , ck Z k + ε k=1
Zk
P(μk (x, y))d(x, y) ; μk| X = η; μk|Y = ν k ; μk ∈ M+ (Z k ), k = 1, . . . , K ; η ∈ M(X ). (9.102)
9.3 Transportation Theory
301
Computing the ‘bidual’ problem we again recognize the Fenchel duality framework with ⎧ K ⎪ ⎨ P(μk (x, y))d(x, y) f (μ, η) = f 1 (μ) + f 2 (η); f 1 (μ) = ε (9.103) Z k k=1 ⎪ ⎩ f 2 (η) = 0; F = I{0} ; A(μ, η) = (η − μ| X , −μ|Y ); y = (0, ν). We find that f 1∗ can be computed as in Sect. 9.3.3, and f 2∗ is the indicatrix of 0, so that the primal (or bidual) problem is Min ϕ,ψ
K − ηk , ϕ k C(X ) − ν k , ψ k C(Yk ) + ε k=1
K k=1
P∗ Zk
ϕ k (x) + ψ k (y) − ck (x, y) d(x, y) ; ε
ϕ k = 0; ϕ k ∈ C(X ); ψ k ∈ C(Yk ), k = 1, . . . , K .
(9.104)
9.4 Notes Brøndsted [30] and Dolecki and Kurcyusz [44] are early references for generalized convexity. The augmented Lagrangian approach was introduced by Powell [88] and Hestenes [58], and linked to the dual proximal algorithm in Rockafellar [101]. For its application to infinite-dimensional problems, see Fortin and Glowinski [50]. Convex functions of measures are discussed in Demengel and Temam [41, 42]. On transportation theory, see the monographs by Villani [122] and Santambrogio [108]. The link (9.80) between a transportation map and the derivative of a convex function is known as Brenier’s theorem [27]. Augmented Lagrangians are a useful numerical tool for solving optimal transport problems, see Benamou and Carlier [17]. Cuturi [35] introduced the entropic penalty, and showed that the resulting problem can be efficiently solved thanks to Sinkhorn’s algorithm [116] (for computing matrices with prescribed row and column sums). Barycenters in the optimal transport framework were introduced in Carlier and Ekeland [31]. See also Agueh and Carlier [1]. It gives a powerful tool for clustering, see Cuturi and Doucet [36].
References
1. Agueh, M., Carlier, G.: Barycenters in the Wasserstein space. SIAM J. Math. Anal. 43(2), 904–924 (2011) 2. Akhiezer, N.I.: The Classical Moment Problem. Hafner Publishing Co., New York (1965) 3. Aliprantis, C.D., Border, K.C.: Infinite dimensional analysis, 3rd edn. Springer, Berlin (2006) 4. Alizadeh, F., Goldfarb, D.: Second-order cone programming. Math. Program. 95(1, Ser. B), 3–51 (2003) 5. Altman, E.: Constrained Markov Decision Processes. Stochastic Modeling. Chapman & Hall/CRC, Boca Raton (1999) 6. Ambrosio, L., Fusco, N., Pallara, D.: Functions of Bounded Variation and Free Discontinuity Problems. Oxford Mathematical Monographs. The Clarendon Press, Oxford University Press, New York (2000) 7. Arapostathis, A., Borkar, V.S., Fernández-Gaucherand, E., Ghosh, M.K., Marcus, S.I.: Discrete-time controlled Markov processes with average cost criterion: a survey. SIAM J. Control Optim. 31(2), 282–344 (1993) 8. Araujo, A., Giné, E.: The Central Limit Theorem for Real and Banach Valued Random Variables. Wiley, New York (1980) 9. Artzner, P., Delbaen, F., Eber, J.M., Heath, D.: Coherent measures of risk. Math. Financ. 9(3), 203–228 (1999) 10. Attouch, H., Brézis, H.: Duality for the sum of convex functions in general Banach spaces. In: Barroso, J.A. (ed) Aspects of Mathematics and its Applications, pp. 125–133 (1986) 11. Aubin, J.-P., Ekeland, I.: Estimates of the duality gap in nonconvex optimization. Math. Oper. Res. 1(3), 225–245 (1976) 12. Aubin, J.P., Frankowska, H.: Set-Valued Analysis. Birkhäuser, Boston (1990) 13. Bardi, M., Capuzzo-Dolcetta, I.: Optimal Control and Viscosity Solutions of Hamilton-JacobiBellman Equations. Birkhäuser, Boston (1997) 14. Ben-Tal, A., Golany, B., Nemirovski, A.: Vial, J-Ph: Supplier-retailer flexible commitments contracts: a robust optimization approach. Manuf. Serv. Oper. Manag. 7(3), 248–273 (2005) 15. Ben-Tal, A., Nemirovski, A.: Lectures on modern convex optimization. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, MPS/SIAM Series on Optimization (2001) 16. Ben-Tal, Aharon: Ghaoui, El: Laurent, Nemirovski, Arkadi: Robust Optimization. Princeton University Press, Princeton (2009) 17. Benamou, J.-D., Carlier, G.: Augmented Lagrangian methods for transport optimization, mean field games and degenerate elliptic equations. J. Optim. Theory Appl. 167(1), 1–26 (2015) 18. Benders, J.F.: Partitioning procedures for solving mixed-variables programming problems. Numer. Math. 4, 238–252 (1962) 19. Bertsekas, D.P.: Dynamic Programming and Optimal Control, 2nd edn., vol I & II. Athena Scientific, Belmont (2000, 2001) © Springer Nature Switzerland AG 2019 J. F. Bonnans, Convex and Stochastic Optimization, Universitext, https://doi.org/10.1007/978-3-030-14977-2
303
304
References
20. Billingsley, P.: Convergence of Probability Measures, 2nd edn. Wiley Inc., New York (1999) 21. Birge, J.R., Louveaux, F.: Introduction to Stochastic Programming. Springer, New York (1997) 22. Bokanowski, O., Maroso, S., Zidani, H.: Some convergence results for Howard’s algorithm. SIAM J. Numer. Anal. 47(4), 3001–3026 (2009) 23. Bonnans, J.F., Cen, Z.: Christel, Th: Energy contracts management by stochastic programming techniques. Ann. Oper. Res. 200, 199–222 (2012) 24. Bonnans, J.F., Gilbert, J.C., Lemaréchal, C., Sagastizábal, C.: Numerical Optimization: Theoretical and Numerical Aspects, 2nd edn. Universitext. Springer, Berlin (2006) 25. Bonnans, J.F., Ramírez, H.: Perturbation analysis of second-order cone programming problems. Math. Program. 104(2–3, Ser. B), 205–227 (2005) 26. Bonnans, J.F., Shapiro, A.: Perturbation Analysis Of Optimization Problems. Springer, New York (2000) 27. Brenier, Y.: Polar factorization and monotone rearrangement of vector-valued functions. Commun. Pure Appl. Math. 44(4), 375–417 (1991) 28. Brézis, H.: Functional Analysis. Sobolev Spaces and Partial Differential Equations. Springer, New York (2011) 29. Brézis, H., Lieb, E.: A relation between pointwise convergence of functions and convergence of functionals. Proc. Am. Math. Soc. 88(3), 486–490 (1983) 30. Brøndsted, A.: Convexification of conjugate functions. Math. Scand. 36, 131–136 (1975) 31. Carlier, G., Ekeland, I.: Matching for teams. Econ. Theory 42(2), 397–418 (2010) 32. Carpentier, P., Chancelier, J-Ph, Cohen, G., De Lara, M.: Stochastic Multi-stage Optimization. Springer, Berlin (2015) 33. Castaing, C., Valadier, M.: Convex Analysis and Measurable Multifunctions. Lecture Notes in Mathematics, vol. 580. Springer, Berlin (1977) 34. Csiszár, I.: Information-type measures of difference of probability distributions and indirect observations. Stud. Sci. Math. Hungar. 2, 299–318 (1967) 35. Cuturi, M.: Sinkhorn distances: lightspeed computation of optimal transportation. In: Neural Information Processing Conference Proceedings, pp. 2292–2300 (2013) 36. Cuturi, M., Doucet, A.: Fast computation of Wasserstein barycenters. In: Xing, E.P., Jebara, T. (eds.) Proceedings of the 31st International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 32, pp. 685–693, Bejing, China (2014) 37. Dallagi, A.: Méthodes particulaires en commande optimale stochastique. Ph.D. thesis, Université Paris I (2007) 38. Danskin, J.M.: The Theory of Max-Min and Its Applications to Weapons Allocation Problems. Springer, New York (1967) 39. Decarreau, A., Hilhorst, D., Lemaréchal, C., Navaza, J.: Dual methods in entropy maximization. Application to some problems in crystallography. SIAM J. Optim. 2(2), 173–197 (1992) 40. Dellacherie, C., Meyer, P.-A.: Probabilities and potential. North-Holland Mathematics Studies, vol. 29. North-Holland Publishing Co., Amsterdam (1978) 41. Demengel, F., Temam, R.: Convex functions of a measure and applications. Indiana Univ. Math. J. 33(5), 673–709 (1984) 42. Demengel, F., Temam, R.: Convex function of a measure: the unbounded case. FERMAT days 85: mathematics for optimization (Toulouse, 1985). North-Holland Mathematics Studies, vol. 129, pp. 103–134. North-Holland, Amsterdam (1986) 43. Dentcheva, D., Ruszczy´nski, A.: Common mathematical foundations of expected utility and dual utility theories. SIAM J. Optim. 23(1), 381–405 (2013) 44. Dolecki, S., Kurcyusz, S.: On Φ-convexity in extremal problems. SIAM J. Control Optim. 16(2), 277–300 (1978) 45. Dudley, R.M.: Real Analysis and Probability. Cambridge University Press, Cambridge (2002). Revised reprint of the 1989 original 46. Ekeland, I., Temam, R., Convex Analysis and Variational Problems. Studies in Mathematics and its Applications, vol. 1. North-Holland, Amsterdam (1976). French edition: Analyse convexe et problèmes variationnels. Dunod, Paris (1974) 47. Fenchel, W.: On conjugate convex functions. Can. J. Math. 1, 73–77 (1949)
References
305
48. Fenchel, W.: Convex Cones and Functions. Lecture Notes. Princeton University, Princeton (1953) 49. Föllmer, H., Schied, A.: Stochastic Finance: An Introduction in Discrete Time. de Gruyter Studies in Mathematics, vol. 27. Walter de Gruyter & Co., Berlin (2002) 50. Fortin, M., Glowinski, R.: Augmented Lagrangian Methods. North-Holland, Amsterdam (1983) 51. Georghiou, A., Wiesemann, W., Kuhn, D.: Generalized decision rule approximations for stochastic programming via liftings. Math. Program. 152(1-2, Ser. A), 301–338 (2015) 52. Girardeau, P., Leclere, V., Philpott, A.B.: On the convergence of decomposition methods for multistage stochastic convex programs. Math. Oper. Res. 40(1), 130–145 (2015) 53. Goberna, M.A., Lopez, M.A.: Linear Semi-infinite Optimization. Wiley Series in Mathematical Methods in Practice, vol. 2. Wiley, Chichester (1998) 54. Gol’shtein, E.G.: Theory of Convex Programming. Translations of Mathematical Monographs, vol. 36. American Mathematical Society, Providence (1972) 55. Gouriéroux, C.: ARCH Models and Financial Applications. Springer, New York (1997) 56. Hernández-Lerma, O., Lasserre, J.B.: Discrete-Time Markov Control Processes. Springer, New York (1996) 57. Hernández-Lerma, O., Lasserre, J.B.: Further Topics on Discrete-Time Markov Control Processes. Springer, New York (1999) 58. Hestenes, M.R.: Multiplier and gradient methods. J. Optim. Theory Appl. 4, 303–320 (1969) 59. Hoffman, A.: On approximate solutions of systems of linear inequalities. J. Res. Natl. Bureau Stand., Sect. B, Math. Sci. 49, 263–265 (1952) 60. Horn, R.A., Johnson, C.R.: Matrix Analysis, 2nd edn. Cambridge University Press, Cambridge (2013) 61. Hsu, S.-P., Chuang, D.-M., Arapostathis, A.: On the existence of stationary optimal policies for partially observed MDPs under the long-run average cost criterion. Syst. Control Lett. 55(2), 165–173 (2006) 62. Kall, P., Wallace, S.W.: Stochastic Programming. Wiley, Chichester (1994) 63. Kelley, J.E.: The cutting plane method for solving convex programs. J. Soc. Indust. Appl. Math. 8, 703–712 (1960) 64. Komiya, H.: Elementary proof for Sion’s minimax theorem. Kodai Math. J. 11(1), 5–7 (1988) 65. Krein, M., Milman, D.: On extreme points of regular convex sets. Studia Math. 9, 133–138 (1940) 66. Kuhn, D., Wiesemann, W., Georghiou, A.: Primal and dual linear decision rules in stochastic and robust optimization. Math. Program. 130(1, Ser. A), 177–209 (2011) 67. Kushner, H.J., Dupuis, P.G.: Numerical Methods for Stochastic Control Problems in Continuous Time. Applications of Mathematics, vol. 24, 2nd edn. Springer, New York (2001) 68. Lang, S.: Real and Functional Analysis, 3rd edn. Springer, New York (1993) 69. Lasserre, J.B.: Semidefinite programming versus LP relaxations for polynomial programming. Math. Oper. Res. 27, 347–360 (2002) 70. Lemaréchal, C., Oustry, F.: Semidefinite relaxations and Lagrangian duality with application to combinatorial optimization. Rapport de Recherche INRIA 3710, (1999) 71. Lewis, A.: The mathematics of eigenvalue optimization. Math. Programm. 97, 155–176 (2003) 72. Lewis, A.S.: The convex analysis of unitarily invariant matrix functions. J. Convex Anal. 2(1–2), 173–183 (1995) 73. Lewis, A.S., Overton, M.L.: Eigenvalue optimization. In: Acta numerica, 1996, pp. 149–190. Cambridge University Press, Cambridge (1996) 74. Liapounoff, A.: Sur les fonctions-vecteurs complètement additives. Bull. Acad. Sci. URSS. Sér. Math. [Izvestia Akad. Nauk SSSR] 4, 465–478 (1940) 75. Linderoth, J.T., Shapiro, A., Wright, S.: The empirical behavior of sampling methods for stochastic programming. Technical Report 02-01, Computer Science Department, University of Wisconsin-Madison (2002) 76. Lobo, M.S., Vandenberghe, L., Boyd, S., Lebret, H.: Applications of second-order cone programming. Linear Algebra Appl. 284, 193–228 (1998)
306
References
77. Malliavin, P.: Integration and Probability. Springer, New York (1995). French edition: Masson, Paris (1982) 78. Mandelbrojt, S.: Sur les fonctions convexes. C. R. Acad. Sci., Paris 209, 977–978 (1939) 79. Maréchal, P.: On the convexity of the multiplicative potential and penalty functions and related topics. Math. Program. 89(3, Ser. A), 505–516 (2001) 80. Modica, L.: The gradient theory of phase transitions and the minimal interface criterion. Arch. Ration. Mech. Anal. 98(2), 123–142 (1987) 81. Monahan, G.E.: A survey of partially observable Markov decision processes: theory, models, and algorithms. Manag. Sci. 28(1), 1–16 (1982) 82. Moreau, J.-J.: Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. France 93, 273–299 (1965) 83. Moreau, J.-J.: Fonctionnelles convexes. In: Leray, J. (ed.) Séminaire sur les équations aux dérivées partielles, vol. 2, pp. 1–108. Collège de France (1966/1967). www.numdam.org 84. Moreau, J.-J.: Inf-convolution, sous-additivité, convexité des fonctions numériques. J. Math. Pures Appl. 9(49), 109–154 (1970) 85. Nesterov, Y., Nemirovskii, A.: Interior-Point Polynomial Algorithms in Convex Programming. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (1994) 86. Pereira, M.V.F., Pinto, L.M.V.G.: Multi-stage stochastic optimization applied to energy planning. Math. Program. 52(2, Ser. B), 359–375 (1991) 87. Pontryagin, L.S., Boltyanski˘ı, V.G., Gamkrelidze, R.V., Mishchenko, E.F.: The Mathematical Theory of Optimal Processes. Gordon & Breach Science Publishers, New York (1986). Reprint of the 1962 English translation 88. Powell, M.J.D.: A method for nonlinear constraints in minimization problems. In: Fletcher, R. (ed.) Optimization, pp. 283–298. Academic, New York (1969) 89. Powell, M.J.D.: Approximation Theory and Methods. Cambridge University Press, Cambridge (1981) 90. Pulleyblank, W.R.: Polyhedral combinatorics. In: Nemhauser, G.L., et al. (eds.) Optimization. Elsevier, Amsterdam (1989) 91. Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming. WileyInc, New York (1994) 92. Puterman, M.L., Shin, M.C.: Modified policy iteration algorithms for discounted Markov decision problems. Manag. Sci. 24(11), 1127–1137 (1978) 93. Rockafellar, R.T.: Duality theorems for convex functions. Bull. Am. Math. Soc. 70, 189–192 (1964) 94. Rockafellar, R.T.: Extension of Fenchel’s duality theorem for convex functions. Duke Math. J. 33, 81–90 (1966) 95. Rockafellar, R.T.: Extension of Fenchel’s duality theorem for convex functions. Duke Math. J. 33, 81–89 (1966) 96. Rockafellar, R.T.: Integrals which are convex functionals. Pacif. J. Math. 24, 525–539 (1968) 97. Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970) 98. Rockafellar, R.T.: Convex integral functionals and duality. In: Contributions to Nonlinear Functional Analysis (Proc. Sympos., Math. Res. Center, University of Wisconsin, Madison, Wisconsin, 1971), pp. 215–236. Academic, New York (1971) 99. Rockafellar, R.T.: Integrals which are convex functionals. II. Pacif. J. Math. 39, 439–469 (1971) 100. Rockafellar, R.T.:. Conjugate Duality and Optimization. Regional Conference Series in Applied Mathematics, vol. 16. SIAM, Philadelphia (1974) 101. Rockafellar, R.T.: Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Math. Oper. Res. 1, 97–116 (1976) 102. Rockafellar, R.T.: Integral functionals, normal integrands and measurable selections. In: Nonlinear Operators and the Calculus of Variations (Summer School, Univ. Libre Bruxelles, Brussels, 1975). Lecture Notes in Mathematics, vol. 543, pp. 157–207. Springer, Berlin (1976) 103. Rockafellar, R.T., Wets, R.J.-B.: Stochastic convex programming: basic duality. Pacif. J. Math. 62(1), 173–195 (1976)
References
307
104. Rockafellar, R.T., Wets, R.J.-B.: Stochastic convex programming: singular multipliers and extended duality singular multipliers and duality. Pacif. J. Math. 62(2), 507–522 (1976) 105. Royden, H.L.: Real Analysis, 3rd edn. Macmillan Publishing Company, New York (1988) 106. Ruszczynski, A., Shapiro, A. (eds.): Stochastic Programming. Handbook in Operations Research and Management, vol. 10. Elsevier, Amsterdam (2003) 107. Ruszczy´nski, A., Shapiro, A.: Conditional risk mappings. Math. Oper. Res. 31(3), 544–561 (2006) 108. Santambrogio, F.: Optimal Transport for Applied Mathematicians. Birkhäuser (2015) 109. Santos, M.S., Rust, J.: Convergence properties of policy iteration. SIAM J. Control Optim. 42(6), 2094–2115 (electronic) (2004) 110. Schrijver, A.: Theory of Linear and Integer Programming. Wiley, New Jersey (1986) 111. Shapiro, A., Asymptotic analysis of stochastic programs. Ann. Oper. Res., 30(1–4):169–186 (1991). Stochastic programming, Part I (Ann Arbor, MI, 1989) 112. Shapiro, A.: Asymptotics of minimax stochastic programs. Stat. Probab. Lett. 78(2), 150–157 (2008) 113. Shapiro, A.: Analysis of stochastic dual dynamic programming method. Eur. J. Oper. Res. 209(1), 63–72 (2011) 114. Shapiro, A., Dentcheva, D., Ruszczynski, A.: Lectures on Stochastic Programming: Modelling and Theory, 2nd edn. SIAM (2014) 115. Shiryaev, A.N.: Probability. Graduate Texts in Mathematics, vol. 95, 2nd edn. Springer, New York (1996). Translated from the first (1980) Russian edition by R.P. Boas 116. Sinkhorn, R.: Diagonal equivalence to matrices with prescribed row and column sums. Am. Math. Mon. 74(4), 402–405 (1967) 117. Sion, M.: On general minimax theorems. Pacif. J. Math. 8, 171–176 (1958) 118. Skorohod, A.V.: Limit theorems for stochastic processes. Teor. Veroyatnost. i Primenen. 1, 289–319 (1956) 119. Tardella, F.: A new proof of the Lyapunov convexity theorem. SIAM J. Control Optim. 28(2), 478–481 (1990) 120. Tibshirani, R.: Regression shrinkage and selection via the lasso: a retrospective. J. R. Stat. Soc. Ser. B 73, Part 3, 273–282 (2011) 121. Villani, C.: Intégration et analyse de Fourier. ENS Lyon (2007). Revised in 2010 122. Villani, C.: Optimal Transport. Old and New. Springer, Berlin (2009) 123. Wallace, S.W., Ziemba, W.T. (eds.): Aplications of Stochastic Programming. MPS/SIAM Series Optimization, vol. 5. SIAM, Philadelphia (2005) 124. Wets, R.J.-B.: Stochastic programs with fixed recourse: the equivalent deterministic program. SIAM Rev. 16, 309–339 (1974) 125. Wolkowicz, H., Saigal, R., Vandenberghe, L. (eds.): Handbook of Semidefinite Programming. Kluwer Academic Publishers, Boston (2000) 126. Yosida, K., Hewitt, E.: Finitely additive measures. Trans. Am. Math. Soc. 72, 46–66 (1952) 127. Zhou, L.: A simple proof of the Shapley-Folkman theorem. Econom. Theory 3(2), 371–372 (1993) 128. Zou, J., Ahmed, S., Sun, X.A.: Stochastic dual dynamic integer programming. Math. Program. (2018)
Index
A Acceptation set, 170 Accessible set, 234 Adjoint state, 215 Algebra, 117 Algorithm cutting plane, 268 Kelley, 268 Application measurable, 119 Approximation Moreau–Yosida, 38 Approximation in the sense of Chebyshev, 101
B Backward equation, 215 Biconjugate, 19 Bidual, 89 Bochner, 140 Borel σ -algebra, 118 Borel–Cantelli, 125 Borelian function, 121 Bounded in probability, 181
C Calmness, 51 Carathéodory, 142 Carathéodory theorem, 123 Castaing representation, 143 Cauchy sequence, 5 Class recurrent, 257 transient, 257
Compatibility, 209 Compatible multimapping, 144 Conditional expectation, 202 variance, 208 Cone normal, 31 recession, 41 tangent, 31 Cone of nonincreasing vectors, 80 Conjugate, 17, 21 Contact set, 96 Convergence in law, 180 in measure, 127 in probability, 127 narrow, 180 simple, 120, 233 Convex function, 3 set, 3 Convex closure, 20 Core, 26 Costate, 215 adapted, 217 Countable additivity, 122 Covariance, 185 Cycle, 257 Cylinder, 125
D Distribution empirical, 190 Disutility, 165 Domain, 1
© Springer Nature Switzerland AG 2019 J. F. Bonnans, Convex and Stochastic Optimization, Universitext, https://doi.org/10.1007/978-3-030-14977-2
309
310 Dominated convergence, 132, 137 generalized, 133 Doob–Dynkin, 121 Dynamic programming, 229, 230, 232, 239, 270 ergodic, 264 E Egoroff, 127 Entropy, 36 Epigraph, 4 Estimate of lack of convexity, 69 Exchange property, 142 Exhaustion sequence, 122 Expansive (non), 54 Extremal loading, 114 F Fatou, 133 Fenchel duality, 42 subdifferential formula, 46 Fenchel conjugate, 17 Fenchel–Young inequality, 17 Filtration, 218 Floor approximation, 121 Function moment-generating, 199 perspective, 63 rate, 199 recession, 62 simple, 120 spectral, 80 step, 120 symmetric, 80 G Gâteaux differentiable, 29 Gauge function, 7 Graph class, 257 H Hadamard differentiability, 187, 188 Half-space, 6 Hamiltonian, 250 Hyperplane, 6 I Indicatrix function, 2
Index Inequality Tchebycheff, 129 Infimal convolution, 60 Innovations, 214 Integrable selection, 156 Interpolation Lagrange, 104 Iteration Howard, 236 policy, 236 values, 235 K Kolmogorov equation, 226 L Lagrangian, 65 duality, 33 standard, 38 Legendre transform, 23 Lemma Schur, 78 Schur (generalized), 88 Limit Cesaro, 256 Linear rate, 256 Local solution, 72 Log-likelihood, 177 M Markov chain, 223 Maximum likelihood, 177 Measure, 122 completed, 125 Gaussian, 185 non-atomic, 156 positive, 97 probability, 122 with finite support, 97 Metric, 2 Minimizing sequence, 2 Minkowski sum, 7 Moments of distributions, 111 Monotone convergence, 131 Monotonicity, 168 Multiplier Lagrange with finite support, 97 N Neutral risk probability, 167 Nonanticipativity constraint, 218
Index Norm differentiable, 29 Frobenius, 75 Normal integrand, 146 O Oblique hyperplane, 19 Optimality condition, 35 Optimization semi-infinite, 95 P Path, 257 Point of interpolation, 104 Policy feedback, 229 open-loop, 248 Polyhedron, 57 Polynomial nonnegative, 106 of Chebyshev, 103 Pontryagin’s principle, 251 Positively homogeneous, 4 Preference, 166 Probability invariant, 256 Programmation semi-infinite, 95 Programming positive semidefinite, 77 positive semidefinite linear, 77 Projector, 209 R Recourse, 153 Reference (of a polynomial), 102 Regular probability, 179 Regularisation Lipschitz, 181 Regularity constraints, 96 Relative interior, 9 Risk-adverse, 166 Rotationally invariant, 80 S Sample approximation, 186 SDP relaxation, 87 Second-order cone, 91 Separable, 140 Separation, 6
311 Set feasible, 1 negligible, 124 solution, 2 σ -algebra, 117 complete, 124 Lebesgue, 125 product, 118 trivial, 117 Skorokhod–Dudley representation theorem, 185 Space measurable, 117 measure, 122 normed, 5 probability, 122 separable, 122 Stability, 36 State accessible, 257 communicating, 257 State equation, 215 linearized, 215 Subadditive, 4 Subdifferential, 22 partial, 44 Support, 97 function, 22 of a measure, 56 T Time exit, 239 stopping, 240 Trajectory, 273 Transition matrix (regular), 259 operator, 224 Translation invariance, 168 U Uniformly integrability, 134 V Value, 1 Value at risk, 174 Vertical hyperplane, 19 Vitali, 134 W Walk, 257