158 96 1MB
English Pages 151 [148] Year 2023
Texts and Readings in Mathematics 83
Vivek S. Borkar K. S. Mallikarjuna Rao
Elementary Convexity with Optimization
Texts and Readings in Mathematics Volume 83
Advisory Editor C. S. Seshadri, Chennai Mathematical Institute, Chennai, India Managing Editor Rajendra Bhatia, Ashoka University, Sonepat, Haryana, India Editorial Board Manindra Agrawal, Indian Institute of Technology, Kanpur, India V. Balaji, Chennai Mathematical Institute, Chennai, India R. B. Bapat, Indian Statistical Institute, New Delhi, India V. S. Borkar, Indian Institute of Technology, Mumbai, India Apoorva Khare, Indian Institute of Science, Bangalore, India T. R. Ramadas, Chennai Mathematical Institute, Chennai, India V. Srinivas, Tata Institute of Fundamental Research, Mumbai, India Technical Editor P. Vanchinathan, Vellore Institute of Technology, Chennai, India
The Texts and Readings in Mathematics series publishes high-quality textbooks, research-level monographs, lecture notes and contributed volumes. Undergraduate and graduate students of mathematics, research scholars and teachers would find this book series useful. The volumes are carefully written as teaching aids and highlight characteristic features of the theory. Books in this series are co-published with Hindustan Book Agency, New Delhi, India.
Vivek S. Borkar · K. S. Mallikarjuna Rao
Elementary Convexity with Optimization
Vivek S. Borkar Department of Electrical Engineering Indian Institute of Technology Bombay Mumbai, Maharashtra, India
K. S. Mallikarjuna Rao Industrial Engineering and Operations Research (IEOR) Indian Institute of Technology Bombay Mumbai, Maharashtra, India
ISSN 2366-8717 ISSN 2366-8725 (electronic) Texts and Readings in Mathematics ISBN 978-981-99-1651-1 ISBN 978-981-99-1652-8 (eBook) https://doi.org/10.1007/978-981-99-1652-8 Jointly published with Hindustan Book Agency The print edition is not for sale in India. Customers from India please order the print book from: Hindustan Book Agency. ISBN of the Co-Publisher’s edition: 978-819-57-8291-8 Mathematics Subject Classification: 46N10, 47N10 © Hindustan Book Agency 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publishers, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publishers nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publishers remain neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
The authors dedicate this book to the memory of Prof. Pravin P. Varaiya (29 October 1940 – 10 June 2022), a polymath and a gentleman, who left an indelible mark on several areas of control and communications and was a mentor and role model for many generations of young researchers in these areas.
Photo courtesy: Dileep Kalathil
Preface
The somewhat quaint title of this book is intentional. It is not a textbook of either convex analysis or optimization, but hovers somewhere on the boundary of the two. It is an idiosyncratic and eclectic collection of facts and factoids loosely strung together in a coherent story. The omissions are many. For example, many results from convexity that would be considered as ‘staple diet’ are missing, because the emphasis is indeed on aspects of convexity central to optimization. Even then, the elegant results surrounding convex polytopes, barring a few, are missing. These are indeed staple diet for discrete optimization and linear programming, but the emphasis here is on what one commonly calls ‘nonlinear programming’. Needless to say, the choice here is dictated by our personal taste and many may wonder why some of the results that we chose to present are all that ‘central’. They are there either because we have often found them useful in our own professional lives and feel that they deserve more attention than what they usually get, or because we feel that they ‘round off’ the collage we are trying to build in an elegant manner, pointing in the process to possibilities often left out of standard texts and courses. Other than that, what we offer is many nice short and/or elementary proofs, of course not always our own. We also hope to convey a viewpoint that continuous optimization is simply ‘applied real analysis’, just as queuing theory is applied probability. A conspicuous omission is an extensive treatment of optimization algorithms, which any formal textbook worth its salt would have. It indeed deserves a book length treatment of its own. We did not think we could do more than a lip service to it in a book this size and saw no point in such tokenism. For what its worth, we have a somewhat novel selection and organization of material here for the theoretical aspects and while we do not offer anything comparable for algorithms, we have attempted a fairly comprehensive bird’s eye view of them at an intuitive level, garnished with just enough technical details. It certainly should serve as a good launchpad for the interested reader to take off from. For details, there are excellent accounts already around (see the bibliographical note in ‘Epilogue’), which makes it a totally pointless exercise to replicate that whole universe here in a miniature version.
vii
viii
Preface
That said, we are not claiming total originality of organization either, only sufficient originality to make the exercise worthwhile. In particular the influence of Luenberger [1] will be clear to the discerning eye. We are only two among a large community of researchers whose view of optimization theory was shaped by this book, see, e.g., the preface to [2]. The full story is as follows. The book evolved over several years, beginning with a course VB gave in what is now the TIFR Centre for Applicable Mathematics in Bengaluru. It has since been taught in Indian Institute of Science in Bengaluru, Tata Institute of Fundamental Research in Mumbai, and now at Indian Institute of Technology Bombay in Mumbai. The content has evolved a lot over the years, as did VB’s own feel for the material. At some point, VB started feeling that the course was sufficiently different that it was worth penning it down. One trigger for this was a friend and academic visitor, Mokshay Madiman of University of Delaware, who sat in and liked it. We must add, however, that this material provided for about half the course, that too in a highly trimmed avatar. In particular, the course, rather the first half of it, skipped some of the more exotic or difficult material altogether or sketched it only in outline. The other half of the course was on algorithms, which VB taught from some of the existing classics mentioned later in the bibliographic note in ‘Epilogue’. It did, however, broadly follow the structure of Chap. 6 here. Another important influence has been the little pedagogical note of McShane [3]. Quoting an opinion expressed by L. C. Young that basic constrained optimization should make it to the advanced calculus courses of future, McShane gave a proof of the Fritz John-Karush-Kuhn-Tucker conditions using only the Bolzano-Weierstrass and Weierstrass theorems. VB was so taken by the idea that he made a sport out of going as far as possible in the course (algorithms included) using these alone. That aspect colours this book as well. For us, the future that Young envisaged is already here! Once the book project was initiated, KSMR was roped in by VB initially as a consultant, in order to draw upon his vast knowledge base, particularly about little known but very clever new proofs of well-known facts (including, believe it or not, something as commonplace as the contraction mapping theorem!). His role rapidly attained such proportions that he simply had to be a coauthor. Thereafter we launched on this pleasant exercise of putting together sleek treatments of known facts along with some not so central facts worth knowing regardless, either for their sheer elegance or because they add to the overall intuition needed in optimization, or because they serve as tantalizing pointers to what comes next, in particular in infinite-dimensional optimization. VB’s son Aseem, an ardent critic of his teaching skills—rather the lack thereof1 — was obliged to compensate for them by taking copious notes in class, aided further by in-class audio recordings that he took. These served as a template for this book. There are also many students and teaching assistants over the years who pointed out errors or gaps in the arguments, or assisted ably otherwise. The course having been taught over a few decades, they are many and only the recent ones are fresh in the 1
To quote him, ‘Teachers sometimes talk very fast or write (on board) very fast, you do both at the same time.’
Preface
ix
memory, so VB refrains from mentioning any names. (Of the students who caught VB on the wrong foot in ‘real time’, a notable one is now our colleague.) The plan of the book is as follows. The first two chapters have a hefty dose of assorted results about continuous functions (Chap. 1) and differentiability (Chap. 2) from advanced calculus and elementary real analysis. These cover most of what we have found useful in our professional lives as optimization/control theorists. En route, we prove basic results in optimization theory, such as existence of optima in Chap. 1 (Weierstrass theorem and its extensions) and necessary or sufficient conditions for local optimality in Chap. 2. Chapters 3–5 deal with convexity. Chapter 3 deals with properties of convex sets and Chap. 4 does likewise with convex functions. Chapter 5 builds upon the two to develop the basic theory of convex optimization. Chapter 6 is, as already mentioned, a brave effort to give a panoramic view of optimization algorithms, old and new. We do not claim that it is exhaustive, in fact it is far from it. Thus many readers may find a favourite topic missing. The attempt has been to make it sufficiently representative so as to give the reader a good feel for the field. For this chapter, we have benefited enormously from the classic texts of Bertsekas [4], Fletcher [5] and Luenberger and Ye [6]. Finally, the last chapter, aptly titled ‘Epilogue’, concludes with pointers to more advanced areas that build upon what’s here, and a bibliographical note. That leaves the pleasant task of thanking our non-academic contributors, our families, who were supportive as always, something that was much needed during a large part of this exercise that was carried out under the cloud of the coronavirus scare. We also thank the relatively safe haven of the Indian Institute of Technology Bombay campus that eased the process by no small means. Mumbai, India December 2022
Vivek S. Borkar K. S. Mallikarjuna Rao
References 1. Luenberger, D.G.: Optimization by Vector Space Methods. Wiley, New York-London-Sydney (1969) 2. Anderson, E.J., Nash, P.: Linear programming in infinite-dimensional spaces. WileyInterscience Series in Discrete Mathematics and Optimization. Wiley, Chichester (1987) 3. McShane, E.J.: The Lagrange multiplier rule. Amer. Math. Monthly 80, 922–925 (1973) 4. Bertsekas, D.P.: Nonlinear programming, 3rd edn. Athena Scientific Optimization and Computation Series. Athena Scientific, Belmont, MA (2016) 5. Fletcher, R.: Practical methods of optimization, 2nd edn. Wiley-Interscience [Wiley], New York (2001) 6. Luenberger, D.G., Ye, Y.: Linear and nonlinear programming, International Series in Operations Research and Management Science, vol. 228, 4th edn. Springer, Cham (2016)
Contents
1 Continuity and Existence of Optima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Some Basic Real Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Bolzano-Weierstrass Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Existence of Optima . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 More on Continuous Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 5 8 11 19 20
2 Differentiability and Local Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Notions of Differentiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Conditions for Local Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Danskin’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Parametric Monotonicity of Optimizers . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Ekeland Variational Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Mountain Pass Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.8 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23 23 23 25 29 31 33 35 37 38
3 Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 The Minimum Distance Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Separation Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Extreme Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 The Shapley-Folkman Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Helly’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Brouwer Fixed Point Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Proof of Theorem 3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39 39 41 43 48 53 54 55 56 59 60
xi
xii
Contents
4 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Differentiability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 An Approximation Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Convex Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Further Properties of Gradients of Convex Functions . . . . . . . . . . . . 4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61 61 64 66 71 71 74 76 78
5 Convex Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.2 Legendre Transform and Fenchel Duality . . . . . . . . . . . . . . . . . . . . . . 80 5.3 The Lagrange Multiplier Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.4 The Arrow-Barankin-Blackwell Theorem . . . . . . . . . . . . . . . . . . . . . . 89 5.5 Linear Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.6 Applications to Game Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.6.1 Min–Max Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.6.2 Existence of Nash Equilibria . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6 Optimization Algorithms: An Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Line Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Algorithms for Unconstrained Optimization . . . . . . . . . . . . . . . . . . . . 6.4 Algorithms for Constrained Optimization . . . . . . . . . . . . . . . . . . . . . . 6.5 Special Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Other Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
101 101 102 105 116 121 125 126 128
7 Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 What Lies Beyond . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Bibliographical Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
131 131 132 133
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Chapter 1
Continuity and Existence of Optima
1.1 Some Basic Real Analysis Optimization theory in finite dimensional spaces may be viewed as ‘applied real analysis’, since it depends on the analytic tools of the latter discipline for most of its foundations.1 With this in mind, we begin here with a bare bones treatment of some basic concepts in real analysis. This is truly bare bones in the sense that our aim here is to be ‘elementary’, i.e., depend only on a few basic mathematical tools, rather than ‘simple’, i.e., use sophisticated mathematical tools to give concise proofs. In fact, we shall highlight a single key result in elementary real analysis, the Bolzano-Weierstrass theorem, and go a long way with that and very little else. But before we get there, we need to introduce some basic objects for analysis in finite dimensional Euclidean spaces. We denote by Rd the d-dimensional Euclidean space, i.e., the vector space of d-dimensional real vectors, for an integer d ≥ 1. Our target will be to think in terms of sequences in Rd , say, x1 , x2 , . . . , where xi ∈ Rd for all i. We write this sequence as {xn } for brevity. One motivation for doing so is immediate. The ultimate goal of optimization theory is to provide a suitable framework to think about algorithms for optimization. But what is an algorithm? In the present context, we shall take as a working definition a rule that starts with an initial guess for the solution of the problem at hand, say x0 ∈ Rd , and updates it iteratively to x1 , x2 , . . . , with xn+1 a (possibly random) function of xn alone. Thus we have a sequence {xn }. This is a strong enough motivation to study the mathematics of sequences in Rd , but there are other reasons as well: even non-algorithmic issues (such as existence proofs that we discuss later) require thinking in terms of sequences. In short, sequences of real vectors is one of the central objects of optimization theory in Rd , so we shall begin with a study of these.
1
This applies to optimization over finite dimensional vectors, often called linear and nonlinear programming, and not to other domains of optimization such as combinatorial optimization which deals with discrete structures. © Hindustan Book Agency 2023 V. S. Borkar and K. S. M. Rao, Elementary Convexity with Optimization, Texts and Readings in Mathematics 83, https://doi.org/10.1007/978-981-99-1652-8_1
1
2
1 Continuity and Existence of Optima
Recall the definition of a norm on Rd : a map x ∈ Rd → x ∈ [0, ∞) is said to be a norm if (i) x ≥ 0 ∀ x ∈ Rd and = 0 if and only if x = θ := the vector of all zeros, (ii) αx = |α|x for α ∈ R, x ∈ Rd , and, (iii) the ‘triangle inequality’ holds, i.e., for x, y ∈ Rd , x + y ≤ x + y. Unless mentioned otherwise, we shall use the Euclidean norm d |x(i)|2 x := i=1
for x = [x(1), . . . , x(d)] ∈ Rd and the corresponding notion of distance between x and y in Rd given by x − y. We say that a sequence {xn } ⊂ Rd converges to x ∗ ∈ Rd , written as xn → x ∗ , if xn − x ∗ → 0. The point x ∗ is then said to be a limit of the sequence xn , n ≥ 1. Another way of saying the same thing is as follows: For r > 0, let Br (x) := {y : y − x < r } denote the so called open ball of radius r centred at x. Then xn → x ∗ if for any r > 0, howsoever small, xn ∈ Br (x) eventually. More formally, for all (written as ∀) r > 0, there exists (written as ∃) an n 0 ≥ 1 such that n ≥ n 0 implies (written as =⇒) xn ∈ Br (x ∗ ). Symbolically, ∀ r > 0, ∃ n 0 ≥ 1 such that n ≥ n 0 =⇒ xn ∈ Br (x ∗ ). This also serves the purpose of introducing the notation ∀, ∃, =⇒ which we use often. Suppose that for every x ∈ A ⊂ Rd , Br (x) ⊂ A for some r > 0 that can depend on x. In other words, all points sufficiently close to x should also be in A for any x in A. If so, we say that A is an open set. A seemingly more general (and perhaps more appealing) definition of xn → x ∗ then would be: for any open set A ⊂ Rd such that x ∗ ∈ A, ∃ n 0 ≥ 1 such that n ≥ n 0 =⇒ xn ∈ A. The two definitions are in fact equivalent, as is easily verified. The notion of open set is very fundamental in mathematics. In particular, the ‘open ball’ Br (x) introduced above is open. So are the whole space and the empty set, which correspond to the limiting cases r = ∞ and r = 0 respectively. There are two properties2 of open sets of immediate concern to us that are rather simple:
2
These are in fact defining properties of abstract open sets in point set topology.
1.1 Some Basic Real Analysis
3
(1) Finite intersections of open sets are open. n Ai = φ, the empty To see this, take A1 , . . . , Am open and suppose A := ∩i=1 set. (If it is empty, there is nothing to prove, as the claim is vacuously true.) Then for x ∈ A, x ∈ Ai , hence ∃ ri > 0 such that Bri (x) ⊂ Ai , 1 ≤ i ≤ m. But then for r := min{r1 , . . . , rm } > 0, Br (x) ⊂ A, proving that A is open. (2) Arbitrary unions of open sets is open. To see this, let Aα , α ∈ I , be an arbitrary collection of open sets indexed by (a possibly uncountable) index set I . Then for x ∈ A := ∪α∈I Aα , x ∈ Aα0 for some choice of α0 ∈ I , leading to: ∃ r > 0 such that Br (x) ⊂ Aα0 =⇒ Br (x) ⊂ A. Thus A is open. Note that the closed interval [0, 1] is not open and can be written as ∞ 1 1 − ,1 + [0, 1] = , n n n=1 a countable intersection of open sets. Thus (1) above cannot be relaxed. Our definition of open sets was in terms of open balls, whose definition depends on the specific norm used. Suppose two different norms · and · on Rd lead to the same family of open sets. Then they are said to be compatible or equivalent. In this case, one can show that there exist constants b, c > 0 such that bx ≤ x ≤ cx .
(1.1)
An important notion related to open sets is that of denseness. A set A is dense in Rd if any open set in Rd contains an element of A. In other words, given any point x in Rd and an > 0, we can find a y ∈ A such that x − y < . More generally, a set A is dense in a set B if A ⊂ B and for any x ∈ B and any > 0, we can find a y ∈ A such that x − y < . Equivalently, for any x ∈ B\A, there exists a sequence {xn } ⊂ A such that xn → x. Thus, for example, rationals are dense in R. A related and equally important notion is that of a closed set. A set B ⊂ Rd is said to be closed if its complement B c is open. But there is another equivalent definition which is often more useful. We say that a set B is closed if, whenever a sequence {xn } ⊂ B satisfies xn → x ∗ (say), then x ∗ ∈ B. The equivalence is easy to establish: (1) Suppose A := B c is open and {xn } ⊂ B satisfies xn → x ∗ . If x ∗ ∈ / B, then x ∗ ∈ A ∗ / Br (x ∗ ), i.e., and since A is open, ∃ r > 0 such that Br (x ) ⊂ A, implying xn ∈ ∗ ∗ xn − x ≥ r ∀n, a contradiction. Hence x ∈ B as desired. (2) Suppose B has the property that {xn } ⊂ B, xn → x ∗ =⇒ x ∗ ∈ B, but B is not closed. Then A := B c is not open, hence ∃ x ∈ A such that ∀ r > 0, Br (x) ∩ B = / B, φ. Pick xn ∈ B n1 (x) ∩ B, n ≥ 1. Then xn ∈ B, xn → x, but x ∈ A =⇒ x ∈ a contradiction. Thus B must be closed. From properties of open sets stated above, taking complements leads to the observation that finite unions and arbitrary intersections of closed sets are closed. Together, these imply that for any set D ⊂ Rd , the largest open set contained in D (equivalently, the union of open sets contained in D) and the smallest closed set contain-
4
1 Continuity and Existence of Optima
ing D (equivalently, the intersection of closed sets containing D) are well-defined notions. These are called respectively the interior of D, denoted int(D), and closure ¯ Clearly, int(D) ⊆ D ⊆ D¯ and D = φ =⇒ D¯ = φ (as it contains of D, denoted D. D). However, it is possible to have int(D) = φ even when D = φ, consider, e.g., the set D := {(x, y) ∈ R2 : 0 ≤ x ≤ 1, y = 0}. There is no way one can put a ball Br ((x, y)) of radius r > 0, no matter how small, around some (x, y) ∈ D such that ¯ (= D¯ ∩ (int(D))c ) is called the boundary of Br (x) ⊂ D. The set ∂ D := D\int(D) D. Clearly, the following hold: (∗) ∂ D is always a closed set, a closed set contains its boundary, an open set is disjoint from its boundary, and the closure of a set is the union of its interior and its boundary. Also, a closed set with empty interior is its own boundary. Finally, note that the space Rd satisfies the definitions of both open and closed set and is therefore both. By complementation, so is the empty set φ. For both, the boundary is the empty set. For any other subset of Rd , the boundary will be necessarily nonempty. With this terminology settled, we shall call B¯ r (x) := {y ∈ Rd : y − x ≤ r } the closed ball of radius r centred at x and distinguish it from the open ball Br (x) defined above. We briefly touch upon “relative” counterparts of some of the concepts above. (In fact, the notion is much more general.) For C ⊂ Rd , A ⊂ C is relatively open in C if it is the intersection of C with an open set in Rd . Closed sets, interior, closure, boundary, etc. relative to C are defined correspondingly. In what follows, we have several occasions to think in these terms for C = a hyperplane H in Rd , i.e., a set of the type {x ∈ Rd : x − x0 , n = 0}. H is characterized by any point x0 ∈ Rd that lies on it, and the unit vector ±n that is normal to it. It is a translate of the (d − 1)-dimensional subspace of Rd that is orthogonal to n, translated by the vector x0 . Finally, we summarize some results surrounding the notion of continuity. Recall that a function f : C ⊂ Rd → Rm is continuous at x ∈ C if given any > 0, we can find a δ > 0 such that y ∈ C, y − x < δ, implies f (x) − f (y) < . It is a continuous function if it is continuous at all x in C. An equivalent definition is: (†) f −1 (A) := {x ∈ C : f (x) ∈ A} is relatively open in C for any open set A ⊂ Rm . To show that our original definition implies (†), let y ∈ A with A open. Then f −1 ({y}) is either empty, in which case there is nothing to show, or not. If not, let B (y) ⊂ A for some > 0. Let x ∈ C, f (x) = y. Then there exists a δ > 0 such that x ∈ Bδ (x) ∩ C =⇒ f (x ) ∈ A, implying (†). Conversely, let (†) hold. Let f (x) = y. Then for > 0, f −1 (B (y)) is open and therefore contains Bδ (x) for some δ > 0, implying the first definition. This establishes the equivalence of the two. A third equivalent definition is: (††) xn → x in C implies f (xn ) → f (x).
1.2 Bolzano-Weierstrass Theorem
5
The proof of the equivalence of (††) with the other two definitions is left as an exercise. Note that (†) is also equivalent to another identical statement, but with ‘open’ replaced by ‘closed’. This is because if A, B are disjoint, then f −1 (A), f −1 (B) are disjoint. A stronger notion than continuity is that of uniform continuity. A function f : C ⊂ Rd → R is uniformly continuous if for any > 0, we can find a δ > 0 such that if x, y ∈ C, x − y < δ, then | f (x) − f (y)| < . The difference with the definition of continuity is that the δ is now independent of x. Some useful facts about uniform continuity are: (U1) If C is closed bounded, a continuous f : C → R is uniformly continuous. Closedness is essential here. For example, f (x) = x is uniformly continuous on (0, 1), but g(x) = x1 is not (check this). (U2) If C is not closed and f : C → R is uniformly continuous, then f extends to a unique continuous f˜ : C → R, i.e., there exists a unique (uniformly) continuous f˜ : C → R such that f˜(x) = f (x) for x ∈ C. This
statement is false without uniform continuity, consider, e.g., g(x) = sin x1 , x ∈ (0, 1].
1.2 Bolzano-Weierstrass Theorem Armed with the above preliminaries, we now look at the existence issue in optimization theory. We recall the following terminology: For A ⊂ R, define the ‘least upper bound of A’, written as .u.b.(A), and the ‘greatest lower bound of A’, written as g..b.(A), as follows : .u.b.(A) := min{z : z ≥ x ∀x ∈ A}, g..b.(A) := max{z : z ≤ x ∀x ∈ A}. Two remarks are in order here: 1. If in the first (resp., the second) case, the set on the right hand side is empty, we take the quantity to be +∞ (resp., −∞). 2. We take it as a given that both these quantities are well defined as elements of R when the corresponding set on the right hand side is nonempty. The existence of these is not as innocuous as it may seem and is closely related to axiomatics for the real number system, see, e.g., Rudin [8]. An alternative notation for .u.b.(A) (resp., g.l.b.(A)) is sup(A) (resp., inf(A)), called the supremum (resp., infimum) of A. For functions f : C ⊂ Rd → R, one correspondingly defines the infimum of f on C, given by inf x∈C f (x) := inf{y : y = f (x) for some x ∈ C} and sometimes abbreviated to inf C f . Likewise we define the supremum of f on C as supx∈C f (x) = supC f := sup{y : y = f (x) for some x ∈ C}. Note that even when inf C f > −∞, it need not be the case that there exists an x ∗ ∈ C such that f (x ∗ ) = inf C f . For example, for C = R and f (x) = e−x , inf C f = 0, but f (x) >
6
1 Continuity and Existence of Optima
0 ∀ x. In case there exists an x ∗ ∈ C with f (x ∗ ) = inf C f , we call the infimum as the minimum and x ∗ as a minimizer of f in C. The set of such minimizers will be denoted by argminC f or simply argmin( f ) when C is clear from the context. Analogously, if for some xˆ ∈ C one has f (x) ˆ = supC f , we call the supremum as the maximum of f on C and xˆ as a maximizer, with the set of maximizers being denoted by argmaxC f or argmax( f ) when C is implicit. A local minimum of f is its minimum (as opposed to a minimizer) in some open set O ⊂ C. By abuse of terminology, a point x ∈ C is said to be a local minimum of f in C if f attains at x its infimum on some open neighborhood of x. A similar definition applies to a local maximum. A minimizer, resp. maximizer over the entire domain C will be referred to as a global minimum, resp. maximum, when a distinction is to be made from its local counterpart. Say that a sequence xn , n ≥ 1, in R is monotone if either xn ≤ xn+1 ∀n, in which case we say it is monotone increasing, or, xn ≥ xn+1 ∀n, in which case we say it is monotone decreasing. We say that it is strictly increasing, resp., strictly decreasing if the inequality is strict ∀n.3 Here and later, we say that a set A is bounded if there exists a K , 0 < K < ∞, such that x ≤ K ∀ x ∈ A, equivalently, A ⊂ B K (θ ) where θ ∈ Rd is the vector of all zeros. One of the simple but incredibly useful fact about real sequences is the following. Monotone bounded sequences in R converge. Without loss of generality, we may verify this for the first case, i.e., xn ≤ xn+1 ∀n and |xn | ≤ K < ∞. Let x ∗ := .u.b.{xn }. Then K ≥ x ∗ ≥ xn ∀n. For any η > 0, if xn ∈ (x ∗ − η, x ∗ ], then xm ∈ (x ∗ − η, x ∗ ] ∀ m ≥ n. Suppose there exists an η > 0 such / (x ∗ − η, x ∗ ] ∀n. Then .u.b.{xn } ≤ x ∗ − η, implying x ∗ = .u.b.{xn }. that xn ∈ Hence xn ∈ (x ∗ − η, x ∗ ] from some n on. It follows that xn → x ∗ , proving the claim. When xn → x ∗ and {xn } is monotone increasing or decreasing, one writes xn ↑ x ∗ , xn ↓ x ∗ resp. An important consequence of this is the following cornerstone of real analysis. Recall that a subsequence of a sequence {xn } is a sequence {xn(k) } where {n(k)} ⊂ {n}. Theorem 1.1 (Bolzano-Weierstrass theorem) Every bounded sequence in Rd has a convergent subsequence. Proof Let d = 1 and {xn } ⊂ [a, b] for some −∞ < a < b < ∞. Let a1 = a, b1 = b and repeat the following steps: step n, given the interval [an , bn ], pick one of aAt n n , n +b , bn so that it contains infinitely many xn ’s. the half-intervals an , an +b 2 2 3
Some authors use increasing/decreasing in place of strictly increasing or decreasing, and nondecreasing/non-increasing for what we have called increasing/decreasing.
1.2 Bolzano-Weierstrass Theorem
7
This is always possible since otherwise there would be only finitely many xn ’s, which is not the case. (Recall here that we treat xn , xm as distinct entities for m = n even when xm = xn , i.e., they have the same numerical value.) Relabel the chosen interval as [an+1 , bn+1 ]. Then an+1 ≥ an , bn+1 ≤ bn . Furthermore, bn ≥ an with |bn − an | = 2−(n−1) (b − a) → 0. Hence the monotone sequences {an }, {bn } converge to the common limit x ∗ := .u.b.({an }) = g.l.b.({bn }) = a + c(b − a) where c has the binary expansion 0.c1 c2 c3 . . . with ci = 1 if ai < ai+1 , and = 0 otherwise. Pick xn(k) ∈ [ak , bk ] such that n(k) > n( j), j < k. This is always possible since the interval [ak , bk ] contains infinitely many xn ’s. Then xn(k) → x ∗ . For arbitrary d > 1, use an induction argument as follows. Suppose the claim holds for d ≥ 1. For a bounded {xn } ⊂ Rd+1 , write xn = [x˜n , xˇn ] with x˜n ∈ Rd , xˇn ∈ R. Using induction hypothesis, pick {n(k)} ⊂ {n} such that x˜n(k) → some x˜ ∗ . Then pick {k(m)} ⊂ {k} such that xˇn(k(m)) → xˇ ∗ . Then xn(k(m)) → x ∗ := [x˜ ∗ , xˇ ∗ ]. We shall see several important consequences of this result throughout this book. It also motivates the following generalization of the notion of ‘limit’ above. We say that x ∗ is a limit point of {xn } if there exists a subsequence {xn(k) } of {xn } such that xn(k) → x ∗ . Bolzano-Weierstrass theorem then says that every bounded sequence in Rd has a limit point. One can in general speak of a limit point x ∗ of a set A ⊂ Rd if there exists a sequence {xn } ⊂ A such that xn → x ∗ . On the other hand, we can say that a sequence {xn } converges to a set A if inf y∈A xn − y → 0 as n ↑ ∞. ¯ It is This is equivalent to saying that the set of limit points of {xn } is contained in A. also easy to see that the set of limit points of a sequence is closed, simply because a limit point of limit points is also a limit point! (Check this.) The set of limit points, however, can be empty, e.g., for the sequence xn = n, n ≥ 1. Here is one basic fact of real analysis that can be derived from the BolzanoWeierstrass theorem. It requires the notion of a Cauchy sequence. A sequence {xn } ⊂ Rd is said to be Cauchy if limm,n↑∞ xm − xn = 0. A priori, a Cauchy sequence can have at most one limit point (check this!). The following result then shows that it has to be exactly one. Theorem 1.2 If {xn } ⊂ Rd is Cauchy, then xn → x ∗ for some x ∗ . Proof Given > 0, can find N large such that xm − x N < for m ≥ N , which is possible by the Cauchy property. Hence {xn } is bounded and by the BolzanoWeierstrass theorem, we can find a subsequence {xn(k) } such that xn(k) → some x ∗ . Then by the Cauchy property, xm − x ∗ ≤ xm − xn(k) + xn(k) − x ∗ → 0 by letting m, k ↑ ∞.
8
1 Continuity and Existence of Optima
1.3 Existence of Optima One of the main application areas of the Bolzano-Weierstrass theorem is the existence theory in optimization. An immediate consequence thereof in fact is the following key existence theorem.4 Many other advanced existence theorems in optimization and control use the same proof idea in spirit. Theorem 1.3 (Weierstrass theorem) If A ⊂ Rd is closed and bounded, and f : A → R is continuous, then f attains its maximum and minimum on A, i.e., ∃ x, x ∈ A satisfying f (x) ≤ f (y) ≤ f (x ) ∀ y ∈ A. Proof Let α := inf x∈A f (x). Then ∃ {xn } ⊂ A such that f (xn ) ↓ α, otherwise there would have been some > 0 such that f (x) ≥ α + ∀ x ∈ A, in which case inf x∈A f (x) ≥ α + > α, a contradiction. By the Bolzano-Weierstrass theorem, we can find xn(k) → x ∗ for some x ∗ ∈ A. By continuity, f (xn(k) ) → f (x ∗ ) = α. Proof for the maximum is similar. There are further variants of this result. One of the very important ones allows us to drop the requirement of boundedness of A : Theorem 1.4 Suppose C ⊂ Rd is closed and f : C → R is continuous and satisfies lim f (x) = ∞.
(‡)
x↑∞
Then f attains its minimum on C. Proof As before, take {xn } ∈ Rd such that f (xn ) ↓ α. If α = −∞, we must have xn → ∞, because the alternative is prevented by the previous theorem. Thus xn → ∞ and f (xn ) → −∞, contradicting (‡). So we must have α > −∞. Then by (‡) again, {xn } is bounded. Now argue as in the proof of Weierstrass theorem to conclude. A function f satisfying (‡) is said to be coercive. A similar argument shows that a continuous f : C → R attains its maximum if limx↑∞ f (x) = −∞. A further variant, also very useful, relaxes the continuity assumption on f . For {xn } ⊂ R, define lim inf xn := lim inf xm = sup inf xn ,
(1.2)
lim sup xn := lim sup xm = inf sup xn .
(1.3)
n↑∞
n↑∞
n↑∞ m≥n n↑∞ m≥n
n
m≥n
n m≥n
Note that the sequence inf m≥n xm , n ≥ 1, is monotone increasing because the infimum is being taken on successively smaller sets. Thus its limit is well defined with 4
The result is often stated with ‘closed and bounded’ replaced by ‘compact’, a notion we have not defined. The two are equivalent for Rd .
1.3 Existence of Optima
9
±∞ as a possible value if this infimum is always ±∞. The aforementioned monotone decreasing property also establishes the equality of last two expressions in (1.2). Similar statements apply to (1.3) with ‘monotone increasing’ replaced by ‘monotone decreasing’ in the discussion above. Clearly, if lim inf n↑∞ xn = lim supn↑∞ xn = x ∗ (say), then x ∗ = limn↑∞ xn . In general, lim supn↑∞ f (xn ) ≥ lim inf n↑∞ f (xn ), which is obvious from the definitions. Say that f : Rd → R is lower semicontinuous (l.s.c. for short) if lim inf f (xn ) ≥ f (x) n↑∞
whenever xn → x in Rd . Similarly, f : Rd → R is upper semicontinuous (u.s.c. for short) if lim sup f (xn ) ≤ f (x) n↑∞
whenever xn → x in Rd . What this means in the former case is that at any point of discontinuity x, ˜ all limit points of f (x) as x → x˜ are ≥ f (x). ˜ In the latter case, one reverses this inequality. For example, consider f : [0, 1] → R defined by: f (x) = e−x for 0 ≤ x < 1 and f (1) = 0. This f is l.s.c. If instead we take f (1) = 1, it is u.s.c. For a more elaborate example, consider f : [0, 1] → R given by f (0) = −1, f (x) = sin ( x1 ) for 0 < x ≤ 1. This again is l.s.c., but if we change to f (0) = 1, it is u.s.c. If f (0) = 0, it is neither, as the set of limit points of sin ( x1 ) as x ↓ 0 is [−1, 1] (see Fig. 1.1). Clearly, if f is l.s.c., − f is u.s.c. and vice versa, and a function that is both l.s.c. and u.s.c. will be continuous.
Fig. 1.1 Graph of sin
1 x
10
1 Continuity and Existence of Optima
From the very nature of these definitions, one expects ‘one-sided’ counterparts of the Weierstrass theorem to hold for such functions. This is indeed the case as we see below. Theorem 1.5 A lower semicontinuous function on a closed bounded set C attains its minimum. An upper semicontinuous function on a closed bounded set C attains its maximum. Proof Consider the first case, i.e., an l.s.c. f . Argue as in the proof of Weierstrass theorem to get a sequence {xn } ⊂ C such that f (xn ) ↓ α := inf C f and without loss of generality (i.e., by dropping to a subsequence guaranteed by Bolzano-Weierstrass theorem if necessary), let xn → x. Then by the lower semicontinuity of f , α = lim f (xn ) ≥ f (x) ≥ α =⇒ f (x) = α. n↑∞
Proof for the maximum of a u.s.c. function is similar.
This is often useful because of the following facts. Theorem 1.6 If f is a pointwise supremum of continuous (more generally, l.s.c.) functions, it is l.s.c. Likewise, if it is the pointwise infimum of continuous (more generally, u.s.c.) functions, it is u.s.c. Proof Suppose A ⊂ Rd , f α : A → R for α ∈ some index set I , are continuous and f (x) := supα f α (x) < ∞ ∀ x ∈ A. Let xn → x in A. Then for any α, lim inf f (xn ) ≥ lim inf f α (xn ) = lim f α (xn ) = f α (x), n↑∞
n↑∞
n↑∞
(1.4)
where the inequality follows from the definition of f and the equalities from continuity of f . Taking supremum over α in the rightmost expression, the claim follows. If f α are only l.s.c., then (1.4) gets replaced by lim inf f (xn ) ≥ lim inf f α (xn ) ≥ f α (x) n↑∞
n↑∞
and the result follows as before. The second claim is proved similarly.
Theorem 1.6 is useful in ‘worst case optimization’, e.g., minimizing the maximum loss (where the maximum is over parameters not in our control), or maximizing the minimum gain. The theorem reduces it to resp., minimizing an l.s.c. function or maximizing a u.s.c. function if the loss, resp. gain, is continuous. One also has the natural counterparts of Theorem 1.5 for l.s.c./u.s.c. functions: Theorem 1.7 A lower semicontinuous function f on a closed set C satisfying the condition limx↑∞ f (x) = ∞ attains its minimum. An upper semicontinuous function g on a closed set C satisfying limx↑∞ f (x) = −∞ attains its maximum.
1.4 More on Continuous Functions
11
1.4 More on Continuous Functions In this section we collect some useful facts about continuous functions. Given functions f n , f : C ⊂ Rd → R, we say that f n → f pointwise if, as n ↑ ∞, f n (x) → f (x) ∀ x ∈ C, and uniformly if supx∈C | f n (x) − f (x)| → 0. Theorem 1.8 If continuous functions f n : C ⊂ Rd → R, n ≥ 1, converge uniformly to an f : C → R as n ↑ ∞, then f is continuous. Proof Let z ∈ C and fix > 0. Pick n sufficiently large so that sup | f n (x) − f (x)| < x∈C
Pick a δ > 0 such that | f n (y) − f n (z)| < Then for such y,
3
. 3
whenever y ∈ C and y − z < δ.
f (y) − f (z) ≤ f (y) − f n (y) + f n (y) − f n (z) + f n (z) − f (z) < .
The claim follows.
Examples of continuous functions converging to a function that is, resp., l.s.c., u.s.c. or neither are easy to construct. Consider ⎧ ⎪ ⎨ 0, f n (x) = x n , ⎪ ⎩ 1,
x ≤ 0, 0 ≤ x ≤ 1, x ≥ 1.
(1.5)
Then as n ↑ ∞, f n (·) decreases pointwise to f (x) = 1, x ≥ 1, and = 0, x < 1, 1 which is discontinuous at x = 1, but is u.s.c. (see Fig. 1.2). If we make f n (x) = x n for 0 ≤ x ≤ 1 instead, keeping it = 0 (resp., = 1) for x ≤ 0 (resp., x ≥ 1) as before, we have f n (·) increasing pointwise to fˆ(x) = 0, x ≤ 0 and = 1, x > 0, which is discontinuous at x = 0, but is l.s.c. (see Fig. 1.3). Finally, f n (x) :=
e−nx 1 + e−nx
converges pointwise to f (x) = 1, x < 0, = 21 , x = 0, and = 0, x > 0, which is discontinuous at 0 and neither u.s.c. nor l.s.c. (see Fig. 1.4). See Gordon [6] for an interesting extension of the above result. The following result is often useful. Theorem 1.9 Any continuous f : Rd → R is approximated uniformly on any closed bounded set C ⊂ Rd by polynomials. If it is k times continuously differentiable (i.e.,
12
1 Continuity and Existence of Optima
Fig. 1.2
Fig. 1.3
differentiable k times with continuous derivatives up to kth order), both the function and its derivatives up to order k or less can be approximated uniformly by a polynomial and its corresponding derivatives.
1.4 More on Continuous Functions
13
Fig. 1.4
Proof We shall fix a closed bounded set C ⊂ Rd and by modifying f if necessary outside an open ball containing C, assume that f is bounded. Also, by suitable scaling if required, assume that (1.6) sup x − y < 1. x,y∈C
(Check that this is legitimate.) Let {φn } be a sequence of non-negative functions satisfying the conditions lim
n→∞ x≥
φn (x)dx = 1,
(1.7)
φn (x)dx = 0 ∀ > 0.
(1.8)
First we show that the convolution of f and φn f ∗ φn (x) :=
f (x − y)φn (y)dy Rd
=
φn (x − y) f (y)dy Rd
as n → ∞, uniformly on closed bounded subsets. Indeed,
→ f (x)
14
1 Continuity and Existence of Optima
| f ∗ φn (x) − f (x)| ≤
φn (y)| f (x − y) − f (x)|dy y≤
φn (y)| f (x − y) − f (x)|dy
+ y>
≤ sup | f (x + z) − f (x)| + 2 sup | f (z)| z≤
Thus,
z∈Rd
sup | f ∗ φn (x) − f (x)| ≤ x∈C
sup
z≤,x∈C
φn (y)dy.
y>
| f (x + z) − f (x)|
+ 2 sup | f (z)| z∈Rd
φn (y)dy
y>
and therefore, lim sup | f ∗ φn (x) − f (x)| ≤
n→∞ x∈C
sup
z≤,x∈C
| f (x + z) − f (x)|.
Since > 0 is arbitrary and f is uniformly continuous on C, we conclude that f ∗ φn converges uniformly to f on C. Next we restrict to φn that are zero outside a closed ball and continuously differentiable in its interior. We claim that ∇( f ∗ φn )(x) = f ∗ ∇φn (x).
(1.9)
Let h ∈ Rd with h < 1. For all y ∈ Rd , we have |φn (x + h − y) − φn (x − y) − h · ∇φn (x − y)| 1 = h · {∇φn (x + sh − y) − ∇φn (x − y)}ds 0
→ 0 as h → 0. This follows from the uniform continuity of ∇φn on C. Moreover, note that the right hand side is zero for all y ∈ Rd outside a closed bounded set. Using the fact that f (y)φn (x − y)dy. f ∗ φn (x) = Rd
1.4 More on Continuous Functions
15
we have f ∗ φn (x + h) − f ∗ φ(x) − h f ∗ ∇φn (x) ≤ η(h)
f (y)dy K
h→0
for some η(h) → 0 and a suitably chosen closed bounded set K . This proves our claim. In an analogous manner, we can show that the convolution is as many times continuously differentiable as φn is, even if f is not smooth. To complete the proof of the theorem, we choose φn appropriately so that f ∗ φn and its derivatives are polynomials. Consider pn (r ) =
1 (1 cn
0
− r 2 )n if |r | ≤ 1 otherwise
where cn is a constant such that the integral of pn is 1. Let φn (x) = pn (x1 ) pn (x2 ) . . . pn (xd ). We leave it as an exercise to show that (1.7), (1.8) hold for this choice of {φn }. It is also not hard to verify that f ∗ φn and its derivatives are polynomials on B1 (θ ), completing in particular the proof of the first part. The uniform convergence of derivatives follows analogously using the fact that (1.9) ensures k times continuous differentiability of f ∗ φn on C for all k ≤ n (check this). The first statement of the above theorem is known as the Weierstrass approximation theorem.5 The next result is a trivial special case of a much more general result known as Urysohn’s lemma. Let A, B ⊂ Rd be disjoint closed sets and a, b ∈ R with b > a. Lemma 1.1 There exists a continuous f : Rd → [a, b] such that f (x) = a for x ∈ A and f (x) = b for x ∈ B. Proof For any closed set C ⊂ Rd , define dC (x) := min y∈C x − y, which is well defined by the Weierstrass theorem. Furthermore, A, B being disjoint, d A (x) + d B (x) = 0 ∀x. Then the function f (x) := satisfies the requirement. 5
bd A (x) ad B (x) + d A (x) + d B (x) d A (x) + d B (x)
A related result known as the Müntz-Szász theorem says that a continuous function can be λ1 λ2 uniformly ∞ 1 approximated by linear combinations of {1, x , x , . . .}, λi ∈ (0, ∞), if and only if i=1 λi = ∞. See [1].
16
1 Continuity and Existence of Optima
This facilitates the following important result. Theorem 1.10 (Tietze extension theorem) Let C ⊂ Rd be closed and f : C → [a, b], b > a, be continuous. Then f extends continuously to a function f ∗ : Rd → [a, b]. Proof By scaling f if necessary, let a = −1, b = 1. Let A(1) := f −1 ([1/3, 1]), B(1) := f −1 ([−1, −1/3]). These are closed and disjoint, so (if nonempty) there exists f 1 : Rd → [−1/3, 1/3] such that f (x) = 1/3 on A(1) and = −1/3 on B(1). Then maxx∈C | f (x) − f 1 (x)| ≤ 2/3. Consider f − f 1 : Rd → [−2/3, 2/3] and let A(2) := ( f − f 1 )−1 ([2/9, 2/3]), B(2) := ( f − f 1 )−1 ([−2/3, −2/9]). Then pick f 2 such that maxx∈C | f (x) − f 1 (x) − f 2 (x)| ≤ 4/9. Iterating this construction, we have continuous functions
n−1 n−1 2 2 f n : Rd → − n , n , n ≥ 1 3 3
satisfying 2n−1 | f n (·)| ≤ n , 3 n 2 n max f (x) − f i (x) ≤ . x∈C 3 i=1
(1.10) (1.11)
n for each x with sum ≤ 1. Hence nBy (1.10), i=1 | f i (x)| is absolutely ∗summable d i=1 f i (x) converges pointwise to an f : R → [−1, 1]. Since m n n ∗ sup f (x) − f i (x) ≤ lim sup f i (x) − f i (x) n≤m↑∞ x∈Rd x∈Rd i=1 i=1 i=1 ∞ m 2 m↑0 ≤ → 0, 3 m=n n
f i converges uniformly to f ∗ , whence f ∗ is continuous. From (1.11), we see that f = f on C. This completes the proof. i=1
∗
The next result is a standard workhorse for proving convergence of many algorithms. Say that a map f : Rd → Rd is a contraction if there exists an α ∈ (0, 1) such that (1.12) f (x) − f (y) ≤ αx − y ∀ x, y ∈ Rd .
1.4 More on Continuous Functions
17
What this means is that an application of f strictly decreases the distance between a pair of distinct points by a factor of at least α. For example, for d = 1, any continuously differentiable function with its derivative bounded in absolute value by α will be a contraction by the mean value theorem. In fact a similar statement holds for d > 1 if the standard matrix norm of the Jacobian matrix of f is bounded from above by α. Note that in particular, (1.12) implies that f is uniformly continuous. The theorem below requires the notion of a fixed point of a map Rd → Rd . The terminology itself suggests the definition. For any f : Rd → Rd , a point x ∈ Rd is said to be a fixed point of f if f (x) = x. In other words, application of f to x leaves it unchanged. For example, for d = 1, for f (x) = x 3 , x = 0 is the only fixed point, whereas for f (x) = x, all points are fixed points. At the other extreme, f (x) = x + 1 has no fixed points, which shows that the condition α < 1 in the following theorem cannot be relaxed to α ≤ 1. The following simple proof is due to Palais [7]. Theorem 1.11 (Banach contraction mapping theorem) If f : Rd → Rd satisfies (1.12) for some 0 < α < 1, then f has a unique fixed point x ∗ and the iteration xn+1 = f (xn ), n ≥ 0,
(1.13)
with any choice of x0 ∈ Rd , converges to x ∗ . Proof For x, y ∈ Rd , we have x − y ≤ x − f (x) + y − f (y) + f (x) − f (y) ≤ x − f (x) + y − f (y) + αx − y 1 =⇒ x − y ≤ (x − f (x) + y − f (y)) . 1−α Hence if x, y above are fixed points of f , we must have x = y. That is, the fixed point, if it exists, is unique. For x0 ∈ Rd , let xn := f n (x0 ) := f ◦ f ◦ · · · ◦ f (x0 ) (n times) for n ≥ 1. Then for m, n ≥ 1,
1 n f (x0 ) − f n+1 (x0 ) + f m (x0 ) − f m+1 (x0 ) 1−α αn + αm x0 − f (x0 ) → 0 ≤ 1−α
f n (x0 ) − f m (x0 ) ≤
as n, m ↑ ∞. Thus {xn } is Cauchy and therefore converges to (say) x ∗ . Letting n ↑ ∞ in (1.13), we have x ∗ = f (x ∗ ), i.e., x ∗ is the fixed point of f . See Daskalakis et al. [4] for a ‘converse’ statement. Our final result of this section requires the notion of equicontinuity. A family A of real-valued functions on C ⊂ Rd is said to be equicontinuous at x ∈ C if, given > 0, we can find a δ > 0 such that
18
1 Continuity and Existence of Optima
y − x < δ, y ∈ C, implies | f (x) − f (y)| < for all f ∈ A. A is equicontinuous if it is so at all x ∈ C. Theorem 1.12 (Arzela-Ascoli theorem) Let C ⊂ Rd be closed and bounded. If a family A of maps C → R is equicontinuous and satisfies sup f (x0 ) < ∞ f ∈A
(1.14)
for some x0 ∈ C, then every sequence { f n } ∈ A has a further subsequence that converges uniformly to some continuous f : C → R. The converse also holds. Remark 1.1 If (1.14) holds for some x0 ∈ C, it holds for all x0 ∈ C by equicontinuity. (Check this.) Proof Let C0 := {x1 , x2 , . . .} a countable dense subset of C.6 Thus given any x ∈ C\C0 , we can find a sequence {xn } ∈ C0 such that xn → x. By a diagonal argument, pick a subsequence of { f n }, denoted by { f n } again by abuse of notation, which converges pointwise in R∞ (:= the countably infinite product of copies of R) on C0 . The ‘diagonal argument’ goes as follows. Let C1 := the subsequence of C0 such that { f n (x1 )} converges along C1 as n ↑ ∞. This is possible by the Bolzano-Weierstrass theorem. Inductively, pick a subsequence Ci+1 of Ci along which { f n (xi+1 )} converges as n ↑ ∞. Let C ∗ be the subsequence of C0 whose kth element is the kth element of Ck . Then it is easily verified that along the ‘diagonal subsequence’ C ∗ , denoted by {xi } again by abuse of notation, { f n (xi )} converges as n ↑ ∞ for all i ≥ 1. Fix > 0. Let x, y ∈ C0 be such that x − y < δ for a δ > 0 chosen such that | f n (x ) − f n (y )| < ∀n, ∀ x , y ∈ C0 satisfying x − y < δ. This is possible by equicontinuity. Then choose n large enough so that | f n (x) − f (x)|, | f n (y) − f (y)| < . Then | f (x) − f (y)| ≤ | f (x) − f n (x)| + | f (y) − f n (y)| + | f n (x) − f n (y)| < 3. Hence f : C0 → R is uniformly continuous and extends to a unique uniformly continuous f : C → R. We claim that f n → f uniformly on C. If not, there exist > 0 and {xn } ⊂ C such that | f n (xn ) − f (xn )| > along a subsequence of { f n }, denoted as { f n } again by abuse of notation. By Bolzano-Weierstrass theorem, by dropping to a further subsequence if necessary, let xn → x ∈ C. Then by the triangle inequality, | f n (xn ) − f n (x)| ≥ | f n (xn ) − f (xn )| − | f (xn ) − f (x)|.
6
The existence of such a set is a topological fact that we take for granted. For simple C such as, e.g., closed balls, C0 can be taken to be the set of points in C with rational coordinates.
1.5 Exercises
19
As n ↑ ∞, the second term on the right goes to zero by continuity of f . The term on the left goes to zero by equicontinuity. The first term on the right is > by hypothesis, leading to a contradiction. Thus f n → f uniformly. For the converse, if (1.14) does not hold at some x0 ∈ C, then there exists { f n } ⊂ A such that | f n (x0 )| → ∞. But then { f n } cannot have a subsequence that converges at x0 , a contradiction. Thus (1.14) holds for all x0 ∈ C. If A is not equicontinuous at some x ∈ C, then there exist > 0 and { f n } ⊂ A, {xn } ⊂ C such that xn → x but | f n (xn ) − f n (x)| ≥ . Let f n → f uniformly along a subsequence, denoted again by { f n } by abuse of notation, for some continuous f : C → R. Then lim supn→∞ | f (xn ) − f (x)| ≥ , which contradicts the continuity of f in view of xn → x. This completes the proof.
1.5 Exercises 1.1 Show that any closed set in Rd can be written as intersection of countably many open sets. 1.2 Let f : Rd → Rm and B ⊂ Rd is closed and bounded. Show that the image set { f (x) : x ∈ B} of B is closed and bounded. 1.3 Show that the claim (1.1) is symmetric in the two norms modulo the choice of b, c. Give a direct proof that the norm x∞ := max |xi | ∀ x = [x1 , . . . , xd ] ∈ Rd , i
is equivalent to the Euclidean norm and find the best constants b, c. 1.4 Let C denote the set of limit points of xn ∈ Rd as n ↑ ∞, assumed non-empty. Show that C is closed. 1.5 Prove (∗) on Page 4. 1.6 Prove (U1), (U2). 1.7 Let Cn ⊂ Rd , n ≥ 1, be a sequence of closed bounded sets satisfying Cn+1 ⊂ Cn ∀n ≥ 1. Show that ∩∞ n=1 C n = φ. 1.8 Suppose f n : C ⊂ Rd → R with C closed and bounded, are continuous functions for n = 1, 2, . . . , ∞, and f n (x) monotonically increases (resp., decreases) to f ∞ (x) for all x ∈ C as n ↑ ∞. Show that f n → f ∞ uniformly. (This is Dini’s theorem [5].) 1.9 Show that the two definitions of lim inf and lim sup in (1.2), (1.3) respectively, are equivalent. 1.10 Let f : C ⊂ Rd → R be lower semicontinuous. Show that arg minC f is a closed set if C is closed. 1.11 Let f : C × Rm → R be lower semicontinuous, where C is closed and bounded. Let M(y) := {x ∈ C : f (x, y) = min f (·, y)}. Show that the set {(x, y) ∈ C × Rm : x ∈ M(y)} is closed (i.e., yn → y, xn ∈ M(yn ) and xn → x in C imply x ∈ M(y)). If C = an interval [a, b], show that the map y → the smallest element of M(y), is l.s.c.
20
1 Continuity and Existence of Optima
1.12 For f, g : C ⊂ Rd → R, show that inf C ( f + g) ≥ inf C f + inf C g and supC ( f + g) ≤ supC f + supC g. 1.13 Given f, g : C ⊂ Rd → R, show that | inf f − inf g|, | sup f − sup g| ≤ sup | f − g|. C
C
C
C
C
1.14 Let C ⊂ Rd , D ⊂ Rs be closed bounded sets and f : C × D → R a continuous function. Show that x ∈ C → min D f (x, y), resp. max D f (x, y), are continuous functions. 1.15 Let f : C ⊂ Rd → R be lower semicontinuous and f (x) = 0 for every x ∈ Rd . Show that 1f is upper semicontinuous. 1.16 Let f : Rd → R be lower semicontinuous and consider f n (x) = inf{ f (y) + nx − y : y ∈ Rd }, n ≥ 1. Show that f n is continuous, f 1 ≤ f 2 ≤ · · · and f n → f as n → ∞. Conclude that f is the pointwise supremum of all continuous g such that g(·) ≤ f (·) pointwise. (This result is due to Baire [2].) 1.17 Let f : R → R be a continuous function and let f n (x) = f (x + n1 ). 1. Show that f n → f uniformly over any closed and bounded interval. 2. Show that the convergence need not be uniform over R. 1.18 For b > a, let f : [a, b] → R be continuous and let { pn } be a sequence of polynomials such that maxa≤x≤b | f (x) − pn (x)| → 0 as n ↑ ∞ as in the Weierstrass approximation theorem. Show that if f is not polynomial, then dn → ∞ where dn is the degree of the polynomial pn . 1.19 Let B = {x ∈ Rd : x − z < } for a fixed z ∈ Rd and > 0. Let f : B → Rd be a function such that f (x) − f (y) ≤ αx − y ∀x, y ∈ B, with a constant α < 1. Furthermore assume that z − f (z) < (1 − α). Show that f has a fixed point in B. (This is taken from [3].) 1.20 Prove the ‘Remark’ following the statement of Theorem 1.12.
References 1. Almira, J.M.: Müntz type theorems. I Surv. Approx. Theory 3, 152–194 (2007) 2. Baire, R.: Leçons sur les fonctions discontinues. Les Grands Classiques Gauthier-Villars [Gauthier-Villars Great Classics]. Éditions Jacques Gabay, Sceaux (1995). Reprint of the 1905 original
References
21
3. Brooks, R.M., Schmitt, K.: The contraction mapping principle and some applications. In: Electronic Journal of Differential Equations. Monograph, vol. 9. Texas State University—San Marcos, Department of Mathematics, San Marcos, TX (2009) 4. Daskalakis, C., Tzamos, C., Zampetakis, M.: A converse to Banach’s fixed point theorem and its CLS-completeness. In: STOC’18—Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pp. 44–50. ACM, New York (2018) 5. Dini, U.: Fondamenti per la teorica delle funzioni di variabili reali. T. Nistri, Pisa (1878) 6. Gordon, R.A.: When is a limit function continuous? Math. Mag. 71(4), 306–308 (1998) 7. Palais, R.S.: A simple proof of the Banach contraction principle. J. Fixed Point Theory Appl. 2(2), 221–223 (2007) 8. Rudin, W.: Principles of Mathematical Analysis, 3rd edn. International Series in Pure and Applied Mathematics. McGraw-Hill Book Co., New York-Auckland-Düsseldorf (1976)
Chapter 2
Differentiability and Local Optimality
2.1 Introduction In this chapter we give an overview of a variety of facts concerning optimality of a point relative to a neighborhood of it, in terms of ‘local’ objects such as derivatives. We first introduce the different notions of derivatives and beginning with some familiar conditions for optimality from calculus, build up various generalizations thereof. Throughout this chapter and the rest of the book, we use the standard order notaf (x) → 0 as g(x) → 0 tion: for f : Rd → R and g :→ [0, ∞), f (x) = o(g(x)) if g(x) or ∞, as applicable. In the latter case, x, g(x) can be, either or both, discrete valued f (x) ≤ K for a constant (e.g., integers). Similarly, f (x) = O(g(x)) if lim supg(x)↓0 g(x) K < ∞, likewise for g(x) ↑ ∞.
2.2 Notions of Differentiability We first recall some notions of differentiability. Let f : Rd → R. Given x ∈ Rd and a unit vector h ∈ Rd , f is said to be differentiable in the direction h if the limit f (x; h) := lim
0 0 in case of the directional derivative), but at a rate that can depend on h. Clearly, Gâteaux differentiability along h implies directional differentiability along ±h with the values of the directional derivatives along h, −h matching in magnitude and opposite in sign. Both definitions can be extended to f : Rd → Rm , m ≥ 1, componentwise. The strongest notion of differentiability we shall need is that of Fréchet differentiability. Say that f : Rd → Rm is Fréchet differentiable if there exists a linear map Dx f : Rd → Rm (called the Fréchet derivative) such that f (x + h) − f (x) →0 sup − D f (h) x h=1 as → 0. This is equivalent to the statement f (x + h) = f (x) + Dx f (h) + o(), implying that the map h → f (x) + Dx f (h) serves as the linear approximation to f f (x) at point x. The difference with Gâteaux differentiability is that the ratio f (x+h)− d converges uniformly in h ∈ R , h = 1. The definition of Fréchet differentiability of f automatically implies that f is Gâteaux differentiable in all directions. The converse is not true, as the following example from [2] shows. Define f : R2 → R by: for θ := (0, 0) ∈ R2 , f (x, y) =
0 x5 (x−y)2 +x 4
if (x, y) = θ, if (x, y) = θ.
Then given any unit normal n = (a, b) (say) in R2 and 0 = α ∈ R, α2a5 f (αa, αb) − f (θ ) = . α (a − b)2 + α 2 a 4 As α → 0, this converges to 0 if a = b, f (θ ; n) = a if a = b.
Thus it is Gâteaux differentiable in all directions. But it is not Fréchet differentiable. If it were, we would be able to find scalars p, q such that for all unit normals (a, b), ( p + q)a = a when a = b and pa + qb = 0 when a = b. It is easy to verify that this cannot hold for all unit normals (a, b) at the same time. Hence f is not Fréchet differentiable.
2.3 Conditions for Local Optimality
25
If m = 1, i.e., f : Rd → R, then Dx f : Rd → R is linear and therefore of the form Dx f (y) = a, y ∀y ∈ Rd for a unique a ∈ Rd . We call this a ∈ Rd as the gradient of f at x, and denote it as ∇ f (x). More generally, for m ≥ 2 and f = ( f 1 , . . . , f m ), Dx f (y) = A x y ∀y ∈ Rd is a linear map for each fixed x for some d × m matrix A x , given by: the (i, j)th element of A x is ∂ f i /∂ x j (x) for 1 ≤ i ≤ m, 1 ≤ j ≤ d. This matrix is known as the Jacobian matrix of f and by slight abuse of notation, denoted again as Dx f . For m = d, it is a square matrix. Note that the gradient is a d-vector, so x → ∇ f (x) is a map R d → Rd . The Jacobian matrix of this map is called the Hessian, denoted by ∇ 2 f (x), and is a d × d matrix whose (i, j)th element 2 2 2 is ∂ x∂i ∂fx j (x). Since these cross-derivatives are symmetric in i, j, i.e., ∂ x∂i ∂fx j = ∂ x∂ j ∂fxi (a fact known from calculus), this matrix is always symmetric. We also have the classical first order and second order Taylor formulas: for x0 ∈ Rd , f (x) = f (x0 ) + ∇ f (x0 ), x − x0 + o(x − x0 ), 1 f (x) = f (x0 ) + ∇ f (x0 ), x − x0 + (x − x0 )T ∇ 2 f (x0 )(x − x0 ) 2 + o(x − x0 2 ).
(2.1)
(2.2)
These yield in particular local approximations of f in terms of affine (i.e., linear plus a constant), resp., quadratic functions. If the first (resp., the first two) derivative(s), i.e., the gradient (resp., the gradient and the Hessian) exist at all points in the domain of interest and are continuous, the function is said to be continuously differentiable (resp., twice continuously differentiable). In these cases, we have the more exact forms: 1 f (x) = f (x0 ) +
∇ f ((1 − t)x0 + t x), x − x0 dt,
(2.3)
0
f (x) = f (x0 ) + ∇ f (x0 ), x − x0 1 + 2
1 (x − x0 )T ∇ 2 f ((1 − t)x0 + t x)(x − x0 ) dt.
(2.4)
0
2.3 Conditions for Local Optimality Recall that given f : C ⊂ Rd → R, a point x0 ∈ C is a local minimum for f if there exists > 0 such that x − x0 < , x ∈ C implies f (x) ≥ f (x0 ). It is a strict local minimum if this inequality is strict for all such x = x0 . Suppose f is continuous and has a directional derivative in the direction of a unit vector n. If this derivative were strictly negative, then by (2.1), f (x0 + n) < f (x0 ) for arbitrarily small > 0, so x0 could not have been a local minimum. Thus the
26
2 Differentiability and Local Optimality
directional derivative at x0 in any direction must be non-negative. A similar argument shows that if all directional derivatives are strictly positive, then x0 must be a local minimum. If f is Gâteaux differentiable along the linear span of n, then applying this to the directions ±n, it follows that for a local minimum, the Gâteaux derivative must be zero along any direction. If f is Fréchet differentiable, then likewise the Fréchet derivative, i.e., the gradient, must vanish at x0 . Thus we have the following theorem. Theorem 2.1 (Fermat) Let f be either Gâteaux or Fréchet differentiable and let x ∗ be a local minimum. Then Dx ∗ f (h), resp. ∇ f (x ∗ ) = θ (∀ h ∈ Rd in the former case). Clearly the converse does not hold: a point of zero gradient, also called a critical point, can correspond to a local maximum, a saddle point (i.e., a point at which for some lines passing through it, the function attains its local maximum while along other lines passing through it, it attains its local minimum), a point of inflection (point on a curve where the derivative of its slope changes sign), etc. See Fig. 2.1. Thus we need to use information about higher derivatives in order to narrow down the options. Suppose f is twice Fréchet differentiable. Then from (2.2), it follows that if x0 is a local minimum, then ∇ 2 f (x0 ) must be positive semidefinite and if ∇ 2 f (x0 ) is positive definite, then x0 is a strict local minimum. Summarizing, we have the following: Theorem 2.2 If x0 is a strict local minimum of f , ∇ 2 f (x0 ) is positive semidefinite. If x0 satisfies: ∇ f (x0 ) = θ and ∇ 2 f (x0 ) is positive definite, then x0 is a strict local minimum. The converse is also true for the latter claim. That it is not so for the former is evident from the example f : x ∈ R → x 3 ∈ R, x0 = 0. Then f (x0 ) = f
(x0 ) = 0, but x0 is not a local minimum. This takes care of unconstrained minima. It does not work if the minimization is performed on a subset of Rd because for points on the boundary of this set, not
Point of Local Maximum Point of Inflection Point of Local Minimum
(a) Fig. 2.1
(b)
2.3 Conditions for Local Optimality
27
all directions will be available. This leads to the celebrated ‘Karush-Kuhn-Tucker’ (KKT) conditions. We give an elementary proof due to McShane [11]. Let k, s be non-negative integers and let f ; g1 , . . . , gk ; h 1 , . . . , h s : C → R for some open set C ⊂ Rd be continuous and continuously differentiable. Theorem 2.3 Let x0 ∈ C satisfy the equality and inequality constraints: gi (x) = 0, h r (x) ≤ 0 ∀i, r.
(2.5)
for x = x0 and f (x0 ) ≤ f (x) ∀x ∈ C satisfying (2.5). Then there exist scalars λ0 , λ1 , . . . , λk , and μ1 , . . . , μs such that: ∂gi ∂h i ∂f f (x0 ) + λi (x0 ) + μi (x0 ) = 0 ∀ j. ∂x j ∂x j ∂x j i=1 i=1 k
λ0
s
(2.6)
Furthermore, (i) λ0 ≥ 0 and μr ≥ 0 ∀r ; (ii) (Complementary slackness condition:) h r (x0 ) < 0 =⇒ μr = 0 for 1 ≤ r ≤ s; (iii) if ∇gi (x0 ), 1 ≤ i ≤ k, and those ∇h r (x0 ), r, 1 ≤ r ≤ s, for which h r (x0 ) = 0 (i.e., the constraint is active or binding) are linearly independent, then λ0 = 1 without loss of generality. Proof Without any loss of generality, we may take x0 = θ := the zero vector, f (x0 ) = 0, and for some ≤ s, h i (x0 ) = 0, 1 ≤ i ≤ ; h i (x0 ) < 0, < i ≤ s. Since C is open, we can pick ∗ > 0 such that B ∗ (θ ) ⊂ C and h i (x) < 0 ∀ i > , x ∈ B ∗ (θ ). Pick 0 < < ∗ . We now replace the constrained optimization problem by an unconstrained optimization problem by incorporating an additional term in the objective function to penalize the violation of constraints. For this purpose, first we change f (x) to f (x) + x2 , which also has a local minimum at θ and this local minimum is a strict local minimum, i.e., there is no other local minimum in a sufficiently small neighbourhood thereof. Then we add the aforementioned penalty and define F : C → R by ⎞ ⎛ k F(x) = f (x) + x2 + N ⎝ gi (x)2 + h +j (x)2 ⎠ i=1
j=1
for some N ≥ 1, where h +j (x) := max(h j (x), 0). The third term above puts a penalty of N gi (x)2 on the violation of the constraint gi (x) = 0 and a penalty N (h +j (x))2 on the violation of the constraint h j (x) ≤ 0. Claim: ∀ ∈ (0, ∗ ), there exists an N = N () ≥ 1 such that
28
2 Differentiability and Local Optimality
⎞ ⎛ k f (x) + x2 + N ⎝ gi (x)2 + (h +j (x))2 ⎠ > 0 ∀ x ∈ ∂ B (θ ). i=1
(2.7)
j=1
Proof of the above claim: If the claim were not true, there exist ∈ (0, ∗ ), Nm ↑ ∞ and xm ∈ ∂ B (θ ), m ≥ 1, such that f (xm ) + xm ≤ −Nm 2
k
gi (xm ) + 2
i=1
h i+ (xm )2
∀ m ≥ 1.
(2.8)
i=1
By the Bolzano-Weierstrass theorem, we may drop to a subsequence if necessary so that xm → some point x ∗ ∈ ∂ B (θ ). By continuity, f (xm ) → f (x ∗ ). Dividing both sides of (2.8) by −Nm and letting m ↑ 0, we get k
gi (x ∗ )2 +
i=1
(h +j (x ∗ ))2 = 0,
j=1
so that x ∗ satisfies (2.5). Hence f (x ∗ ) ≥ f (x0 ) = 0. But by (2.8), f (xm ) ≤ − 2 , implying f (x ∗ ) ≤ − 2 , a contradiction. So (2.7) must hold. ˆ ≤ F(x0 ) = Returning to the main proof, let xˆ := arg min x∈B (θ) F(x). Then F(x) ˆ = θ, the zero vector. That 0, so by (2.7), xˆ ∈ / ∂ B (θ ), i.e., xˆ ∈ B (θ ), with ∇ F(x) is,
∂f ∂gi ∂h i (x) ˆ + 2 xˆ j + 2N gi (x) ˆ (x) ˆ + 2N h i+ (x) ˆ (x) ˆ = 0 ∀ j. ∂x j ∂ x ∂ xj j i=1 i=1 k
(2.9)
Choose a sequence m ↓ 0 and apply the above for ∗ = m . For each m, denote the corresponding x, ˆ N as xˆ m , N m and rewrite the corresponding Eq. (2.9) as λm 0
k s 2 xˆ mj ∂f ∂gi m ∂h i m + f (xˆ m ) + λim (xˆ ) + μim (xˆ ) = 0 ∀ j, ∂x j Z m i=1 ∂x j ∂x j i=1
(2.10)
by dividing through by Z m , where m m + m Z m := [1, 2N m g1 (xˆ m ), . . . , 2N m gk (xˆ m ), 2N m h + 1 ( xˆ ), . . . , 2N h ( xˆ ), 0, . . . , 0] m m m is a (1 + k + s)-dimensional vector, and U m := [λm 0 , . . . , λk , μ1 , . . . , μs ] is the m m m unit vector in the direction of Z . In particular, λ0 > 0, μi ≥ 0 ∀ i ≤ and μim = 0 ∀ i > . By the Bolzano-Weierstrass theorem, as m ↑ ∞, by dropping to a subsequence if necessary, U m → U := a unit vector [λ0 , . . . , λk , μ1 , . . . , μs ] with λ0 ≥ 0, μi ≥ 0 for i ≤ and μi = 0 for i > . Also, xˆ m → x0 . By continuity of the functions concerned, we get (2.6). If ∇gi (x0 ), 1 ≤ i ≤ k, and those ∇h r (x0 ) for
2.4 Danskin’s Theorem
29
all r, 1 ≤ r ≤ s, for which h r (x0 ) = 0, are linearly independent, then λ0 = 0 would lead to a contradiction. So λ0 > 0. Dividing the equation through by λ0 , we may set λ0 = 1. Condition (2.6) is known as the Fritz John condition and with λ0 = 1, as the Karush-Kuhn-Tucker (KKT) condition. This is the ‘first order necessary condition’ for constrained optimization and if all gi (·), h j (·) ≡ 0, it reduces to the corresponding first order condition for unconstrained optimization, viz. that ∇ f (x0 ) = θ . It is also possible to give second order conditions, see, e.g., [12]. The numbers λ0 , λ1 , . . . , λk , μ1 , . . . , μs are known as multipliers. If λ0 = 0, they are known as Lagrange multipliers and if λ0 = 0, they are known as abnormal multipliers. See the book [6] for more details.
2.4 Danskin’s Theorem In many applications of optimization, one optimizes over the worst case behaviour of a system, such as minimizing the maximum possible loss, or maximizing the minimum possible profit. This leads to, e.g., minimization of a function of the type g(x) = max y f (x, y) where y is some parameter. This requires us to compute appropriate derivatives of g in terms of those of f . Suppose f is continuously differentiable and the maximum above is attained at a unique y ∗ (x) where the map x → y ∗ (x) is also continuously differentiable. Then by the chain rule of differentiation, ∂ ∂g (x) = max f (x, y) ∂ xi ∂ xi y ∂ f (x, y ∗ (x)) = ∂ xi ∂f ∂ y ∗j (x) ∂f (x, y) ∗ + (x, y) ∗ = y=y (x) y=y (x) ∂ x i ∂ xi ∂yj j ∂f = (x, y) ∗ , y=y (x) ∂ xi
(2.11)
where the second term in (2.11) is zero because y ∗ (x) maximizes f (x, ·). Thus the gradient of g is the partial gradient of f restricted to the x variable, evaluated at arg max f (x, ·). This is the vanilla version of Danskin’s theorem [4]. It is also called ‘envelope theorem’, particularly in economics where it is used, e.g., for ‘comparative statics’ [10]. The full version is given below. The notation is as follows. The function f : C × D → R, where C ⊂ Rd is open and D ⊂ Rm is closed bounded, is continuous, and its partial gradient with respect to the x variable alone denoted by
∇ x f (x) :=
∂f ∂f (x, y), . . . , (x, y) ∈ Rd ∂ x1 ∂ xd
30
2 Differentiability and Local Optimality
is assumed to be continuous. Let g(x) := max y∈D f (x, y). Here the maximum is attained on a non-empty closed bounded set M(x) ⊂ D because D is closed and bounded. (See Exercise 1.5 of Chap. 1.) Theorem 2.4 (Danskin) The map g : Rd → R has a directional derivative in every direction, given by (2.12) g (x; n) = max ∇ x f (x, y), n y∈M(x)
for every unit vector n ∈ Rd . Proof Recall from Exercise 1.14 of Chap. 1 that g is continuous. Let x0 ∈ C, n a unit vector in Rd , and xn = x0 + n n → x0 for some 0 < n ↓ 0 in C. Let yn ∈ argmax f (xn , ·), n ≥ 0. Then g(xn ) − g(x0 ) f (xn , yn ) − f (x0 , y0 ) = n n f (xn , yn ) − f (xn , y0 ) f (xn , y0 ) − f (x0 , y0 ) = + n n f (xn , y0 ) − f (x0 , y0 ) ≥ n = ∇ x f (x0 + αn n, y0 ), n with αn ∈ [0, n ], n ≥ 1, where we have used the fact f (xn , yn ) ≥ f (xn , y0 ) for the inequality and the mean value theorem for the last equality. Hence lim inf n↑∞
g(xn ) − g(x0 ) n
implying
lim inf n↑∞
≥ ∇ x f (x0 , y), n ∀ y ∈ M(x0 ),
g(xn ) − g(x0 ) n
≥ max ∇ x f (x0 , y), n. y∈M(x0 )
(2.13)
Similarly, g(xn ) − g(x0 ) f (xn , yn ) − f (x0 , y0 ) = n n f (xn , yn ) − f (x0 , yn ) f (x0 , yn ) − f (x0 , y0 ) = + n n f (xn , yn ) − f (x0 , yn ) ≤ n = ∇ x f (x0 + βn n, yn ), n, for some βn ∈ [0, n ] ∀n ≥ 1. In view of Exercise 1.11 of Chap. 1, this leads to
2.5 Parametric Monotonicity of Optimizers
lim sup n↑∞
g(xn ) − g(x0 ) n
31
≤ max ∇ x f (x0 , y), n. y∈M(x0 )
Combining (2.13), (2.14), the claim follows.
(2.14)
2.5 Parametric Monotonicity of Optimizers Consider a twice continuously differentiable f : R2 → R such that the map f (·, y) has a unique minimum at x ∗ (y) for each fixed y ∈ R. Then ∂f (x, y) ∗ = 0. x=x (y) ∂x Suppose
∂2 f < 0 ∀ x, y. ∂ x∂ y
(2.15)
Then as y increases, ∂∂ xf (·) decreases, so the point x ∗ (y) at which it crosses zero will increase. In other words, (2.15) implies that the minimizer of f in the first variable is an increasing function of the second. We give below a result which generalizes this intuition. For x, y ∈ Rd , we shall use the compact notation x ∧ y := the vector formed by taking the componentwise minimum of x, y, i.e., its ith component is min(xi , yi ) for 1 ≤ i ≤ d. Similarly, x ∨ y denotes the componentwise maximum. A function f : Rd → R is said to be submodular if f (x) + f (y) ≥ f (x ∧ y) + f (x ∨ y) ∀ x, y ∈ Rd , and supermodular if the inequality above is reversed. It is strictly submodular or supermodular if the corresponding inequality is strict whenever x, y are not comparable. (If they are, equality holds, as can be easily verified.) It is easy to see that if f : Rd × Rm → R is submodular (resp., supermodular), then so is f (·, y) or f (x, ·), for each fixed y (resp., x). We say that f : Rd × Rm → R satisfies increasing, resp., decreasing differences property if x ≥ x , y ≥ y implies f (x, y) − f (x , y) ≥ (resp., ≤) f (x, y ) − f (x y ). Note that the ‘≤’ case generalizes (2.15). Strictly increasing or decreasing differences are defined analogously. We then have the following relationship between the two notions. Lemma 2.1 If f : Rd × Rm → R is submodular (resp., supermodular), then it has decreasing (resp., increasing) differences.
32
2 Differentiability and Local Optimality
Proof We shall prove only the first case, the second being proved analogously. Let z = (x, y), z = (x , y ) with x ≥ x , y ≥ y . Let u = (x, y ), v = (x , y). Then u ∨ v = z, u ∧ v = z . By submodularity of f , f (u) + f (v) ≥ f (z) + f (z ) =⇒ f (v) − f (z ) ≥ f (z) − f (u) =⇒ f (x , y) − f (x , y ) ≥ f (x, y) − f (x, y ), which is the decreasing differences property. The claim for supermodular f is proved analogously. Next consider f : C × D ⊂ Rd × Rm → R where C is closed and bounded, and C, D are closed under ∨, ∧ (i.e., x, y ∈ C =⇒ x ∨ y, x ∧ y ∈ C and similarly for D). For y ∈ D, define M(y) := {x ∈ C : f (x, y) = min x ∈Rd f (x , y)}. Theorem 2.5 Let f be continuous. If f satisfies decreasing differences and f (·, y) is submodular for each y ∈ D, then (i) ∀y, M(y) is non-empty closed and bounded, (ii) M(y) satisfies: x, x ∈ M(y) =⇒ x ∧ x , x ∨ x ∈ M(y), (iii) ∀y, M(y) contains a minimal element x∗ (y), resp. a maximal element x ∗ (y) (i.e., there is no x ∈ M(y) such that x = x∗ (y) and x ≤ x∗ (y), or x = x ∗ (y) and x ≥ x ∗ (y)). (iv) y → x∗ (y) is non-increasing, (v) if f satisfies strict decreasing differences, then y ≥ y , y = y =⇒ ∀ x ∈ M(y), x ∈ M(y ), x ≥ x . Proof The first claim follows from Theorem 1.3 and Exercise 1.11 of Chap. 1. Let / M(y), we have x, x ∈ M(y), x = x . If x ∧ x ∈ f (x ∧ x , y) > f (x, y) = f (x , y). By submodularity of f (·, y), f (x ∨ x , y) + f (x ∧ x , y) ≤ f (x, y) + f (x , y) < f (x, y) + f (x ∧ x , y) which implies
f (x ∨ x , y) < f (x, y) = f (x , y),
a contradiction. Thus x ∧ x ∈ M(y). A similar argument shows that x ∨ x ∈ M(y). This proves the second claim. The third claim is a standard fact from lattice theory [1]. Let y > y and let x ∈ M(y), x ∈ M(y ). Then
2.6 Ekeland Variational Principle
33
0 ≥ f (x, y) − f (x ∨ x , y) ≥ f (x ∧ x , y) − f (x , y) ≥ f (x ∧ x , y ) − f (x , y ) ≥ 0,
(2.16)
so equality holds throughout. Here the first inequality follows from x ∈ M(y) and the definition of M(y), the second by the submodularity of f (·, y), the third by the decreasing differences property and the fourth by the fact that x ∈ M(y ). If x = x∗ (y) and x = x∗ (y ), then the equality in the above chain implies that x ∧ x is also optimal for y. If x > x , x ∧ x < x and x cannot be the minimal element in M(y). So x ≤ x , which proves the fourth claim. For the last claim, let x1 ∈ M(y1 ), x2 ∈ M(y2 ) be arbitrary. If x1 < x2 , y1 < y2 , then x1 ∧ x2 < x2 , x1 ∨ x2 > x1 , so by the property of strictly decreasing differences, f (x2 , y2 ) − f (x1 ∧ x2 , y2 ) < f (x2 , y1 ) − f (x1 ∧ x2 , y1 ), implying the strict inequality in (2.16), a contradiction. This completes the proof. Other important results in this context are a converse to Lemma 2.1 and its important consequence, the intuitively expected fact (see the preamble to this section) that for twice continuously differentiable f , an equivalent condition for submodularity is that ∂2 f (x) ≤ 0. ∂ xi ∂ x j See Sect. 10.4 of [12] for the proofs of these.
2.6 Ekeland Variational Principle In this section we give a finite dimensional version of one of the landmark results in this broad subject area, which has found many applications in optimization, control and variational formulations of partial differential equations. This is the Ekeland variational principle [8]. Our treatment follows that of [5]. Theorem 2.6 (Ekeland variational principle) Let f : C ⊂ Rd → R, C open, be lower semicontinuous and bounded from below, and > 0. Let x ∗ ∈ C satisfy f (x ∗ ) ≤ inf f (x) + . x∈C
Then for any λ > 0, there exists xλ ∈ C such that
(2.17)
34
2 Differentiability and Local Optimality
f (xλ ) ≤ f (x ∗ ),
(2.18)
∗
xλ − x ≤ λ,
(2.19)
f (xλ ) < f (x) + x − xλ ∀ x = xλ . λ
(2.20)
Proof Define an order ≺ on C by x ≺ y ⇐⇒ f (x) ≤ f (y) −
x − y, x, y ∈ C. λ
It is easy to check that this relation is reflexive (x ≺ x), antisymmetric (x ≺ y, y ≺ x =⇒ x = y) and transitive (x ≺ y ≺ z =⇒ x ≺ z). Define inductively: z1 = x ∗, S1 = {x ∈ C : x ≺ z 1 }, , z 2 ∈ y ∈ S1 : f (y) ≤ inf f (x) + x∈S1 4 .. . Sn = {x ∈ C : x ≺ z n }, z n+1 ∈ y ∈ Sn : f (y) ≤ inf f (x) + n+1 . x∈Sn 2 For each n, z n ∈ Sn and hence these sets are non-empty. Moreover, they are closed by the lower semicontinuity of f . Thus {Sn } is a family of non-empty closed sets satisfying Sn+1 ⊂ Sn . Also, by the foregoing, x ∈ Sn implies x ≺ z n , i.e., x − z n , λ f (z n ) ≤ f (x) + n . 2 f (x) ≤ f (z n ) −
Thus x − z n ≤ λ2−n → 0. Hence the diameter of Sn goes to zero. By this and Exercise 1.7 of Chap. 1, ∩n Sn is a singleton, say {xˆλ }. Then xˆλ ∈ S1 , so it satisfies (2.18). If x ≺ xλ , x ∈ ∩n Sn , so x = xˆλ . So x = xˆλ implies, by our definition of the order ‘≺’, that (2.20) holds. Finally, x ∗ − z n ≤
n−1 i=1
z i − z i+1 ≤ λ
n−1
2−i .
i=1
Letting n ↑ ∞, (2.19) follows. See [7] for several applications of this result, including some to optimization.
2.7 Mountain Pass Theorem
35
2.7 Mountain Pass Theorem We now state and prove a result known as the Mountain Pass Theorem which has played a major role in nonlinear functional analysis in infinite dimensions. The intuition is that, if you consider all paths between two strict local minima (‘bottoms of valleys’) and consider the minimum of the points of maximum elevation on them, that point will be a critical point, typically a saddle (the highest point on a mountain pass). We make this precise below. The proof is from [9] which in turn is an adaptation of the proof from [3]. Recall that a closed bounded set is connected if it cannot be written as a disjoint union of two closed bounded sets.1 Theorem 2.7 (The Mountain Pass Theorem) Let f : Rd → R be a continuously differentiable function satisfying limx↑∞ f (x) = ∞, with at least two strict local minima x1 , x2 . Then it also has a third critical point x ∗ characterized by f (x ∗ ) = inf max f (x) C∈C x∈C
where C := {C ⊂ Rd : C is closed, bounded and connected, and x1 , x2 ∈ C}. Proof For C ∈ C, let xC ∈ arg maxx∈C f (x) and αC := f (xC ). Since x1 , x2 are strict local minima, αC = f (xC ) > max( f (x1 ), f (x2 )). Let α ∗ = inf max f (x) = inf αC ≥ max( f (x1 ), f (x2 )). C∈C x∈C
C∈C
(2.21)
Then there exist Cn ∈ C, n ≥ 1, such that αCn → α ∗ ≥ max( f (x1 ), f (x2 )).
(2.22)
Define C ∗ := ∩m≥1 ∪i≥m Ci . Then C ∗ is closed bounded and connected (Check this! This requires using coercivity.), therefore in C. By Bolzano-Weierstrass theorem, there exists an x ∗ ∈ C ∗ such than xˆn := arg maxCn f → x ∗ along a subsequence. By continuity, f (x ∗ ) = lim f (xCn ) = lim αCn = α ∗ . n↑∞
n↑∞
On the other hand, for any zˆ ∈ C ∗ , there exist {n(k)} ⊂ {n} and z(n(k)) ∈ Cn(k) , k ≥ 1, such that z(n(k)) → zˆ . By continuity, f (z(n(k))) ≤ f (xCn(k) ) =⇒ f (ˆz ) ≤ f (x ∗ ), implying x ∗ = xC ∗ . This implies in particular, from the definition of xC ∗ and the fact that x1 , x2 are strict local minima, that the inequality in (2.21) is strict. Define the set 1
The definition of a connected set is more general, but this will suffice for the present purposes.
36
2 Differentiability and Local Optimality
D := {x ∈ C ∗ : f (x) = xC ∗ = α ∗ }. Clearly D is closed and bounded. We claim that D contains a critical point. Suppose not. Then there exists an ν > 0 such that ∇ f (x) > ν ∀ x ∈ D. (Check this.) Then by continuity, for > 0 sufficiently small, D := {x ∈ Rd : inf y − x < } y∈D
/ D is an open neighborhood of D satisfying ∇ f (x) > ν2 . Note that x1 , x2 ∈ d because they are critical points. Let ρ : R → [0, 1] be a continuously differentiable function such that ρ(x) = 1, x ∈ D, and x = 0, x ∈ / D . Define η : Rd × R → Rd by η(x, t) := x − tρ(x)∇ f (x). Then η is continuously differentiable and d f (η(x, t)) = −ρ(x)∇ f (x)2 . t=0 dt Then for T > 0 sufficiently small, ρ(x) d f (η(x, t)) ≤ − ∇ f (x)2 ∀ t ∈ [0, T ]. dt 2 For x ∈ C ∗ , this implies T f (η(x, T )) = f (x) + =⇒ f (η(x, T ))
T d f (η(x, t))dt ≤ f (x) − ρ(x)∇ f (x)2 dt 2
0
< α ∗ if x ∈ / D, T ν2 ∗ ≤ α − 2 if x ∈ D.
Here the penultimate inequality follows because α ∗ = xC ∗ . It follows that for ˇ C := {η(x, T ) : x ∈ C ∗ }, (2.23) max f (x) < α ∗ . x∈Cˇ
By the continuity of η, Cˇ is closed bounded and connected and contains xi = η(xi , T ), i = 1, 2, because x1 , x2 ∈ C ∗ . Hence Cˇ ∈ C, which, by (2.23), contradicts the definition (2.21) of α ∗ . Therefore C ∗ must contain a critical point.
2.8 Exercises
37
One can go a little further and show that the critical point thus obtained cannot be a local minimum, see [9] for details. The infinite dimensional counterpart of this result requires additional technical conditions.
2.8 Exercises 2.1 Give an example of a continuous function f : [0, 1] → R that has infinitely many local maxima and minima. 2.2 Let g : Rd → Rm , h : Rm → Rs be continuously differentiable and define f : Rd → Rs by f (x) := h(g(x)). Show that f is continuously differentiable and derive an expression for D f x (·) in terms of Dgx (·) and Dh y (·), x ∈ Rd , y ∈ Rm (the chain rule for Fréchet derivatives). 2.3 Let f : Rd → R be thrice continuously differentiable. Write down the third Fréchet derivative of f . 2.4 Let F : Rd → Rd be monotone in the sense that
F(x) − F(y), x − y > 0 for x = y ∈ Rd .
2.5
2.6
2.7
2.8
2.9
2.10
2
Suppose f : Rd → Rd is continuously differentiable. Show that it is monotone if and only if 21 D f x (·) + D Fx (·)T is positive definite for all x. (Euler’s homogeneous function theorem) A function f : Rd → R is said to be positively homogeneous of degree r, r ≥ 0, if f (ax) = a r f (x) for any a > 0. If f : Rd → R is continuously differentiable, show that it is positively homogeneous of degree r if and only if x, ∇ f (x) = r f (x) ∀x. Let f : Rd → R be twice continuously differentiable. Suppose x0 is a critical point of f such that ∇ 2 f (x0 ) is non-singular and its trace is zero. Show that x0 is a saddle point of f . Let f : Rd → R be twice continuously differentiable with x0 ∈ Rd a critical point of f . If ∇ 2 f (x0 ) is non-singular, show that there is no other critical point in a sufficiently small neighbourhood of x0 . Let f : R → R be a continuously differentiable function. Let x0 ∈ R be a unique point such that f (x0 ) = 0. If f
(x0 ) < 0, then show that x0 is the global minimum of the function f . Let f : R2 → R be given by f (x, y) = x 2 + y 2 (1 − x)3 . Show that the function f has a unique critical point, which is a local minimum. However, it is not a global minimum. If f : Rd → R is continuously differentiable and bounded from below, show that we can find a sequence {xn } ⊂ Rd such that f (xn ) → inf x f (x) and ∇ f (xn ) → 0. (Hint: Use Ekeland’s variational principle.2 )
This is not necessarily true in infinite dimensions and needs to be stated as an assumption, called the Palais-Smale condition, e.g., for proving the Mountain Pass Theorem.
38
2 Differentiability and Local Optimality
References 1. Birkhoff, G.: Lattice Theory, 3rd edn. American Mathematical Society Colloquium Publications, vol. 25. American Mathematical Society, Providence, R.I. (1979) 2. Céa, J.: Lectures on Optimization—Theory and Algorithms. Tata Institute of Fundamental Research Lectures on Mathematics and Physics, vol. 53. Tata Institute of Fundamental Research, Bombay (1978) 3. Courant, R.: Dirichlet’s Principle, Conformal Mapping, and Minimal Surfaces. Springer, New York (1977). With an appendix by M. Schiffer, Reprint of the 1950 original 4. Danskin, J.M.: The Theory of Max-Min and Its Application to Weapons Allocation Problems. Econometrics and Operations Research. Springer, New York (1967) 5. de Figueiredo, D.G.: Lectures on the Ekeland Variational Principle with Applications and Detours. Tata Institute of Fundamental Research Lectures on Mathematics and Physics, vol. 81. Published for the Tata Institute of Fundamental Research, Bombay; by Springer, Berlin (1989) 6. Dhara, A., Dutta, J.: Optimality Conditions in Convex Optimization. CRC Press, Boca Raton, FL (2012) 7. Ekeland, I.: Nonconvex minimization problems. Bull. Am. Math. Soc. (N.S.) 1(3), 443–474 (1979) 8. Ekeland, I.: On the variational principle. J. Math. Anal. Appl. 47, 324–353 (1974) 9. Jabri, Y.: The Mountain Pass Theorem: Variants, Generalizations and Some Applications. Encyclopedia of Mathematics and Its Applications, vol. 95. Cambridge University Press, Cambridge (2003) 10. Mas-Colell, A., Whinston, M.D., Green, J.R.: Microeconomic Theory. Oxford University Press, Oxford (1995) 11. McShane, E.J.: The Langrange multiplier rule. Am. Math. Monthly 80, 922–925 (1973) 12. Sundaram, R.K.: A First Course in Optimization Theory. Cambridge University Press, Cambridge (1996)
Chapter 3
Convex Sets
3.1 Introduction Recall that a convex set C ⊂ Rd is a set such that any line segment joining two distinct points in C lies entirely in C, i.e., x, y ∈ C, 0 ≤ α ≤ 1 implies αx + (1 − α)y ∈ C. Figure 3.1 gives examples of a convex set and a non-convex set respectively, in the plane. An equivalent definition isas follows. For any n ≥ 2, xi ∈ C, 1 ≤ i ≤ n, and αi ∈ [0, 1], 1 ≤ i ≤ n, with i αi = 1, we have i αi xi ∈ C. Clearly for n = 2, this reduces to the previous definition and therefore subsumes it. To show that they are equivalent, we use induction to prove that the earlier definition implies the new one. Indeed, it holds for n = 2. Suppose it holds for all m≤ n for some n ≥ 2. n+1 αi = 1. Then for Let xi ∈ C be distinct and αi ∈ [0, 1], 1 ≤ i ≤ n + 1, with i=1 (= α ) < 1. Define βi = some i (say, for i = 1 without loss of generality), 0 < α i 1 αi+1 , 1 ≤ i ≤ n. Then β ∈ [0, 1] with β = 1. Hence by induction hypothesis, i i i 1−α1 i βi x i+1 ∈ C. Then n+1 i=1
αi xi = α1 x1 + (1 − α1 )
n
βi xi+1 ∈ C,
i=1
again by the induction hypothesis, and we are done. If x = i αi xi for {αi } as above, we say that x is a convex combination of the xi ’s. We say that it is a strict convex combination if αi ∈ (0, 1) ∀i. If C is closed, a simpler requirement suffices: C is convex if x, y ∈ C implies 1 (x + y) ∈ C. 2
(3.1)
By iterating this condition, it automatically implies that αx + (1 − α)y ∈ C for all dyadic rationals in [0, 1], i.e., rationals that can be written as 2mn for some 0 ≤ m ≤ 2n , n ≥ 1. Dyadic rationals are dense in R, i.e., given any z ∈ R, there exists a © Hindustan Book Agency 2023 V. S. Borkar and K. S. M. Rao, Elementary Convexity with Optimization, Texts and Readings in Mathematics 83, https://doi.org/10.1007/978-981-99-1652-8_3
39
40
3 Convex Sets
Fig. 3.1
sequence of dyadic rationals z n ∈ R such that z n → z. (Consider, e.g., the truncated binary expansion of x.) The claim then follows for all α ∈ [0, 1] by continuity of the linear map α → αx + (1 − α)y for fixed x, y. On the other hand, the set of dyadic rationals itself (or simply the set of rationals Q) satisfies condition (3.1), but is not convex. So the condition that C be closed is essential. Some simple properties of convex sets that are easily verified are as follows: (P1) A convex set is connected, i.e., it cannot be written as the union of two sets that have disjoint open neighborhoods. (P2) Intersection of an arbitrary collection of convex sets is convex when non-empty. (P3) Union of two convex sets need not be convex. (P4) Interior and closure of a convex set are convex. (P5) Image of a convex set under a linear (more generally, affine, i.e., linear plus a constant) transformation is convex. In fact, affine functions are the only functions that transform convex sets to convex sets when the function is assumed to be continuous and injective, and moreover, the dimension of the domain is bigger than one. For continuous functions from R to R, the image of any interval under any continuous function is always an interval, thanks to the intermediate value theorem. Theorem 3.1 (Knecht and Vanderwerff [11]) Let d ≥ 2 and f : Rd → Rm be a continuous injective function such that f maps convex sets to convex sets. Then f is affine. The proof of this result is given in the last section. Given an arbitrary set A ⊂ Rd , we define its convex hull, denoted co(A), as the smallest convex set containing A, alternatively as the set of convex combinations of points in A. We also define its closed convex hull to be the smallest closed convex set containing A, or the intersection of closed convex sets containing A, or, equivalently, as the closure of its convex hull, and denote it as co(A). We begin in the next section with the simplest but archetypical convex optimization problem, that of finding the nearest point in a closed convex set from a given point outside it. This serves as a building block of increasingly more complicated scenarios, as we shall soon see.
3.2 The Minimum Distance Problem
41
3.2 The Minimum Distance Problem Let C ⊂ Rd be a closed convex set C and let x ∈ / C. Consider the problem of minimizing x − y for y ∈ C. Theorem 3.2 There is a unique x ∗ ∈ C such that x − x ∗ = min ||x − y . y∈C
Proof By triangle inequality, | x − y − x − z | ≤ y − z , so the map y → x − y is continuous. By triangle inequality again, x − y ≥ y − x , so lim y ↑∞ x − y = ∞. The existence of a minimizer x ∗ now follows by Theorem 1.4. If xˆ = x ∗ is another minimizer, then the triangle formed by x, x ∗ , xˆ is an isosceles triangle with the line segment x ∗ ↔ xˆ as its base, which will lie entirely in C by its convexity. Then by elementary plane geometry, the mid-point thereof, which ˆ a contradiction. Hence x ∗ is in C, is at a strictly smaller distance from x than x ∗ , x, is the unique minimizer of the map y → y − x . There is more to this. If C is a set such that the theorem holds true, then the set C is called a Chebyshev set. Every Chebyshev set is convex in finite dimensions (see [8, Chap. 12]). The x ∗ above is called the projection of x on C. Note that this reduces to the classical notion of projection if C is an affine space (i.e., translate of a subspace, in other words a set of the form {a + x : n, x = 0} for some a, n ∈ Rd with n = 1). An immediate generalization is as follows. Theorem 3.3 Let C, D be disjoint closed convex sets in Rd with C bounded. Then there exist x ∗ ∈ C, y ∗ ∈ D, such that 0 < x ∗ − y ∗ = min x − y . x∈C,y∈D
Proof Consider the map (x, y) ∈ C × D → x − y . Again, it is easy to see that this is continuous. Furthermore, if D is also bounded, the claim is immediate from Theorem 1.3. If not, (x, y) ↑ ∞ ⇐⇒ y ↑ ∞ because C is bounded, and thus (x, y) ↑ ∞ implies x − y ↑ ∞ as in the proof of Theorem 3.2. The existence of minimizing pair (x ∗ , y ∗ ) now follows from Theorem 1.4. Since C, D are disjoint, x ∗ − y ∗ > 0. No uniqueness can be claimed. Consider, e.g., two disjoint rectangles as in Fig. 3.2 with two sides parallel. There are uncountably many points which attain the minimum distance between the two. Also, the boundedness condition on C cannot be dropped. Consider, e.g., 1 1 , D := (x, y) : x > 0, y ≥ . C := (x, y) : x < 0, y ≥ − x x
42
3 Convex Sets
Fig. 3.2
Fig. 3.3
Fig. 3.4
Then the two sets are convex and disjoint, but inf x∈C,y∈D x − y = 0 (see Fig. 3.3). The infimum is not attained at any (x, y) ∈ C × D. We next take up some important ramifications of Theorem 3.2. The first is a geometric characterization of x ∗ (see Fig. 3.4.) Theorem 3.4 For a closed convex C ⊂ Rd and x ∈ / C, the point x ∗ := arg min x − y y∈C
3.3 Separation Theorems
is characterized by:
43
∀ y ∈ C, y − x ∗ , x − x ∗ ≤ 0.
(3.2)
That is, the angle between the line segments x ↔ x ∗ and y ↔ x ∗ is at least 90◦ . Proof Suppose (3.2) does not hold, i.e., the angle between the two line segments is acute. For any point x on the line segment joining x with x ∗ , we will have x − x ∗ = min y ∈C x − y (check this), so without any loss of generality, we may take x − x ∗ < y − x ∗ . Drop a perpendicular from x to the line joining x ∗ and y. Then its foot (say) z must fall between x ∗ and y. But then z ∈ C by convexity of C and is strictly closer to x than x ∗ , a contradiction. So (3.2) holds. Conversely, let (3.2) hold for some x ∗ and let xˆ := arg min y∈C y − x with xˆ = x ∗ . Then by the preceding argument, the line segments x ↔ xˆ and x ∗ ↔ xˆ make an angle of at least 90◦ . Hence the line segments x ↔ x ∗ and x ∗ ↔ xˆ must make an acute angle, contradicting (3.2). This establishes the converse.
3.3 Separation Theorems Recall that a hyperplane H := {x ∈ Rd : x − x0 , n = 0} in Rd is characterized by its normal vector n, taken to be a unit vector without loss of generality, and any point on it, in particular x0 . (Note that −n will serve equally well.) H defines two closed half spaces L H := {x ∈ Rd : x − x0 , n ≤ 0} and U H := {x ∈ Rd : x − x0 , n ≥ 0} that intersect in H . It also defines two open half spaces defined analogously with ≤, ≥ in the above definitions replaced respectively by . We say that H separates sets A, B ⊂ Rd if A ⊂ L H , B ⊂ U H , or vice versa. We then have the following important consequences of the foregoing (Fig. 3.5). Theorem 3.5 (i) If C is closed convex and x ∈ / C, then there exists a hyperplane H separating the two. (ii) If C, D are disjoint closed convex sets, there exists a hyperplane separating the two. Proof In the framework of Theorem 3.4, let n := (x − x ∗ )/ x − x ∗ and x0 := x ∗ . Then H defined as above separates x and C by virtue of Theorem 3.4. This establishes (i). For (ii), recall the framework of Theorem 3.3. First consider a bounded C. Take n := (y ∗ − x ∗ )/ y ∗ − x ∗ and x0 = x ∗ and apply Theorem 3.4 to both / C, x ∗ = arg minx∈C x − y ∗ , and to (x ∗ , D) with x ∗ ∈ / D, y ∗ = (y ∗ , C) with y ∗ ∈ ∗ arg min y∈D y − x , to generate two hyperplanes, H (1) := {x : x − x ∗ , n = 0} and H (2) := {y : y − y ∗ , n = 0}.
44
3 Convex Sets
D
C
x
Fig. 3.5
H (2)
Fig. 3.6
C
D
H (1) Then H (1), H (2) are parallel hyperplanes. It is seen that C ⊂ L H (1) and H (2), D ⊂ U H (1) . Similarly, H (1), C ⊂ L H (2) and D ⊂ U H (2) . Thus both H (1), H (2) separate C, D (see Fig. 3.6). More generally, if both C, D are unbounded, replace C by C N := C ∩ {x : x ≤ N }, N ≥ 1. Then by the foregoing, there exists a hyperplane HN separating C N , D. Let x ∈ C, y ∈ D and L := the line segment joining the two. Denote by x N the point where HN intersects L. Since a hyperplane can be characterized by a unit vector normal to it and a point it contains, we have HN = {x : x − x N , n N = 0}, where n N = 1 and x N ∈ L. Suppose y − x N , n N ≤ 0 ≤ z − x N , n N ∀ y ∈ C N , z ∈ D. Since both {x N , N ≥ 1} and {n N , N ≥ 1} are bounded sequences, the BolzanoWeierstrass theorem allows us to pick a convergent subsequence of (x N , n N ), N ≥ 1, ˆ Passing to the limit in the above pair of along which it converges to (say) (x, ˆ n). inequalities along this subsequence, we conclude that the hyperplane
3.3 Separation Theorems
45
Fig. 3.7
C
D
ˆ = 0} H∞ := {x : x − x, ˆ n
separates C, D.
The above proof already contains the germ of the following important observation. Let C, D be as above with C bounded. Define separation between two parallel hyperplanes H, H as min x∈H, y∈H x − y , which equals min x∈H x − yˆ , resp. min y∈H xˆ − y for any choice of yˆ ∈ H , resp., xˆ ∈ H . Corollary 3.1 The minimum distance between C and D equals the maximum separation between pairs of parallel hyperplanes separating the two. Proof It is clear that the minimum distance between C, D equals the separation between H (1), H (2). So it remains to show that the latter is indeed the maximum separation between pairs of parallel hyperplanes separating C, D. Let H , H be another pair of parallel hyperplanes separating C, D. Convince yourself that they ˜ y˜ resp. (see Fig. 3.6). Then x ∗ − must intersect the line segment x ∗ ↔ y ∗ in, say, x, ∗ y ≥ x˜ − y˜ ≥ the separation between H , H by the definition thereof. The claim follows (Fig. 3.7). This is a typical ‘duality’ result: we have identified a minimization problem, viz. that of minimizing the distance between two disjoint closed convex sets with nonzero distance from each other, with a ‘dual’ maximization problem, viz. that of maximizing the separation between pairs of parallel hyperplanes constrained to pass through the gap between these sets. Next we push the above theorem a little further by allowing the sets to touch at the boundary. We shall need the following as a first step. Theorem 3.6 If C is closed convex and x ∗ ∈ ∂C, then there exists a hyperplane H passing through x ∗ such that C ⊂ L H and int(C) ⊂ int(LH ). Proof Let xn ∈ / C, n ≥ 1, such that xn → x ∗ . Let yn := arg min z − xn , n ≥ 1. z∈C
Since xn ↔ yn ↔ x ∗ is a triangle with the line segments xn − yn , yn − x ∗ meeting at an angle ≥ 90◦ , 0 < xn − yn ≤ xn − x ∗ → 0, implying yn → x ∗ . Let
46
3 Convex Sets
Fig. 3.8
m nm := xxmm −y , m ≥ 1. Then by the Bolzano-Weierstrass theorem, nm → n∗ along −ym a subsequence, for some unit vector n∗ . Since
xn − yn , z − yn ≤ 0 ∀ z ∈ C, n ≥ 1, we have nm , z − ym ≤ 0 ∀ z ∈ C, m ≥ 1, Passing to the limit as m ↑ ∞ along an appropriate subsequence, we have n∗ , z − x ∗ ≤ 0 ∀ z ∈ C. Thus H := {x : n∗ , x − x ∗ = 0} satisfies the requirement.
The hyperplane H above passes through the point x ∗ ∈ ∂C and contains C entirely in one of the half spaces it generates. Such a hyperplane is said to be a support hyperplane of C at x ∗ . Support hyperplane at an x ∈ ∂C may not be unique, see, e.g., Fig. 3.8. The figure suggests that uniqueness of the support hyperplane at a boundary point is in some way related to the ‘smoothness’ of the boundary at that point. This is indeed the case: if, for example, the boundary ∂C in a neighborhood of x is of the form f (x) = a constant for some continuously differentiable f , then ±∇ f (x) is the only possible choice for n, leading to a unique H . It is also possible that multiple points on the boundary lead to the same supporting hyperplane, e.g., the points on any one side of a rectangle. We next consider separation of two disjoint convex sets. Let C, D be closed convex sets such that int(C) = φ, D ∩ int(C) = φ and D ∩ C = φ. That is, D intersects C, but not its interior, implying that there exists a point r ∈ ∂C ∩ D. Theorem 3.7 There exists a hyperplane H such that C ⊂ L H and D ⊂ U H (i.e., H separates C and D). Proof (Sketch) Fix a ∈ int(C), which is possible because int(C) = φ. Extend the line segment from a to r to a point b such that b = a + (1 + δ)(r − a) for some δ > 0 (see Fig. 3.9).
3.3 Separation Theorems
47
Fig. 3.9
r D (δ)
D
b
a
C
Define D (δ) := {x : x = z + δ(r − a) for some z ∈ D}, i.e., the set D shifted by an amount δ r − a in the direction (r − a)/ r − a . We claim that D (δ) ∩ C = φ. Suppose not. Then there exists a point c ∈ C ∩ D (δ). Let d := c − δ(r − a). By the definition of D (δ), d ∈ D. Consider the quadrilateral a ↔ r ↔ c ↔ d ↔ a. (It is clear that this will be a quadrilateral because the line segments a ↔ r and d ↔ c are parallel.) Let e denote the point of intersection of its diagonals. Since a, c ∈ C and r, d ∈ D, e ∈ C ∩ D. Let B ⊂ int(C) be an open ball centred at a. Consider the set Q := co(B ∪ {c}). Since c ∈ C, B ⊂ C and C is closed and convex, we must have Q ⊂ C and int(Q) ⊂ int(C). It is easily seen that e ∈ int(Q). Hence e ∈ int(C), contradicting the hypothesis that D∩int(C) = φ. Therefore D (δ) ∩ C = φ. By Theorem 3.5, there exists a hyperplane H (δ) separating C and D (δ), say with C ⊂ L H (δ) and D (δ) ⊂ U H (δ) . Let δ ↓ 0 along a subsequence (say) δn , n ≥ 1. Since D (δn ) → D as δn ↓ 0 and C ∩ D = φ, we can find points xn ∈ H (δn ) such that xn → xˆ ∈ C ∩ D. Write H (δm ) as {x : x − xm , nm = 0} for a suitable sequence nm , m ≥ 1, of unit vectors. Using Bolzano-Weierstrass theorem, we can extract a subsequence of (xm , nm ), m ≥ 1, along which xm → xˆ and nm → n for some unit vector n. Since C ⊂ L H (δm ) , D (δm ) ⊂ U H (δm ) ∀m, passing to the limit in the inequalities x − xm , nm ≤ 0 ≤ y − xm , nm ∀ x ∈ C, y ∈ D (δm ), m ≥ 1, we have x − x, ˆ n ≤ 0 ≤ y − x, ˆ n ∀ x ∈ C, y ∈ D. It follows that H := {x : x − x, ˆ n = 0} satisfies the requirement.
The condition that one of the two convex sets have a non-empty interior cannot be relaxed. Consider, for example, the line segments a ↔ b and c ↔ d in Fig. 3.10 which meet at a single point. Viewed as subsets of the ambient two-dimensional plane, they are both closed and convex with empty interiors. It is easy to see that no hyperplane (≈ a line in this case) can separate them.
48
3 Convex Sets
Fig. 3.10
d
b
a
c
3.4 Extreme Points A point x ∈ a convex set C ⊂ Rd is said to be its extreme point if it cannot be expressed as a strict convex combination of two distinct points in C. That is, x = αy + (1 − α)z, y, z ∈ C, α ∈ (0, 1) =⇒ x = y = z. Clearly, x ∈ ∂C, because if not, there is an open ball centred at x and contained in C, which clearly contradicts the definition of an extreme point. Lemma 3.1 A closed bounded convex set C ⊂ Rd has at least one extreme point. Proof Let 1 , . . . , d be linearly independent vectors in Rd . Let C0 := C and for 0 ≤ i < d define recursively Ci+1 := {x ∈ Ci : i , x = mini , y}. y∈Ci
Then Ci+1 ⊂ Ci ∀i and the Ci ’s are closed bounded and convex. If x, y ∈ Cd , then i , x = i , y ∀i, implying x = y by our choice of {i }. Thus Cd is a singleton, say {x ∗ }. We claim that x ∗ is an extreme point of C. Suppose not. Choose y = z ∈ C such that x ∗ = 21 (y + z). Then l1 , x ∗ =
1 1 l1 , y + l1 , z 2 2
and hence l1 , y = l1 , z = l1 , x ∗ , implying that y, z ∈ C1 . Repeat this argument for C1 , C2 , · · · , using induction to conclude that y, z ∈ Ci for all i. Hence y, z ∈ Cd = {x ∗ }. The claim follows. In the following, we provide a characterization result for the existence of extreme points of closed and convex sets. Theorem 3.8 A closed and convex set has an extreme point if and only if it contains no lines. Proof Let C ⊂ Rd be closed and convex. Suppose C contains a line {x¯ + th : t ∈ R} starting from point x¯ ∈ Rd in the direction h ∈ Rd . Then for any t ∈ R, x + th ∈ C for each x ∈ C. Indeed,
3.4 Extreme Points
49
t ∈C x + th = lim (1 − )x + x¯ + h →0 since C is convex and closed. Therefore no point of C can be an extreme point. Next, assume that C contains no lines. We use induction to show that C has an extreme point. If C is a closed and convex subset of R having no lines, then it is a closed interval bounded from below or/and above. Hence it has an extreme point. Assuming that the statement is true for dimensions strictly less than d, consider a closed convex set C ⊆ Rd . Since C contains no lines, it has a boundary point say x. ¯ Let H be the supporting hyperplane of C at x. ¯ Now H ∩ C lies in (d − 1)dimensional space and has no lines. Therefore H ∩ C has an extreme point by the induction hypothesis. It is an easy exercise to see that this is also an extreme point of C. (In fact this is formally proved in Lemma 3.2 below.) In several applications, we encounter the following set K := {x ∈ Rd : Ax ≤ b}
(3.3)
where A is an m × d matrix and b ∈ Rm . In the following result, we discuss the structure of extreme points of K . Theorem 3.9 Let x ∈ K . Then x is an extreme point of K if and only if some d inequalities corresponding to d linearly independent rows of the system Ax ≤ b are equalities i.e., ai , x = bi for i corresponding to those linearly independent rows. Proof Let x ∈ K be an extreme point. Let I = {i ∈ {1, 2, · · · , m} : ai , x = bi } Let F = {ai : i ∈ I }. We need to show that F contains d linearly independent vectors. This is equivalent to showing that Span(F) = Rd . / I, Suppose Span(F) Rd . Therefore F ⊥ = {θ }. Choose h = θ in F ⊥ . For i ∈ ai , x ± h = ai , x ± ai , h = ai , x ≤ b for > 0 sufficiently small. Therefore x ± h ∈ K for such . But x = 21 (x + h) + 1 (x − h), contradicting the fact that x is an extreme point. Thus Span(F) = Rd . 2 Now suppose x ∈ K be such that d of the inequalities (corresponding to d linearly independent rows) of the system Ax ≤ b are equalities i.e., ai , x = bi for i corresponding to those linearly independent rows. If x is not extreme point, then there exists h ∈ Rd such that x ± h ∈ K . Therefore bi ± ai , h = ai , x ± h ≤ bi This is possible if and only if h = θ and hence x is extreme point. The following corollary is immediate.
50
3 Convex Sets
Fig. 3.11
Corollary 3.2 Set of extreme points of K is finite. We first state some simple facts about extreme points. (Q1) Every boundary point may not be an extreme point, e.g., the points on the boundary of a rectangle other than its corners are not extreme points. (Q2) A convex set may not have an extreme point if it is not closed (e.g., the open ball in Rd ). (Q3) If x ∈ ∂C is not an extreme point and y, z ∈ C are such that x = αy + (1 − α)z for some α ∈ (0, 1), then y, z ∈ ∂C. To see this, suppose y ∈ int(C). Then there exists an open ball B centred at y such that B ⊂ int(C). Then x ∈ int(co(B ∪ {z})) ⊂ int(C), a contradiction. (Q4) An extreme point may have a unique supporting hyperplane that contains other, non-extreme points, see, e.g., the point (0, 0) in Fig. 3.11. We denote by e(C) the set of extreme points of C. The next result is a special case of the more general Krein-Milman theorem of functional analysis. Theorem 3.10 (Krein-Milman theorem in finite dimension) A closed bounded convex set C is the closed convex hull of e(C). Proof Clearly, co(e(C)) ⊂ C. If the claim is not true, we can pick x ∗ ∈ C ∩ (co(e(C)))c . As in the proof of Theorem 3.5 (i), we can construct a support hyperplane H = {x : x − x0 , n = 0} (say) of co(e(C)) such that e(C) ⊂ L H and x ∗ ∈ int(U H ). Then x ∗ − x0 , n > 0. Repeat the procedure in the proof of Lemma 3.1 with 1 := −n to obtain xˆ ∈ e(C) with xˆ − x0 , n > 0, a contradiction. The claim follows. Theorem 3.11 The set of extreme points of a closed bounded convex set C ⊂ Rd can be written as a countable intersection of relatively open sets in C.
3.4 Extreme Points
51
Proof Let C be as above and e(C) the set of its extreme points. Consider 1 1 Cn = x ∈ C : x = (y + z) and y − z ≥ , y, z ∈ K . 2 n
Then each Cn is closed. Also, K := C \ e(C) = ∪∞ n=1 C n . Thus e(C) = C \ K = C ∩ (∩n Cnc ). The claim follows. Theorem 3.12 below plays an important role in convex analysis and convex optimization in finite dimensions. We first establish a technical lemma. A version of this fact is already used in Theorem 3.8. Lemma 3.2 Let C ⊂ Rd be a closed bounded convex set and H a support hyperplane of C such that C ⊂ L H (say). Then C1 := C ∩ H is closed bounded and convex, and e(C1 ) ⊂ e(C). Proof The first claim follows easily from the facts that C is closed bounded convex and H is closed convex. Suppose x ∈ e(C1 )\e(C). Then x = αy + (1 − α)z for some α ∈ (0, 1) and y = z in ∂C by virtue of (Q3) above. Clearly, at least one of y, z, say y, is not in C1 . Then it is in ∂C\H ⊂ int(L H ). Then z ∈ int(U H ) ∩ ∂C, / e(C1 ), a which is empty. Hence both y, z ∈ H , therefore y, z ∈ C1 . But then x ∈ contradiction. The claim follows. Theorem 3.12 (Caratheodory’s theorem) Every x ∈ a closed bounded convex set C ⊂ Rd can be written as a convex combination of m points in e(C) for some m ≤ d + 1. Proof For d = 1, C is a closed bounded interval, so the claim holds. We prove the general claim by induction. Suppose the claim is true for some d ≥ 1. Let C ∈ Rd+1 with non-empty interior. If x ∈ e(C), the claim holds with m = 1. Suppose x∈ / e(C). Take e1 ∈ e(C) and extend the line segment from e1 to x till the point b1 := e1 + a(x − e1 ) for 1 ≤ a := the maximum such number for which b1 thus defined is still in C. Then b1 ∈ ∂C. (Note that b1 = x, i.e., a = 1, is possible.) Let H1 be a support hyperplane of C at b1 such that C ⊂ L H1 . Let C1 = C ∩ H1 , which too is closed bounded and convex. Then by the induction hypothesis, b1 is a convex combination of at most (d + 1) points in e(C1 ) and e(C1 ) ⊂ e(C) by Lemma 3.2. Hence x is a convex combination of at most (d + 2) points of e(C). The claim follows by induction. If C is bounded and e(C) is finite, C is said to be a convex polytope and the points in e(C) its ‘corners’. It is easy to see that C = co(e(C)) = co(e(C)). (Check this.) Suppose e(C) = {x1 , . . . , xm } for some m ≥ 1. Define ei = xi+1 − x1 , 1 ≤ i ≤ m − 1. Then at most d of the vectors {ei } can be linearly independent. In particular, if m > d + 1, they are not. Then for each x ∈ C, x − x1 can be written as a convex combination of ei , 1 < i ≤ m, in more than one way. This means that x can be written as a convex combination of {xi } in a non-unique manner. On the other hand, if the {ei } are linearly independent, these representations are unique. In such a case, C
52
3 Convex Sets
is said to be a simplex. A simplex in Rd with non-empty interior1 is said to be a d-dimensional simplex or simply d-simplex. For example, a point, a closed interval, a closed triangle, a closed tetrahedron, are respectively, zero-, one-, two- and threedimensional simplices. Some of the easily established properties of simplices are as follows. Let S be a d-simplex with e(C) = {x1 , . . . , xd+1 }. (S1) For any m with 1 ≤ m ≤ d, and distinct i 1 , . . . , i m in {1, . . . , d + 1}, co({xi1 , . . . , xim }) = co({xi1 , . . . , xim }) ⊂ ∂ S and is an (m − 1)-simplex. These are called ((m − 1)-dimensional) faces of S. (S2) ∂ S is the union of d + 1 distinct (d − 1)-simplices whose intersections are (d − 2)-simplices in their respective boundaries. (S3) Intersection of S with a hyperplane H not intersecting its interior is one of its faces. On the other hand, if H intersects int(S), then S ∩ H is a polytope whose relative interior is int(S) ∩ H . See [13] for explicit expressions for the weights of convex combinations in polytopes, motivated by applications to control engineering. Convex polytopes are precisely the sets of the form (3.3), see Exercise 3.15. We conclude with a result that is sometimes useful in constrained optimization. Let C ⊂ Rd be closed bounded and convex, and let Hi := {x ∈ Rd : ri , x ≤ ci }, 1 ≤ i ≤ m, m Hm and C ∗ := be closed half spaces in Rd for some m, 1 ≤ m ≤ d. Let H := ∩i=1 ∗ C ∩ H . Then C is also closed bounded and convex. We shall denote by H˜ i := {x ∈ Rd : ri , x = ci }, 1 ≤ i ≤ m, the hyperplanes that form the boundaries of the Hi ’s, and define H˜ := ∂C ∗ ∩ (∪i H˜ i ), i.e., the part of the boundary of C ∗ that is in one or more of the H˜ i ’s.
Theorem 3.13 (Dubins [9]) Every x ∈ e(C∗ ) can be written as a convex combination of at most m + 1 elements of e(C). Proof (Sketch) If not, x can be written as a strict convex combination of k, m + 1 < k ≤ d + 1, elements of e(C), say x1 , . . . , xk , and no less. Then these xi ’s must form a (k − 1)-simplex S ⊂ C containing x in its interior, otherwise one could have written x as a convex combination of k − 1 or less extreme points of S, hence of C. For the same reason, x ∈ int(S). Clearly, x ∈ / e(C). Consider a closed ball B ⊂ int(S) centred at x. Intersection of B with 0 ≤ ≤ m hyperplanes passing through x will be a disc B centred at x such that the dimension of span{y − x : y ∈ B } is at least 1. So x cannot be in e(C ∗ ), a contradiction. This proves the claim.
1
Equivalently, if the linear span of (xi − x1 )’s for i > 1 as defined above has dimension d.
3.5 The Shapley-Folkman Theorem
53
3.5 The Shapley-Folkman Theorem We next present an interesting result of Shapley and Folkman that has found important applications, e.g., for estimating the so-called duality gap for non-convex problems [2] and convergence to a convex set of normalized Minkowski sums2 of independent and identically distributed random closed bounded sets [1]. Left unpublished by the authors themselves, various proofs of this result appeared subsequently, beginning with Starr [14] to whom it was communicated by Shapley and Folkman as a private correspondence. Our treatment follows [15]. Theorem 3.14 (Shapley-Folkman) Let Si ⊂ Rd , 1 ≤ i ≤ n, and S := the Minkowski sum
n
Si :=
n
i=1
Then each x ∈ co(S) can be written as x = for at least (n − d) indices i.
xi : xi ∈ Si , 1 ≤ i ≤ n .
i=1
n i=1
xi where xi ∈ co(Si ) ∀i and xi ∈ Si
d Proof We use the easily proved fact that s if q, q1 , . . . , qr ∈ R with r > d, and q = r q = m=1 bm qim for some 1 ≤ i m ≤ r , 1 ≤ s ≤ d, i=1 ai qi for some ai ≥ 0, then n yi ∈ co(S) with yi ∈ co(Si ) ∀i (see Exercise 3.11). and bm ≥ 0 ∀m. Let x = i=1 i i Let also yi = j=1 ai j yi j with ai j > 0 ∀i, j with j=1 ai j = 1, and yi j ∈ Si . Define d+n the following vectors in R :
z = [x T : 1, 1, . . . , 1]T , w1 j = [y1Tj : 1, 0, . . . , 0]T , ... ..............., wn j = [ynTj : 0, 0, . . . , 1]T . By construction, z=
i n
ai j wi j .
i=1 j=1
By the fact mentioned at the beginning of this proof, z=
i n
bi j wi j
i=1 j=1
where bi j ≥ 0 ∀i, j and at most (d + n) of the bi j ’s are strictly positive. In parn i ticular, from the definitions of z, {wi j }, we have x = i=1 j=1 bi j yi j . Letting 2
Minkowski sum of two sets A, B ⊂ Rd is the set A + B := {x + y : x ∈ A, y ∈ B}.
54
3 Convex Sets
i n xi := j=1 bi j yi j , we have x = i=1 xi with xi ∈ co(Si ) ∀i. Since there are at most (d + n) of the bi j ’s that are > 0 and at least one bi j > 0 for each i, it follows that there are at most d values of i for which bi j > 0 for two or more j’s. That is, xi ∈ Si for at least (n − d) values of i. See [10] for approximate versions of Caratheodory and Shapley-Folkman theorems.
3.6 Helly’s Theorem Helly’s theorem is a major result in convex analysis that has led to many interesting spin-offs and applications [4, 7]. It is closely related to Caratheodory’s theorem, in fact they can be derived from each other [12]. Our proof follows [12]. We begin with a lemma. Lemma 3.3 (Radon) Let A := {x1 , . . . , xm } ⊂ Rd where m ≥ d + 2. Then there exist A1 , A2 ⊂ A such that A = A1 ∪ A2 , A1 ∩ A2 = φ and co(A1 ) ∩ co(A2 ) = φ. Proof Since m ≥ d + 2, there exist ai ∈ R, 2 ≤ i ≤ m, not all zero, such that m ai (xi − x1 ) = θ . Hence there exist bi , 1 ≤ i ≤ m, not all zero, such that i=2 m m i = 0. Define I1 := {i : bi ≥ 0} and I2 := {i : bi < 0}. i=1 bi x i = θ and i=1 b Then I1 , I2 = φ. Let B := i∈I1 bi and Ak := {xi , i ∈ Ik } for k = 1, 2. Then bi i∈I1
The claim follows.
B
bi − xi = xi ∈ co(A1 ) ∩ co(A2 ). B i∈I 2
Theorem 3.15 (Helly) Let Ci ∈ Rd , 1 ≤ i ≤ m, be convex, with m ≥ d + 1. If ∩i∈I Ci = φ for any I ⊂ {1, . . . , m} with |I | = d + 1, then ∩i Ci = φ. Proof We prove by induction. The claim is obvious for m = d + 1. Suppose it holds for some m ≥ d + 1. Consider a subcollection of (m + 1) convex sets, say C1 , . . . , Cm+1 . Let Di := ∩ j =i C j , 1 ≤ i ≤ m + 1. By induction hypothesis, Di = φ ∀i. Pick xi ∈ Di , 1 ≤ i ≤ m + 1 and define W := {x1 , . . . , xm+1 }. Then by the above lemma, we can find W1 = {xi : i ∈ I1 }, W2 = {xi : i ∈ I2 } ⊂ W for suitable I1 , I2 , such that W1 ∪ W2 = W and co(W1 ) ∩ co(W2 ) = φ. Pick x ∗ ∈ co(W1 ) ∩ co(W2 ). Clearly I1 ∩ I2 = φ and hence i ∈ I1 , j ∈ I2 , implies Di ⊂ C j , D j ⊂ Ci . Let i ∈ I1 . Since Ci is convex, co(W2 ) ⊂ Ci and therefore x ∗ ∈ Ci . A similar argu ment shows that x ∗ ∈ Ci for any i ∈ I2 , proving the claim.
3.7 Brouwer Fixed Point Theorem
55
3.7 Brouwer Fixed Point Theorem We now prove another important fixed point theorem, viz. Brouwer’s fixed point theorem. (See [6] for an excellent text on fixed point theorems.) We begin with some preliminaries. For simplicity, in what follows, we take B := the closed unit ball in Rd centred at the origin, though the proofs extend easily to a closed bounded convex set B. Lemma 3.4 Let φ : Rd → Rd is continuously differentiable with φ(x) = x when x ≥ 1. Then φ is onto. Proof If not, there exists y ∗ ∈ B such that φ(x) = y ∗ ∀x. Then there exists an open ball B of radius 1 > > 0 centred at y ∗ such that ∀ y ∈ B, φ(x) = y ∀x : if not, we could find xn ∈ B, n ≥ 1, with φ(xn ) → y ∗ . By the Bolzano-Weierstrass theorem, xn → some x ∗ ∈ B along a subsequence. Then by continuity, φ(x ∗ ) = y ∗ , a contradiction. Hence B as above exists. Consider a continuously differentiable f : Rd → [0, ∞) not identically zero and vanishing outside B . Then by the change of variables formula from calculus, 0 = f (y)dy = f (φ(x))|det(Dx φ(x))|dx = 0, a contradiction. The claim follows.
Corollary 3.3 If φ : B → R is continuous with φ(x) = x on ∂ B, then ∀y ∈ B, y = f (x) for some x ∈ B. Proof Extend φ to Rd by setting φ(x) = x, x > 1. For n ≥ 1, we can construct continuously differentiable φn : Rd → Rd such that φn (x) = x for x ≥ 1 ∀x and φn (x) → φ(x) uniformly on B (check this). Let y ∈ B. By Lemma 3.4, there exist xn ∈ B such that φn (xn ) = y. By the Bolzano-Weierstrass theorem, xn → some x along a subsequence, whence by continuity of φ and uniform convergence of φn to φ, y = φ(x), proving the claim. Corollary 3.4 (No retract theorem) There is no continuous map φ : B → ∂ B for which φ(x) = x for x ∈ ∂ B. Proof The existence of such a map would contradict the preceding corollary, implying the claim. Theorem 3.16 (Brouwer fixed point theorem) Any continuous function φ : B → B has a fixed point. Proof If not, let ψ(x) denote the point where the line segment from x to φ(x) meets ∂ B. Then ψ : B → ∂ B is continuous with ψ(x) = x on ∂ B, a contradiction. The claim follows. A very elementary and ingenious proof of Brouwer’s fixed point theorem appears in [5]. We have not included it here because it calls upon some (very basic) ideas from point set topology that have not been developed here to an adequate extent.
56
3 Convex Sets
3.8 Proof of Theorem 3.1 This section is devoted to the proof of Theorem 3.1. Without loss of generality we assume f (0) = 0 and show that f is linear. In the proof, we use the notation [x, y] to denote the line segment joining the two vectors x and y. If the vectors are in R, this reduces to the closed interval. We also use the notation [r, s]x to denote that set {ux : u ∈ [r, s]}. Step 1: For each y ∈ Rd , f (R+ y) ⊆ R+ f (y). / R+ f (y). Suppose that this is not true. Then there exists α ∈ R+ such that f (αy) ∈ Let β > max{α, 1} and consider [0, β] f (y). Since 0 = f (0) ∈ f ([0, β]y), f (βy) ∈ f ([0, β]y) and f ([0, β]y) is convex (from hypothesis), the line segment joining 0 and f (βy) lies in f ([0, β]y), that is, [0, f (βy)] ⊆ f ([0, β]y). The sets f ([0, αy]) and f ([αy, βy]), being images of closed and bounded sets, are closed. Moreover, their intersection contains a single point f (αy) which is not in R+ f (y) by our hypothesis. At the same time, [0, f (βy)] = { f ([0, αy]) ∩ [0, f (βy)]} ∪ { f ([αy, βy]) ∩ [0, f (βy)]}, is a union of two disjoint closed sets. This contradicts the fact that [0, f (βy)] is connected. This completes Step 1. Step 2: f maps lines to lines. Clearly f (R− y) = f (R+ (−y)) ⊂ R+ f (−y). Hence
f (Ry) ⊆ R+ f (−y) ∪ R+ f (y).
Since f is an injection, R+ f (−y) ∩ R+ f (y) = { f (0)}. As f (Ry) is convex, the only possibility is that R+ f (−y) ⊆ R− f (y) as well as R+ f (y) ⊂ R− f (−y). This follows because the line segment [ f (−y), f (y)] ⊆ f (Ry). Thus f (Ry) ⊆ R f (y). Consequently there is an α ∈ R such that f (−y) = α f (y). Now, for each x, y ∈ Rd , f (x + R+ y) ⊆ f (x) + R+ ( f (x + y) − f (x)) This follows by applying Step 1 to the function g(y) := f (x + y) − f (x). Along the same lines as above, we can show that f (x + Ry) ⊆ f (x) + R( f (x + y) − f (x)) for each x, y ∈ Rd .
3.8 Proof of Theorem 3.1
57
Step 3: If x and y are linearly independent, then f (x) and f (y) are linearly independent. Suppose f (x) and f (y) are not linearly independent. Then f (x) = λ f (y) for some λ = 0. This implies that f (x) ∈ R f (y) and hence 1 [0, f (x)] ∩ 0, f (x) ⊆ f (Rx) ∩ f (Ry), λ which is a singleton (because f is injective and x, y are linearly independent), whereas the left hand side is an interval, a contradiction. Therefore f (x) and f (y) are linearly independent. Step 4: For x, y ∈ Rd , f (Rx + Ry) ⊆ R f (x) + R f (y). Let z = 2(y − x). Then we have, from Step 2, f (2x + Rz) ⊂ f (2x) + R( f (2y) − f (2x)). In particular, 1 f (x + y) = f 2x + z ∈ f (2x) + R( f (2y) − f (2x)). 2 Using Step 2 once again, we see that f (x + y) ∈ R f (x) + R f (y). From this, it immediately follows that f (Rx + Ry) ⊆ R f (x) + R f (y). Step 5: f is linear. Let x, y ∈ Rd be linearly independent. From Step 4, there exists a, b ∈ R such that f (x + y) = a f (x) + b f (y). Since x and y are linearly independent, f (x) and f (y) are linearly independent from Step 3. Therefore both a and b are nonzero. We will show that a = 1 and b = 1. Note that f (x + y) − f (x) = (a − 1) f (x) + b f (y) and b = 0. Let λ > 0. Consider f (x + λy). From Step 2, there exists sλ > 0 such that f (x + λy) = f (x) + sλ [ f (x + y) − f (x)] = (1 + (a − 1)sλ ) f (x) + bsλ f (y). It follows from Step 1 that there exists tλ > 0 such that f (x + λy) = tλ f
x λ
+y .
58
3 Convex Sets
Thus
x
sλ 1 + y = (1 + (a − 1)sλ ) f (x) + b f (y). λ tλ tλ x Letting λ → ∞, f λ + y → f (y) and hence f
1 sλ (1 + (a − 1)sλ ) → 0 and b → 1. tλ tλ Therefore t1λ → − (a−1) . Since tλ is positive, (a − 1) and b have opposite signs. b Moreover, b is positive (nonzero as well) from the second limit and hence (a − 1) is non-positive. Next consider f (x − λy). Proceeding as above, there exist u λ > 0, vλ > 0 such that f (x − λy) = f (x) − u λ [ f (x + y) − f (x)] = (1 − (a − 1)u λ ) f (x) − bu λ f (y) as well as
x f (x − λy) = −vλ f − + y . λ
Arguing as above, we have 1 − (a − 1)u λ uλ → 0 and b →1 vλ vλ as λ → ∞, which implies that v1λ → (a−1) . The positivity of vλ will then imply that b (a − 1) and b have same signs. Since b is positive (follows also from the second limit), (a − 1) must be non-negative. Thus (a − 1) is both non-negative as well as nonpositive. This is possibly if and only if a = 1. In a similar manner (or interchanging the roles of x and y), we can show that b = 1. Thus f (x + y) = f (x) + f (y) provided x and y are linearly independent. Now let x ∈ Rd and y be any vector linearly independent of x. Then, for any > 0, x and x + y are linearly independent and therefore f (x + x + y) = f (x) + f (x + y). Letting → 0, we see that f (2x) = 2 f (x). Using induction and density arguments, we can show that f (αx) = α f (x), implying that f preserves scalar multiplication. Hence f is linear, completing the proof of the theorem.
3.9 Exercises
59
3.9 Exercises 3.1 If C, D ⊂ Rd are convex, show that so is C + D := {x + y : x ∈ C, y ∈ D} ⊂ Rd . 3.2 If C ⊂ Rn , D ⊂ Rd are convex, show that so is C × D := {z = (x, y) : x ∈ C, y ∈ D} ⊂ Rn+d . 3.3 Prove (P1)–(P5). 3.4 Show that the map : x ∈ Rd → its projection to a subspace S of Rd is linear and in particular, maps convex sets to convex sets. 3.5 Prove the following: if C ⊂ Rd is open, then its convex hull is also open. If C is closed and bounded, then so is its convex hull. 3.6 Show that the equivalent definitions of convex hull, resp., closed convex hull, given in Sect. 3.1 are indeed equivalent. 3.7 Show that a closed convex set is the intersection of all half spaces containing it. 3.8 Prove (S1)–(S3). 3.9 Let x1 , x2 ∈ Rd \C for a closed bounded convex set C ⊂ Rd , and xˆi = arg min y∈C xi − y , i = 1, 2. Show that xˆ1 − xˆ2 ≤ x1 − x2 . Show that the claim is false if we drop the convexity assumption on C. 3.10 Let Ci ⊂ Rd , 1 ≤ i ≤ m, be closed bounded convex sets with C := ∩i Ci = φ. Let x ∈ / ∪i Ci . Define the sequence {xn } by: x0 = x and xi+1 := the projection of xi to Ci mod (m) + 1 . That is, we generate the sequence by successive projections on Ci ’s in a round robin fashion, beginning with x. (a) (b) (c) (d)
Show that for y ∈ C, xn − y decreases monotonically to some c ≥ 0. Show that xn → a point xˇ in C. (Hint: Use contradiction.) If the Ci ’s, C are affine spaces, show that xˇ is the projection of x on C. Show that the claim (c) above need not be true if the Ci ’s are not affine spaces.
3.11 Let A1 , A2 , . . . , An ⊆ Rd be convex. Show that convex hull of the Minkowski sum A1 + A2 + · · · + An is equal to the Minkowski sum of the convex hulls of A1 , A2 , . . . , An . 3.12 Show that the set of d × d symmetric positive definite matrices whose eigenvalues lie in the interval [a, b] for some 0 < a < b < ∞ is a closed bounded convex subset of Rd×d . Which of these fail (if at all) when we allow a = 0 or b = ∞? 3.13 Show that the set of d × d stochastic matrices (i.e., non-negative matrices with row sums = 1) is a closed bounded convex subset of R d×d . What are its extreme points?
60
3 Convex Sets
3.14 (i) Show that the set C of d × d doubly stochastic matrices (i.e., non-negative matrices whose row and column sums are = 1) is a closed bounded convex subset of Rd×d . Show that permutation matrices (i.e., matrices with entries in {0, 1} such that each row or column has a 1 in exactly one place) are extreme points of C. (ii) The Birkhoff-von Neumann theorem [3] says that the converse of the above also holds. Using this, prove that a d × d doubly stochastic matrix can be written as a convex combination of at most d 2 − 2d + 2 permutation matrices. 3.15 Let K := {x ∈ Rd : Ax ≤ b} for some A ∈ Am×d , b ∈ Rm be non-empty and bounded. Show that it is a convex polytope and conversely, every convex polytope is of this form.
References 1. Artstein, Z., Vitale, R.A.: A strong law of large numbers for random compact sets. Ann. Probab. 3(5), 879–882 (1975) 2. Aubin, J.P., Ekeland, I.: Estimates of the duality gap in nonconvex optimization. Math. Oper. Res. 1(3), 225–245 (1976) 3. Bapat, R.B., Raghavan, T.E.S.: Nonnegative Matrices and Applications. Encyclopedia of Mathematics and Its Applications, vol. 64. Cambridge University Press, Cambridge (1997) 4. Bárány, I., Kalai, G.: Helly-type problems. Bull. Am. Math. Soc. 59, 471–502 (2022) 5. Ben-El-Mechaieh, H., Mechaiekh, Y.A.: An elementary proof of the Brouwer’s fixed point theorem. Arab. J. Math. 11, 179–188 (2022) 6. Border, K.C.: Fixed Point Theorems with Applications to Economics and Game Theory. Cambridge University Press, Cambridge, UK (1989) 7. de Loera, J.A., Goaoc, X., Meunier, F., Nustafa, N.H.: The discrete yet ubiquitous theorems of Caratheodory, Helly, Sperner, Tucker, and Tverberg. Bull. Am. Math. Soc. 5, 415–511 (2019) 8. Deutsch, F.: Best Approximation in Inner Product Spaces. CMS Books in Mathematics [Ouvrages de Mathématiques de la SMC], vol. 7. Springer, New York (2001) 9. Dubins, L.E.: On extreme points of convex sets. J. Math. Anal. Appl. 5, 237–244 (1962) 10. Kerdreux, T., d’Aspremont, A., Colin, I.: An approximate Shapley-Folkman theorem (2020). arXiv:1712.08559 11. Knecht, A., Vanderwerff, J.: Legendre functions whose gradients map convex sets to convex sets. In: Computational and Analytical Mathematics. Springer Proceedings in Mathematics and Statistics, vol. 50, pp. 455–462. Springer, New York (2013) 12. Robinson, A.: Helly’s theorem and its equivalences via convex analysis. University Honours theses, Portland State University, vol. 67 (2014) 13. Schürmann, B., El-Guindy, A., Altoff, M.: Closed-form expressions of convex combinations. In: Proceedings of the American Control Conference, Boston, pp. 2795–2801 (2016) 14. Starr, R.M.: Quasi-equilibria in markets with non-convex preferences. Econometrica 37(1), 25–38 (1969) 15. Zhou, L.: A simple proof of the Shapley-Folkman theorem. Econom. Theory 3(2), 371–372 (1993)
Chapter 4
Convex Functions
4.1 Basic Properties This chapter is devoted to convex functions, the rock star of optimization theory. In this section, we recall their key properties that matter for convex optimization. Recall that a function f : C → R for a convex set C ⊂ Rd is said to be convex if for any λ ∈ [0, 1] and x, y ∈ C, we have f (λx + (1 − λ)y) ≤ λ f (x) + (1 − λ) f (y).
(4.1)
Note that the definition itself requires λx + (1 − λ)y to be in the domain of definition of f ; hence, the convexity requirement for C. Just as we did for convex sets, we can give a seemingly more general but equivalent definition: for C as above, f : C → R is convex if for n ≥ 2; λi , 1 ≤ i ≤ n, in [0, 1] with i λi = 1; and x1 , . . . , xn ∈ C, we have f
n i=1
λi xi
≤
n
λi f (xi ).
(4.2)
i=1
Clearly, (4.2) reduces to (4.1) for n = 2. The reverse implication follows by induction. Suppose (4.2) holds for 2 ≤ m < n. Assuming without loss of generality that λ1 ∈ (0, 1), we have
© Hindustan Book Agency 2023 V. S. Borkar and K. S. M. Rao, Elementary Convexity with Optimization, Texts and Readings in Mathematics 83, https://doi.org/10.1007/978-981-99-1652-8_4
61
62
4 Convex Functions
Fig. 4.1
f
n i=1
λi xi
λi xi = f λ1 x1 + (1 − λ1 ) 1 − λ1 i=2 n λi ≤ λ1 f (x1 ) + (1 − λ1 ) f xi 1 − λ1 i=2 n λi ≤ λ1 f (x1 ) + (1 − λ1 ) f (xi ) 1 − λ1 i=2 =
n
n
λi f (xi ).
i=1
Here the first and second inequalities follow from the induction hypothesis for m = 2 and m = n − 1, resp. A convex function f is said to be strictly convex if the inequalities in (4.1) or (4.2) are required to be strict whenever x = y in (4.1), equivalently, whenever not all xi are identical in (4.2). A visual interpretation of these definitions is that f is convex if any line segment joining two distinct points on its graph lies on or above the graph and strictly convex if it lies strictly above the graph except for its end points (see Fig. 4.1). Linear functions which have the form f (x) = a, x for some a ∈ Rd and affine functions which have the form f (x) = a, x + b for a, b ∈ Rd are trivially convex, but not strictly convex. The following properties of convex functions follow immediately from the definition. We assume that multiple convex functions occurring in them are defined on a common convex domain C. More generally, one would take the intersection of their domains, which would also be convex if non-empty.
4.1 Basic Properties
63
(C1) Positive linear combinations of convex functions n are convex, i.e., if { f i , 1 ≤ ai f i is convex. It will be i ≤ n} are convex and a1 , . . . , an ≥ 0, then i=1 strictly convex if at least one f i for which ai > 0 is. The former claim follows quite simply as follows. Let x, y ∈ C, λ ∈ [0, 1]. Then n
ai f i (λx + (1 − λ)y) ≤
i=1
n
ai (λ f i (x) i=1 n
=λ
+ (1 − λ) f i (y))
ai f i (x) + (1 − λ)
i=1
n
ai f i (y).
i=1
For the latter claim, note that the inequality will be strict if x = y and some f i for which ai > 0 is strictly convex. (C2) Pointwise limits of convex functions are convex, i.e., if f n , n ≥ 1, are convex and f (x) = limn↑∞ f n (x) exists for all x ∈ C, then f : C → R is convex. This is seen simply by passing to the limit as n ↑ ∞ in the inequality (4.1) for f n , i.e., in f n (λx + (1 − λ)y) ≤ λ f n (x) + (1 − λ) f n (y), to obtain (4.1) for the limiting f . Note that even if all f n are strictly convex, 2 f need not be. For example, f n (x) = x + xn , x ∈ R, is strictly convex, but f (x) = limn↑∞ f n (x) = x, x ∈ R, is not. (C3) Pointwise maxima or suprema of convex functions are convex. That is, maxα f α (·), supα f α (·), where α is any index taking values in a prescribed set, are convex when well defined (i.e., pointwise finite). This is proved as follows: for x, y ∈ C and λ ∈ [0, 1], sup f α (λx + (1 − λ)y) ≤ sup(λ f α (x) + (1 − λ) f α (y)) α
α
≤ λ sup f α (x) + (1 − λ) sup f α (y). α
α
(C4) Pointwise minima of convex functions need not be convex, e.g., f (x) = min(x 2 , (x − 1)2 ) for x ∈ R. However, if f : C × D → R is convex for convex sets C ⊂ Rd , D ⊂ Rm , and g(x) = inf y∈D f (x, y) > −∞ ∀ x, then g is convex. This is because for given x, x ∈ C and any y, y ∈ D, for any λ ∈ [0, 1], g(λx + (1 − λ)x ) ≤ f (λx + (1 − λ)x , λy + (1 − λ)y ) ≤ λ f (x, y) + (1 − λ) f (x , y ). Taking infimum over y, y on right, we get g(λx + (1 − λ)x ) ≤ λg(x) + (1 − λ)g(x ).
64
4 Convex Functions
(C5) If f 1 , . . . , f m are convex and g : Rm → R is convex and componentwise increasing, then h(·) = g( f 1 (·), . . . , f m (·)) : C → R is convex: for x, y ∈ C and λ ∈ [0, 1], h(λx + (1 − λ)y) = g( f 1 (λx + (1 − λ)y), . . . , f m (λx + (1 − λ)y)) ≤ g(λ f 1 (x) + (1 − λ) f 1 (y), . . . , λ f m (x) + (1 − λ) f m (y)) ≤ λg( f 1 (x), . . . , f m (x)) + (1 − λ)g( f 1 (y), . . . , f m (x)). Note that the first inequality uses monotone increase (componentwise) of g. In fact the claim is false without it: consider g(x) = e−x , f (x) = x 2 and 2 h(x) = g( f (x)) = e−x for x ∈ R. (C6) Let { f α } be a family of convex functions indexed by a parameter α ∈ Rm (more generally, a subset thereof) and ϕ : Rm → [0, ∞). Suppose f ∗ (x) := f α (x)ϕ(α)dα is well defined as a Riemann integral for all x. Then f ∗ is convex.1 (C7) If f is componentwise convex, it need not be jointly convex, i.e., if f (x, ·), f (·, y) are convex, f (·, ·) need not be. For example, consider f : R2 → R given by f (x, y) = x y. (C8) Let f : Rd → R be convex, b ∈ Rd and A a d × m matrix. Then x ∈ Rm → f (Ax + b) ∈ R is convex. The proof is easy.
4.2 Continuity One property of convex functions that is of great value in optimization is that they are continuous in the interior of the domain of their definition. In order to prove this, we introduce another important concept, which ties up the theory of convex functions with the theory of convex sets, i.e., the epigraph. Epigraph of a function f : C → R for a convex C ⊂ R, denoted epi( f ), is defined as the set of points on or above the graph of f , i.e., the set {(x, y) : y ≥ f (y)} (see Fig. 4.2). Lemma 4.1 f is a convex function if and only if epi( f ) is a convex set. Proof Recall that for convex f , any line segment joining two points on the graph of f lies on or above the graph; hence, so does any line segment joining two points that lie on or above the graph of f . Thus epi( f ) is convex. Conversely, suppose epi( f ) is convex. Then any line segment joining two points on the graph of f is in epi( f ), and hence lies above the graph of f by the definition of epigraph, so f is convex. The notion of an epigraph allows us to translate many problems concerning convex functions to those concerning convex sets, often rendering them easier. For example, (C3) above follows from the fact that epi(supα f α ) is the intersection of {epi( f α )} and intersections of convex sets are convex. Likewise, (C4) is seen to follow from the 1
This also extends to the case when the integral above is a Lebesgue integral.
4.2 Continuity
65
Fig. 4.2
Epigraph f (x)
x dom(f )
Fig. 4.3
y
n 0
x
fact that epi(g) is the projection of epi( f ) on the x-space and projections of convex sets are convex. Let f : C → R be convex for a convex C ⊂ Rd with non-empty interior. Theorem 4.1 f is continuous at any x0 ∈ int(C). Proof Consider epi( f ) ⊂ Rd+1 = Rd × R with Rd and R on the right hand side identified with resp., the domain and range of f . Let H be a support hyperplane of epi( f ) ⊂ Rd+1 at (x0 , f (x0 )) and n its normal pointing into epi( f ) (see Fig. 4.3). Change coordinates of Rd+1 to those that map (x0 , f (x0 )) to the origin of Rd+1 , the support hyperplane of epi( f ) at (x0 , f (x0 )) to the subspace corresponding to the first d components thereof (i.e., a copy of Rd ), and span(n) to span of the last coordinate vector thereof (i.e., a copy of R). Recall that linear transformations preserve convexity. Thus without loss of generality, we may take x0 = θ := the zero vector in Rd and f (x) ≥ f (θ ) = 0 ∀ x ∈ C. The point x0 = θ remains in the interior of the domain C of f after this transformation. Let θ = x ∈ U := the unit cube in Rd centred at θ . Consider a line segment joining θ to x, extended to a unique y ∈ ∂U . Then y = i αi xi where xi ’s are the corners of U and αi ≥ 0 ∀i with i αi = 1. Also, x = y for some ∈ (0, 1] and satisfies
66
4 Convex Functions
0 ≤ f (x) = f (y) = f (y + (1 − )θ ) ≤ f (y) + (1 − ) f (θ ) = f (y)
= f αi xi ≤ αi f (xi ) ≤ max f (xi ), i
i
i
Thus for any δ > 0, we can find > 0 such that f (x) < δ when x∞ < The claim follows.
δ . maxi f (xi )
The result is not in general true if we replace int(C) by C. Consider, e.g., the function f : [0, 1] → R defined by f (x) = x 2 for x ∈ (0, 1] and x(0) = 1. This satisfies the definition of convexity, but is not continuous at 0. In case f : C → R is continuous, epi( f ) is closed: yn ≥ f (xn ), xn → x in C, and yn → y together imply y ≥ f (x) by continuity, so epi( f ) is closed. The converse is not true, as the following ingenious example from Luenberger [6] shows. Let C ⊂ R2 be given by {(x, y) : (x − 1)2 + y 2 ≤ 1}. Define f ((0, 0)) = 0, f (x) = 1 for x ∈ C˜ := ∂C\{(0, 0)}. For x ∈ int(C), define f (x) by linear interpolation between the value 0 at (0, 0) and the value 1 at the unique point in C˜ such that x lies on the line segment joining the two. (Imagine an ice cream cone with a slit along a line from the bottom point to a point on the rim, held so that this slit is vertical.) This function is lower semicontinuous, but not continuous at (0, 0). Nevertheless its epigraph is closed. Another example is given in Exercise 4.13.
4.3 Differentiability Next we explore the differentiability properties of convex functions. Let f : R → R be convex. Consider z < q < r < y in R and the corresponding points a := (z, f (z)), b := (q, f (q)), c := (r, f (r )), d := (y, f (y)) on its graph. Then by convexity of f , the points b, c lie on or below the line segment a–d (see Fig. 4.4). This simple fact takes us a long way in deducing differentiability properties of convex functions.
Fig. 4.4
4.3 Differentiability
67
At x ∈ R, define resp. right derivative f + (x) and left derivative f − (x) as
f (x + ) − f (x) := lim , 0 x in a bounded set B ⊂ Rd is bounded 2
This paragraph requires a nominal familiarity with Lebesgue measure.
68
4 Convex Functions
from above. In fact, | f (x) − f (y)| ≤ M(B)|x − y| for x, y ∈ L for M(B) := d × the maximum absolute value of the right or left directional derivative along any coordinate axis. In other words, f is locally Lipschitz. The claim holds true more generally for convex f : Rd → R, d ≥ 1. In fact, a careful look at the proof of Theorem 4.1 shows that it is already proved there (check this). A celebrated theorem of Rademacher states that a Lipschitz function, in particular f , is differentiable almost everywhere. This is an alternative route to almost everywhere differentiability. We summarize these results as follows. Theorem 4.2 A convex function f : Rd → R is locally Lipschitz and is twice differentiable almost everywhere. One can say more: Lemma 4.2 1. Let f : C → R be convex, where C ⊂ Rd is convex and open. Then for any x ∈ C where f is differentiable, f (y) ≥ f (x) + ∇ f (x) · (y − x) ∀ y ∈ C.
(4.5)
Conversely, if f is continuously differentiable and this inequality holds for all x, y ∈ C, then f is convex. 2. If f above is twice continuously differentiable at x ∈ C, ∇ 2 f (x) is positive semidefinite. Conversely, if ∇ 2 f is positive definite in C, f must be convex. Proof The graph of a convex function f : Rd → R is a d-dimensional surface in Rd+1 given by the points (x, f (x)), x ∈ Rd , i.e., the surface described by the equation g(x, y) := f (x) − y = 0. The normal to this surface at (x ∗ , f (x ∗ )) is given by ∇g(x ∗ , f (x ∗ )) = [∇ f (x ∗ ) : −1]T because ∂∂gxi = ∂∂xfi and ∂g = −1. The support hyperplane to the epigraph of f at ∂y x ∗ passes through the point (x ∗ , f (x ∗ )) and has the normal ∇g(x ∗ , f (x ∗ )) above. Therefore it is given by (x, y) − (x ∗ , f (x ∗ )), (∇ f (x ∗ ), −1) = 0, i.e.,
y = f (x ∗ ) + ∇ f (x ∗ ), x − x ∗ .
Since the graph of f lies above this hyperplane, we have f (x) ≥ f (x ∗ ) + ∇ f (x ∗ ), z − x ∗ ∀ z.
4.3 Differentiability
69
Conversely, if (4.5) holds for all x, y ∈ C, then f (x) = sup{g(x) := f (x0 ) + ∇ f (x0 ) · (x − x0 ), x0 ∈ C} and as a pointwise supremum of affine functions, f must be convex. Now suppose that the f above is twice differentiable at x ∈ C. Then for any unit vector d, the function α ∈ R → g(α) := f (x + αd) is twice differentiable, and as d2 g = d T ∇ 2 f (x)d ≥ 0. Since this holds for any unit vector already observed, dα 2 α=0
d, ∇ 2 f (x) is positive semidefinite. On the other hand, suppose f is twice continuously differentiable everywhere and ∇ 2 f is positive definite. Using this fact in (2.4), we have f (x) ≥ f (x0 ) + ∇ f (x0 ), x − x0
for all x and x0 , implying (4.5). Part (i) completes the proof.
The gradient of a convex function says a lot about the function itself. For example, if f : Rd → R is twice continuously differentiable, bounded from below and ∇ f 2 is convex, then f is convex. Furthermore, if g is another function with these properties and ∇ f (x) = ∇g(x) ∀x, then f = g modulo an additive constant. See [1] for these and related results. We next show the interesting fact that a convex function that is Gâteaux differentiable along coordinate directions at a point (i.e., all its partial derivatives are well defined) is automatically Frechet differentiable. The following proof is from [7]. Theorem 4.3 Let f : U → R be a convex function, where U ⊆ Rd is open. Suppose z ∈ U is such that the partial derivatives ∂∂xf1 (z), . . . , ∂∂xfd (z) exist. Then f is Fréchet differentiable at z. Proof Choose r > 0 such that U contains the open ball Br (z). Consider the function g(u) = f (z + u) − f (z) −
d ∂f (z)u i ∂ xi i=1
for u ∈ Br (θ ). To prove the theorem, we need to show that lim
sup
r ↓0 θ=u∈Br (θ)
|g(u)| = 0. u
(4.6)
The function g is clearly convex. Therefore
1 1 1 u + (−u) ≤ g(u) + g(−u), 2 2 2 2 d u i ei be such that u < dr , implying −g(−u) ≤ g(u) ∀ u ∈ Br (z). Let u = i=1 where ei are the unit coordinate vectors. Then du i ei < r and for x∞ := 0 = g(0) = g
1
70
4 Convex Functions
maxi |xi |, g(u) = g
d 1
d
du i ei
i=1
d
1 g(du i ei ) d i=1 g(du i ei ) = ui du i i:u i =0 g(du i ei ) . ≤ u∞ du i i:u =0 ≤
i
In a similar fashion, g(−u) ≤ u∞
g(−du i ei ) . du
i:u i =0
i
Thus we have −u∞
g(−du i ei ) g(du i ei ) ≤ −g(−u) ≤ g(u) ≤ u∞ . du du i i i:u =0 i:u =0 i
i
∂g i ei ) i ei ) Since all the partial derivatives ∂u at z exist, we have g(du , g(−du → 0 unidu i du i i formly as u i → 0. Using this and the equivalence of compatible norms on Rd in the above inequality, (4.6) follows, implying the claim.
At points where a convex function is not differentiable, the notion of subgradient proves useful. Recall that if a convex function f : C → R for an open C ⊂ Rd is differentiable at x, then f (y) − f (x) ≥ ∇ f (x), y − x . If f is not differentiable at x, define a ξ ∈ Rd to be a subgradient of f at x if f (y) − f (x) ≥ ξ, y − x .
(4.7)
This definition does not require f to be convex, but existence of at least one such ξ does follow from convexity. This is apparent from the geometric interpretation of (4.7), viz., that the hyperplane at x0 with normal vector [ξ, −1] ∈ Rd+1 is a supporting hyperplane of epi( f ). We confine ourselves to convex f . It also follows from (4.7) that the set of subgradients, called a subdifferential, is a closed cone. The subdifferential of f at x is denoted by ∂ f (x). A few properties of subdifferentials are immediate from the definition, e.g., for f, g ∈ C → R with C open,
4.5 Convex Extensions
71
(G1) ∂(λ f )(x) = λ∂ f (x) ∀ x ∈ C. (G2) ∂( f + g) ⊂ ∂ f + ∂g for all x ∈ C. (G3) If f is Gâteaux differentiable at x, ∂ f (x) = {Gˆateaux derivatives of f at x}. In particular, if it is Frechet differentiable, ∂ f (x) = {∇ f (x)}. The converse also holds. (G4) If g(t) := f (x + tv) for t ∈ R, then ∂g(t) ⊂ { ξ, v : ξ ∈ ∂ f (x + tv)}. (G5) For x = y, f (y) − f (x) ∈ { ξ, y − x : ξ ∈ ∂ f (x)}. (G6) A point x ∗ ∈ int(C) is a local minimum of f if θ ∈ ∂ f (x ∗ ). The reverse implication is true, e.g., if f is convex. See [3] for more on subdifferential calculus and its applications.
4.4 An Approximation Theorem Recall the Weierstrass approximation theorem (Theorem 1.9): if A is a closed and bounded set and f : A → R is continuous, then given any > 0, there exists a multivariate polynomial p : A → R such that maxx∈A | f (x) − p (x)| < . For convex f , we can say more, viz., these polynomials can be taken to be convex. Theorem 4.4 A convex f : Rd → R can be uniformly approximated on a closed bounded convex set C ⊂ Rd by convex polynomials on C. Proof Recall the polynomials φn considered in the proof of Theorem 1.9. Since f is convex, the convolution f (x − y)φn (y)dy f ∗ φn (x) = Rd
is convex by the property (C6). Hence the theorem follows.
4.5 Convex Extensions It is tempting to say that a convex f : C ⊂ Rd → R for a closed convex C has a convex extension to the whole of Rd , along the lines of the Tietze extension theorem√in Chap. 2. But this may not be the case. For example, f (x) = x log x and g(x) = − x as maps [0, 1] → R do not extend to the whole space (check this). Also, while many convex optimization problems in practice that minimize a convex function f over a convex set C ⊂ Rd have f a priori specified as the restriction of some convex f¯ : Rd → R to C, this may not always be the case, so it is worthwhile to know when
72
4 Convex Functions
this is possible. The following theorem, a special case of the result of Dragomirescu and Ivan [4], gives such a result. The proof is adapted from [10]. We need the following simple observation. Lemma 4.3 Let x be a point on the line joining y, z ∈ Rd , y = z, such that x is not a convex combination of y, z. Then x can be written as x = λy + (1 − λ)z, λ > 1,
(4.8)
if y is a convex combination of x, z. A symmetric statement holds if z is a convex combination of x, y. Theorem 4.5 A convex function f : C ⊂ Rd → R for a convex and bounded C can be extended to a convex function Rd → R if and only if it is Lipschitz on C. Proof The only if part is already proved in Sect. 4.3 above. We prove the if part. Suppose f : C → R is convex with C ⊂ Rd closed convex and bounded. For x ∈ Rd , define f˜(x) := sup{λ f (y) + (1 − λ) f (z) : x = λy + (1 − λ)z, y, z ∈ C, λ ≥ 1}. Since f is Lipschitz on C, there exists an L > 0 such that | f (x) − f (y)| ≤ Lx − y ∀ x, y ∈ C. Then ∀ λ > 1 and x ∈ R d , y, z ∈ C satisfying (4.8), |λ f (y) + (1 − λ) f (z)| ≤ λ| f (y) − f (z)| + | f (z)| ≤ λLy − z + | f (z)| = Lx − z + | f (z)|. Since C is bounded, the quantity on right is bounded for each fixed x, so f˜(x) < ∞ ∀x. For x ∈ C, we must have λ = 1 (check this), so that f˜(x) = f (x). For x1 , x2 ∈ Rd , α ∈ [0, 1], we have f˜(αx1 + (1 − α)x2 ) = sup{λ f (y) + (1 − λ) f (z) : αx1 + (1 − α)x2 = λy + (1 − λ)z, y, z ∈ C, λ ≥ 1} ≤ sup{λ f (y) + (1 − λ) f (z) : x1 = λy1 + (1 − λ)z 1 , x2 = λy2 + (1 − λ)z 2 , y = αy1 + (1 − α)y2 , z = αz 1 + (1 − α)z 2 , y1 , y2 , z 1 , z 2 ∈ C, λ ≥ 1} ≤ α sup{λ f (y1 ) + (1 − λ) f (z 1 ) : x1 = λy1 + (1 − λ)z 1 , y1 , z 1 ∈ C, λ ≥ 1} + (1 − α) sup{λ f (y2 ) + (1 − λ) f (z 2 ) : x2 = λy2 + (1 − λ)z 2 , x2 , z 2 ∈ C, λ ≥ 1} = α f˜(x1 ) + (1 − α) f˜(x2 ).
Hence f˜ is convex.
4.5 Convex Extensions
73
So far we have always defined a convex function on a convex set, because the definition itself requires that if any two points are in the domain of definition of the function, any convex combination of them should be so for the function to be defined there in the first place. This, however, does not prevent us from tweaking the definition to include non-convex domains as follows: Say that f : D ⊂ Rd → R is convex if for any x, y ∈ D and α ∈ [0, 1], if αx + (1 − α)y ∈ D, then f (αx + (1 − α)y) ≤ α f (x) + (1 − α) f (y). This then raises the question: can f then be extended to a convex function on co(D)? This is answered in the affirmative by the following result from Peters and Wakker [8]. We consider only the case of bounded D, f for ease of exposition. Theorem 4.6 A bounded convex f : D ⊂ Rd → R for a bounded D ⊂ Rd extends to a convex f˜ : co(D) → R. Proof Define f˜(x) = inf
⎧ k ⎨ ⎩
p j f (x j ) : k ≥ 1; x =
j=1
k
pi xi ; x1 , . . . , xk ∈ D;
i=1
p1 , . . . , pk ∈ [0, 1];
p j = 1.
j
⎫ ⎬ ⎭
(4.9)
For x ∈ D, we have f˜(x) ≥ f (x) by convexity of F. Taking m k = 1 above, we also have f˜(x) ≤ f (x), so f˜(x) =f (x) for x ∈ D. Let x = i=1 qi yi for m > 1, yi ∈ ni m q = 1. Suppose y = co(D) ∀i, and qi > 0 ∀i with i=1 i i j=1 pi j x i j where x i j ∈ D ∀i, j, and pi j ∈ [0, 1] such that j pi j = 1. From (4.9), f˜(y) ≤
m
qi
i=1
ni
pi j f (xi j ).
j=1
Furthermore, from (4.9), for any > 0, we can pick { pi j , xi j } above so that n i ˜ j=1 pi j x i j ≤ f (yi ) + . Then combining this with the preceding inequality, we have m ˜ qi f˜(yi ) + . f (y) ≤ i=1
Since > 0 was arbitrary, convexity of f˜ follows, completing the proof.
74
4 Convex Functions
4.6 Further Properties of Gradients of Convex Functions Given a convex function f : Rd → R, we see in Exercise 4.6 that ∇ f (x) − ∇ f (y), x − y ≥ 0 for each x = y in Rd . This says that gradients of convex functions are ‘monotone’, this being one generalization of the notion of monotonicity to Rd , d > 1 (and to infinite-dimensional spaces). In fact, the gradient of a convex function is more than that. It is ‘maximal monotone’ in the sense that, if ∇ f (x) − q, x − y ≥ 0
(4.10)
for each x ∈ Rd , then ∇ f (y) = q. To prove this, we require the following lemma. Lemma 4.4 Let f : Rd → R be a convex and differentiable function such that ∇ f (x), x ≥ 0 for each x ∈ Rd . Then the zero vector θ is the minimizer of f . Proof Consider g(x) = f (x) + 21 x 2 . Since f is pointwise supremum of affine functions whose graphs lie below the graph of f (Exercise 4.5(a)), there exists a vector a and a real number r such that f (x) ≥ a, x + r for each x ∈ Rd . Hence g(x) ≥
1 2 x + a, x + r, 2
which implies that g is coercive. Hence g attains its minimum at some x ∗ , implying ∇ f (x ∗ ) + x ∗ = θ . Therefore 0 = ∇ f (x ∗ ) + x ∗ , x ∗ ≥ x ∗ 2 , and hence x ∗ = θ . Therefore ∇ f (θ ) = θ , implying the claim in view of the convexity of f . We next establish the maximal monotonicity of gradients of convex functions. Theorem 4.7 (Minty and Rockafellar) Let f : Rd → R be a differentiable convex function such that ∇ f (x) − q, x − y ≥ 0 (4.11) for each x ∈ Rd . Then ∇ f (y) = q.
4.6 Further Properties of Gradients of Convex Functions
75
Proof Consider f˜(x) = f (x + y) − q, x − y for a fixed y ∈ Rd . Then f˜ is convex and ∇ f˜(x) = ∇ f (x + y) − q. Now (4.11) implies that ∇ f˜(x), x ≥ 0 for each x ∈ Rd . By the previous theorem, ∇ f˜(θ ) = θ and hence ∇ f (y) = q. An interesting consequence of maximal monotonicity is that differentiable convex functions are continuously differentiable. Theorem 4.8 Let f : Rd → R be convex and differentiable at every point x ∈ Rd . Then f is continuously differentiable. Proof Let yn → y in Rd and let x ∈ Rd . From monotonicity of the gradient, we have ∇ f (yn ) − ∇ f (x), yn − x ≥ 0 for each n. Since yn → y, ∇ f (yn ) is bounded because of the local Lipschitz continuity of f . It follows by the Bolzano-Weierstrass theorem that ∇ f (yn ) has a convergent subsequence. Let q be a limit point thereof. Letting n → ∞ along this subsequence, we get q − ∇ f (x), y − x ≥ 0. From Theorem 4.7, it follows that q = ∇ f (y). It then follows that ∇ f (yn ) → ∇ f (y). Thus f is continuously differentiable. Our next result, which is a property shared by maximal monotone operators, has enormous applications in functional analysis, partial differential equations and approximation theory. Theorem 4.9 (Minty’s surjectivity theorem) Let f : Rd → R be a continuously differentiable convex function. Then for each u ∈ Rd , the equation x + ∇ f (x) = u has a unique solution (i.e., x → x + ∇ f (x) is an invertible map). Proof Let u ∈ Rd and consider g : Rd → R defined by g(x) = f (x) + 21 x − u2 . Since f is convex, it is not hard to verify that g is strictly convex and coercive (see, e.g., the proof of Lemma 4.4). Hence g attains its unique minimum at some x ∗ . Since f is differentiable, it follows that ∇ f (x ∗ ) + x ∗ − u = 0,
implying the claim.
This result is closely connected with the Moreau envelope and proximal operator applied to a convex function f . The Moreau envelope f μ of f is defined by f μ (x) = inf
y∈Rd
f (y) +
and the proximal operator is defined as
1 x − y2 2μ
76
4 Convex Functions
Fig. 4.5
proxμf (x) = arg min
f (y) +
y∈Rd
Then for μ = 1,
1 x − y2 2μ
x = prox f (u) iff x + ∇ f (x) = u.
The Moreau envelope and proximal operator have several interesting properties. One of the main properties is its regularizing effect. If f is Lipschitz continuous, f μ → f uniformly as μ → 0. Another interesting property is that proxμf (x) + proxμf ∗ (x) = x and f μ (x) + f μ∗ (x) =
1 x2 . 2
See [2, 9] for these and many other interesting details. These concepts will feature again in Chap. 6 when we study ‘proximal methods’.
4.7 Exercises 4.1 Prove (C6). 4.2 Let f : Rd → R be a convex function. Then show that the Gâteaux derivative f (x; h) is a convex function in h. In particular, f (x; h 1 + h 2 ) ≤ f (x; h 1 ) + f (x; h 2 )
4.7 Exercises
77
for all h 1 , h 2 ∈ Rd . Furthermore, if it is linear in h, then f is differentiable at x. 4.3 Let f : Rd → R be a bounded convex function. Show that f is constant. 4.4 Let f : Rd → R be a convex function. Suppose there is an α ∈ R such that {x : f (x) ≤ α} is non-empty, closed and bounded. Show that for each β ∈ R, the level set {x : f (x) ≤ β} is closed and bounded. 4.5 Recall that affine functions Rd → R are functions of the type x → a T x + b for some a, b ∈ Rd . (a) Show that a convex function f : Rd → R is the upper envelope of affine functions whose graphs lie below that of f , i.e., f (x) = sup {a T x + b : a T y + b ≤ f (y) ∀y}. a,b∈Rd
(b) Prove the Jensen’s inequality: For f as above and ϕ : Rd → [0, ∞) with ϕ(x)dx = 1,
f (x)ϕ(x)dx ≥ f xϕ(x)dx , where the second integral is componentwise. 4.6 (Monotonicity of the gradient) For a continuously differentiable and convex f : C ⊂ Rd → R with C convex, show that for x, y ∈ C, ∇ f (y) − ∇ f (x), y − x ≥ 0. Show that this can be generalized to subgradients if we drop the differentiability condition, as: ζ − ξ, y − x ≥ 0, ∀ ζ ∈ ∂ f (y), ξ ∈ ∂ f (x).
(4.12)
4.7 Let f : Rd → R be twice continuously differentiable. Show that on any closed bounded convex set C ⊂ R d , f can be written as a difference of two convex functions. 4.8 Let f : Rn → R be lower semicontinuous. Assume that for each x, y ∈ Rn , there exists α ∈ (0, 1) such that f (αx + (1 − α)y) ≤ α f (x) + (1 − α) f (y). Show that f is convex. 4.9 Show that lower semicontinuity of f is necessary in the above exercise. Hint: Consider 1 if x < 0 f (x) = 41 + x if x ≥ 0. 2
78
4 Convex Functions
4.10 Give an example of a convex function [0, 1] → R that has infinitely many points of non-differentiability. 4.11 Let A0 , A1 , . . . , Ad be m × m symmetric matrices and consider f : Rd → R where f (x) is the maximum eigenvalue of the matrix A0 + x(1)A1 + · · · + x(d)Ad . Show that f is a convex function. 4.12 For a convex C ⊂ Rd , let f n : C → R, n ≥ 1, be a sequence of convex functions such that f (x) := lim supn↑∞ f n (x) < ∞ for x ∈ C. Show that f : C → R is convex. Will the result hold if ‘convex’ is replaced by ‘strictly convex’ in both places? 4.13 Consider the real-valued function on the convex set C := {(x, y) : y ≥ x 2 } 2 defined as f (x, y) = xy , with f (0, 0) = 0 (see Fig. 4.5). Show that it is convex and lower semicontinuous but not continuous, whereas its epigraph is closed. 4.14 A convex function f : C ⊂ R → R, C convex, is said to be self-concordant if 2 23 3 d f d f dx 3 (x) ≤ 2 dx 2 (x) . For C ⊂ Rd convex for d ≥ 2, f : C → R is said to be self-concordant if it is so for every line segment contained in C. 1. Show that self-concordance is preserved under positive linear combinations d of functions and affine transformations on R . T N 2. Show that f (x) : x ∈ C → − i=1 log bi − ai x ∈ R is self-concordant on {x : bi − aiT x > 0 ∀i}, where bi ∈ R, ai ∈ Rd ∀i.
References 1. Boulmezaoud, T.Z., Cieutat, P., Daniilidis, A.: Gradient flows, second-order gradient systems and convexity. SIAM J. Optim. 28(3), 2049–2066 (2018) 2. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004) 3. Clarke, F.: Functional Analysis, Calculus of Variations and Optimal Control. Graduate Texts in Mathematics, vol. 264. Springer, London (2013) 4. Dragomirescu, M., Ivan, C.: The smallest convex extension of a convex function. Optimization 24(3–4), 193–206 (1992) 5. Gruber, P.M.: Convex and Discrete Geometry. Grundlehren der mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], vol. 336. Springer, Berlin (2007) 6. Luenberger, D.G.: Optimization by Vector Space Methods. Wiley, New York (1969) 7. Niculescu, C.P., Persson, L.E.: Convex Functions and Their Applications: A Contemporary Approach, 2nd edn. CMS Books in Mathematics/Ouvrages de Mathématiques de la SMC. Springer, Cham (2018) 8. Peters, H.J.M., Wakker, P.P.: Convex functions on nonconvex domains. Econ. Lett. 22(2–3), 251–255 (1986) 9. Rockafellar, R.T.: Convex Analysis. Princeton Landmarks in Mathematics. Princeton University Press, Princeton, NJ (1997). Reprint of the 1970 original, Princeton Paperbacks 10. Yan, M.: Extension of convex function. J. Convex Anal. 21(4), 965–987 (2014)
Chapter 5
Convex Optimization
5.1 Introduction Convex optimization or convex programming refers to the problem of minimizing convex functions over convex sets. Observe that we have been careful to say only minimization. Maximization of convex functions is a different kettle of fish altogether, these problems can be extremely hard. Minimization of convex functions on convex sets on the other hand has a large body of clean and elegant theory as well as a plethora of algorithms. These include several that are tailor-made for important special subclasses, such as linear and semidefinite programming. A symmetric statement can be made regarding maximization of concave functions, which is equivalent to minimization of convex functions. Why bother about such a special class of problems? One reason is that many optimization problems in practice are naturally of this form, such as energy minimization and cost minimization. Likewise, many problems of maximizing profit or utility in economics maximize concave functions because of ‘the law of diminishing returns’. More importantly, recall that near a local minimum of a twice continuously differentiable function, the Hessian is positive semidefinite and often positive definite. The latter implies local convexity. Thus close to a local minimum, convex optimization is useful. It is not an exaggeration to say that the theory and practice of minimizing non-convex functions borrow a lot from and build upon the theory of minimizing convex functions. The reason why convex optimization is more amenable than its non-convex counterpart already becomes apparent with some simple initial observations about the problem. Thus consider a convex continuous f : C ⊂ Rd → R where C is closed and convex. Then we have: (C1) If C is bounded as well, f attains a minimum. This is immediate from the Weierstrass theorem. If C is not bounded, it may not do so, e.g., the map x → e−x does not attain a minimum in R (C unbounded). Nor does f : [0, 1] → R defined by f (x) = x 2 for x > 0, f (0) = 1 ( f not continuous). © Hindustan Book Agency 2023 V. S. Borkar and K. S. M. Rao, Elementary Convexity with Optimization, Texts and Readings in Mathematics 83, https://doi.org/10.1007/978-981-99-1652-8_5
79
80
5 Convex Optimization
(C2) Local minima of f are also its global minima. To see this, suppose that there are two distinct points x, y ∈ C with f (x) < f (y) and y is a local minimum of f . Consider the graph of f restricted to the line segment x ↔ y, represented as αx + (1 − α)y, α ∈ [0, 1]. That is, we look at the function α ∈ [0, 1] → g(α) = f (αx + (1 − α)y). Then g(·) is convex with a local minimum at the boundary point α = 0, and g(1) < g(0). By the convexity of g(·), its graph lies on or below the line segment joining (0, g(0)) with (1, g(1)). At every point (a, b) on this line segment except (0, f (y)), b < f (y) (because f (x) < f (y)). But the interval [0, 1] has points arbitrarily close to 0 and to the right of it, where the value of g then is strictly below g(0), contradicting the fact that y is a local minimum. Thus y must be a global minimum. (C3) The set of (necessarily global) minima of f is a closed convex set. The set of minima of a continuous function is in any case closed, so we need to prove only the convexity. If x, y ∈ C are two minima, the graph of f restricted to the line segment x ↔ y must lie on or below the line segment joining f (x), f (y), i.e., a horizontal line segment. Clearly the two must coincide; otherwise we have a point on this line segment where g(α) as defined in (C2) has a strictly lower value than g(0), contradicting the local minimality of f (y). This proves the claim. Perhaps the most elegant aspect of convex optimization is the duality theory, a whiff of which we already got in Corollary 3.1. We take this up next.
5.2 Legendre Transform and Fenchel Duality Let f : C ⊂ Rd → R be convex, where C ⊂ Rd is closed and convex. Define the Legendre transform or conjugate convex function of f to be the function y → f ∗ (y) := sup{x, y − f (x)}. x∈C
Note that this can take the value +∞. (See [2] for a lucid exposition of the Fenchel dual.) While one can build a theory around this definition by considering extended real-valued functions (i.e., functions that are allowed to take values ±∞), we shall take the simpler route of taking f ∗ to be defined on the set C ∗ := {y ∈ Rd : f ∗ (y) < ∞}. Note also that the definition did not require f to be convex. Indeed, one can build a more general theory without this assumption, but for sake of simplicity, we shall stick to the case when f is convex. We shall begin by proving some simple but useful properties. A few more are found in the exercises. But before doing so, we give some geometric intuition for this definition. Recall that {x ∈ Rd : x, y = 0} defines a (d − 1)-dimensional subspace of Rd passing through the origin and normal
5.2 Legendre Transform and Fenchel Duality x·y
81
f (x)
f (x)
(y, f (y))
(y, f (y)
x·y
(0, f ∗ (y))
(0, −f ∗ (y))
(a)
(b)
Fig. 5.1 Legendre transform
to the vector y. Thus f ∗ (y) for fixed y is the maximum extent to which the graph of this hyperplane lies above the graph of f , or equivalently, the extent to which the graph of this hyperplane must be lowered till it barely touches the graph of f —see Fig. 5.1a. Note that the graph of the hyperplane may have to be raised rather than lowered for it to barely touch the graph of f , see Fig. 5.1b. We take raising by an amount to be the same as lowering by the negative of that amount. To begin with, the fact that f ∗ is a pointwise supremum of affine functions shows that it is convex. Secondly, it is clear that f (x) + f ∗ (y) ≥ x, y .
(5.1)
This is known as the Fenchel-Young inequality. An immediate consequence of it is the following. Observe that the domain of definition of ( f ∗ )∗ will necessarily contain C. We then have: Theorem 5.1 ( f ∗ )∗ (x) = f (x), x ∈ C. Proof Since (cf. Exercise 4.5(a)) f (x) = sup{g(x) : g(·) ≤ f (·), g is affine},
(5.2)
we have, by (5.1), ( f ∗ )∗ (x) = sup (y, x − f ∗ (y)) ≤ f (x). y∈C ∗
Let x → x, y − c be an affine map whose graph lies below that of f , i.e., x, y − c ≤ f (x) ∀ x ∈ C. Then c ≥ x, y − f (x) ∀ x ∈ C, implying
(5.3)
82
5 Convex Optimization
c ≥ sup(x, y − f (x)) = f ∗ (y). x∈C
Hence
x, y − c ≤ x, y − f ∗ (y).
Taking supremum on both sides over the set Y of all c such that (5.3) holds, and using (5.2), we have f (x) ≤ sup(x, y − f ∗ (y)) ≤ ( f ∗ )∗ (x) y∈Y
for x ∈ C. This completes the proof.
The function ( f ∗ )∗ is sometimes written as f ∗∗ and is called dual conjugate of f . Another useful fact is the following, which connects the foregoing to the notion of the subdifferential ∂ f . Theorem 5.2 ∂ f := {y ∈ C ∗ : f ∗ (y) = x, y − f (x)}, i.e., the subdifferential of f at x is precisely the set of y for which equality in the Fenchel-Young inequality is attained. Proof This follows from the following chain of equivalent statements: y ∈ ∂ f (x) ⇐⇒ f (z) ≥ f (x) + z − x, y ∀z ⇐⇒ x, y − f (x) ≥ z, y − f (z) ∀ z ⇐⇒ x, y − f (x) = sup(z, y − f (z)) z
∗
⇐⇒ f (y) = x, y − f (x), which completes the proof.
One consequence of this is that the points x where ∂ f is not a singleton (in particular, f is not differentiable) are the points where multiple y (viz., normals of the hyperplanes supporting the epigraph of f at x) have the same value of f ∗ (y). Clearly, this is a closed convex set. Since ( f ∗ )∗ = f , this means that the supporting hyperplane of the epigraph of f ∗ with normal x touches the graph of f ∗ at a closed convex set that is not a singleton, i.e., a ‘flat patch’—see Fig. 5.2. Given the symmetry of the situation, we conclude that non-differentiability of f corresponds to a flat patch for the graph of f ∗ and vice versa. This has implications in thermodynamics, where transition between different phases correspond to such ‘flat patches’ and lead to non-differentiability of ‘dual’ quantities, see [11]. We now prove a celebrated result in this domain, the Fenchel duality theorem. It establishes a duality principle between a minimization problem and a maximization problem in the spirit of what Corollary 3.1 did for convex sets. Before giving a statement of the theorem, we need to introduce for concave functions the counterparts
5.2 Legendre Transform and Fenchel Duality
83
Fig. 5.2
of some of the notions defined thus far for convex functions. Recall that a function g : D → R, D ⊂ Rd closed convex is concave if for any λ ∈ [0, 1] and x, y ∈ D, λg(x) + (1 − λ)g(y) ≤ g(λx + (1 − λ)y), i.e., the line segment joining (x, g(x)) and (y, g(y)) lies below the graph of g. This in particular means that the hypograph of g, denoted by hypo(g) and defined as {(x, r ) ∈ D × R : r ≤ g(x)}, is convex. The counterparts of our results for convex f and its epigraph hold, e.g., continuity of g implies that hypo(g) is closed. The conjugate concave function for g is defined as g∗ (y) := inf (x, y − g(x)), x∈D
viewed as a real-valued concave function on D ∗ := {y ∈ Rd : g∗ (y) > −∞}. The corresponding Fenchel-Young inequality then is: g(x) + g∗ (y) ≤ x, y . The Fenchel duality theorem then says that the minimum of f − g on C ∩ D, which is also the minimum vertical separation between the epigraph of f and the hypograph of g, equals the maximum vertical separation between g∗ and f ∗ on C ∗ ∩ D ∗ , which is the maximum vertical separation between pairs of parallel hyperplanes separating epi( f ) and hypo(g). Theorem 5.3 (Fenchel duality theorem) For f, g, C, D as above, assume that (†) int(C) ∩ int(D) = φ. Then C ∗ ∩ D ∗ = φ and (g∗ (y) − f ∗ (y)). inf ( f (x) − g(x)) = max ∗ ∗
x∈C∩D
y∈C ∩D
84
5 Convex Optimization
Proof That C ∗ ∩ D ∗ = φ is easy to show (check this). By the Fenchel-Young inequalities for convex and concave functions, f (x) + f ∗ (y) ≥ x, y ≥ g(x) + g∗ (y) for x ∈ C ∩ D, y ∈ C ∗ ∩ D ∗ . Thus f (x) − g(x) ≥ g∗ (y) − f ∗ (y), leading to m := inf ( f (x) − g(x)) ≥ x∈C∩D
sup (g∗ (y) − f ∗ (y)).
y∈C ∗ ∩D ∗
If m = −∞, there is nothing to prove, so let m > −∞ and let A, B denote resp., epi( f ) and hypo(g + m). By Theorem 3.7, there is a hyperplane H in Rd × R separating A and B, with A ⊂ U H , B ⊂ L H . In view of (†), it follows that H is not vertical. Suppose H := {(x, r ) : x, z + ra = α}. Then without loss of generality, a > 0 and x, z + a f (x) ≥ α ≥ x, z + a(g(x) + m). Hence for all x ∈ C ∩ D, f (x) ≥ x, −z/a + α/a ≥ g(x) + m. Thus
f ∗ (−z/a) ≤ −α/a, g∗ (−z/a) ≥ −α/a + m.
But then m ≤ g∗ (−z/a) − f ∗ (−z/a)) ≤
sup (g∗ (y) − f ∗ (y)) ≤ m,
y∈C ∗ ∩D ∗
implying equality throughout. The claim follows.
One important case that does not completely fit the above is that of minimizing a convex f on a closed convex C. One takes g to be the extended real-valued concave function defined by g(x) = 0 for x ∈ C and = −∞ for x ∈ / C. One can apply a more general version of the Fenchel duality theorem in order to obtain the dual problem, but we do not deal with this here since we have confined ourselves to real-valued functions alone. Next we consider the Legendre transform of a continuous but possibly non-convex f : Rd → R. f ∗ , f ∗∗ , and their domains are defined as before and are convex even when f is not. Following the arguments in Theorem 5.1, it is not hard to verify that f ∗∗ is the convex minorant of f , i.e., the largest convex function ≤ f , given by the pointwise supremum of all affine functions whose graph lies below that of f . In fact f = f ∗∗ if and only if f is convex (check this). An important result in the non-convex case is the following. Theorem 5.4 (Hiriart-Urruty [4]) Let f : Rd → R be differentiable function. Then a point x¯ is a global minimum of f if and only if ∇ f (x) ¯ = 0 and f ∗∗ (x) ¯ = f (x). ¯
5.3 The Lagrange Multiplier Rule
85
Proof Let x¯ be a global minimum of f . Then ∇ f (x) ¯ = 0. To prove the second condition, note that the affine function a(x) ≡ f (x) ¯ is below f (x) and has the pointwise maximum at x. ¯ From the characterization of f ∗∗ as the convex minorant of f , it follows that f (x) ¯ = f ∗∗ (x). ¯ ¯ = f (x). ¯ Recall our notation for Conversely let x¯ be a critical point and f ∗∗ (x) directional derivatives from Chap. 1. Then 1 1 ∗∗ f (x¯ + h) − f ∗∗ (x) f (x¯ + h) − f (x) ¯ → f (x; ¯ ≤ ¯ h) = ∇ f (x), ¯ h = 0
as ↓ 0. Thus we have ( f ∗∗ ) (x; ¯ h) ≤ 0 ∀ h ∈ Rd . Since the directional derivative from the right dominates that from the left in any given direction we have ¯ −h) ≤ ( f ∗∗ ) (x; ¯ h) −( f ∗∗ ) (x; ¯ −h) ≥ 0 for all h ∈ Rd . Thus ( f ∗∗ ) (x; ¯ −h) = 0. Simwhich implies that ( f ∗∗ ) (x; ∗∗ ¯ h) = 0. Thus f ∗∗ is Gâteaux differentiable at x¯ in any direction h ilarly, ( f ) (x; with the Gâteaux derivative = 0. By Theorem 4.3, it is Fréchet differentiable at x¯ ¯ Consequently, f has and its gradient is zero. Hence f ∗∗ has a global minimum at x. a global minimum at x. ¯
5.3 The Lagrange Multiplier Rule The next result we prove is the crown jewel of convex optimization, the Largrange multiplier rule. This addresses the archetypical convex optimization problem: Minimize a continuous convex f : C → R on a closed convex C ⊂ Rd , subject to constraints (5.4) gi (x) ≤ 0, 1 ≤ i ≤ m, where the gi : C → R are convex and continuous. In particular, the set of ‘feasible m := C ∩ (∩i=1 {x : gi (x) ≤ 0}) is closed convex. The value 0 on the right points’ C hand side of (5.4) is not restrictive. Any constant will do, as it can be absorbed in gi to bring the inequality to the above standard form (i.e., replace gi (x) ≤ c by gˆ i (x) ≤ 0 for gˆi (·) = gi (·) − c). We write the constraint in a compact form as g(x) ≤ θ, where g(x) := [g1 (x), . . . , gm (x)]T . The inequality/equality symbols between vectors here and in what follows are componentwise. Consider the set D := {(y, z) ∈ Rm × R : there exists an x ∈ Rd such that y ≥ g(x), z ≥ f (x)}. Then it is easy to see that D is convex (check this) (In fact, it is the projection of the epigraph of the componentwise convex map x → (g(x), f (x)) ∈ Rm × R, viewed as a subset of C × (Rm × R), onto (Rm × R).) (see Fig. 5.3).
86
5 Convex Optimization f (x)
g(x)
Fig. 5.3
Fig. 5.4 Change of coordinates for Lagrange multiplier
The unconstrained minimum of f would correspond to the lowest point in this figure, marked by a solid circle in the figure. But the ‘feasible set’, i.e., the set of points satisfying the constraint, is the subset D of D given by D := {(y, z) ∈ D : y ≤ θ }, shaded in the figure. The constrained minimum is then the point marked by a box in Fig. 5.3. Let the supporting hyperplane H at this point and its inward normal at this point be as shown on the left in Fig. 5.4a. By a change of coordinate that takes H to the horizontal (m-dimensional) axes and to the vertical axis, the picture changes to what we have on the right hand side in Fig. 5.4b. In this coordinate system, the constrained minimizer becomes an unconstrained one. This turns out to be equivalent to minimizing f (·) + T g(·). This is the Lagrange multiplier rule. We make this precise below, following a technical lemma which says that the ‘southwest frontier’ of D is indeed the graph of a convex function. The idea here is to move the vertical line Vc := {(z, y) : z = c}, c ∈ Rm , from left to right by increasing c, and track the minimum of f over the set Dc := D ∩ {(z, y) : z ≤ c}. Clearly, D = Dθ .
5.3 The Lagrange Multiplier Rule
87
Lemma 5.1 The function w(z) := inf{ f (x) : g(x) ≤ z} : Rm → R is convex. Proof Let α ∈ [0, 1]. Then for z 1 , z 2 ∈ Rm , w(αz 1 + (1 − α)z 2 ) = inf { f (x) : g(x) ≤ αz 1 + (1 − α)z 2 } x∈C
≤ inf { f (x) : x = αx1 + (1 − α)x2 , x1 , x2 ∈ C; g(x1 ) ≤ z 1 , g(x2 ) ≤ z 2 } x∈C
≤ inf {α f (x1 ) + (1 − α) f (x2 ) : x1 , x2 ∈ C; g(x1 ) ≤ z 1 , g(x2 ) ≤ z 2 } x∈C
= α inf { f (x) : g(x) ≤ z 1 } + (1 − α) inf { f (x) : g(x) ≤ z 2 } x∈C
x∈C
= αw(z 1 ) + (1 − α)w(z 2 ). Here the first inequality follows from the fact that the infimum over a potentially smaller set is at least as large or larger, and the componentwise convexity of g. The second inequality follows from the convexity of f . We shall assume that μ0 := inf x∈C f (x) is finite. Then we have: Theorem 5.5 (Lagrange multiplier rule) Suppose ∃ x1 satisfying g(x1 ) < θ . Then ∃ ≥ θ such that (5.5) μ0 = inf ( f (x) + T g(x)). x∈C
then x ∗ minimizes f on Furthermore, if this infimum is attained at some x ∗ ∈ C, C and (5.6) , g(x ∗ ) = 0. Proof Let B := {(z, y) : y ≤ μ0 , z ≤ θ } ⊂ Rm × R. Then (x1 , μ0 − ) ∈ int(B) for a suitable , so that int(B) = φ. On the other hand, it is clear that int(B) ∩ D = φ. Hence there exists a supporting hyperplane H = {(z, y) ∈ Rm × R : T z + βy = γ } for suitable ∈ Rm , β, γ ∈ R, such that D ⊂ U H , B ⊂ L H . That is, ∀ (z 1 , y1 ) ∈ D, (z 2 , y2 ) ∈ B, we have βy1 + T z 1 ≥ γ ≥ βy2 + T z 2 . From the definition of B, it is easy to see that ≥ θ and β ≥ 0 (check this). Since (θ, μ0 ) ∈ B, T z + βy ≥ βμ0 ∀ (z, y) ∈ D. If β = 0, then in particular = θ , because (, β) is a nonzero vector. Also, T z ≥ 0 in the above. In particular, for z = g(x1 ), T g(x1 ) ≥ 0. But g(x1 ) < θ, θ =
88
5 Convex Optimization
Fig. 5.5 Geometric interpretation of Lagrange multipliers
∇g2 (x∗ )
−∇f (x∗ ) ∇g (x∗ ) 1
g1 (x) ≤ θ
g2 (x) ≤ θ x∗ f (x) = k
≥ θ together lead to T g(x1 ) < 0, a contradiction. Thus we must have β > 0. Without loss of generality, we take β = 1. Since (θ, μ0 ) ∈ ∂ B ∩ ∂ D, μ0 = inf (y + T z) (y,z)∈D
≤ inf ( f (x) + T g(x)) x∈C
≤ inf ( f (x) + T g(x)) x∈C
≤ inf f (x) x∈C
= μ0 . Here the first two inequalities follow because the infimum is over successively smaller sets and the third inequality follows because T g(x) ≤ 0. Clearly, the equality must hold throughout, implying (5.5), (5.6). Figure 5.5 gives a geometric intuition for the Lagrange multiplier theorem, which says that at the constrained minimum, the negative gradient of f is a positive linear combination of the outward normals to the active constraint surfaces. The condition ‘∃ x1 such that g(x1 ) < θ ’ is called the Slater condition and the Lagrange multiplier. Equation (5.6) is known as the complementarity condition. The ‘KKT conditions’ of Chap. 2 are now recognized as a local version of the Lagrange multiplier rule, where the Slater condition already made an appearance, essentially for the same purpose. We did not require convexity there, but that is not surprising because near a local minimum, the function is convex. We next prove another important property of the Lagrange multipliers, the saddle point property. Given C ⊂ Rd , D ⊂ Rm , and a map f : C × D → R, a point (x ∗ , y ∗ ) ∈ C × D is said to be a saddle point of f if f (x ∗ , y) ≤ f (x ∗ , y ∗ ) ≤ f (x ∗ , y) for all (x, y) in a relatively open neighbourhood of (x ∗ , y ∗ ) in C × D. Suppose the infimum in (5.5) is attained at x0 ∈ C. Define the ‘Lagrangian’ L(x, y) = f (x) + y T g(x).
5.4 The Arrow-Barankin-Blackwell Theorem
89
Theorem 5.6 (Saddle point property) The Lagrangian L(·, ·) has a saddle point at (x0 , ), i.e., L(x0 , y) ≤ L(x0 , ) ≤ L(x, ) ∀x ∈ C, θ ≤ y ∈ Rm . Proof Note that for x ∈ C, by the definitions of μ0 , x0 , μ0 = f (x0 ) + T g(x0 ) ≤ f (x) + T g(x). That is, L(x0 , ) ≤ L(x, ). On the other hand, for y ≥ θ in Rm , L(x0 , y) − L(x0 , ) = y T g(x0 ) − T g(x0 ) = y T g(x0 ) ≤ 0, where the second equality follows from (5.6) and the inequality from the facts y ≥ θ and g(x0 ) ≤ θ . The claim follows. The minimization of x → L(x, ) over x ∈ C and the maximization of y → L(x0 , y) over y ≥ θ are viewed as being dual to each other. This is the so-called Lagrangian duality. The two notions of duality, Fenchel and Lagrange, are not unrelated, see [5].
5.4 The Arrow-Barankin-Blackwell Theorem In this section we give a version of the celebrated Arrow-Blackwell-Barankin theorem [1]. The proof is adapted from [8]. The result is of great significance in multiobjective optimization. Let C be a closed convex set in (Rd )+ := {x = [x1 , . . . , xd ] ∈ Rd : xi > 0 ∀i}. (More generally, C ⊂ a translate of (Rd )+ will do, as will become apparent soon.) A point x = [x1 , . . . , xd ] in C is said to be a Pareto point if one has: y = [y1 , . . . , yd ] ∈ C satisfies yi ≤ xi ∀i if and only if x = y. Let P(C) denote the set of Pareto points of C. For = [1 , . . . , d ] ∈ (Rd )+ , x := argmin y∈C , y is clearly a Pareto point. The converse is not always true, consider, e.g., the point marked by a ‘∗’ in Fig. 5.6. An ‘almost’ converse, however, does hold, as the following result shows. Theorem 5.7 (Arrow-Barankin-Blackwell) Let Q(C) := {x ∈ C : ∃ ∈ (Rd )+ s.t. , x = min, y }. y∈C
Then Q(C) is dense in P(C).
90
5 Convex Optimization
Fig. 5.6
∗
Proof Let x ∈ P(C) and B an open ball centred at x. Then it is easy to see that ¯ Thus we may take C to be bounded without loss of generality. For x ∈ P(C ∩ B). k ≥ 1, define, for K := max y∈C y, 1 Ck := y ∈ C : yi ≤ xi + ∀i k 1 , vk (y) := max yi − xi − 1≤i≤d k wk (y) :=
d j=1
yj , d(k + 1)K
u k (y) := wk (y) + vk (y), r k := arg min u k (y). y∈Ck
By construction, u k (·) is convex in y. Also, wk (y) ≤ Consider the set Ak := {y ∈ Rd : u k (y) < u k (r k )},
1 k+1
0 and any unit coordinate vector e. Using this, we have k , r k − αe < k , r k which leads to k ∈ (Rd )+ . Without loss of generality we can assume that the componentwise sum of k is 1. Since wk (r k ) ≥ 0, we have rik − xi −
1 ≤ wk (r k ) + vk (r k ) = u k (rk ) k 1 1 1 0 sufficiently small such that z˜ := λz + (1 − λ)r k ∈ Ck . The right hand side inequality in (5.7) is valid for z˜ . Therefore k , r k ≤ k , λz + (1 − λ)r k After rearrangement, we can note that the right hand side inequality in (5.7) is true for all z ∈ C. That is, the hyperplane that separates Ck , Ak also separates C, Ak . The right hand side inequality in (5.7) then implies that r k ∈ P(C). Since r k , x ∈ Ck ∀ k and r k , x ∈ P(C) ∀ k, it follows that r k → x as k ↑ ∞. This proves the claim. Letting k → along a subsequence, we also have , x ≤ , z ∀ z ∈ C for some ∈ (Rd )+ . If C is a convex polytope, one can say more. Corollary 5.1 If C above is a polytope, then Q(C) = P(C). Proof (Sketch) From the definition of P(C), it is easily seen that any support hyperplane of C at a point x in P(C) will have an inward unit normal = [1 , . . . , d ] satisfying i ≥ 0 ∀i. Suppose i = 0 for some i. Without loss of generality, relabel the ˆ ×R ˇ indices such that i > 0 for 1 ≤ i ≤ m < d and = 0 otherwise. Write Rd = R ˇ If it is in ˆ R ˇ are copies of Rm , Rd−m resp. Then x ∈ some face of C in R. where R, the relative interior of this face or any sub-face of dimension ≥ 1 thereof, it could not have been in P(C), because then one of its coordinates can be reduced while still remaining in C, without having to change other coordinates. This contradicts the choice x ∈ P(C). Thus x must be a corner, whence a small perturbation of leads to a support hyperplane at x with normal ∈ (Rd )+ . Hence x ∈ Q(C). To see that the claims of Theorem 5.7 cannot be improved to those of Corollary 5.1 in general, consider C ⊂ R2 defined as follows (see Fig. 5.7): B := {(x, y) ∈ R2 : (x − 1)2 + (y − 1)2 ≤ 1, 0 ≤ x, y ≤ 1}, C := {(x, y) ∈ R2 : x ≥ 0, y ≥ 1} ∪ {(x, y) ∈ R2 : x ≥ 1, y ≥ 0} ∪ B. Then (0, 1), (1, 0) ∈ P(C)\Q(C). More generally, one considers a continuous U : (Rd )+ → R+ that is strictly increasing componentwise. Then any minimizer of U (·) on C will be in P(C). The map x → , x with ∈ (Rd )+ is a special case of this. For the ‘maximization’ counterpart of the above results, such U are called utility functions and are usually taken to be concave increasing because of underlying economic principles such as ‘the law of diminishing returns’.
92
5 Convex Optimization
y
Fig. 5.7
(0,1)
(1,0)
x
5.5 Linear Programming We consider linear program in standard form ⎧ inf c, x ⎪ ⎪ ⎨ x∈Rd s.t. Ax = b, ⎪ ⎪ ⎩ x ≥ θ,
(LP)
d where A is m × d matrix, b ∈ Rm . Let F = {x ∈ R+ : Ax = b} be the feasible set. We denote by ai , the ith row of the matrix A. The apparently more general problem where the constraint Ax = b is replaced by Ax ≤ b componentwise can be reduced to the above form by introducing additional variables known as slack variables. The following result is crucial in the simplex method, one of the major algorithms for linear programming.
Theorem 5.8 If F has an extreme point and the linear programming problem (LP) has an optimal solution, then the LP has an optimal solution which is an extreme point of F . Proof Let S denote the set of optimal solutions of (LP). Then it is easy to see that S is convex and closed. Since F has an extreme point, Theorem 3.8 ensures that F and hence S contains no line. From the same theorem, S has an extreme point x ∗ . Let y, z ∈ F be such that x ∗ = λy + (1 − λ)z for some λ ∈ (0, 1). Then inf c, x = c, x ∗ = λc, y + (1 − λ)c, z .
x∈F
Therefore necessarily c, y = c, z = c, x ∗ and hence y, z ∈ S. This implies that x ∗ = y = z, proving that x ∗ is an extreme point of F . While we do not get into the details of the simplex algorithm,1 its convex analytic underpinnings are worth note. The above result justifies a systematic search among 1
See the ‘Bibliographical note’ in Chap. 7 for this.
5.5 Linear Programming
93
the extreme points of the feasible set, which is what this algorithm does. Theorem 3.9, which characterizes the extreme points of convex sets defined by linear inequalities, plays a key role in how this is done. Corresponding to (LP), which we call the Primal problem, we have a Dual problem given by ⎧ ⎨ sup y, b y∈Rm (Dual) ⎩ s.t. y T A ≤ cT . Let α = inf x∈F c, x and β = sup{y:y T A≤cT } y, b . Theorem 5.9 (Weak duality) Let x be a feasible solution to Primal and y be a feasible solution to Dual. Then y, b ≤ c, x . In particular, it is always true that β ≤ α. Proof We have Ax = b and y T A ≤ cT . Hence y, b = y, Ax = y T A, x ≤ c, x .
This completes the proof.
From the weak duality, we see that if the primal problem is unbounded (i.e., α = −∞), then the dual problem is infeasible (i.e., β = −∞) . Similarly if the dual problem is unbounded (i.e., β = ∞), then the primal problem is infeasible (i.e., α = ∞). Theorem 5.10 (Strong duality) If one of the Primal or Dual is feasible and bounded, then α = β. Proof Suppose that Primal is feasible and bounded. Then the Dual is feasible and bounded as well. Similarly, if Dual is feasible and bounded, so is the Primal. Consider ˜ : x ≥ θ} K = { Ax b A . Then, for any > 0, we have ∈ / K . Using separation α− cT theorem (Theorem 3.7), we can find y ∈ Rm and η ∈ R such that where A˜ =
y , b + η(α − ) > 0 ≥ y , Ax + ηc, x for every > 0 and x ≥ θ . Since there exists an x ≥ θ such that Ax = b (by the feasibility of Primal), η = 0. At the same time, the arbitrariness of > 0 will ensure that η < 0 from the first inequality. Replacing y by y := − η1 y , we now have
94
5 Convex Optimization
y, b > α − for every > 0. Also, for every x ≥ θ , y, Ax ≤ c, x . Since x ≥ θ is arbitrary, this implies y T A ≤ cT . Thus y is feasible for Dual and hence β ≥ α − for each > 0, so β ≥ α. This proves the strong duality. It is worthwhile to motivate this using the framework of the Lagrange multiplier rule for some additional insights. Assume the Primal to be feasible and bounded. Writing Ax = b as a combination of two inequalities Ax ≤ b and −Ax ≤ −b, Primal is equivalent to unconstrained minimization of cT x + (λ∗1 )T (b − Ax) + (λ∗2 )T (−b + Ax) + (μ∗ )T x where λ∗1 , λ∗2 ≥ θ in Rm , μ∗ ≥ θ in Rd are the Lagrange multipliers, with μ∗ corresponding to the non-negativity constraints. (Note that strictly speaking, Theorem 5.5 does not permit this, because the Slater condition is violated.) Writing λ∗ := λ∗1 − λ∗2 ∈ Rm , this can be rewritten as cT x + (λ∗ )T (b − Ax) + (μ∗ )T x. By the saddle point property, if x ∗ is the optimal solution, then cT x ∗ + λT (b − Ax ∗ ) + μT x ∗ ≤ cT x ∗ + (λ∗ )T (b − Ax ∗ ) + (μ∗ )T x ∗ ≤ cT x + (λ∗ )T (b − Ax) + (μ∗ )T x, ∀ λ ∈ Rm , μ, x ∈ Rd . Note the symmetric roles played by λ, x. Replacing λ by y, the maximization problem in the left inequality is recognized as being equivalent to Dual with x ∗ playing the role of the corresponding Lagrange multipliers.
5.6 Applications to Game Theory In this section, we first establish two key results of game theory that have of convex analysis and convex optimization flavour. The first is the celebrated min-max theorem that has a long history dating from the original result by von Neumann [10] to the very general version by Sion [9]. It addresses the so-called two person zero sum game, wherein two players play a game with an objective function that depends on their actions, which one of them is trying to maximize while the other one is trying to minimize. The theorem then says that the worst case profit for one equals the worst case loss of the other under suitable convexity hypotheses. The second result addresses more general non-cooperative games where the celebrated ‘Nash equilibrium’ introduced by John Nash is the most natural equilibrium concept. We give a simple proof of the existence of Nash equilibria for individually convex costs using Brouwer fixed point theorem, due to Geanakoplos [3].
5.6 Applications to Game Theory
95
5.6.1 Min–Max Theorem Let f : C × D → R denote the payoff function. Recall that a pair (x ∗ , y ∗ ) is said to be a saddle point for f if f (x ∗ , y) ≤ f (x ∗ , y ∗ ) ≤ f (x, y ∗ ) ∀x ∈ C, y ∈ D. If a saddle point exists, then taking minimum over x ∈ C on RHS and maximum over y ∈ D on LHS in the above, max f (x ∗ , y) ≤ f (x ∗ , y ∗ ) ≤ min f (x, y ∗ ) y
x
=⇒ min max f (x, y) ≤ max min f (x, y). x
y
y
x
Since min max f (x, y) ≥ max min f (x, y) x
y
y
x
in any case, we have min max f (x, y) = max min f (x, y). x
y
y
x
Thus the key step in establishing this equality is the existence of a saddle point. Theorem 5.11 (Min–max theorem) Let C, D closed bounded convex subsets of Rd , Rn resp., f (x, y) : (x, y) ∈ C × D → R continuous, f (·, y) convex ∀ y ∈ D, f (x, ·) concave ∀ x ∈ C. Then min max f (x, y) = max min f (x, y). x
y
y
x
Proof To begin with, assume that f (·, y) is strictly convex for each y and f (x, ·) is ˜ ∗ ), y ∗ ), where x(y) ˜ := strictly concave for each x. Let max y min x f (x, y) = f (x(y ∗ ∗ ˜ ). Then argmin f (·, y). Let x = x(y f (x ∗ , y ∗ ) ≤ f (x, y ∗ ) ∀x ∈ C. Let m(y) = min x f (x, y). Then m is concave continuous and y ∗ = argmax m(·), so for any y ∈ D and λ ∈ (0, 1),
96
5 Convex Optimization
f (x ∗ , y ∗ ) = m(y ∗ ) ≥ m((1 − λ)y ∗ + λy) = f (x((1 ˜ − λ)y ∗ + λy), (1 − λ)y ∗ + λy) ˜ − λ)y ∗ + λy), y) ≥ (1 − λ) f (x((1 ˜ − λ)y ∗ + λy), y ∗ ) + λ f (x((1 ≥ (1 − λ) f (x ∗ , y ∗ ) + λ f (x((1 ˜ − λ)y ∗ + λy), y) Rearranging terms, we get ˜ − λ)y ∗ + λy), y). λ f (x ∗ , y ∗ ) ≥ λ f (x((1 Dividing both sides by λ > 0 and letting λ → 0, we have ˜ − λ)y ∗ + λy), y) f (x ∗ , y ∗ ) ≥ lim f (x((1 λ↓0
= f (x ∗ , y).
(5.8)
Hence (x ∗ , y ∗ ) is a saddle point, implying the min-max theorem. Now consider the case when the convexity of f (·, y) or concavity of f (x, ·) above is not strict. Define f : C × D → R by f (x, y) := f (x, y) + (x2 − y2 ). Then f is strictly convex in x for fixed y and strictly concave in y for fixed x. Therefore the foregoing applies and guarantees the existence of a saddle point equilibrium (x , y ) satisfying f (x , y) + (x 2 − y2 ) ≤ f (x , y ) + (x 2 − y 2 ) ≤ f (x, y ) + (x2 − y 2 ) for all x ∈ C and y ∈ D. Since C and D are closed and bounded, (x , y ) converges to (x ∗ , y ∗ ) along a subsequence by the Bolzano-Weierstrass theorem. Letting → 0 along this subsequence in the above inequality, we obtain f (x ∗ , y) ≤ f (x ∗ , y ∗ ) ≤ f (x, y ∗ ) for all x ∈ C and y ∈ D, implying the existence of a saddle point equilibrium. Hence min max f (x, y) = max min f (x, y), x
y
y
x
proving the claim.
The convexity/concavity assumption, however, is crucial. Consider the following simple example. Let f : [0, 1] × [0, 1] → R given by f (x, y) = (x − y)2 . Then f (·, ·) is convex in x variable and convex (instead of concave) in y variable. The game does not have a saddle point equilibrium. In fact (see the Fig. 5.8), min max f (x, y) = x
y
1 = 0 = max min f (x, y). y x 4
5.6 Applications to Game Theory
97
Fig. 5.8 Graph of f (x, y) = (x − y)2
5.6.2 Existence of Nash Equilibria We now move on to the N -person non-cooperative game with payoff function f = [ f 1 , . . . , f N ]T : C =
N
Ci ⊂ R
N i=1
di
→ R N ,
i=1
where Ci ⊂ Rdi , di ≥ 1, is closed bounded and convex. For xi ∈ Ci , 1 ≤ i ≤ N (written as a row vector) , we denote by x−i the vector x = [x1 : . . . : xi−1 : xi+1 : . . . : x N ] and write x = [xi : x−i ] for any i, 1 ≤ i ≤ N , by abuse of notation. We assume that f is componentwise convex, i.e., xi → f i ([xi , x−i ]) is convex and continuous for each i and each fixed value of x−i . The objective of the ith player is to minimize the latter function for given x−i . The catch is that every player does so at the same time independently of the others, i.e., non-cooperatively. The natural solution concept then is that of Nash equilibrium [6, 7]. We call x ∗ = [x1∗ : . . . : x N∗ ], xi∗ ∈ Rdi ∀i, ∗ ]). In a Nash equilibrium if for each i, xi∗ minimizes the map x ∈ Rdi → f i ([x, x−i words, it is the action profile for which no player has any incentive to unilaterally change his action when the actions of all other players remain frozen. For simplicity, we consider N = 2. The same proof works for N > 2 with additional notational overhead. Assume f 1 is convex in x for each fixed y and f 2 is convex in y for each fixed x. Also assume that both f 1 and f 2 are jointly continuous. We use the notation introduced at the beginning of this section. Theorem 5.12 (Existence of Nash equilibrium) There exists a Nash equilibrium x ∗ . Proof Fix (x, ¯ y¯ ) ∈ S1 × S2 and define φ = (φ1 , φ2 ) : S1 × S2 → S1 × S2 by
98
5 Convex Optimization
φ1 (x, ¯ y¯ ) = arg min{ f 1 (x, y¯ ) + |x − x| ¯ 2} x∈S1
and ¯ y¯ ) = arg min{ f 2 (x, ¯ y) + |y − y¯ |2 }. φ2 (x, y∈S2
Clearly φ1 and φ2 are single valued and hence φ := [φ1 : φ2 ] : S1 × S2 → S1 × S2 defines a continuous function. Now Brouwer fixed point theorem guarantees the existence of (x ∗ , y ∗ ) ∈ S1 × S2 such that φ(x ∗ , y ∗ ) = (x ∗ , y ∗ ). Therefore, f 1 (x ∗ , y ∗ ) = min f 1 (x, y ∗ ) + |x − x ∗ |2 x∈S1
f 2 (x ∗ , y ∗ ) = min f 2 (x, y ∗ ) + |y − y ∗ |2 .
and
y∈S2
Fix x ∈ S1 . Let λ ∈ (0, 1) and consider λx + (1 − λ)x ∗ ∈ S1 . By definition of x ∗ and y ∗ , we have f 1 (x ∗ , y ∗ ) ≤ f 1 (λx + (1 − λ)x ∗ , y ∗ ) + |λx + (1 − λ)x ∗ − x ∗ |2 = f 1 (λx + (1 − λ)x ∗ , y ∗ ) + λ2 |x − x ∗ |2 ≤ λ f 1 (x, y ∗ ) + (1 − λ) f 1 (x ∗ , y ∗ ) + λ2 |x − x ∗ |2 . Rearranging, we get λ f 1 (x ∗ , y ∗ ) ≤ λ f 1 (x, y ∗ ) + λ2 |x − x ∗ |2 Dividing both sides by λ > 0 and then letting λ → 0, we obtain f 1 (x ∗ , y ∗ ) ≤ f 1 (x, y ∗ ). Since x ∈ S1 is arbitrary, we get the optimality of x ∗ . In a similar fashion, we can show the optimality of y ∗ . Therefore (x ∗ , y ∗ ) is a Nash equilibrium.
5.7 Exercises 5.1 Show that a convex function f : C → R for a closed bounded convex C ⊂ Rd attains its maximum at an extreme point of C. 5.2 For a continuous function f : C → R where C ⊂ Rd is closed convex and bounded, define its convex minorant as the convex function f m : C → R satisfying f m (x) ≤ f (x) ∀ x ∈ C such that for any convex g : C → R satisfying g(x) ≤ f (x) ∀ x ∈ C, g(x) ≤ f m (x) ∀ x ∈ C. Show that:
5.7 Exercises
99
1. f m satisfying the above exists and is unique, 2. epi( f m ) = co(epi( f )), and, 3. min x∈C f (x) = min x∈C f m (x). 5.3 Compute f ∗ for f : R → R given by f (x) = 21 x2 . 5.4 Let { f α } be an arbitrary family of convex functions. 1. If fˇ(·) := inf α f α (·) > −∞, show that ( fˇ)∗ = supα f α∗ . 2. If fˆ(·) := supα f α (·) < ∞, show that ( fˆ)∗ ≤ inf α f α∗ , and that the inequality cannot be improved to equality in general. 5.5 Show that for convex f and a scalar a > 0, (a f )∗ = a f ∗ ax . 5.6 Let f, g : Rd → R be convex functions. The inf-convolution (or epi-sum) of f and g is defined as f g(x) := inf{ f (y) + g(x − y) : y ∈ Rd }. Show that f g is convex and ( f g)∗ = f ∗ + g ∗ . 5.7 Let C ⊂ Rd be a convex set. Its support function SC is defined as SC (x) := sup y∈C x, y . Show that SC∗ (x) = 0IC (x) + (−∞)IC c (x) where I B (·) is the indicator function of the set B, defined as I B (x) = 1 for x ∈ B, = 0 if not. 5.8 Let S := {x = [x1 , . . . , xd ] ∈ Rd : xi ≥ 0 ∀i, i xi = 1} denote the simplex of d-dimensional probability vectors. Minimize over S the ‘average energy’ prescribed Hi , 1 ≤ i ≤ d, subject to the constraint on ‘entropy’ i x i Hi for given by − i xi log xi ≤ C for a prescribed C > 0. 5.9 Consider the optimization problem
min e−x x
s.t. x 2 /y ≤ 0, y > 0. Show that this convex optimization problem has a duality gap with its Lagrangian dual and the Slater condition is violated. 5.10 Derive Theorem 5.11 from Theorem 5.12. 5.11 Let the sets {Ci } in Theorem 5.12 be finite sets and let P(Ci ) for each i denote the simplex of probability vectors indexed by Ci . An element of P(Ci ) corresponds to a randomized decision for agent i, called a mixed strategy. Show that a Nash equilibrium among mixed strategies always exists, though one in ‘pure’ strategies corresponding to deterministic choices from the respective Ci ’s, may not. 5.12 Show that if a continuously differentiable convex function f : C → R for a closed convex set C ⊂ Rd attains its minimum at x ∗ ∈ C, then x ∗ satisfies the ‘variational inequality’ x − x ∗ , ∇ f (x ∗ ) ≥ 0 ∀ x ∈ C.
100
5 Convex Optimization
References 1. Arrow, K.J., Barankin, E.W., Blackwell, D.: Admissible points of convex sets. In: Contributions to the Theory of Games, vol. 2. Annals of Mathematics Studies, no. 28, pp. 87–91. Princeton University Press, Princeton, NJ (1953) 2. Bauschke, H.H., Lucet, Y.: What is a Fenchel conjugate? Not. Am. Math. Soc. 59, 44–46 (2012) 3. Geanakoplos, J.: Nash and Walras equilibrium via Brouwer. Econom. Theory 21(2–3), 585–603 (2003). Symposium in Honor of Mordecai Kurz (Stanford, CA, 2002) 4. Hiriart-Urruty, J.B.: When is a point x satisfying ∇ f (x) = 0 a global minimum of f ? Am. Math. Monthly 93(7), 556–558 (1986) 5. Magnanti, T.L.: Fenchel and Lagrange duality are equivalent. Math. Program. 7, 253–258 (1974) 6. Nash, J.: Equilibrium points in n-person games. Proc. Nat. Acad. Sci. U.S.A. 36, 48–49 (1950) 7. Nash, J.: Non-cooperative games. Ann. Math. 54, 286–295 (1951) 8. Peleg, B.: Topological properties of the efficient point set. Proc. Am. Math. Soc. 35, 531–536 (1972) 9. Sion, M.: On general minimax theorems. Pac. J. Math. 8, 171–176 (1958) 10. von Neumann, J.: Zur Theorie der Gesellschaftsspiele. Math. Ann. 100(1), 295–320 (1928) 11. Wightman, A.S.: Introduction. Convexity and the notion of equilibrium state in thermodynamics and statistical mechanics. In: Convexity in the Theory of Lattice Gases (R.B. Israel), pp. ix– lxxxvi. Princeton University Press (2015)
Chapter 6
Optimization Algorithms: An Overview
6.1 Preliminaries Optimization algorithms is a vast research area in its own right, with multiple strands. In this chapter we do not attempt anything close to a comprehensive overview, but limit ourselves to giving just a taste of the subject in broad strokes. The ‘bibliographical note’ at the end of this book gives pointers to more specialized texts with in-depth accounts of various sub-topics. We begin with a discussion of some general underpinnings of the subject. An algorithm, rather a recursive algorithm (which will be our main concern here), sequentially generates a sequence {xn } of putative candidates for the optimum. Thus xn+1 is derived from xn for each n according to a prescribed rule that defines the algorithm, beginning with a prescribed initial condition x0 . The most common format for an optimization algorithm is a recursion of the type ˆ n ), n ≥ 0, xn+1 = F(x
(6.1)
for a suitable Fˆ : Rd → Rd . But this is not exhaustive, e.g., there can be explicit time ˆ n ). Also, there are cases ˆ i.e., we can have F(X ˆ n , n) instead of F(X dependence in F, when we have an implicit prescription n , xn+1 ) = θ, n ≥ 0, F(x
(6.2)
: Rd × Rd → Rm , m ≥ 1. The idea is to ‘solve’ this for xn+1 with xn for some F as a given parameter. Such schemes are relatively few, so our primary focus will be on (6.1). We also do not consider, except towards the end and that too in passing, schemes wherein Fˆ is random. One usually specializes further to iterations of the type xn+1 = xn + a(n)F(xn ), n ≥ 0, © Hindustan Book Agency 2023 V. S. Borkar and K. S. M. Rao, Elementary Convexity with Optimization, Texts and Readings in Mathematics 83, https://doi.org/10.1007/978-981-99-1652-8_6
(6.3)
101
102
6 Optimization Algorithms: An Overview
where a(n) > 0 is a judiciously chosen sequence of stepsizes, and F : Rd → Rd is a prescribed function. This renders (6.3) incremental, i.e., we make only a small change in xn at time n in order to obtain xn+1 . This often improves the performance of the algorithm and, more often than not, is essential for its correct functioning. The quantity F(xn ) by definition uses information of the function f : Rd → R (or more generally, f : C ⊂ Rd → R as the case may be). This can include for example f (xn ), ∇ f (xn ), ∇ 2 f (xn ), or even the higher order derivatives. Often one uses approximations thereof, e.g., ∂f f (x + δei ) − f (xi ) , 1 ≤ i ≤ d, (x) ≈ ∂ xi δ where ei is the unit vector in the ith coordinate direction and 0 < δ ≈ 0. Then we are using information about f in a small neighborhood of xn and not just at xn . In any case, the algorithm is local in the sense that it uses information only at, or in a very small neighborhood of, the current iterate. The other noteworthy feature is that it is deterministic, i.e., xn+1 is prescribed in terms of xn by a purely deterministic rule, there is no randomness involved. As a general rule, for non-convex f , any deterministic and local algorithm will yield only local optima at best. Thus general purpose global optimization algorithms have to, and do, give up on either locality or determinism (i.e., use some form of randomization) or both. We return to this theme later. Given the format (6.3), one can mentally split it into two components. The first ˆ n ). This choice is to choose which direction to move in, which is specified by F(x dictates the various classes of algorithms. The second component is to decide how far to move in that direction, viz., the choice of the stepsize a(n). This aspect is handled by approximately solving the scalar, and therefore easier optimization problem: (P0 ) Minimize a → f (xn + a F(xn )) over a suitable subset of R+ . In the least, one tries to solve it only approximately, or even make an intelligent move that lowers this objective function adequately. In other words, we solve approximately a succession of scalar problems. The move from one such problem to the next is dictated in a precise sense by the rule embedded in the function F. The scalar optimization component, called line search, tends to be more or less common across different classes of algorithms. We briefly mention some representative schemes for it in the next section.
6.2 Line Search The most naive scheme one can think of in the above context is exact line minimization, i.e., solve exactly the problem (P0 ). This may not be easy or fast, particularly in the non-convex case. Also, it represents a greedy heuristic, a priori it is not obvious
6.2 Line Search
103
in general that this is the best thing to do. Therefore it is not often used. Nevertheless, theoretical analyses of algorithms often assume that it is operative, for sake of analytical ease. The more common approach is to restrict the line search to an interval I = [0, r ] for a suitable r < ∞. Further, one does not perform an exact minimization on I , but seeks a point therein that gives adequate decrease of f . Some classical schemes are based on simple ‘search and compare’ ideas that successively generate subintervals to search on, each formed by subdividing the previous candidate interval into two according to some rule. One important consideration is that each subdivision should maintain a decent proportion among the lengths of the new intervals generated. Some classical methods for this are the binary search (subdivide into equal parts), Fibonacci search (subdivide according to the ratios of successive Fibonacci numbers generated by the recursion Z n+1 = Z n + Z n−1 , n ≥ 2, with Z 0 = Z 1 = 1) and the golden section method (where the ratio is fixed at the so-called ‘golden ratio’ which can be arrived at as a limit of ratios of successive Fibonacci numbers). The latter two have some theoretical advantages over simple binary search. Even with this, the iteration may not converge and keep oscillating unless explicit care is taken to ensure that a minimum decrease in the value of f is achieved at each step, see, e.g., Fig. 6.1 (adapted from [6], which also gives a concrete example of such a scenario). Intuitively, in most algorithms we need F(x) to be ‘gradient-like’, i.e.,
∇ f (x), F(x) < 0,
(6.4)
so that for sufficiently small a(n), we have, by the Taylor formula, f (xn+1 ) ≈ f (xn ) + a(n) ∇ f (x), F(x) + o(a(n)) < f (xn ) + o(a(n)).
Fig. 6.1
(6.5) (6.6)
104
6 Optimization Algorithms: An Overview
But the left hand side of (6.4) can still be ≈ 0 even for non-negligible values of ∇ f (xn ), F(xn ), unless the angle between the two is bounded away from ± π2 . Thus what one needs is in fact the additional condition
∇ f (xn ), F(xn ) > δn ,
(6.7)
for judiciously chosen δn > 0. Such conditions due to Wolfe, Goldstein, etc., lead to more principled search schemes. We describe next one popular variant known as the Armijo rule [3]. The two conditions we impose are: sup F(xn ) < ∞, k
sup ∇ f (xn ), F(xn ) < 0. k
The Armijo rule is as follows. Fix I = [0, r ] as above and choose constants 0 < α, β < 1. Set a(n) := β m(n) r , where m(n) = min{m ≥ 0 : f (xn ) − f (xn + β m r F(xn )) ≥ −αβ m r ∇ f (xn ), F(xn )}. Then one has: Theorem 6.1 If xn → x ∗ along a subsequence, then ∇ f (x ∗ ) = θ . A verbal description of this scheme is as follows. Consider the half line L through xn in the direction F(xn ) and mark a point xn that is r > 0 away from xn on it. Consider the interval I along L with end points xn , xn . Consider also the half line L ∗ that lies below L with a negative slope smaller in magnitude than the projected negative gradient along L, by a factor α < 1. If at xn , the graph of f lies below L ∗ , set xn+1 = xn . If not, replace r by βr (i.e., shrink I by a factor β keeping the end point xn fixed). Repeat the procedure till termination. The key idea is that we pick the distance to move along F(xn ) so that there is adequate decrease in the function value relative to the distance moved, viz., at least a certain fraction (α to be precise) of what one would get by traveling along the negative directional derivative along that direction. The process terminates at a finite m because, by our choice of α < 1, the desired decrease is always possible sufficiently close to xn . See [6] for a detailed proof of the above theorem. For an overview of classical line search methods, see [8, 31], among others. In conclusion, we also point out that the simplest scheme is to use a pre-determined sequence {a(n)}, which includes the even simpler possibility of a(n) ≡ a > 0, i.e., a constant stepsize. This has the advantage of simplicity, which matters a lot when the algorithm is hardwired into a chip. The constant stepsize is also a popular scheme when tracking the optimum of a function varying slowly with time, a theme out of scope of this book. What one loses is the fact that a priori in absence of any analytic insights, these
6.3 Algorithms for Unconstrained Optimization
105
stepsizes are usually chosen to be small, which makes the algorithm slow. Also, it clearly gives up on the option of leveraging line optimization, which may cause further degradation of quality. Nevertheless they remain popular. For non-constant stepsizes, however, one caveat is in order. It helps here to view the algorithms as a discretization (or Euler scheme) for the differential equation x(t) ˙ = F(x(t)), t ≥ 0.
(6.8)
The idea is that the discrete algorithm should track the asymptotic behaviour of (6.8). Interpreting {a(n)} as discrete time steps, this means that we need n a(n) = ∞ in order that the entire time axis be covered. Otherwise the convergence observed will be due to rapid decrease of {a(n)} rather than the actual minimization taking place, effectively emulating (6.8) only on a finite time interval. Therefore one enforces the above condition on {a(n)}. The relationship with (6.8) also uncovers an advantage of slowly decreasing {a(n)} (i.e., a(n) → 0 with n a(n) = ∞) near the desired limit (ideally, the global minimum), viz., that the tracking of (6.8), becomes more and more exact, leading to an asymptotically vanishing error. Thus slowly decreasing stepsizes capture a trade-off between rapid initial ‘exploration’ with larger stepsizes and eventual ‘exploitation’ of the local landscape near the desired limit with small stepsizes approaching zero.
6.3 Algorithms for Unconstrained Optimization In this section, we introduce the classical algorithms for unconstrained optimization in outline. We begin with gradient descent, also known as steepest descent. 1. Gradient descent For continuously differentiable f : Rd → R, the choice F(x) = −∇ f (x) yields, by virtue of (6.5)–(6.6), a strict decrease of f (xn ) as long as ∇ f (xn ) = O(1). Thus if a(n) = O(a) (say) as n → ∞ for some small a > 0, then lim sup ∇ f (xn ) = O(a),
(6.9)
n↑∞
implying that xn → a set of ‘nearly critical points’. For reasonable conditions on f typically satisfied by functions one encounters in practice, we can strengthen this to ‘xn → a small neighbourhood of a critical point’. One such condition, e.g., is that the Hessian be non-singular at the critical points, which in particular implies that they are isolated, i.e., each one has no other critical point in a sufficiently small neighborhood. This can be verified by looking at the second order Taylor formula.
106
6 Optimization Algorithms: An Overview
This is the gradient descent or steepest descent algorithm. The latter terminology stems from the fact that, subject to a bound on F(x) , the best choice of F(x) in (6.4), i.e., one that makes the left hand side the most negative, would indeed be to set F(x) ∝ −∇ f (x). A priori, if ∇ f (xn ) = θ for any n, the algorithm stops there, whence we cannot rule out convergence to critical points that are not local minima. However, these are unstable equilibria for the iteration, because a small numerical error will dislodge it from such a point, making it descend further. One can also ensure this by adding a small amount of noise, but that is rarely necessary. Here it helps to compare the algorithm with the continuous time gradient descent x(t) ˙ = −∇ f (x(t)), t ≥ 0, of which it can be viewed as a discretization. Isolated local minima are stable equilibria for this. In general, however, stability is not guaranteed unless f is real analytic, i.e., has a Taylor series expansion. In particular, only infinite differentiability is not enough, see [1]. Note also that for a(n) → 0, exact convergence can be usually claimed because the ‘a’ in (6.9) can be made arbitrarily small. Exact convergence is also true for schemes such as Armijo or exact line minimization. If local minima are not isolated, one can get convergence to a set thereof, but not necessarily to a single point therein, even to one that possibly depends on the initialization of the iteration, unless additional conditions are imposed. See [1] for an example where countably infinite local minima accumulate to prevent such a ‘point convergence’. But such situations are pathological and do not commonly occur in practice. Nevertheless it is good to be alert to such possibilities. Convergence rates for gradient descent under a variety of conditions on f have been analyzed. Here we give a simple such result for the special case when f : Rd → R is twice continuously differentiable and x ∗ is its local minimum satisfying ∇ 2 f (x ∗ ) > 0 (i.e., is strictly positive definite). Then we can find r > 0 sufficiently small so that for x ∈ B¯ r (x ∗ ) := {y : y − x ∗ ≤ r }, ∇ 2 f (x) > 0 by continuity. In particular its eigenvalues are real (because it is symmetric) and positive. Let λmin (x) > 0 denote the least eigenvalue of ∇ 2 f (x) and let λ0 := min x∈ B¯ r (x ∗ ) λmin (x) > 0. Let a ∈ (0, λ0 ) and consider the gradient descent scheme with a(n) ≡ a. Suppose xn 0 ∈ B¯ r (x ∗ ). Since ∇ f (x ∗ ) = θ , Taylor’s formula for n ≥ n 0 leads to xn+1 − x ∗ = xn − x ∗ − a(∇ f (xn ) − ∇ f (x ∗ )) = (I − a∇ 2 f (x˜n ))(xn − x ∗ ) (for some x˜n ∈ B¯ r (x ∗ )) =⇒ xn+1 − x ∗ ≤ I − a∇ 2 f (x˜n ) xn − x ∗ = (1 − aλmin (x˜n )) xn − x ∗ ≤ (1 − aλ0 ) xn − x ∗ , aλ0 ∈ (0, 1).
6.3 Algorithms for Unconstrained Optimization
Hence
107
xn − x ∗ ≤ (1 − aλ0 )(n−n 0 ) xn 0 +1 − xn 0 → 0.
Suppose we use instead general stepsizes {a(n)} with argument leads to xn − x ∗ ≤ e−λ0
n
m=n 0 +1
a(m)
n
a(n) = ∞, a similar
x(n 0 + 1) − x(n 0 ) → 0
(6.10)
as n ↑ ∞. It is more common to look for the rate of decrease of f (xn ) − min y f (y) rather than of xn − x ∗ . See [36] for a detailed analysis along those lines. Gradient descent is extremely popular because of its simplicity, which makes it scale well with dimensions in the very high dimensional regime, as compared to the more sophisticated methods we describe below. Nevertheless, it is not without its drawbacks. It does not escape even ‘shallow’ local minima. Furthermore, it slows down as it approaches the local minimum because the gradient gets smaller in magnitude. More generally, it is slow on part of the domain of f where gradients are small. These regions can be large, e.g., in training algorithms for sigmoidal neural networks. Also, while we argued that the scheme will generally escape critical points other than local minima, it can do so very slowly precisely because the gradients are small in the neighbourhood of such points. The biggest problem, however, is that it performs miserably on ‘pinched’ landscapes, i.e., neighborhoods of local minima where the rise away from the local minimum is very slow in some directions and very steep in others, see, e.g., Fig. 6.2. As shown in this figure, gradient descent will zigzag ‘across’ the narrow alley rather than follow the obviously better option of going down ‘along’ it. Luckily, a simple fix works, which we describe in the next bullet. We conclude our discussion of gradient descent by pointing out that the exact gradient is often not available explicitly and is estimated from function evaluations, sometimes noisy, by using finite difference approximations (one sided or symmetric two sided) along coordinate directions, or sometimes along random directions, and so on. One can use one-sided finite difference approximation:
Fig. 6.2
108
6 Optimization Algorithms: An Overview
∂f f (x + δei ) − f (x) , (x) ≈ ∂ xi δ or the two-sided one: ∂f f (x + δei ) − f (x − δei ) (x) ≈ ∂ xi 2δ for a small δ > 0, ei being the ith unit coordinate vector for 1 ≤ i ≤ d. From Taylor’s formula, one sees that the latter has smaller (O(δ 2 )) error compared to the former, where it is O(δ). However, the former requires d + 1 function evaluations, whereas the latter requires 2d. One recent innovation takes a cue from (6.5) to estimate ∇ f (x) by ∇ f (x) ≈ argmin y
m
2 k( xi − x )( f (xi ) − f (x) − y, xi − x)
i=1
where {xi , 1 ≤ i ≤ m} are points in a suitable neighborhood of x where, along with x itself, f is measured, and k(·) is a non-negative kernel which decays with increasing values of its argument [33]. This is a simple quadratic minimization problem motivated by the first order Taylor formula. 2. Momentum methods The idea here is to add to the gradient descent a term proportional to the previous move, i.e., (6.11) xn+1 = xn − a(n)∇ f (xn ) + b(n)(xn − xn−1 ). Usually b(n) < a(n). For simplicity, let a(n) ≡ a, b(n) ≡ b with b < a. Setting yn = xn − xn−1 , (6.11) can be thought of as a discretization of the differential equation x(t) ˙ = y(t), x(t) ˙ = −αy(t) − ∇ f (x(t)),
(6.12)
where α := (1 − b/a) > 0. This is recognized as Newtonian dynamics in a ‘potential field’ f , with damping. Think of a ball rolling down a hilly landscape. Thanks to its momentum (hence the rubric ‘momentum method’) it will quickly get out of any local maxima, points of inflection, or saddle points it encounters on its way, not to mention shallow local minima (see Fig. 6.3). It will also do a better job of moving down a ‘flat landscape’ compared to steepest descent. It will quickly approach the bottom of the valley (≈ local minimum) that it will eventually settle to, much faster than gradient dynamics, but will oscillate locally before friction dissipates its energy and it stops. One expects the algorithm to reflect this behaviour and it does.
6.3 Algorithms for Unconstrained Optimization
109
Fig. 6.3
Equation (6.12) is not the most general scenario; in particular it is possible to have time dependent coefficients. In fact the leading algorithm in high dimensions is a specific variant of momentum method known as Nesterov’s accelerated gradient method [35], whose o.d.e. counterpart has this feature [42]. It has been analyzed quite elegantly in terms of the physics behind it [44]. Another viewpoint which explains why the momentum method avoids zigzagging along a narrow downhill wedge is the fact that starting from rest, (6.12) leads to t y(t) = − e−(t−s) ∇ f (x(s))ds, 0
i.e., it sees the exponentially averaged or ‘smoothened’ gradient that suppresses the zigzagging behavior alluded to above. 3. Conjugate gradient This is one of the classical methods that often perform better than gradient descent for moderate dimensions. It was originally designed for quadratic functions, but is used more generally. We first discuss a more general scheme called ‘conjugate directions’ method. Let f (x) = 21 x T Qx − bT x, Q symmetric positive definite. Nonzero vectors {d(i)} ⊂ Rd are said to be conjugate with respect to Q if d(i)T Qd( j) = 0 for i = j, i.e., they are orthogonal with respect to the inner product x, y Q := x T Qy. Let · Q denote the corresponding norm. Clearly there are at most d such mutually conjugate vectors. Consider the algorithm xn+1 = xn + a(n)d(n), n ≥ 0,
(6.13)
with d(k), d( j) Q = 0, k = j. Assume exact line minimization. Then a(k) := argmina f (x(k) + ad(k)) is seen to equal a(k) =
d(k)T (b − Qx(k)) . d(k)T Qd(k)
Direct calculation also shows that (check this)
(6.14)
110
6 Optimization Algorithms: An Overview
∇ f (xn )T d(i) = 0 ∀ i < n.
(6.15)
xn+1 = Sn := argmin{x∈span(d(i),0≤i≤n)} f (x).
(6.16)
This implies that
Therefore the iteration converges to argmin( f ) in at most d steps. The conjugate gradient algorithm corresponds to the special case when the d(i)’s are obtained by successively applying Gram-Schmidt orthogonalization to −∇ f (xn ), n ≥ 0 with respect to the inner product ·, · Q . By (6.16), we also have ∇ f (xn+1 ) ⊥ Sn = span{∇ f (xk ), 0 ≤ k ≤ n}. From this, one can easily show that
∇ f (xn ), d( j) Q = 0, j < k − 2, ∇ f (xk ) 2 = , j = k − 1, a(k − 1) 1 d( j)T (∇ f (x j+1 ) − ∇ f (x j )). d( j) 2Q = a( j)
(6.17) (6.18) (6.19)
Substituting (6.17)–(6.19) in the Gram-Schmidt formula given by k−1 g(k)T Qd( j) d(k) = −∇ f (xk ) + d( j), d( j)T Qd( j) j=0
we get a recursion for {d(n)} given by d(n) = −∇ f (xn )) + b(n)d(n − 1), n ≥ 1,
(6.20)
where the stepsize b(n) :=
∇ f (xn ) 2 d(n − 1)T (∇ f (xn ) − ∇ f (xn−1 ))
simplifies after some algebra (check this) to the Fletcher-Reeves formula b(n) =
∇ f (xn ) 2 . ∇ f (xn−1 ) 2
(6.21)
As mentioned above, the conjugate gradient method (6.13), (6.20) is also applied for non-quadratic functions, where often an equivalent (for quadratic case) expression
∇ f (xn ), ∇ f (xn ) − ∇ f (xn−1 ) , (6.22) b(n) = ∇ f (xn−1 ) 2
6.3 Algorithms for Unconstrained Optimization
111
called the Polak-Ribiere formula, is used because it helps avoid ‘jamming’ of the iterates. The choice of a(n) is made by line minimization. That said, application of conjugate gradient to non-quadratic non-convex function needs some care, and tweaks such as reverting occasionally to steepest descent are commonly used. 4. Newton and quasi-Newton methods One way to work around ‘pinched landscapes’ is to rescale the space locally so that they are no longer ‘pinched’. Assume that f is twice continuously differentiable with a non-singular Hessian. In view of the second order Taylor expansion of f at x, it makes sense to use ∇ 2 f (x) for this purpose. In fact, if we choose F(xn ) in the formula f (xn+1 ) − f (xn ) = a(n) ∇ f (xn ), F(xn ) a(n)2 F(xn )T ∇ 2 f (xn )F(xn ) + o(a(n)2 ), + 2 so as to minimize the right hand side sans the o(a(n)2 ) term (in other words, minimize a quadratic approximation to the left hand side), we get precisely the Newton scheme −1 ∇ f (xn ), n ≥ 0. xn+1 = xn − a(n) ∇ 2 f (xn ) We then have f (xn+1 ) = f (xn ) −
a(n)
∇ f (xn ), (∇ 2 f (xn ))−1 ∇ f (xn ) + o(a(n)2 ). 2
If f is convex, (∇ 2 f (xn ))−1 is positive definite and the right hand side is strictly negative till the point where the o(a(n)2 ) term matters, i.e., till the point where the gradient is o(a(n)) (assuming that ∇ 2 f (xn )−1 remains uniformly positive definite). See [38] for a nice exposition of the method. A remarkable result of Smale [40] for the differential equation counterpart suggests that convexity is not crucial for convergence to a critical point. We sketch here a bare bones version of his argument. The differential equation counterpart is −1 ∇ f (x(t)), t ≥ 0. x(t) ˙ = − ∇ 2 f (x(t))
(6.23)
Suppose the set E of equilibria of this equation, i.e., the critical points of f , can be enclosed in a bounded region D with a smooth boundary, such that the vector field h(x) := −(∇ 2 f (x))−1 ∇ f (x) is pointing inwards on the boundary ∂ D of D (see Fig. 6.4). (x) . Some simAway from E, ∇ f (x) > 0 and we can define g(x) := ∇∇ ff (x) ple calculus shows that g(x) = constant is precisely a trajectory of (6.23), parametrized so that the time is increasing as you move away from ∂ D. Suppose
112
6 Optimization Algorithms: An Overview
Fig. 6.4
D
∂D
E
we start one such trajectory on a boundary point and try to extend it inwards into the interior of D. Since g maps Rd into a (d − 1)-dimensional object, viz., the unit d-sphere Sd , we expect the inverse image of any point in Sd under g to be a one-dimensional curve. Under certain technical hypotheses, the implicit function theorem of differential topology allows us to conclude that indeed in a sufficiently small neighbourhood of any point in D\E, there exists a unique such curve. Returning to the curve we initiated on the boundary, this eliminates many possibilities. For example, the curve cannot intersect itself, in fact cannot come arbitrarily close to itself. It cannot end abruptly at a point in D\E because the theorem allows us to extend it further. On the other hand, it cannot turn around and exit D because h is pointing inwards at the boundary. Thus it has to converge to E. There is, however, one catch, viz., the ‘technical conditions’ that allowed us to invoke the implicit function theorem. Luckily, another important theorem of differential topology, the Sard’s theorem, allows us to claim that the above works for ‘almost all’ initial conditions in a measure theoretic sense. This observation of Smale has led to much research on what is now known as the global Newton method and related dynamics [23]. The sleek theory apart, the method is not without its hassles. Recall that often the gradient needs to be estimated and typical schemes for that have numerical issues such as the small divisor problem. This problem gets worse with the Hessian. Furthermore, matrix inversion, which here can be replaced by a linear system solver as a subroutine, is another computational overhead. This has prompted approximate Newton methods, known as quasi-Newton methods. An excellent account of the development of these ideas appear in the classic text [17] by one of the pioneers. We describe some key details below. The idea is to keep track of a running recursive estimate of (∇ 2 f (xn ))−1 . This has to be done in a manner that ensures positive definiteness and the so-called
6.3 Algorithms for Unconstrained Optimization
113
quasi-Newton condition that ensures that they indeed mimic the inverse Hessian. The latter condition is derived as follows. We have ∇ f (xn+1 ) − ∇ f (xn ) ≈ ∇ 2 f (xn+1 )(xn+1 − xn ). Thus the approximation H (n + 1) of (∇ 2 f (xn+1 ))−1 should satisfy the quasiNewton condition xn+1 − xn = H (n + 1)(∇ f (xn+1 ) − ∇ f (xn )). The simplest thing would be to go for a rank one update for a symmetric positive definite A (= H (n)), i.e., from A to A + uu T for a suitable u. In fact some early schemes took this route, but were plagued by numerical issues such as small divisors or loss of positive definiteness. Hence one goes for rank two updates, i.e., from A to A + uu T + vvT for suitable u, v. We sketch below the key points in the derivation of quasi-Newton methods, see, e.g., [17] for a more detailed treatment. One ends up with not one, but a whole family of algorithms called the Broyden family. Introduce the notation s(n) := xn+1 − xn , y(n) := ∇ f (xn+1 ) − ∇ f (xn ), H (n)y(n) s(n) − . w(n) := y(n)T s(n) y(n)T H (n)y(n) The quasi-Newton condition can be written as s(n) = H (n + 1)y(n). Taking cue from the foregoing, consider the rank two update H (n + 1) = H (n) + auu T + bvvT for suitable scalars a, b, and vectors u, v. Then by the quasi-Newton condition, s(n) = H (n)y(n) + auu T y(n) + bvvT y(n). The choice u = s(n), v = H (n)y(n) and a, b chosen so that au T y(n) = 1 and bvT y(n) = −1 leads to H (n + 1) = H (n) +
H (n)y(n)y(n)T H (n) s(n)s(n)T − . s(n)T y(n) y(n)T H (n)y(n)
114
6 Optimization Algorithms: An Overview
This is the original quasi-Newton method, the Davidon-Fletcher-Powell method we revisit below. The general algorithm is xn+1 = xn − a(n)H (n)∇ f (xn ), n ≥ 0,
(6.24)
coupled with the iterate for {H (n)} given by H (n + 1) = H (n) −
1 H (n)y(n)y(n)T H (n) y(n)T H (n)y(n)
1 s(n)s(n)T y(n)T s(n) +φ(n)(y(n)T H (n)y(n))w(n)w(n)T .
+
(6.25)
This family of algorithms is known as the Broyden family. Here φ(n) is a flexible parameter that can be chosen depending on y(n), s(n) and B(n). The case φ(n) ≡ 0 yields the Davidon-Fletcher-Powell method, one of the earliest quasiNewton algorithms. The choice φ(n) ≡ 1 on the other hand leads to the popular BFGS method, for Broyden-Fletcher-Goldfarb-Shanno, who discovered it independently. Note that w(n) ⊥ y(n), hence any quantity proportional to w(n)w(n)T can be added on the right hand side without affecting s(n), thanks to the quasiNewton condition. Nevertheless, the last term on the right can affect the actual convergence behaviour of the algorithm and gives practitioners an additional handle to tweak. We now check the correctness of the above scheme. Suppose H (n) > 0 and a(n) is chosen so that for d(n) := −H (n)∇ f (xn ),
∇ f (x(n)), d(n) < ∇ f (x(n + 1)), d(n).
(6.26)
This ensures that H (n + 1) is positive definite, as shown below. If ∇ f (x(n)) = θ , then (6.26) can be ensured by doing a line search till | ∇ f (x(n)), d(n)| > | ∇ f (x(n + 1)), d(n)|, e.g., by line minimization for which the right hand side is in fact zero. Then a(n) > 0, y(n) = θ , and s(n)T y(n) = a(n)d(n)T (∇ f (x(n + 1)) − ∇ f (x(n))) > 0,
(6.27)
so all denominators in (6.25) are √ nonzero and H (k + 1) is well defined. For √ z = θ , a := H (n)z and b := H (n)y(n),
6.3 Algorithms for Unconstrained Optimization
z T H (n + 1)z = z T H (n)z +
115
(y(n)T H (n)z)2 (z T s(n))2 − y(n)T s(n) y(n)T H (n)y(n)
+φ(n)(y(n)T H (n)y(n))(z T w(n))2 =
a 2 b 2 − (a T b)2 (z T s(n))2 + b 2 y(n)T s(n) T +φ(n)(y(n) H (n)y(n))(w(n)T z))2 .
(6.28)
All terms are non-negative and a 2 b 2 = (a T b)2 =⇒ a = λb =⇒ z = λy(n) =⇒ λ = 0 =⇒ (z T s(n) = 0 =⇒ s(n)T y(n) = 0). But s(n)T y(n) > 0, so at least one of the first two terms on RHS of (6.28) is > 0. This proves the positive definiteness of H (n + 1). A ‘dual’ scheme for updating candidate Hessians rather than Hessian inverse can be derived and corresponds to (6.25) with s(n) and y(n) interchanged. Of course, this requires matrix inversion which needs O(d 3 ) computation, but here it can be reduced to O(d 2 ) by using Cholesky decomposition [21]. On the flip side, the gain over the first scheme is in better avoidance of the loss of positive definiteness due to numerical errors [19]. We briefly sketch a related scheme known as the Gauss-Newton method. This minimizes the square of the linear approximation f (x) ≈ f (x(k)) + ∇ f (x(k)) · (x − x(k)), leading to, ‘in principle’, x(k + 1) = x(k) − a(k)(∇ f (x(k))∇ f (x(k))T )−1 ∇ f (x(k)) f (x(k)). Of course, the rank 1 matrix ∇ f (x(k))∇ f (x(k))T is not invertible. To avoid this non-invertibility issue, we can use x(k + 1) = x(k) − a(k)(∇ f (x(k))∇ f (x(k))T + )−1 ∇ f (x(k)) f (x(k)). where is a small positive definite matrix. For = ν I , this is the LevenbergMarquardt methodpopular in neural networks literature. This is particularly m f i2 , where the inverse computation can be simplified by useful when f = i=1 using the Sherman-Morrison-Woodbury formula [21].
116
6 Optimization Algorithms: An Overview
6.4 Algorithms for Constrained Optimization Here we give a short summary of classical algorithms for optimization under constraints. The standard constrained optimization problem is: Problem (P): Minimize f (x) on C := {gi (x) ≤ 0, 1 ≤ i ≤ M}. 1. Linear programming and related problems: Linear programming is the special subclass where both the objective function and the constraints are linear. That is, the problem is: Minimize cT x subject to Ax ≤ b, x ≥ θ. Here x, c ∈ Rd , b ∈ Rm , A ∈ Rm×d for d, m ≥ 1. By introducing the so-called ‘slack variables’, the inequality above can be replaced by an equality. The dual problem is the linear program Maximize bT y subject to AT y ≤ c, y ≥ θ. Under reasonable conditions, we can ensure feasibility, absence of duality gap, etc., see Sect. 5.5. The classical algorithm for linear programming is the simplex method of Dantzig, which does a recursive search on corners (extreme points) of the polytope of feasible solutions. As an ingenious example of [28] showed, this is not a polynomial time algorithm, but subsequent work in [41] showed that the examples where it is not are non-generic in a precise sense, so that it can be and is used in practice with complete faith. See [2] for a precursor and [43] for some subsequent developments. Khachiyan [27] was the first to give a polynomial time algorithm, but it was not very useful in practice. The next big breakthrough came with the work of Karmarkar [26] who gave another polynomial time algorithm which also works well in practice. The ideas that went into it led to a lot of development in interior point methods of convex optimization, called thus because they approach the optimum from the interior of the feasible set. There are several excellent texts on linear programming, some of them are [7, 13, 14, 25, 32]. See [12] for an interesting historical perspective. There have been a few subsequent success stories. A major one is semidefinite programming. Let S ⊂ Rd×d denote the space of symmetric matrices endowed with an inner product defined as follows. For A = [[ai j ]], B = [[bi j ]], define
A, B := tr(AT B) := i, j ai j bi j . For A ∈ S, let A > 0 ⇐⇒ A is positive definite. The optimization problem then is Minimize C, X subject to
Ai , X = bi , 1 ≤ i ≤ m; X > 0. Here C, Ai ∈ S, ∀i. Several efficient algorithms have been developed for this class of problems and it has found many applications.
6.4 Algorithms for Constrained Optimization
117
Another important class is second order cone programming, which deals with problems of the type Minimize cT x subject to Ai x + bi ≤ di x + ri , 1 ≤ i ≤ m, Ax = h. Here x, c ∈ Rd , Ai ∈ Rki ×d , bi ∈ Rki ∀i, A ∈ Rs×d , h ∈ Rs . See [9] for introductory accounts of semidefinite and second order cone programming. 2. Penalty functions: In this method, instead of (P) one minimizes without constraints the function f λ (x) := f (x) + λF(x) with the ‘penalty function’ F chosen such that as λ → ∞, λF(x) → 0 on C and → ∞ on C c , uniformly on closed bounded sets. So if x ∗ (λ) is a minimizer of f λ , then any limit point of x ∗ (λ) as λ ↑ ∞ should be a solution to (P). That is, one solves an unconstrained optimization problem whose solution approximates that of the original problem. Typically the optimum / C. Thus, if one solves the problem for of (P) is on the boundary ∂C, but xλ ∈ a convenient λ, xλ is not feasible and one typically picks the point nearest to it in C as a good solution. Under convenient hypotheses of convexity, etc., this can be shown to give a near-optimal solution. Usually F is chosen so that it is continuously differentiable and there is at least one minimizer of f λ (e.g., when lim x ↑∞ F(x) = ∞ faster than | f (x)|). One popular choice is F(x) = + 2 i (gi (x)) . Recall that this idea was already used in the proof of Theorem 2.3. 3. Barrier functions: This follows the philosophy of replacing the original problem by unconstrained optimization problems whose solution approximates that of the original problem ‘from inside’, i.e., from int(C). The idea is to minimize f μ (x) := f (x) + μF(x) on int(C) with a ‘barrier function’ F : int(C) → R satisfying F(x) → ∞ as x → ∂C and f μ → f on int(C) as μ ↓ 0. Other requirements for F are analogous to One popular example is the ‘logarithmic barrier’ F(x) =
the above. 1 i log −gi (x) . Barrier function methods form an important class of the interior point methods for convex optimization. This one typically minimizes t f (·) + F(·) with t > 0, which is equivalent to the above. Given the known fast convergence rate of Newton method near a minimum, a Newton scheme is typically used for quickly finding an approximate minimum, followed by an update of t. The choices of how far the Newton scheme is to be run and how t is incremented are cleverly made so as to get good theoretical guarantees. 4. Primal-dual methods: These schemes are based on the Lagrange multiplier theorem, the idea being to find the saddle point in (x, λ) of f (x) + λT g(x). For example, consider the ‘gradient descent-ascent’
118
6 Optimization Algorithms: An Overview
x(n + 1) = x(n) − a(n)(∇ f (x(n)) +
λi (n)∇gi (x(n))),
i
λ(n + 1) = [λ(n) + a(n)g(x(n))]+ , where [ ]+ applies componentwise. The differential equation counterpart is x(t) ˙ = −∇( f + λ(t)T g)(x(t)), ˙ λ(t) = g(x(t)), constrained to remain in Rd × (R+ ) M . For strictly convex f, g, let (x ∗ , λ∗ ) be the (unique) saddle point. Then x − x ∗ 2 + λ − λ∗ 2 serves as a Liapunov function. In general, this applies only locally. The discrete algorithm behaves similarly for small {a(n)} in the sense that it will converge to a small neighbourhood of the saddle point. 5. Projected gradient: The projection on the hyperplane {x : y + Ax = θ } is given by (x) := (I − AT (A AT )−1 A)x. Suppose h(x) = θ denotes the active constraints at the nth iterate. Then {y : x + Dh(x)(y − x) = θ } is the tangent plane at x and the scheme is
x(n + 1) = x(n) − a(n)(I
−Dh(x(n))T (Dh(x(n))Dh(x(n))T )−1 Dh(x(n)))∇ f (x(n)) ,
where is an operation to pull the iterate back to the feasible region. This is given by: (y) = y + Dh(x(n))T α such that h( (y)) = θ . This α can be found by Newton’s method to solve nonlinear equations. One needs to choose a(n) small enough so that this is possible. 6. Reduced gradient: This belongs to a class of algorithms collectively known as ‘feasible direction methods’ wherein at each step one seeks a direction that leads to a solution a priori ascertained to be feasible. This is in contrast with projected gradient where one first follows a step as in unconstrained optimization and then projects the outcome to the feasible set. Consider linear constraints Ax = b where A = [B : C] for B square and non-singular. Writing x = (y, z) to conform with this, we can treat z as independent variables and y = B −1 (b − C z) as dependent variables. Then we can consider the reduced gradient ∇ z f (y, z) − ∇ y f (y, z)B −1 C. This suggests the following scheme. Write x(n) = (y(n), z(n)) where y(n), z(n) are, resp., dependent and independent variables. Then, with Dr (· · · ) denoting
6.4 Algorithms for Constrained Optimization
119
the Jacobian matrix w.r.t. the variable r and h(x) = θ the active constraints at the nth iterate, z(n + 1) = z(n) − a(n)(∇ z f (y(n), z(n)) −∇ y f (y(n), z(n))D y h(y(n), z(n))−1 D z h(y(n), z(n))), y(n + 1) = y(n) − D y h(y(n), z(n))−1 D z h(y(n), z(n))(z(n + 1) − z(n)). As before, we pull back to a feasible point using Newton iteration and choose a(n) appropriately to facilitate this. This pullback is due to discretization errors and not intrinsic to the algorithm itself, unlike the gradient projection algorithm. 7. Conditional gradient or Frank-Wolfe method: This is another popular feasible directions method. In this, for the problem of minimizing f subject to a constraint set C, one finds at the (n + 1)st iterate the direction d(n) maximally aligned with −∇ f (x(n)) subject to the constraints, i.e. d(n + 1) := argmaxz∈C −∇ f (x(n)), z. The iterate then is x(n + 1) = (1 − a(n))x(n) + a(n)d(n + 1) where the stepsize {a(n)} satisfies n a(n) = ∞. For convex and twice continuously differentiable f , let x ∗ be a minimizer of f on C. Assume that C is a closed and bounded convex set. Then x(n) ∈ C =⇒ x(n + 1) ∈ C. Also, we have f (x(n + 1)) − f (x(n)) = ∇ f (x(n)).(x(n + 1) − x(n)) + ζ (n) where |ζ (n)| ≤ K a(n)2 for a suitable constant K > 0. Hence f (x(n + 1)) − f (x ∗ ) ≤ f (x(n)) − f (x ∗ ) + ∇ f (x(n)), (x(n + 1) − x(n)) + K a(n)2 = f (x(n)) − f (x ∗ ) + a(n) ∇ f (x(n)), (d(n + 1) − x(n)) + K a(n)2 = f (x(n)) − f (x ∗ ) + a(n) min( ∇ f (x(n)), (y − x(n))) + K a(n)2 y∈C
∗
= f (x(n)) − f (x ) − a(n) max( ∇ f (x(n)), (x(n) − y) + K a(n)2 y∈C
∗
≤ f (x(n)) − f (x ) − a(n) ∇ f (x(n))), (x(n) − x ∗ ) + K a(n)2 ≤ f (x(n)) − f (x ∗ ) − a(n)( f (x(n)) − f (x ∗ )) + K a(n)2 = (1 − a(n))( f (x(n)) − f (x ∗ )) + K a(n)2 . Iterating and using the inequality 1 − x ≤ e−x ,
120
6 Optimization Algorithms: An Overview
f (X (n)) − f (x ∗ ) ≤
n
(1 − a(m))( f (x(0)) − f (x ∗ ))
m=0
+ K ≤e
−
n n
(1 − a(m + 1))a(k)2
k=1 m=k n m=0 a(m)
+ K
( f (x(0)) − f (x ∗ ))
n
e−
n−1 m=k
a(m+1)
a(k)2 → 0
k=1
as n ↑ ∞ (check this). This scheme has the advantage of avoiding projections and similar operations. It has issues such as zigzaging of iterates. Variants to improve the convergence behavior have been proposed, see, e.g., [24] for a recent contribution and [10] for a comprehensive survey. 8. Cutting plane method: Consider convex f, g. Cutting plane methods are methods that approximate the problem by a succession of linear programs. We describe here a standard variant due to Kelly. It suffices to consider a linear objective function of the form f (x) = cT x, since we can equivalently consider the problem of minimizing r ∈ R subject to f (x) − r ≤ 0, g(x) ≤ θ . Suppose the minimum is attained. Start with an initial polytope P1 containing C. The scheme then is as follows: at step n, given a polytope Pn containing C, do the following. a. Find w(n) := argmin Pn (cT x). If g(w(n)) ≤ θ , stop. If not, go to step (b). b. Let i ∗ (n) := argmaxi gi (w(n)). Then gi ∗ (n) (w(n)) > 0. Set Pn+1 = Pn ∩ {x : gi ∗ (n) (w(n)) + ∇gi ∗ (n) (w(n))(x − w(n)) ≤ 0}. Repeat till convergence. For x = w(n), the LHS of the inequality above equals gi ∗ (n) (w(n)) > 0, hence w(n) ∈ / Pn+1 . Also, for x ∈ C, 0 ≥ gi ∗ (n) (x) ≥ gi ∗ (n) (w(n)) + ∇gi ∗ (n) (w(n))(x − w(n)) by the convexity of gi ∗ (n) (·), implying C ⊂ Pn+1 . Suppose w(n(k)) → w∗ where n(k) are such that i ∗ (n(k)) ≡ iˆ (say). Then ∇giˆ (w(n(k))) → ∇giˆ (w∗ ) implying supk ∇giˆ (w(n(k))) < ∞. Then
6.5 Special Topics
121
giˆ (w(n(k + m))) ≤ −∇giˆ (w(n(k)))(w(n(k + m)) − w(n(k))) =⇒
giˆ (w(n(k + m))) ≤ ∇giˆ (w(n(k))) w(n(k + m)) − w(n(k)) =⇒ giˆ (w∗ ) ≤ 0 (on letting k ↑ ∞) =⇒ gi (w∗ ) ≤ 0 ∀i.
Thus w∗ is feasible. Optimality follows easily by a limiting argument. The key idea here is to approximate the original problem by linear programs ‘from the outside’, i.e., by minimizing the (linear) objective function successively over a nested sequence of convex polytopes that contain the feasible set and shrink to it in the limit. In fact even when the feasible set is not convex (but is closed and bounded), the linearity of the objective function ensures an optimum at an extreme point of its convex hull, which is what gets approximated by the shrinking sequence of polytopes. See [5] for an extensive treatise on algorithms for constrained optimization.
6.5 Special Topics The foregoing does not exhaust the rich repertoire of clever algorithms researchers have built over the years. Here we collect some representatives of useful tweaks as well as some special algorithms that do not fit the broad categories above. 1. Two timescale schemes: This is the coupled iteration x(k + 1) = x(k) + a(k)h(x(k), y(k)), y(k + 1) = y(k) + b(k)g(x(k), y(k)). with h, g Lipschitz and a(k), b(k) > 0 satisfying: a(k), b(k) → 0, k a(k) = k b(k) = ∞, and b(k) = o(a(k)). The latter condition ensures that the second iteration moves on a slower timescale and is approximately constant (to be precise, very slowly varying) on the timescale of the first iteration. Thus the first iteration tracks the o.d.e. x˙ (t) = h(x (t), y) for y ≈ y(t). Suppose this x (t) converges to a unique λ(y) where λ(·) is Lipschitz. This implies that x(k) − λ(y(k)) → 0 asymptotically. Hence the second iteration tracks the o.d.e. y˙ (t) = g(λ(y (t)), y (t)).
122
6 Optimization Algorithms: An Overview
Suppose this converges to some y ∗ . Then the coupled iterates converge to (λ(y ∗ ), y ∗ ). In many algorithms including those that follow, this can be used to replace a subroutine by a concurrent iteration, albeit on a faster timescale. 2. Convex-concave procedure: This is also called ‘Difference of Convex’ (DC) programming. Let f : Rd → R be twice continuously differentiable with eigenvalues of ∇ 2 f (x) bounded in absolute value by some K < ∞ for all x. Then f (x) = ( f (x) + K x 2 ) + (−K x 2 ), i.e., it can be written as a sum of a convex and a concave function. Consider the problem of minimizing f written as h + g where h, g are continuously differentiable and h, −g are convex. The convex-concave procedure [46] does the iteration: solve ∇h(x(t + 1)) = −∇g(x(t)) for x(t + 1) given x(t). Some useful facts that justify the scheme are as follows. a. By convexity, h(x(t)) ≥ h(x(t + 1)) + ∇h(x(t + 1)) · (x(t) − x(t + 1)) =⇒ h(x(t + 1)) ≤ h(x(t)) − ∇h(x(t + 1)) · (x(t) − x(t + 1)) = h(x(t)) + ∇g(x(t)) · (x(t) − x(t + 1)) = h(x(t)) − ∇g(x(t)) · (x(t + 1) − x(t)). Likewise, by concavity, g(x(t + 1)) ≤ g(x(t)) + ∇g(x(t)) · (x(t + 1) − x(t)). Adding the two, we get f (x(t + 1)) ≤ f (x(t)). Note that this is an implicit scheme as in (6.2). b. Consider minimizing F(x) = h(x) +
d i=1
xi
∂g (x(t)) ∂ xi
on the given constraint set. The minimizer z satisfies ∇h(z) = −∇g(x(t)). Set x(t + 1) = z. Note that minimizing F is equivalent to minimizing h + the linearization of g at x(t). See [30] for some recent extensions. 3. Homotopy method: Find F : Rd × [0, 1] → R such that F(·, 0) = a prescribed f 0 (·) : Rd → R with
6.5 Special Topics
123
a unique minimum and F(·, 1) = f (·). Find the minimizer x0 of f 0 and then track the solution x ∗ (t), t ∈ [0, 1] of ∇ x F(x, t) = θ starting with initial condition x ∗ (0) = x0 , as t → 1. Based on ‘parametric Sard’s theorem’, this ensures a continuous curve to a local critical point of f . See [18] for an introduction. 4. Proximal methods: Recall from Chap. 4 the ‘prox’ operator defined by
proxλ f (u) := argminx
1 2 f (x) + x − u . 2λ
The proximal algorithm then is
xn+1 = proxa(n) f (xn ) := argminx
f (x) +
1 x − xn 2 . 2a(n)
(6.29)
This favours points close to xn , effectively making it an incremental subgradient descent as the following interpretation shows: The minimum condition yields
θ ∈ ∂ f (·) +
1 1 · −xn 2 (xn+1 − xn ), = ∂ f (xn+1 ) + xn+1 2a(n) a(n)
i.e., xn+1 ∈ xn − a(n)∂ f (xn+1 ). One requires k a(n) = ∞. If the subgradient is in fact a gradient, the above corresponds to the ‘backward Euler method’ for solving the differential equation x(t) ˙ = −∇ f (x(t)). Note that, expressed in this manner, it appears as an implicit scheme in the sense of (6.2). It is easy to see that a fixed point of proxa f (·) will be a minimizer of f .1 Thus if the iteration converges, it will converge to a minimizer. Another viewpoint is as follows. Let
f (u) := inf x
1 2 f (x) + x − u . 2
By the definition of proxλ f (x), λ f (x) = f (proxλ f (x)) +
1 x − proxλ f (x) 2 . 2λ
By Danskin’s theorem, for x = xn , the foregoing leads to 1
‘Generically’, to be precise, ruling out other critical points using the same arguments that we used for gradient descent.
124
6 Optimization Algorithms: An Overview
∇a(n) f (xn ) =
1 1 (xn − proxa(n) f (xn )) = (xn − xn+1 ), a(n) a(n)
i.e., xn+1 = xn − a(n)∇a(n) f (x(n)). This exhibits the algorithm as a gradient descent, but for a different function derived from f . This formulation is not implicit. The proximal gradient method deals with the minimization of functions of the form x ∈ Rd → f (x) + g(x) where f, g : Rd → R (resp., R ∪ {∞}) where f is differentiable and lim x ↑∞ f (x) (resp., lim x ↑∞ g(x)) = ∞. The algorithm is given by xn+1 = proxa(n)g (xn − a(n)∇ f (xn )) . Since we have allowed +∞ as a value of g, we can take g(x) = 0 for x ∈ a closed convex set C and ∞ otherwise, thereby incorporating a convex constraint x ∈ C. (This is an example where f has much better regularity properties than g, which is a common situation in applications of this scheme.) Explicit expressions are available for several prox maps of practical importance, e.g., as the so-called shrinkage operators, soft-max operators, and a variety of projection operators, see [37] for details. This monograph also points out many connections of proximal methods with other schemes and with the mathematics of the ‘inf-convolution’ operator f g := inf ( f (x) + g(y − x)). x
The proximal method described above will face the usual issues with ‘pinched landscapes’ that we encountered with gradient descent and the fix is similar to what we did in passing over to Newton scheme. Thus we modify the definition of prox operator above by replacing the term 21 x − y 2 therein by a more general discrepancy measure D(x; y). One popular choice is the Bregman divergence f (y) − f (x) − ∇ f (x), y − x ≈
1 (y − x)T ∇ 2 f (x)(y − x). 2
This leads to the popular mirror descent algorithm [34]. 5. Separable problems: n f i (xi ) subject to i gi (xi ) ≤ θ . With Lagrange mulConsider minimizing i=1 tiplier λ known, this splits into separate problems: minimize f i (xi ) + λT gi (xi ) for each i. This suggests the dual ascent scheme xi∗ (k) = argmin( f i (·) + λ(k)T gi (·)) ∀i, ∗ gi (xi (k)) . λ(k + 1) = λ(k) + a(k) i
6.6 Other Directions
125
This observation goes back to [4] in the context of social welfare economics and has been used extensively in optimization and control. Consider linear constraints: Minimize i f i (x i ) subject to Ax = b. This is equivalent to minimizing i f i (xi ) + ρ2 Ax − b 2 subject to Ax = b and the Lagrangian is ρ f i (xi ) + λT (Ax − b) + Ax − b 2 . 2 i This is called the Augmented Lagrangian and has some computational advantage due to the strictly convex penalty term. The problem is not separable, but the primal minimization can be done one variable at a time. This is the Alternating Directions Method of Multipliers (ADMM). f (z) can be cast into separable form by minimizing Problems of minimizing i i f (x ) with the additional constraints xi = x1 ∀ i = 1. i i i 6. Majorization-minimization (MM) and trust region algorithms: These are based on locally dominating the objective function by a convenient convex (e.g., quadratic) function near the current iterate and then minimizing the latter to get the next iterate. This is particularly popular in certain statistical applications, see [22]. A somewhat related philosophy is that of trust region methods [45]. These use a local approximation in a pre-specified ‘trust region’ and take a minimization step for this approximation according to a pre-specified stepsize. Depending on the net decrease in the objective function, the trust region is updated for the next iterate.
6.6 Other Directions 1. Advanced themes in algorithms With large-scale optimization, several other research areas open up, such as decomposition techniques (the Dantzig-Wolfe decomposition in linear programming being one famous example) and low dimensional approximations. Many other research issues in algorithms are thrown up by large-scale optimization, such as parallel or distributed algorithms. The latter involves computation over many communicating processors which exchange information and comes with its own baggage of problems such as asynchrony (processors work on different clocks), stragglers (some processors are much slower than others), communication constraints (processors communicate over communication channels with limited capacity, leading to limitations on the rate at which information can be transmitted, delays, packet losses, etc.), and so on. Each of these is a fertile research area. 2. Other classes of algorithms Our focus so far has been entirely on classical ‘analytical’ algorithms that use analytical information about the function being optimized, such as its deriva-
126
6 Optimization Algorithms: An Overview
tives. There are many ‘derivative-free’ algorithms such as deterministic or random search methods. An important class of problems that has been the focus of much research activity is that of biologically inspired algorithms. One important subclass is that of evolutionary algorithms that borrow paradigms from theory of evolution, the most famous example being the genetic algorithm. Other biologically motivated algorithms are ant colony algorithm, particle swarm algorithm, and so on. A general feature of many of these is that they can be viewed as a large family of simultaneously running dynamics (or algorithms) with interaction that reinforces in some way the desired behavior. We have mentioned earlier in this book that the classical analytic schemes are local and deterministic and therefore get stuck in local optima. The ‘population algorithms’ mentioned above give up on locality and sometimes on determinism as well, incorporating randomness. See [15, 16, 20] for a few representative strands in this line of research. The famous simulated annealing algorithm is inspired by statistical physics rather than biology and adds slowly decreasing noise to gradient descent so that it asymptotically concentrates on global minima in a probabilistic sense. The noise ensures occasional ‘uphill’ moves that ensure eventual escape from non-optimal local minima. The convergence claim is ‘in probability’, i.e., the probability of being in a neighborhood of the set of global minima goes to one. This does not ensure ‘almost sure’ or sample pathwise convergence with probability one. In fact if the ‘valley’ near the global minima is shallow, it may not hold—the iterates can make rare excursions away from global minima. The algorithm is slow in practice in its pure theoretically sound form, and often it is tweaked suitably or used in combination with other algorithms. There are also several other random search schemes with varying degrees of theoretical justification [47]. A less known fact is that an elegant probabilistic analysis of genetic algorithms by Cerf [11] shows that the algorithm concentrates on global minima in the ‘small noise limit’ under suitable conditions. See [15, 20, 39] for some representative texts in this domain. See [29] for an extensive treatment of simulated annealing.
6.7 Exercises 6.1 For the algorithm x(n + 1) = x(n) + a(n)d(n), n ≥ 0, for minimizing a continuously differentiable function f : Rd → R, show that the choice d(n) ∝ −∇ f (x(n)) maximizes f (x(n)) − f (x(n + 1)) subject to the constraint d = 1, modulo o(a(n)) terms. 6.2 Supply all the missing details in the analysis of the conjugate gradient method. 6.3 For a twice continuously differentiable f : Rd → R, there is an isolated local minimum at x ∗ with ∇ 2 f (x ∗ ) > 0 and no other critical point in O for some
6.7 Exercises
127
open neighbourhood O of x ∗ . Consider the approximate gradient descent
xi (n + 1) = xi (n) − a
f (x(n) + δei ) − f (x(n)) , δ
where a > 0, δ > 0 is small and ei := the unit vector in the ith coordinate direction. For a > 0 sufficiently small, if x(n 0 ) ∈ O for some n 0 ≥ 1, show that x(n) → an open ball centred at x ∗ with radius a × O(δ). 6.4 Consider the map x ∈ Rd → f (x) := 21 x T Qx + bT x + c for some symmetric non-singular Q ∈ Rd×d , b ∈ Rd , c ∈ R, where Q is neither positive nor negative definite. Consider the Newton scheme x(n + 1) = x(n) − a(n)∇ 2 f (x(n))−1 ∇ f (x(n)), n ≥ 0. Show that x(n) → a saddle point. 6.5 Consider the minimization of a strictly convex function f : Rd → R over a set C := {x : gi (x) ≤ 0, 1 ≤ i ≤ m}, where the gi ’s are convex and sat↑ ∞ ∀i. Suppose that we minimize instead the function isfy lim x ↑∞ gi (x) f N (x) := f (·) + N i (gi+ (·))2 : Rd → R and find its minimum x N∗ for N ≥ 1. Show that as N ↑ ∞, x N∗ → x ∗ := the minimizer of f on C. 6.6 Write down explicitly the Newton scheme to solve h(y + Dh(x(n))T α) = θ for α in the projected gradient scheme. 6.7 A non-convex continuous function f : Rd → R with isolated critical points is to be minimized over the unit closed ball B centred at the origin in Rd . We run gradient descent with decreasing stepsizes a(n) > 0 satisfying n a(n) = ∞ a total of N times with initial conditions chosen independently according to uniform distribution on B, and take the best answer. Show that the probability of not finding the global minimum by this method decreases exponentially with increasing N .2 6.8 For a continuous f : (x, y) ∈ C × D → R with C ⊂ Rd , D ⊂ Rk closed and bounded, consider the ‘alternating minimization’ scheme given as follows. Start with (x0 , y0 ) ∈ C × D and at step n ≥ 0, set xn+1 := a minimizer of f (·, yn ) and yn+1 = yn for even n, and xn+1 = xn and yn+1 := a minimizer of f (xn , ·) for odd n. Show that this will converge to the set of local minima of f . 6.9 Let f : Rd → R be convex and continuously differentiable. The proximal gradient scheme is given by x(n + 1) = proxg (x(n) − a(n)∇ f (x(n))), where 0 < a(n) → 0 with n a(n) = ∞, and g is an extended real-valued function (i.e., with ±∞ as possible values). Let C ⊂ Rd be closed bounded and convex. Show that for g given by: 2
This problem requires familiarity with some basic probability theory. This method is known as ‘multi-start’.
128
6 Optimization Algorithms: An Overview
g(x) =
0 ∀ x ∈ C, ∞ otherwise,
this scheme is equivalent to the projected gradient scheme. 6.10 Consider the problem of minimizing a continuous function f : Rd → R satis f (x) fying lim x ↑∞ f (x) = ∞. Suppose for T > 0, C T := e− T dx < ∞. Define the probability density pT (x) := C T−1 e−
f (x) T
, x ∈ Rd .
Show that as T ↓ 0, pT (·) concentrates on the set of global minima of f . (This is the theoretical basis of the simulated annealing algorithm.)
References 1. Absil, P.A., Mahony, R., Sepulchre, R.: Optimization Algorithms on Matrix Manifolds. Princeton University Press, Princeton, NJ (2008) 2. Adler, I., Karp, R.M., Shamir, R.: A simplex variant solving an m × d linear program in O(min(m 2 , d 2 )) expected number of pivot steps. J. Complex. 3(4), 372–387 (1987) 3. Armijo, L.: Minimization of functions having Lipschitz continuous first partial derivatives. Pac. J. Math. 16, 1–3 (1966) 4. Arrow, K.J., Hurwicz, L., Uzawa, H.: Studies in Linear and Non-linear Programming. Stanford Mathematical Studies in the Social Sciences, II. Stanford University Press, Stanford, CA (1958). With contributions by H. B. Chenery, S. M. Johnson, S. Karlin, T. Marschak, R. M. Solow 5. Bertsekas, D.P.: Constrained Optimization and Lagrange Multiplier Methods. Computer Science and Applied Mathematics. Academic Press, Inc. [Harcourt Brace Jovanovich, Publishers], New York (1982) 6. Bertsekas, D.P.: Nonlinear Programming, 3rd edn. Athena Scientific Optimization and Computation Series. Athena Scientific, Belmont, MA (2016) 7. Bertsimas, D., Tsitsiklis, J.: Introduction to Linear Optimization, 1st edn. Athena Scientific (1997) 8. Bonnans, J.F., Gilbert, J.C., Lemaréchal, C., Sagastizábal, C.A.: Numerical Optimization, 2nd edn. Universitext. Springer, Berlin (2006) 9. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004) 10. Braun, G., Carderera, A., Combettes, C.W., Hassani, H., Karbasi, A., Mokhtari, A., Pokutta, S.: Conditional gradient methods (2022). arXiv:2211.14103 11. Cerf, R.: Asymptotic convergence of genetic algorithms. Adv. Appl. Probab. 30(2), 521–550 (1998) 12. Chakraborty, A., Chandru, V., Rao, M.R.: A linear programming primer: from Fourier to Karmarkar. Ann. Oper. Res. 287(2), 593–616 (2020) 13. Dantzig, G.B., Thapa, M.N.: Linear Programming. 1 Introduction. Springer Series in Operations Research. Springer, New York (1997) 14. Dantzig, G.B., Thapa, M.N.: Linear Programming. 2 Theory and Extensions. Springer Series in Operations Research. Springer, New York (2003) 15. De Jong, K.A.: Evolutionary Computation: A Unified Approach. MIT Press, Cambridge, MA, USA (2016)
References
129
16. Dorigo, M., Stützle, T.: Ant Colony Optimization. MIT Press, Cambridge, MA, USA (2004) 17. Fletcher, R.: Practical Methods of Optimization, 2nd edn. Wiley-Interscience, New York (2001) 18. Garcia, C.B., Zangwill, W.J.: Pathways to Solutions, Fixed Points and Equilibria. Prentice-Hall, Englewood Cliffs, NJ (1981) 19. Gill, P.E., Murray, W., Wright, M.H.: Practical Optimization. Society for Industrial and Applied Mathematics, Philadelphia, USA (2019) 20. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning. AddisonWesley Longman Publishing Co., Inc., USA (1989) 21. Golub, G.H., Van Loan, C.F.: Matrix Computations, 4th edn. Johns Hopkins University Press (2013) 22. Hunter, D.R., Lange, K.: A tutorial on MM algorithms. Am. Statist. 58(1), 30–37 (2004) 23. Jongen, H.T., Jonker, P., Twilt, F.: Nonlinear Optimization in Finite Dimensions: Morse Theory, Chebyshev Approximation, Transversality, Flows, Parametric Aspects, Nonconvex Optimization and Its Applications, vol. 47. Kluwer Academic Publishers, Dordrecht (2000) 24. Julien-Lacoste, S., Jaggi, M.: On the global linear convergence of Frank-Wolfe optimization variants. In: Advances in Neural Information Processing Systems, vol. 28 (2015) 25. Karloff, H.: Linear Programming. Progress in Theoretical Computer Science. Birkhäuser Boston Inc., Boston, MA (1991) 26. Karmakar, N.: A new polynomial-time algorithm for linear programming. Combinatorica 373– 395 (1984) 27. Khachiyan, L.G.: A polynomial algorithm in linear programming. Sov. Math. Dokl. 20, 191– 194 (1979) 28. Klee, V., Minty, G.J.: How good is the simplex algorithm? In: Inequalities, III. Proceedings of the Third Symposium, University California, Los Angeles, CA, 1969; dedicated to the memory of Theodore S. Motzkin, pp. 159–175 (1972) 29. van Laarhoven, P.J.M., Aarts, E.H.L.: Simulated Annealing: Theory and Applications. Mathematics and its Applications, vol. 37. D. Reidel Publishing Co., Dordrecht (1987) 30. Lipp, T., Boyd, S.: Variations and extension of the convex-concave procedure. Optim. Eng. 17(2), 263–287 (2016) 31. Luenberger, D.G., Ye, Y.: Linear and Nonlinear Programming. International Series in Operations Research & Management Science, 4th edn, vol. 228. Springer, Cham (2016) 32. Matoušek, J., Gärtner, B.: Understanding and Using Linear Programming. Springer, Berlin (2007) 33. Mukherjee, S., Wu, Q., Zhou, D.X.: Learning gradients on manifolds. Bernoulli 16(1), 181–207 (2010) 34. Nemirovsky, A.S., Yudin, D.B.A.: Problem Complexity and Method Efficiency in Optimization. Wiley-Interscience Series in Discrete Mathematics. Wiley, New York (1983) 35. Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate O(1/k 2 ). Sov. Math. Dokl. 27(2), 372–376 (1983) 36. Nesterov, Y.E.: Lectures on Convex Optimization. Springer Optimization and Its Applications, vol. 137. Springer, Cham (2018). Second edition of [MR2142598] 37. Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 127–239 (2014) 38. Polyak, B.T.: Newton’s method and its use in optimization. Eur. J. Oper. Res. 181, 1086–1096 (2007) 39. Simon, D.: Evolutionary Optimization Algorithms. Wiley, Hoboken, NJ (2013) 40. Smale, S.: A convergent process of price adjustment and global Newton methods. J. Math. Econ. 3(2), 107–120 (1976) 41. Spielman, D.A., Teng, S.H.: Smoothed analysis of algorithms: why the simplex algorithm usually takes polynomial time. J. ACM 51(3), 385–463 (2004) 42. Su, W., Boyd, S., Candès, E.J.: A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights. J. Mach. Learn. Res. 17(1), 5312–5354 (2016) 43. Vershynin, R.: Beyond Hirsch conjecture: walks on random polytopes and smoothed complexity of the simplex method. SIAM J. Comput. 39(2), 646–678 (2009)
130
6 Optimization Algorithms: An Overview
44. Wibisono, A., Wilson, A.C., Jordan, M.I.: A variational perspective on accelerated methods in optimization. Proc. Natl. Acad. Sci. U.S.A. 113(47), E7351–E7358 (2016) 45. Yuan, Y.X.: A review of trust region algorithms for optimization. In: Proceedings of the Fourth International Congress on Industrial and Applied Mathematics, vol. 99, pp. 271–282 (2000) 46. Yuille, A., Rangarajan, A.: The convex-concave procedure (CCCP). In: Advances in Neural Information Systems, vol. 14 (2001) 47. Zhigljavsky, A.A.: The Theory of Global Random Search. Springer, Berlin (1991)
Chapter 7
Epilogue
7.1 What Lies Beyond Here we try to sketch briefly many challenging directions one can take from here. These are, however, only ‘teasers’, without any significant detail. 1. Infinite-dimensional optimization Convex analysis and optimization get harder and more interesting in infinite dimensions, because many things true in finite dimensions stop being so, among them continuity of convex functions on interior of their domains, existence of a minimum of a continuous function on a closed and bounded set, etc. The problem is not purely academic. We shall mention two important sources of infinite-dimensional optimization. The first is control theory, or optimization over possible trajectories of a controlled dynamics so as to minimize a suitable performance measure. One example is to drive a mechanical system from one configuration to another with least expenditure of energy, another is to drive a projectile from one position to another in minimum time. These become optimization problems on spaces of trajectories or function spaces. Another major source of such problems comes from the variational problems in mechanics and other branches of physics. For example, solutions of partial differential equations arising in potential theory can be cast as minimizers (or more generally, critical points) of suitable ‘energy functions’. The numerical schemes for their solution often go through finite dimensional approximations and call upon numerical algorithms we have seen in this book. See, e.g., [15, 23]. 2. High dimensional optimization Optimization problems in high but not infinite dimensions are also distinctive enough for a variety of reasons. These have become immensely important in recent years due to increased use of high dimensional data. One important manner in which they differ from the traditional problems is that many classical algorithms do not scale well with dimension. Thus, for example, simple steepest descent and its momentum variants are more popular than conjugate gradient or © Hindustan Book Agency 2023 V. S. Borkar and K. S. M. Rao, Elementary Convexity with Optimization, Texts and Readings in Mathematics 83, https://doi.org/10.1007/978-981-99-1652-8_7
131
132
7 Epilogue
quasi-Newton methods which would have been preferred for moderate dimensions. Another important aspect is that with high dimension, new structural regularities emerge, as also new considerations. For example, most volume of a high dimensional convex body is near its boundary. Some ‘friendly’ structure also emerges in high dimension, e.g., Dvoretzky’s theorem that says that low dimensional slices of a centre-symmetric convex body in high dimensions is approximately elliptical in a precise sense, or the Johnson-Lindenstrauss lemma which says that projections of a collection of points in a high dimensional space to a random low dimensional subspace approximately preserves pairwise distances. Such facts can be and have been put to good algorithmic use. 3. Discrete optimization This is another facet of optimization which we have given a complete miss, and for good reason. It deserves a separate treatment both because the flavour is different and because the body of results is humongous to the extent that it has an identity as an essentially distinct field. This is not to say that there is no overlap, in fact it generously borrows from continuous optimization, e.g., by considering continuous relaxations of discrete problems. On the other had, some notions such as submodularity, whose ‘home base’ is discrete optimization, find applications in continuous optimization. The interplay is quite rich and is in fact unavoidable in problems that simultaneously involve both, such as mixed integer programming. See, e.g., [21, 27, 31] for combinatorial optimization in general and [13, 32] for integer programming. 4. Multiobjective and multiagent problems These are by now classical areas that build upon basic optimization theory. The latter in particular leads to team and game theories, depending on whether the objective is common to all agents or not. There is an ocean of exciting work in these which is also of immense practical importance in areas ranging from economic decisions to communication networks.
7.2 Bibliographical Note This short postscript is intended to give pointers to sources where a lot more can be found. The recommendations are loosely grouped thematically, though some straddle more than one category. 1. Books for non-specialists: Convex analysis and optimization are not blessed with much effort towards popularizing them among readers who are non-specialists, but there is at least one very good book [34]. For beginning students or nonspecialists seeking a quick overview, the lovely little book [36] walks the reader from basic optimization theory to optimal control in under two hundred pages to give a bird’s eye view of this entire arena.
7.2 Bibliographical Note
133
2. Convex analysis: A great book for beginners is [35]. The classic [30] is a more or less complete coverage of convex analysis in finite dimensions. See [17, 18] for more advanced treatments. 3. Convex optimization: For convex optimization, again [35] is very good for beginners. Some standard texts are [2, 9, 25]. A distinctive and unique treatment appears in [6]. The book [4] contains some advanced material. For extremely general treatment of optimality condition for convex optimization without differentiability conditions and much else, see [14]. 4. Algorithms, general texts: Two excellent texts for algorithms are [5, 7]. Some introductory texts are [1, 8, 12, 16, 22, 26]. 5. Algorithms, special classes: There are several texts devoted to one or a few related theme(s) in algorithms. A sampler is as follows: a. For distributed optimization, see the classic [3]. The survey [24] covers some recent developments. b. Interior point methods, an important class of algorithms for convex optimization, saw a major surge in interest after Karamarkar’s discovery of a polynomial time interior point method for linear programming. See [9, 29] for a theoretical introduction. c. There are many recent surveys on special classes of algorithms that have come into prominence because of applications to machine learning. Some are: proximal algorithms [7, 28], Alternating Direction Method of Multipliers (ADMM) [10], special algorithms such as alternating minimization for special classes of non-convex problems [20], and online optimization where data arrives one at a time sequentially [19, 33]. See also [11] for a compact introduction to convex optimization for machine learning researchers and [37] for a comprehensive text on optimization for machine learning.
References 1. Bazaraa, M.S., Sherali, H.D., Shetty, C.M.: Nonlinear Programming: Theory and Algorithms, 3rd edn. Wiley-Interscience, Hoboken, NJ (2006) 2. Ben-Tal, A., Nemirovski, A.: Lectures on Modern Convex Optimization. MPS/SIAM Series on Optimization. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA; Mathematical Programming Society (MPS), Philadelphia, PA (2001) 3. Bertsekas, D.P., Tsitsiklis, J.N.: Parallel and Distributed Computation: Numerical Methods. Athena Scientific, Belmont, MA (2014). Originally published by Prentice-Hall, Inc. in 1989. Includes corrections (1997) 4. Bertsekas, D.P.: Convex Analysis and Optimization. Athena Scientific, Belmont, MA (2003). With Angelia Nedi´c and Asuman E. Ozdaglar 5. Bertsekas, D.P.: Nonlinear Programming, 3rd edn. Athena Scientific Optimization and Computation Series. Athena Scientific, Belmont, MA (2016) 6. Bertsekas, D.P.: Convex Optimization Theory. Athena Scientific, Nashua, NH (2009) 7. Bertsekas, D.P.: Convex Optimization Algorithms. Athena Scientific, Belmont, MA (2015)
134
7 Epilogue
8. Bonnans, J.F., Gilbert, J.C., Lemaréchal, C., Sagastizábal, C.A.: Numerical Optimization, 2nd edn. Universitext. Springer, Berlin (2006) 9. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, Cambridge (2004) 10. Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011) 11. Bubeck, S.: Convex optimization: algorithms and complexity. Found. Trends Mach. Learn. 8(3–4), 231–357 (2015) 12. Chong, E.K., Zak, S.H.: An Introduction to Optimization, 4th edn. Wiley, Hoboken, NJ (2013) 13. Conforti, M., Cornuéjols, G., Zambelli, G.: Integer Programming. Graduate Texts in Mathematics, vol. 271. Springer, Cham (2014) 14. Dhara, A., Dutta, J.: Optimality Conditions in Convex Optimization. CRC Press, Boca Raton, FL (2012) 15. Ekeland, I., Témam, R.: Convex Analysis and Variational Problems. Classics in Applied Mathematics, English edn, vol. 28. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA (1999). Translated from the French 16. Fletcher, R.: Practical Methods of Optimization, 2nd edn. Wiley-Interscience, New York (2001) 17. Gruber, P.M.: Convex and Discrete Geometry. Grundlehren der mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], vol. 336. Springer, Berlin (2007) 18. Güler, O.: Foundations of Optimization. Graduate Texts in Mathematics, vol. 258. Springer, New York (2010) 19. Hazan, E.: Introduction to online convex optimization. Found. Trends Optim. 2(3–4), 157–325 (2016) 20. Jain, P., Kar, P.: Non-convex optimization for machine learning. Found. Trends Mach. Learn. 10(3–4), 142–336 (2017) 21. Korte, B., Vygen, J.: Combinatorial Optimization: Theory and Algorithms, 6th edn. Algorithms and Combinatorics, vol. 21. Springer, Berlin (2018) 22. Luenberger, D.G., Ye, Y.: Linear and Nonlinear Programming. International Series in Operations Research & Management Science, 4th edn, vol. 228. Springer, Cham (2016) 23. Luenberger, D.G.: Optimization by Vector Space Methods. Wiley, New York (1969) 24. Nedi´c, A.: Convergence rate of distributed averaging dynamics and optimization in networks. Found. Trends Syst. Control 2(1), 1–100 (2015) 25. Nesterov, Y.E.: Lectures on Convex Optimization. Springer Optimization and Its Applications, vol. 137. Springer, Cham (2018). Second edition of [MR2142598] 26. Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. Springer Series in Operations Research and Financial Engineering. Springer, New York (2006) 27. Papadimitriou, C.H., Steiglitz, K.: Combinatorial Optimization: Algorithms and Complexity. Dover Publications, Inc., Mineola, NY (1998). Corrected reprint of the 1982 original 28. Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 127–239 (2014) 29. Renegar, J.: A Mathematical View of Interior-Point Methods in Convex Optimization. MPS/SIAM Series on Optimization. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA; Mathematical Programming Society (MPS), Philadelphia, PA (2001) 30. Rockafellar, R.T.: Convex Analysis. Princeton Landmarks in Mathematics. Princeton University Press, Princeton, NJ (1997). Reprint of the 1970 original, Princeton Paperbacks 31. Schrijver, A.: Combinatorial Optimization: Polyhedra and Efficiency, vol. A-C. Algorithms and Combinatorics. Springer, Berlin (2003) 32. Schrijver, A.: Theory of Linear and Integer Programming. Wiley-Interscience Series in Discrete Mathematics. Wiley, Chichester (1986) 33. Shalev-Shwartz, S.: Online learning and online convex optimization. Found. Trends Mach. Learn. 4(2), 107–194 (2012) 34. Tikhomirov, V.M.: Stories About Maxima and Minima. Mathematical World, vol. 1. American Mathematical Society, Providence, RI; Mathematical Association of America, Washington, DC (1990). Translated from the 1986 Russian original by Abe Shenitzer
References
135
35. van Tiel, J.: Convex Analysis: An Introductory Text. Wiley, New York (1984) 36. Varaiya, P.P.: Notes on Optimization. Van Nostrand Reinhold (1972) 37. Wright, S.J., Recht, B.: Optimization for Data Analysis. Cambridge University Press, Cambridge, UK (2022)
Index
A Alternating Directions Method of Multipliers (ADMM), 125 Approximate gradient, 107 Arrow-Barankin-Blackwell theorem, 89 Arzela-Ascoli theorem, 18 Augmented Lagrangian, 125
B Banach contraction mapping theorem, 17 Barrier functions, 117 Birkhoff-von Neumann theorem, 60 Bolzano-Weierstrass theorem, 6 Boundary, 4 Brouwer fixed point theorem, 55
C Caratheodory’s theorem, 51 Cauchy sequence, 7 Closed ball, 4 Closed set, 3 Coercive, 8 Conditional gradient method, 119 Conjugate concave function, 83 Conjugate directions, 109 Conjugate gradient algorithm, 110 Continuity, 4 uniform, 5 Contraction, 16 Convex-concave procedure, 122 Convex conjugate function, 80 Convex function, 61 Convex set, 39 Cutting plane method, 120
D Danskin’s theorem, 30 Dense set, 3 Derivative directional, 23 Fréchet, 24 Gâteaux, 23 Difference of Convex (DC) programming, 122 Dini’s theorem, 19 Dubins theorem, 52 E Epigraph, 64 Euler’s theorem, 37 F Fenchel duality theorem, 83 Fenchel-Young inequality, 81 Fermat’s theorem, 26 Fixed point, 17 Frank-Wolfe method, 119 Fritz John condition, 29 G Gradient, 25 Gradient descent, 106 Greatest lower bound (g..b.), 5 H Half spaces, 43 Helly’s theorem, 54 Hessian, 25 Homotopy method, 122
© Hindustan Book Agency 2023 V. S. Borkar and K. S. M. Rao, Elementary Convexity with Optimization, Texts and Readings in Mathematics 83, https://doi.org/10.1007/978-981-99-1652-8
137
138
Subject Index
Hyperplane, 43 support, 46 Hypograph, 83
Monotone sequence, 6 Moreau envelope, 75 Müntz-Szász theorem, 15
I Inf-convolution, 99 Infimum, 5
N Nash equilibrium, 97 existence, 97 Newton method, 111
J Jacobian matrix, 25 Jensen’s inequality, 77
K Karush-Kuhn-Tucker (KKT) conditions, 27 Krein-Milman theorem, 50
L Lagrange multiplier rule, 87 Lagrangian, 88 Least upper bound (.u.b.), 5 Legendre transform, 80 Limit point, 7 Linear programming, 92 algorithms, 116 strong duality, 93 weak duality, 93 Line search, 102 Armijo, 104 binary, 103 Fibonacci, 103 golden section, 103 Lipschitz function, 67 locally, 67 Lower semicontinuous (l.s.c.), 9
M Majorization-Minimization (MM) method, 125 Maximum, 6 global, 6 local, 6 Minimum, 6 global, 6 local, 6 Minimum distance problem, 41 Min-max theorem, 95 Minty’s surjectivity theorem, 75 Mirror descent, 124 Momentum methods, 108
O Open ball, 2 Open set, 2 Optimization algorithm, 101 deterministic, 102 implicit, 101 incremental, 102 local, 102
P Parametric monotonicity, 32 Penalty function method, 117 Point of inflection, 26 Primal-dual methods, 117 Projected gradient method, 118 Projection, 41 Proximal algorithm, 123 proximal gradient method, 124 Prox operator, 123
Q Quasi-Newton methods, 113 BGGS, 114 Broyden family, 114 Davidon-Fletcher-Powell, 114
R Radon’s theorem, 54 Reduced gradient methods, 118
S Saddle point, 26 Saddle point property, 89 Second order cone programming, 117 Semidefinite programming, 116 Separable problems, 124 Separation theorem, 43 Shapley-Folkman theorem, 53 Steepest descent, 106
Subject Index Stepsize constant, 104 decreasing, 105 Subgradient, 70 Submodular function, 31 Supermodular function, 31 Supremum, 5
T Tietze extension theorem, 16
139 Trust region methods, 125 Two timescale iterations, 121
U Upper semicontinuous (u.s.c.), 9
W Weierstrass approximation theorem, 15 Weierstrass theorem, 8