328 76 1MB
English Pages 154 Year 2019
International Series of Numerical Mathematics
170
Seyedehsomayeh Hosseini Boris S. Mordukhovich André Uschmajew Editors
Nonsmooth Optimization and Its Applications
ISNM International Series of Numerical Mathematics Volume 170 Managing Editors G. Leugering, Erlangen, Germany M. Hintermueller, Berlin, Germany Associate Editors V. N. Starovoitov, Novosibirsk, Russia N. Kenmochi, Kita-ku, Kyoto, Japan R. W. Hoppe, Houston, USA Z. Chen, Beijing, China Honorary Editor K.-H. Hoffmann, Garching, Germany
More information about this series at http://www.springer.com/series/4819
Seyedehsomayeh Hosseini • Boris S. Mordukhovich • André Uschmajew Editors
Nonsmooth Optimization and Its Applications
Editors Seyedehsomayeh Hosseini IAV GmbH Gifhorn, Germany
Boris S. Mordukhovich Department of Mathematics Wayne State University Detroit MI, USA
André Uschmajew Max Planck Institute for Mathematics in the Sciences Leipzig, Germany
ISSN 0373-3149 ISSN 2296-6072 (electronic) International Series of Numerical Mathematics ISBN 978-3-030-11369-8 ISBN 978-3-030-11370-4 (eBook) https://doi.org/10.1007/978-3-030-11370-4 Library of Congress Control Number: 2019933570 Mathematics Subject Classification (2010): 49J52, 49J53, 58K05, 65K10, 90C26, 90C30, 90C52 © Springer Nature Switzerland AG 2019 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This book is published under the imprint Birkhäuser, www.birkhauser-science.com by the registered company Springer Nature Switzerland AG. The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
This volume is an outcome of the workshop on nonsmooth optimization and applications held at and financed by the Hausdorff Center for Mathematics in Bonn from May 15 to 19, 2017. During the workshop, senior mathematicians and young scientists have presented their latest works. The six articles in this book have been contributed by workshop speakers and their collaborators. They give insights into current research activities in constrained and nonconstrained nonsmooth optimization. The first three chapters deal with topics of constrained nonsmooth optimization. Chapter “A Collection of Nonsmooth Riemannian Optimization Problems” presents nine examples for nonsmooth optimization problems on Riemannian manifolds that are relevant in numerous applications. Chapter “An Approximate ADMM for Solving Linearly Constrained Nonsmooth Optimization Problems with Two Blocks of Variables” is devoted to an approximate alternating direction method of multipliers to solve constrained nonsmooth optimization problems with two blocks of variables subject to linear constraints. In chapter “Tangent and Normal Cones for LowRank Matrices”, tangent cones and normal cones to varieties of low-rank matrices are studied, which is useful for low-rank optimization problems. Chapters “Subdifferential Enlargements and Continuity Properties of the VU-Decomposition in Convex Optimization” and “Proximal Mappings and Moreau Envelopes of SingleVariable Convex Piecewise Cubic Functions and Multivariable Gauge Functions” are concerned with some ideas for relating a nonsmooth problem to a smooth problem and then solving the smooth counterpart. Specifically, chapter “Subdifferential Enlargements and Continuity Properties of the VU-Decomposition in Convex Optimization” is concerned with the concept of ε-VU-objects for nonsmooth convex functions, which is closely related to partly smooth functions, based on an enlargement of the subdifferential. Chapter “Proximal Mappings and Moreau Envelopes of Single-Variable Convex Piecewise Cubic Functions and Multivariable Gauge Functions” deals with the Moreau envelope for finite-dimensional, proper, lower semicontinuous, convex functions. The Moreau envelope offers the benefit of smoothing a nonsmooth objective function while maintaining the same minimum and minimizers. Finally, chapter “Newton-Like Dynamics Associated to Nonconvex v
vi
Preface
Optimization Problems” is devoted to the relation of a dynamic system with critical points of a nonsmooth function. We wish to thank all the contributors of the workshop for making it a successful, inspiring, and memorable event. The financial and administrative support of the Hausdorff Center for Mathematics is particularly acknowledged. Gifhorn, Germany Detroit, MI, USA Leipzig, Germany
Seyedehsomayeh Hosseini Boris S. Mordukhovich André Uschmajew
Contents
A Collection of Nonsmooth Riemannian Optimization Problems . . . . . . . . . . P.-A. Absil and S. Hosseini An Approximate ADMM for Solving Linearly Constrained Nonsmooth Optimization Problems with Two Blocks of Variables .. . . . . . . . Adil M. Bagirov, Sona Taheri, Fusheng Bai, and Zhiyou Wu Tangent and Normal Cones for Low-Rank Matrices . . . .. . . . . . . . . . . . . . . . . . . . Seyedehsomayeh Hosseini, D. Russell Luke, and André Uschmajew Subdifferential Enlargements and Continuity Properties of the VU-Decomposition in Convex Optimization .. . . . . .. . . . . . . . . . . . . . . . . . . . Shuai Liu, Claudia Sagastizábal, and Mikhail Solodov Proximal Mappings and Moreau Envelopes of Single-Variable Convex Piecewise Cubic Functions and Multivariable Gauge Functions . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . C. Planiden and X. Wang
1
17 45
55
89
Newton-Like Dynamics Associated to Nonconvex Optimization Problems . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 131 Radu Ioan Bo¸t and Ernö Robert Csetnek
vii
A Collection of Nonsmooth Riemannian Optimization Problems P.-A. Absil and S. Hosseini
Abstract Nonsmooth Riemannian optimization is still a scarcely explored subfield of optimization theory that concerns the general problem of minimizing (or maximizing) over a domain endowed with a manifold structure, a real-valued function that is not everywhere differentiable. The purpose of this paper is to illustrate, by means of nine concrete examples, that nonsmooth Riemannian optimization finds numerous applications in engineering and the sciences. Keywords Nonsmooth optimization · Riemannian optimization · Optimization on manifolds Mathematics Subject Classification (2000) Primary 49J52; Secondary 90C56
1 Introduction Optimization on manifolds, also termed Riemannian optimization, concerns optimizing a real-valued function defined on a nonlinear search space endowed with a manifold structure. Manifolds that frequently arise in applications include the Stiefel manifold of orthonormal p-frames in Rn , the Grassmann manifold of pplanes in Rn , and the manifold of matrices of fixed rank and size. The area is motivated by numerous applications, notably in machine learning, but also in computer vision, imaging sciences, mechanics, physics, chemistry, and genetics.
P.-A. Absil () ICTEAM Institute, University of Louvain, Louvain-la-Neuve, Belgium e-mail: [email protected] S. Hosseini Hausdorff Center for Mathematics and Institute for Numerical Simulation, University of Bonn, Bonn, Germany e-mail: [email protected] © Springer Nature Switzerland AG 2019 S. Hosseini et al. (eds.), Nonsmooth Optimization and Its Applications, International Series of Numerical Mathematics 170, https://doi.org/10.1007/978-3-030-11370-4_1
1
2
P.-A. Absil and S. Hosseini
A way to keep abreast of the rapidly evolving field of applications is to track papers that refer to general-purpose Riemannian optimization toolboxes such as Manopt [10], Pymanopt [56], and ROPTLIB [34]. In considering an optimization problem on a Riemannian manifold, fundamental challenges occur due to the nonlinear nature of the search space. As we now explain, several approaches that may seem natural are inapplicable or can be deemed unsatisfactory. As a first approach, since manifolds are “locally Euclidean,” it is tempting to resort to unconstrained optimization techniques. However, methods based on the linear space structure—that is, virtually all classical methods in optimization— are not directly applicable on an abstract manifold. As a second approach, one could resort to embedding theorems (namely, the Whitney embedding theorem or the Nash isometric embedding theorem) that guarantee that any manifold can be viewed as a subset of a Euclidean space. However, handling such a subset may be impractical. This is the case, for example, of the well-known Grassmann manifold of p-planes in Rn ; any p-plane in Rn can be uniquely represented by the orthogonal projection onto it, and thus as an element of the n2 -dimensional Euclidean space of linear transformations of Rn , but using instead a basis of the p-plane (as in, e.g.,[2, §3.4.4]) may be considered more wieldy. On the other hand, some manifolds— the unit sphere in Rn being a simple example—naturally come to us as a subset of a Euclidean space. In that situation, the Riemannian optimization problem falls in the realm of constrained optimization. However, in contrast with classical constrained optimization methods such as augmented Lagrangian or sequential quadratic programming, Riemannian optimization methods have the often desirable property of being feasible, namely their iterates belong to the admissible subset. Projected gradient (see, e.g., [6]), which is a classical feasible optimization method, bears some similarity with Riemannian steepest descent (see, e.g., [2, §4.2]), but a major difference is that projected gradient requires the feasible set to be convex, and another difference is that Riemannian steepest descent first projects the gradient onto the tangent space, which can be considerably more efficient (see [61, §2.1]). As a third approach, one could exploit the fact that, by definition, a manifold is a set covered by coordinate patches (or charts) that overlap smoothly. Therefore one might ask whether it is possible to simply work in a chart domain and use classical linear algorithms to solve optimization problems on manifolds efficiently. Unfortunately, such an approach seldom leads to wieldy algorithms, because of the following reasons. First of all, symmetries of the underlying Riemannian manifold will in general not be respected by such algorithms. Second, while the existence of charts is inherent to the notion of manifold, they may not be readily available or they may be computationally impractical. Third, localizing to a chart inevitably leads to distortions in the metric, which may cause slower convergence. Finally, for a chartbased algorithm to have access to the whole manifold, chart transition mechanisms need to be devised, and such mechanisms may prove to be intricate (see [52] for an example). For these reasons, together with the wealth of applications, Riemannian optimization has become a thriving area of research in the past few decades. Early attempts to adapt standard optimization methods to smooth problems on complete Riemannian manifolds were presented by Gabay in 1982, who introduced
A Collection of Nonsmooth Riemannian Optimization Problems
3
the steepest descent, Newton, and quasi-Newton algorithms on manifold settings. Furthermore, he provided the global and local convergence properties of the presented algorithms [22]. Udriste also presented a steepest descent and a Newton algorithm for smooth objective functions on complete Riemannian manifolds and proved their (linear) convergence under the assumption of exact line search; see [58]. Fairly recently, other versions of smooth optimization algorithms such as the steepest descent method, Newton’s method, trust-region methods, and conjugate gradient methods have been extended to solving smooth optimization problems on complete Riemannian manifolds (see, e.g., [2, 3, 35, 50]). These methods can be found in the general-purpose Riemannian optimization toolboxes mentioned above. Regardless of the applications, a conceptual appeal of Riemannian optimization is that manifolds are arguably the natural generalization of Rn as the search space of smooth optimization problems: the notion of smoothness remains well defined for a real-valued function on an abstract manifold, and the notion of steepest descent direction follows from the Riemannian metric. In view of this, the concept of nonsmooth Riemannian optimization may seem paradoxical. However, it turns out that several important problems can be phrased as optimizing a nonsmooth function over a (smooth) Riemannian manifold. The purpose of this paper is to give an overview of a few such problems that fit in the framework of nonsmooth Riemannian optimization (NRO). Our aim is to give just enough information for the reader to get a clear idea of what the problems are about and how they fit in the NRO framework, while referring to the original work for details. The applications covered in this paper are not meant to form an exhaustive list. For example, Dirr et al. [17, §3.4] present an application in grasping force optimization that we do not cover here, and our choice of applications is also complementary to [41] where experimental results are shown for compressed modes, functional correspondence, and robust Euclidean embedding. Though NRO is a fairly recent and still scarcely explored area, a few generalpurpose NRO methods are currently available. First, derivative-free techniques on manifolds (but not necessarily their convergence analysis) are readily applicable to NRO. A few derivative-free schemes such as Powell’s method, direct local search methods, and particle swarm optimization algorithms are available on complete Riemannian manifolds; see [8, 14, 19]. Second, the NRO problems of interest have a locally Lipschitz—hence almost everywhere differentiable—objective function, and several methods exist that exploit this fact, such as gradient sampling [31, 32, §7.2], an ε-subdifferential algorithm [28], a trust region algorithm [27], nonsmooth BFGS methods [30, 32, §7.3], and a manifold ADMM [41]. Whereas most of these methods are fairly straightforward generalizations to manifolds of existing nonsmooth optimization methods in Rn , their analysis is often considerably more challenging, as the cited literature shows. As for the nonsmooth Riemannian BFGS, it closely resembles the smooth Riemannian BFGS [33, 48] much as the nonsmooth Euclidean BFGS resembles the smooth Euclidean BFGS; however, even in the Euclidean case, the favorable behavior of the nonsmooth BFGS remains somewhat mysterious [44]. Therefore, in [30] a new version of nonsmooth BFGS method has
4
P.-A. Absil and S. Hosseini
been presented, moreover it has been shown using several experiments that the new method is in the sense of success rate and computational time more robust and faster than the nonsmooth Riemannian BFGS method presented in [32, §7.3]. In each section, we also give pointers to existing techniques to handle the specific problem considered.
2 Sparse PCA Consider a data matrix A = a1 . . . aj ∈ Rm×n , where m and n are the number of variables and observations, respectively. For example, in the case of gene expression data, Aij may represent the expression level of gene i in experiment (or microarray) j . Principal component analysis (PCA) relates to singular value decomposition (SVD) [25]. An SVD of A consists in expressing A as the matrix product U V T , where p = min{m, n}, U ∈ Rm×p is orthonormal (U T U = I ), V ∈ Rn×p is also orthonormal, and ∈ Rp×p is a diagonal matrix with nonnegative diagonal entries sorted in decreasing order. The lth column of U , denoted by ul , is then the loading vector of the lth principal component of A. In the gene expression application, the columns of U have been called eigenarrays and the columns of V , eigengenes [4]. For simplicity, let us focus on the first loading vector, u1 , which has the remarkable property of giving the direction of the best fitting line in the least squares sense to the data points a1 , . . . , an , a concept that dates to Pearson [46]. Equivalently, u1 is solution of the optimization problem maxu∈Rm , u=1 uT AAT u, which expresses that the columns of A are largest (in the mean square sense) along the direction of u1 . However, a possible drawback of u1 is that its entries are typically nonzero. This may make the principal component—the projection onto the direction of u1 —unpleasantly heavy to compute, and hamper the interpretability of u1 [57]. Addressing this drawback leads to seek a new u that strikes a balance between the objectives of making uT AAT u large and keeping small the cardinality u0 , i.e., the number of nonzero elements of u. Such a task can be approached in three ways: the Ivanov approach [36] minimizes the data attachment term (here uT AAT u) under a constraint on the regularizer (here u0 ); the Morozov approach [45] minimizes the regularizer under a constraint on the data attachment term; the Tikhonov approach [55] mixes the two terms in the objective function. Let us focus on the last one, which yields the formulation max
u∈Rn
subject to
uT AAT u − ρu0 (2.1) uT u = 1,
with the parameter ρ ≥ 0. In problem (2.1), the objective function is not only nonsmooth; it is discontinuous, in view of the u0 term. Optimizing a
A Collection of Nonsmooth Riemannian Optimization Problems
5
discontinuous function is an unpleasant task, hence the u0 term is often relaxed to the sparsity-inducing (see, e.g., [18, 54]) 1 norm u1 := ni=1 |ui |, yielding the surrogate problem max
u∈Rn
subject to
uT AAT u − ρu1 (2.2) uT u = 1,
a continuous but nonsmooth optimization problem on the unit sphere. Since the unit sphere is a submanifold of Rn , (2.2) constitutes nonsmooth Riemannian optimization problem. Techniques to handle this problem are discussed in [15, 23, 37–39, 64] and the references therein.
3 Secant-Based Dimensionality Reduction The purpose in this application is to pick the “easiest to invert” projection of a n high-dimensional system. Let S be a subset of R , which can be,e.g., the orbit x−y of a dynamical system, and let := x−y : x, y ∈ S, x = y be its set of unit secants. We want to find a p-dimensional subspace U of Rn on which to orthogonally project the data. The projection is injective if and the orthogonal complement of U have an empty intersection. In order to make the projection “as safely injective as possible,” we look for the subspace U in the Grassmann manifold Gr(n, p) of p-planes in Rn that maximizes f (U) = min πU s, s∈
where πU denotes the orthogonal projection onto U and · denotes the Euclidean norm. In practice, S comes to us as a finite set, e.g., the samples of a numerical integration of the dynamical system. The problem therefore consists in maximizing over the Grassmann manifold Gr(n, p) an objective function defined as the pointwise minimum of a finite collection of smooth functions. This is thus a nonsmooth optimization problem on a manifold. This problem was addressed in [11] in the surrogate smooth form min
U ∈Gr(n,p)
1
s∈
πU s
.
The only existing work that addresses the nonsmooth problem directly appears to be the very recent paper [20] where a direct search method on the Grassmann manifold is advocated.
6
P.-A. Absil and S. Hosseini
4 Economic Load Dispatch The economic load dispatch problem consists in finding the most cost-effective repartition of power generation between production units in order to satisfy the demand, while accounting for transmission losses and keeping each unit within its T allowed operating zone. Letting p = p1 . . . pn denote the power output of the n available units, a popular load dispatch model is minn
p∈R
subject to
fT (p) :=
n
ai pi2 + bi pi + ci + di sin ei pimin − pi
(4.1a)
i=1
pimin ≤ pi ≤ pimax , n
i = 1, . . . , n,
pi = pD + pL (p),
(4.1b) (4.1c)
i=1
where pD is the power demand (MW) and pL (p) stands for the power loss (MW) expressed as pL (p) = pT Bp + pT b0 + b00 . Coefficients Bij , bi0 , b00 are the transmission loss coefficients (B-coefficients) given by the elements of the square matrix B of size n × n, the vector b0 of length n, and the constant b00, respectively. The matrix B is symmetric positive-definite, hence pL (p) is a convex quadratic function of p, and thus the set of points that meet the power balance constraint (4.1c) form an ellipsoid, i.e., a submanifold of Rn . The feasible set of (4.1) is thus the intersection of an ellipsoid and an axis-aligned box. The objective function (4.1a) follows a frequently encountered formulation where the contribution of each unit is made of a quadratic term and a rectified sine that models the so-called valve-point effect [60, 62]. The rectified term makes the objective function nonsmooth and turns (4.1) into a nonsmooth optimization problem on a Riemannian manifold with additional box constraints. The geometry of the problem was taken into account in [9] where a Riemannian subgradient approach is proposed.
5 Range-Based Independent Component Analysis Independent component analysis (ICA) is the problem of extracting maximally independent linear combinations of given signals. An ICA algorithm typically consists of (1) a contrast function which measures the dependence between signals and (2) an optimization method to minimize the contrast function. Since contrast functions are in principle invariant by scaling (the “measure of dependence” between signals should not be affected by scaling), it is common to break this indeterminacy by imposing a normalization yielding, e.g., the oblique manifold OB(D) of D × D matrices with unit-norm columns [1, 40, 53].
A Collection of Nonsmooth Riemannian Optimization Problems
7
Letting matrix M ∈ RD×N contain the D signals of length N to be unmixed, a contrast function introduced in [59] is f (X; M) :=
D
log R(xdT M) − log | det X|,
(5.1)
d=1
where X ∈ OB(D) is the candidate unmixing matrix (yielding the candidate unmixed signals found in the rows of XT D) and R returns the range of its argument. Vrins et al. [59] propose to use a robust estimator of the range defined as 1 Rr ([b1 , b2 , . . . , bN ]) m m
R([b1 , b2 , . . . , bN ]) :=
r=1
where Rr ([b1 , b2 , . . . , bN ]) := b(N−r+1) − b(r) and b(i) stands for the ith element of [b1 , b2 , . . . , bN ] by increasing values. We have thus obtained the problem min f (X; M) X
subject to
(5.2) X ∈ OB(D)
which is a nonsmooth optimization problem over the oblique manifold. This problem was tackled by a derivative-free optimization method in [52] and by a nonsmooth quasi-Newton method in [53]. The latter is a Riemannian generalization (see also [32, Ch. 7]) of the nonsmooth BFGS method of Lewis and Overton [44].
6 Sphere Packing on Manifolds Let M be an n-dimensional Riemannian manifold equipped with a Riemannian distance dist, and let B(P , r) denote the ball with respect to this distance in M centered at P with radius r. The sphere packing problem aims at finding m points P1 , . . . , Pm in M such that max{r :
B(Pi , r) ∩ B(Pj , r) = ∅ ∀i < j },
(6.1)
is maximized. This problem (6.1) is equivalent to maximizing the following nonsmooth function, F (P1 , . . . , Pm ) :=
min
1≤i 0 is denoted by Bδ (x) = {y ∈ Rn : y − x < δ}. “conv” denotes the convex hull of a set.
An Approximate ADMM for Nonsmooth Optimization
19
The subdifferential ∂F (x) of a convex function F : Rn → R at a point x ∈ Rn is defined by [1, 5]: ∂F (x) = ξ ∈ Rn : F (y) − F (x) ≥ ξ, y − x ∀y ∈ Rn , and its ε-subdifferential by ∂ε F (x) = ξ ∈ Rn : F (y) − F (x) ≥ ξ, y − x − ε ∀y ∈ Rn . Each vector ξ ∈ ∂F (x) is called a subgradient. A point x ∗ ∈ Rn is stationary iff 0n ∈ ∂F (x ∗ ). Stationarity is necessary and sufficient for global optimality.
2 Spherical Subdifferentials Let F : Rn → R be a convex function and τ ≥ 0 be a given number. Denote by S1n = {d ∈ Rn : d = 1} a unit sphere in Rn . Definition 2.1 A set Dτ F (x) = conv ξ ∈ Rn : ξ ∈ ∂F (x + τ d) for some d ∈ S1n
(2.1)
is called a τ -spherical subdifferential of the function F at a point x ∈ Rn . It is obvious that D0 F (x) = ∂F (x) for any x ∈ Rn . Definition 2.2 Let the convex functions F1 and F2 be defined in Rn . The directional sum of their τ -spherical subdifferentials at a point x ∈ Rn is: Dτ F1 (x) (+) Dτ F2 (x) = conv{ξ ∈ Rn : ∃ d ∈ S1n s.t. ξ = ξ1 + ξ2 : ξ1 ∈ ∂F1 (x + τ d), ξ2 ∈ ∂F2 (x + τ d)}.
(2.2)
Proposition 2.3 For a given τ ≥ 0 the τ -spherical subdifferential of the function F at a point x ∈ Rn is a compact set. Proof Boundedness of the subdifferential mapping x → ∂F (x) on a bounded set implies that for a given τ ≥ 0 the set Dτ F (x) is bounded. Take any sequence {ξk } ⊂ Dτ F (x) and assume that ξk → ξ¯ as k → ∞. Since ξk ∈ Dτ F (x) there exists dk ∈ S1n such that ξk ∈ ∂F (x + τ dk ). The sequence {dk } belongs to the compact set S1n and, therefore, has at least one limit point. For ¯ It is clear that d¯ ∈ S1n . Then the the sake of simplicity, assume that dk → d. ¯ upper semicontinuity of the subdifferential mapping implies that ξ¯ ∈ ∂F (x + τ d). ¯ Therefore, ξ ∈ Dτ F (x), that is the set Dτ F (x) is closed. This completes the proof.
20
A. M. Bagirov et al.
Proposition 2.4 At a point x ∈ Rn for any ε > 0 there exists τε > 0 such that Dτ F (x) ⊂ ∂F (x) + Bε (0n ) for all τ ∈ (0, τε ). Proof The proof follows from the upper semicontinuity of the subdifferential mapping x → ∂F (x). Let x ∈ Rn and ε ≡ ε(x, τ ) = sup {F (x) − F (x + τ d) + τ ξ, d : d ∈ S1n , ξ ∈ ∂F (x + τ d)} . (2.3) The convexity of the function F implies that ε ≥ 0. Proposition 2.5 For the τ -spherical subdifferential Dτ F (x) the following holds: Dτ F (x) ⊂ ∂ε F (x), where ε is defined by (2.3). Proof Take any ξ ∈ Dτ F (x). Then there exists d ∈ S1n such that ξ ∈ ∂F (x + τ d). The convexity of the function F implies that ξ ∈ ∂εd F (x) for all ξ ∈ ∂F (x + τ d) where εd = F (x) − F (x + τ d) + τ ξ, d ≥ 0. Since εd ≤ ε it follows that ∂εd F (x) ⊆ ∂ε F (x). This means that ξ ∈ ∂ε F (x). Let δ > 0 be a given number. Definition 2.6 A point x ∈ Rn is called a (τ, δ)-stationary point of the function F if 0n ∈ Dτ F (x) + Bδ (0n ).
(2.4)
Notice that if τ = δ = 0 then x is a stationary point of the function F . Proposition 2.7 Let {xk } be a sequence of (τk , δk )-stationary points of the convex function F such that xk → x, ¯ τk ↓ 0 and δk ↓ 0 as k → ∞. Then x¯ is a stationary point of the function F . Proof Since xk is the (τk , δk )-stationary point we get that 0n ∈ Dτk F (xk )+Bδk (0n ). Proposition 2.4 and the upper semicontinuity of the subdifferential mapping imply that for any ε > 0 there exists k1ε > 0 such that Dτk F (xk ) ⊂ ∂F (xk ) + Bε (0n ) for all k > k1ε . This means that 0n ∈ ∂F (xk ) + Bε+δk (0n ) for k > k1ε . Applying upper semicontinuity of the subdifferential x → ∂F (x) we obtain that for ε > 0
An Approximate ADMM for Nonsmooth Optimization
21
there exists k2ε > k1ε such that ∂F (xk ) ⊂ ∂F (x) ¯ + Bε (0n ) ¯ + B2ε+δk (0n ), k > k2ε . Since ε is for all k > k2ε . Then we get that 0n ∈ ∂F (x) arbitrary and δk ↓ 0 as k → ∞ we conclude that 0n ∈ ∂F (x). ¯ Proposition 2.8 Let x ∈ Rn be a (τ, δ)-stationary point of the function F . Then F (x) − F (z) ≤ ε + δx − z ∀z ∈ Rn , where ε is defined by (2.3) for given τ > 0 at x. Proof Since x is a (τ, δ)-stationary point of F it follows from Proposition 2.5 that 0n ∈ ∂ε F (x) + Bδ (0n ). This means that there exists ξ ∈ ∂ε F (x) such that ξ ≤ δ. According to the definition of the ε-subdifferential we get: F (z) − F (x) ≥ ξ, z − x − ε z ∈ Rn . Therefore: F (x) − F (z) ≤ ξ, x − z + ε ≤ ξ x − z + ε ≤ δx − z + ε.
This completes the proof.
3 An Approximate Alternating Direction of Multipliers Method Introduce the following function:
(x, y) =
1 Ax + Cy − b2 , (x, y) ∈ Rn × Rm . 2
The augmented Lagrangian for Problem (1.1) is defined as [14]: Lβ (x, y, λ) = f1 (x) + f2 (y) + λ, Ax + Cy − b + β (x, y), where λ ∈ Rp are Lagrange multipliers and β > 0 is a penalty parameter. The gradient of the function at (x, y) ∈ Rn × Rm is: ∇ (x, y) = (∇x (x, y), ∇y (x, y)),
22
A. M. Bagirov et al.
where ∇x (x, y) = AT (Ax + Cy − b), ∇y (x, y) = C T (Ax + Cy − b). The classic ADMM for solving (1.1) begins with an arbitrary z0 = (x0 , y0 , λ0 ) and consists of the following steps at the k-th iteration, k ≥ 0: 1. For given yk and λk obtain x¯k by x¯k = argmin f1 (x) + λk , Ax + Cyk − b + β (x, yk ) : x ∈ Rn ;
(3.1)
2. For given x¯k and λk compute y¯k by y¯k = argmin f2 (y) + λk , Ax¯k + Cy − b + β (x¯k , y) : y ∈ Rm ;
(3.2)
3. For given x¯k and y¯k determine λ¯ k as: λ¯ k = λk + β(Ax¯k + C y¯k − b);
(3.3)
4. Update zk by zk+1 as follows: zk+1 = (x¯k , y¯k , λ¯ k ).
(3.4)
For a given τ > 0 the τ -spherical subdifferential of the function f1 at x ∈ Rn is defined as: Dτ f1 (x) = conv ξ ∈ Rn : ξ ∈ ∂f1 (x + τ d), d ∈ S1n ,
(3.5)
and this subdifferential for the function f2 at y ∈ Rm is given by Dτ f2 (y) = conv η ∈ Rm : η ∈ ∂f2 (y + τ d), d ∈ S1m .
(3.6)
The τ -spherical subdifferential of the function at a point (x, y) ∈ Rn × Rm is: Dτ (x, y) = Dτ x (x, y), Dτy (x, y) , where Dτ x (x, y) = v ∈ Rn : v = AT (A(x + τ d) + Cy − b) , d ∈ S1n , and Dτy (x, y) = w ∈ Rm : w = C T (Ax + C(y + τ d) − b) , d ∈ S1m .
An Approximate ADMM for Nonsmooth Optimization
23
Define the following two numbers: M1 = AT A, M2 = C T C. Here · is the Frobenius norm of a matrix. Proposition 3.1 For the function at a point (x, y) ∈ Rn × Rm the following inclusions hold: Dτ x (x, y) ⊂ {∇x (x, y)} + BM1 τ (0n ),
(3.7)
Dτy (x, y) ⊂ ∇y (x, y) + BM2 τ (0m ).
(3.8)
Proof Take any v ∈ Dτ x (x, y). Then there exists d ∈ S1n such that v = AT (A(x + τ d) + Cy − b). It is clear that v − ∇x (x, y) ≤ M1 τ. The similar result can be obtained for the set Dτy (x, y). Applying the “directional sum rule" defined in Definition 2.2 we can see that the τ -spherical subdifferential of the function Lβ at a point x ∈ Rn for fixed y ∈ Rm and λ ∈ Rp is given as: Dτ x Lβ (x, y, λ) = Dτ f1 (x) (+) AT λ (+) βDτ x (x, y), and this subdifferential at a point y ∈ Rm for fixed x ∈ Rn and λ ∈ Rp is: Dτy Lβ (x, y, λ) = Dτ f2 (y) (+) C T λ (+) βDτy (x, y). Define the sets: Dˆ τ x (x, y, λ) = Dτ f1 (x) + AT λ + β∇x (x, y), Dˆ τy (x, y, λ) = Dτ f2 (y) + C T λ + β∇y (x, y). The following proposition is easily obtained from Proposition 3.1. Proposition 3.2 For any x ∈ Rn , y ∈ Rm , λ ∈ Rp we have Dτ x Lβ (x, y, λ) ⊂ Dˆ τ x (x, y, λ) + BβM1 τ (0n ),
(3.9)
Dτy Lβ (x, y, λ) ⊂ Dˆ τy (x, y, λ) + BβM2 τ (0m ).
(3.10)
Using sets Dτ x Lβ and Dτy Lβ we introduce the new version of the ADMM, called SGBUN (subgradient bundle) method. Assume that sequences {τk } and {δk }
24
A. M. Bagirov et al.
are given such that τk , δk ↓ 0 as k → ∞. Let β > 0 be a given number and z0 = (x0 , y0 , λ0 ) be a starting point. Then the SGBUN method proceeds as follows: 1. For given yk and λk obtain x¯k such that: 0n ∈ Dτk x Lβ (x¯k , yk , λk ) + Bδk (0n );
(3.11)
2. For given x¯k and λk compute y¯k satisfying the following condition: 0m ∈ Dτk y Lβ (x¯k , y¯k , λk ) + Bδk (0m );
(3.12)
3. For given x¯k and y¯k determine λ¯ k as: λ¯ k = λk + β(Ax¯k + C y¯k − b);
(3.13)
zk+1 = (x¯k , y¯k , λ¯ k ).
(3.14)
4. Update zk by zk+1 as:
One can see that unlike the classic ADMM in the SGBUN method subproblems are solved approximately. In the next section we develop an algorithm for finding points x¯k and y¯k in Steps 1 and 2, respectively, and prove that this algorithm is finite convergent. This means that the SGBUN method is well defined.
4 An Algorithm for Finding (τ, δ)-Stationary Points In this section, we introduce an algorithm for finding (τ, δ)-stationary points of the function Lβ (x, y, λ) with respect to x ∈ Rn for a fixed (y, λ) ∈ Rm × Rp . An algorithm for finding such points with respect to y ∈ Rm for a fixed (x, λ) ∈ Rn × Rp can be designed in a similar way. Let F1 (x) = Lβ (x, y, λ), x ∈ Rn , where y ∈ Rm , λ ∈ R p are fixed. For a given τ > 0 the τ -spherical subdifferential of the function F1 at a point x ∈ Rn is: Dτ F1 (x) = Dτ x Lβ (x, y, λ). The convexity of the function F1 implies that F1 (x + τ d) − F1 (x) ≤ τ
max ξ, d .
ξ ∈Dτ F1 (x)
(4.1)
An Approximate ADMM for Nonsmooth Optimization
25
Proposition 4.1 Assume that x is not a (τ, δ)-stationary point of the function F1 . ¯ Then for the direction d¯ = − ξξ¯ , where ξ¯ = min{ξ : ξ ∈ Dτ F1 (x)}, one has ¯ − F1 (x) ≤ −τ ξ¯ . F1 (x + τ d) / Dτ F1 (x) + Bδ (0n ). This Proof Since x is not (τ, δ)-stationary it follows that 0n ∈ implies that ξ¯ ≥ δ. Applying the necessary condition for a minimum we have:
ξ, ξ¯ ≥ ξ¯ 2 ∀ξ ∈ Dτ F1 (x).
Dividing both sides by −ξ¯ , we get:
ξ, d¯ ≤ −ξ¯ ∀ξ ∈ Dτ F1 (x).
Then the proof follows from (4.1).
¯ − F1 (x) ≤ −τ δ. This means that if x is Since ξ ≥ δ we get that F1 (x + τ d) not a (τ, δ)-stationary point then the set Dτ F1 (x) can be used to find a direction of sufficient decrease. However, the calculation of the set Dτ F1 (x) is not an easy task. Next, we design an algorithm which requires only a few elements from Dτ F1 (x) to find directions of sufficient decrease. Algorithm 1 Computation of descent directions 1: (Initialization). Select a constant c ∈ (0, 1) and a direction d1 ∈ S1n . Compute v1 ∈ ∂F1 (x + τ d1 ) and an element ξ1 = v1 + AT λ + βAT (A(x + τ d1 ) + Cy − b). Set U1 := {ξ1 } and k := 1. 2: (Computation of a least distance). Find u¯ k as the solution to the following quadratic programming problem: minimize
1 u2 subject to u ∈ Uk . 2
(4.2)
3: (Stopping criterion). If u¯ k ≤ δ
(4.3)
then stop. The point x is the (τ, δ)-stationary point. 4: (Computation of a search direction). Compute the search direction dk+1 = −u¯ k −1 u¯ k . If F1 (x + τ dk+1 ) − F1 (x) ≤ −cτ u¯ k
(4.4)
then stop. The direction of sufficient decrease has been found. 5: (Computation of a new subgradient). Compute the subgradient vk+1 ∈ ∂F1 (x + τ dk+1 ) and T T a new element ξk+1 = vk+1 +A λ +βA (A(x + τ dk+1 ) + Cy − b). Update the set of subgradients Uk+1 = conv Uk {ξk+1 } . Set k := k + 1 and go to Step 2.
26
A. M. Bagirov et al.
Proposition 4.2 Algorithm 1 is finite convergent. Proof Algorithm 1 has two stopping criteria: conditions (4.3) and (4.4). We prove that one of these conditions will be satisfied after finite number of steps. If none of these conditions is satisfied, then the new subgradient ξk+1 computed in Step 5 does not belong to the set Uk . Since u¯ k is the solution to (4.2) according to the necessary and sufficient condition for a minimum we have: u¯ k , u ≥ u¯ k 2 ∀u ∈ Uk . Dividing both sides by −u¯ k we get: u, dk+1 ≤ −u¯ k ∀u ∈ Uk .
(4.5)
Since the condition (4.4) is not satisfied it follows that: F1 (x + τ dk+1 ) − F1 (x) > −cτ u¯ k . Furthermore, the subgradient inequality for convex functions implies that: F1 (x + τ dk+1 ) − F1 (x) ≤ τ ξk+1 , dk+1 . Then we get: ξk+1 , dk+1 > −cu¯ k .
(4.6)
Since c ∈ (0, 1), (4.5) and (4.6) imply that ξk+1 ∈ / Uk . This means that if none of stopping criteria is satisfied, then the new subgradient allows us to significantly improve the approximation of the set Dτ F1 (x). In order to show that Algorithm 1 is finite convergent it is sufficient to prove that the condition (4.3) will be satisfied after finite number of iterations if the condition (4.4) never satisfies. For any t ∈ (0, 1), tξk+1 + (1 − t)u¯ k ∈ Uk+1 , we have: u¯ k+1 2 ≤ u¯ k + t (ξk+1 − u¯ k )2 = u¯ k 2 + 2t u¯ k , ξk+1 − u¯ k + t 2 ξk+1 − u¯ k 2 . Since the subdifferential ∂F1 (x) is bounded over a bounded set there exists K > 0 such that ξ ≤ K for all ξ ∈ Dτ F1 (x). Then ξk+1 − u¯ k ≤ 2K. From (4.6) we have ξk+1 , u¯ k ≤ cu¯ k 2 which implies that: u¯ k+1 2 ≤ u¯ k 2 + 2t u¯ k , ξk+1 − u¯ k + 4t 2 K 2 ≤ u¯ k 2 − 2t (1 − c)u¯ k 2 + 4t 2 K 2 .
An Approximate ADMM for Nonsmooth Optimization
27
Select t=
(1 − c)u¯ k 2 . 4K 2
It is obvious that t ∈ (0, 1). Then we have: u¯ k+1 2 ≤ u¯ k 2 −
(1 − c)2 u¯ k 4 . 4K 2
If u¯ k > δ for any k > 0, then we get: u¯ k+1 2 ≤ u¯ k 2 −
(1 − c)2 δ 4 . 4K 2
Writing this inequality for all k > 0 and summing them up we have: u¯ k+1 2 ≤ u¯ 1 2 − k
(1 − c)2 δ 4 . 4K 2
If the stopping criterion (4.3) never happens, then u¯ k → −∞ as k → ∞, which is a contradiction. This means that this criterion should be satisfied after 2δ4 finite number of iterations. Note that u¯ 1 ≤ K. Since K 2 − k (1−c) ≥ 0 we 2 4K 4
4K have k ≤ (1−c) 2 δ 4 . Then it is obvious that the maximum number kmax of iterations to reach this condition is given by the number:
kmax =
4 (1 − c)2
K δ
4 .
Next, we design an algorithm for finding (τ, δ)-stationary points of the function F1 for given τ > 0 and δ > 0. Proposition 4.3 Assume that F1∗ = inf{F1 (x), x ∈ Rn } > −∞. Then Algorithm 2 finds a (τ, δ)-stationary point of the function F1 after finite many iterations. Proof Assume the contrary. Then the sequence {xk } generated by Algorithm 2 is infinite. This means that at Step 2, Algorithm 1 will always generate a subgradient u¯ k satisfying the condition u¯ k > δ and also a descent direction d¯k satisfying the condition: F1 (xk + τ d¯k ) − F1 (xk ) ≤ −c1 τ u¯ k . Since c2 ≤ c1 it follows that: F1 (xk + τ d¯k ) − F1 (xk ) ≤ −c2 τ u¯ k .
28
A. M. Bagirov et al.
Algorithm 2 Finding (λ, δ)-stationary points 1: (Initialization). Select a starting point x1 , constants c1 ∈ (0, 1) and c2 ∈ (0, c1 ]. Set k := 1. 2: (Finding search directions) Apply Algorithm 1 at the point xk using c = c1 . This algorithm terminates after finite number of steps either satisfying the condition (4.3) or finding the descent direction d¯k ∈ S1n and the subgradient u¯ k ∈ Dτ f (xk ) satisfying the condition (4.4). 3: (Stopping criterion) If the condition (4.3) is satisfied then the algorithm terminates and the point xk is a (τ, δ)-stationary point. 4: (Line search) If the condition (4.4) is satisfied then find the step-length αk > 0 such that αk = sup α > 0 : F1 (xk + α d¯k ) − F1 (xk ) ≤ −c2 αu¯ k .
(4.7)
5: (Updating iteration) Set xk+1 := xk + αk d¯k , k := k + 1 and go to Step 2.
Therefore, in Step 4 of Algorithm 2 the step-length αk ≥ τ > 0. This implies that: F1 (xk + αk d¯k ) − F1 (xk ) ≤ −c2 τ u¯ k , or F1 (xk+1 ) − F1 (xk ) ≤ −c2 τ u¯ k < −c2 τ δ. Then we get: F1 (xk+1 ) < F1 (x1 ) − kc2 τ δ.
(4.8)
Since the sequence {xk } is infinite F1 (xk ) → −∞ as k → ∞. This contradicts the assumption of the proposition and, therefore, the sequence {xk } is finite. From the definition of F1∗ and the inequality (4.8) we obtain that F1∗ < F1 (x1 ) − kc2 τ δ and, therefore, the maximum number kmax of iterations to find (τ, δ)-stationary point of the function F1 is kmax =
F1 (x1 ) − F1∗ . c2 τ δ
The proof is complete.
5 Convergence of SGBUN Method In this section, we study the convergence of the SGBUN method. Our convergence analysis is similar to that of used in [3]. The Lagrange function for the problem (1.1) is: L0 (x, y, λ) = f1 (x) + f2 (y) + λ, Ax + Cy − b , (x, y, λ) ∈ Rn × Rm × Rp .
An Approximate ADMM for Nonsmooth Optimization
29
The following assumptions on Problem (1.1) are used to prove the convergence of the SGBUN method: Assumption 5.1 The functions f1 , f2 : Rn → R are Lipschitz, proper convex. Assumption 5.2 The Lagrangian L0 has a saddle point, that is there exists (x ∗ , y ∗ , λ∗ ), not necessarily unique, such that: L0 (x ∗ , y ∗ , λ) ≤ L0 (x ∗ , y ∗ , λ∗ ) ≤ L0 (x, y, λ∗ ) ∀ (x ∈ Rn , y ∈ Rm , λ ∈ Rp ). Let (x ∗ , y ∗ , λ∗ ) be a saddle point for L0 . For given xk ∈ Rn , yk ∈ Rm and λk ∈ Rp define Vk =
1 λk − λ∗ 2 + βC(yk − y ∗ )2 , β
(5.1)
and pk = f1 (xk ) + f2 (yk ), p∗ = f1 (x ∗ ) + f2 (y ∗ ), rk = Axk + Cyk − b, sk = xk − x ∗ , tk = yk − y ∗ . The following proposition was proved in [3]. Proposition 5.3 The sequence {pk } satisfies the following inequality for all k > 0: p∗ − pk+1 ≤ λ∗ , rk+1 .
(5.2)
Consider the following functions: ϕk (x) = f1 (x) + λk+1 − βC(yk+1 − yk ), Ax , and ψk (y) = f2 (y) + λk+1 , Cy . Proposition 5.4 The following inequality holds for all k > 0: pk+1 − p∗ ≤ − λk+1 , rk+1 − β C(yk+1 − yk ), −rk+1 + C(yk+1 − y ∗ ) +δk (sk+1 + tk+1 ) + βτk (M1 sk+1 + M2 tk+1 ) + ε1k + ε2k . (5.3) Here ε1k and ε2k are defined by applying the formula (2.3) to the function ϕk at the point xk+1 and to the function ψk at the point yk+1 , respectively.
30
A. M. Bagirov et al.
Proof Since xk+1 is a (τk , δk )-stationary point of the function Lβ (x, yk , λk ) it follows that: 0n ∈ Dτk x Lβ (xk+1 , yk , λk ) + Bδk (0n ). Applying (3.9) we get that Dτk x Lβ (xk+1 , yk , λk ) ⊂ Dτk f1 (xk+1 ) + AT λk + β∇x (xk+1 , yk ) + BβM1 τk (0n ). Therefore, 0n ∈ Dτk f1 (xk+1 ) + AT λk + βAT (Axk+1 + Cyk − b) + BβM1 τk +δk (0n ). Since λk+1 = λk + βrk+1 we have 0n ∈ Dτ f1 (xk+1 ) + AT (λk+1 − βC(yk+1 − yk )) + BβM1 τk +δk (0n ).
(5.4)
It is obvious that the τ -spherical subdifferential of the function ϕk at x ∈ Rn is: Dτ ϕk (x) = Dτ f1 (x) + AT (λk+1 − βC(yk+1 − yk )). Then it follows from (5.4) that 0n ∈ Dτ ϕk (xk+1 ) + BβM1 τk +δk (0n ) and therefore, xk+1 is a (τk , βM1 τk + δk )-stationary point of the function ϕk . In a same manner we can prove that the point yk+1 is a (τk , βM2 τk +δk )-stationary point of the function ψk . Applying Proposition 2.8 to the convex functions ϕk and ψk we have f1 (xk+1 ) + λk+1 − βC(yk+1 − yk ), Axk+1 ≤ f1 (x ∗ ) + λk+1 − βC(yk+1 − yk ), Ax ∗ +ε1k + (βM1 τk + δk )sk+1 , and f2 (yk+1 ) + λk+1 , Cyk+1 ≤ f2 (y ∗ ) + λk+1 , Cy ∗ + ε2k + (βM2 τk + δk )tk+1 . Adding these two inequalities, we obtain f1 (xk+1 ) + f2 (yk+1 ) + λk+1 , Axk+1 + Cyk+1 − β C(yk+1 − yk ), Axk+1 ≤ f1 (x ∗ ) + f2 (y ∗ ) + λk+1 , Ax ∗ + Cy ∗ − β C(yk+1 − yk ), Ax ∗ +δk (sk+1 + tk+1 ) + βτk (M1 sk+1 + M2 tk+1 ) + ε1k + ε2k ,
An Approximate ADMM for Nonsmooth Optimization
31
which implies that (taking into account that Ax ∗ + Cy ∗ = b): pk+1 − p∗ ≤ − λk+1 , rk+1 + b + β C(yk+1 − yk ), A(xk+1 − x ∗ ) + λk+1 , b +δk (sk+1 + tk+1 ) + βτk (M1 sk+1 + M2 tk+1 ) + ε1k + ε2k = − λk+1 , rk+1 + β C(yk+1 − yk ), A(xk+1 − x ∗ ) +δk (sk+1 + tk+1 ) + βτk (M1 sk+1 + M2 tk+1 ) + ε1k + ε2k = − λk+1 , rk+1 +β C(yk+1 − yk ), Axk+1 + Cyk+1 − b − Cyk+1 + b − Ax ∗ +δk (sk+1 + tk+1 ) + βτk (M1 sk+1 + M2 tk+1 ) + ε1k + ε2k = − λk+1 , rk+1 − β C(yk+1 − yk ), −rk+1 + C(yk+1 − y ∗ ) +δk (sk+1 + tk+1 ) + βτk (M1 sk+1 + M2 tk+1 ) + ε1k + ε2k . This completes the proof.
Proposition 5.5 For all k > 0, we have: V k+1 ≤ V k − βrk+1 2 − βC(yk+1 − yk )2 + εk ,
(5.5)
where εk = 2δk (sk+1 + 2tk+1 ) + 2δk−1 tk + 2βτk (M1 sk+1 + 2M2 tk+1 ) +2βM2 τk−1 + 2(ε1k + 2ε2k + ε2,k−1).
(5.6)
Proof From (5.2) and (5.3), we get: − λk+1 − λ∗ , rk+1 − β C(yk+1 − yk ), −rk+1 + C(yk+1 − y ∗ ) +δk (sk+1 + tk+1 ) + βτk (M1 sk+1 + M2 tk+1 ) + ε1k + ε2k ≥ 0, which can be rewritten as: 2 λk+1 − λ∗ , rk+1 − 2β C(yk+1 − yk ), rk+1 +2β C(yk+1 − yk ), C(yk+1 − y ∗ ) − 2δk (sk+1 + tk+1 ) −2βτk (M1 sk+1 + M2 tk+1 ) − 2(ε1k + ε2k ) ≤ 0.
(5.7)
32
A. M. Bagirov et al.
Since λk+1 = λk + βrk+1 we have: 2 λk+1 − λ∗ , rk+1 = 2 λk − λ∗ , rk+1 + 2βrk+1 2 2 1 λk − λ∗ , λk+1 − λk + λk+1 − λk 2 + βrk+1 2 β β 1
λk+1 − λ∗ 2 − λk − λ∗ 2 + βrk+1 2 . = (5.8) β =
From (5.1), (5.7), and (5.8), we obtain:
V k+1 − V k − β C(yk+1 − y ∗ )2 − C(yk − y ∗ )2 +βrk+1 2 − 2β C(yk+1 − yk ), rk+1 + 2β C(yk+1 − yk ), C(yk+1 − y ∗ ) −2δk (sk+1 + tk+1 ) − 2βτk (M1 sk+1 + M2 tk+1 ) − 2(ε1k + ε2k ) ≤ 0. (5.9) In addition, by replacing yk+1 − y ∗ = (yk+1 − yk ) + (yk − y ∗ ) in the term β C(yk+1 − yk ), C(yk+1 − y ∗ ) we have
V k+1 − V k − β C(yk+1 − y ∗ )2 − C(yk − y ∗ )2 +βrk+1 − C(yk+1 − yk )2 +βC(yk+1 − yk )2 + 2β C(yk+1 − yk ), C(yk − y ∗ ) . −2δk (sk+1 + tk+1 ) − 2βτk (M1 sk+1 + M2 tk+1 ) −2(ε1k + ε2k ) ≤ 0.
(5.10)
Furthermore, by substituting yk+1 − yk = (yk+1 − y ∗ ) − (yk − y ∗ ) in the terms βC(yk+1 − yk )2 + 2β C(yk+1 − yk ), C(yk − y ∗ ) , we get V k+1 − V k + βrk+1 − C(yk+1 − yk )2 −2δk (sk+1 + tk+1 ) − 2βτk (M1 sk+1 + M2 tk+1 ) − 2(ε1k + ε2k ) ≤ 0, (5.11) which can be rewritten as: V k − V k+1 ≥ βrk+1 − C(yk+1 − yk )2 −2δk (sk+1 + tk+1 ) − 2βτk (M1 sk+1 + M2 tk+1 ) − 2(ε1k + ε2k ). (5.12) Recall that the point yk+1 is a (τk , βM2 τk + δk )-stationary point of the function ψk (y) = f2 (y) + λk+1 , Cy and yk is a (τk−1 , βM2 τk−1 + δk−1 )-stationary point of the function ψk−1 (y) = f2 (y) + λk , Cy . It follows from Proposition 2.8 that f2 (yk+1 ) + λk+1 , Cyk+1 ≤ f2 (yk ) + λk+1 , Cyk + (βM2 τk + δk )tk+1 + ε2k ,
An Approximate ADMM for Nonsmooth Optimization
33
and f2 (yk ) + λk , Cyk ≤ f2 (yk+1 ) + λk , Cyk+1 + (βM2 τk−1 + δk−1 )tk + ε2,k−1. Adding these two inequalities we get: λk+1 −λk , C(yk+1 −yk ) ≤ (βM2 τk +δk )tk+1 +(βM2 τk−1 +δk−1 )tk +ε2k +ε2,k−1 , or β rk+1 , C(yk+1 − yk ) ≤ (βM2 τk + δk )tk+1 + (βM2 τk−1 + δk−1 )tk + ε2k + ε2,k−1 . Then from (5.12) we have: V k − V k+1 ≥ βrk+1 2 + βC(yk+1 − yk )2 −2(βM2 τk + δk )tk+1 − 2(βM2 τk−1 + δk−1 )tk − 2(ε2k + ε2,k−1 ) −2δk (sk+1 + tk+1 ) − 2βτk (M1 sk+1 + M2 tk+1 ) − 2(ε1k + ε2k ), or V k − V k+1 ≥ βrk+1 2 + βC(yk+1 − yk )2 − 2δk (sk+1 + 2tk+1 ) − 2δk−1 tk −2βτk (M1 sk+1 + 2M2 tk+1 ) − 2βM2 τk−1 −2(ε1k + 2ε2k + ε2,k−1 ). This completes the proof.
In order to get the convergence of the SGBUN method we need one additional assumption. Let Xτ δ (y, λ, β) be a set of (τ, δ)-stationary points of the function Lβ (x, y, λ) on Rn with respect to x and Yτ δ (x, λ, β) be a set of (τ, δ)-stationary points of the function Lβ (x, y, λ) on Rm with respect to y. Assumption 5.6 There exist τ0 > 0, δ0 > 0 and β0 > 0 such that for any τ ∈ (0, τ0 ), δ ∈ (0, δ0 ) and β > β0 1. the set Xτ δ (y, λ, β) is bounded for all y ∈ Rm and λ ∈ Rp ; 2. the set Yτ δ (x, λ, β) is bounded for all x ∈ Rn and λ ∈ Rp . Proposition 2.7 shows that this assumption is equivalent to the assumption that the set of stationary points of the function Lβ (x, y, λ) with respect to x (y) is bounded for all y ∈ Rm and λ ∈ Rp (x ∈ Rn and λ ∈ Rp ). This latter assumption holds, for example, when the number β is sufficiently large and functions f1 and f2 are coercive. Proposition 5.7 Let {εk } be a sequence defined by (5.6) and Assumption 5.6 holds. Then εk → 0 as k → ∞.
34
A. M. Bagirov et al.
Proof Since Assumption 5.6 holds the sequences {sk } and {tk } are bounded. Then the first four terms in the definition of the sequence {εk } converge to 0 as k → ∞. Furthermore, boundedness of the subdifferentials over bounded sets and the continuity of the functions f1 and f2 imply that ε1k , ε2k → 0 as k → ∞ and therefore, the fifth term in (5.6) also converges to 0. Remark 5.8 It follows from the expression (5.6) and also the continuity of the functions f1 and f2 that the sequence εk depends on the choice of the sequences {τk } and {δk }. These sequences can be chosen so that M=
∞
εk < +∞.
(5.13)
k=1
The following useful proposition will be used to prove the convergence of the proposed algorithm. Proposition 5.9 Let {vk }, {bk }, {θk } be sequences such that vk ≥ 0, bk ≥ 0, θk ≥ 0 for all k ≥ 1, θk → 0 as k → ∞ and vk+1 ≤ vk − bk + θk , k = 1, 2, . . . . Then bk → 0 as k → ∞. Proof Assume the contrary, that is bk → b0 > 0 as k → ∞. This means that there exists k1 > 1 such that bk > b0 /2 for all k > k1 . Since θk → 0 as k → ∞ there exists k2 > k1 such that θk < b0 /4 for all k > k2 . Therefore, vk+1 ≤ vk − b0 /4, ∀ k > k2 and vk − vk2 +1 ≤ −(k − k2 − 1)b0/4 ∀ k > k2 + 1. This implies that vk → −∞ as k → ∞ which is a contradiction.
Proposition 5.10 Suppose that Assumptions 5.1, 5.2, and 5.6 hold for Problem (1.1) and the sequences {τk } and {δk } are chosen such that the sequence {εk }, defined in (5.6), satisfies the condition (5.13). Then rk → 0 and pk → p∗ as k → ∞. Proof According to Assumptions 5.1 and 5.2 Propositions 5.3, 5.4, and 5.5 are true. Moreover, Assumption 5.6 implies that sequences {xk } and {yk } belong to bounded sets, and therefore, the sequences {sk } and {tk } are bounded. It follows from Proposition 5.9 and (5.5) that
lim
k→∞
rk+1 2 + C(yk+1 − yk )2 = 0
An Approximate ADMM for Nonsmooth Optimization
35
and therefore, rk → 0 as k → ∞. Applying (5.5) and the condition (5.13) we get that 0 ≤ Vk ≤ V0 +
k−1
εj ≤ V 0 + M.
j =1
This means that the sequence {V k } is bounded, and it follows from its definition that the sequence {λk } is also bounded. This and also Propositions 5.3 and 5.4 imply that pk → p∗ as k → ∞.
6 Computational Results In this section, we present computational results on the performance of the proposed SGBUN method using 10 academic test problems. Since there are no test problems for the ADMM algorithms in the literature we design new test problems. The detailed descriptions of these problems are given in Appendix. We compare the SGBUN method with the proximal bundle method for solving general unconstrained nonsmooth optimization problems (PBUN) and also with the proximal bundle method for solving general nonsmooth optimization problems subject to linear constraints (PBUNL). The PBUN method is applied to solve unconstrained problems (3.1) and (3.2) at the first two iterations of the ADDM. The PBUNL method is directly applied to solve (1.1) as a nonsmooth optimization problem subject to linear constraints. The implementation of these methods is discussed in [17]. In the implementation of SGBUN, we set the initial value of the penalty parameter β to 0.1. Then this parameter is increased using the formula β = 10 × β if the constraint violation is more than 10−8 . However, this increase stops when the parameter β is greater than 103. If the constraint violation is less than 10−8 , then the algorithm stops and the solution obtained with this value of β is accepted as a final solution. The same rule for updating the parameter β is used also in the implementation of the PBUN method. All methods were implemented in Fortran 95, compiled using the g95 compiler and the calculations were carried out on a 2.90 GHz Intel Core i5-3470S machine with 8 GB of RAM. We compare the efficiency of methods in terms of the number of function and subgradient evaluations. For most of these problems, the CPU time used by all three methods is close to 0 and, therefore, we do not include their CPU time. Results with a starting point given in the description of each test problem and with 20 randomly generated starting points are presented separately. In the former case we present results in Table 1, and in the latter case we analyze the results using the performance profiles given in Fig. 1. For details of performance profiles, see [6].
Problem 1 2 3 4 5 6 7 8 9 10
SGBUN Func. value 4.0833 6.1664 38.0000 163.0000 13.6666 13.0000 0.5000 5.6000 1.4432 28.7031
No. func.eval 4509 12151 1700 5160 37443 2448 5006 3620 13959 2364
No. subgrad.eval 3350 2716 1212 3147 29623 1601 3622 2559 10363 1798
PBUN Func. value 4.0833 6.1664 38.0000 163.0000 13.6666 13.0000 0.5000 5.6000 1.4432 28.7031
Table 1 Results using 10 test problems with a single starting point No. func.eval 1645 1370 741 1123 25200 18751 8495 2015 72584 1095
No. subgrad.eval 1644 1369 740 1122 25199 18750 8494 2014 72580 1094
PBUNL Func value 4.0833 6.1664 38.0000 163.0000 20.3816 13.0000 0.5000 5.6000 1.4432 28.7031
No. func.eval 9 18 10 13 196 16 23 31 94 72
No. subgrad.eval 9 18 10 13 196 16 23 31 94 72
36 A. M. Bagirov et al.
An Approximate ADMM for Nonsmooth Optimization (a)
(b)
(c)
(d)
(e)
(f)
37
Fig. 1 Performance profiles of methods using 10 test problems with 20 random starting points. (a) No.func.eval, SGBUN vs PBUN. (b) No. subgrad. eval, SGBUN vs PBUN. (c) No.func.eval, SGBUN vs PBUNL. (d) No. subgrad. eval, SGBUN vs PBUNL. (e) No. func. eval, PBUNL vs PBUN. (f) No. subgrad. eval, PBUNL vs PBUN
38
A. M. Bagirov et al.
The results presented in Table 1 show that the PBUNL is the most efficient method. It requires significantly less computational effort than other two methods. However, this method fails to find the solution in Problem 5. This is due to the choice of the starting point which is away from the solution. The SGBUN and PBUN methods find solutions in all test problems. These methods require significantly more computational effort than the PBUNL method. This is due to the fact that these two methods are applied repeatedly at each iteration of the ADMM scheme. The number of function and subgradient evaluations used by PBUN and SGBUN methods is comparable. Performance profiles presented in Fig. 1 demonstrate that the SGBUN and PBUN methods are comparable in terms of both efficiency and robustness. We can see that SGBUN requires more function evaluations than PBUN, however, the former method uses less subgradient evaluations than the latter method. This is due to the fact that the SGBUN method uses significantly more functions evaluations in the line search than the PBUN method. These two methods are more robust than PBUNL method even if the latter method requires less function and subgradient evaluations and it is more efficient.
7 Conclusions In this chapter, we introduced a method for solving convex programming problems with two blocks of variables subject to linear constraints. The new method is based on the alternating direction method of multipliers and uses the L2 penalty functions. In the proposed method subproblems are solved approximately and a finite convergent algorithm is designed to solve these subproblems. The convergence of the proposed method is studied. We designed new problems for testing ADMM methods. Using these problems we compared the performance of the proposed method with that of the proximal bundle method for general unconstrained convex problems and the proximal bundle method for linearly constrained nonsmooth optimization problems. The proximal bundle method for linearly constrained nonsmooth optimization problems is more efficient than the other two methods, however, this method is less robust. Results show that the proposed method is a good alternative to the proximal bundle method in the ADMM scheme for solving convex programming problems with two blocks of variables subject to linear constraints. Acknowledgements The authors would like to thank the anonymous referee for valuable comments that helped to improve the quality of this paper. This research was started when Dr. A.M. Bagirov visited Chongqing Normal University and the visit was supported by this university. The research by A.M. Bagirov and S.Taheri was also supported by Australian Research Council’s Discovery Projects funding scheme (Project No. DP190100580).
An Approximate ADMM for Nonsmooth Optimization
39
Appendix This appendix contains all test problems used in numerical experiments. In these problems the objective functions are presented as: F (x, y) = f1 (x) + f2 (y). Therefore, in the description of test problems only functions f1 and f2 are presented and the following notations are used: • (x 0 , y 0 ) ∈ Rn × Rm —starting point; • (x ∗ , y ∗ ) ∈ Rn × Rm —known best solution; • f ∗ —known best value. Problem 1 Dimension: 3 (2,1), Component functions: f1 (x) = max f i (x), i=1,2,3
f2 (y) = max f i (y), i=4,5,6
f 1 (x) = x12 + x24 , f 2 (x) = (2 − x1 )2 + (2 − x2 )2 , f 3 (x) = exp(x2 − x1 ), f 4 (y) = y12 , f 5 (y) = (2 − y1 )2 , f 6 (y) = exp(−y1 ), Linear constraints: −x1 − x2 + y1 = −0.5, Starting point: (x 0 , y 0 ) = (2, 2, 2)T , Optimum point: (x ∗ , y ∗ ) = (0.8333333, 0.8333333, 1.1666667)T , Optimum value: f ∗ = 4.0833333. Problem 2 Dimension: 4 (2,2), Component functions: f1 (x) = max{f11 (x), f12 (x), f13 (x)}, f2 (y) = max{f21 (y), f22 (y), f23 (y)}, f11 (x) = x14 + x22 , f12 (x) = (2 − x1 )2 + (2 − x2 )2 , f13 (x) = 2e−x1 +x2 , f21 (y) = y12 − 2y1 + y22 − 4y2 + 4, f22 (y) = 2y12 − 5y1 + y22 − 2y2 + 4, f23 (y) = y12 + 2y22 − 4y2 + 1, Linear constraints: x1 + x2 − y1 + y2 = 0, −3x1 + 5x2 − y1 − 2y2 = 0, Starting point: (x 0 , y 0 ) = (2, 2, 2, 2)T , Optimum point: (x ∗ , y ∗ ) = (0.4964617, 0.6982659, 1.4638, 0.2690723)T , Optimum value: f ∗ = 6.1663599.
40
A. M. Bagirov et al.
Problem 3 Dimension: 4 (2,2), Component functions: f1 (x) = |x1 − 1| + 200 max{0, |x1 | − x2 }, f2 (y) = 100(|y1| − y2 ), Linear constraints: x1 − x2 + 2y1 − 3y2 = 1, −x1 + 2x2 − 2y1 − y2 = 6, Starting point: (x 0 , y 0 ) = (−1.2, 1, 1, 1)T , Optimum point: (x ∗ , y ∗ ) = (5.6666667, 5.6666667, 0, −0.3333333)T , Optimum value: f ∗ = 38. Problem 4 Dimension: 10 (3,7), Component functions: f1 (x) = 9|x1 | + |x2 | + |x3|, 7 f2 (y) = i|yi |, i=1
Linear constraints: x1 + x2 − y1 = 1, x1 − y2 = −10, x1 + y3 = 10, x2 − y4 = 1, x2 + y5 = 10, x3 − y6 = −10, x3 + y7 = 1, Starting point: x 0 = (1, 1, 1)T , y 0 = (2, . . . , 2)T , Optimum point: x ∗ = (0, 1, 0.0000001)T , y ∗ = (0, 10, 10, 0, 9, 10, 1)T , Optimum value: f ∗ = 163. Problem 5 Dimension: 10 (3,7), Component functions: f1 (x) = |x1 + 3x2 + x3 | + 4|x1 − x2 |, 4 7 max{0, −yi } + i|yi |, f2 (y) = 10 i=1
Linear constraints: −x1 + 6x2 + 4x3 − y1 = 3, x1 + x2 + x3 = 1, x1 + x2 − y5 = 0, x2 + x3 − y6 = 0, x1 + x3 − y7 = 0,
i=1
An Approximate ADMM for Nonsmooth Optimization
Starting point: x 0 = (1, −2, 3)T , y 0 = (2, . . . , 2)T , Optimum point: x ∗ = (0.3333408, 0.3333408, 0.3333237)T , y ∗ = (0, 0, 0, 0, 0.6666869, 0.6666638, 0.6666638)T , Optimum value: f ∗ = 13.6673059. Problem 6 Dimension: 10 (3,7), Component functions: f1 (x) = 9 + 8x1 + 6x2 + 4x3 + 2|x1| + 2|x2 | + |x3 |, +|x1 + x2 | + |x1 + x3 |, 4 max{0, −yi+3}, f2 (y) = 2|y1 | + |y2 | + |y3| + 10 i=1
Linear constraints: x1 + x2 + 2x3 + y4 = 3, 4x1 − 2x2 − x3 − 3y1 + y2 − y3 = 6, x1 + x2 − y5 = 0, x2 + x3 − y6 = 0, x1 + x3 − y7 = 0, Starting point: x 0 = (−1, 2, −3)T , y 0 = (1, . . . , 1)T , Optimum point: x ∗ = (0, 0, 0)T , y ∗ = (−2, 0, 0, 3, 0, 0, 0)T , Optimum value: f ∗ = 13. Problem 7 Dimension: 12 (7,5), Component functions: 6 f1 (x) = |xi − xi+1 |, i=1
f2 (y) = 10
5
j =1
max{0, yj },
Linear constraints: x1 + x2 + 2x3 − y1 = 3, x2 − x3 + x4 − 3y2 = 1, x1 + x5 − y1 − y3 = 1, x1 + x6 − y1 − y4 = 2, x1 + x7 − y1 − y5 = 1, Starting point: x 0 = (1, 2, 3, 4, −5, −6, −7)T , y 0 = (1, −2, 3, −4, 5)T , Optimum point: x ∗ = (0.75, 0.75, 0.75, 0.3381465, 0.25, 0.25, 0.25)T , y ∗ = (0, 0.2206178, 0, −1, 0)T , Optimum value: f (x ∗ , y ∗ ) = 0.5.
41
42
A. M. Bagirov et al.
Problem 8 Dimension: 14 (8,6), Component functions: 8 f1 (x) = max{0, xi }, i=1
f2 (y) = |y1 | + 2|y2| + 2|y3| + |y4 | + |y5 | + 5|y6 | + y2 − y5 − 4y6, Linear constraints: x1 + 2x2 + x3 + x4 + y1 = 5, 3x1 + x2 + 2x3 − x4 + y2 = 4, x5 + 2x6 + x7 + x8 + y3 = 5, 3x5 + x6 + 2x7 − x8 + y4 = 4, x2 + 4x3 − y5 = 1.5, Starting point: x 0 = (1, . . . , 1)T , y 0 = (2, . . . , 2)T , Optimum point: x ∗ = (0.6, 2.2, 0, 0, 0.6, 2.2, 0, 0)T , y ∗ = (0, 0, 0, 0, 0.7, 0)T , Optimum value:f ∗ = 5.6. Problem 9 Dimension: 17 (10,7), Component functions: f1 (x) = |x1 | + |x2 | + |x1 + x2 | − x1 − 2x2 + |x3 − 10| + 4|x4 − 5|, +|x5 − 3| + 2|x6 − 1| + 5|x7 | + 7|x8 − 11| + 2|x9 − 10| + |x10 − 7|, 6 max{0, yi − yi+1 }, f2 (y) = |y1 | + |y2 | + 10 i=3
Linear constraints: 4x1 + 5x2 − 3x7 + 9x8 + 2y1 + y3 + y4 = 105, 10x1 − 8x2 − 17x7 + 2x8 + y2 + y4 + 5y5 = 0, 8x1 − 2x2 − 5x9 + 2x10 − 3y5 − y6 = −12, 3x1 − 6x2 − 12x9 + 7x10 − 2y6 − y7 = −96, Starting point: x 0 = (2, . . . , 2)T , y 0 = (1, . . . , 1)T , Optimum point: x ∗ = (1.4431818, 1.6136364, 10, 5, 3, 1, 0, 11, 10, 7)T , y ∗ = (0, 0, −3.9204544, −3.9204544, −3.9204544, −3.9204544, 27.4886356)T , Optimum value: f ∗ = 1.4431818. Problem 10 Dimension: 30 (15,15), Component functions: f1 (x) = 15 max |xi |, f2 (y) =
15
i=1,...,15
i=1
|yi |,
An Approximate ADMM for Nonsmooth Optimization
43
Linear constraints: xi + 2xi+1 − 3xi+2 − yi − 2yi+2 = i − 1, i = 1, . . . , 13, Starting point: x 0 = (x1 , . . . , x15 )T , xi = i, i ≤ 7, xi = −i, i > 7, y 0 = (y1 , . . . , y15 )T , yi = −i, i ≤ 7, yi = i, i > 7, Optimum point: x ∗ = (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)T , y ∗ = (0, 0, 0, −0.5, −1, −1.25, −1.5, −1.88, −2.25, −2.56, −2.86, −3.22, −3.56, −3.89, −4.22)T , Optimum value: f ∗ = 28.703125.
References 1. A. Bagirov, N. Karmitsa, and M. Makela. Introduction to Nonsmooth Optimization: Theory, Practice and Software. Springer, Cham, 2014. 2. A. Bnouhachem, M.H. Xu, M. Khalfaoui, and Sh. Zhaohana. A new alternating direction method for solving variational inequalities. Computers and Mathematics with Applications, 62 (2011), 626–634. 3. S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning, 3(1) (2011), 1–122. 4. Sh. Cao, Y. Xiao, and H. Zhu. Linearized alternating directions method for l1 -norm inequality constrained l1 -norm minimization. Applied Numerical Mathematics, 85 (2014), 142–153. 5. F.H. Clarke. Optimization and Nonsmooth Analysis. Canadian Mathematical Society series of monographs and advanced texts. Wiley-Interscience, 1983. 6. E.D. Dolan and J.J. Moré. Benchmarking optimization software with performance profiles. Mathematical Programming, 91(2) (2002), 201–213. 7. J. Eckstein and W. Yao. Augmented Lagrangian and alternating direction methods for convex optimization: A tutorial and some illustrative computational results. In RUTCOR Research Reports, 32. 2012. 8. M. El Anbari, S. Alam, and H. Bensmail. COFADMM: A computational features selection with alternating direction method of multipliers. Procedia Computer Science, 29 (2014), 821–830. 9. M. Fukushima. Application of the alternating direction method of multipliers to separable convex programming problems. Computational Optimization and Applications, 1(1) (1992), 93–111. 10. D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via finite element approximations. Computers and Mathematics with Applications, 2 (1976) 17–40. 11. R. Glowinski and A. Marrocco. Sur l ‘approximation par éléments finis d ‘ordre 1 et la résolution par pénalisation-dualité d ‘une classe de problémes de dirichlet. RAIRO, 2 (1975), 41–76. 12. D. Han and H.K. Lo. A new stepsize rule in He and Zhou’s alternating direction method. Applied Mathematics Letters, 15 (2002), 181–185. 13. B. He and J. Zhou. A modified alternating direction method for convex minimization problems. Applied Mathematics Letters, 13 (2000), 123–130. 14. M.R. Hestenes. Multiplier and gradient methods. Journal of Optimization Theory and Applications, 4 (1969) 302–320. 15. K.C. Kiwiel. An alternating linearization bundle method for convex optimization and nonlinear multicommodity flow problems. Mathematical Programming, Ser. A, 130 (2011), 59–84.
44
A. M. Bagirov et al.
16. X. Li, L. Mo, X. Yuan, and J. Zhang. Linearized alternating direction method of multipliers for sparse group and fused lasso models. Computational Statistics and Data Analysis, 79 (2014), 203–221. 17. L. Luk¨san and J. Vl¨cek. Algorithm 811: NDA: algorithms for nondifferentiable optimization. ACM Transactions on Mathematical Software, 27(2) (2001), 193–213. 18. M.J.D. Powell. A method for nonlinear constraints in minimization problems. In R. Fletcher, editor, Optimization, pages 283–298. Academic Press, NY, 1969. 19. A. Rakotomamonjy. Applying alternating direction method of multipliers for constrained dictionary learning. Neurocomputing, 106 (2013), 126–136. 20. Y. Shen and M.H. Xu. On the O(1/t) convergence rate of Ye–Yuan’s modified alternating direction method of multipliers. Applied Mathematics and Computation, 226 (2014), 367–373. 21. J. Sun and S. Zhang. A modified alternating direction method for convex quadratically constrained quadratic semidefinite programs. European Journal of Operational Research, 207 (2010), 1210–1220. 22. K. Zhao and G. Yao. Application of the alternating direction method for an inverse monic quadratic eigenvalue problem. Applied Mathematics and Computation, 244 (2014), 32–41.
Tangent and Normal Cones for Low-Rank Matrices Seyedehsomayeh Hosseini, D. Russell Luke, and André Uschmajew
Abstract In (D. R. Luke, J. Math. Imaging Vision, 47 (2013), 231–238) the structure of the Mordukhovich normal cone to varieties of low-rank matrices at rankdeficient points has been determined. A simplified proof of that result is presented here. As a corollary we obtain the corresponding Clarke normal cone. The results are put into the context of first-order optimality conditions for low-rank matrix optimization problems. Keywords Matrix optimization · Low rank constraint · Optimality conditions Mathematics Subject Classification (2000) Primary 15B99, 49J52; Secondary 65K10
1 Introduction In continuous optimization problems, necessary conditions for local minima relate the first-order geometries of a constraint set M and a sublevel set of a cost function f with each other, in order to give a rigorous meaning to the intuition that descent directions of f should point “away” from M at such a point. If M is a smooth
S. Hosseini Hausdorff Center for Mathematics & Institute for Numerical Simulation, University of Bonn, Bonn, Germany e-mail: [email protected] D. R. Luke Institute for Numerical and Applied Mathematics, University of Göttingen, Göttingen, Germany e-mail: [email protected] A. Uschmajew () Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany e-mail: [email protected] © Springer Nature Switzerland AG 2019 S. Hosseini et al. (eds.), Nonsmooth Optimization and Its Applications, International Series of Numerical Mathematics 170, https://doi.org/10.1007/978-3-030-11370-4_3
45
46
S. Hosseini et al.
submanifold in Rn and f is continuously differentiable in a neighborhood of x ∈ M, then a necessary condition for x being a local minimum of f on M is that the anti-gradient −∇f (x) belongs to the normal space, that is, is orthogonal to the tangent space TM (x) at x: −∇f (x) ∈ NM (x) := (TM (x))⊥ . When the set M is just closed, but not necessarily smooth, the normal space in this optimality condition has to be replaced by the polar of the Bouligand tangent cone B (x) [4], which we will call the Bouligand normal cone: TM B B (x) := (TM (x))◦ . −∇f (x) ∈ NM B (x). See Eq. (2.4) for the general definition of TM More generally, when the function f is just locally Lipschitz continuous, but not necessarily differentiable at a local minimum, generalized derivatives and subgradients need to be considered. First-order optimality conditions then take the form
0 ∈ NM (x) + ∂f (x),
(1.1)
where NM (x) is a certain closed convex cone of “normal directions,” and ∂f (x) is the corresponding subdifferential containing all subgradients at x. Well-known C (x) and the Mordukhovich normal normal cones are the Clarke normal cone NM M (x); see Sect. 3. cone NM Intuitively, the necessary first-order condition (1.1) becomes stronger the narrower the considered normal cone NM (x) is, and hence provides more information about points satisfying it (such points are called critical points). Concerning the normal cones mentioned so far, it holds that B M C (x) ⊆ NM (x) ⊆ NM (x). NM
(1.2)
The aim of the present work is to determine the corresponding normal cones for the set M = M≤k of matrices of rank at most k, and to show that both inclusions in (1.2) are strict at all singular points of M≤k , that is, at matrices of rank strictly less than k (at matrices of rank k all three cones are equal). This has interesting implications to necessary optimality conditions in matrix optimization problems with low-rank constraints, as shall be discussed at the end of this note.
Tangent and Normal Cones for Low-Rank Matrices
47
2 Preliminaries We consider the linear space RM×N of M × N matrices. The Euclidean structure in this space is provided by the Frobenius inner product. In the following, k is a fixed integer that satisfies k ≤ kmax := min(M, N). Given a real valued function f on RM×N , we are then interested in first-order optimality conditions for the minimization problem min f (X),
X ∈ M≤k ,
(2.1)
where M≤k = {X ∈ RM×N : rank(X) ≤ k} is the set of matrices of rank at most k. This set is a real algebraic variety and closed due to lower-semicontinuity of matrix rank. The characterization of optimality conditions for (2.1) becomes nontrivial in particular at the singular points of the variety M≤k , which are the matrices of rank strictly less than k. Therefore concepts from nonsmooth analysis will have to be used. It will be frequently convenient to view matrices as elements of the tensor product of their column and row spaces. By this we mean that a matrix X admits a decomposition X = U ΣV T s.t. range(U ) ⊆ U , range(V ) ⊆ V
if and only if
X ∈ U ⊗ V. (2.2)
Moreover, the smallest dimensions that U and V can have are both equal to the rank of X. In this case, the choice of these spaces is unique, namely U = range(X) has to be the column space, and V = range(XT ) the row space of X. We recall that a rank-k matrix X admits a singular value decomposition (SVD), which is a particular decomposition of the form (2.2) in which U = [u1 , . . . , uk ] and V = [v1 , . . . , vk ] are orthonormal bases for the column and row spaces of X, respectively, and Σ is a diagonal matrix with decreasing positive diagonal entries: X = kr=1 σr ur vrT , σ1 ≥ σ2 ≥ · · · ≥ σk > 0. The rank-one matrices ur vrT in this sum are then mutually orthogonal in the Frobenius inner product. A remarkable fact is that a truncation sr=1 σr ur vrT of an SVD to rank s < k yields a best approximation to X in the set M≤s in Frobenius norm. This metric projection is unique if and only if σs > σs+1 .
48
S. Hosseini et al.
2.1 Tangent and Normal Space of a Fixed Rank Manifold The real algebraic variety M≤k stratifies into the smooth submanifolds Ms = {X ∈ RM×N : rank(X) = s} of matrices of rank exactly s. Being a smooth manifold, every Ms is prox-regular in a neighborhood of each of its points; see [1]. Hence all usual notions of tangent cones coincide with the tangent space. Moreover, all notions of normal cones coincide with the normal space [3]. The tangent spaces to Ms are well known; see, e.g., [5, Ex. 14.16] or [6, Prop. 4.1]. Theorem 2.1 Let X have rank s, column space U, and row space V. The tangent space to the manifold Ms at X admits the orthogonal decomposition TMs (X) = (U ⊗ V) ⊕ (U ⊗ V ⊥ ) ⊕ (U ⊥ ⊗ V). The normal space is NMs (X) = [TMs (X)]⊥ = U ⊥ ⊗ V ⊥ .
(2.3)
It is interesting to note that TMs (X) contains matrices of rank at most 2s, while the maximum rank in NMs (X) is kmax − s. Also note that X ∈ TMs (X) (which already follows from the fact that Ms is a cone). In applications s may be much smaller than kmax , and in this context it can be an important fact that orthogonal projections on TMs (X) and NMs (X) can be computed using only projections on the s-dimensional spaces U and V as follows: PTMs (X) (Z) = PU Z + ZPV − PU ZPV , and PNMs (X) (Z) = PU ⊥ ZPV ⊥ = Z − PTMs (X) (Z).
2.2 Bouligand Tangent and Normal Cone to M≤k The general definition of the Bouligand tangent cone to a closed set M ⊆ RN is as follows: B TM (x) = {ξ ∈ RN : ∃(xn ) ⊆ M, (an ) ⊆ R+ s.t. xn → x, an (xn − x) → ξ }. (2.4)
Tangent and Normal Cones for Low-Rank Matrices
49
In the context of low-rank optimization, the Bouligand tangent cone to M≤k has been derived from this definition in [9]. In [2], an essentially equivalent definition based on derivatives of analytic curves has been used. The result is also well known in algebraic geometry [5, Ex. 20.5]. Theorem 2.2 Let X ∈ M≤k have rank s ≤ k. The Bouligand tangent cone to M≤k at X is B (X) = TMs (X) ⊕ {Y ∈ NMs (X) : rank(Y ) ≤ k − s}. TM ≤k
(2.5)
In fact, since the orthogonal projection Z → PU ⊥ ZPV ⊥ onto NMs (X) does not increase the rank of a matrix Z, it can be easily deduced that the orthogonal sum in the above formula can be replaced by an ordinary sum: B (X) = TMs (X) + M≤k−s . TM ≤k
(2.6)
Since TMs (X) contains matrices of rank at most 2s only, an interesting consequence is that B TM (X) = M≤k−s , ≤k
if s = rank(X) ≤
k . 3
B When s < k, it follows from (2.6) that an element in the polar cone of TM (X) ≤k
needs to be orthogonal to any matrix in RM×N of rank at most k − s. Obviously, only the zero matrix fulfills this. Corollary 2.3 Let X have rank s < k. The Bouligand normal cone (defined as the polar of the Bouligand tangent cone) to M≤k at X is B NM (X) = {0}. ≤k
3 Clarke and Mordukhovich Normal Cones Let PM denote the (set-valued) metric projection (in the Euclidean norm · ) onto a closed subset M ⊆ Rn . Following [8, Theorem 1.6], the Mordukhovich normal M (x) to M at x can be defined as follows: cone NM M (x) = {η ∈ Rn : there exist (xi ) ⊂ Rn and (ηi ) ⊂ Rn such that NM
xi → x, ηi → η, and ηi ∈ cone(xi − PM (xi )) for all i ∈ N}.
(3.1)
M (x) are called basic normal, limiting normal, or simply normal The elements of NM vectors.
50
S. Hosseini et al.
It is proved in [8, Theorem 3.57] that the Clarke normal cone can be obtained from the Mordukhovich normal cone as its closed convex hull: C M NM (x) = cl conv NM (x).
(3.2)
We now consider the Mordukhovich normal cones to M≤k . The following result is essentially due to Luke [7].1 Theorem 3.1 Let X ∈ M≤k have rank s ≤ k. The Mordukhovich normal cone to the closed variety M≤k at X is M NM (X) = {Y ∈ NMs (X) : rank(Y ) ≤ kmax − k}. ≤k
(3.3)
Proof The result is clear if k = kmax , since in this case M≤kmax = RM×N . Hence we consider k < kmax . As before, let U and V denote the column and row space of X, respectively. Recall that NMs (X) = U ⊥ ⊗ V ⊥ . M (X) ⊇ W . Denoting the set on the right side of (3.3) by W , we first show NM ≤k
Let Y ∈ W , then by (2.2) there exist subspaces U1 ⊆ U ⊥ and V1 ⊆ V ⊥ , both of dimension kmax − k, such that Y ∈ U1 ⊗ V1 . Since U ⊥ and V ⊥ are both at least of dimension kmax − s, there exist subspaces U0 ⊆ U ⊥ ∩ U1⊥ and V0 ⊆ V ⊥ ∩ V1⊥ of dimension k −s. Pick Z ∈ U0 ⊗V0 with rank(Z) = k −s, and consider the sequence Xi := X + i −1/2Z + i −1 Y → X
(i → ∞).
Using SVD and rank(Z) = k − s, the mutual orthogonality of the column resp. row spaces of X, Z, and Y implies that for large enough i ∈ N the best approximation of rank at most k is PM≤k (Xi ) = X + i −1/2 Z. Hence Y ∈ cone(Xi − PM≤k (Xi )) for all such i, which by definition (3.1) proves M (X). Y ∈ NM ≤k M (X) ⊆ W , consider sequences X → X and Y → Y satisfying To prove NM i i ≤k Yi = αi (Xi − Zi ) with Zi ∈ PM≤k (Xi ). Note that Zi → X, since Zi − X ≤ Zi −Xi +Xi −X ≤ 2X−Xi (the second inequality follows from X ∈ M≤k ). We have to show Y ∈ W . If Y = 0, this is clear. If Y = 0, it must hold Yi = 0 and hence rank(Xi ) > k for large enough i. As Zi is a truncated SVD of Xi , we hence can state that
rank(Yi ) = rank(Xi − Zi ) = rank(Xi ) − rank(Zi ) = rank(Xi ) − k ≤ kmax − k
1 Some inaccuracies in the statement of Theorem 3.1 in [7] are corrected here. Also, the “⊆” part is proven by a more direct argument compared to [7].
Tangent and Normal Cones for Low-Rank Matrices
51
for i large enough. From the lower semicontinuity of the rank function it now follows that rank(Y ) ≤ kmax − k. Another consequence of the fact that Zi is a truncated SVD of Xi is that Zi YiT = 0 and ZiT Yi = 0. In the limit we get XY T and XT Y = 0, which is equivalent to Y ∈ U ⊥ ⊗ V ⊥ by (2.3). In summary, we have shown Y ∈ W . As a corollary, we obtain the Clarke normal cone via (3.2). Corollary 3.2 Let X ∈ M≤k have rank s ≤ k. The Clarke normal cone to the closed variety M≤k at X is C NM (X) = NMs (X). ≤k
Further, we observe that the Mordukhovich normal cone provides the “missing” part in the Bouligand tangent cone to fill up the whole space. Corollary 3.3 Let X have rank s ≤ k. Every matrix Z ∈ RM×N admits an orthogonal decomposition Z = Z1 + Z2 ,
B Z 1 ∈ TM (X), ≤k
M Z2 ∈ NM (X). ≤k
(3.4)
Proof By Theorem 2.1, the normal space NMs (X) contains matrices of rank at most kmax − s. Using SVD, every such matrix can be orthogonally decomposed into B a matrix Z1 of rank at most k − s, implying Z1 ∈ TM (X) by (2.6), and a matrix ≤k Z2 of rank at most kmax − k. Due to the tensor product structure of NMs (X), we in M (X). fact have Z1 , Z2 ∈ NMs (X), implying in particular Z2 ∈ NM ≤k In this sense, the Mordukhovich normal cone can be seen as an appropriate “nonlinear orthogonal complement” to the Bouligand tangent cone. However, one M (X) is only complementary to T B should be aware that NM M≤k (X) if rank(X) = k. ≤k Otherwise, their intersection contains all matrices in NMs (X) of rank at most min(kmax − k, k − s). Also, the orthogonal decomposition (3.4) is then not unique.
4 Implications to Necessary Optimality Conditions The differences between the necessary optimality conditions for the low-rank optimization problem (2.1) arising from different choices of normal cones are best illustrated for the case that the function f is continuously differentiable in a neighborhood of M≤k . Further, we consider only rank deficient critical points X ∈ M≤k with rank(X) = s < k, since otherwise all normal cones are the same and equal to NMk (X).
52
S. Hosseini et al.
We first consider first-order optimality in the sense of Clarke. By Corollary 3.2, a matrix X ∈ Ms with s < k will be a Clarke critical point of f on M≤k , if C −∇f (X) ∈ NM (X) = NMs (X). ≤k
Hence this condition only states that X is in particular a critical point on the smooth stratum Ms , but does not provide any further information. In this sense, the Clarke normal cone at a rank-deficient point is “blind” to the fact that optimization is performed on M≤k and that potentially a higher rank could be used. For example, if we imagine an optimization algorithm on M≤k that is initialized with a point X ∈ Ms that is optimal on Ms , then it will terminate immediately without further improvement, if the Clarke normal cone is used for checking optimality. Therefore, the Clarke normal cone is not the most suitable choice for optimization problems on the variety M≤k . At the opposite extreme, by Corollary 2.3, a matrix X ∈ Ms with s < k will be a critical point for (2.1) in the sense of Bouligand only if B (X) = {0}. −∇f (X) ∈ NM ≤k
The optimality condition is thus the same as in the unconstrained case [9]. As a consequence, if there do not exist points X ∈ M≤k with ∇f (X) = 0, then all Bouligand critical points of the problem (2.1), if there exist any, have rank k. In this case, even if f (X) is minimal on a stratum Ms , s < k, it is always possible to decrease the value of f on M≤k in a neighborhood of X by increasing the rank. Indeed, the Euclidean projection of −∇f (X) onto the Bouligand tangent cone B TM (X) provides a descent direction; cf. the discussion in [9]. ≤k Among the considered normal cones, the Mordukhovich normal cone is not convex. Its role at rank deficient points X ∈ Ms , s < k, is intermediate. On the one hand, the necessary condition M (X) ⊆ NMs (X) − ∇f (X) ∈ NM ≤k
(4.1)
implies, as for all other cones, that X is in particular critical on the smooth stratum Ms . The difference to the Clarke normal cone is that rank(∇f (X)) ≤ kmax − k. In particular, ∇f (X) = 0 if k = kmax . On the other hand, if ∇f (X) = 0 (which implies k < kmax ), then an orthogonal decomposition −∇f (X) = Z1 + Z2 where both matrices Z1 and Z2 are in NMs (X) and rank(Z1 ) = min(kmax − k, B k − s, rank(∇f (X))) is possible. Note that Z1 ∈ TM (X), that is, −∇f (X) still ≤k contains a tangential component with respect to the variety M≤k . From an optimization viewpoint this means that in theory one should not be satisfied with such a point. Still, the optimality condition (4.1) appears to be the most natural for optimization
Tangent and Normal Cones for Low-Rank Matrices
53
methods that try to drive the projected gradient to zero. In practice, such a method aims at producing a sequence (Xi ) on the smooth part Mk of M≤k such that PTMk (Xi ) ∇f (Xi ) → 0. Hence, if Xi → X with rank(X) = s ≤ k, then PNMk (Xi ) (−∇f (Xi )) will have the same limit as −∇f (Xi ), namely −∇f (X) (by our continuity assumption). It then follows from the general limiting properties of Mordukhovich normal cones [8, M (X). However, for the case of low-rank Theorem 1.6] that −∇f (X) ∈ NM ≤k matrices one can see more directly that −∇f (X) belongs to the right-hand side of (3.3). First, taking Yi = PNMk (Xi ) (−∇f (Xi )) we have Xi YiT = 0 and XiT Yi = 0 due to (2.3), which in the limit implies −∇f (X) ∈ NMs (X) . Second, by (2.3), PNMk (Xi ) (−∇f (Xi )) has rank at most kmax − k, so the same is true for the limit. In this sense, the condition rank(X) ≤ kmax − k in the Mordukhovich normal cone inherits the information that optimization was performed on the smooth part Mk . Acknowledgements We thank B. Kutschan for bringing Harris’ book [5] as a reference for the B to our attention, and for pointing out that formula (2.5) is equivalent to (2.6). tangent cone TM ≤k
References 1. F. Bernard, L. Thibault, and N. Zlateva, Prox-regular sets and epigraphs in uniformly convex Banach spaces: various regularities and other properties, Trans. Amer. Math. Soc., 363 (2011), 2211–2247. 2. T. P. Cason, P.-A. Absil, and P. Van Dooren, Iterative methods for low rank approximation of graph similarity matrices, Linear Algebra Appl., 438 (2013), 1863–1882. 3. D. Drusvyatskiy and A. S. Lewis, Optimality, identifiability, and sensitivity, Math. Program., 147 (2014), 467–498. 4. M. Guignard, Generalized Kuhn–Tucker conditions for mathematical programming problems in a Banach space, SIAM J. Control, 7 (1969), 232–241. 5. J. Harris, Algebraic geometry. A First Course, Springer-Verlag, New York, 1992. 6. U. Helmke and M. A. Shayman, Critical points of matrix least squares distance functions, Linear Algebra Appl., 215 (1995), 1–19. 7. D. R. Luke, Prox-regularity of rank constraints sets and implications for algorithms, J. Math. Imaging Vision, 47 (2013), 231–238. 8. B. S. Mordukhovich, Variational analysis and generalized differentiation. I, Springer-Verlag, Berlin, 2006. 9. R. Schneider and A. Uschmajew, Convergence results for projected line-search methods on varieties of low-rank matrices via Łojasiewicz inequality, SIAM J. Optim., 25 (2015), 622–646.
Subdifferential Enlargements and Continuity Properties of the VU -Decomposition in Convex Optimization Shuai Liu, Claudia Sagastizábal, and Mikhail Solodov
Abstract We review the concept of VU-decomposition of nonsmooth convex functions, which is closely related to the notion of partly smooth functions. As VU-decomposition depends on the subdifferential at the given point, the associated objects lack suitable continuity properties (because the subdifferential lacks them), which poses an additional challenge to the already difficult task of constructing superlinearly convergent algorithms for nonsmooth optimization. We thus introduce certain ε-VU-objects, based on an abstract enlargement of the subdifferential, which have better continuity properties. We note that the standard ε-sudifferential belongs to the introduced family of enlargements, but we argue that this is actually not the most appropriate choice from the algorithmic point of view. Specifically, strictly smaller enlargements are desirable, as well as enlargements tailored to specific structure of the function (when there is such structure). Various illustrative examples are given. Keywords Subdifferential · Enlargements · VU-decomposition · Sublinear functions · Partial smooth Mathematics Subject Classification (2000) Primary 49J52; Secondary 49J53, 65K10
S. Liu · C. Sagastizábal () IMECC-UNICAMP, Campinas, SP, Brazil e-mail: [email protected]; [email protected] M. Solodov Instituto de Matemática Pura e Aplicada, Rio de Janeiro, RJ, Brazil e-mail: [email protected] © Springer Nature Switzerland AG 2019 S. Hosseini et al. (eds.), Nonsmooth Optimization and Its Applications, International Series of Numerical Mathematics 170, https://doi.org/10.1007/978-3-030-11370-4_4
55
56
S. Liu et al.
1 Introduction Designing superlinearly convergent algorithms for nonsmooth convex optimization has been an important challenge over the last 30 years or so [22]. The elusive fast convergence in nonsmooth settings remained out of reach until it was noticed in [13] that for a nonsmooth convex function f : Rn → R, second-order expansions do exist, when restricted to a certain special subspace. More precisely, near a given point x, ¯ while the function has kinks on the subspace spanned by its subdifferential ∂f (x), ¯ on the orthogonal complement of this subspace the graph of f appears “U”-shaped (in other words, smooth). Because this subspace concentrates the smoothness of f , it was called the U-subspace in [14]. Likewise, its orthogonal complement reflects all the nonsmoothness of f about x, ¯ and so it was called the V-subspace, V(x) ¯ := aff(∂f (x) ¯ − g) where g ∈ ∂f (x) ¯ and aff stands for the affine hull of a set. From the algorithmic perspective, the importance of having a second-order expansion for f lies in the possibility of making a fast U-Newton-move, thus enabling superlinear convergence. The VU-space decomposition gave rise to conceptual methods that were made implementable for problems with good structural properties, such as max-eigenvalue minimization [27] and primal–dual gradient structured functions [23]. The ability of bundle methods to identify nonsmoothness structure for max-functions was demonstrated in [5]. An interesting connection with sequential quadratic programming methods was revealed in [26]. The VU-theory was applied in [21] to solve stochastic programming problems, and also in [11] and [12] to design fast algorithms for nonconvex maximum eigenvalue problems. More recently, for finite max-functions, [6] shows that it is possible to construct VU-approximations in a derivative-free setting; see also [8]. The notion of partial smoothness, introduced in [15] for nonconvex nonsmooth functions, includes the VU-decomposition concepts as a particular case. A partly smooth function appears as being smooth when moving along certain manifold of activity, whose tangent subspace is the U-subspace. The question of finding the activity manifold is related to the problem of identifying active constraints in nonlinear programming [16], and has been explored in [4, 7, 19, 20]. Although partial smoothness and VU-decomposition appear as very powerful theoretical tools, they have not yet been exploited up to their full potential in algorithmic developments. Recall that bundle methods (which are some of the most practical and efficient tools for the nonsmooth setting) are designed as black-box (first-order) methods, working with an oracle which provides functional values and only one subgradient for any given point. From a numerical perspective, properly identifying the VU-subspaces (or the activity manifold) on the basis of such scarce/limited information is very hard. However, in our understanding, there is also another important reason for the difficulties, and it is inherent to the VU-concepts (or partly smooth concepts) themselves. Specifically, an issue is the lack of suitable continuity of VU-objects when they are seen as set-valued mappings of x. Roughly speaking, if the function happens to have a kink of nondifferentiability at one
Continuity of the VU -Decomposition
57
iterate x k , the corresponding V k -subspace should have some positive dimension; whereas if at x k+1 the function is differentiable, the V k+1 -subspace shrinks to zero. We emphasize that this may occur even for two iterates arbitrarily close to each other. Such oscillations are undesirable in any algorithm, as it becomes prone to erratic/unstable behavior. Motivated by the observations above, in this work we introduce ε-VU-objects that have better continuity properties, when considered as multifunctions of x and ε. Note that a natural, and straightforward, proposal would be to consider the lineality space of the ε-subdifferential of f at x, Vε (x) := aff(∂ε f (x) ¯ − g), where g ∈ ∂ε f (x). ¯ However, we shall argue that this may not necessarily be the most appropriate choice. To take advantage of possible structural properties of f , as well as to allow more options and flexibility, we shall introduce a certain abstract enlargement, denoted by δε f (x), ¯ for which the ε-subdifferential is one but not the only possible choice. Among other things, we allow enlargements of ∂f (x) ¯ which can be smaller than ∂ε f (x). ¯ We believe this can be desirable for a number of reasons, ranging from tighter approximation to easier computation when compared to the ε-subdifferential. Indeed, we actually prefer the enlargement δε f (x) ¯ to be strictly contained in ∂ε f (x). ¯ This is because we often find ∂ε f (x) ¯ to be too big, causing Vε (x) ¯ too “fat,” and consequently Uε (x) ¯ too “thin.” We conclude this discussion by noting that there is another well-known candidate for enlargement of the subdifferential, namely the so-called ε-enlargement of the maximally monotone operator (the subdifferential is maximally monotone), see [1]. However, this enlargement is even larger than the ε-subdifferential. This work is organized as follows. Section 2 gives an overview of important concepts regarding the VU-theory and partial smoothness. Section 3 is devoted to the abstract enlargement of the subdifferential, and studies under which conditions the resulting ε-VU-subspaces are outer and/or inner semicontinuous multifunctions of (x, ε). Section 4 illustrates how the abstract enlargement is more versatile than the ε-subdifferential, and how it can be used to fully exploit known structure of a function in a model example. This section finishes showing the impact that the chosen enlargement and the corresponding ε-VU-subspaces have in an algorithmic framework. Sections 5 and 6 consider enlargements suitable for functions defined as the pointwise maximum of convex functions and sublinear functions, respectively. The work concludes with final remarks and comments on future research.
2 A Fast Overview of VU -Theory We now recall the essential notions of the VU-theory, which is at the heart of developing superlinearly convergent variants of bundle methods [23].
58
S. Liu et al.
2.1 The U -Lagrangian Given a convex function f : Rn → R and a point x¯ ∈ Rn , let g be any subgradient in ∂f (x). ¯ The VU-decomposition [14] of Rn at x¯ is defined by V(x) ¯ = span(∂f (x) ¯ − g) = aff(∂f (x)) ¯ − g,
U(x) ¯ = V (x) ¯ ⊥.
(2.1)
Given any g ◦ ∈ ri ∂f (x), ¯ the interior of ∂f (x) ¯ relative to its affine hull, it is known that the subspaces in question can also be characterized as ◦ ¯ −w) = −f (x; U(x) ¯ = w ∈ Rn : f (x; ¯ w) = N∂f (x) ¯ (g ), ◦ V(x) ¯ = T∂f (x) ¯ (g ),
where f (x; d) is the directional derivative of f at x in the direction d; ND (z) stands for the normal cone and TD (z) for the tangent cone, respectively, to the convex set D at z ∈ D. As Rn = U(x) ¯ ⊕ V(x), ¯ each x ∈ Rn can be decomposed into x = xU (x) ¯ ⊕ xV (x) ¯ g¯V (x) ¯ . The U-Lagrangian of f depends on the V(x)-component ¯ of a given subgradient g¯ ∈ ∂f (x): ¯ U(x) ¯ u → LU (u; g¯ V (x) ¯ ) := inf
v∈V (x) ¯
f (x¯ + u ⊕ v) − g¯ V (x) . , v ¯ V (x) ¯
Each U-Lagrangian is a proper closed convex function that is differentiable at u = 0 with U-gradient given by ∇LU (0; g¯ V (x) ¯ the projection ¯ ) = g¯ U (x) ¯ = PU (x) ¯ (∂f (x)), of ∂f (x) ¯ on U(x). ¯ When f has the primal–dual gradient structure at x, ¯ LU is twice continuously differentiable around 0 and f has a second-order expansion in u along the smooth trajectory corresponding to U.
2.2 Fast Tracks A conceptual superlinearly convergent VU-algorithm makes a minimizing step in the V-subspace, followed by a U-Newton step along the direction of the U-gradient. To make this idea implementable, a fundamental result is [24, Thm 5.2]. It states that, at least locally, V-steps can be replaced by proximal steps. Proximal steps, in turn, can be approximated (with any desired precision) by bundle methods. Let p(x) denote the proximal point of f at x, which also depends on a proxparameter μ > 0 (although not reflected in our notation). I.e., p(x) is the unique minimizer of f (y) + μ/2|y − x|2 . Then, by Mifflin and Sagastizábal [25, Cor. 4.3], to compute the U-gradient it suffices to find s(x), the element of minimum norm in ∂f (p(x)). Under reasonable assumptions, the primal–dual pair (p(x), s(x)) is a fast track leading to (x, ¯ 0), a minimizer and the null subgradient.
Continuity of the VU -Decomposition
59
Since bundle methods can be used to approximate proximal steps, see, e.g., [3], all the VU-related quantities can be approximated by suitably combining the ingredients above. Specifically, let ϕ denote a convex piecewise linear model of f , available at the current iteration. Then the bundle method quadratic programming problem solution 1 dˆ := arg min ϕ(x + d) + μ|d|2 d 2 yields an approximation pˆ = x + dˆ for the proximal point p(x). Once pˆ is known, since ϕ is a simple max-function, the full subdifferential ∂ϕ(p) ˆ is readily available. It is then possible to compute its null space, and its basis matrix U , to approximate U(p). ˆ Also, it is possible to obtain an approximation for s(x), by solving a second quadratic programming problem ˆ . sˆ := arg min |s|2 : s ∈ ∂ϕ(x + d) s
The U-step is given by dU := U H −1 U sˆ , for a positive definite matrix H gathering second-order information of f about p. ˆ However, since we are dealing with approximations, the V and U-steps yielding the next iterate, x + = pˆ − dU , are done only when the approximation is deemed good enough; specifically, when f (p) ˆ − ϕ(p) ˆ ≤
m 2 |ˆs | 2μ
(2.2)
for a given parameter m ∈ (0, 1). The steps above are the main ingredients of the globally and locally superlinearly convergent VU-algorithm developed in [23] (barring a simple line search that we omit here, to simplify the exposition).
2.3 Smooth Manifolds and Partial Smoothness A concept closely related to VU-theory is the one of partial smoothness, relative to certain smooth manifold. In [18], a set M ⊂ Rn is said to be a C k -smooth manifold of codimension m around x¯ ∈ M if there is an open set Q ⊂ Rn such that M ∩ Q = {x ∈ Q : φi (x) = 0, i = 1, · · · , m} , where φi are C k functions (i.e., k times continuously differentiable) with the set {∇φi (x) ¯ , i = 1, . . . , m} being linearly independent. The normal (orthogonal to
60
S. Liu et al.
tangent) subspace to M at x¯ is given by ¯ = span {∇φi (x) ¯ , i = 1, · · · , m} . NM (x) A proper closed convex function f is said to be partly smooth at x, ¯ relative to M, a C 2 -manifold around x, ¯ if ∂f (x) ¯ = ∅, and ¯ 1. Smoothness: f restricted to M is C 2 around x; 2. Normals are parallel to subdifferential: NM (x) ¯ = V (x); ¯ 3. Continuity: ∂f is continuous at x¯ relative to M. The function f is said to be partly smooth relative to the manifold M if f is partly smooth at each point in M, relative to M. In [15] it is shown that if a function is partly smooth relative to a manifold, then the manifold contains the primal component of the fast track (proposed in VUdecomposition). Often the manifold can be defined from the functional structure. For example, given a nonempty finite index set I and any C 1 -convex functions fi , i ∈ I , the max-function f (x) := max fi (x) i∈I
is partly smooth at x¯ relative to the manifold Mx¯ = {x : I (x) = I (x)} ¯ , where I (x) := {i ∈ I : fi (x) = f (x)}
(2.3)
is the activity set, provided the set of active gradients {∇fi (x) ¯ : i ∈ I (x)} ¯ is linearly independent [15, Cor. 4.8]. Regarding the VU-algorithm, the smooth manifold appears in connection with the sequence of V-steps: finding an acceptable pˆ can be seen as a succession of corrector steps, bringing the iterate x close to the smooth manifold. Pursuing further with this geometrical interpretation, once (2.2) holds, the U-step represents a predictor step, taken along a direction tangent to the manifold. In this respect, approximating the U-basis amounts to essentially approximating the smooth manifold.
3 Approximating the V -Subspace We start with the following considerations. The superlinear convergence rate of the VU-algorithm in [23] relies on the quality of the approximate U-matrices, particularly with respect to their ability to asymptotically span the true U(x)¯ subspace.
Continuity of the VU -Decomposition
61
Similarly to identification of active constraints in nonlinear programming, determining those bases can prove difficult. As mentioned, the U-basis is related to the manifold of partial smoothness, which in turn depends on certain activity set. For the max-function example, in particular, this requires correctly identifying the activity index I (x) ¯ (without knowing x, ¯ of course). In fact, even for any given x, identifying reliably all the values in {f1 (x) , f2 (x) , f3 (x) , . . . , fm (x)} that are equal is already tricky from the numerical point of view (what “equal” really means for two non-integer numbers in a computer?). Another example of the aforementioned difficulty is given by the max-eigenvalue function, where one would need to find the exact multiplicity of the maximum eigenvalue of a matrix. Indeed, let Sn denote the Euclidean space of the n-by-n real symmetric matrices, endowed with the Frobenius norm and inner product. The eigenvalues of X ∈ Sn are denoted by λ1 (X) ≥ λ2 (X) ≥ · · · ≥ λn (X) (listed in decreasing order by multiplicity) and Ei (X) is the eigenspace associated with ¯ is m-dimensional, then f (X) := λ1 (X) is partly smooth at X, ¯ λi (X). If E1 (X) relative to the manifold (cf. [15, Ex. 3.6]) MX¯ = {X ∈ Sn : λ1 (X) has multiplicity m}
(1 ≤ m ≤ n).
Exact identifications in this setting, within some realistic algorithmic process, are extremely hard. As already mentioned above, in addition to possible difficulties with exact identifications, there is also the issue of (lack of) continuity of the resulting objects. We discuss this in more detail next.
3.1 Semicontinuity Notions In order to understand the phenomenon of instability, and to propose a solution, we need to analyze the continuity properties of the VU-subspaces, when considered as a (multi)function of x. Recall that for a set-valued mapping S : Rn ⇒ Rm , – S is inner semicontinuous (isc) at x¯ if lim infx k →x¯ S(x k ) ⊃ S(x), ¯ i.e., given s¯ ∈ S(x) ¯ and any sequence {x k } → x¯ there exists a selection s : x → S(x) for all x such that s(x k ) → s¯ as k → ∞ . – S is outer semicontinuous (osc) at x¯ if s¯ := lim supx k →x¯ S(x k ) ⊂ S(x), ¯ i.e., given any sequence {x k } → x¯ for each selection {s k ∈ S(x k )} → s¯ as k → ∞ it holds that s¯ ∈ S(x) ¯ .
62
S. Liu et al.
Proposition 3.1 Consider a set-valued mapping S : Rn ⇒ Rm and the multifunction resulting from applying the affine hull operation: A : x → aff S(x) :=
p
λi si : si ∈ S(x) ,
i=1
p
λi = 1 , λi ∈ R , and p ∈ N
.
i=1
If S is isc at x, ¯ then so is A.
p Proof Any fixed a¯ ∈ A(x) ¯ is of the form a¯ = i=1 λi s¯i , with s¯i ∈ p ¯ S(x) ¯ , i=1 λi = 1, λi ∈ R, and p ∈ N. For each i = 1, . . . , p, since s¯i ∈ S(x), the isc of S at x¯ ensures that for any sequence x k → x¯ there exists a selection si (x) such that sik = si (x k ) converges to s¯i . The isc of A follows, because the sequence p p a k := i=1 λi sik ∈ A(x k ) defines a selection converging to i=1 λi s¯i = a. ¯ ! As explained below, for set-valued mappings, composing with the affine hull does not preserve outer semicontinuity. Remark 3.2 (On Outer Semicontinuity of Affine Hulls) A set-valued mapping S : Rn ⇒ Rm is osc everywhere if and only if its graph gph S := {(x, u) : u ∈ S(x)} is closed (cf. [29, Thm. 5.7]). However, given an osc mapping S(x), its affine hull mapping aff S(x) does not necessarily have closed graph. For example, consider a mapping S : R ⇒ R with graph shown in Fig. 1a. Although it has a closed graph, the graph of aff S(x) is not closed. Moreover, even an osc maximally monotone mapping does not necessarily have an osc affine hull mapping. Consider a function f : [−1, 1] → R ⎧
⎨ 1 x, if x ∈ 1 , 1 for all k ∈ N, k k+1 k f (x) = ⎩0, if x = 0.
(a)
(b)
Fig. 1 S(x) is osc everywhere but aff (S(x)) is not osc at one point. (a) The graph of S(x). (b) The graph of aff (S(x))
Continuity of the VU -Decomposition
63
Define ⎧
⎨ 1 , if x ∈ 1 , 1 , k k+1 k S(x) := ∂f (x) = 1 1 1 ⎩ , if x = k+1 , . k+1 k Then S(0) = {0}, S is maximally monotone and ⎧
⎨ 1 , if x ∈ 1 , 1 , k+1 k aff S(x) = k ⎩R, if x = 1 . k+1
1 There exists a sequence x k := k+1 such that as k → +∞, we have x k → 0 but k aff S x → R ⊂ aff S(0) = {0}.
3.2 Subdifferential Enlargements The V-subspace at x is defined by taking the affine hull of (∂f (x) − g), with any g ∈ ∂f (x). By Proposition 3.1 and Remark 3.2, the inner semicontinuity of the operator S is inherited by its affine hull, but not its outer semicontinuity. We face a paradoxical situation because, as a multifunction, the subdifferential is osc but not isc; cf. [10, VI.6.2]. Under these circumstances, the chances for the V-subspace to enjoy some continuity property on x are slim. In Sect. 4 below we illustrate with a simple example that the lack of continuity of the concept can lead to drastic changes from one point to another: in that example, the V-subspaces shrink from the whole space to the origin. The corresponding oscillations in the algorithmic process have the undesirable result of slowing down the convergence speed. In order to stabilize the erratic behavior of the V-subspaces, we define approximations, denoted by Vε below, that are given by taking the affine hull of a set-valued mapping with better continuity properties than the subdifferential. Thanks to the “enlarging” parameter ε, the ε-subdifferential from Convex Analysis is both inner and outer semicontinuous as a set-valued mapping of ε > 0 and x [10, §XI.4.1]. One could then consider replacing ∂f (x) by the ε-subdifferential in the definition of V(x) given in (2.1). However, for a number of reasons that will become clear in the subsequent sections, we shall introduce an abstract set-valued function (enlargement), denoted by δε f (x). The abstract enlargement must satisfy a certain minimal set of conditions, which we identified as relevant for our purposes. These conditions are shown below to hold for the usual ε-subdifferential as well, i.e., the abstract concept includes this important enlargement as a particular case. But it also allows for other enlargements, in particular smaller than the ε-subdifferential, as well as those that explicitly use the structure of the function when it is available.
64
S. Liu et al.
Given a convex function f : Rn → R, and an abstract enlargement of ∂f (x), the corresponding Vε Uε -decomposition is defined as follows: Vε (x) := aff (δε f (x) − s) , s ∈ δε f (x),
Uε (x) := Vε (x)⊥ .
(3.1)
We require δε f (x) to be convex and closed, and to satisfy the following “sandwich” inclusions for all ε ≥ 0: ∂f (x) ⊆ δε f (x) ⊆ ∂ε f (x) .
(3.2)
Proposition 3.3 The following holds for the abstract enlargement δε f (x) and the corresponding subspaces defined in (3.1): (i) Uε (x) = Nδε f (x) (s ◦ ), for all s ◦ ∈ ri δε f (x). (ii) The V-subspace is enlarged and the U-subspace is shrunk: V(x) ⊆ Vε (x) and U(x) ⊇ Uε (x). (iii) If the enlargement satisfies (3.2), then it is an outer semicontinuous multifunction of x and ε and, in particular, lim sup δε f (x k ) = ∂f (x) .
(3.3)
x k →x ε→0
(iv) If the enlargement is isc at x¯ and ε¯ > 0, so is Vε (x). (v) If the enlargement satisfies (3.2) and is isc, then it is also continuous. (vi) If the enlargement satisfies (3.2) is isc, and lim
(x,ε)→(x,¯ ¯ ε)
Vε (x) = Vε¯ (x), ¯
ε≥0
then lim
(x,ε)→(x,¯ ¯ ε)
Uε (x) = Uε¯ (x). ¯
ε≥0
Proof Item (i) is obtained by using the same arguments as those in [14, Proposition 2.2]. Item (ii) is straightforward from (3.2), whose inclusions are preserved when taking the affine hull of the sets, which yield the corresponding V spaces. Regarding item (iii), both ∂f and (x, ε) → ∂ε f (x) are outer semicontinuous multifunctions. In view of (3.2), (x, ε) → δε f (x) is also osc and (3.3) holds. Item (iv) is just Proposition 3.1, written with S(x) = δε f (x), while item (v) follows from item (iii). To show item (vi), let s¯ be an arbitrary element in δε f (x) ¯ and si , i = 1, · · · , m be all the elements in δε f (x) ¯ such that {si − s¯ }m is a set that contains the maximal i=1 number of linearly independent vectors. Define the n × n matrix Aεx := [s1 − s¯ , s2 − x, ¯ · · · , 0n×(n−m) ].
Continuity of the VU -Decomposition
65
m ε By definition, Vε (x) = i=1 αi (si − s¯ ) : αi ∈ R, i = 1, · · · , m = Range (Ax ) ε and then Uε (x) = ker(Ax ). The convergence of Vε (x) corresponds to the convergence of Aεx . Applying [29, Thm. 4.32], we have that the kernel of Aεx , i.e., Uε (x), also converges. ! Item (iii) in the above proposition shows that outer semicontinuity of the relaxed V-space can be obtained if the sandwich inclusion (3.2) holds. The enlargements proposed for several classes of functions in the next sections will all satisfy that important relation.
4 Impact of the Chosen Enlargement We now illustrate the different choices that can be made for the abstract enlargement using the following model function: ! h(x1 , x2 ) = h1 (x1 ) + h2 (x2 ),
where
h1 (x1 ) = |x1 | h2 (x2 ) = 12 x22 .
(4.1)
Note that h is separable, h1 is nonsmooth, and h2 is smooth. This function is the simplest instance of the “half-and-half” functions created by Lewis and Overton to analyze BFGS behavior in a nonsmooth setting [17, Sec. 5.5].
4.1 The ε-Subdifferential Enlargement The (smoothness and) second-order of the function h is all concentrated in the x2 -component, where the function has constant curvature, equal to 1. The subdifferential is easy to compute:
∂h(x) = ∂h1 (x1 ) × {x2}
⎧ ⎪ ⎪ ⎨{−1} for ∂h1 (x1 ) = [−1, 1] ⎪ ⎪ ⎩{1}
if x1 < 0 if x1 = 0, if x1 > 0 .
As a result, the V and U subspaces are if x1 = 0 then V(x1 , x2 ) = {(0, 0)} and U(x1 , x2 ) = R2 while V(0, 0) = R × {0} and U(0, 0) = {0} × R .
(4.2)
66
S. Liu et al.
To compute the ε-subdifferential, first combine Remark 3.1.5 and Example 1.2.2 from [10, Ch. XI] to obtain that ∂ε h(x) =
#
∂ε1 h1 (x1 ) × ∂ε−ε1 h2 (x2 )
ε1 ∈[0,ε]
=
# ε1 ∈[0,ε]
! $ 1 ∂ε1 h1 (x1 ) × x2 + ξ : ξ 2 ≤ ε − ε1 , 2
and use the well-known formula below ⎧ ⎪ ⎪ ⎨[−1, −1 − ε1 /x1 ] ∂ε1 h1 (x1 ) = [−1, 1] ⎪ ⎪ ⎩[1 − ε /x , 1] 1
1
(4.3)
if x1 < −ε1 /2, if − ε1 /2 ≤ x1 ≤ ε1 /2,
(4.4)
if x1 > ε1 /2.
The expression (4.3)–(4.4) puts in evidence the following drawback of the εsubdifferential in our setting: it enlarges ∂h(x) also at points where h is actually smooth. Such a “fattening” is detrimental/harmful from the VU-decomposition perspective: if in our example we were to take δε h(x) = ∂ε h(x), we would end up with the following, extreme, decomposition in (3.1): Vε (x) = R2
and Uε (x) = {0} ,
for all x, which would make it impossible to make the desirable Newton-like Usteps.
4.2 A Separable Enlargement Since often we only need enlargements near kinks, we can use the knowledge that h = h1 + h2 , with h2 being smooth. This yields a more suitable enlargement, corresponding to the particular set resulting from taking ε1 = ε in the union in (4.3): δε h(x1 , x2 ) := ∂ε h1 (x1 ) × {∇h2 (x2 )} .
(4.5)
In view of (4.2), condition (3.2) holds, and the enlargement is osc, by item (iii) in Proposition 3.3. Moreover, since (3.2) holds and δε h has one component that is a continuous enlargement, and the other component is single-valued, the enlargement is continuous, by (v) in Proposition 3.3. Finally, item (vi) also holds, since for all x ∈ R2 ,
Vε (x) = aff δε h(x) − g0 = R × {0} and, hence, Uε (x) = {0} × R . (4.6)
Continuity of the VU -Decomposition
67
When compared with the original (exact) VU-decomposition in (4.2), we have gained not only stability in the decomposition (the subspaces are the same for all x), but also continuity in the Vε -subspace.
4.3 A Not-So-Large Enlargement A small enlargement makes the Uε -subspace larger, and this somehow favors a faster speed of convergence, via the U-steps. Since considering the large set ∂ε h(x) also led to an unsuitable space decomposition, looking for options that are smaller may be of interest. We analyze a choice of this type for our example, noting that in Sect. 6 we shall consider a general definition for this set, suitable for any sublinear function, not only for h1 (x1 ) = |x1 |. As a geometrical motivation to our approach, consider the enlargements of ∂h1 given in Fig. 2. For fixed ε, the graph of the ε-subdifferential (4.4) is shown on the left. The simpler (polyhedral) multifunction shown on the right is the graph of δε h1 (x). We observe that the enlargement δε h1 remains a singleton at points far from the origin. This is consistent with the fact that the function h1 is smooth in that region. By contrast, the ε-subdifferential expands ∂h1 at all the points where h1 is differentiable. To see the impact of this difference in the resulting V-approximations, consider the enlargement δε h(x1 , x2 ) := δε h1 (x1 ) × {∇h2 (x2 )} , where we define δε h1 (x1 ) := co {s ∈ ext ∂h1 (0) : s, x1 ≥ h1 (x1 ) − ε} ⎧ ⎪ if x1 < −ε/2, ⎪ ⎨{−1} = [−1, 1] if − ε/2 ≤ x1 ≤ ε/2, ⎪ ⎪ ⎩{1} if x > ε/2. 1
Fig. 2 Two enlargements for ∂|x|
(4.7)
68
S. Liu et al.
In the above, ext S stands for the extreme points of the convex set S, co D denotes the convex hull of the set D, and S is the closure of the set S. From (4.7), (4.4), and (4.3) we see that δ h(x1 , x2 ) satisfies condition (3.2). Hence, by Proposition 3.3, it is osc. The enlargement is isc everywhere except when |x1 | = 2ε . The resulting space decomposition is Vε (x) =
R × {0}, {(0, 0)} ,
and Uε (x) =
{0} × R, if − ε/2 ≤ x1 ≤ ε/2, R2 ,
otherwise . (4.8)
We can verify that V (x) and U (x) are continuous everywhere except when |x1 | = ε 2 . With the separable enlargement, Vε is always R × 0. By contrast, with the smaller enlargement, Vε is maintained as the null subspace at points not close to the origin. From the VU-optimization point of view, this enlargement appears to be the best one.
4.4 The Enlargement as a Stabilization Device of the VU -Scheme For our simple example and the enlargement given by (4.5), we next show how the approximated subspaces and the level of information that is made available by the oracle impacts on the calculation of primal and dual iterates (p, ˆ sˆ ) approximating the fast track pair (p(x), s(x)), as well as on the determination of the U-subspace spanning matrices. 4.4.1 Exact VU-Approach Suppose the oracle delivers the full subdifferential at h1 at any given point. In this case the V-step can be computed exactly, componentwise: ⎧ 1 1 ⎪ ⎪ x1 + if x1 < − ⎪ ⎪ μ μ ⎪ ⎨ 1 1 p1 (x1 ) = 0 if − ≤ x1 ≤ ⎪ μ μ ⎪ ⎪ ⎪ 1 1 ⎪ ⎩ x1 − if x1 > μ μ
and p2 (x2 ) =
μ x2 . 1+μ
(4.9)
Continuity of the VU -Decomposition
69
The shortest element in the subdifferential at p(x) is also straightforward: ⎧ 1 ⎪ ⎪ −1 if x1 < − ⎪ ⎪ μ ⎪ ⎨ 1 1 s1 (x1 ) = 0 if − ≤ x1 ≤ ⎪ μ μ ⎪ ⎪ ⎪ 1 ⎪ ⎩ 1 if x1 > μ
and s2 (x2 ) = p2 (x) .
The matrices spanning the respective U(p(x))-subspaces are U =[0
1 ] and U = I, the 2 × 2 identity matrix.
Finally, take the “exact Hessian” is given by
H =
⎧ ⎪ ⎪ ⎨1 %
σ 0 ⎪ ⎪ ⎩ 01
&
when V(p(x)) = R × {0} when V(p(x)) = {(0, 0)} .
The (small) parameter σ > 0 above is introduced to ensure positive-definiteness in the Newton system giving the U-direction that corrects the V-step; recall that H d = −U sˆ . Since the next iterate is x1+ = pˆ + U d, it follows that ⎧ 1 1 1 ⎪ ⎪ x1 + + if x1 < − ⎪ ⎪ μ σ μ ⎪ ⎨ 1 1 if − ≤ x1 ≤ x1+ = 0 ⎪ μ μ ⎪ ⎪ ⎪ 1 1 1 ⎪ ⎩ x1 − − if x1 > μ σ μ
and x2+ = 0 .
The problem under consideration is min h(x), which is solved by x¯ = (0, 0). The exact VU-scheme above finds x¯2 in one iteration for any initial point x2 . The first component, x¯ 1 , would be found in one iteration if x1 was taken sufficiently close to the solution (|x1 | ≤ 1/μ). Otherwise, assuming μ is kept fixed along iterations, convergence speed is driven by the choice of σ . This is an artificial parameter, which should be zero to reflect the second order structure of h, but was set to a positive (preferably small) value to make H positive definite. For instance, when x1 < −1/μ, if σ is such that −2/μ ≤ x1 + 1/σ ≤ 0, then x1+ ∈ [−1/μ, 1/μ] and x1++ = 0. However, if σ is too small and x1 + 1/σ > 0, this makes x1+ > 1/μ and x1++ = x1 . This undesirable oscillation comes from the lack of continuity of the VU-subspaces with respect to the variable x. The ε-objects are meant to help in this sense, as shown next.
70
S. Liu et al.
4.4.2 Exact V-Step with Uε -Step For less simple functions, we do not have access to the full subdifferential of h at all points. Hence, neither p(x) nor s(x), or any of the objects defined at each iteration are computable in an exact manner. We introduce gradually approximations in those objects, in a manner that reveals the utility of the proposed enlargements. Suppose we do know the full enlargement δε h(p(x)) and also p(x), but cannot compute explicitly s(x). Since pˆ = p(x) amounts to using h as its model ϕ to estimate the V-step, the relation in (2.2) holds for any choice of sˆ . In particular, if we project 0 onto the enlargement δε h(p(x)) to compute sˆ . Plugging the prox-expression (4.9) in (4.5), we see that ⎧ εμ 1 ⎪ ⎪ −1 − if x1 < − − ε ⎪ ⎪ μx + 1 μ ⎪ 1 ⎨ 1 1 sˆ1 (x1 ) = 0 if − ≤ x1 ≤ + ε ⎪ μ μ ⎪ ⎪ ⎪ 1 εμ ⎪ ⎩1 − if x1 > + ε μx1 − 1 μ μ x2 . 1+μ Because the Vε Uε -subspaces in (4.6) are “stable” with respect to x, we can take the same matrix H = 1 and U = [ 0 1 ] for all x1 (without distinguishing the three cases, as above). The U-step uses the direction d1 = 0 and d2 = p2 (x2 ), which gives the update and sˆ2 (x2 ) = p2 (x2 ) =
+
x = p(x) −
0 p2 (x2 )
=
p1 (x1 ) 0
.
As before, x¯2 is found in one iteration. Regarding x¯1 , the next iteration is ⎧ 1 1 ⎪ ⎪ x1 + if x1 < − ⎪ ⎪ μ μ ⎪ ⎨ 1 1 if − ≤ x1 ≤ x1+ = 0 ⎪ μ μ ⎪ ⎪ ⎪ 1 1 ⎪ ⎩ x1 − if x1 > . μ μ If μ is kept fixed, it takes (integer value of) |μx1| iterations to find x¯1 . This termination result is consistent with the VU-superlinear convergence feature, since the result stating that proximal points are on the fast track requires that μ(x−x) ¯ →0 as x → x. ¯
Continuity of the VU -Decomposition
71
4.4.3 Approximating Both Steps Suppose now the V-step yielding p1 (x1 ) is not computable because we cannot solve exactly the (implicit) inclusion μ(x1 − p1 (x1 )) ∈ ∂h1 (p1 (x1 )) . Assuming that we do have access to the full ε-subdifferential, we consider two options, both using the ε-subdifferential enlargement as a replacement of the subdifferential. 4.4.4 Implicit Vε Step In this first option, we solve the implicit inclusion μ(x1 − pˆ1 ) ∈ δε h1 (pˆ 1 ) . Through some algebraic calculations, we get ⎧% ' ( &
2 ⎪ ⎪ 1 1 1 4ε 1 ⎪ x1 + μ + μ , x1 + μ , ⎪ ⎪ 2 x1 + μ − ⎪ ⎪ ⎨ pˆ1 ∈ x1 − μ1 , x1 + μ1 , ⎪ % ' (&
⎪ ⎪ 2 ⎪ ⎪ 1 1 1 1 4ε ⎪ x1 − μ + μ , ⎪ ⎩ x1 − μ , 2 x1 − μ + if
≥
ε 2
1 μ;
if x1 ≤ if
1 μ
1 μ
−
ε 2
if x1 ≥
−
ε 2
≤ x1 ≤
ε 2
−
−
ε 2
1 μ
1 μ
(4.10) and otherwise,
( &
⎧% ' 2 ⎪ 1 1 1 4ε 1 ⎪ ⎪ x1 + μ + μ , x1 + μ , ⎪ 2 x1 + μ − ⎪ ⎪ ⎨ pˆ 1 = [B1 , B2 ] , % ' (& ⎪
⎪ 2 ⎪ ⎪ ⎪ x − 1, 1 x − 1 + ⎪ x1 − μ1 + 4ε , ⎩ 1 μ 2 1 μ μ
if x1 ≤ if
ε 2
−
ε 2
1 μ
if x1 ≥
1 μ
−
1 μ
≤ x1 ≤ −
1 μ
−
ε 2
ε 2
(4.11) ' where B1
x1 −
1 μ
= 2
+
1 2 4ε μ
x1 + (
1 μ
−
x1 +
1 μ
2
( +
4ε μ
' and B2
=
1 2
x1 −
1 μ
+
.
By the expressions in (4.10) and (4.11), the solution pˆ1 is not unique but an interval. For this reason, we need to append the inclusion with a selection
72
S. Liu et al.
mechanism, which takes pˆ1 as the closest solution to x1 . The motivation behind this rule is to approximately find the fixed point of the proximal mapping. The unique solution for both (4.10) and (4.11) is: ⎧ ' ⎪ 1 ⎪ ⎪ ⎪ 2 x1 + ⎪ ⎪ ⎨ pˆ 1 = x1 , ' ⎪ ⎪ ⎪ ⎪ ⎪ 1 ⎪ ⎩ 2 x1 −
1 μ
1 μ
−
+
x1 +
x1 −
1 μ
1 μ
2
2
( +
, if x1 ≤ −ε
4ε μ
if − ε ≤ x1 ≤ ε
( +
4ε μ
(4.12)
, if x1 ≥ ε
4.4.5 Explicit Vε Step We can also solve the explicit inclusion μ(x1 − pˆ 1 ) ∈ δε h1 (x1 ), to obtain ⎧ ε ⎪ , x1 + μ1 , ⎪ x1 + μ1 + xμ ⎪ ⎨ pˆ1 = x1 − μ1 , x1 + μ1 , ⎪ ⎪ ⎪ ⎩ x1 − 1 , x1 − 1 + ε , μ μ x1 μ
if x1 ≤ − 2ε if −
ε 2
≤ x1 ≤
if x1 ≥
ε 2
(4.13)
ε 2
and define the resulting update x + . Once again, we see from (4.13) that the solution pˆ 1 is not unique but an interval. Applying the same selection mechanism, ⎧ ⎪ ⎪ ⎨x 1 + pˆ1 = x1 , ⎪ ⎪ ⎩x − 1
1 μ
+
ε x1 μ ,
1 μ
+
ε x1 μ ,
if x1 ≤ −ε if − ε ≤ x1 ≤ ε
(4.14)
if x1 ≥ ε
4.4.6 Iteration Update for All the Cases The of the Newton direction is the same as in Sect. 4.4.2, i.e., d = approximation 0 . Consequently, in all cases we have the same update x1+ = p1 (x1 ) and p2 (x2 ) x2+ = 0. Consider the expressions of pˆ 1 in (4.12) and (4.14). Simple algebraic calculations can verify that pˆ 1 becomes closer to 0 compared with x1 except in two
Continuity of the VU -Decomposition
73
cases where pˆ 1 = x1 when |x| ≤ ε. When this happens, we can decrease ε in our algorithm so that the sequence of updates x1+ eventually converges to 0. For several classes of functions with special structure we now consider different subdifferential enlargements and the corresponding Vε Uε objects.
5 Maximum of Convex Functions When defining our enlargements, a very important principle we follow is that their computation must also be practically less difficult than computing the εsubdifferential. Furthermore, we prefer the enlargement δε f to be strictly contained in ∂ε f because often we find ∂ε f is too big, causing Vε too “fat” and then Uε too “thin.”
5.1 ε-Activity Sets We start with a result examining the persistence of the set of active indices (2.3), which characterizes the manifold of partial smoothness of a max-function. We use the notation IS for the indicator function of the set S, i.e., IS (x) = 0 if x ∈ S and it is +∞ otherwise. Proposition 5.1 Consider a max-function of the form f (x) := maxi∈I fi (x) + IC (x) where I is a non-empty finite index set, C is a subset of Rn , and fi : Rn → R is continuous for all i ∈ I . For ε ≥ 0, let Iε (x) := {i ∈ I : fi (x) ≥ f (x) − ε}
(5.1)
and denote I (x) := I0 (x). The following holds. ¯ for all x ∈ B(x, ¯ ε¯ ) (i) For any x¯ ∈ C, there exists ε¯ ≥ 0 such that Iε (x) ⊂ I (x) and ε ∈ [0, ε¯ ]. (ii) If, in particular, x¯ ∈ int C and ε = 0, then the inclusion holds as an equality. Proof Given x¯ ∈ C, to show item (i) we prove that there exists r ≥ 0 such that Iε (x) ⊂ I (x) ¯ for all x ∈ B(x, ¯ r) and ε ∈ [0, r]. For any i ∈ I \ I (x), ¯ the number ci := f (x) ¯ − fi (x) ¯ is positive as f is finite at x. ¯ Define gi (x) := maxi∈I fi (x) − fi (x) − c2i . Then gi (x) ¯ = maxi∈I fi (x) ¯ − fi (x) ¯ − c2i > 0. The continuity of gi yields a neighborhood B(x, ¯ γi ) such that gi (x) > 0 for any x ∈ B(x, ¯ γi ). Consequently, fi (x) < maxi∈I fi (x) − c2i ≤ f (x) − c2i and i ∈ I \ I ci (x). Take 2 ci r := mini∈I \I (x) ¯ r) × [0, r]. ¯ 2 , γi , then i ∈ I \ Iε (x) for all (x, ε) ∈ B (x, Therefore, Iε (x) ⊂ I (x) ¯ for all x ∈ B(x, ¯ r) and ε ∈ [0, r], as stated.
74
S. Liu et al.
Regarding item (ii), for any x¯ ∈ int C, there exists γ > 0 such that B(x, ¯ γ ) ⊂ C. Suppose for contradiction that for any t ∈ (0, +∞) there exist w ∈ B(x, ¯ t) and ε ∈ (0, t) such that Iε (w) = I (x). ¯ Let β := min {γ , r}, y ∈ B(x, ¯ β) and ε ∈ (0, β) such that Iε (y) = I (x). ¯ Then it can only be that Iε (y) I (x). ¯ Let j be an index in I (x)\I ¯ ε (y). Then fj (y) < f (y)−ε = maxi∈I fi (y)−ε. By continuity, there exists B(y, α) such that fj (x) < maxi∈I fi (y) − ε for all x ∈ B(y, α). It then follows that fj (x) < f (x) − ε for all x ∈ B(y, α) ∩ B(x¯ , β). And hence j ∈ Iε (x). Consider the point w ∈ B(y, α) which is on the line segment {λx¯ + (1 − λ)y : λ ∈ [0, 1]} and has the shortest distance to x. ¯ By continuity there exists another neighborhood B(w, p) such that j ∈ Iε (x) for all x ∈ B(w, p). This process can be done repetitively in a finite number of times until the existence of a ball containing x¯ such that j ∈ Iε (x). ¯ This is impossible because we have j ∈ I (x) ¯ ⊂ Iε (x). ¯ ! The relaxed activity sets are used below to define suitable enlargements, first for polyhedral functions, and afterwards for general max-functions.
5.2 Polyhedral Functions: Enlargement and Vε Uε -Subspaces We start our development with the important class of polyhedral functions f : Rn → R: ! f(x) := maxi∈I a i , x + bi , f (x) = f(x) + ID (x), for (5.2) D := x ∈ Rn : cj , x ≤ d j , j ∈ J where the index sets are finite with I = ∅. Because the function f(x) has full domain, ∂f (x) = ∂f(x) + ND (x) and, hence, ⎫ αi = 1, ⎪ ⎪ ⎬ i∈I (x) i j . αi a + βj c ∂f (x) = ⎪ ⎪ ⎪ αi ≥ 0 (i ∈ I (x)) , ⎪ ⎭ ⎩i∈I (x) j ∈J (x) β ≥ 0 (j ∈ J (x)) j ⎧ ⎪ ⎪ ⎨
It can be seen in [15, Ex. 3.4] that ⎧ ⎫ ⎨ ⎬ V(x) = αi a i + βj c j : αi = 0, βj ∈ R , ⎩ ⎭ i∈I (x)
j ∈J (x)
(5.3)
(5.4)
i∈I (x)
The “active” index sets above are , , I (x) = i ∈ I : a i , x + b i = f (x) and J (x) = j ∈ J : cj , x = d j . (5.5)
Continuity of the VU -Decomposition
75
To derive the expression of the ε-subdifferential, we combine once again results from [9, Ch. XI, Sec.3.14]. First, by Theorem 3.1.1 therein, for any εf ≥ 0 ∂ε f (x) =
#
∂εf f(x) + ∂ε−εf ID (x) .
εf ∈[0,ε]
Second, as shown in Example 3.5.3 and letting ei := f(x) − a i , x − bi ≥ 0 for i ∈ I, i (5.6) ∂εf f(x) = αi a : αi ≥ 0 , αi = 1, αi ei ≤ εf . i∈I
i∈I
i∈I
Third, by Example 3.1.4, ∂ε−εf ID (x) = ND,ε−εf (x) the set of approximate normal elements. Finally, by Example 3.1.3, letting Ej := d j − cj , x ≥ 0 for j ∈ J , ∂ε−εf ID (x) =
⎧ ⎨ ⎩
βj c j : βj ≥ 0 ,
j ∈J
βj Ej ≤ ε − ε f
j ∈J
⎫ ⎬ ⎭
,
Combining all the expressions above, ∂ε f (x) = εf ∈ [0, ε] ⎧ ⎫ ⎪ αi ei ≤ εf αi ≥ 0 αi = 1 ⎪ ⎪ ⎪ ⎨ ⎬ i∈I i∈I i j . α a + β c i j i∈I j ∈J ⎪ ⎪ βj Ej ≤ ε − ε f βj ≥ 0 ⎪ ⎪ ⎩ ⎭
(5.7)
j ∈J
Like with the one-dimensional example in Sect. 4, we could choose as enlargement the set in the union (5.7) that corresponds to εf = ε, i.e., δε f (x) = ∂ε f(x) + ∂0 ID (x) = ∂ε f(x) + ND (x) . This enlargement satisfies the sandwich inclusion (3.2), but it is not continuous everywhere. Another, similar, option is to take δε f (x) = ∂ ε2 f(x) + ND, 2ε (x) , which is actually continuous everywhere except when ε = 0. However, since we prefer smaller enlargements, we consider instead the εactivity index set from (5.1), which in this setting has the form , Iε (x) = i ∈ I : f i (x) = a i , x + b i ≥ f (x) − ε ,
(5.8)
76
S. Liu et al.
and define the associated enlargement ⎫ αi = 1 αi ≥ 0 ⎬ αi a i + βj cj i∈Iε (x) , δε f (x) = ⎭ ⎩ i∈Iε (x) j ∈J (x) βj ≥ 0 ⎧ ⎨
(5.9)
which satisfies (3.2). The corresponding subspace (3.1) can be explicitly represented as follows (see [15]): ⎧ ⎫ ⎨ ⎬ Vε (x) = αi a i + βj c j : αi = 0, βj ∈ R . (5.10) ⎩ ⎭ i∈Iε (x)
j ∈J (x)
i∈Iε (x)
Applying Proposition 5.1, we see that, for any x¯ ∈ D, the relaxed activity set ¯ when x is sufficiently close to x¯ and ε is small enough, satisfies Iε (x) ⊂ I (x) where the inclusion holds as an equation if x¯ ∈ int D. In view of the expressions (5.9) and (5.3), the inclusion δε f (x) ⊂ ∂f (x) ¯ always holds, and it becomes an identity if x¯ ∈ int D. Hence, by (5.10), for all x¯ ∈ D when (x, ε) are sufficiently close to (x, ¯ 0), Vε (x) ⊂ V(x) ¯ and Uε (x) ⊃ U(x) ¯ , and Vε (x) = V(x) ¯ and lim inf(x,ε)→(x,0) Uε (x) = U(x) ¯ . lim sup(x,ε)→(x,0) ¯ ¯ ε≥0
ε≥0
(5.11)
While, if x¯ ∈ int D, when (x, ε) are sufficiently close to (x, ¯ 0), Vε (x) = V(x) ¯ and Uε (x) = U(x) ¯ ,
(5.12)
and Vε (x) = V(x) ¯ and lim(x,ε)→(x,0) Uε (x) = U(x) ¯ . lim(x,ε)→(x,0) ¯ ¯ ε≥0
ε≥0
(5.13)
5.3 Polyhedral Functions: Manifold and Manifold Relaxation In [15, Ex. 3.4] it is shown that any polyhedral function is partly smooth at any x, ¯ relative to Mx¯ := x ∈ Rn : I (x) = I (x) ¯ and J (x) = J (x) ¯ .
(5.14)
Continuity of the VU -Decomposition
77
We now give various equivalent characterizations for the manifold of partial smoothness. The last one is particularly suitable for defining a manifold relaxation based on the Uε -subspace. The normals parallel to subdifferential property of partly smooth functions is equivalent to TM (x) = U(x). This condition obviously holds if M is just the affine subspace x + U(x). Particularly, this is the case for polyhedral functions locally. Proposition 5.2 For polyhedral functions, we have Mx¯ ⊂ x¯ + U(x). ¯ Moreover, there exists ε > 0 such that Mx¯ ∩ B (x, ¯ ε) = x¯ + (U(x) ¯ ∩ B (0, ε)). Proof For any x ∈ Mx¯ we show x − x¯ ∈ V(x) ¯ ⊥ . From the subdifferential characterization (5.3), we see that ∂f (x) = ∂f (x) ¯ and therefore V(x) = V(x). ¯ From the definition of Mx¯ , (5.4), and (5.5) we have x − x, ¯ v = 0 for any v ∈ V(x), and the inclusion Mx¯ ⊂ x¯ + U(x) ¯ follows. To prove the next statement, we only need to show I (x¯ + d) = I (x) ¯ and J (x¯ + d) = J (x) ¯ for any d ∈ U(x) ¯ ∩ B(0, ε) and for some ε. From the continuity of the functions defining f , there exists a neighborhood B(x, ¯ ε) such that I (x¯ + d) ⊂ I (x) ¯ and J (x¯ + d) ⊂ J (x) ¯ for any x¯ + d ∈ B(x, ¯ ε). It suffices to show other way around. As d ∈ V(x) ¯- ⊥ , in view of (5.4) (written with x = x), ¯ we have ,the i + j , d = 0 for all α such that α a β c α = 0 and for i i∈I (x) ¯ i j ∈J (x) ¯ j i∈I (x) i i k j ¯ and c , d = 0 for all all βj ∈ R. It follows that a , d = a , d for all i, k ∈ I (x) i ¯ j ∈ J (x). ¯ Consequently, f (x) ¯ + a , d is not dependent on the choice of i ∈ I (x). As I ( x ¯ + d) ⊂ I ( x), ¯ there must be an index t ∈ I ( x) ¯ ∩ I ( x ¯ + d). Therefore t a , x¯ + bt + a t , d = f (x¯ + d) = f (x) ¯ + a t , d . This shows that all i ∈ I (x) ¯ are also elements of I (x¯ +d). For any j ∈ J (x), ¯ it follows that cj , x¯ + d = cj , x¯ and thus j ∈ J (x¯ + d). Consequently, J (x¯ + d) ⊂ J (x) ¯ and the proof is finished. ! The manifold (5.14) is characterized by the active indices. In view of the relations between I (x) and Iε (x) (cf. (5.8)), one may consider the following relaxation: Mεx := y ∈ Rn : Iε (y) = Iε (x), J (y) = J (x) . However, we do not know whether Mx is contained in Mεx or not. We therefore consider another option for relaxing the smooth manifold Mx . Define Mεx := x + Uε (x), where Uε (x) is defined based on the enlargement δε f (x) from (5.9). Then TMεx (x) = Uε (x) as Uε (x) is a subspace contained in U(x). Furthermore, in view of (5.11) and Proposition 5.2, there exists ε > 0 such that lim inf Mεx ∩ B(x, ¯ ε ) = Mx¯ ∩ B(x, ¯ ε ) for all x¯ ∈ D,
(x,ε)→(x,0) ¯
ε≥0
78
S. Liu et al.
and lim
(x,ε)→(x,0) ¯
Mεx ∩ B(x, ¯ ε ) = Mx¯ ∩ B(x, ¯ ε ) for all x¯ ∈ int D,
ε≥0
confirming the fact that Mεx enlarges the manifold Mx . 5.3.1 Illustration on a Simple Example Consider the function f (x) = max1≤i≤n xi . It is known that ∂f (x) = {α ∈ n : α, x ≥ f (x)} = {α ∈ n : αi = 0 if xi < f (x)} , where n is the unit simplex. Let I (x) = {i ∈ I : xi = f (x)} and Iε (x) = {i ∈ I : xi ≥ f (x) − ε} .
(5.15)
Then !
$
δε f (x) = co ei : xi − max xi + ε ≥ 0 = co {ei : i ∈ Iε (x)} , i∈I
where Iε is defined in (5.15). For comparison, ! $ ∂ε f (x) = s ∈ n : s, x ≥ max xi − ε i∈I
= s ∈ n : si xi − max xi + ε ≥ 0 . i∈I
i∈I
Consider the case of R2 . Then ⎧ ⎪ ⎪ ⎨{1} , Iε (x) = {1, 2} , ⎪ ⎪ ⎩{2} ,
if x1 > x2 + ε
⎧' ( ⎪ 1 ⎪ ⎪ , ⎪ ⎪ ⎪ ⎪ 0 ⎨
if x1 > x2 + ε
δε f (x) =
2 , ' ( ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 0 ⎪ ⎩ 1
if x2 − ε ≤ x1 ≤ x2 + ε if x1 < x2 − ε
if x2 − ε ≤ x1 ≤ x2 + ε if x1 < x2 − ε
(5.16)
Continuity of the VU -Decomposition
79
1
1
Fig. 3 δε f (x) ¯ and ∂ε f (x) ¯
We see that δε f (x) is only a non-singleton if x1 and x2 are close enough. On the other hand, ∂ε f (x) as the intersection of a two-dimensional simplex and some half space is always non-singleton. Consider the two-dimensional case, when x¯ = (2, 3) ∈ R2 and ε = 0.6. From (5.16) we see δε f (x) ¯ is just the point (0, 1). The set ∂ε f (x) ¯ is shown in Fig. 3 as the solid blue line. This means that, at points where x1 and x2 are not “too close,” the dimension of the enlarged subspace, Vε (x), based on δε f (x), is smaller than the subspace that would be obtained using ∂ε f (x). In this case, Mx¯ = {x : x2 > x1 }, Vε (x) = {x : x2 = −x1 } if x2 −ε ≤ x1 ≤ x2 + ε and otherwise V (x) = {0}, and Mx = x + {x : x2 = x1 } if x2 − ε ≤ x1 ≤ x2 + ε and otherwise Mx = R2 .
5.4 General Max-Functions: Enlargement and Vε Uε -Subspaces Consider f (x) := max fi (x) where each fi (x) is convex and is C 1 . The subdiffereni∈I
tial is ∂f (x) = co {∇fi (x) : i ∈ I (x)}, while the expression for the ε-subdifferential is similar to the one in (5.6), written with ei := f (x) − f i (x) ≥ 0 for i ∈ I . To define a smaller enlargement, consider the ε-activity set in (5.1), so that δε f (x) := co {∇fi (x) : i ∈ Iε (x)} satisfies the sandwich inclusion (3.2). Moreover, since the max-function is now defined in the whole domain, Proposition 5.1(ii) applies with C = Rn and, hence, all the relations in (5.12) and (5.13) hold for this enlargement.
80
S. Liu et al.
Applying Proposition 3.3, we get the osc of δε f (x). Regarding its inner semicontinuity, by Proposition 5.1, given any x¯ when x and ε are sufficiently close to x¯ and 0, there is always Iε (x) = I (x). ¯ Obviously the set-valued mapping (x, ε) → {∇fi (x) : i ∈ I (x)} ¯ is isc. It follows from [29, Thm. 5.9] that δε f (x) is isc at (x, ¯ 0). Consequently, δε f (x) is continuous at (x, ¯ 0). The approximations of VU spaces can also be expressed as follows: Vε (x) = span ∇fi (x) − ∇fi◦ (x) : i ∈ Iε (x) , where i◦ is any fixed index in Iε (x), and Uε (x) = y ∈ Rn : ∇fi (x), y = ∇fi◦ , y , i ∈ Iε (x) . Let (x, ) be sufficiently close to (x, ¯ 0) and be fixed. Then V (x) and U (x) can be, respectively, considered as the range and kernel of a matrix Aεx whose columns are of the form ∇fi (x)−∇fi◦ (x) for i ∈ Iε (x) = I (x) ¯ by Proposition 5.1, and the number of its columns is the constant |I (x)|. ¯ As (x, ε) → (x, ¯ 0), due to the continuity of ∇fi the sequence of matrices Aεx converges to another matrix Ax¯ whose columns are of the form ∇fi (x) ¯ − ∇fi◦ (x) ¯ for i ∈ I (x). ¯ It follows from [29, Thm. 4.32] that the range and kernel of Aεx converges correspondingly. Consequently, V (x) and U (x) as set-valued mappings are both continuous at (x, ¯ 0) for all x. ¯
6 Sublinear Functions We next turn our attention to H, the class of proper closed sublinear functions from Rn to R. Sublinear functions are positively homogeneous and convex; they constitute a large family including norms, quadratic seminorms, gauges of closed convex sets containing 0 in their interior, infimal convolutions of sublinear functions, and support functions of bounded sets. The class also comprises perspective functions, that are systematically studied in [2], in particular to show they provide constructive means to model general lower semicontinuous convex functions.
6.1 Enlargement and Vε Uε -Subspaces As shown in [9, Ch. V], sublinear functions enjoy remarkable properties. In particular, see [9, Ch. VI Remark 1.2.3], they are support functions (of the subdifferential at zero): h(x) = sup x, s . s∈∂h(0)
(6.1)
Continuity of the VU -Decomposition
81
Hence, the subdifferential is given by ∂h(x) = {s ∈ ∂h(0) : h(x) = s, x } ,
(6.2)
see [29, Cor. 8.25]. Moreover, as shown in [9, Ch.VI Example 1.2.5], it holds that ∂ε h(x) = {s ∈ ∂h(0) : s, x ≥ h(x) − ε} ,
(6.3)
from which we define the enlargement in (6.4), generalizing the one introduced in Sect. 4.3 for the model function (4.1). Proposition 6.1 Given ε ≥ 0, consider the enlargement given by δε h(x) := co {s ∈ ext ∂h(0) : s, x ≥ h(x) − ε} .
(6.4)
If h ∈ H is finite valued, the following relation holds (which is stronger than the “sandwich” relation (3.2)): ∂h(x) ⊂ δ0 h(x) ⊂ δε h(x) ⊂ ∂ε h(x) . Proof The rightmost inclusion is straightforward, because ext ∂h(0) ⊂ h(0) and the set in (6.3) is closed and convex: δε h(x) ⊂ ∂ε h(x). Together with the fact that δ0 h(x) ⊂ δε h(x), by definition of the enlargement, we only need to show that ∂h(x) ⊂ δ0 h(x). In view of (6.1), for the support function h(x) to be finite everywhere, the set ∂h(0) must be bounded. Plugging the identity ∂h(0) = co ext ∂h(0) from [28, Corollary 18.5.1] in (6.2) yields ∂h(x) = {s ∈ co ext ∂h(0) : s, x = h(x)} . Therefore, for any s∈ ∂h(x), there exist t ∈ N, si ∈ ext ∂h(0), αi ≥ 0, i = 1, · · · , t, such that ti=1 αi = 1, s = ti=1 αi si and s, x = ti=1 αi si , x = h(x). Without loss of generality we can assume αi > 0 for i = 1, · · · , t. Now to show s ∈ δ0 h(x), by definition, we only need to show si , x = h(x) for all i = 1, · · · , t. Suppose for contradiction that there exists an index j such that sj , x = h(x). Then we must have sj , x < h(x) because h(x) = sups ∈∂h(0) s , x and t , x = αj sj , x + sj ∈ ext ∂h(0) ⊆ ∂h(0). Consequently, s, x = i=1 αi si t , x < αj h(x) + ti=1,i=j αi si , x ≤ αj h(x) + ti=1,i=j αi h(x) = i=1,i=j s i αj h(x) + 1 − αj h(x) = h(x), contradicting the fact that s, x = h(x). !
82
S. Liu et al.
6.2 A Variety of Sublinear Functions Since the sandwich inclusion in (3.2) holds, Proposition 3.3 ensures that the enlargement (6.4) is outer semicontinuous and satisfies (3.3). Inner continuity may hold locally in some particular cases, as shown below.
6.2.1 The Absolute Value Function For h(x) = |x| = σ[−1,1] (x), we have that ∂h(0) = [−1, 1] and ext ∂h(0) = {−1, 1}. It is not difficult to derive ⎧ ⎪ if x < −ε/2, ⎪ ⎨{−1} δε h(x) = [−1, 1] if − ε/2 ≤ x ≤ ε/2, ⎪ ⎪ ⎩{1} if x > ε/2, which coincides with the right graph in Fig. 2. This enlargement is isc everywhere except when |x| = 2ε . For any pair (x, ¯ ε¯ ) ¯ r) × such that |x| ¯ < 2ε¯ , there exists r > 0 such that |x| < ε2 for all (x, ε) ∈ B(x, [max {¯ε − r, 0} , ε¯ + r]. Consequently, δε h(x) remains a constant over B(x, ¯ r) × [max {¯ε − r, 0} , ε¯ + r] and is continuous there. Same arguments apply if x¯ > ε2¯ or ¯ = 2ε¯ . Then for all r > 0 there exists x¯ < − 2ε¯ . Now consider the pair such that |x| (x, ε) ∈ B(x, ¯ r) × [max {¯ε − r, 0} , ε¯ + r] such that |x| > ε2 . According to [29, Theorem 4.10], δε h(x) is isc at (x, ¯ ε¯ ) relative to {(x, ε) : ε ≥ 0} if and only if for every ρ > 0 and α > 0 there is B(x, ¯ r) × [max {¯ε − r, 0} , ε¯ + r] such that [−1, 1] ∩ ρB ⊂ δε h(x) + αB for all (x, ε) ∈ B(x, ¯ r) × [max {¯ε − r, 0} , ε¯ + r], where B is the unit ball. Without loss of generality, take (x , ε ) ∈ B(x, ¯ r) × [max {¯ε − r, 0} , ε¯ + r] ε with x > 2 and thus δε h(x ) = {1}, and ρ = α = 1. Then [−1, 1] ⊂ {1} + [−1, 1] = [0, 2], which is impossible.
6.2.2 Finite Valued Sublinear Polyhedral Functions The absolute-value function is a very simple illustration of a polyhedral function that is sublinear and finite everywhere. The enlargement defined in Sect. 5 for general polyhedral functions can be particularized to the finite-valued sublinear case. For a polyhedral function to be sublinear, all the vectors bi and d j in (5.2) must be null, and for the function to be finite valued, the set J must be empty. As a result, given a nonempty finite index set I , h(x) = h(x)
for
h(x) = max i∈I
,
ai , x
- .
Continuity of the VU -Decomposition
83
Setting bi = 0 and J = ∅ in (5.5) gives the expressions for the active and almost active index sets I (x) and Iε (x). Likewise for the errors ei = h(x) = a i , x used in the ε-subdifferential expression (5.7). As shown in Sect. 5.2, the enlargement (5.9) resulting from Iε (x) is osc. For finite-valued polyhedral functions in H, we have therefore two enlargements: δε h(x) from (6.4), and the one from (5.9), given by δε(5.9) (x) = co a i , i ∈ Iε (x) . This enlargement is larger, because δε h(x) = co a i , i ∈ Iε (x) ∩ I ∗ , with I ∗ = i ∈ I : a i ∈ ext ∂h(0) . Together with the sandwich inclusion shown in Proposition 6.1, we have that ∂h(x) ⊂ δε h(x) ⊂ δε(5.9) h(x) . Since, in addition, Proposition 5.1(ii) applies for all x, ¯ there exists ε¯ > 0 such that for all x ∈ B(x, ¯ ε¯ ) and all ε ∈ [0, ε¯ ] δε(5.9) h(x) = ∂h(x) . As a result, locally both enlargements coincide and are continuous as multifunctions of (x, ε). Polyhedral norms are finite-valued sublinear polyhedral functions: h1 (x) := x1
and h∞ (x) := x∞ .
The respective ε-subdifferentials are ∂ε h1 (x) = {s ∈ Rn : s∞ ≤ 1, s, x ≥ x1 − ε} and ∂ε h∞ (x) = {s ∈ Rn : s1 ≤ 1, s, x ≥ x∞ − ε} . While the enlargements (6.4) are δε h1 (x) = co s ∈ Rn : si = ±1, s, x ≥ x1 − ε
and
δε h∞ (x) = co s ∈ Rn : si = ±1, sj = 0, j = i, 1 ≤ i ≤ n, s, x ≥ x∞ − ε .
Figure 4 shows the enlargements for these functions, the δε enlargement represented by the red line, and the ε-subdifferential by the gray area.
84
S. Liu et al.
(a)
(b)
s2
s2 1
1 s1 −1
s1
1
−1
−1
1 −1
Fig. 4 The ε-subdifferential and the δε enlargement for h1 and h∞ . (a) δε h1 (x) ¯ and ∂ε h1 (x) ¯ at ¯ and ∂ε h∞ (x) ¯ at x¯ = (−2, 2) x¯ = (−2, 0). (b) δε h∞ (x)
6.2.3 The Maximum Eigenvalue Function The maximum eigenvalue function λ1 (X) =
max
{q∈Rn ,q=1}
q Xq
is also sublinear. In [27] a special enlargement was introduced to develop a secondorder bundle method to minimize the composition with an affine function. The ε-subdifferential is given by ∂ε λ1 (X) = t t t αi di di : di = 1, αi ≥ 0, αi = 1, αi di Xdi ≥ λ1 (X) − ε . i=1
i=1
i=1
The expression for the subdifferential is obtained taking ε = 0. In a manner similar to (5.1), the author of [27] considers an “activity” set including the ε-largest eigenvalues Iε (X) := {i ∈ {1, . . . , n} : λi (X) > λ1 (X) − ε}, whose eigenspace is the product of the “almost-active” eigenspaces: Eε (X) := ⊕i∈Iε (X) Ei (X) . With these tools, the proposed enlargement has the expression δε[27] λ1 (X) := co dd : d = 1, d ∈ Eε (X) .
Continuity of the VU -Decomposition
85
Although not apparent at first sight, by taking advantage of the structure of λ1 , this enlargement is much easier to compute than the ε-subdifferential. As shown in [27, Prop. 1], the enlargement satisfies (3.2) and is continuous, thanks to the nice analytical property in [27, Prop. 9], stating that, for some matrix Xε , δε[27] λ1 (X) = ∂λ1 (Xε )
and, therefore,
Vε[27] (X) = V (Xε ) .
The enlargement from (6.4) has the expression δε λ1 (X) = co dd : d = 1, d Xd ≥ λ1 (X) − ε , which gives a larger set than δε[27] λ1 (X) . To see this, take d ∈ Eε (X) such that d = 1. As any matrix X ∈ Sn is diagonalizable, there exist eigenvectors r¯ε dj ∈ Ej (X), j = 1, · · · , r¯ε such that d = j =1 dj and dj , dk = 0 for j = k, where r¯ε is the number the first rε eigenvalues among
of distinct
r¯ε r¯ε r¯ε = eigenvalues of X. Then d Xd = X j =1 dj j =1 dj j =1 dj Xdj . As λj , j = 1, · · · r¯ε are distinct eigenvalues satisfying λj ≥ λ1 − ε, it follows that r¯ε 2 [27] d Xd ≥ j =1 dj (λ1 − ε) = λ1 − ε. Therefore, δε λ1 (X) ⊂ δε λ1 (X), as stated.
7 Concluding Remarks In Sect. 3, we considered a simple model function (4.1), and different versions of Vε Uε -subspaces, defined using different options for enlargements of the subdifferential. We argued that the usual ε-subdifferential is not the best choice for the task. In particular, comparing (4.6) and (4.8), we see that the latter is a better approximation from the point of view of VU-decomposition, because at points not near the origin, the function is smooth and it makes sense to let Uε (x) = R2 and Vε (x) = {(0, 0)}, as in (4.8). In fact, the situation is similar for the so-called half-and-half function, extending (4.1) to Rn , with n even. Specifically, consider ! f (x) := f1 (x) + f2 (x) with
√ f1 (x) = x Ax f2 (x) = x Bx .
The matrix A has all the elements zero, except for ones on the diagonal at odd numbered locations (A(i, i) = 1 for i odd). The matrix B is diagonal with elements B(i, i) = 1/i for all i. The minimizer of this partly smooth convex function is at x¯ = 0, where the V and U subspaces both have dimension n/2 (hence, the name “half-and-half”).
86
S. Liu et al.
Directly computing the ε-subdifferential of f is not easy. But we can exploit the structure of this function and define our enlargement to be δε f (x) = δε f1 (x) + ∇f2 (x) . Since f1 is sublinear, the enlargement could be ∂ε f1 (x) or the one defined in (6.4). The possibility of choosing different enlargements, that was illustrated in this work on some structured classes of functions, can be further extended to a much larger group, by means of composition. Namely, suppose a function can be expressed as f (x) = g (F (x)) where g : Rn → R is a proper closed convex function and F : Rn → Rm a smooth mapping. If we have a good enlargement δε g(·), then we can simply define δε f (x) ¯ = ∇F (x) ¯ δε g (F (x)) ¯ . The corresponding enlargement of the V-space is ¯ = ∇F (x) ¯ Vεg (F (x)) Vεf (x) ¯ . Exploiting those constructions in an implementable algorithmic framework is a subject of our current and future work. Acknowledgements The authors would like to thank Alfredo N. Iusem for raising the two counterexamples given in Remark 3.2, and Regina Burachik for her help in the proof of Proposition 3.1. The first author was supported by CNPq grant 401119/2014-9. The second author was partially supported by CNPq Grant 303905/2015-8 and FAPERJ Grant 203.052/2016. The third author was supported in part by CNPq Grant 303724/2015-3 and by FAPERJ Grant 203.052/2016.
References 1. R. S. Burachik and A. N. Iusem. Set-Valued Mappings and Enlargements of Monotone Operators. New York: Springer–Verlag, 2008. 2. P. L. Combettes. “Perspective functions: Properties, constructions, and examples”. Set-Valued and Variational Analysis (2016), pp. 1–18. 3. R. Correa and C. Lemaréchal. “Convergence of some algorithms for convex minimization”. Mathematical Programming 62.1–3 (1993), pp. 261–275. 4. A. Daniilidis, D. Drusvyatskiy, and A. S. Lewis. “Orthogonal invariance and identifiability”. SIAM Journal on Matrix Analysis and Applications 35.2 (2014), pp. 580–598. 5. A. Daniilidis, C. Sagastizábal, and M. Solodov. “Identifying structure of nonsmooth convex functions by the bundle technique”. SIAM Journal on Optimization 20.2 (2009), pp. 820–840. 6. W. Hare. “Numerical Analysis of VU-Decomposition, U-Gradient, and U-Hessian Approximations”. SIAM Journal on Optimization 24.4 (2014), pp. 1890–1913. 7. W. L. Hare and A. S. Lewis. “Identifying active constraints via partial smoothness and proxregularity”. Journal of Convex Analysis 11.2 (2004), pp. 251–266.
Continuity of the VU -Decomposition
87
8. W. Hare, C. Sagastizábal, and M. Solodov. “A proximal bundle method for nonsmooth nonconvex functions with inexact information”. Computational Optimization and Applications 63.1 (2016), pp. 1–28. 9. J. B. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms I: Fundamentals. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 1996. 10. J.-B. Hiriart-Urruty and C. Lemaréchal. Convex Analysis and Minimization Algorithms II. Advanced Theory and Bundle Methods. Vol. 306. Grundlehren der mathematischen Wissenschaften. Springer Berlin Heidelberg, 1993. 11. M. Huang, L.-P. Pang, and Z.-Q. Xia. “The space decomposition theory for a class of eigenvalue optimizations”. Computational Optimization and Applications 58.2 (June 1, 2014), pp. 423–454. 12. M. Huang et al. “A space decomposition scheme for maximum eigenvalue functions and its applications”. Mathematical Methods of Operations Research 85.3 (June 1, 2017), pp. 453–490. 13. C. Lemaréchal and C. Sagastizábal. “Practical Aspects of the Moreau–Yosida Regularization: Theoretical Preliminaries”. SIAM Journal on Optimization 7.2 (1997), pp. 367–385. 14. C. Lemaréchal, F. Oustry, and C. Sagastizábal. “The U-Lagrangian of a convex function”. Transactions of the American Mathematical Society 352.2 (2000), pp. 711–729. 15. A. S. Lewis. “Active Sets, Nonsmoothness, and Sensitivity”. SIAM Journal on Optimization 13.3 (Jan. 2002), pp. 702–725. 16. A. S. Lewis and S. J. Wright. “Identifying Activity”. SIAM Journal on Optimization 21.2 (2011), pp. 597–614. 17. A. S. Lewis and M. L. Overton. “Nonsmooth Optimization via BFGS”. Optimization Online 2172 (2008). 18. A. S. Lewis and S. Zhang. “Partial smoothness, tilt stability, and generalized Hessians”. SIAM Journal on Optimization 23.1 (2013), pp. 74–94. 19. J. Liang, J. Fadili, and G. Peyré. “Activity Identification and Local Linear Convergence of Forward–Backward-type Methods”. SIAM Journal on Optimization 27.1 (2017), pp. 408–437. 20. J. Liang, J. Fadili, and G. Peyré. “Local Convergence Properties of Douglas–Rachford and Alternating Direction Method of Multipliers”. Journal of Optimization Theory and Applications 172.3 (Mar. 1, 2017), pp. 874–913. 21. Y. Lu et al. “Stochastic Methods Based on-Decomposition Methods for Stochastic Convex Minimax Problems”. Mathematical Problems in Engineering 2014 (2014). 22. R. Mifflin and C. Sagastizábal. “Optimization Stories”. In: vol. Extra Volume ISMP 2012, ed. by M. Grötschel. DOCUMENTA MATHEMATICA, 2012. Chap. A Science Fiction Story in Nonsmooth Optimization Originating at IIASA, p. 460. 23. R. Mifflin and C. Sagastizábal. “A VU-algorithm for convex minimization”. Mathematical programming 104.2-3 (2005), pp. 583–608. 24. R. Mifflin and C. Sagastizábal. “Proximal points are on the fast track”. Journal of Convex Analysis 9.2 (2002), pp. 563–580. 25. R. Mifflin and C. Sagastizábal. “UV-smoothness and proximal point results for some nonconvex functions”. Optimization Methods and Software 19.5 (2004), pp. 463–478. 26. S. A. Miller and J. Malick. “Newton methods for nonsmooth convex minimization: connections among-Lagrangian, Riemannian Newton and SQP methods”. Mathematical programming 104.2-3 (2005), pp. 609–633. 27. F. Oustry “A second-order bundle method to minimize the maximum eigenvalue function”. Mathematical Programming, Series B 89.1 (2000), pp. 1–33. 28. R. T. Rockafellar. Convex analysis. Princeton Landmarks in Mathematics. Princeton University Press, Princeton, NJ, 1997, pp. xviii+451. 29. R. T. Rockafellar and R. J.-B. Wets. Variational Analysis: Grundlehren Der Mathematischen Wissenschaften. Vol. 317. Springer, 1998.
Proximal Mappings and Moreau Envelopes of Single-Variable Convex Piecewise Cubic Functions and Multivariable Gauge Functions C. Planiden and X. Wang
Abstract This work presents a collection of useful properties of the Moreau envelope for finite-dimensional, proper, lower semicontinuous, convex functions. In particular, gauge functions and piecewise cubic functions are investigated and their Moreau envelopes categorized. Characterizations of convex Moreau envelopes are established; topics include strict convexity, strong convexity, and Lipschitz continuity. Keywords Gauge function · Lipschitz continuous · Moreau envelope · Piecewise cubic function · Proximal mapping · Strictly convex · Strongly convex Mathematics Subject Classification (2000) Primary 49J53, 52A41; Secondary 49J50, 26C05
1 Introduction The Moreau envelope er f was introduced in its original form by Jean-Jacques Moreau in the mid-1960s [21]. It is an infimal convolution of two functions f and qr , where r > 0 and qr = 2r · 2 . The Moreau envelope offers many benefits in optimization, such as the smoothing of the nonsmooth objective function f [20, 21] while maintaining the same minimum and minimizers of f in the case where f is proper, lower semicontinuous (lsc), and convex [25, 28]. Also in the
C. Planiden Mathematics and Applied Statistics, University of Wollongong, Wollongong, NSW, Australia e-mail: [email protected] X. Wang () Mathematics, University of British Columbia Okanagan, Kelowna, BC, Canada e-mail: [email protected] © Springer Nature Switzerland AG 2019 S. Hosseini et al. (eds.), Nonsmooth Optimization and Its Applications, International Series of Numerical Mathematics 170, https://doi.org/10.1007/978-3-030-11370-4_5
89
90
C. Planiden and X. Wang
convex setting, er f is differentiable and its gradient has an explicit representation, even when f itself is not differentiable [25]. As a result, much research has been done on properties of the Moreau envelope, including differentiability [6, 12, 23], regularization [4, 11, 13, 15, 16, 19], and convergence of the related proximal-point algorithms for finding a minimizer [1, 5, 9, 27, 29]. In this work, we continue the development of convex Moreau envelope theory. We endeavor to show the advantages that this form of regularization has to offer and make comparisons to the Pasch-Hausdorff envelope. Most of the focus is on the set of convex Moreau envelopes; we work to establish characterizations about when a function is a Moreau envelope of a proper, lsc, convex function. We also consider the differentiability properties of er f, annotating the characteristics of the proximal mapping and the Moreau envelope for C k functions. The main contributions of this paper are the analysis of the proximal mapping and Moreau envelope of two particular families of convex functions: piecewise cubic functions and gauge functions. Explicit formulae for, and method of calculation of, the proximal mapping and Moreau envelope for any single-variable convex piecewise-cubic function are given. The Moreau envelope is used to smooth the ridges of nondifferentiability (except the kernel) of any gauge function, while maintaining its status as a gauge function. The special case of norm functions is analyzed as well; the Moreau envelope is used to convert any norm function into one that is smooth everywhere expect at the origin. For both piecewise-cubic functions and gauge functions, several explicit examples with illustrations are included. To the best of our knowledge, the closed forms of Moreau envelopes of many examples given here have not been realized until now. The piecewise cubic work extends the results for piecewise linear-quadratic functions found in [2, 7, 17], and the study of Moreau envelopes of gauge functions is new. The remainder of this paper is organized as follows. Section 2 contains notation, definitions, and facts that are used throughout. In Sect. 3, several known results about the Moreau envelope are collected first, then new results on the set of convex Moreau envelopes are presented. We provide an upper bound for the difference f − er f when f is a Lipschitz continuous function. We establish several characterizations of the Moreau envelope of f convex, based on strict convexity of f, strong convexity of the Fenchel conjugate (er f )∗ , and Lipschitz continuity of ∇er f. We discuss the differentiability of er f, proving that f ∈ C k ⇒ er f ∈ C k . Then we focus on explicit expressions for the Moreau envelope and the proximal mapping for convex piecewise functions on R, which sets the stage for the section that follows. Section 4 concentrates on the set of convex piecewise-cubic functions on R and their Moreau envelopes. We lay out the piecewise domain of er f for f piecewise-cubic and present a theorem that states the proximal mapping and Moreau envelope. Section 5 deals with the smoothing of an arbitrary gauge function by way of . the Moreau envelope. It is shown that given a gauge function f, the function er (f 2 ) is also a gauge function and is differentiable everywhere except on the.kernel. A corollary about norm functions follows: if f is a norm function, then er (f 2 ) is a norm function that is differentiable everywhere except at the
Proximal Mappings and Moreau Envelopes
91
origin. Several examples and illustrations are provided in this section. Section 6 summarizes the results of this work.
2 Preliminaries 2.1 Notation All functions in this work are on Rn , Euclidean space equipped defined √ with inner n product defined x, y = x y and induced norm x = x, x . The i=1 i i n extended real line R ∪{∞} is denoted R. We use 0 (R ) to represent the set of proper, convex, lower semicontinuous (lsc) functions on Rn . The identity operator is denoted Id . We use NC (x) to represent the normal cone to C at x, as defined in [25]. The domain and the range of an operator A are denoted dom A and ran A, p e respectively. Pointwise convergence is denoted →, epiconvergence → .
2.2 Definitions and Facts In this section, we collect some definitions and facts that we need for proof of the main results. Definition 2.1 The graph of an operator A : Rn ⇒ Rn is defined gra A = {(x, x ∗ ) : x ∗ ∈ Ax}. Its inverse A−1 : Rn ⇒ Rn is defined by the graph gra A−1 = {(x ∗ , x) : x ∗ ∈ Ax}. Definition 2.2 For any function f : Rn → R, the Fenchel conjugate of f is denoted f ∗ : Rn → R and defined by f ∗ (x ∗ ) = sup [ x ∗ , x − f (x)]. x∈Rn
Definition 2.3 For a proper, lsc function f : Rn → R, the Moreau envelope of f is denoted er f and defined by r er f (x) = infn f (y) + y − x2 . 2 y∈R
92
C. Planiden and X. Wang
The vector x is called the prox-centre and the scalar r ≥ 0 is called the proxparameter. The associated proximal mapping is the set of all points at which the above infimum is attained, denoted Pr f : r Pr f (x) = argmin f (y) + y − x2 . 2 y∈Rn Definition 2.4 A function f ∈ 0 (Rn ) is σ -strongly convex if there exists a modulus σ > 0 such that f − σ2 · 2 is convex. Equivalently, f is σ -strongly convex if there exists σ > 0 such that for all λ ∈ (0, 1) and for all x, y ∈ Rn , f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y) −
σ λ(1 − λ)x − y2 . 2
Definition 2.5 A function f ∈ 0 (Rn ) is strictly convex if for all x, y ∈ dom f, x = y and all λ ∈ (0, 1), f (λx + (1 − λ)y) < λf (x) + (1 − λ)f (y). Definition 2.6 A function f ∈ 0 (Rn ) is essentially strictly convex if f is strictly convex on every convex subset of dom ∂f. Next, we have some facts about the Moreau envelope, including differentiability, upper and lower bounds, pointwise convergence characterization, linear translation, and evenness. Fact 2.7 (Inverse Function Theorem [8, Theorem 5.2.3]) Let f : U → Rn be C k on the open set U ⊆ Rn . If at some point the Jacobian of f is invertible, then there exist V ⊆ Rn open and g : V → Rn of class C k such that (i) v0 = f (u0 ) ∈ V and g(v0 ) = u0 ; (ii) U0 = g(V ) is open and contained in U ; (iii) f (g(v)) = v ∀v ∈ V . Thus, f : U0 → V is a bijection and has inverse g : V → U0 of class C k . Fact 2.8 ([3, Proposition 12.9], [25, Theorem 1.25]) Let f ∈ 0 (Rn ). Then for all x ∈ Rn , (i) inf f ≤ er f (x) ≤ f (x), (ii) lim er f (x) = f (x), and r%∞
(iii) lim er f (x) = inf f. r&0
Fact 2.9 ([25, Theorem 7.37]) Let {f ν }ν∈N ⊆ 0 (Rn ) and f ∈ 0 (Rn ). Then p e f ν → f if and only if er f ν → er f . Moreover, the pointwise convergence of er f ν to er f is uniform on all bounded subsets of Rn , hence yields epi-convergence to er f as well.
Proximal Mappings and Moreau Envelopes
93
Fact 2.10 ([10, Lemma 2.2]) Let f : Rn → R be proper lsc, and g(x) = f (x) − a x for some a ∈ Rn . Then
a 1 − a x − a a. er g(x) = er f x + r 2r Lemma 2.11 Let f : Rn → R be an even function. Then er f is an even function. Proof Let f (−x) = f (x). Then r (er f )(−x) = infn f (z) + z − (−x)2 . 2 z∈R Let z = −y. Then r (er f )(−x) = inf n f (−y) + − y − (−x)2 2 −y∈R r = inf n f (y) + y − x2 2 −y∈R r = infn f (y) + y − x2 2 y∈R = (er f )(x).
!
3 Properties of the Moreau Envelope of Convex Functions In this section, we present results on bounds and differentiability, and follow up with characterizations that involve strict convexity, strong convexity, and Lipschitz continuity. These results are the setup for the two sections that follow, where we explore more specific families of functions.
3.1 The Set of Convex Moreau Envelopes We begin by providing several properties of Moreau envelopes of proper, lsc, convex functions. We show that the set of all such envelopes is closed and convex, and we give a bound for f − er f when f is Lipschitz continuous. The facts in this section are already known in the literature, but they are scattered among several articles and books, so it is convenient to have them all in one collection. Fact 3.1 ([26, Theorem 3.18]) Let f : Rn → R be proper and lsc. Then er f = f for some r > 0 if and only if f is a constant function. Fact 3.2 ([22, Theorem 3.1]) The set er (0 (Rn )) is a convex set in 0 (Rn ).
94
C. Planiden and X. Wang
Fact 3.3 ([22, Theorem 3.2]) The set er (0 (Rn )) is closed under pointwise convergence. Proposition 3.4 Let f ∈ 0 (Rn ) be L-Lipschitz. Then for all x ∈ dom f and any r > 0, 0 ≤ f (x) − er f (x) ≤
L2 . 2r
Proof We have 0 ≤ f (x) − er f (x) ∀x ∈ dom f by Rockafellar and Wets [25, Theorem 1.25]. Then r f (x) − er f (x) = f (x) − f (Pr f (x)) + x − Pr f (x)2 2 r ≤ Lx − Pr f (x) − x − Pr f (x)2 2 r 2 = Lt − t , 2 where t = x − Pr f (x). This is a concave quadratic function whose maximizer is L/r. Thus, r L r Lt − t 2 ≤ L − 2 r 2 =
2 L r
L2 . 2r
!
The following example demonstrates that for an affine function, the bound in Proposition 3.4 is tight. Example 3.5 Let f : Rn → R, f (x) = a, x + b, a ∈ Rn , b ∈ R . Then f − er f =
a2 . 2r
Proof We have r er f (x) = infn a, y + b + y − x2 = infn g(y). 2 y∈R y∈R Setting g (y) = 0 to find critical points yields y = x − a/r. Substituting into g(y), we have 2 , a a- r
a2 + x − − x = a, x + b − . er f (x) = a, x − r 2 r 2r
Proximal Mappings and Moreau Envelopes
95
Thus, f − er f =
a2 , 2r
where a is the Lipschitz constant of the affine function f.
!
The next theorem is a characterization of when a convex function and its Moreau envelope differ only by a constant: when the function is affine. Theorem 3.6 Let f ∈ 0 (Rn ), r > 0. Then f = er f + c for some c ∈ R if and only if f is an affine function. Proof (⇐) This is the result of Example 3.5. (⇒) Suppose that f = er f + c. Taking the Fenchel conjugate of both sides and rearranging, we have f∗ +
1 · 2 = f ∗ + c. 2r
(3.1)
Let x0 be such that f ∗ (x0 ) < ∞. Then by (3.1) we have f ∗ (x0 ) +
1 x0 2 = f ∗ (x0 ) + c 2r 1 c = x0 2 . 2r
(3.2)
Now suppose there exists x1 = x0 such that f ∗ (x1 ) < ∞. Then co{x0 , x1 } ⊆ 1 dom f ∗ , thus, 2r tx0 + (1 − t)x1 2 = c for all t ∈ [0, 1]. Substituting (3.2), we have t 2 x0 2 + 2t (1 − t) x0 , x1 + (1 − t)2 x1 2 = x0 2 (t 2 − 1)x0 2 − 2t (t − 1) x0 , x1 + (t − 1)2 x1 2 = 0 (t + 1)x0 2 − 2t x0 , x1 + (t − 1)x1 2 = 0.
(3.3)
Note that if one of x0 , x1 equals zero, then (3.3) implies that the other one equals zero, a contradiction to x0 = x1 . Hence, x0 = 0 and x1 = 0. Since the left-hand side of (3.3) is a smooth function of t for t ∈ (0, 1), we take the derivative of (3.3) with respect to t and obtain x0 2 − 2 x0 , x1 + x1 2 = 0 x0 − x1 2 = 0 x0 = x1 ,
96
C. Planiden and X. Wang
a contradiction. Hence, dom f ∗ = {x0 }. Therefore, f ∗ = ι{x0 } + k for some k ∈ R, and we have f (x) = x, x0 − k. !
3.2 Characterizations of the Moreau Envelope Now we show the ways in which er f can be characterized in terms of f when f has a certain structure. We consider the properties of strict convexity, strong convexity, and Lipschitz continuity. Theorem 3.7 Let f ∈ 0 (Rn ). Then f is essentially strictly convex if and only if er f is strictly convex. Proof Let f be essentially strictly convex. By Rockafellar [24, Theorem 26.3], 1 we have that f ∗ is essentially smooth, as is (er f )∗ = f ∗ + 2r · 2 . Applying [24, Theorem 26.3] gives us that er f is essentially strictly convex. Since Moreau envelopes of convex functions are convex and full-domain, this essentially strict convexity is equivalent to strict convexity. Therefore, er f is strictly convex. Conversely, assuming that er f is strictly convex, the previous statements in reverse order allow us to conclude that f is essentially strictly convex. ! Theorem 3.8 Let g ∈ 0 (Rn ). Then f = er g if and only if f ∗ is strongly convex with modulus 1/r. Proof (⇒) Suppose f = er g. Making use of the Fenchel conjugate, we have f = er g 1 · 2 2r 1 g ∗ = f ∗ − · 2 . 2r
f ∗ = g∗ +
Since the conjugate of g ∈ 0 (Rn ) is again a function in 0 (Rn ), we have that 1 f ∗ − 2r · 2 is in 0 (Rn ), which means that f ∗ is strongly convex with modulus 1/r. 1 (⇐) Suppose f ∗ is strongly convex with modulus 1/r. Then f ∗ − 2r · 2 = g ∗ for n ∗ some g in 0 (R ), and we have 1 · 2 2r ∗
r · 2 . = g∗ + 2
f ∗ = g∗ +
Proximal Mappings and Moreau Envelopes
97
Taking the Fenchel conjugate of both sides, and invoking [3, Theorem 16.4], we have (using as the infimal convolution operator) ∗ ∗
r f = g∗ + · 2 2 r ∗∗ · 2 =g 2 r =g · 2 2 = er g.
!
Fact 3.9 ([3, Corollary 18.18]) Let g ∈ 0 (Rn ). Then g = er f for some f ∈ 0 (Rn ) if and only if ∇g is r-Lipschitz. Fact 3.10 ([22, Lemma 2.3]) Let r > 0. The function f ∈ 0 (Rn ) is r-strongly r convex if and only if e1 f is r+1 -strongly convex. For strongly convex functions, a result that resembles the combination of Facts 3.9 and 3.10 is found in [16]. This is a reciprocal result, in that it is not er f that is found to have a Lipschitz gradient as in Fact 3.9, but (e1 f )∗ .1 Fact 3.11 ([16, Theorem 2.2]) Let f be a finite-valued convex function. For a n , define ·, · 2 symmetric linear operator M ∈ S++ M = M·, · , · M = ·, · M and ! $ 1 F (x) = infn f (y) + y − x2M . 2 y∈R Then the following are equivalent: (i) (ii) (iii) (iv)
f is k1 -strongly convex; ∇f ∗ is k-Lipschitz; ∇F ∗ is K-Lipschitz; F is K1 -strongly convex;
for some K such that k − 1/λ ≤ K ≤ k + 1/λ, where λ is the minimum eigenvalue of M.
3.3 Differentiability of the Moreau Envelope It is well known that er f is differentiable if f ∈ 0 (Rn ); see [25]. In this section, we study differentiability of er f when f enjoys higher-order differentiability. 1 Thank
you to the anonymous referee for providing this reference.
98
C. Planiden and X. Wang
Theorem 3.12 Let f ∈ 0 (Rn ) and f ∈ C k . Then er f ∈ C k . Proof If k = 1, the proof is that of [25, Proposition 13.37]. Assume k > 1. Since f ∈ 0 (Rn ), by Rockafellar and Wets [25, Theorem 2.26] we have that −1 1 , ∇er f = r Id −r Id + ∇f r
(3.4)
−1 is unique for each x ∈ dom f. Let y = and that Pr f = Id + 1r ∇f
−1 Id + 1r ∇f (x). Then x = y + 1r ∇f (y) =: g(y), and for any y0 ∈ dom f we have 1 ∇g(y0 ) = Id + ∇ 2 f (y0 ), r where ∇ 2 f (y0 ) ∈ Rn×n exists (since f ∈ C 2 ) and is positive semidefinite. This gives us that ∇g ∈ C k−2 , so that g ∈ C k−1 . Then by Fact 2.7, we have that g −1 = Pr f ∈ C k−1 . Thus, by (3.4) we have that ∇er f ∈ C k−1 . Therefore, er f ∈ C k . !
3.4 Moreau Envelopes of Piecewise Differentiable Functions When a function is piecewise differentiable, using Minty’s surjective theorem, we can provide a closed analytical form for its Moreau envelope. This section is the setup for the main result of Sect. 4, in which Theorem 4.9 gives the explicit expression of the Moreau envelope for a piecewise cubic function on R . Proposition 3.13 Let f1 , f2 : R → R be convex and differentiable on the whole of R such that f1 (x), if x ≤ x0 f (x) = f2 (x), if x ≥ x0 is convex. Then ⎧ 1 ⎪ ⎪ ⎨Pr f1 (x), if x < x0 + r f1 (x0 ), Pr f (x) = x0 , if x0 + 1r f1 (x) ≤ x ≤ x0 + 1r f2 (x0 ), ⎪ ⎪ ⎩P f (x), if x > x + 1 f (x ), r 2 0 r 2 0 ⎧ ⎪ if x < x0 + 1r f1 (x0 ), ⎪er f1 (x), ⎨ er f (x) = f1 (x0 ) + 2r (x0 − x)2 , if x0 + 1r f1 (x0 ) ≤ x ≤ x0 + 1r f2 (x0 ), ⎪ ⎪ ⎩e f (x), if x > x + 1 f (x ). r 2
0
r 2
0
Proximal Mappings and Moreau Envelopes
99
Proof First observe that since f is convex, f1 (x0 ) = f2 (x0 ) and f1 (x0 ) ≤ f2 (x0 ). Hence, f is continuous, and the regions x < x0 + 1r f1 (x0 ) and x > x0 + 1r f2 (x0 ) cannot overlap. We split the Moreau envelope as follows: /
0 r r 2 2 . er f (x) = min inf f1 (y) + (y − x) , inf f2 (y) + (y − x) y 0
and
er f (x) =
⎧ ⎪ −5x − 25 ⎪ 2r − 2, ⎪ ⎪ ⎪ r 2 ⎪ ⎪ ⎨ 2 (x + 1) + 3,
r 2 r+2 (x − 1) − 1, ⎪ ⎪ r 2 ⎪ ⎪ ⎪2x , √ ⎪ ⎪ ⎩ r 3 −r(r+12x) r 2 +12rx+18r 2 x+54rx 2 108
if x < −1 − 5r , if − 1 − if − 1 − if −
2 r
5 r 4 r
≤ x ≤ −1 − 4r , < x < − 2r ,
≤ x ≤ 0,
, if x > 0.
Proof The proof is a matter of applying Corollary 3.14 with x1 = −1, x2 = 0, f0 (x) = −5x − 2, f1 (x) = (x − 1)2 − 1 and f2 (x) = x 3 . The algebra and calculus are elementary and are left to the reader as an exercise. ! Figure 1 presents f and er f for several values of r. Fig. 1 The functions f (black) and er f for r = 1 (red), 5 (green), and 20 (blue)
Proximal Mappings and Moreau Envelopes
103
Theorem 3.16 Let f ∈ 0 (R) be differentiable on [a, b]. Define g ∈ 0 (R) by g(x) =
f (x), if a ≤ x ≤ b, ∞,
otherwise.
Then
Pr g(x) =
and
er g(x) =
⎧ ⎪ ⎪a, ⎨
if x ≤ a + 1r f (a),
Pr f (x), if a + 1r f (a) < x < b + 1r f (b), ⎪ ⎪ ⎩b, if b + 1r f (b) ≤ x,
⎧ r 1 2 ⎪ ⎪ ⎨f (a) + 2 (a − x) , if x ≤ a + r f (a), er f (x), ⎪ ⎪ ⎩f (b) + r (b − x)2 , 2
if a + 1r f (a) < x < b + 1r f (b), if b + 1r f (b) ≤ x.
−1 [3, Example 23.3]. Find the Proof We use the fact that Pr g = Id + 1r ∂g subdifferential of g : ⎧ ⎪ if a < x < b, ⎪ ⎪f (x), ⎪ ⎨f (a) + R , if x = a, − ∂g(x) = ⎪f (b) + R+ , if x = b, ⎪ ⎪ ⎪ ⎩ ∅, otherwise. Multiplying by
1 r
and adding the identity function, we obtain
⎧ ⎪ x + 1r f (x), ⎪ ⎪ ⎪ ⎨ 1 a + 1r f (a) + R− , x + ∂g(x) = ⎪ r b + 1r f (b) + R+ , ⎪ ⎪ ⎪ ⎩ ∅,
if a < x < b, if x = a, if x = b, otherwise.
−1 Now applying the identity Pr g(x) = Id + 1r ∂g (x), we find
Pr g(x) =
⎧ 1 1 ⎪ ⎪ ⎨Pr f (x), if a + r f (a) < x < b + r f (b), a, ⎪ ⎪ ⎩b,
if a + 1r f (a) ≥ x, if b + 1r f (b) ≤ x.
!
104
C. Planiden and X. Wang
Fig. 2 The functions g (black) and e1 g (red)
Example 3.17 Define g(x) =
x,
if − 1 ≤ x ≤ 2,
∞, otherwise.
Then by Theorem 3.16 (see Fig. 2), ⎧ ⎪ ⎪−1 + 2r (−1 − x)2 , ⎨ 1 er g(x) = x − 2r , ⎪ ⎪ ⎩2 + r (2 − x)2 , 2
if x ≤ −1 + 1r , if − 1 +
1 2
< x ≤ 2 + 1r ,
if x > 2 + 1r .
4 The Moreau Envelope of Piecewise Cubic Functions In this section, we concentrate our efforts on the class of univariate, piecewise cubic functions.
4.1 Motivation Piecewise polynomial functions are of great interest in current research because they are commonly used in mathematical modelling, and thus in many optimization algorithms that require a relatively simple approximation function. Convex
Proximal Mappings and Moreau Envelopes
105
piecewise functions in general, and their Moreau envelopes, are explored in [18, 19] and similar works. Properties of piecewise linear-quadratic (PLQ) functions in particular, and their Moreau envelopes, are developed in [2, 7, 17] and others. The new theory of piecewise cubic functions found in this section will enable the expansion of such works to polynomials of one degree higher, and any result developed here reverts to the piecewise linear-quadratic case by setting the cubic coefficients to zero. Matters such as interpolation for discrete transforms, closedness under Moreau envelope, and efficiency of Moreau envelope algorithms that are analyzed in [17] for PLQ functions can now be extended to the piecewise cubic case, as can the PLQ Toolbox software found in [17, §7]. Indeed, it is our intention that many applications and algorithms that currently use PLQ functions as their basis will become applicable to a broader range of useful situations due to expansion to the piecewise-cubic setting.
4.2 Convexity We begin with the definition and a lemma that characterizes when a piecewise cubic function is convex. Definition 4.1 A function f : R → R is called piecewise cubic if dom f can be represented as the union of finitely many closed intervals, relative to each of which f (x) is given by an expression of the form ax 3 + bx 2 + cx + d with a, b, c, d ∈ R . Proposition 4.2 If a function f : R → R is piecewise cubic, then dom f is closed and f is continuous relative to dom f, hence lsc on R . Proof The proof is the same as that of [25, Proposition 10.21].
!
Lemma 4.3 For i = 1, 2, . . . , m, let fi be a cubic, full-domain function on R, fi (x) = ai x 3 + bi x 2 + ci x + di . For i = 1, 2, . . . , m − 1, let {xi } be in increasing order, x1 < x2 < · · · < xm−1 , such that fi (xi ) = fi+1 (xi ). Define the subdomains D1 = (−∞, x1 ], D2 = [x1 , x2 ], . . . , Dm−1 = [xm−2 , xm−1 ], Dm = [xm−1 , ∞).
106
C. Planiden and X. Wang
Then the function f defined by
f (x) =
⎧ ⎪ f1 (x), ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨f2 (x),
if x ∈ D1 , if x ∈ D2 , .. .
⎪ ⎪ ⎪ ⎪ ⎪ ⎪fm−1 (x), if x ∈ Dm−1 , ⎪ ⎩ if x ∈ Dm fm (x),
is a continuous, piecewise cubic function. Moreover, f is convex if and only if (i) fi is convex on Di for each i, and (x ) for each i < m. (ii) fi (xi ) ≤ fi+1 i Proof By Proposition 4.2, f is a continuous, piecewise cubic function. (x ) for each (⇐) Suppose that each fi is convex on Di and that fi (xi ) ≤ fi+1 i i < m. Since fi is convex and smooth on int Di for each i, we have that for each i : (a) fi is monotone on int Di , (b) fi (xi ) = sup fi (x) (by point (a), and because fi is polynomial fi is x∈int Di
continuous, fi is an increasing function), and (x ) = (x) (by point (a) and continuity of f ). (c) fi+1 inf fi+1 i i x∈int Di+1
(x ) : Then at each xi , the subdifferential of f is the convex hull of fi (xi ) and fi+1 i (x )]. (d) ∂f (xi ) = [fi (xi ), fi+1 i
Points (a), (b), (c), and (d) above give us that ∂f is monotone over its domain. Therefore, f is convex. (⇒) Suppose that f is convex. It is clear that if fi is not convex on Di for some i, then f is not convex and we have a contradiction. Hence, fi is convex on Di for each (x ) < f (x ) i, and point (i) is true. Suppose for eventual contradiction that fi+1 i i i for some i < m. Since point (i) is true, point (a) and hence point (b) are also true. Thus, since fi is a continuous function on Di , there exists x ∈ int Di such that (x ). Since x < x , we have that ∂f is not monotone. Hence, f is not fi (x) > fi+1 i i (x ) for all i < m. convex, a contradiction. Therefore, fi (xi ) ≤ fi+1 ! i
4.3 Examples It will be helpful to see how the Moreau envelopes of certain piecewise cubic functions behave graphically. Visualizing a few simple functions and their Moreau envelopes points the way to the main results in the next section.
Proximal Mappings and Moreau Envelopes
107
Example 4.4 Let x1 = −1, x2 = 1. Define f0 (x) = −2x 3 +2x 2 +2x+3, f1 (x) = x 3 +3x 2 −x+2, f2 (x) = 3x 3 +2x 2 +2x−2, ⎧ ⎪ ⎪f0 (x), if x < x1 ⎨ f (x) = f1 (x), if x1 ≤ x < x2 , ⎪ ⎪ ⎩f (x), if x ≤ x. 2
2
It is left to the reader to verify that f is convex. Notice that x1 and x2 are points of nondifferentiability. We find that 1 8 x1 + f0 (x1 ) = −1 − , r r 1 8 x2 + f1 (x2 ) = 1 + , r r
1 4 x1 + f1 (x1 ) = −1 − , r r 1 15 x2 + f2 (x2 ) = 1 + . r r
Then according to Corollary 3.14, we have
Pr f (x) =
⎧ ⎪ p1 , if x < −1 − 8r , ⎪ ⎪ ⎪ ⎪ 8 4 ⎪ ⎪ ⎨x1 , if − 1 − r ≤ x ≤ −1 − r , p2 , if − 1 − 4r < x < 1 + 8r , ⎪ ⎪ ⎪ 8 15 ⎪ ⎪ ⎪x2 , if 1 + r ≤ x ≤ 1 + r , ⎪ ⎩p , if 1 + 15 < x, 3 r
where .
(4 + r)2 + 24(2 − rx) , 12 . −(6 + r) + (6 + r)2 + 12(1 + rx) p2 = , 6 . −(4 + r) + (4 + r)2 − 36(2 − rx) . p3 = 18 p1 =
(4 + r) −
Remark 4.5 Note that in finding the proximal points of convex cubic functions, setting the derivative of the infimand of the Moreau envelope expression equal to zero and solving yields two points (positive and negative square root). However, the proximal mapping is strictly monotone and only one of the two points will be in the appropriate domain. The method for choosing the correct proximal point is laid out in Proposition 4.10. As an illustration of Remark 4.5, consider our choice of p2 above. The counterpart of p2 has a negative square root, but p2 is correct as given. It is easy to see that
108
C. Planiden and X. Wang
Fig. 3 The functions f (x) (black) and er f (x) for r = 1, 10, 50, 100
p2 ∈ [x1 , x2 ], by noting that p2 (x) is an increasing function of x and observing that p2 (−1−4/r) = x1 and p2 (1+8/r) = x2 . The proper choices of p1 and p3 are made in a similar manner. This method is presented in general form in Proposition 4.10. Then we have (see Fig. 3)
er f (x) =
⎧ r 8 3 2 2 ⎪ ⎪ ⎪−2p1 + 2p1 + 2p1 + 3 + 2 (p1 − x) , if x < −1 − r , ⎪ ⎪ ⎪ if − 1 − 8r ≤ x ≤ −1 − 4r , ⎪5 + 2r (−1 − x)2 , ⎨ p23 + 3p22 − p2 + 2 + 2r (p2 − x)2 , ⎪ ⎪ ⎪ r 2 ⎪ ⎪ ⎪5 + 2 (1 − x) , ⎪ ⎩3p3 + 2p2 + 2p − 2 + r (p − x)2 , 3 3 3 2 3
Example 4.6 Let f : R → R, f (x) =
if − 1 − if 1 + if 1 +
x 2,
if x ≤ 0,
x 3,
if x > 0.
4 8 r < x < 1+ r, 8 15 r ≤x ≤1+ r , 15 r < x.
Then Pr f (x) =
⎧ ⎨
rx r+2 ,√ ⎩ −r+ r 2 +12rx , 6
if x ≤ 0, if x > 0,
and ⎧ 2 rx ⎪ , ⎨ r+2 3 √ er f (x) = −r+ r 2 +12rx ⎪ + ⎩ 6
r 2
√
−r+
r 2 +12rx 6
if x ≤ 0,
2 −x
,
if x > 0.
Proximal Mappings and Moreau Envelopes
109
Proof The Moreau envelope is r er f (x) = inf f (y) + (y − x)2 y∈R 2 / 0 r r = min inf y 2 + (y − x)2 , inf y 3 + (y − x)2 . y≤0 y>0 2 2 (i) Let x ≤ 0. Then, with the restriction y > 0, y 3 + 2r (y − x)2 is minimized at y = 0, so that r r inf y 3 + (y − x)2 = x 2 . y>0 2 2 For the other infimum, setting the derivative of its argument equal to zero yields rx a minimizer of y = r+2 , so that rx 2 r inf y 2 + (y − x)2 = = er f (x). y≤0 2 r +2 (ii) Let x > 0. Then, with the restriction y ≤ 0, y 2 + 2r (y − x)2 is minimized at y = 0, so that r r inf y 2 + (y − x)2 = x 2 . y≤0 2 2 For the other infimum, setting the derivative of its argument equal to zero yields √ a minimizer of y =
−r+
r 2 +12rx 6
(see Remark 4.5), so that
r inf y 3 + (y − x)2 = y>0 2
'
−r +
r + 2
'
(3 √ r 2 + 12rx 6
−r +
(4.1)
(2 √ r 2 + 12rx −x , 6
which is less than 2r x 2 for all x > 0. This can be seen by subtracting 2r x 2 from the right-hand side of (4.1), and using calculus to show that the maximum of the resulting function is zero. The statement of the example follows. ! Figure 4 illustrates the result for r = 1. This result is perhaps surprising at first glance, since we know that as r % ∞ we must have er f % f. This leads us to suspect that the Moreau envelope on the cubic portion of the function will be a cubic
110
C. Planiden and X. Wang
Fig. 4 f (x) (black), e1 f (x) (red)
function, but the highest power of x in the Moreau envelope is 2. The following proves that this envelope does indeed converge to x 3 . We have ⎡' (3 ' (2 ⎤ . . 2 + 12rx 2 + 12rx −r + r r 1 −r + lim ⎣ + −x ⎦ 6 2 6 r%∞ . −4r 3 − 36r 2 x + 6r 2 + 72rx + 108x 2 + (4r 2 + 12rx − 6r − 36x) r 2 + 12rx = lim 216 r%∞ = lim
2x 3 (4r 3 − 6r 2 − 27x)
r%∞ 4r 3 + 36r 2 x − 6r 2 − 72rx − 108x 2 + (4r 2 + 12rx − 6r − 36x)
2x 3 4 − 6r − 27x 3 r = lim 5 2 r%∞ 36x 6 72x 108x 12x 12x 4 + r − r − 2 − 3 + 4 + r − 6r − 36x 1 + r r r r2 =
2x 3 · 4 √ = x3 . 4+4 1
Example 4.7 Let f : R → R, f (x) = |x|3 . Then ⎧ √ ⎨ r− r 2 −12rx , √6 Pr f (x) = ⎩ −r+ r 2 +12rx , 6
if x < 0, if x ≥ 0,
.
r 2 + 12rx
Proximal Mappings and Moreau Envelopes
111
and ⎧ 3 √ ⎪ ⎪ −r+ r 2 −12rx + ⎨ 6 er f (x) = 3 √ ⎪ 2 ⎪ ⎩ −r+ r +12rx + 6
r 2 r 2
√
r−
r 2 −12rx 6
√
−r+
2 −x
r 2 +12rx 6
if x < 0,
, 2
−x
,
if x ≥ 0.
Proof The Moreau envelope is r er f (x) = inf f (y) + (y − x)2 y∈R 2 / 0 r r = min inf −y 3 + (y − x)2 , inf y 3 + (y − x)2 . y≤0 y>0 2 2 By an argument identical to that of the previous example, we find that for x ≥ 0, ' er f (x) =
−r +
(3 ' (2 √ √ r 2 + 12rx r −r + r 2 + 12rx −x . + 6 2 6
Then by Lemma 2.11, we conclude the statement of the example.
!
Example 4.8 Let f : R → R, f (x) = |x|3 + ax. Then ⎧ √ ⎨ r− r 2 −12(rx−a) , if x < ar , √ 6 Pr f (x) = ⎩ −r+ r 2 +12(rx−a) , if x ≥ a , 6 r and ⎧ 3 2 √ √ ⎪ −r+ r 2 −12(rx−a) −r+ r 2 −12(rx−a) ⎪ r a ⎪ + + x − ⎪ 6 2 6 r ⎪ ⎪ ⎪ ⎪ ⎨ a2 a +ax − 2r , if x < r er f (x) = 3 2 √ √ ⎪ −r+ r 2 +12(xr−a) −r+ r 2 +12(xr−a) ⎪ r a ⎪ + − x + ⎪ 6 2 6 r ⎪ ⎪ ⎪ ⎪ 2 ⎩ +ax − a2r , if x ≥ ar . Proof The proof is found by applying Fact 2.10 to Example 4.7.
!
4.4 Main Result The examples of the previous section suggest a theorem for the case of a general convex cubic function on R. The theorem is the following.
112
C. Planiden and X. Wang
Theorem 4.9 Let f : R → R, f (x) = a|x|3 +bx 2 +cx +d, with a, b ≥ 0. Define . (r + 2b)2 − 12a(rx − c) , p1 = 6a . −r − 2b + (r + 2b)2 + 12a(rx − c) . p2 = 6a r + 2b −
Then the proximal mapping and Moreau envelope of f are Pr f (x) =
er f (x) =
p1 , if x < cr , p2 , if x ≥ cr , −ap13 + bp12 + d − p1 (rx − c) + 2r (p12 + x 2 ),
if x < cr ,
ap23 + bp22 + d − p2 (rx − c) + 2r (p22 + x 2 ),
if x ≥ rc .
Proof We first consider g(x) = a|x|3 +bx 2 +d, and we use Lemma 2.10 to account for the cx term later. By the same method as in Example 4.6, for x < 0 we find that q1 = Pr g(x) =
r + 2b −
. (r + 2b)2 − 12arx 6a
and r er g(x) = −aq13 + bq12 + d + (q1 − x)2 . 2 Then by Lemma 2.11, for x ≥ 0 we have that q2 = Pr g(x) =
−r − 2b +
.
(r + 2b)2 + 12arx 6a
and r er g(x) = aq23 + bq22 + d + (q2 − x)2 . 2 Finally, Lemma 2.10 gives us that er f (x) = er g x − cr + cx − the proximal mapping and Moreau envelope that we seek.
c2 2r ,
which yields !
Now we present the application of Corollary 3.14 to convex piecewise cubic functions. First, we deal with the issue mentioned in Remark 4.5: making the proper choice of proximal point for a cubic piece.
Proximal Mappings and Moreau Envelopes
113
Proposition 4.10 Let f : R → R be a convex piecewise cubic function (see Lemma 4.3), with each piece fi defined by fi (x) = ai x 3 + bi x 2 + ci x + di , ∀x ∈ R . Then on each subdomain Si = xi + 1r fi (xi ), xi+1 + 1r fi (xi+1 ) (and setting x0 = −∞ and xm+1 = ∞), the proximal point of fi is pi =
−(2bi + r) +
. (2bi + r)2 − 12ai (ci − rx) . 6ai
Proof Recall from Lemma 4.3 that dom fi = R for each i. For fi , the proximal mapping is r Pr fi (x) = argmin ai y 3 + bi y 2 + ci y + di + (y − x)2 . 2 y∈R Setting the derivative of the infimand equal to zero yields the potential proximal points: 3ai y 2 + 2bi y + ci + ry − rx = 0 = 3ai y 2 + (2bi + r)y + (ci − rx), y=
−(2bi + r) ±
. (2bi + r)2 − 12ai (ci − rx) . 6ai
(4.2)
Notice that any x ∈ Si can be written as 1 x = x˜ + (3ai x˜ 2 + 2bi x˜ + ci ) for some x˜ ∈ [xi , xi+1 ]. r Substituting into (4.2) yields y=
−(2bi + r) ± |2bi + r + 6ai x| ˜ 6ai
(4.3)
Since fi is convex on [xi , xi+1 ], the second derivative is nonnegative: 6ai x+2b ˜ i ≥0 for all x˜ ∈ [xi , xi+1 ]. Thus, |2bi + r + 6ai x| ˜ = 2bi + r + 6ai x˜ and the two points of (4.3) are pi =
−(2bi + r) + 2bi + r + 6ai x˜ = x, ˜ 6ai
pj =
−(2bi + r) − 2bi − r − 6ai x˜ 2bi + r = −x˜ − . 6ai 3ai
114
C. Planiden and X. Wang
Therefore, pi is the proximal point, since it lies in [xi , xi+1 ]. This corresponds to the positive square root of (4.2), which gives us the statement of the proposition. ! Corollary 4.11 Let f : R → R be a convex piecewise cubic function:
f (x) =
⎧ ⎪ f0 (x), ⎪ ⎪ ⎪ ⎪ ⎨f1 (x),
if x ≤ x1 , if x1 ≤ x ≤ x2 , .. .
⎪ ⎪ ⎪ ⎪ ⎪ ⎩f (x), if x ≤ x, m m
where fi (x) = ai x 3 + bi x 2 + ci x + di ,
a i , b i , ci , d i ∈ R .
For each i ∈ {0, 1, . . . , m}, define pi =
−(2bi + r) +
. (2bi + r)2 − 12ai (ci − rx) . 6ai
Partition dom f as follows: 1 2 S0 = −∞, x1 + (3a0x1 + 2b0 x1 + c0 ) , r / 0 1 1 2 2 S1 = x1 + (3a0 x1 + 2b0x1 + c0 ), x1 + (3a1 x1 + 2b1 x1 + c1 ) , r r 1 1 S2 = x1 + (3a1 x12 + 2b1 x1 + c1 ), x2 + (3a1 x22 + 2b1x2 + c1 ) , r r .. .
1 2 + 2bm xm + cm ), ∞ . S2m = xm + (3am xm r Then the proximal mapping and Moreau envelope of f are ⎧ ⎪ p , ⎪ ⎪ 0 ⎪ ⎪ ⎪ ⎪x 1 , ⎨ Pr f (x) = p1 , ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ pm ,
⎧ ⎪ f (p ) + 2r (p0 − x)2 , ⎪ ⎪ 0 0 ⎪ ⎪ r 2 ⎪ if x ∈ S1 , ⎪f1 (x1 ) + 2 (x1 − x) , ⎨ if x ∈ S2 , er f (x) = f1 (p1 ) + 2r (p1 − x)2 , ⎪ ⎪ .. ⎪ ⎪ ⎪ . ⎪ ⎪ ⎩ fm (pm ) + 2r (pm − x)2 , if x ∈ S2m , if x ∈ S0 ,
if x ∈ S0 , if x ∈ S1 , if x ∈ S2 , .. . if x ∈ S2m .
Proximal Mappings and Moreau Envelopes
115
Algorithm 1 : A routine for graphing the Moreau envelope of a convex piecewise cubic function Step 0. Input coefficients of fi , intersection points, prox-parameter r, lower and upper bounds for the graph. Step 1. Find fi for each i. Step 2. Use fi and xi to define the subdomains Si of er f as found in Corollary 4.11. Step 3. On each Si , use Proposition 4.10 to find the proximal point pi . Step 4. Find er f (x) = fi (p) + 2r (p − x)2 for x ∈ Si , where p is pi for i even and xi for i odd. Step 5. Plot f and er f on the same axes.
Algorithm 1 below is a block of pseudocode that accepts as input a set of m cubic functions {f1 , . . . , fm } and m − 1 intersection points {x1 , . . . , xm−1 } that form the convex piecewise cubic function f, calculates er f , and plots f and er f together.
5 Smoothing a Gauge Function via the Moreau Envelope In this section, we focus on the idea of smoothing a gauge function. Gauge functions are proper, lsc, and convex, but many gauge functions have ridges of nondifferentiability that can be regularized by way of the Moreau envelope. The main result of this section is a method of smoothing a gauge function that yields another gauge function that is differentiable everywhere except on the kernel, as we shall see in Theorem 5.6. A special case of a gauge function is a norm function; Corollary 5.7 applies Theorem 5.6 to an arbitrary norm function, resulting in another norm function that is smooth everywhere except at the origin. To the best of our knowledge, this smoothing of gauge functions and norm functions is a new development in convex optimization. It is our hope that this new theory will be of interest and of some practical use to the readers of this paper.
5.1 Definitions We begin with some definitions that are used only in this section. Definition 5.1 For x ∈ R, the sign function sgn(x) is defined ⎧ ⎪ if x > 0, ⎪ ⎨1, sgn(x) = 0, if x = 0, ⎪ ⎪ ⎩−1, if x < 0. Definition 5.2 A function k on Rn is a gauge if k is a nonnegative, positively homogeneous, convex function such that k(0) = 0.
116
C. Planiden and X. Wang
Definition 5.3 A function f on Rn is gauge-like if f (0) = inf f and the lower level sets {x : f (x) ≤ α}, f (0) < α < ∞ are all proportional, i.e. they can all be expressed as positive scalar multiples of a single set. Note that any norm function is a closed gauge. Theorem 5.4 below gives us a way to construct gauge-like functions that are not necessarily gauges. Theorem 5.4 ([24, Theorem 15.3]) A function f is a gauge-like closed proper convex function if and only if it can be expressed in the form f (x) = g(k(x)), where k is a closed gauge and g is a nonconstant nondecreasing lsc convex function on [0, ∞] such that g(y) is finite for some y > 0 and g(∞) = ∞. If f is gauge-like, then f ∗ is gauge-like as well. Example 5.5 Let k : R → R, k(x) = |x| and g : [0, ∞] → R, g(x) = x + 1. Then by Theorem 5.4, we have that f (x) = g(k(x)) = |x| + 1 is gauge-like, and so is f ∗ (y) =
−1, if − 1 ≤ y ≤ 1, ∞,
otherwise.
5.2 Main Result and Illustrations The following theorem and corollary are the main results of this section. Then some typical norm functions on R2 are showcased: the ∞-norm and the 1 -norm. Finally, by way of counterexample we demonstrate that the Moreau envelope is ideal for the smoothing effect of Theorem 5.6 and other regularizations may not be; the PaschHausdorff envelope is shown not to have the desired effect. Theorem 5.6 Let f : Rn → R be a gauge function. Define gr (x) = [er (f 2 )](x) √ and hr = gr . Then hr is a gauge function, differentiable except on {x : f (x) = 0}, and lim hr = f. r%∞
Proof By Rockafellar and Wets [25, Theorem 1.25], limr%∞ gr = f 2 . So we have limr%∞ hr = |f |, which is simply f since f (x) ≥ 0 for all x. Since
Proximal Mappings and Moreau Envelopes
117
f is nonnegative and convex, f 2 is proper, lsc, and convex. By Rockafellar and Wets [25, Theorem 2.26] we have that gr is convex and continuously differentiable everywhere, the gradient being ∇gr (x) = r[x − Pr f 2 (x)]. Then by the chain rule, we have that ∇hr (x) =
1 1 [gr (x)]− 2 r[x − Pr f 2 (x)], provided that gr (x) = 0. 2
Since inf gr = inf f 2 = 0 and argmin gr = argmin f 2 , we have gr (x) = 0 if and only if f 2 (x) = 0, i.e., f (x) = 0. To see that hr is a gauge function, we have r gr (αx) = [er (f 2 )](αx) = infn f 2 (y) + y − αx2 2 y∈R ! $ 6 6 2 α r 6y 62 = infn f 2 (y) + 6 − x6 2 α y∈R ! 62 $ r6 6 6y 2 2 y = α yinf f + 6 − x6 n α 2 α α ∈R r = α 2 infn f 2 (y) ˜ + y˜ − x2 = α 2 gr (x). 2 y∈R ˜ Thus, gr is positively homogeneous of degree two. Hence, by Rockafellar [24, Corollary 15.3.1], there exists a closed gauge function k such that gr (x) = 12 k 2 (x). Then
. 1 2 1 hr (x) = gr (x) = k (x) = √ k(x) 2 2 and we have that hr is a gauge function.
!
Corollary 5.7 Let f : Rn → R, f (x) = x∗ be an arbitrary norm function. √ Define gr (x) = [er (f 2 )](x) and hr = gr . Then hr is a norm, hr is differentiable everywhere except at the origin, and lim hr = f. r%∞
Proof By Theorem 5.6, we have that limr%∞ hr = f, hr is differentiable everywhere except at the origin, hr is nonnegative and positively homogeneous. To see that hr is a norm, it remains to show that (i) hr (x) = 0 ⇒ x = 0 and (ii) hr (x + y) ≤ hr (x) + hr (y) for all x, y ∈ Rn .
118
C. Planiden and X. Wang
(i) Suppose that hr (x) = 0. Then 5 [er (f 2 )](x) = 0, r infn y2∗ + y − x2 = 0, 2 y∈R y ˜ 2∗ + y˜ − x2 = 0 for some y, ˜ (since · 2∗ is strongly convex) y ˜ ∗ = −y˜ − x ⇒ y˜ = 0 ⇒ x = 0. (ii) We have that hr is convex, since it is a gauge function. Therefore, by Rockafellar [24, Theorem 4.7], the triangle inequality holds. ! Now we present some examples on R2 , to illustrate the method of Theorem 5.6. Example 5.8 Let f : R2 →√R, f (x, y) = max(|x|, |y|). Define gr (x, y) = [er (f 2 )](x, y) and hr (x, y) = gr (x, y). Then, with R2 partitioned as !
$ ! $ r r r r = (x, y) : − x≤y≤ x ∪ (x, y) : x≤y≤− x , r +2 r +2 r +2 r +2 ! $ ! $ r r r r R2r = (x, y) : − y≤x≤ y ∪ (x, y) : y≤x≤− y , r +2 r +2 r +2 r +2 ! $ ! $ r r +2 r +2 r R3r = (x, y) : y≤x≤ y ∪ (x, y) : y≤x≤ y , r +2 r r r +2 ! $ ! $ r r +2 r +2 r r y≤x≤− y ∪ (x, y) : − y≤x≤− y , R4 = (x, y) : − r +2 r r r +2
R1r
we have ⎧
rx ⎪ , y , ⎪ ⎪ ⎪
r+2 ⎪ ⎪ ry ⎨ x, r+2 , Pr hr (x, y) = r(x+y) r(x+y) ⎪ , , ⎪ ⎪ ⎪ 2(r+1) 2(r+1) ⎪ ⎪ r(x−y) −r(x−y) ⎩ , 2(r+1) , 2(r+1) ⎧5 r ⎪ ⎪ r+2 |x|, ⎪ ⎪ 5 ⎪ ⎪ ⎨ r |y|, r+2 hr (x, y) = 5 2 r (x−y)2 +2r(x 2 +y 2 ) ⎪ ⎪ , ⎪ 4(r+1) ⎪ 5 ⎪ ⎪ ⎩ r 2 (x+y)2+2r(x 2+y 2 ) , 4(r+1)
and lim hr = f, lim hr = 0. r%∞
r&0
if (x, y) ∈ R1r , if (x, y) ∈ R2r , if (x, y) ∈ R3r , if (x, y) ∈ R4r , if (x, y) ∈ R1r , if (x, y) ∈ R2r , if (x, y) ∈ R3r , if (x, y) ∈ R4r ,
Proximal Mappings and Moreau Envelopes
119
Fig. 5 The four regions of the piecewise function h1 (x, y)
Proof Figure 5 shows the partitioning of R2 in the case r = 1; for other values of r the partition is of similar form. We have [er (f 2 )](x, ¯ y) ¯ r [max{|x|, |y|}]2 + (x − x) ¯ 2 + (y − y) ¯ 2 = inf 2 (x,y)∈R2 / r = min inf x 2 + (x − x) ¯ 2 + (y − y) ¯ 2 , |x|≥|y| 2 0 r 2 2 2 (x − x) ¯ + (y − y) inf y + ¯ . |x| |y|, the x = y portion, and the x = −y portion. We denote these three infima as I|x|>|y| , Ix=y , and Ix=−y . Similarly, we split Iy into I|x||y| , we set of its argument equal to zero and find a proximal point of (x, y) =
the gradient r x¯ r+2 , y¯ , which yields an infimum of I|x|>|y| =
r x¯ 2 . r+2
r x¯ This is the result for |x| > |y|, or in other words for r+2 ¯ (region R1r ). In a > |y| moment we compare this result to Ix=y and Ix=−y ; Ix is the minimum of the three.
120
C. Planiden and X. Wang
r y¯ By a symmetric process, considering I|x| |y|, then we have I|x|>|y| < I|x||y| , Ix=y , Ix=−y ) on R1r and min(I|x||y| on R1r . By identical arguments, one can show that Ix=−y ≥ I|x|>|y| on R1r , and that Ix=y ≥ I|x| 0 and where x, y < 0, which is R3r . Similarly, gr (x, y) = Ix=−y outside of R1r ∪ R2r where x > 0, y < 0 and where x < 0, y > 0, which is R4r . Therefore, the proximal mapping of gr is ⎧
rx ⎪ ,y , ⎪ r+2 ⎪ ⎪
⎪ ⎪ ⎨ x, ry , r+2 Pr gr (x, y) = r(x+y) r(x+y) ⎪ , , ⎪ ⎪ ⎪
2(r+1) 2(r+1) ⎪ ⎪ r(x−y) −r(x−y) ⎩ , 2(r+1) , 2(r+1)
if (x, y) ∈ R1r , if (x, y) ∈ R2r , if (x, y) ∈ R3r , if (x, y) ∈ R4r .
Applying to (5.1), we find that ⎧ 2 rx ⎪ , ⎪ ⎪ r+22 ⎪ ⎨ ry , gr (x, y) = r+2 2 2 2 2 ⎪ r (x−y) +2r(x +y ) , ⎪ 4(r+1) ⎪ ⎪ ⎩ r 2 (x+y)2+2r(x 2+y 2 ) , 4(r+1)
if (x, y) ∈ R1r , if (x, y) ∈ R2r , if (x, y) ∈ R3r , if (x, y) ∈ R4r .
√ Finally, hr = gr has the same proximal mapping as gr , so hr (x, y) is as stated in the example (Fig. 6). Now let us take a look at what happens to hr when r % ∞. By Theorem 5.6, we expect to recover f. Taking the limit of R1r , we have /! $ rx −r x≤y≤ (x, y) : r%∞ r+2 r +2 $0 ! −rx rx ≤y≤ , ∪ (x, y) : r+2 r+2
lim R1r = lim
r%∞
= {(x, y) : −x ≤ y ≤ x} ∪ {(x, y) : x ≤ y ≤ −x} = {(x, y) : |x| ≥ |y|}. Similarly, we find that lim R2r = {(x, y) : |x| ≤ |y|},
r%∞
lim R3r = {(x, y) : x = y},
r%∞
lim R4r = {(x, y) : x = −y}.
r%∞
122
C. Planiden and X. Wang
Fig. 6 The function h1 (x, y)
Since R3r and R4r are now contained in R1r , we need consider the limit of hr over R1r and R2r only. Therefore, ⎧ 5 r ⎪ |x|, if (x, y) ∈ R1r , ⎨ lim r+2 r%∞ 5 lim hr (x, y) = r ⎪ r%∞ |y|, if (x, y) ∈ R2r , ⎩ lim r+2 r%∞
=
|x|, if |x| ≥ |y|, |y|, if |x| ≤ |y|,
= max{|x|, |y|} = f (x, y). If, on the other hand, we take the limit as r goes down to zero, then it is R3r and R4r that become all of R2 , with lim R3r = {(x, y) : x, y ≥ 0} ∪ {(x, y) : x, y ≤ 0},
r&0
lim R4r = {(x, y) : x ≥ 0, y ≤ 0} ∪ {(x, y) : x ≤ 0, y ≥ 0}.
r&0
Proximal Mappings and Moreau Envelopes
123
Fig. 7 The function hr (x, y) from r = 0.01 (gray) to r = 5 (blue)
Then R1r and R2r are contained in R3r , and the limit of hr is 5 ⎧ r 2 (x−y)2 +2r(x 2 +y 2 ) ⎪ , ⎨ lim 4(r+1) r&0 5 lim hr (x, y) = 2 2 +2r(x 2 +y 2 ) ⎪ r&0 ⎩ lim r (x+y)4(r+1) , r&0
⎧5 ⎨ 0, = 54 ⎩ 0, 4 = 0.
if (x, y) ∈ R3r , if (x, y) ∈ R4r
if x, y ≥ 0 or x, y ≤ 0, if x ≥ 0, y ≤ 0 or x ≤ 0, y ≥ 0 !
Figure 7 shows the graphs of hr for several values of r, and demonstrates the effect of r & 0 and r % ∞. Example 5.9 Let f : R2 → R, f (x, y) = |x| + |y|. Define gr (x, y) = √ [er (f 2 )](x, y) and hr = gr . Then, with R2 partitioned as ! R1r = (x, y) :
$ ! $ 2 r +2 r +2 2 y≤x≤ y ∪ (x, y) : y≤x≤ y , r +2 2 2 r +2 ! $ ! $ 2 r +2 r +2 2 r R2 = (x, y) : − y≤x≤− y ∪ (x, y) : − y≤x≤− y , r +2 2 2 r +2 ! $ ! $ r +2 r +2 r +2 r +2 y ≤ x, − y ≤ x ∪ (x, y) : y ≥ x, − y≥x R3r = (x, y) : 2 2 2 2 ! $ ! $ r +2 r +2 r +2 r +2 R4r = (x, y) : x ≥ y, − x ≥ y ∪ (x, y) : x ≤ y, − x≤y , 2 2 2 2
124
C. Planiden and X. Wang
we have ⎧
(r+2)x−2y −2x+(r+2)y ⎪ , ⎪ ⎪ r+4 , ⎪ r+4 ⎪ ⎪ (r+2)x+2y 2x+(r+2)y ⎨ , r+4 , r+4 Pr hr (x, y) =
rx ⎪ ,0 , ⎪ ⎪ ⎪ r+2 ⎪ ⎪ ry ⎩ 0, r+2 , ⎧5 r ⎪ ⎪ r+4 |x + y|, ⎪ ⎪ 5 ⎪ ⎪ ⎨ r |x − y|, r+4 hr (x, y) = 5 2 2rx +r(r+2)y 2 ⎪ ⎪ , ⎪ 2(r+2) ⎪ 5 ⎪ ⎪ ⎩ r(r+2)x 2+2ry 2 , 2(r+2)
if (x, y) ∈ R1r , if (x, y) ∈ R2r , if (x, y) ∈ R3r , if (x, y) ∈ R4r ,
if (x, y) ∈ R1r , if (x, y) ∈ R2r , if (x, y) ∈ R3r , if (x, y) ∈ R4r ,
and lim hr = f, lim hr = 0. r%∞
r&0
Proof Using the same notation introduced in the previous example, it is convenient to split the infimum expression as follows: gr (x, ¯ y) ¯ =
inf
(x,y)∈R2
(|x| + |y|)2 +
r (x − x) ¯ 2 + (y − y) ¯ 2 2
= min Ix,y>0 , Ix,y0,y 0. Then gr is a gauge, and ∂gr (x) = ∂f (p) ∩ r∂p − x, where p is in the proximal mapping of f at x. Moreover, when f is an arbitrary norm, gr is a norm. However, in that case gr is not necessarily differentiable everywhere except at the origin, as is the case of Corollary 5.7 where the Moreau envelope is used. Proof To prove that gr is a gauge, we must show that (i) gr (x) ≥ 0 for all x ∈ Rn , and x = 0 ⇒ gr (x) = 0, (ii) gr (αx) = αgr (x) for all α > 0, and (iii) gr is convex. (i) Since min f (y) = 0 and min ry − x = 0, we have inf {f (y) + ry − x} ≥ 0 ∀x ∈ Rn .
y∈Rn
We have gr (0) = infn (f (y) + ry − 0) = 0, since both terms of the infimum y∈R
are minimized at y = 0. Hence, x = 0 ⇒ gr (x) = 0. (ii) Let α > 0. Then, with y˜ = y/α, gr (αx) = infn {f (y) + ry − αx} y∈R
6y 6 y 6 6 α f + r 6 − x6 n α α ∈R α
= yinf
= α infn {f (y) ˜ + ry˜ − x} = αgr (x). y∈R ˜
(iii) Since (x, y) → f (y) + ry − x is convex, the marginal function gr is convex by Bauschke and Combettes [3, Proposition 8.26]. Therefore, gr is a gauge. The expression for ∂gr comes from [3, Proposition 16.48]. Now let f be a norm. To show that gr is a norm, we must show that (iv) gr (x) = 0 ⇒ x = 0, (v) gr (−x) = gr (x) for x ∈ Rn , and (vi) gr (x + y) ≤ gr (x) + gr (y) for all x, y ∈ Rn . (iv) Let gr (x) = 0. Then there exists {yk }∞ k=1 such that f (yk ) + ryk − x → 0.
(5.2)
As 0 ≤ f (yk ) ≤ f (yk ) + ryk − x → 0, by the Squeeze Theorem we have f (yk ) → 0, and since f is a norm, yk → 0. Then by (5.2) together with f (yk ) → 0, we have yk − x → 0, i.e. yk → x. Therefore, x = 0.
128
C. Planiden and X. Wang
(v) As in Lemma 2.11, one can show that the Pasch-Hausdorff envelope of an even function is even. (vi) By Rockafellar [24, Theorem 4.7], it suffices that gr is convex. Therefore, gr is a norm. To show that gr is not necessarily smooth everywhere except at one point, we consider a particular example. On R2 , define f1 (x) = |x1 | + |x2 |, f2 (x) =
√ 5 2 x12 + x22 .
Then g√2 (x) = infn {f1 (y) + f2 (x − y)} . y∈R
√ It is elementary to show that f1 is 2-Lipschitz, so by Fact 5.10, we have that g√2 ≡ f1 . Hence, g√2 (x) = |x1 | + |x2|, which is not smooth along the lines x1 = 0 and x2 = 0. ! Remark 5.12 Further work in this area could be done by replacing q(x − y) = 1 2 2 x − y by a general distance function, for example the Bregman distance kernel: D(x, y) =
f (y) − f (x) − ∇f (x), y − x , if y ∈ dom f, x ∈ int dom f, ∞,
otherwise.
See [6, 14] for details on the Moreau envelope using the Bregman distance.
6 Conclusion We established characterizations of Moreau envelopes: er f is strictly convex if and only if f is essentially strictly convex, and f = er g with g ∈ 0 (Rn ) if and only if f ∗ is strongly convex with modulus 1/r. We saw differentiability properties of convex Moreau envelopes and used them to establish an explicit expression for the Moreau envelope of a piecewise cubic function. Finally, we presented a method for smoothing an arbitrary gauge function by applying the Moreau envelope, resulting in another norm function that is differentiable everywhere except on the kernel. A special application to an arbitrary norm function is presented. Acknowledgements The authors thank the anonymous referee for the many useful comments and suggestions made to improve this manuscript. Chayne Planiden was supported by UBC University Graduate Fellowship and by Natural Sciences and Engineering Research Council of Canada. Xianfu Wang was partially supported by a Natural Sciences and Engineering Research Council of Canada Discovery Grant.
Proximal Mappings and Moreau Envelopes
129
References 1. F. Aragón, A. Dontchev, and M. Geoffroy. Convergence of the proximal point method for metrically regular mappings. In CSVAA 2004, volume 17 of ESAIM Proc., pages 1–8. EDP Sci., Les Ulis, 2007. 2. A. Bajaj, W. Hare, and Y. Lucet. Visualization of the ε-subdifferential of piecewise linear– quadratic functions. Comput. Optim. Appl., 67(2):421–442, 2017. 3. H. Bauschke and P. Combettes. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. Springer, New York, 2011. 4. T. Bayen and A. Rapaport. About Moreau–Yosida regularization of the minimal time crisis problem. J. Convex Anal., 23(1):263–290, 2016. 5. G. Bento and J. Cruz. Finite termination of the proximal point method for convex functions on Hadamard manifolds. Optimization, 63(9):1281–1288, 2014. 6. Y. Chen, C. Kan, and W. Song. The Moreau envelope function and proximal mapping with respect to the Bregman distances in Banach spaces. Vietnam J. Math., 40(2–3):181–199, 2012. 7. B. Gardiner and Y. Lucet. Convex hull algorithms for piecewise linear-quadratic functions in computational convex analysis. Set-Valued Var. Anal., 18(3–4):467–482, 2010. 8. R. Hamilton. The inverse function theorem of Nash and Moser. Bull. Amer. Math. Soc., 7(1):65–222, 1982. 9. W. Hare and Y. Lucet. Derivative-free optimization via proximal point methods. J. Optim. Theory Appl., 160(1):204–220, 2014. 10. W. Hare and R. Poliquin. Prox-regularity and stability of the proximal mapping. J. Convex Anal., 14(3):589–606, 2007. 11. M. Hintermüller and M. Hinze. Moreau–Yosida regularization in state constrained elliptic control problems: error estimates and parameter adjustment. SIAM J. Numer. Anal., 47(3):1666–1683, 2009. 12. A. Jourani, L. Thibault, and D. Zagrodny. Differential properties of the Moreau envelope. J. Funct. Anal., 266(3):1185–1237, 2014. 13. A. Jourani and E. Vilches. Moreau–Yosida regularization of state-dependent sweeping processes with nonregular sets. J. Optim. Theory Appl., 173(1):91–116, 2017. 14. C. Kan and W. Song. The Moreau envelope function and proximal mapping in the sense of the Bregman distance. Nonlinear Anal., 75(3):1385–1399, 2012. 15. M. Keuthen and M. Ulbrich. Moreau–Yosida regularization in shape optimization with geometric constraints. Comput. Optim. Appl., 62(1):181–216, 2015. 16. C. Lemaréchal and C. Sagastizábal. Practical aspects of the Moreau–Yosida regularization: theoretical preliminaries. SIAM J. Optim., 7(2):367–385, 1997. 17. Y. Lucet, H. Bauschke, and M. Trienis. The piecewise linear-quadratic model for computational convex analysis. Comput. Optim. Appl., 43(1):95–118, 2009. 18. F. Meng and Y. Hao. Piecewise smoothness for Moreau–Yosida approximation to a piecewise C 2 convex function. Adv. Math., 30(4):354–358, 2001. 19. R. Mifflin, L. Qi, and D. Sun. Properties of the Moreau–Yosida regularization of a piecewise C 2 convex function. Math. Program., 84(2):269–281, 1999. 20. J.-J. Moreau. Propriétés des applications “prox”. C. R. Acad. Sci. Paris, 256:1069–1071, 1963. 21. J.-J. Moreau. Proximité et dualité dans un espace Hilbertien. Bull. Soc. Math. France, 93:273– 299, 1965. 22. C. Planiden and X. Wang. Strongly convex functions, Moreau envelopes and the generic nature of convex functions with strong minimizers. SIAM J. Optim., 26(2):1341–1364, 2016. 23. R. Poliquin and R. Rockafellar. Generalized Hessian properties of regularized nonsmooth functions. SIAM J. Optim., 6(4):1121–1137, 1996. 24. R. Rockafellar. Convex Analysis. Princeton Landmarks in Mathematics. Princeton University Press, Princeton, NJ, 1997. 25. R. Rockafellar and J.-B. Wets. Variational Analysis. Springer-Verlag, Berlin, 1998.
130
C. Planiden and X. Wang
26. X. Wang. On Chebyshev functions and Klee functions. J. Math. Anal. Appl., 368(1):293–310, 2010. 27. H. Xiao and X. Zeng. A proximal point method for the sum of maximal monotone operators. Math. Methods Appl. Sci., 37(17):2638–2650, 2014. 28. K. Yosida. Functional Analysis. Die Grundlehren der Mathematischen Wissenschaften. Springer-Verlag, Berlin, 1965. 29. A. Zaslavski. Convergence of a proximal point method in the presence of computational errors in Hilbert spaces. SIAM J. Optim., 20(5):2413–2421, 2010.
Newton-Like Dynamics Associated to Nonconvex Optimization Problems Radu Ioan Bo¸t and Ernö Robert Csetnek
Abstract We consider the dynamical system !
v(t) ∈ ∂φ(x(t)) λx(t) ˙ + v(t) ˙ + v(t) + ∇ψ(x(t)) = 0,
where φ : Rn → R ∪ {+∞} is a proper, convex, and lower semicontinuous function, ψ : Rn → R is a (possibly nonconvex) smooth function, and λ > 0 is a parameter which controls the velocity. We show that the set of limit points of the trajectory x is contained in the set of critical points of the objective function φ + ψ, which is here seen as the set of the zeros of its limiting subdifferential. If the objective function is smooth and satisfies the Kurdyka-Łojasiewicz property, then we can prove convergence of the whole trajectory x to a critical point. Furthermore, convergence rates for the orbits are obtained in terms of the Łojasiewicz exponent of the objective function, provided the latter satisfies the Łojasiewicz property. Keywords Dynamical systems · Newton-like methods · Lyapunov analysis · Nonsmooth optimization · Limiting subdifferential · Kurdyka-Łojasiewicz property Mathematics Subject Classification (2000) 34G25, 47J25, 47H05, 90C26, 90C30, 65K10
R. I. Bo¸t () · E. R. Csetnek Faculty of Mathematics, University of Vienna, Vienna, Austria e-mail: [email protected]; [email protected] © Springer Nature Switzerland AG 2019 S. Hosseini et al. (eds.), Nonsmooth Optimization and Its Applications, International Series of Numerical Mathematics 170, https://doi.org/10.1007/978-3-030-11370-4_6
131
132
R. I. Bo¸t and E. R. Csetnek
1 Introduction and Preliminaries The dynamical system !
v(t) ∈ T (x(t)) λ(t)x(t) ˙ + v(t) ˙ + v(t) = 0,
(1.1)
where λ : [0, +∞) → [0, +∞) and T : Rn ⇒ Rn is a (set-valued) maximally monotone operator, has been introduced and investigated in [9] as a continuous version of Newton and Levenberg-Marquardt-type algorithms. It has been shown that under mild conditions on λ the trajectory x(t) converges weakly to a zero of the operator T , while v(t) converges to zero as t → +∞. These investigations have been continued in [2] in the context of solving optimization problems of the form inf {φ(x) + ψ(x)},
x∈Rn
(1.2)
where φ : Rn → R ∪ {+∞} is a proper, convex, and lower semicontinuous function and ψ : Rn → R is a convex and differentiable function with locally Lipschitzcontinuous gradient. More precisely, problem (1.2) has been approached via the dynamical system !
v(t) ∈ ∂φ(x(t)) λ(t)x(t) ˙ + v(t) ˙ + v(t) + ∇ψ(x(t)) = 0,
(1.3)
where ∂φ is the convex subdifferential of φ. It has been shown in [2] that if the set of minimizers of (1.2) is nonempty and some mild conditions on the damping function λ are satisfied, then the trajectory x(t) converges to a minimizer of (1.2) as t → +∞. Further investigations on dynamical systems of similar type have been reported in [1] and [20]. The aim of this paper is to perform an asymptotic analysis of the dynamical system (1.3) in the absence of the convexity of ψ, for constant damping function λ and by assuming that the objective function of (1.2) satisfies the KurdykaŁojasiewicz property, in other words is a KL function. To the class of KL functions belong semialgebraic, real subanalytic, uniformly convex, and convex functions satisfying a growth condition. The convergence analysis relies on methods of real algebraic geometry introduced by Łojasiewicz [28] and Kurdyka [27] and developed recently in the nonsmooth setting by Attouch et al. [6] and Bolte et al. [15]. Optimization problems involving KL functions have attracted the interest of the community since the works of Łojasiewicz [28], Simon [32], and Haraux and Jendoubi [25]. The most important contributions of the last years in the field include the works of Alvarez et al. [3, Section 4] and Bolte et al. [11, Section 4]. Ever since the interest in this topic increased continuously (see [4–6, 14, 15, 17– 19, 22, 23, 26, 30]).
Newton-Like Dynamics Associated to Nonconvex Optimization Problems
133
In the first part of the paper we show that the set of limit points of the trajectory x generated by (1.3) is entirely contained in the set of critical points of the objective function φ +ψ, which is seen as the set of zeros of its limiting subdifferential. Under some supplementary conditions, including the Kurdyka-Łojasiewicz property, we prove the convergence of the trajectory x to a critical point of φ + ψ. Furthermore, convergence rates for the orbits are obtained in terms of the Łojasiewicz exponent of the objective function, provided the latter satisfies the Łojasiewicz property. In the following we recall some notions and results which are needed throughout the paper. We consider on Rn the Euclidean scalar product and the corresponding norm denoted by ·, · and · , respectively. The domain of the function f : Rn → R ∪ {+∞} is defined by dom f = {x ∈ n R : f (x) < +∞} and we say that f is proper, if it has a nonempty domain. For the following generalized subdifferential notions and their basic properties we refer to [16, 29, 31]. Let f : Rn → R ∪ {+∞} be a proper and lower semicontinuous function. The Fréchet (viscosity) subdifferential of f at x ∈ dom f is the set $ ! v, y − x ˆ (x) = v ∈ Rn : lim inf f (y) − f (x) − ≥0 . ∂f y→x y − x ˆ (x) := ∅. The limiting (Mordukhovich) subdifferential is If x ∈ / dom f , we set ∂f defined at x ∈ dom f by ∂L f (x) = {v ∈ Rn : ∃xk → x, f (xk ) → f (x)and ˆ (xk ), vk → v as k → +∞}, ∃vk ∈ ∂f ˆ (x) ⊆ ∂L f (x) for each while for x ∈ / dom f , we set ∂L f (x) := ∅. Obviously, ∂f n x∈R . When f is convex, these subdifferential notions coincide with the convex ˆ (x) = ∂L f (x) = ∂f (x) = {v ∈ Rn : f (y) ≥ f (x) + subdifferential, thus ∂f n v, y − x ∀y ∈ R } for all x ∈ Rn . The following closedness criterion of the graph of the limiting subdifferential will be used in the convergence analysis: if (xk )k∈N and (vk )k∈N are sequences in Rn such that vk ∈ ∂L f (xk ) for all k ∈ N, (xk , vk ) → (x, v) and f (xk ) → f (x) as k → +∞, then v ∈ ∂L f (x). The Fermat rule reads in this nonsmooth setting as follows: if x ∈ Rn is a local minimizer of f , then 0 ∈ ∂L f (x). We denote by crit(f ) = {x ∈ Rn : 0 ∈ ∂L f (x)} the set of (limiting)-critical points of f . When f is continuously differentiable around x ∈ Rn we have ∂L f (x) = {∇f (x)}. We will also make use of the following subdifferential sum rule: if f : Rn → R ∪ {+∞} is proper and lower semicontinuous and h : Rn → R is
134
R. I. Bo¸t and E. R. Csetnek
a continuously differentiable function, then ∂L (f + h)(x) = ∂L f (x) + ∇h(x) for all x ∈ Rn . Further, we recall the notion of a locally absolutely continuous function and state two of its basic properties. Definition 1.1 (See [2, 9]) A function x : [0, +∞) → Rn is said to be locally absolutely continuous, if it is absolutely continuous on every interval [0, T ] for T > 0. Remark 1.2 (a) An absolutely continuous function is differentiable almost everywhere, its derivative coincides with its distributional derivative almost everywhere and one can recover the function from its derivative x˙ = y by integration. (b) If x : [0, T ] → Rn is absolutely continuous for T > 0 and B : Rn → Rn is L-Lipschitz continuous for L ≥ 0, then the function z = B ◦ x is absolutely continuous, too. Moreover, z is differentiable almost everywhere on [0, T ] and the inequality ˙z(t) ≤ Lx(t) ˙ holds for almost every t ∈ [0, T ]. The following two results, which can be interpreted as continuous versions of the quasi-Fejér monotonicity for sequences, will play an important role in the asymptotic analysis of the trajectories of the dynamical system (1.3). For their proofs, we refer the reader to [2, Lemma 5.1] and [2, Lemma 5.2], respectively. Lemma 1.3 Suppose that F : [0, +∞) → R is locally absolutely continuous and bounded from below and that there exists G ∈ L1 ([0, +∞)) such that for almost every t ∈ [0, +∞) d F (t) ≤ G(t). dt Then there exists limt →∞ F (t) ∈ R. Lemma 1.4 If 1 ≤ p < ∞, 1 ≤ r ≤ ∞, F : [0, +∞) → [0, +∞) is locally absolutely continuous, F ∈ Lp ([0, +∞)), G : [0, +∞) → R, G ∈ Lr ([0, +∞)) and for almost every t ∈ [0, +∞) d F (t) ≤ G(t), dt then limt →+∞ F (t) = 0. The following result, which is due to Brézis ([21, Lemme 3.3, p. 73]; see also [7, Lemma 3.2]), provides an expression for the derivative of the composition of convex functions with absolutely continuous trajectories. Lemma 1.5 Let f : Rn → R ∪ {+∞} be a proper, convex and lower semicontinuous function. Let x ∈ L2 ([0, T ], Rn ) be absolutely continuous such that x˙ ∈ L2 ([0, T ], Rn ) and x(t) ∈ dom f for almost every t ∈ [0, T ]. Assume that there
Newton-Like Dynamics Associated to Nonconvex Optimization Problems
135
exists ξ ∈ L2 ([0, T ], Rn ) such that ξ(t) ∈ ∂f (x(t)) for almost every t ∈ [0, T ]. Then the function t → f (x(t)) is absolutely continuous and for almost every t such that x(t) ∈ dom ∂f we have d f (x(t)) = x(t), ˙ h ∀h ∈ ∂f (x(t)). dt
2 Asymptotic Analysis In this paper we investigate the dynamical system ⎧ ⎨ v(t) ∈ ∂φ(x(t)) λx(t) ˙ + v(t) ˙ + v(t) + ∇ψ(x(t)) = 0 ⎩ x(0) = x0 , v(0) = v0 ∈ ∂φ(x0 ),
(2.1)
where x0 , v0 ∈ Rn and λ > 0. We assume that φ : Rn → R ∪ {+∞} is proper, convex, and lower semicontinuous and ψ : Rn → R is possibly nonconvex and Fréchet differentiable with L-Lipschitz continuous gradient, for L > 0; in other words, ∇ψ(x) − ∇ψ(y) ≤ Lx − y for all x, y ∈ Rn . In the following we specify what we understand under a solution of the dynamical system (2.1). Definition 2.1 Let x0 , v0 ∈ Rn and λ > 0 be such that v0 ∈ ∂φ(x0 ). We say that the pair (x, v) is a strong global solution of (2.1) if the following properties are satisfied: (i) (ii) (iii) (iv)
x, v : [0, +∞) → Rn are locally absolutely continuous functions; v(t) ∈ ∂φ(x(t)) for every t ∈ [0, +∞); λx(t) ˙ + v(t) ˙ + v(t) + ∇ψ(x(t)) = 0 for almost every t ∈ [0, +∞); x(0) = x0 , v(0) = v0 .
The existence and uniqueness of the trajectories generated by (2.1) has been investigated in [2]. A careful look at the proofs in [2] reveals the fact that the convexity of ψ is not used in the mentioned results on the existence, but the Lipschitz-continuity of its gradient. We start our convergence analysis with the following technical result. Lemma 2.2 Let x0 , v0 ∈ Rn and λ > 0 be such that v0 ∈ ∂φ(x0 ). Let (x, v) : [0, +∞) → Rn × Rn be the unique strong global solution of the dynamical system (2.1). Then the following statements are true: (i) x(t), ˙ v(t) ˙ ≥ 0 for almost every t ∈ [0, +∞); d φ(x(t)) = x(t), ˙ v(t) for almost every t ∈ [0, +∞). (ii) dt
136
R. I. Bo¸t and E. R. Csetnek
Proof (i) See [9, Proposition 3.1]. The proof relies on the first relation in (2.1) and the monotonicity of the convex subdifferential. (ii) The proof makes use of Lemma 1.5. This relation has been already stated in [2, relation (51)] without making use in its proof of the convexity of ψ. ! Lemma 2.3 Let x0 , v0 ∈ Rn and λ > 0 be such that v0 ∈ ∂φ(x0 ). Let (x, v) : [0, +∞) → Rn × Rn be the unique strong global solution of the dynamical system (2.1). Suppose that φ + ψ is bounded from below. Then the following statements are true: d 2 + x(t), (i) dt (φ + ψ)(x(t)) + λx(t) ˙ ˙ v(t) ˙ = 0 for almost every t ≥ 0; 2 n (ii) x, ˙ v, ˙ v + ∇ψ(x) ∈ L ([0, +∞); R ), x(·), ˙ v(·) ˙ ∈ L1 ([0, +∞); R) and limt →+∞ x(t) ˙ = limt →+∞ v(t) ˙ = limt →+∞ v(t) + ∇ψ(x(t)) = 0; (iii) the function t → (φ + ψ)(x(t)) is decreasing and ∃ limt →+∞ (φ + ψ) x(t) ∈ R.
Proof (i) The statement follows by inner multiplying the both sides of the second relation in (2.1) by x(t) ˙ and by taking afterwards into consideration Lemma 2.2(ii). (ii) After integrating the relation (i) and by taking into account that φ + ψ is bounded from below, we easily derive x˙ ∈ L2 ([0, +∞); Rn ) and x(·), ˙ v(·) ˙ ∈ L1 ([0, +∞); R) (see also Lemma 2.2(i)). Further, by using the second relation in (2.1), Remark 1.2(b), and Lemma 2.2(i), we obtain for almost every t ≥ 0: d 1 v(t) + ∇ψ(x(t))2 dt 2 9 8 d = v(t) ˙ + ∇ψ(x(t)), v(t) + ∇ψ(x(t)) dt 8 9 d = v(t) ˙ + ∇ψ(x(t)), −λx(t) ˙ − v(t) ˙ dt 9 8 d 2 = −λ v(t), ˙ x(t) ˙ −v(t) ˙ ∇ψ(x(t)), x(t) ˙ −λ dt 8 9 d − ∇ψ(x(t)), v(t) ˙ dt 9 8 d 2 ≤ −v(t) ˙ ∇ψ(x(t)), x(t) ˙ −λ dt 8 9 d − ∇ψ(x(t)), v(t) ˙ dt 2 2 ≤ −v(t) ˙ + λLx(t) ˙ + Lx(t) ˙ · v(t) ˙ 1 2 2 2 2 ˙ ≤ −v(t) ˙ + λLx(t) ˙ + L2 x(t) ˙ + v(t) , 4
Newton-Like Dynamics Associated to Nonconvex Optimization Problems
137
hence d dt
1 3 2 2 2 v(t) + ∇ψ(x(t)) + v(t) ˙ ≤ L(λ + L)x(t) ˙ . 2 4
(2.2)
Since x˙ ∈ L2 ([0, +∞); Rn ), by a simple integration argument we obtain v˙ ∈ L2 ([0, +∞); Rn ). Considering the second equation in (2.1), we further obtain that v + ∇ψ(x) ∈ L2 ([0, +∞); Rn ). This fact combined with Lemma 1.4 and (2.2) implies that limt →+∞ v(t) + ∇ψ(x(t)) = 0. From the second equation in (2.1) we obtain ˙ + v(t) ˙ = 0. lim λx(t)
t →+∞
(2.3)
Further, from Lemma 2.2(i) we have for almost every t ≥ 0 2 2 2 2 ≤ λ2 x(t) ˙ + 2λ x(t), ˙ v(t) ˙ + v(t) ˙ = λx(t) ˙ + v(t) ˙ , v(t) ˙
˙ = 0. Combining this with (2.3) we hence from (2.3) we get limt →+∞ v(t) conclude that limt →+∞ x(t) ˙ = 0. (iii) From (i) and Lemma 2.2(i) it follows that d (φ + ψ)(x(t)) ≤ 0 dt for almost every t ≥ 0. The conclusion follows by applying Lemma 1.3.
(2.4) !
Lemma 2.4 Let x0 , v0 ∈ Rn , and λ > 0 be such that v0 ∈ ∂φ(x0 ). Let (x, v) : [0, +∞) → Rn × Rn be the unique strong global solution of the dynamical system (2.1). Suppose that φ + ψ is bounded from below. Let (tk )k∈N be a sequence such that tk → +∞ and x(tk ) → x ∈ Rn as k → +∞. Then 0 ∈ ∂L (φ + ψ)(x). Proof From the first relation in (2.1) and the subdifferential sum rule of the limiting subdifferential we derive for any k ∈ N v(tk ) + ∇ψ(x(tk )) ∈ ∂φ(x(tk )) + ∇ψ(x(tk )) = ∂L (φ + ψ)(x(tk )).
(2.5)
Further, we have x(tk ) → x as k → +∞
(2.6)
v(tk ) + ∇ψ(x(tk )) → 0 as k → +∞.
(2.7)
and (see Lemma 2.3(ii))
138
R. I. Bo¸t and E. R. Csetnek
According to the closedness property of the limiting subdifferential, the proof is complete as soon as we show that (φ + ψ)(x(tk )) → (φ + ψ)(x) as k → +∞.
(2.8)
From (2.6), (2.7) and the continuity of ∇ψ we get v(tk ) → −∇ψ(x) as k → +∞.
(2.9)
Further, since v(tk ) ∈ ∂φ(x(tk )), we have φ(x) ≥ φ(x(tk )) + v(tk ), x − x(tk ) ∀k ∈ N. Combining this with (2.6) and (2.9) we derive lim sup φ(x(tk )) ≤ φ(x). k→+∞
A direct consequence of the lower semicontinuity of φ is the relation lim φ(x(tk )) = φ(x),
k→+∞
which combined with (2.6) and the continuity of ψ yields (2.8).
!
We define the limit set of x as ω(x) := {x ∈ Rn : ∃tk → +∞ such that x(tk ) → x as k → +∞}. We use also the distance function to a set, defined for A ⊆ Rn as dist(x, A) = infy∈A x − y for all x ∈ Rn . Lemma 2.5 Let x0 , v0 ∈ Rn and λ > 0 be such that v0 ∈ ∂φ(x0 ). Let (x, v) : [0, +∞) → Rn × Rn be the unique strong global solution of the dynamical system (2.1). Suppose that φ + ψ is bounded from below and x is bounded. Then the following statements are true: (i) (ii) (iii) (iv)
ω(x) ⊆ crit(φ + ψ); ω(x) is nonempty, compact, and connected; limt →+∞ dist x(t), ω(x) = 0; φ + ψ is finite and constant on ω(x).
Proof Statement (i) is a direct consequence of Lemma 2.4. Statement (ii) is a classical result from [24]. We also refer the reader to the proof of Theorem 4.1 in [3], where it is shown that the properties of ω(x) of being nonempty, compact, and connected hold for bounded trajectories fulfilling limt →+∞ x(t) ˙ = 0. Statement (iii) follows immediately since ω(x) is nonempty.
Newton-Like Dynamics Associated to Nonconvex Optimization Problems
139
(iv) According to Lemma (2.3)(iii), there exists limt →+∞ (φ + ψ) x(t) ∈ R. Let us denote by l ∈ R this limit. Take x ∈ ω(x). Then there exists tk → +∞ such that x(tk ) → x as k → +∞. From the proof of Lemma 2.4 we have that (φ + ψ)(x(tk )) → (φ + ψ)(x) as k → +∞, hence (φ + ψ)(x) = l. ! Remark 2.6 Suppose that φ + ψ is coercive, in other words, lim
u→+∞
(φ + ψ)(u) = +∞.
Let x0 , v0 ∈ Rn and λ > 0 be such that v0 ∈ ∂φ(x0 ). Let (x, v) : [0, +∞) → Rn × Rn be the unique strong global solution of the dynamical system (2.1). Then φ + ψ is bounded from below and x is bounded. Indeed, since φ + ψ is a proper, lower semicontinuous, and coercive function, it follows that infu∈Rn [φ(u)+ψ(u)] is finite and the infimum is attained. Hence φ +ψ is bounded from below. On the other hand, from Lemma 2.3(iii) it follows (φ + ψ)(x(T )) ≤ (φ + ψ)(x0 ) ∀T ≥ 0. Since φ + ψ is coercive, the lower level sets of φ + ψ are bounded, hence the above inequality yields that x is bounded. Notice that in this case v is bounded too, due to the relation limt →+∞ v(t) + ∇ψ(x(t)) = 0 (Lemma 2.3(ii)) and the Lipschitz continuity of ∇ψ.
3 Convergence of the Trajectory When the Objective Function Satisfies the Kurdyka-Łojasiewicz Property In order to enforce the convergence of the whole trajectory x(t) to a critical point of the objective function as t → +∞ more involved analytic features of the functions have to be considered. A crucial role in the asymptotic analysis of the dynamical system (2.1) is played by the class of functions satisfying the Kurdyka-Łojasiewicz property. For η ∈ (0, +∞], we denote by η the class of concave and continuous functions ϕ : [0, η) → [0, +∞) such that ϕ(0) = 0, ϕ is continuously differentiable on (0, η), continuous at 0 and ϕ (s) > 0 for all s ∈ (0, η). Definition 3.1 (Kurdyka-Łojasiewicz Property) Let f : Rn → R ∪ {+∞} be a proper and lower semicontinuous function. We say that f satisfies the KurdykaŁojasiewicz (KL) property at x ∈ dom ∂L f = {x ∈ Rn : ∂L f (x) = ∅}, if there exist η ∈ (0, +∞], a neighborhood U of x, and a function ϕ ∈ η such that for all x in the intersection U ∩ {x ∈ Rn : f (x) < f (x) < f (x) + η}
140
R. I. Bo¸t and E. R. Csetnek
the following inequality holds ϕ (f (x) − f (x)) dist(0, ∂L f (x)) ≥ 1. If f satisfies the KL property at each point in dom ∂L f , then f is called KL function. The origins of this notion go back to the pioneering work of Łojasiewicz [28], where it is proved that for a real-analytic function f : Rn → R and a critical point x ∈ Rn (that is ∇f (x) = 0), there exists θ ∈ [1/2, 1) such that the function |f − f (x)|θ ∇f −1 is bounded around x. This corresponds to the situation when ϕ(s) = Cs 1−θ for C > 0. The result of Łojasiewicz allows the interpretation of the KL property as a re-parametrization of the function values in order to avoid flatness around the critical points. Kurdyka [27] extended this property to differentiable functions definable in o-minimal structures. Further extensions to the nonsmooth setting can be found in [5, 11–13]. One of the remarkable properties of the KL functions is their ubiquity in applications (see [15]). We refer the reader to [4–6, 11–13, 15] and the references therein for more properties of the KL functions and illustrating examples. In the analysis below the following uniform KL property given in [15, Lemma 6] will be used. Lemma 3.2 Let ⊆ Rn be a compact set and let f : Rn → R ∪ {+∞} be a proper and lower semicontinuous function. Assume that f is constant on and that it satisfies the KL property at each point of . Then there exist ε, η > 0 and ϕ ∈ η such that for all x ∈ and all x in the intersection {x ∈ Rn : dist(x, ) < ε} ∩ {x ∈ Rn : f (x) < f (x) < f (x) + η}
(3.1)
the inequality ϕ (f (x) − f (x)) dist(0, ∂L f (x)) ≥ 1.
(3.2)
holds. Due to some reasons outlined in Remark 3.6 below, we prove the convergence of the trajectory x(t) generated by (2.1) as t → +∞ under the assumption that φ : Rn → R is convex and differentiable with ρ −1 -Lipschitz continuous gradient for ρ > 0. In these circumstances the dynamical system (2.1) reads ⎧ ⎨ v(t) = ∇φ(x(t)) λx(t) ˙ + v(t) ˙ + ∇φ(x(t)) + ∇ψ(x(t)) = 0 ⎩ x(0) = x0 , v(0) = v0 = ∇φ(x0 ), where x0 , v0 ∈ Rn and λ > 0.
(3.3)
Newton-Like Dynamics Associated to Nonconvex Optimization Problems
141
Remark 3.3 We notice that we do not require second order assumptions for φ. However, we want to notice that if φ is a twice continuously differentiable function, then the dynamical system (3.3) can be equivalently written as !
˙ + ∇φ(x(t)) + ∇ψ(x(t)) = 0 λx(t) ˙ + ∇ 2 φ(x(t))(x(t)) x(0) = x0 , v(0) = v0 = ∇φ(x0 ),
(3.4)
where x0 , v0 ∈ Rn and λ > 0. This is a differential equation with a Hessian-driven damping term. We refer the reader to [3] and [8] for more insights into dynamical systems with Hessian-driven damping terms and for motivations for considering them. Moreover, as in [8], the driving forces have been split as ∇φ +∇ψ, where ∇ψ stands for classical smooth driving forces and ∇φ incorporates the contact forces. In this context, an improved version of Lemma 2.2(i) can be stated. Lemma 3.4 Let x0 , v0 ∈ Rn and λ > 0 be such that v0 = ∇φ(x0 ). Let (x, v) : [0, +∞) → Rn × Rn be the unique strong global solution of the dynamical system (3.3). Then: 2 x(t), ˙ v(t) ˙ ≥ ρv(t) ˙ for almost every t ∈ [0, +∞).
(3.5)
Proof Take an arbitrary δ > 0. For t ≥ 0 we have v(t + δ) − v(t), x(t + δ) − x(t) = ∇φ(x(t + δ)) − ∇φ(x(t)), x(t + δ) − x(t) ≥ ρ∇φ(x(t + δ)) − ∇φ(x(t))2 = ρv(t + δ) − v(t)2 ,
(3.6)
where the inequality follows from the Baillon-Haddad Theorem [10, Corollary 18.16]. The conclusion follows by dividing (3.6) by δ 2 and by taking the limit as δ converges to zero from above. ! We are now in the position to prove the convergence of the trajectories generated by (3.3). Theorem 3.5 Let x0 , v0 ∈ Rn and λ > 0 be such that v0 = ∇φ(x0 ). Let (x, v) : [0, +∞) → Rn × Rn be the unique strong global solution of the dynamical system (3.3). Suppose that φ + ψ is a KL function which is bounded from below and x is bounded. Then the following statements are true: (i) x, ˙ v, ˙ ∇φ(x) + ∇ψ(x) ∈ L2 ([0, +∞); Rn ), x(·), v(·) ˙ ∈ L1 ([0, +∞); ˙ R) and limt →+∞ x(t) ˙ = limt →+∞ v(t) ˙ = limt →+∞ ∇φ(x(t)) + ∇ψ(x(t)) = 0; (ii) there exists x ∈ crit(φ + ψ) (that is ∇(φ + ψ)(x) = 0) such that limt →+∞ x(t) = x.
142
R. I. Bo¸t and E. R. Csetnek
Proof According to Lemma 2.5, we can choose an element x ∈ crit(φ + ψ) (that is, ∇(φ + ψ)(x) = 0) such that x ∈ ω(x). According to Lemma 2.3(iii), the proof of Lemma 2.4 and the proof of Lemma 2.5(iv), we have lim (φ + ψ)(x(t)) = (φ + ψ)(x).
t →+∞
We consider the following two cases. I. There exists t ≥ 0 such that (φ + ψ)(x(t)) = (φ + ψ)(x). From Lemma 2.3(iii) we obtain for every t ≥ t that (φ + ψ)(x(t)) ≤ (φ + ψ)(x(t)) = (φ + ψ)(x) Thus (φ + ψ)(x(t)) = (φ + ψ)(x) for every t ≥ t. According to Lemma 2.3(i) and (3.5), it follows that x(t) ˙ = v(t) ˙ = 0 for almost every t ∈ [t, +∞), hence x and v are constant on [t, +∞) and the conclusion follows. II. For every t ≥ 0 it holds (φ + ψ)(x(t)) > (φ + ψ)(x). Take := ω(x). By using Lemma 2.5(ii), (iv) and the fact that φ + ψ is a KL function, by Lemma 3.2, there exist positive numbers and η and a concave function ϕ ∈ η such that for all u belonging to the intersection {u ∈ Rn : dist(u, ) < } ∩ u ∈ Rn : (φ + ψ)(x) < (φ + ψ)(u) < (φ + ψ)(x) + η ,
(3.7)
one has
ϕ (φ + ψ)(u) − (φ + ψ)(x) · ∇φ(u) + ∇ψ(u) ≥ 1.
(3.8)
Let t1 ≥ 0 be such that (φ + ψ)(x(t)) < (φ + ψ)(x) + η for all t ≥ t1 . Since limt →+∞ dist x(t), = 0 (see Lemma2.5(iii)), there exists t2 ≥ 0 such that for all t ≥ t2 the inequality dist x(t), < holds. Hence for all t ≥ T := max{t1 , t2 }, x(t) belongs to the intersection in (3.7). Thus, according to (3.8), for every t ≥ T we have
ϕ (φ + ψ)(x(t)) − (φ + ψ)(x) · ∇φ(x(t)) + ∇ψ(x(t)) ≥ 1.
(3.9)
From the second equation in (3.3) we obtain for almost every t ∈ [T , +∞)
(λx(t) ˙ + v(t)) ˙ · ϕ (φ + ψ)(x(t)) − (φ + ψ)(x) ≥ 1.
(3.10)
Newton-Like Dynamics Associated to Nonconvex Optimization Problems
143
By using Lemma 2.3(i), that ϕ > 0 and d
ϕ (φ + ψ)(x(t)) − (φ + ψ)(x) dt
d (φ + ψ)(x(t)), = ϕ (φ + ψ)(x(t)) − (φ + ψ)(x) dt we further deduce that for almost every t ∈ [T , +∞) it holds 2 + x(t), d
λx(t) ˙ ˙ v(t) ˙ ϕ (φ + ψ)(x(t)) − (φ + ψ)(x) ≤ − . dt λx(t) ˙ + v(t) ˙
(3.11)
We invoke now Lemma 3.5 and obtain 2 + ρv(t) 2 d
˙ λx(t) ˙ ϕ (φ + ψ)(x(t)) − (φ + ψ)(x) ≤ − . dt λx(t) ˙ + v(t) ˙
(3.12)
Let α > 0 (not depending on t) be such that −
2 + ρv(t) 2 λx(t) ˙ ˙ ≤ −αx(t) ˙ − αv(t) ˙ ∀t ≥ 0. λx(t) ˙ + v(t) ˙
(3.13)
One can, for instance, choose α > 0 such that 2α max(λ, 1) ≤ min(λ, ρ). From (3.12) we derive the inequality d
ϕ (φ + ψ)(x(t)) − (φ + ψ)(x) ≤ −αx(t) ˙ − αv(t), ˙ dt
(3.14)
which holds for almost every t ≥ T . Since ϕ is bounded from below, by integration it follows x, ˙ v˙ ∈ L1 ([0, +∞); Rn ). From here we obtain that limt →+∞ x(t) exists and the conclusion follows from the results obtained in the previous section. ! Remark 3.6 Taking a closer look at the above proof, one can notice that the inequality (3.11) can be obtained also when φ : Rn → R ∪ {+∞} is a (possibly nonsmooth) proper, convex, and lower semicontinuous function. Though, in order to conclude that x˙ ∈ L1 ([0, +∞); Rn ) the inequality obtained in Lemma 2.2(i) is not enough. The improved version stated in Lemma 3.4 is crucial in the convergence analysis. If one attempts to obtain in the nonsmooth setting the inequality stated in Lemma 3.4, from the proof of Lemma 3.4 it becomes clear that one would need the inequality ξ1∗ − ξ2∗ , x1 − x2 ≥ ρξ1∗ − ξ2∗ 2
144
R. I. Bo¸t and E. R. Csetnek
for all (x1 , x2 ) ∈ Rn × Rn and all (ξ1∗ , ξ2∗ ) ∈ Rn × Rn such that ξ1∗ ∈ ∂φ(x1 ) and ξ2∗ ∈ ∂φ(x2 ). This is nothing else than (see, for example, [10]) ξ1∗ − ξ2∗ , x1 − x2 ≥ ρξ1∗ − ξ2∗ 2 for all (x1 , x2 ) ∈ Rn × Rn and all (ξ1∗ , ξ2∗ ) ∈ Rn × Rn such that x1 ∈ ∂φ ∗ (ξ1∗ ) and x2 ∈ ∂φ ∗ (ξ2∗ ). Here φ ∗ : Rn → R denotes the Fenchel conjugate of φ, defined for all x ∗ ∈ Rn by φ ∗ (x ∗ ) = supx∈Rn { x ∗ , x − φ(x)}. The latter inequality is equivalent to ∂φ ∗ is ρ-strongly monotone, which is further equivalent (see [33, Theorem 3.5.10] or [10]) to φ ∗ is is strongly convex. This is the same with asking that φ is differentiable on the whole Rn with Lipschitz-continuous gradient (see [10, Theorem 18.15]). In conclusion, the smooth setting provides the necessary prerequisites for obtaining the result in Lemma 3.4 and, finally, Theorem 3.5.
4 Convergence Rates In this subsection we investigate the convergence rates of the trajectories(x(t), v(t)) generated by the dynamical system (3.3) as t → +∞. When solving optimization problems involving KL functions, convergence rates have been proved to depend on the so-called Łojasiewicz exponent (see [4, 11, 23, 28]). The main result of this subsection refers to the KL functions which satisfy Definition 3.1 for ϕ(s) = Cs 1−θ , where C > 0 and θ ∈ (0, 1). We recall the following definition considered in [4]. Definition 4.1 Let f : Rn → R ∪ {+∞} be a proper and lower semicontinuous function. The function f is said to have the Łojasiewicz property, if for every x ∈ crit f there exist C, ε > 0 and θ ∈ (0, 1) such that |f (x)−f (x)|θ ≤ Cx ∗ for every x fulfilling x −x < ε and every x ∗ ∈ ∂L f (x). (4.1) According to [5, Lemma 2.1 and Remark 3.2(b)], the KL property is automatically satisfied at any noncritical point, fact which motivates the restriction to critical points in the above definition. The real number θ in the above definition is called Łojasiewicz exponent of the function f at the critical point x. The convergence rates obtained in the following theorem are in the spirit of [11] and [4]. Theorem 4.2 Let x0 , v0 ∈ Rn , and λ > 0 be such that v0 = ∇φ(x0 ). Let (x, v) : [0, +∞) → Rn × Rn be the unique strong global solution of the dynamical system (3.3). Suppose that x is bounded and φ + ψ is a function which is bounded from below and satisfies Definition 3.1 for ϕ(s) = Cs 1−θ , where C > 0 and θ ∈ (0, 1). Then there exists x ∈ crit(φ + ψ) (that is ∇(φ + ψ)(x) = 0) such that limt →+∞ x(t) = x and limt →+∞ v(t) = ∇φ(x) = −∇ψ(x). Let θ be the Łojasiewicz exponent of φ + ψ at x, according to Definition 4.1. Then there exist
Newton-Like Dynamics Associated to Nonconvex Optimization Problems
145
a1 , b1 , a2 , b2 > 0 and t0 ≥ 0 such that for every t ≥ t0 the following statements are true: (i) if θ ∈ (0, 12 ), then x and v converge in finite time; (ii) if θ = 12 , then x(t) − x + v(t) − ∇φ(x) ≤ a1 exp(−b1 t); (iii) if θ ∈
( 12 , 1),
then x(t) − x + v(t) − ∇φ(x) ≤ (a2 t + b2 )
1−θ − 2θ −1
.
Proof According to the proof of Theorem 3.5, x, ˙ v˙ ∈ and there exists x ∈ crit(φ + ψ), in other words ∇(φ + ψ)(x) = 0, such that limt →+∞ x(t) = x and limt →+∞ v(t) = ∇φ(x) = −∇ψ(x). Let θ be the Łojasiewicz exponent of φ + ψ at x, according to Definition 4.1. We define σ : [0, +∞) → [0, +∞) by (see also [11]) L1 ([0, +∞); Rn )
:
+∞
σ (t) =
:
+∞
x(s)ds ˙ +
t
v(s)ds ˙ for all t ≥ 0.
t
It is immediate that :
+∞
x(t) − x ≤
x(s)ds ˙ ∀t ≥ 0.
(4.2)
t
Indeed, this follows by noticing that for T ≥ t 6 : 6 x − x(t) − x = 6 x(T ) − 6
T t
:
6 6 6 x(s)ds ˙ 6 T
≤ x(T ) − x +
x(s)ds, ˙
t
and by letting afterwards T → +∞. Similarly, we have :
+∞
v(t) − ∇φ(x) ≤
v(s)ds ˙ ∀t ≥ 0.
(4.3)
t
From (4.2) and (4.3) we derive x(t) − x + v(t) − ∇φ(x) ≤ σ (t) ∀t ≥ 0.
(4.4)
We assume that for every t ≥ 0 we have (φ + ψ)(x(t)) > (φ + ψ)(x). As seen in the proof of Theorem 3.5 otherwise the conclusion follows automatically. Furthermore, by invoking again the proof of Theorem 3.5 , there exist ε > 0, t0 ≥ 0 and α > 0 such that for almost every t ≥ t0 (see (3.14)) αx(t) ˙ + αv(t) ˙ +
1−θ d (φ + ψ)(x(t)) − (φ + ψ)(x) ≤0 dt
(4.5)
146
R. I. Bo¸t and E. R. Csetnek
and x(t) − x < ε. We derive by integration for T ≥ t ≥ t0 : α t
T
:
T
x(s)ds ˙ +α
1−θ v(s)ds ˙ + (φ + ψ)(x(T )) − (φ + ψ)(x)
t
≤ (φ + ψ)(x(t)) − (φ + ψ)(x)
1−θ ,
hence 1−θ ασ (t) ≤ (φ + ψ)(x(t)) − (φ + ψ)(x) ∀t ≥ t0 .
(4.6)
Since θ is the Łojasiewicz exponent of φ + ψ at x, we have |(φ + ψ)(x(t)) − (φ + ψ)(x)|θ ≤ C∇(φ + ψ)(x(t)) for every t ≥ t0 . From the second relation in (3.3) we derive for almost every t ∈ [t0 , +∞) |(φ + ψ)(x(t)) − (φ + ψ)(x)|θ ≤ Cλx(t) ˙ + Cv(t), ˙ which combined with (4.6) yields 1−θ 1−θ 1−θ θ ≤ (C max(λ, 1)) θ · (x(t) θ . ˙ + v(t)) ˙ ασ (t) ≤ Cλx(t) ˙ + Cv(t) ˙ (4.7) Since σ˙ (t) = −x(t) ˙ − v(t), ˙
(4.8)
we conclude that there exists α > 0 such that for almost every t ∈ [t0 , +∞) θ σ˙ (t) ≤ −α σ (t) 1−θ . If θ = 12 , then σ˙ (t) ≤ −α σ (t)
(4.9)
Newton-Like Dynamics Associated to Nonconvex Optimization Problems
147
for almost every t ∈ [t0 , +∞). By multiplying with exp(α t) and integrating afterwards from t0 to t, it follows that there exist a1 , b1 > 0 such that σ (t) ≤ a1 exp(−b1 t) ∀t ≥ t0 and the conclusion of (b) is immediate from (4.4). Assume that 0 < θ < 12 . We obtain from (4.9) 1−2θ d
1 − 2θ σ (t) 1−θ ≤ −α dt 1−θ for almost every t ∈ [t0 , +∞). By integration we obtain 1−2θ
σ (t) 1−θ ≤ −αt + β ∀t ≥ t0 , where α > 0. Thus there exists T ≥ 0 such that σ (T ) ≤ 0 ∀t ≥ T , which implies that x and y are constant on [T , +∞). Finally, suppose that 12 < θ < 1. We obtain from (4.9) 1−2θ d
2θ − 1 σ (t) 1−θ ≥ α dt 1−θ for almost every t ∈ [t0 , +∞). By integration we derive σ (t) ≤ (a2 t + b2 )
1−θ − 2θ −1
∀t ≥ t0 ,
where a2 , b2 > 0. Statement (c) follows from (4.4).
!
Acknowledgements The authors are thankful to an anonymous referee for carefully reading the paper and suggestions. The research of RIB has been partially supported by FWF (Austrian Science Fund), project I 2419-N32. The research of ERC has been supported by FWF (Austrian Science Fund), project P 29809-N32.
References 1. B. Abbas, An asymptotic viscosity selection result for the regularized Newton dynamic, arXiv:1504.07793v1, 2015. 2. B. Abbas, H. Attouch, B.F. Svaiter, Newton-like dynamics and forward-backward methods for structured monotone inclusions in Hilbert spaces, Journal of Optimization Theory and its Applications 161(2) (2014), 331–360.
148
R. I. Bo¸t and E. R. Csetnek
3. F. Alvarez, H. Attouch, J. Bolte, P. Redont, A second-order gradient-like dissipative dynamical system with Hessian-driven damping. Application to optimization and mechanics, Journal de Mathématiques Pures et Appliquées 81(8) (2002), 747–779. 4. H. Attouch, J. Bolte, On the convergence of the proximal algorithm for nonsmooth functions involving analytic features, Mathematical Programming 116(1–2) (2009) 5–16. 5. H. Attouch, J. Bolte, P. Redont, A. Soubeyran, Proximal alternating minimization and projection methods for nonconvex problems: an approach based on the Kurdyka-Łojasiewicz inequality, Mathematics of Operations Research 35(2) (2010), 438–457. 6. H. Attouch, J. Bolte, B.F. Svaiter, Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward-backward splitting, and regularized Gauss-Seidel methods, Mathematical Programming 137(1–2) (2013), 91–129. 7. H. Attouch, M.-O. Czarnecki, Asymptotic behavior of coupled dynamical systems with multiscale aspects, Journal of Differential Equations 248(6) (2010), 1315–1344. 8. H. Attouch, P.-E. Maingé, P. Redont, A second-order differential system with Hessian-driven damping; application to non-elastic shock laws, Differential Equations and Applications 4(1) (2012), 27–65. 9. H. Attouch, B.F. Svaiter, A continuous dynamical Newton-like approach to solving monotone inclusions, SIAM Journal on Control and Optimization 49(2) (2011), 574–598. 10. H.H. Bauschke, P.L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, CMS Books in Mathematics, Springer, New York, 2011. 11. J. Bolte, A. Daniilidis, A. Lewis, The Łojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems, SIAM Journal on Optimization 17(4) (2006), 1205–1223. 12. J. Bolte, A. Daniilidis, A. Lewis, M. Shiota, Clarke subgradients of stratifiable functions, SIAM Journal on Optimization 18(2) (2007), 556–572. 13. J. Bolte, A. Daniilidis, O. Ley, L. Mazet, Characterizations of Łojasiewicz inequalities: subgradient flows, talweg, convexity, Transactions of the American Mathematical Society 362(6) (2010), 3319–3363. 14. J. Bolte, T.P. Nguyen, J. Peypouquet, B.W. Suter, From error bounds to the complexity of first-order descent methods for convex functions, Mathematical Programming 165(2) (2017), 471–507. 15. J. Bolte, S. Sabach, M. Teboulle, Proximal alternating linearized minimization for nonconvex and nonsmooth problems, Mathematical Programming 146(1–2) (2014), 459–494. 16. J.M. Borwein, Q.J. Zhu, Techniques of Variational Analysis, Springer, New York, 2005. 17. R.I. Bo¸t, E.R. Csetnek, S. László, An inertial forward-backward algorithm for the minimization of the sum of two nonconvex functions, EURO Journal on Computational Optimization 4 (2016), 3–25. 18. R.I. Bo¸t, E.R. Csetnek, Approaching nonsmooth nonconvex optimization problems through first order dynamical systems with hidden acceleration and Hessian driven damping terms, SetValued and Variational Analysis 26(2) (2018), 227–245. 19. R.I. Bo¸t, E.R. Csetnek, A forward-backward dynamical approach to the minimization of the sum of a nonsmooth convex with a smooth nonconvex function, ESAIM: Control, Optimisation and Calculus of Variations 24(2) (2018), 463–477. 20. R.I. Bo¸t, E.R. Csetnek, Levenberg-Marquardt dynamics associated to variational inequalities, Set-Valued and Variational Analysis 25 (2017), 569–589. 21. H. Brézis, Opérateurs maximaux monotones et semi-groupes de contractions dans les espaces de Hilbert, North-Holland Mathematics Studies No. 5, Notas de Matemática (50), NorthHolland/Elsevier, New York, 1973. 22. E. Chouzenoux, J.-C. Pesquet, A. Repetti, Variable metric forward-backward algorithm for minimizing the sum of a differentiable function and a convex function, Journal of Optimization Theory and its Applications 162(1) (2014), 107–132. 23. P. Frankel, G. Garrigos, J. Peypouquet, Splitting methods with variable metric for KurdykaŁojasiewicz functions and general convergence rates, Journal of Optimization Theory and its Applications 165(3) (2015), 874–900.
Newton-Like Dynamics Associated to Nonconvex Optimization Problems
149
24. A. Haraux, Systèmes Dynamiques Dissipatifs et Applications, Recherches en Mathématiques Appliquées 17, Masson, Paris, 1991. 25. A. Haraux, M. Jendoubi, Convergence of solutions of second-order gradient-like systems with analytic nonlinearities, Journal of Differential Equations 144(2) (1998), 313–320. 26. R. Hesse, D.R. Luke, S. Sabach, M.K. Tam, Proximal heterogeneous block input-output method and application to blind ptychographic diffraction imaging, SIAM Journal on Imaging Sciences 8(1) (2015), 426–457. 27. K. Kurdyka, On gradients of functions definable in o-minimal structures, Annales de l’institut Fourier (Grenoble) 48(3) (1998), 769–783. 28. S. Łojasiewicz, Une propriété topologique des sous-ensembles analytiques réels, Les Équations aux Dérivées Partielles, Éditions du Centre National de la Recherche Scientifique Paris, 87–89, 1963. 29. B. Mordukhovich, Variational Analysis and Generalized Differentiation, I: Basic Theory, II: Applications, Springer-Verlag, Berlin, 2006. 30. P. Ochs, Y. Chen, T. Brox, T. Pock, iPiano: Inertial proximal algorithm for non-convex optimization, SIAM Journal on Imaging Sciences 7(2) (2014), 1388–1419. 31. R.T. Rockafellar, R.J.-B. Wets, Variational Analysis, Fundamental Principles of Mathematical Sciences 317, Springer-Verlag, Berlin, 1998. 32. L. Simon, Asymptotics for a class of nonlinear evolution equations, with applications to geometric problems, Annals of Mathematics 118(2) (1983), 525–571. 33. C. Z˘alinescu, Convex Analysis in General Vector Spaces, World Scientific, Singapore, 2002.