483 74 1MB
English Pages 144 Year 2021
E M S
T E X T B O O K S
I N
M A T H E M A T I C S
Alessio Figalli Federico Glaudo
An Invitation to Optimal Transport, Wasserstein Distances, and Gradient Flows
EMS Textbooks in Mathematics The EMS Textbooks in Mathematics is a series of books aimed at students or professional mathematicians seeking an introduction into a particular field. The individual volumes are intended not only to provide relevant techniques, results, and applications, but also to afford insight into the motivations and ideas behind the theory. Suitably designed exercises help to master the subject and prepare the reader for the study of more advanced and specialized literature.
Previously published in this series: Peter Kunkel and Volker Mehrmann, Differential-Algebraic Equations Markus Stroppel, Locally Compact Groups Dorothee D. Haroske and Hans Triebel, Distributions, Sobolev Spaces, Elliptic Equations Thomas Timmermann, An Invitation to Quantum Groups and Duality Oleg Bogopolski, Introduction to Group Theory Marek Jarnicki and Peter Pflug, First Steps in Several Complex Variables: Reinhardt Domains Tammo tom Dieck, Algebraic Topology Mauro C. Beltrametti et al., Lectures on Curves, Surfaces and Projective Varieties Wolfgang Woess, Denumerable Markov Chains Eduard Zehnder, Lectures on Dynamical Systems Andrzej Skowroński and Kunio Yamagata, Frobenius Algebras I Piotr W. Nowak and Guoliang Yu, Large Scale Geometry Joaquim Bruna and Juliá Cufí, Complex Analysis Eduardo Casas-Alvero, Analytic Projective Geometry Fabrice Baudoin, Diffusion Processes and Stochastic Calculus Olivier Lablée, Spectral Theory in Riemannian Geometry Dietmar A. Salamon, Measure and Integration Andrzej Skowroński and Kunio Yamagata, Frobenius Algebras II Jørn Justesen and Tom Høholdt, A Course In Error-Correcting Codes (2nd ed.) Bogdan Nica, A Brief Introduction to Spectral Graph Theory Timothée Marquis, An Introduction to Kac–Moody Groups over Fields
Alessio Figalli Federico Glaudo
An Invitation to Optimal Transport, Wasserstein Distances, and Gradient Flows
Authors: Alessio Figalli Department of Mathematics ETH Zürich Rämistrasse 101 8092 Zürich, Switzerland E-mail: [email protected]
Federico Glaudo Department of Mathematics ETH Zürich Rämistrasse 101 8092 Zürich, Switzerland E-mail: [email protected]
2020 Mathematics Subject Classification: 49Q22; 60B05, 28A33, 35A15, 35Q35, 49N15, 28A50 Keywords: optimal transport, Wasserstein distance, duality, gradient flows, measure theory, d isplacement convexity
ISBN 978-3-98547-010-5 Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.dnb.de. Published by EMS Press, an imprint of the European Mathematical Society – EMS – Publishing House GmbH Institut für Mathematik Technische Universität Berlin Straße des 17. Juni 136 10623 Berlin, Germany https://ems.press © 2021 European Mathematical Society Typeset using the authors’ LaTeX sources: Alison Durham, Manchester, UK Printing and binding: Beltz Bad Langensalza GmbH, Bad Langensalza, Germany ♾ Printed on acid free paper 987654321
Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Historical overview . . . . . . . . . . . . . . . . . 1.2 Push-forward of measures . . . . . . . . . . . . 1.3 Basics of Riemannian geometry . . . . . . . . 1.4 Transport maps . . . . . . . . . . . . . . . . . . . 1.5 An application to isoperimetric inequalities 1.6 A Jacobian equation for transport maps . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. 1 . 1 . 2 . 5 . 7 . 14 . 16
2
Optimal transport . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Preliminaries in measure theory . . . . . . . . . . . . . . . . . . . . 2.2 Monge vs. Kantorovich . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Existence of an optimal coupling . . . . . . . . . . . . . . . . . . . 2.4 c-cyclical monotonicity . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2.5 The case c.x; y/ D jx 2yj on X D Y D Rd . . . . . . . . . . . 2.6 General cost functions: Kantorovich duality . . . . . . . . . . . . 2.7 General cost functions: Existence and uniqueness of optimal transport maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
3
Wasserstein distances and gradient flows . . . . . . . . . . . . . . . 3.1 p-Wasserstein distances and geodesics . . . . . . . . . . . . . . 3.2 An informal introduction to gradient flows in Hilbert spaces 3.3 Heat equation and optimal transport: The JKO scheme . . . .
. . . .
. . . .
. . . .
. . . .
57 57 64 69
4
Differential viewpoint of optimal transport . . . . . . . . . . . . . . . . . . . 4.1 The continuity equation and Benamou–Brenier formula . . . . . . . . 4.2 Otto’s calculus: From Benamou–Brenier to a Riemannian structure 4.3 Displacement convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 An excursion into the linear Fokker–Planck equation . . . . . . . . . .
. . . . .
. . . . .
81 81 83 88 91
5
Further reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Functional and geometric inequalities . . . . . . . . . . . . . . . 5.2 Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Multi-marginal optimal transport . . . . . . . . . . . . . . . . . . 5.4 Gradient flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Regularity theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Computational aspects . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 From Rd to Riemannian manifolds and beyond: CD spaces
. . . . . . . .
. . . . . . . .
99 100 102 105 107 108 110 112
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . .
. . . . . . . .
17 17 24 24 26 29 47
. . . . . . 52
. . . . . . . .
. . . .
. . . . . . . .
. . . .
. . . . . . . .
. . . . . . . .
Contents
vi
A Exercises on optimal transport (with solutions) . . . . . . . . . . . . . . . . . . 115 B Disintegrating the disintegration theorem . . . . . . . . . . . . . . . . . . . . . . 131 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Chapter 1
Introduction In this introductory chapter we first give a brief historical review of optimal transport, then we recall some basic definitions and facts from measure theory and Riemannian geometry, and finally we present three examples of (not necessarily optimal) transport maps, with an application to the Euclidean isoperimetric inequality.
1.1 Historical overview 1781 – Monge. In his celebrated work, Gaspard Monge introduced the concept of transport maps starting from the following practical question: Assume one extracts soil from the ground to build fortifications. What is the cheapest possible way to transport the soil? To formulate this question rigorously, one needs to specify the transportation cost, namely how much one pays to move a unit of mass from a point x to a point y. In Monge’s case, the ambient space was R3 , and the cost was the Euclidean distance c.x; y/ WD jx yj. 1940s – Kantorovich. After 150 years, Leonid Kantorovich revisited Monge’s problem from a different viewpoint. To explain this, consider N bakeries located at positions .xi /i D1;:::;N and M coffee shops located at .yj /j D1;:::;M . Assume that the ith bakery produces an amount ˛i 0 of bread and that the j th coffee shop needs an amount ˇj 0. Also, that demandDrequest, and normalize them to be equal Passume P to 1: in other words i ˛i D j ˇj D 1. In Monge’s formulation, the transport is deterministic: the mass located at x can be sent to a unique destination T .x/. Unfortunately this formulation is incompatible with the problem above, since one bakery may supply bread to multiple coffee shops, and one coffee shop may buy bread from multiple bakeries. For this reason Kantorovich introduced a new formulation: given c.xi ; yj / the cost to move one unit of mass from xi to yj , he looked for matrices . ij / i D1;:::;N such that j D1;:::;M
(a) ij 0 (the amount of bread going from xi to yj is a nonnegative quantity); P (b) for all i, ˛i D jMD1 ij (the total amount of bread sent to the different coffee shops is equal to the production); P (c) for all j , ˇj D N i D1 ij (the total amount of bread bought from the different bakeries is equal to the demand); P (d) ij minimize the cost di;j D1 ij c.xi ; yj / (the total transportation cost is minimized). It is interesting to observe that constraint (a) is convex, constraints (b) and (c) are linear, and the objective function in (d) is also linear (all with respect to ij ). In other
Introduction
2
words, Kantorovich’s formulation corresponds to minimizing a linear function with convex/linear constraints. Applications. Optimal transport has been a topic of high interest in the last 30 years due to its connection to several areas of mathematics. The properties and the applications of optimal transport depend heavily of the choice of the cost function c.x; y/, representing the cost of moving a unit of mass from x to y. Let us mention some important choices:
c.x; y/ D jx yj2 in Rd : connected to Euler equations, isoperimetric and Sobolev inequalities, evolution PDEs such as @ t u D u, @ t u D .um /, and @ t u D div.rW u u/.
c.x; y/ D jx
c.x; y/ D d.x; y/2 on a Riemannian manifold, with d.; / denoting the Riemannian distance: has connections and applications to the study of Ricci curvature.
c.x; y/ D log.jx yj/ on the sphere S2 R3 : solving the optimal transport problem between two densities on the sphere produces a solution to the associated reflector antenna problem of how to construct an antenna (which is a reflecting surface) in such a way that a light coming from the origin with a given density (in the space of directions, which is parametrized by S2 ) is reflected into another given density (again, in the space of directions).
yj in Rd : appears in probability and kinetic theory.
In this book we mostly focus on the Euclidean quadratic cost jx give references for further applications in Chapter 5.
yj2 , and we will
1.2 Push-forward of measures For simplicity, throughout this book we will always work on locally compact, separable, and complete metric spaces, which will be usually denoted by X (the space where the source measure lives) and Y (the space where the target measure lives). These assumptions are not optimal but simplify some of the proofs in the next chapter (see also Remark 2.1.1). Still, readers not interested in such a level of generality can always think that X D Y D Rd . Remark 1.2.1. All measures under consideration are Borel measures, and all maps are Borel (i.e., if SW X ! Y , then S 1 .A/ is Borel for all A Y Borel). The set of probability measures over a space X will be denoted by P .X /, and the class of Borel-measurable sets by B.X/. Also, 1A denotes the indicator function of a set: ´ 1 if x 2 A; 1A .x/ WD 0 if x 62 A:
Push-forward of measures
3
Definition 1.2.2. Take a map T W X ! Y and a probability measure 2 P .X /. We define the image measure (or push-forward measure) T# 2 P .Y / as .T# /.A/ WD .T
1
for any A 2 B.Y /.
.A//
Lemma 1.2.3. T# is a probability measure on Y . Proof. The proof consists in checking that T# is nonnegative, has total mass 1, gives no mass to the empty set, and is -additive on disjoint sets:
.T# /.;/ D .T
1
.T# /.Y / D .T
1
.T# /.A/ D .T
1
Let .Ai /i 2I Y be a countable family of disjoint sets. We claim first that .T 1 .Ai //i 2I are disjoint. Indeed, if that was not the case and x 2 T 1 .Ai / \ T 1 .Aj /, then T .x/ 2 Ai \ Aj , which is a contradiction. Thanks to this fact, using that is a measure (and thus -additive on disjoint sets) we get [ [ [ Ai D T 1 .Ai / T# Ai D T 1
.;// D .;/ D 0; .Y // D .X/ D 1;
.A// 0 for all A 2 B.X /;
i 2I
i 2I
D
X
.T
1
i 2I
.Ai // D
i 2I
X
T# .Ai /:
i 2I
Remark 1.2.4. One might also be tempted to define the “pull-back measure” S # .E/ WD .S.E// for SW X ! Y and 2 P .Y /. However, this construction does not work in general. Indeed, since the image of two disjoint sets might coincide (consider for instance the case when S is a constant map), S # may not be additive on disjoint sets. Lemma 1.2.5. Let T W X ! Y , 2 P .X/, and 2 P .Y /. Then D T# if and only if, for any 'W Y ! R Borel and bounded, we have Z Z '.y/ d.y/ D '.T .x// d.x/: Y
(1.1)
X
Proof. The implication (1.1) ) D T# follows choosing ' D 1A with A 2 B.Y /. We now focus on the other implication. For any Borel subset A Y , it holds that Z Z Z 1 1A d D .A/ D .T .A// D 1T 1 .A/ d D 1A ı T d: Y
X
X
Introduction
4
Thus, by linearity of the integral, we immediately deduce Z Z ' d D ' ı T d Y
X
P for any simple function 'W Y ! R, i.e., for any ' of the form i 2I i 1Ai where I is a finite set, .Ai /i 2I are Borel subsets, and .i /i 2I are real values. In order to deduce the desired result, fix a bounded Borel function 'W Y ! R. Since any bounded Borel function can be approximated uniformly by simple functions,1 there is a sequence of simple functions .'k /k2N such that k'k 'k1 ! 0 as k ! 1. Therefore we have Z Z Z Z ' d D lim 'k d D lim 'k ı T d D ' d; Y
k!1 Y
k!1 X
X
which is the desired identity. An immediate consequence of the previous lemma is the following: Corollary 1.2.6. For any function 'W Y ! R Borel and bounded it holds that Z Z ' d.T# / D ' ı T d: Y
X
The next lemma shows the relation between composition and push-forward. Lemma 1.2.7. Let T W X ! Y and SW Y ! Z be measurable; then .S ı T /# D S# .T# /: Proof. Thanks to Corollary 1.2.6, for any 'W Z ! R Borel and bounded we have Z Z Z ' d.S ı T /# D ' ı .S ı T / d D .' ı S / ı T d Z X ZX D ' ı S d T# ZY D ' dS# .T# /: Z
The result follows from Lemma 1.2.5. 1
To prove this, given 'W Y ! R a bounded Borel function, P fix " > 0 and for any i 2 Z consider the set Ai WD ¹"i ' < ".i C 1/º. Then define '" WD i2Z "i 1Ai . Since ' is bounded we have Ai D ; for jij 1, hence '" is a simple function. Also k'
'" kL1 D max k' i2Z
'" kL1 .Ai / ":
Basics of Riemannian geometry
5
1.3 Basics of Riemannian geometry Even though we are not going to work with Riemannian manifolds, some of the results we present (namely Arnold’s theorem, geodesics in the Wasserstein space, and the differential structure of the Wasserstein space) are heavily inspired by classical concepts in Riemannian geometry. Hence, we provide a very short introduction to the subject, with an emphasis on those facts and structures that may help readers to fully appreciate the content of this book. First, for embedded submanifolds, we recall the definitions of tangent space, Riemannian distance, (minimizing) geodesic, and gradient. Then we briefly explain how these definitions can be generalized to the (more abstract) case of a (not necessarily embedded) Riemannian manifold. Our presentation of the subject is quick and superficial, but should be sufficient to understand the related topics in this book. This material, and much more, may be found in any introductory text on Riemannian geometry (see, for example, [31,47,56, 62]). Readers with some experience in the subject may skip this chapter. Embedded submanifolds. Let M be a compact d -dimensional smooth manifold embedded in RD . We are going to show how the Euclidean scalar product of the ambient RD induces a distance – the Riemannian distance – on M , and how this gives rise to a number of related concepts (gradients, minimizing geodesics, and geodesics). In what follows, we implicitly assume that all curves are C 1 . Let us begin with the definition of tangent space. Notice that, for its definition, we are not going to use the Euclidean scalar product of the ambient. Definition 1.3.1 (Tangent space). Given a point p 2 M , the tangent space Tp M RD of M at p is defined as Tp M WD ¹ .0/ P j W . 1; 1/ ! M; .0/ D pº: Intuitively, the tangent space contains all the directions tangent to M at p. One can show that Tp M is a d -dimensional subspace of RD . We now give the definition of gradient of a function, which is a convenient representation of its differential. Definition 1.3.2 (Gradient). Let F W M ! R be a smooth function. Its gradient rF W M ! RD is defined as the unique tangent vector field on M , that is, rF .x/ 2 Tx M for all x 2 M , such that the following holds: for any curve W . 1; 1/ ! M , hrF . .0//; .0/i P D
d ˇˇ ˇ F . .t //: dt tD0
For the definition of the gradient we are using that the Euclidean scalar product endows the tangent spaces of a scalar product (i.e., the restriction of the ambient scalar product).
Introduction
6
Given a curve W Œa; b ! M , its length is given by the formula b
Z
j .t/j P dt: a
Notice that the length of a curve is invariant under reparametrization. Notice also that, to define the length of a curve, we need to compute the Euclidean norm only of vectors tangent to M . Once one knows how to measure the length of a curve, the following definition of (Riemannian) distance is fairly natural. Definition 1.3.3 (Riemannian distance). Given two points x; y 2 M , their Riemannian distance dM .x; y/ is defined as ˇ ²Z b ³ ˇ dM .x; y/ WD inf j .t/j P dt ˇˇ W Œa; b ! M; .a/ D x; .b/ D y : a
The Riemannian distance is indeed a distance on M , that is, it satisfies the triangle inequality (besides dM .x; y/ D dM .y; x/, and dM .x; y/ D 0 if and only if x D y). Since any curve can be reparametrized to have constant speed, one can show that an equivalent definition of the Riemannian distance is given by ˇ ²Z 1 ³ ˇ 2 2 ˇ dM .x; y/ D inf j .t/j P dt ˇ W Œ0; 1 ! M; .0/ D x; .1/ D y : (1.2) 0
It turns out that there is always a (not necessarily unique) curve achieving the infimum in the definition of the Riemannian distance (this follows from the compactness of M or, more generally, from its completeness). Definition 1.3.4 (Minimizing geodesic). A curve W Œa; b ! M with constant speed (i.e., j j P is constant) such that .a/ D x; .b/ D y, and whose length is equal to dM .x; y/, is called a minimizing geodesic. The restriction of a minimizing geodesic on a smaller interval is still a minimizing geodesic. Moreover, any minimizing geodesic is smooth. One may think of minimizing geodesics as “straight lines in a curved space.” Indeed, since a minimizing geodesic has constant speed and achieves the minimum also in (1.2), it can be proven (with a variational argument, as a consequence of the minimality) that
.t/ R ? T .t/ M (1.3) for all t 2 Œ0; 1. In other words, apart from the distortion induced by M , minimizing geodesics go “as straight as possible.” Definition 1.3.5 (Geodesic). A (not necessarily minimizing) geodesic is a curve
W Œa; b ! M that satisfies (1.3).
Transport maps
7
It can be readily checked that a geodesic has constant speed; indeed d j j P 2 D 2h ; P i R D 0; dt where we have used that R ? T M 3 . P Moreover, any geodesic is locally minimizing. More precisely, if W Œa; b ! M satisfies (1.3), then for any t0 2 .a; b/ there is " > 0 such that restricted on Œt0 "; t0 C " is a minimizing geodesic. Abstract Riemannian manifolds. In the previous paragraph we described how a submanifold of RD inherits a number of structures (tangent space, gradient, distance, geodesics) from the ambient. Let us briefly explain what is necessary for an abstract manifold to have such structures. Given a compact d -dimensional smooth manifold M , there is an intrinsic definition of tangent space Tp M (as an appropriate quotient of the curves through p, where two curves are identified if “they have the same derivative at p”). To proceed further and talk about gradients, lengths, etc., we need to endow our manifold M with an additional structure, that is, a Riemannian metric. A Riemannian metric is a (symmetric and positive definite) scalar product gx W Tx M Tx M ! R, defined on each tangent space, that varies continuously with respect to x 2 M . If M is endowed with a Riemannian metric g D .gx /x2M , we say that .M; g/ is a Riemannian manifold. On a Riemannian manifold, all the definitions given previously (gradient, length, Riemannian distance, and minimizing geodesic) make perfect sense (for example, the Rb 1 P / P 2 ), and all the facts we have stated remain true. length of a curve is a g . ; It is more delicate to generalize (1.3) to this more abstract setting, and thus to define what a (not necessarily minimizing) geodesic is. We prefer not to delve into this topic, as it goes beyond the basic understanding of Riemannian geometry that is necessary to appreciate the rest of this book.
1.4 Transport maps Definition 1.4.1. Given 2 P .X/ and 2 P .Y /, a map T W X ! Y is called a transport map from to if T# D . Remark 1.4.2. Given and , the set ¹T j T# D º may be empty. For instance, given D ıx0 with x0 2 X and a map T W X ! Y , we have Z Z '.y/ d.T# /.y/ D ' ı T .x/ d.x/ D '.T .x0 // 8 'W Y ! R Y
Y
) T# D ıT .x0 / : Hence, unless is a Dirac delta, for any map T we have T# ¤ and the set ¹T j T# D º is empty.
Introduction
8
Definition 1.4.3. We call 2 P .X Y / a coupling2 of and if .X /# D
and .Y /# D ;
where X .x; y/ D x;
Y .x; y/ D y
8 .x; y/ 2 X Y:
This is equivalent to requiring that Z Z Z '.x/ d .x; y/ D ' ı X .x; y/ d .x; y/ D '.x/ d.x/ X Y
XY
X
for all 'W X ! R Borel and bounded, and Z Z Z .y/ d .x; y/ D ı Y .x; y/ d .x; y/ D XY
XY
.y/ d.y/
Y
for all W Y ! R Borel and bounded. We denote by .; / the set of couplings of and . Remark 1.4.4. Given and , the set Indeed the prodR .; / is always nonempty. R uct measure D ˝ (defined by .x; y/ d .x; y/ D .x; y/ d.x/ d.y/ for every W X Y ! R) is a coupling: Z Z Z Z '.x/ d.x/ d.y/ D d.y/ '.x/ d.x/ D 1 '.x/ d.x/ XY Y X Z X D '.x/ d.x/; X Z Z Z Z .y/ d.x/ d.y/ D d.x/ .y/ d.y/ D 1 .y/ d.y/ XY X Y Z Y D .y/ d.y/: Y
Remark 1.4.5 (Transport map vs. coupling). Let T W X ! Y satisfy T# D . Consider the map Id T W X ! X Y , i.e., x 7! .x; T .x//, and define
T WD .Id T /# 2 P .X Y /: We claim that T 2 .; /. Indeed, recalling Lemma 1.2.7 we have .X /# T D .X /# .Id T /# D .X ı .Id T //# D Id# D ; .Y /# T D .Y /# .Id T /# D .Y ı .Id T //# D T# D : This proves that any transport map T induces a coupling T . 2
The terminology “coupling” is common in probability. However, in optimal transport theory one often uses the expression transport plan in place of coupling.
Transport maps
9
1.4.1 Examples of transport maps We now discuss three examples of transport maps: measurable transport, one-dimensional monotone rearrangement, and the Knothe map. Measurable transport. The following result can be found in [15, Thm. 11.25]: Theorem 1.4.6. Let 2 P .X/ be a probability measure such that has no atoms (i.e., .¹xº/ D 0 for any x 2 X). Then there exists T W X ! R such that T is injective -a.e. and .T /# D dxjŒ0;1 : Moreover, T 1 W Œ0; 1 ! X exists Lebesgue-a.e. and .T 1 /# dx D . In other words, given 2 P .X/ and 2 P .Y / without atoms, this abstract theorem tells us that we can always transport one onto the other by simply considering T 1 ı T (this is a transport map from to ) or T 1 ı T D .T 1 ı T / 1 (this is a transport map from to ). Unfortunately these maps have no structure, so they are of little interest in concrete applications in analysis/geometry. Indeed, as we will see in this book, a very important feature of optimal transport maps is their structural properties (for instance, optimal maps for the quadratic cost are gradients of convex functions; see Theorem 2.5.10). Monotone rearrangement. Given ; 2 P .R/, set Z x Z F .x/ WD d.t/; G.y/ WD 1
y
d.t /: 1
Note that these maps are not well defined at points where measures have atoms, since one needs to decide whether the mass of the atom is included in the value of the integral or not. We adopt the convention that the masses of the atoms are included, so that both maps are continuous from the right. More precisely, we set Z
xC"
Z
1 yC"
d.t/ D . 1; x ;
F .x/ WD lim
"!0C
d.t/ D . 1; y :
G.y/ WD lim
"!0C
1
Note that F and G are nondecreasing. If G was strictly increasing, it would be injective and we could naturally consider its inverse G 1 . However, G may be constant in some regions, so we need to define a “pseudo-inverse” as follows: G
1
.y/ WD inf¹t 2 R j G.t / > yº:
Note that also G 1 is continuous from the right. With these definitions, we define the nondecreasing map T WD G 1 ı F W R ! R and we want to prove that it transports to . Of course this cannot be true in general,
Introduction
10
since the set of transport maps may be empty (recall Remark 1.4.2). The following result shows that this is the case if has no atoms: Theorem 1.4.7. If has no atoms, then T# D . To prove this theorem, we need some preliminary results. Lemma 1.4.8. If has no atoms, then for all t 2 Œ0; 1 we have F 1 .Œ0; t/ D t: Proof. The statement is easily seen to be true for t D 0 and t D 1. Also, since has no atoms, ˇ ˇZ t ˇ ˇ k dˇˇ ! 0 8 t 2 R; jF .tk / F .t/j D ˇˇ tk !t t
thus F 2 C 0 .R; R/. Since F .t/ ! 0 as t ! 1 and F .t / ! 1 as t ! C1, by the intermediate value theorem it follows that F is surjective on .0; 1/. Given t 2 .0; 1/, consider the largest value x 2 R such that F .x/ D t (this point exists by the continuity of F ). With this choice of x, we have Z Z x 1 F .Œ0; t/ D d D d D t; F
1 .Œ0;t/
1
as desired. Corollary 1.4.9. If has no atoms, then for all t 2 Œ0; 1 we have F 1 .Œ0; t// D t: Proof. We apply Lemma 1.4.8 to the intervals Œ0; t and Œ0; t " with " > 0: t D F 1 .Œ0; t/ F 1 .Œ0; t// F 1 .Œ0; t "/ D t " ! t: "!0C
Proof of Theorem 1.4.7. We split the proof into five steps. (1) Let A D . 1; a with a 2 R. Applying Corollary 1.4.9, we have T# .A/ D .T 1 .A// D F 1 ı G.. 1; a/ D F 1 .Œ0; G.a// D G.a/ D .. 1; a/ D .A/: (2) Let A D .a; b D . 1; b n . 1; a. Applying step (1) we have T# .A/ D T# .. 1; b/
T# .. 1; a/ D .. 1; b/ D .A/:
.. 1; a/
Transport maps
(3) Let A D .a; b/, and consider A" WD .a; b convergence we have
11
". Thanks to step (2) and monotone
.A/ - .A" / D T# .A" / % T# .A/ as " ! 0C : S (4) Let A R be an open set. We can write A D i 2I .ai ; bi / with .ai ; bi / i 2I disjoint and countable. Thus, by step (3) we get X X .A/ D ..ai ; bi // D T# ..ai ; bi // D T# .A/: i 2I
i 2I
(5) Since open sets are generators of the Borel -algebra, step (4) proves that T# D:
Knothe map. We are going to build a transport map, known as the Knothe map [53], that is a multidimensional generalization of monotone rearrangement. First we need to state the disintegration theorem (for a proof of this result, see Appendix B). Theorem 1.4.10 (Disintegration theorem). Let 2 P .R2 / and set 1 WD .1 /# 2 P .R/, where 1 W R2 ! R is defined as 1 .x1 ; x2 / WD x1 . Then there exists a family of probability measures .x1 /x1 2R P .R/ such that .dx1 ; dx2 / D x1 .dx2 / ˝ 1 .dx1 /I that is, for any 'W R2 ! R continuous and bounded, we have Z Z Z '.x1 ; x2 / d.x1 ; x2 / D '.x1 ; x2 / dx1 .x2 / d1 .x1 /: R2
R
R
Moreover, the measures x1 are unique 1 -a.e. R Example 1.4.11. Let D f .x1 ; x2 / dx1 dx2 with R2 f dx1 dx2 D 1, and set Z f .x1 ; x2 / dx2 : 1 WD .1 /# ; F1 .x1 / WD R
We claim that 1 D F1 dx1 . Indeed, given any test function 'W R ! R, Z Z Z '.x1 / d1 .x1 / D '.x1 / d.x1 ; x2 / D '.x1 /f .x1 ; x2 / dx1 ; dx2 2 R R2 Z ZR Fubini D '.x1 / f .x1 ; x2 / dx2 dx1 R Z R D '.x1 /F1 .x1 / dx1 ; R
as desired.
Introduction
12
Also, let x1 .dx2 / be the disintegration provided by the previous theorem. Then Z Z Z '.x1 ; x2 / dx1 .x2 / d1 .x1 / D '.x1 ; x2 / d.x1 ; x2 / R R R2 Z D '.x1 ; x2 /f .x1 ; x2 / dx1 dx2 R2 Z Z f .x1 ; x2 / dx2 F1 .x1 / dx1 : D '.x1 ; x2 / F1 .x1 / R R Hence, by uniqueness of the disintegration we deduce that x1 .dx2 / D
f .x1 ; x2 / dx2 ; F1 .x1 /
1 -a.e.
Note that x1 are indeed probability measures: Z Z 1 1 F1 .x1 / D 1: f .x1 ; x2 / dx1 D dx1 .x2 / D F1 .x1 / R F1 .x1 / R Remark 1.4.12 (An absolutely continuous measure lives where its density is positive). Note that F1 > 0 1 -a.e. Indeed, Z Z Z d1 D F1 dx1 D 0 dx1 D 0: ¹F1 D0º
¹F1 D0º
¹F1 D0º
Construction of a Knothe map. Take two absolutely continuous measures on R2 , namely f .x1 ; x2 / dx2 ˝ F1 .x1 / dx1 ; F1 .x1 / g.y1 ; y2 / .y1 ; y2 / D g.y1 ; y2 / dy1 dy2 D dy2 ˝ G1 .y1 / dy1 ; G1 .y1 /
.x1 ; x2 / D f .x1 ; x2 / dx1 dx2 D
where Z
Z
F1 .x1 / D
f .x1 ; x2 / dx2 R
and
G1 .y1 / D
g.y1 ; y2 / dy2 : R
Using Theorem 1.4.7, monotone rearrangement provides us with a map T1 W R ! R such that T1# .F1 dx1 / D G1 dy1 . Then, for F1 dx1 -a.e. x1 2 R, we consider the monotone rearrangement T2 .x1 ; /W R ! R such that T2 .x1 ; /#
f .x ; / g.T1 .x1 /; / 1 dx2 D dy2 : F1 .x1 / G1 .T1 .x1 //
(1.4)
In other words, for each fixed x1 , F .x1 ; / is a map that sends the disintegration of at the point x1 onto the disintegration of and the point T .x1 /.
Transport maps
13
Theorem 1.4.13. The Knothe map T .x1 ; x2 / WD .T1 .x1 /; T2 .x1 ; x2 // transports to . Proof. For 'W R2 ! R Borel and bounded, we have Z Z Z g.y1 ; y2 / dy2 G.y1 / dy1 '.y1 ; y2 /g.y1 ; y2 / dy1 dy2 D '.y1 ; y2 / G .y / R2 R „ R ƒ‚ 1 1 … ‰.y1 /
.T1 /# .F1 dx1 /DG1 dy1
Z
D
‰.T1 .x1 //F1 .x1 / dx1 R
Z Z
g.T1 .x1 /; y2 / D '.T1 .x1 /; y2 / dy2 F1 .x1 / dx1 G1 .T1 .x1 // R R Z Z f .x1 ; x2 / (1.4) D dx2 F1 .x1 / dx1 '.T1 .x1 /; T2 .x1 ; x2 // F1 .x1 / R R Z Z D '.T1 .x1 /; T2 .x1 ; x2 //f .x1 ; x2 / dx2 dx1 ZR R D .' ı T /.x1 ; x2 / d.x1 ; x2 /: R2
Remark 1.4.14. Since monotone rearrangement is an increasing function, we have (under the assumption that the map T .x1 ; x2 / D .T1 .x1 /; T1 .x1 ; x2 // is smooth) @1 T1 0 rT D : 0 @ 2 T2 0 One can use the previous construction of the Knothe map in R2 and iterate it to obtain a Knothe map on Rd . Let .x1 ; : : : ; xd / D f .x1 ; : : : ; xd / dx1 dxd ; .y1 ; : : : ; yd / D g.y1 ; : : : ; yd / dy1 dyd be two absolutely continuous measures. Using monotone rearrangement we get a R map T1 W R ! RR such that T1# .F1 dx1 / D G1 dy1 , where F1 .x1 / D f dx2 : : : dxd and G1 .y1 / D g dy2 : : : dyd . Also, the analogues of Theorem 1.4.10 and Example 1.4.11 in Rd yield probability measures on Rd 1 given by x1 .x2 ; : : : ; xd / D
f .x1 ; x2 ; : : : ; xd / dx2 dxd F1 .x1 /
y1 .y2 ; : : : ; yd / D
g.y1 ; y2 ; : : : ; yd / dy2 dyd ; G1 .y1 /
and
such that D x1 ˝ F1 dx1 and D y1 ˝ G1 dy1 .
Introduction
By induction on the dimension, there exists a Knothe map Tx1 W Rd sending x1 onto T1 .x1 / , and then we obtain a Knothe map in Rd as
1
! Rd
14 1
T .x1 ; : : : ; xd / WD .T1 .x1 /; Tx1 .x2 ; : : : ; xd //: Remark 1.4.15. Suppose again that the map T is smooth. Then 0 1 @1 T1 B C B 0 @2 T2 C B C :: C: : rT D B 0 0 B C :: B C : A 0 0 @ 0 0
0
0
0
@ d Td
Note that this is an upper triangular matrix and that all the values on the diagonal are nonnegative. This will be important for the next section. Remark 1.4.16. Although we call it the Knothe map, the map itself is by no means unique. Indeed, by fixing a basis in Rd but changing the order of integration, one obtains a different Knothe map. Even more, changing the basis of Rd yields in general a different map.
1.5 An application to isoperimetric inequalities The following is the classical (sharp) isoperimetric inequality in Rd . Theorem 1.5.1. Let E Rd be a bounded set with smooth boundary. Then 1
Area.@E/ d jB1 j d jEj
d
1 d
;
where jB1 j is the volume of the unit ball. To prove this result, let jEj denote the Lebesgue measure of E and consider the 1 1E probability measures D jE dx and D jBB11j dy. j Proposition 1.5.2. Let T be a Knothe map from to , and assume it to be smooth.3 Then, (a) for any x 2 E, it holds that jT .x/j 1; (b) det rT D
jB1 j jE j
in E; 1
(c) div T d .det rT / d . 3
The smoothness assumption can be dropped with some fine analytic arguments. To obtain a rigorous proof one can also work with the optimal transport map (instead of the Knothe map) and use the theory of functions with bounded variation, as done in [44].
Application to isoperimetric inequalities
15
Proof. We prove the three properties. (a) If x 2 E, then T .x/ 2 B1 and thus jT .x/j 1. (b) Let A B1 , so that T
1
.A/ E. Since T# D , we have Z dx 1 .A/ D .T .A// D : T 1 .A/ jEj
On the other hand, by the change of variable formulas, setting y D T .x/ we have dy D jdet rT j dx, therefore Z Z dy 1 .A/ D D j det rT .x/j dx: 1 jB j jB 1 1j A T .A/ Furthermore, since rT is upper triangular and its diagonal elements are nonnegative (see Remark 1.4.15), it follows that det rT 0, hence Z Z dx 1 D .A/ D det rT .x/ dx: T 1 .A/ jEj T 1 .A/ jB1 j Since A B1 is arbitrary, we obtain det rT 1 D jB1 j jEj
inside E:
(c) Note that, since the matrix rT is upper triangular (see Remark 1.4.15), its determinant is given by the product of its diagonal elements. Hence div T .x/ D
d X
@i Ti .x/ D d
i D1
d 1X @i Ti .x/ d i D1
d
Y d
d1 @i Ti .x/
1 D d det rT .x/ d ;
i D1
where the inequality follows from the fact that the arithmetic mean of the nonnegative numbers @i Ti .x/ is greater than the geometric one. Proof of Theorem 1.5.1. Thanks to properties (a), (b), (c) in Proposition 1.5.2, denoting by E the outer unit normal to @E and by d the surface measure on @E, we have Z Z Z Z (a) Area.@E/ D 1 d jT j d T E d D div T dx @E @E @E E Z Z (c) d1 jB1 j d1 1 d 1 (b) d det rT dx D d dx D d jB1 j d jEj d ; E E jEj where the equality marked with follows from the Stokes theorem.
Introduction
16
1.6 A Jacobian equation for transport maps Let T W Rd ! Rd be a smooth diffeomorphism with det rT > 0, and assume that T# .f dx/ D g dy, where f and g are probability densities. First of all, by the definition of the push-forward measure, for any bounded Borel function W Rd ! R we have Z Z .y/g.y/ dy D .T .x//f .x/ dx: Rd
Rd
On the other hand, using the change of variables yDT .x/ we have dyD det rT .x/ dx, and therefore Z Z .y/g.y/ dy D .T .x//g.T .x// det rT .x/ dx: Rd
Rd
Comparing the two equations above, since is arbitrary we deduce that T satisfies g.T .x// det rT .x/ D f .x/: Note that the transport maps we are going to construct in the next chapters (and also the Knothe map we have just studied) are not smooth diffeomorphisms in general, thus proving that the validity (in a suitable sense) of this Jacobian equation would require some additional work.
Chapter 2
Optimal transport This chapter contains what is usually considered to be the core of optimal transport theory: the solution of Kantorovich’s problem for general costs (i.e., the existence of an optimal transport plan), the duality theory, and the solution of Monge’s problem (i.e., the existence of an optimal transport map) for suitable costs. Our proof of the duality follows the path that goes through the concept of cyclical monotonicity and the characterization of cyclically monotone sets as graphs. This approach is very specific to the optimal transport problem and does not apply to other convex problems. In Remark 2.6.7 we will briefly outline a different approach, which is actually a fairly standard argument from convex analysis, that works in wider generality at the cost of being more obscure. We will also present a couple of classical applications of the theory: polar decomposition and an application to the Euler equations of fluid dynamics. In order to pursue our plan we will need some preliminaries in measure theory; hence we will devote the first section to these preliminaries.
2.1 Preliminaries in measure theory In this section, X will be a locally compact, separable, and complete metric space. Again, the model case is X D Rd . Every measure here will be in P .X / (i.e., a probability measure). Remark 2.1.1. The assumptions in this book are far from being sharp, as our goal is to emphasize the main ideas of the theory. In particular, the existence of optimal transport plans (Theorem 2.3.2) and the duality theorem (Theorem 2.6.5) hold in arbitrary separable metric spaces. Interested readers may look at [6, Chaps. 5.1–5.4 and 6.1]. Remark 2.1.2. By the Riesz representation theorem (see [15, Thm. 7.7]) we have the following equalities (recall that, given a Banach space E, the notation E denotes its dual): M.X / WD ¹finite signed measures on Xº Š Cc .X/ WD ¹continuous compactly supported functionsº Š C0 .X/ WD ¹continuous functions vanishing at 1º : Remark 2.1.3. Note that Cc .X/ is not complete (with respect to uniform convergence) if X is not compact. For example, for X D R, if n W R ! Œ0; 1 are continuous functions such that n .x/ D 1 for x 2 Œ n; n and n .x/ D 0 for x 62 Œ n 1; n C 1,
Optimal transport
18
then the sequence of functions fn .x/ WD converges towards f .x/ D
1 1Cx 2
1 1 C x2
n .x/
62 Cc .R/.
Let .k /k2N be a sequence of probability measures. Then k .X / D 1 and therefore the whole sequence .k /k2N is uniformly bounded in M.X /. Thus, thanks to the Banach–Alaoglu theorem, there exists a subsequence .kj /j 2N that weakly- converges to a measure 2 M.X/:
kj * 2 M.X /; i.e.,
Z
Z
X
' dkj !
' d for any ' 2 Cc .X /. X
Note that since k 0 (by assumption) we have 0. On the other hand, even if k are all probability measures, may not be a probability measure, as shown in the following example. Example 2.1.4. Let X D R and k D ık for k 2 Z. Then, for any ' 2 Cc .R/, Z k!1 ' dk D '.k/ ! 0: R
Hence k * 0. This shows that, in general, the weak- limit of probability measures may not be a probability measure. To resolve that issue, we need to introduce a stronger notion of convergence. Definition 2.1.5. Let Cb .X/ be the set of continuous bounded functions. We say that k converges to narrowly if Z Z ' dk ! ' d for any ' 2 Cb .X /: X
X
We denote this convergence by k * . Remark 2.1.6. Narrow convergence is particularly useful in our context as it guarantees that limits of probability measures are still probability measures. Indeed, assume that k 2 P .X / and k * . Then, taking ' 1 yields Z Z k .X/ D 1 dk ! 1 d D .X /: X
Hence 2 P .X /.
X
Preliminaries in measure theory
19
Example 2.1.7. Take X D Rd and k D .1 k1 /ı0 C k1 ıxk for some xk 2 Rd . Then, if ' 2 Cb .Rd /, we have Z 1 1 k!1 ' dk D 1 '.0/ C '.xk / ! '.0/; k k R so k * WD ı0 .
The difference with respect to the case when k * is that in weak- convergence, some mass of k may escape to 1. To avoid this, one needs to guarantee that almost all the mass of k remains in a fixed compact set. This motivates the following: Definition 2.1.8. Let A P .X/ be a family of probability measures. We say that A is tight if for any " > 0 there exists a compact set K" X such that .X n K" / " for any 2 A. We are going to see that the tightness of a family is equivalent to its compactness with respect to the narrow topology. But before proving such a result, let us present the following fundamental lemma regarding the exhaustion of a measure by compact sets and compactly supported functions. Lemma 2.1.9. Given a probability measure 2 P .X /, the following statements hold: (a) For any " > 0, there is a compact set K" X such that .K" / 1 ". R (b) For any " > 0, there is " 2 Cc .X/ with 0 " 1 such that " d 1 ". Proof. (a) Since X is separable, there is a countable sequence of points .xn /n2N that is dense in X. Hence, for any r > 0, we have [ B.xn ; r/ D X: n2N
Therefore, given " > 0, for any k 2 N there exists nk;" 2 N such that [ " : B.xn ; k 1 / 1 2k 1nn
(2.1)
k;"
Let us consider the subset K" X defined as \ [ K" WD B.xn ; k
1 /:
(2.2)
k2N 1nnk;"
Being the intersection of finite unions of closed balls, K" is closed. Also, by construction, the set K" is also totally bounded. Hence, since X is complete, we deduce that
Optimal transport
20
K" is compact [74, Thm. 39.9]. Finally, (2.1) implies that X " X [ .X n K" / Xn B.xn ; k 1 / D "; 2k 1nn k2N
thus .K" / 1
k2N
k;"
", as desired.
(b) Let K" be the compact set provided by the previous step. Since X is locally compact, there exists a compact set H" such that K" HV " . Thus, the Tietze extension theorem [74, 15.8] guarantees the existence of a function " 2 C.X / such that " 1 in K" , " 0 in X n H" , and 0 " 1. This function satisfies all the requirements in the statement. Remark 2.1.10. Notice that Lemma 2.1.9(a) implies that the singleton ¹º constitutes a tight family. We are now ready to prove that tightness is a necessary and sufficient condition for compactness with respect to narrow convergence. Theorem 2.1.11 (Prokhorov). A family A P .X / is tight if and only if A is relatively compact for narrow convergence, i.e., for any sequence .k /k2N A there exists a subsequence .kj /j 2N and a probability measure 2 P .X / such that kj * : Proof. We prove only the implication “tightness implies compactness”; for the other implication, we refer interested readers to the proof of [17, Thm. 8.6.2]. Since the family is tight, there is a sequence of compact sets .Kn /n2N such that .X n Kn / n
1
8 2 A:
(2.3)
Since the space X is locally compact, up to enlarging inductively each compact set, we may assume Kn KV nC1 for any n 2 N. Given a sequence .k /k2N A, by the Banach–Alaoglu theorem the restricted measures k jKn 4 converge weakly-, up to subsequence, to a measure .n/ 2 M.X /. Therefore, by a diagonal argument, there exists a subsequence ¹kj ºj 2N such that
kj jKn * .n/ 2 M.X / 8 n 2 N: Note that .n/ vanishes outside Kn and .n/ .X n Km / m (2.3)).
4
That is, k jKn .E/ WD k .E \ Kn / for any E X Borel.
(2.4) 1
for any m 2 N (recall
Preliminaries in measure theory
21
Testing (2.4) against functions compactly supported in KV n , we deduce that .nC1/ jKV D .n/ jKV . By construction we have .nC1/ .n/ and thus n
n
O WD sup .n/ n2N
is a well-defined measure satisfying .X O n Kn / n n 2 N. Therefore, recalling (2.4), we have
kj jKV * O jKV n
1
and O jKV D .n/ jKV for every n
n
8 n 2 N:
n
Since KV n is a subset of a compact set (i.e., Kn ), the latter weak- convergence is equivalent to the narrow convergence kj jKV * O jKV n
8 n 2 N:
n
(2.5)
Thus, recalling that .n/ .X n Kn / n 1 and .X O n Kn / n it follows from (2.5) that, for any ' 2 Cb .X/, ˇZ ˇ Z ˇ ˇ ˇ lim sup ˇ ' dkj ' d O ˇˇ j !1
X
1
, and Kn
1
KV n ,
X
ˇZ ˇ lim sup lim sup ˇˇ n!1
ˇ ˇZ ˇ ˇ ˇ ˇ ' dkj ˇˇ C ˇˇ ' d O ˇˇ XnKV n XnKV n ˇZ ˇ Z ˇ ˇ ˇ C ˇ ' dkj jKV ' djKV ˇˇ
j !1
n
X
D lim sup k'k1 .n n!1
1/
1
C k'k1 .n
n
X
1
1/
D 0:
Since ' is arbitrary, we have shown that k * O narrowly and in particular O 2 P .X /. R The next result shows the lower semicontinuity of the map 7! ' d under weak- convergence, whenever the integrand ' is nonnegative and lower semicontinuous. Since the narrow topology is stronger than the weak- topology (i.e., narrow convergence implies weak- R convergence), this result also implies the lower semicontinuity of the map 7! ' d under narrow convergence.
Lemma 2.1.12 (Lower semicontinuity of integrals). Let k * , and let 'W X ! Œ0; C1 be a lower-semicontinuous function.5 Then Z Z lim inf ' dk ' d: k!1
5
X
X
Recall that a function ' is lower semicontinuous if lim infk!1 '.xk / '.x/ as xk ! x.
Optimal transport
22
Proof. If ' C1 then the statement is trivial, hence we assume that this is not the case. Given 0, define ® ¯ ' .x/ WD inf '.y/ C d.x; y/ ; y2X
where d W X X ! R denotes the distance function on X . We claim that the functions ' satisfy the following properties: (a) If < 0 then ' '0 '. (b) ' is -Lipschitz. (c) For each x 2 X it holds that ' .x/ % '.x/ as ! 1. Let us prove the mentioned properties. (a) For any y 2 X we have ' .x/ '.y/ C d.x; y/ '.y/ C 0 d.x; y/. Taking the infimum over y 2 X shows that ' '0 . Also, taking y D x in the definition of '0 proves that '0 '. (b) Let x; x 0 2 X. Then, by the triangle inequality, ' .x 0 / '.y/ C d.x 0 ; y/ '.y/ C d.x; y/ C d.x; x 0 / 8 y 2 X: Taking the infimum over y yields ' .x 0 / ' .x/ C d.x; x 0 /. Since the argument is symmetric in x and x 0 , this proves that j' .x/
' .x 0 /j d.x; x 0 /:
(c) Fix x 2 X. Since®' is lower semicontinuous, for all " > 0 there exists a ı > 0 such ¯ that '.y/ min '.x/ "; 1" for all y 2 X with d.x; y/ ı.6 Thus, recalling that ' 0, we have 8 ° ± 0 there exists ı > 0 such that '.y/
1 "
" for d.x; y/ ı, for d.x; y/ ı.
Preliminaries in measure theory
23
Note that since ' 0, we have ' 0. Consider the family of compactly supported functions ." /">0 constructed in Lemma 2.1.9. We may assume that 1 1 for i iC1 any i 2 N, and therefore 1 % 1 -a.e. as i ! 1. Then we define i
i .x/
WD 'i .x/ 1 .x/: i
Note that i is continuous and compactly supported. Thus, given that i ' (by property (a) above, since i 'i ), by the weak- convergence of k to we get Z Z Z ' dk : i d D lim i dk lim inf k!1 X
X
Since
i
k!1
X
% ' -a.e., we conclude by monotone convergence: Z Z Z ' d D lim ' dk : i d lim inf i !1 X
X
k!1
X
In the next lemma we show that if a sequence of probability measures converges weakly- to a probability measure (so, we are assuming that the limit still has mass 1), then in fact the convergence is narrow. Lemma 2.1.13 (Weak- convergence + mass conservation = narrow convergence). Let .k /k2N P .X/, and assume that k * for some 2 P .X /. Then k * . Proof. Fix ' 2 Cb .X/. Let L > 0 be a real number such that L ' L at all points. Since ' L 0 and L C ' 0, thanks to Lemma 2.1.12 we have Z Z lim sup ' L dk ' L d; X X k2N Z Z lim inf ' C L dk ' C L d: k2N
X
X
Hence, exploiting the conservation of mass (i.e., k .X / D .X / D 1), we get Z Z Z ' d D L C ' L d L C lim sup ' L dk X X X Z Zk2N D lim sup ' dk lim inf ' dk k2N X X k2N Z Z D L C lim inf ' C L dk L C ' C L d k2N X X Z D ' d; X
R
which implies X ' dk ! narrow convergence follows.
R X
' d. Since ' can be chosen arbitrarily, the desired
Optimal transport
24
2.2 Monge vs. Kantorovich Fix 2 P .X /, 2 P .Y /, and cW X Y ! Œ0; C1 lower semicontinuous. The Monge and Kantorovich problems can be stated as follows (recall Definition 1.4.3): ˇ ²Z ³ ˇ ˇ CM .; / WD inf c.x; T .x// d.x/ ˇ T# D ; (M) ˇ ²ZX ³ ˇ CK .; / WD inf c.x; y/ d .x; y/ ˇˇ 2 .; / : (K) XY
In other words, Monge’s problem (M) consists in minimizing the transportation cost among all transport maps, while Kantorovich’s problem (K) consists in minimizing the transportation cost among all couplings. Remark 2.2.1. Recall that if T# D , then T WD .Id T /# 2 .; /. Also Z Z c.x; T .x// d.x/ D c ı .Id T /.x/ d.x/ X ZX D c.x; y/ d T .x; y/: XY
In other words, any transport map T induces a coupling T with the same cost. Thanks to this fact, we deduce that CM .; / CK .; /: Remark 2.2.2. Let 2 .; / and assume that D .Id S /# for some map SW X ! Y: Then D .Y /# D .Y /# .Id S/# D .Y ı .Id S //# D S# ; thus S is a transport map from to . In other words, if we have a coupling with the structure of a graph, this yields a transport map.
2.3 Existence of an optimal coupling Lemma 2.3.1. The set .; / P .X Y / is tight and closed under narrow convergence. Proof. We split the proof into two steps: we first prove tightness and then closedness.
.; / is tight. Thanks to Lemma 2.1.9, for all " > 0, there exists a set K" X such that .X n K" / 2" . Analogously for , there exists a set Kz" Y such that .Y n Kz" / 2" .
Existence of an optimal coupling
25
Define the compact set Kx" WD K" Kz" X Y . Then, for any 2 .; /, we have
..X Y / n Kx" / D ..X n K" / Y [ X .Y n Kz" // ..X n K" / Y / C .X .Y n Kz" // Z Z D 1XnK" .x/ d .x; y/ C 1Y nKz" .y/ d .x; y/ XY ZX Y Z D 1X nK" .x/ d.x/ C 1Y nKz" .y/ d.y/ X
Y
D .X n K" / C .Y n Kz" / " " C D ": 2 2 Thus .; / is tight.
.; / is closed under narrow convergence. Take a sequence . k /k2N .; /, and assume that k * 2 P .X Y /. Then, for any ' 2 Cb .X / we have Z Z Z '.x/ d.x/ D '.x/ d k .x; y/ ! '.x/ d .x; y/; X
XY
XY
hence .X /# D . Analogously .Y /# D . Theorem 2.3.2. Let cW X Y ! Œ0; C1 be lower semicontinuous, 2 P .X /, and 2 P .Y /. Then there exists a coupling N 2 .; / that is a minimizer for (K). R Proof. Without loss of generality we may assume inf
2.;/ X Y c d < C1 (if R is trivial since every 2 .; / is inf 2.;/ XY c d D C1, then the statement R a minimizer). Let us define ˛ WD inf 2.;/ X Y c d . Let . k /k2N .; / be a minimizing sequence, namely Z c d k ! ˛ as k ! 1. XY
Since ¹ k º .; / is tight, by Theorem 2.1.11 there exists a subsequence . kj /j 2N such that kj * N . Since c is nonnegative and lower semicontinuous, it follows from Lemma 2.1.12 that Z Z Z inf c d D ˛ D lim inf c d kj c d N :
2.;/ XY
j !1
XY
Since N 2 .; / (thanks to Lemma 2.3.1), we clearly have R proves that X Y c d N D ˛, thus N is a minimizer.
XY
R XY
c d N ˛. This
In other words, under very general assumptions on the cost function, an optimal coupling always exists.
Optimal transport
26
Remark 2.3.3. Note that, up to replacing c.x; y/ with c.x; y/ C C for some constant C 2 R, all results proven in this book still hold for costs c bounded from below. At this point, it is very natural to consider the following two questions:
Is the minimizer unique?
Is it given by a transport map?
In order to get some intuition on these two important questions, let us consider two examples. Example 2.3.4. Let D ıx0 and D 12 ıy0 C 12 ıy1 . Then there exists a unique element in .; /, given by the coupling WD 12 ı.x0 ;y0 / C 12 ı.x0 ;y1 / . So the minimizer is unique (for every cost), but it is not induced by a transport map. Example 2.3.5. Let X D Y D R2 , let c.x; y/ D jx given by x1 WD .0; 0/;
x2 WD .1; 1/;
yj2 , consider the points in R2
y1 WD .1; 0/;
y2 WD .0; 1/;
and define the measures D 12 ıx1 C 12 ıx2 ;
D 12 ıy1 C 12 ıy2 :
In this case the set of all couplings from to is obtained by sending an amount ˛ 2 0; 12 from x1 to y1 , the remaining amount 12 ˛ from x1 to y2 , then an amount ˛ from x2 to y2 , and finally the remaining amount 12 ˛ from x2 to y1 . In other words, the set .; / is given by
˛ D ˛ı.x1 ;y1 / C 12 ˛ ı.x1 ;y2 / C 12 ˛ ı.x2 ;y1 / C ˛ı.x2 ;y2 / ; ˛ 2 0; 12 : Note that, for all ˛ 2 0; 12 , Z c d ˛ D ˛jx1 y1 j2 C 12 ˛ jx1 y2 j2 C 12 ˛ jx2 y1 j2 XY
C ˛jx2
y2 j2 D 1:
Hence all couplings ˛ are optimal, ruling out the uniqueness of the optimal plan without further assumptions.
2.4 c-cyclical monotonicity Let us recall the definition of the support of a measure. Definition 2.4.1. Given a measure 2 M.X/, its support is defined as supp./ WD ¹x 2 X j 8" > 0W .B" .x// > 0º: We want to investigate the properties of the support of an optimal coupling.
c-cyclical monotonicity
Given an R optimal coupling N 2 .; / with finite cost (i.e., inf 2.;/ XY c d < C1), its support is
R X Y
27
c d N D
supp. / N D ¹.x; y/ 2 X Y j 8 " > 0W N .B" .x/ B" .y// > 0º: So, essentially, a pair of points .x; y/ belongs to the support if some mass goes from x to y. To understand how to exploit the optimality of , N suppose for instance that .xi ; yi /i D1;2;3 2 supp. /. N This means that N sends mass from xi to yi . Now, consider another transport plan that takes the mass from x2 to y1 , from x3 to y2 , and from x1 to y3 . Since N is optimal, this “re-shuffling” must increase the cost, i.e., 3 X
c.xiC1 ; yi /
i D1
3 X
c.xi ; yi /;
i D1
where we set x4 x1 .7 Since this property needs to hold for any collection of points in the support of , N this motivates the following definition: Definition 2.4.2. A set ƒ X Y is said to be c-cyclically monotone if for any finite sequence .xi ; yi /i D1;:::;N ƒ, the following holds: N X
c.xi ; yi /
i D1
N X
c.xi C1 ; yi /;
i D1
where xN C1 x1 . The above discussion suggests that optimality implies c-cyclical monotonicity. This is indeed the case: Theorem 2.4.3. Let N be optimal and cW X Y ! R continuous. Then supp. N / is c-cyclically monotone. Remark 2.4.4. We will show later that the above statement is an if and only if ; see Theorem 2.6.2. Proof. By contradiction, suppose supp. / N is not c-cyclically monotone. Then there exist > 0 and N pairs of points .x1 ; y1 /; : : : ; .xN ; yN / 2 supp. N / such that N X iD1
7
c.xi ; yi /
N X
c.xi C1 ; yi / C :
(2.6)
i D1
Of course this argument is not rigorous since the points .xi ; yi / may have zero mass for . N As we will see later, however, one can make this argument rigorous by considering some small neighborhoods of .xi ; yi /.
Optimal transport
28
Since c is continuous, there exist open neighborhoods xi 2 Ui X and yi 2 Vi Y such that 8 .x; y/ 2 Ui Vi (2.7) jc.x; y/ c.xi ; yi /j 4N and jc.x; y/
c.xi C1 ; yi /j
4N
8 .x; y/ 2 Ui C1 Vi :
(2.8)
Set "i WD N .Ui Vi /. Note that all "i are positive, since .xi ; yi / belong to the support of . N
j N U V Now set " WD mini D1;:::;N "i and8 i WD "ii i 2 P .X Y /. Then we define the measures i WD .X /# i 2 P .X/ and i WD .Y /# i 2 P .Y /, and we set N N " X " X
i C i C1 ˝ i : N N
0 WD N
i D1
iD1
0
Let us show that 0: Since " "i we have N " X
i D N N
0
N
i D1
1 N
N
N X
N 1 X "
N jUi Vi N "i
N jUi Vi N
i D1
i D1
N 1 X
N D 0: N iD1
Let us also check that 0 2 .; /: Since .X /# N D , .X /# i D i , and .X /# .i C1 ˝ i / D i C1 , we have 0
.X /# D
N N " X " X i C i C1 D : N N i D1
i D1
Analogously .Y /# 0 D . R R It remains only to prove XY c d 0 < X Y c d N because this yields the sought contradiction, since N was assumed to be optimal. Note that, since i 2 P .X / is supported inside Ui and i 2 P .Y / is supported inside Vi ; it follows from (2.8) that Z Z c.x; y/ d.i C1 ˝ i / D c.x; y/ d.i C1 ˝ i / XY U V Z iC1 i c.xi C1 ; yi / C d.i C1 ˝ i / 4N UiC1 Vi D c.xi C1 ; yi / C : 4N 8
Here we are using the notation j N A to denote the restriction of the measure N to the set A: namely, for any Borel set E X , N jA .E/ WD .A N \ E/.
Case of quadratic cost on Euclidean space
Analogously, since i 2 P .X Y / is supported inside Ui Vi , Z Z Z c d i c.xi ; yi / d i D c.xi ; yi / c d i D 4N Ui Vi X Y Ui Vi
29
: 4N
Then, recalling (2.6), we get Z
Z c d N
X Y
XY
N Z " X c d i c d D N XY 0
XY
i D1
N " X c.xi ; yi / N i D1
Z
4N
c d.i C1 ˝ i /
c.xiC1 ; yi / C 4N
N " X c.xi ; yi / c.xi C1 ; yi / N i D1 " " " D > 0; N N2 N2
" N2
a contradiction that concludes the proof.
2.5 The case c.x; y/ D
jx yj2 2
on X D Y D Rd
R R 2 2 2 Let X D Y D Rd and c.x; y/ D jx 2yj . Also, assume that Rd jxj2 d C Rd jyj2 d < C1. Let 2 .; /; then Z Z jxj2 jx yj2 jyj2 d .x; y/ D C x y d (2.9) 2 2 2 Rd Rd Rd Rd Z Z Z jxj2 jyj2 D d C d C x y d : Rd 2 Rd 2 Rd Rd Since the first two terms in the last expression are independent of , we deduce that 2
is optimal for the cost c.x; y/ D jx 2yj if and only if it is optimal for the cost c.x; y/ D x y. Hence, in the next section we will work with the cost function c.x; y/ D x y, as it simplifies several definitions and computations. 2.5.1 Cyclical monotonicity and Rockafellar’s theorem In the case c.x; y/ D
x y, the condition N X i D1
c.xi ; yi /
N X i D1
c.xi C1 ; yi /
Optimal transport
30
is equivalent to N X hyi ; xi C1
xi i 0;
i D1
where h; i D is the canonical scalar product9 on Rd , and by convention xN C1 x1 . Any subset of Rd Rd satisfying this last property (for any family of points .x1 ; y1 /; .x2 ; y2 /; : : : ; .xN ; yN / contained in such a set) is called a cyclically monotone set. The goal of this section is to characterize cyclical monotonicity in terms of the subdifferential of convex functions. We first recall the definition and its relation with the gradient when the function is differentiable. Definition 2.5.1. Given 'W Rd ! R [ ¹C1º convex, we define the subdifferential of ' as ® ¯ @'.x/ WD y 2 Rd j 8z 2 Rd W '.z/ '.x/ C hy; z xi : S We also define @' WD x2Rd ¹xº @'.x/ Rd Rd . Lemma 2.5.2. Let 'W Rd ! R [ ¹C1º be a convex function that is differentiable at xN 2 Rd . It holds that @'.x/ N D ¹r'.x/º. N Proof. Without loss of generality we may assume xN D 0 and '.x/ N D 0. First we show r'.0/ 2 @'.0/ and then we prove that r'.0/ is the only element of @'.0/. Fix x 2 Rd . By convexity and differentiability at 0, we know that the map t 7! '.tx/ is nondecreasing and its limit as t ! 0 is hr'.0/; xi. In particular, we deduce t (comparing the values of the function for t D 0 and t D 1) '.x/ hr'.0/; xi: Since the latter inequality holds for all x 2 Rd and '.0/ D 0, we have shown r'.0/ 2 @'.0/. Now take any element y 2 @'.0/. Since ' is differentiable at 0, for any x 2 Rd , we have hr'.0/; xi C o.jxj/ D '.x/ hy; xi ) hr'.0/
y; xi o.x/
) y D r'.0/: Theorem 2.5.3 (Rockafellar). A set S Rd Rd is cyclically monotone if and only if there exists a convex function 'W Rd ! R [ ¹C1º such that S @'. Proof. First we show that if such a convex function exists then the set S has to be cyclically monotone. The converse implication will then be proven by constructing ' explicitly. 9
Throughout this book, we will use the notation h; i and indistinguishably.
Case of quadratic cost on Euclidean space
31
( Assume that S @' and take a finite set of points .xi ; yi /i D1;:::;N S @'. Then for each i we have yi 2 @'.xi /, and therefore '.z/ '.xi / C hyi ; z
for any z 2 Rd .
xi i
In particular, choosing z D xi C1 we obtain '.xi C1 / '.yi / C hyi ; xi C1
xi i;
and summing over i (where we adopt the convention N C 1 1) yields N X
'.xiC1 /
i D1
N X
'.xi / C
i D1
N X hyi ; xi C1
xi i:
i D1
Since the two summands containing ' are equal, this implies that N X 0 hyi ; xi C1
xi i:
i D1
) Fix .x0 ; y0 / 2 S. If ' is a convex function so that S @' and '.x0 / D 0, then one can show by induction on N 1 that '.x/ hyN ; x
xN i C hyN
1 ; xN
xN
1i
C C hy0 ; x1
x0 i
for any sequence .xi ; yi /iD1;:::;N S. Therefore, it is natural to consider the minimum function with such a property, that is, ® '.x/ WD sup hyN ; x xN i C hyN 1 ; xN xN 1 i N 1
C C hy0 ; x1
¯ x0 i j .xi ; yi /i D1;:::;N S :
Note the following: (a) ' is a supremum of affine functions, thus it is convex. (b) Choosing N D 1 and .x1 ; y1 / D .x0 ; x0 / yields '.x/ hy0 ; x
x0 i;
and in particular '.x0 / 0. (c) For any .xi ; yi /i D1;:::;N S, because of cyclic monotonicity we have hyN ; x0
xN i C C hy0 ; x1
x0 i 0:
Hence '.x0 / 0, which combined with (b) implies that '.x0 / D 0. In particular ' 6 C1.
Optimal transport
32
We now prove that S @'. Take .x; N y/ N 2 S and let ˛ < '.x/. N Then, by the definition of ' there exist N 1 and a sequence .xi ; yi /i D1;:::;N such that hyN ; xN
xN i C C hy0 ; x1
x0 i ˛:
(2.10)
Now consider the sequence .xi ; yi /i D1;:::;N C1 obtained by taking .xN C1 ; yN C1 / D .x; N y/. N Since this new sequence is admissible in the definition of ', using (2.10) we deduce that, for any z 2 Rd , DxN
'.z/ hyN C1 ; z „ƒ‚…
DxN
‚…„ƒ ‚…„ƒ xN C1 i C hyN ; xN C1 xN i C C hy0 ; x1
x0 i
DyN
hy; N z
xi N C ˛:
Letting ˛ ! '.x/, N this shows that '.z/ hy; N z xi N C '.x/ N for all z 2 Rd , thus yN 2 @'.x/ N (or equivalently .x; N y/ N 2 @'), as desired. 2.5.2 Kantorovich duality With the use of the Legendre transform, we now want to find a dual problem to Kantorovich’s problem. We will do this in a constructive way. However, readers familiar with convex optimization will not be surprised: since Kantorovich’s problem is a linear minimization with convex constraints, it admits a dual problem by “abstract convex analysis” (see Remark 2.6.7). Definition 2.5.4. Given 'W Rd ! R [ ¹C1º convex (with ' 6 C1), one defines the Legendre transform of ', ' W Rd ! R [ ¹C1º; as ' .y/ WD sup ¹x y
'.x/º:
x2Rd
Proposition 2.5.5. The following properties hold: (a) '.x/ C ' .y/ x y for all x; y 2 Rd ; (b) '.x/ C ' .y/ D x y if and only if y 2 @'.x/. Proof. As we will see, both properties follow easily from our definitions. (a) For any x 2 Rd , it follows by the definition of ' that ' .y/ x y equivalently ' .y/ C '.x/ x y. (b) ) Assume that '.x/ C ' .y/ D x y. By (a) we have ' .y/ z y
'.z/ 8 z 2 Rd :
'.x/, or
Case of quadratic cost on Euclidean space
Since x y
33
'.x/ D ' .y/, this implies that '.z/ '.x/ C hy; z
xi
8 z 2 Rd ;
thus y 2 @'.x/. ( If y 2 @'.x/, then for any z 2 Rd we have '.z/ '.x/ C hy; z equivalently x y '.x/ z y '.z/ 8 z 2 Rd :
xi, or
By taking the supremum over z 2 Rd we get xy
'.x/ ' .y/;
and by (a) we obtain equality. In the next theorem, we prove the so-called Kantorovich duality. Note that the existence of an optimal coupling for the cost function c.x; y/ D x y does not immediately follow from our previous results, since we only proved existence of an optimal coupling for nonnegative cost functions. However, we can use that the cost R 2 2 c.x; y/ D x y is equivalent to the cost c 0 .x; y/ WD jx 2yj provided that jxj2 d C R jyj2 d < C1 (see (2.9)). In addition, noticing that 2 Z Z jx yj2 d. ˝ /.x; y/ jxj2 C jyj2 d. ˝ /.x; y/ d d 2 Rd Rd ZR R Z 2 D jxj d.x/ C jyj2 d.y/ < C1; Rd
Rd
it follows that inf 2.;/ X Y c 0 d < C1 (recall that ˝ 2 .; /). Hence, we can apply Theorem 2.3.2 to obtain the existence of an optimal coupling for the cost c 0 , and then use that this coupling is also optimal for our cost c. R
Theorem 2.5.6 (Kantorovich duality). Assume that Z Z 2 jxj d.x/ C jyj2 d.y/ < C1: Rd
Rd
Then, for any 2 .; / and '; W Rd ! R [ ¹C1º measurable, it holds that Z min x y d .x; y/
2.;/ Rd Rd Z Z D max '.x/ d.x/ C .y/ d.y/: '.x/C .y/xy
Rd
Rd
Also, in the max above one can choose D ' , and it holds that Z Z Z min x y d .x; y/ D max '.x/ d.x/ C
2.;/ Rd Rd
' convex Rd
Rd
' .y/ d.y/:
Optimal transport
34
Proof. Consider '; W Rd ! R [ ¹C1º such that '.x/ C
8 x; y 2 Rd :
.y/ x y
Integrating this inequality with respect to an arbitrary coupling 2 .; / yields10 Z Z Z x y d .x; y/ '.x/ d .x; y/ C .y/ d .x; y/ d d Rd Rd Rd Rd ZR R Z D '.x/ d.x/ C .y/ d.y/: (2.11) Rd
Rd
Note that the left-hand side does not depend on ' and , and the right-hand side does not depend on . Thus Z inf x y d .x; y/
2.;/ Rd Rd Z Z sup '.x/ d.x/ C .y/ d.y/ '.x/C .y/xy
Rd
Z sup
Rd
Z
' convex Rd
'.x/ d.x/ C
' .y/ d.y/:
(2.12)
Rd
where the second inequality follows from Proposition 2.5.5(a). R Here there is a subtle point: to apply Fubini’s theorem and say Rd Rd '.x/ d .x; y/ D '.x/ d.x/, one would need to make sure that ' is integrable (and analogously for ).
10
R
Rd
2
2
jyj Hence, to justify this identity, we argue as follows: since '.x/ C .y/ x y jxj , 2 2 it means that jxj2 jyj2 '.x/ C C .y/ C 0 8 x; y 2 Rd : 2 2 Choosing two points x0 ; y0 2 Rd where ' and are respectively finite (if these points do not exist it means that ' C1 or C1, and then (2.11) is trivially true), this implies that
jxj2 C0 WD 2 jyj2 .y/ C C1 WD 2
ˆ.x/ WD '.x/ C ‰.y/ WD
.y0 / '.x0 /
jy0 j2 2 jx0 j2 2
8 x; 8 y:
Now, since ˆ C C0 and ‰ C C1 are nonnegative, we can monotonically approximate them with the Borel and bounded functions ˆk WD min¹ˆ C C0 ; kº and ‰k WD min¹‰ C C1 ; kº, k 2 N. Then, applying the definition of coupling to ˆk and ‰k (see Definition 1.4.3) and letting k ! 1, by monotone convergence we get Z Z ˆ.x/ C C0 d .x; y/ D ˆ.x/ C C0 d.x/; d d d Z R R ZR ‰.y/ C C1 d .x; y/ D ‰.y/ C C1 d.y/: Rd Rd
(continued on the next page)
Rd
Case of quadratic cost on Euclidean space
35
On the other hand, let N 2 .; / be optimal. Theorem 2.4.3 implies that supp. N / is cyclically monotone, and so Theorem 2.5.3 yields the existence of a convex map 'W Rd ! R [ ¹C1º such that supp. / N @', that is, y 2 @'.x/ for any .x; y/ 2 supp. N /. Thanks to Proposition 2.5.5, this implies that '.x/ C ' .y/ D x y for
-a.e. N .x; y/. Thus we have Z Z Z x y d .x; N y/ D '.x/ d N .x; y/ C ' .y/ d N .x; y/ Rd Rd Rd Rd Z Z D '.x/ d.x/ C ' .y/ d.y/: Rd
Rd
Hence the triple . ; N '; ' / gives equality in equation (2.11). Remark 2.5.7. In the proof above, the optimality of N is only used to deduce that supp. N / @' for some convex function '. Hence, the proof actually shows that if supp. N / @' with ' convex, then Z Z Z x y d N D ' d C ' d: Rd Rd
Rd
Rd
Since the right-hand side is bounded from above by inf 2.;/ to (2.11)), we conclude that Z Z x y d N inf x y d ;
R
x y d (thanks
2.;/
Rd Rd
thus N is optimal. So we proved the implication supp. / N @' with ' convex ) N is optimal: As a consequence of this remark, together with Theorems 2.4.3 and 2.5.3, we obtain the following: Corollary 2.5.8. Let c.x; y/ D lowing are equivalent:
jx yj2 2
(or equivalently c.x; y/ D
x y). The fol-
N is optimal;
supp. N / is cyclically monotone;
there exists a convex map 'W Rd ! R [ ¹C1º such that supp. N / @'.
R R R R Finally, since jxj2 d D jxj2 d < C1 and jyj2 d D jyj2 d < C1, we can subtract 2 2 jxj C C0 (resp. jyj C C1 ) from the equation above to deduce that 2 2 Z Z '.x/ d .x; y/ D '.x/ d.x/; d d d Z R R ZR .y/ d .x; y/ D .y/ d.y/: Rd Rd
Rd
Optimal transport
36
Remark 2.5.9. These equivalences are particularly useful when proving that a certain transport map is optimal. Indeed, given a transport map T from to , T WD .Id T /# is optimal if and only if there exists a convex map ' such that supp. T / @' (recall Remark 2.2.1). Recalling the definition of @', this is equivalent to asking that T .x/ 2 @'.x/ for -a.e. x. (2.13) In particular, given a convex function ' and a measure 2 P .Rd /, assume that ' is differentiable -a.e. Then the map T WD r' is well defined -a.e., and we can consider the measure WD .r'/# . Since @'.x/ D ¹r'.x/º at every differentiability point of ' (see Lemma 2.5.2), the above optimality condition (2.13) is trivially satisfied and therefore r' is an optimal map from onto D .r'/# . 2.5.3 Brenier’s theorem We are now ready to state and prove a cornerstone of optimal transport theory [20]. Theorem 2.5.10 (Brenier’s theorem). Let X D Y D Rd and c.x; y/ D equivalently c.x; y/ D x y). Suppose that Z Z jxj2 d C jyj2 d < C1 Rd
jx yj2 2
(or
Rd
and that dx (i.e., is absolutely continuous with respect to the Lebesgue measure). Then there exists a unique optimal plan N . In addition, N D .Id T /# and T D r' for some convex function '. Proof. The proof takes four steps: steps (1)–(3) for the existence, and step (4) for the uniqueness. 2
(1) Note that the cost c.x; y/ D jx 2yj is nonnegative and continuous. Also, taking ˝ 2 .; / as a coupling, we obtain Z Z jx yj2 d. ˝ / 2 jxj2 C jyj2 d. ˝ / d d Rd Rd ZR R Z D2 jxj2 d C 2 jyj2 d < C1: Rd
Rd
Thus Theorem 2.3.2 ensures the existence of a nontrivial optimal transport plan
. N (2) Since N is optimal, we know from Corollary 2.5.8 that supp. N / @' for some convex 'W Rd ! R [ ¹C1º. Also, by Proposition 2.5.5, '.x/ C ' .y/ D x y
on @';
Case of quadratic cost on Euclidean space
where ' .y/ WD supz2Rd ¹z y
37
'.z/º. Therefore
'.x/ C ' .y/ D x y
on supp. N /:
In particular .'.x/; ' .y// is finite for -a.e. N .x; y/, and thus '.x/ is finite at -a.e. point x. Since dx, and convex functions are differentiable almost everywhere on the region where they are finite (this follows by Alexandrov’s theorem; see [72, Thm. 14.25]), we deduce that ' is differentiable -a.e. (3) Let A Rd , with .A/ D 0, be such that ' is differentiable everywhere in Rd n A. Fix xN 2 Rd n A and suppose that .x; N y/ N 2 supp. N / @'. Since yN 2 @'.x/ N and ' is differentiable at x, N by Lemma 2.5.2 we get yN D r'.x/. N Thus, we have supp. / N \ Œ.Rd n A/ Rd graph.r'/: Since N .A Rd / D .A/ D 0, this proves that .x; y/ D .x; r'.x// N -a.e.
(2.14)
Thus, for any function F 2 Cb .Rd Rd / we have Z Z (2.14) F .x; y/ d .x; N y/ D F .x; r'.x// d N .x; y/ d d Rd Rd Z R R D F .x; r'.x// d.x/ d ZR D F .x; y/ d..Id r'/# /.x; y/; Rd Rd
hence N D .Id r'/# , as desired. (4) We now prove uniqueness. Assume that N1 and N2 are optimal. By linearity of the
N2 is also optimal; indeed, problem (and convexity of the constraints), N1 C 2 Z N C N 1 2 jx yj2 d d d 2 R R Z Z 1 1 2 D jx yj d N1 C jx yj2 d N2 2 Rd Rd 2 Rd Rd and, for any Z Rd Rd
2 Cb .Rd /, it holds that Z Z N C N 1 1 1 2 .x/ d D .x/ d N1 C 2 2 d d 2 Rd Rd Z R R D d; Rd
N2
N2 / D . Analogously .Y /# . N1 C / D . thus .X /# . N1 C 2 2
.x/ d N2
Optimal transport
Hence, by steps (2) and (3) applied to N1 , N2 , functions '1 , '2 , 'N such that
N1 C N2 , 2
38
there exist three convex
(a) N1 D .Id r'1 /# , thus .x; y/ D .x; r'1 .x// N1 -a.e.; (b) N2 D .Id r'2 /# , thus .x; y/ D .x; r'2 .x// N2 -a.e.; (c)
N1 C N2 2
D .Id r '/ N # , thus .x; y/ D .x; r '.x// N
N1 C N2 -a.e. 2
In particular, it follows by (c) that .x; y/ D .x; r '.x// N holds N1 -a.e., which combined with (a) yields .x; r'1 .x// D .x; r '.x// N
N1 -a.e. ) r'1 .x/ D r '.x/ N
-a.e.;
where the implication follows from the fact that there is no dependence on y (so a relation true N1 -a.e. is also true -a.e.). Analogously, combining (b) and (c), we deduce that r'2 .x/ D r '.x/ N holds for -a.e x 2 Rd . Thus r'1 D r'2 -a.e., and therefore N1 D N2 , as desired. Remark 2.5.11. Let us describe how an appropriate limit of Brenier maps yields the Knothe map constructed in Section 1.4.1. This result was first proven in [28]. Let ; 2 P .Rd / be two probability measures with respect R absolutely continuous R to Lebesgue and with finite second moments jxj2 d.x/ C jxj2 d.x/ < 1. Thanks to Theorem 2.5.10, we know that, for any " > 0, there is a unique optimal transport map T" W Rd ! Rd between and with respect to the cost c" .x; y/ D .x1
y1 /2 C ".x2
y2 /2 C "2 .x3
y3 /2 C C "d
1
.xd
y d /2 :
Notice that we can apply Theorem 2.5.10 because the cost c" is equivalent to the standard quadratic cost up to a linear rescaling. Intuitively, as " ! 0, the relative weight of the d coordinates becomes incomparable: the distance on the first coordinate is much more important than the distance on the second coordinate, the distance on the second coordinate is much more important than the distance on the third coordinate, etc. It turns out that the Brenier maps T" converge to the Knothe map T W Rd ! Rd between and . More precisely, T" ! T in L2 .Rd / as " ! 0. Corollary 2.5.12. Under the assumptions of Brenier’s theorem (Theorem 2.5.10), (a) there exists a unique optimal transport map T W Rd ! Rd such that T# D ; also, T D r' with 'W Rd ! R [ ¹C1º convex; (b) if T# D and T D r' -a.e. for some 'W Rd ! R [ ¹C1º convex, then T is the unique optimal transport map. Proof. As we will see, the proof is an immediate consequence of the previous results. (a) First of all, recall that the infimum in Monge’s problem is bounded from below by the infimum in Kantorovich’s problem (see Remark 2.2.1): Z Z 2 inf jx T .x/j d inf jx yj2 d .x; y/: T# D
Rd
2.;/ Rd Rd
Case of quadratic cost on Euclidean space
39
Let ' be the convex function provided by Theorem 2.5.10, and set N WD .Id r'/# . With this choice we have Z Z 2 jx r'.x/j d D jx yj2 d N .x; y/ Rd Rd Rd Z D min jx yj2 d .x; y/;
2.;/
so T D r' is optimal. We now show that the solution to the Monge problem is unique. Let T1 and T2 be optimal for Monge. Then it follows by the discussion above that 1 D .Id T1 /# and 2 D .Id T2 /# are optimal for Kantorovich. Because 1 D 2 (by Theorem 2.5.10), we conclude that T1 D T2 -a.e. (b) The optimality of S follows directly from Remark 2.5.9, while the uniqueness follows from the first part of this corollary. We conclude this section by proving that, whenever and are both absolutely continuous, the optimal transport from to is invertible, and its inverse is given by the optimal transport map from to . Corollary 2.5.13. Under the assumptions of Brenier’s theorem (Theorem 2.5.10), assume also that dx. Let r' be the optimal transport map from to , and let r be the optimal transport map from to . Then r' ı r D Id -a.e. and r ı r' D Id -a.e. (i.e., informally, the two maps are inverses of each other). Proof. By Brenier’s theorem (Theorem 2.5.10), we have two convex maps ' and such that
r' is an optimal transport map from to ;
r
is an optimal transport map from to .
Hence Z Rd
jx
Z
2
r'.x/j d D D
Rd Rd
yj2 d..Id r'/# /
inf
2.;/ Rd Rd
and (since the cost is symmetric in x and y) Z Z jr .y/ yj2 d D Rd
Rd Rd
D
jx Z
inf
jx Z
yj2 d
yj2 d..r
2.;/ Rd Rd
This implies that .Id r'/# and .r
jx
jx
Id/# /
yj2 d :
Id/# are both optimal, so they are equal.
Optimal transport
40
Thus, for any test function F W Rd Rd ! R, we have Z Z F .x; r'.x// d.x/ D F .x; y/ d..Id r'/# /.x; y/ d d Rd ZR R D F .x; y/ d..r Id/# /.x; y/ d d ZR R D F .r .y/; y/ d.y/: Rd
Choosing F .x; y/ D jx r .y/j2 , this gives Z Z jx r .r'.x//j2 d.x/ D jr .y/ Rd
Rd
r .y/j2 d.y/ D 0;
thus r ı r' D Id -a.e. Similarly, choosing F .x; y/ D jr'.x/ r' ı r D Id -a.e.
yj2 , we get
Remark 2.5.14. Using the definition of the Legendre transform, one can show that y 2 @'.x/ , x 2 @' .y/ and that the map provided by the previous corollary actually coincides with ' . Since this will never be used in this book, we will not prove it, but interested readers are encouraged to try to prove this fact. 2.5.4 An application to Euler equations Let Rd be a bounded open set with smooth boundary, and let be the outer unit normal to @. The Euler equations describe the evolution on a time interval Œ0; T of the velocity v D v.t; x/ 2 Rd of an incompressible fluid. They are given by the following system: 8 ˆ < @ t v C .v r/v C rp D 0 in (Euler equation); div.v/ D 0 in (incompressibility condition); ˆ : v D 0 on @ (no-flux condition): Here p D p.t; x/ 2 R denotes the pressure of the fluid P at time t and position x. The notation v r denotes the differential operator jdD1 v j @xj . Hence, in coordinates v D .v 1 ; : : : ; v d /, one reads the Euler equation as @t vi C
d X j D1
v j @xj v i C @xi p D 0 8 i D 1; : : : ; d:
Case of quadratic cost on Euclidean space
41
If v is smooth, then d dt
d jv.t /j D dt
Z
2
Z X d
v .t/ D 2
d X
2
i;j D1
Z X
D
2
Z X d
j
vi @t vi
i D1
i D1
Z D
i
i j
v v @xj v
v j @xj
i
2
Z X d i D1
X d .v i /2
2
v i @xi p
Z X d i D1
i D1
v i @ xi p
X Z X X d d X j j i 2 j i 2 v .v / C @xj v .v /
Z D
@ j
Z 2
d X
vi i p C 2
@ i D1
Z
div.v/jvj
i D1
@xi vi p
Z
v jvj C @
Z X d i D1
2
D
j
i D1
2
Z
Z vpC2
2 @
div.v/p
D 0; where we used the no-flux and incompressibility conditions. Also, if v is smooth, we can define its flow gW Œ0; T ! as ´ @ t g.t; x/ D v.t; g.t; x//I g.0; x/ D x: Note that g.t; / is a map from to , since (thanks to the no-flux condition) the curve t 7! g.t; x/ never exits . Also, differentiating the ODE for g with respect to x, we get @ t rx g D rx v.t; g.t; x// D rx v.t; g.t; x// rx g.t; x/ (note that rx v and rx g are d d matrices, and rx v.t; g.t; x// rx g.t; x/ denotes their product, which is still a d d matrix). This implies that rx g.t C "; x/ D rx g.t; x/ C "rx v.t; g.t; x// rx g.t; x/ C o."/: Then, since det.AB/ D det.A/ det.B/ and det.Id C "A/ D 1 C " tr.A/ C o."/, det.rx g.t C "; x// d det.rx g.t; x// D lim "!0 dt "
det.rx g.t; x//
Optimal transport
42
det rx g.t; x/ C "rx v.t; g.t; x// rx g.t; x/ C o."/ det.rx g.t; x// D lim "!0 " det rx g.t; x/ C "rx v.t; g.t; x// rx g.t; x/ det.rx g.t; x// C o."/ D lim "!0 " det Id C "rx v.t; g.t; x// det.rx g.t; x// det.rx g.t; x// D lim "!0 " 1 C " tr.rx v/.t; g.t; x// 1 det.rx g.t; x// D lim "!0 " D tr.rx v/.t; g.t; x// det.rx g.t; x// D div.v/.t; g.t; v// det.rx g.t; x// D 0;
(2.15)
where the last equality follows from the incompressibility condition. Hence, since rx g.0; x/ D Id, we deduce that det.rx g.t; x// 1. Now, if we differentiate the equation for g D .g 1 ; : : : ; g d / in time, using the Euler equations we get @ t t g i .t; x/ D @ t .v i .t; g.t; x/// D @ t v i .t; g/ C rv i .t; g/ @ t g D @ t v i .t; g/ C .v.t; g/ r/v i .t; g/ D
@xi p.t; g/:
Thus the Euler equations are equivalent to the following system for a curve t ! 7 g.t / of smooth diffeomorphisms of : 8 @ g D rp.t; g/ (Euler equation ! 2nd order ODE for g); ˆ ˆ ˆ tt < (incompressibility ! g preserves (2.16) det rx g D 1 ˆ the Lebesgue measure); ˆ ˆ : g.0; x/ D x (initial condition); where p.t /W ! R is some function that represents the pressure. Arnold’s theorem. It was observed by Arnold in the 1960s [10] that, at least formally, the Euler equations for fluids can be seen as a geodesic curve in an appropriate infinite-dimensional manifold. Readers can find the definition of a geodesic, and a brief presentation of the concepts necessary to appreciate the following theorem, in Section 1.3. Theorem 2.5.15 (Arnold’s theorem). The Euler equations are equivalent to the geodesic equation on the manifold SDiff./ L2 .I Rd / defined as SDiff./ WD ¹hW ! j h a measure-preserving and orientation-preserving diffeomorphismº: Proof. First of all, we need to identify the tangent space of SDiff./.
Case of quadratic cost on Euclidean space
43
Given hN 2 SDiff./, let t 7! h.t/ 2 SDiff./ be a smooth curve of maps in N and set w.t/ WD @ t h.t /. By definition of tangent space, SDiff./ with h.0/ D h, w.t / 2 Th.t/ SDiff./. Since h.t / is a diffeomorphism of ; it maps @ onto itself, and therefore w.t / D @ t h.t / must be tangent to the boundary. Define w.t z / WD w.t / ı h 1 .t / so that @ t h.t / D w.t; z h.t //, and note that w.t/ z is also tangent to @. Since det rx h.t; x/ 1 (because h.t / 2 SDiff./), by the computations in (2.15) we have 0D
d z D 0: det rx h.t; x/ D div.w/.t; z h.t; x// det rx h.t; x/ ) div.w/ „ ƒ‚ … dt 1
Thus, taking t D 0, we deduce that ThN SDiff./ ¹w j div.w ı hN 1 / D 0; w j@ D 0º D ¹w z ı hN j div.w/ z D 0; w z j@ D 0º: Vice versa, given a vector field wW z ! Rd with div.w/ z D 0 and w z j@ D 0, we solve ´ @ t h.t; x/ D w.h.t; z x//; N h.0; x/ D h.x/; d det rh D 0. Thus and using the same computation as in (2.15) we find that dt h.t /W ! is a curve in SDiff./, and in particular @ t h.0/ D w z ı hN is an element N of the tangent space of SDiff./ at h. Hence, we proved that, for any element hN 2 SDiff./;
ThN SDiff./ D ¹w z ı hN j div.w/ z D 0; w z j@ D 0º: Let us observe the following: (a) For any measure-preserving map h 2 SDiff./, and any f1 ; f2 W ! Rd , we have Z hf1 ı h; f2 ı hiL2 D f1 ı h.x/ f2 ı h.x/ dx Z D f1 .x/ f2 .x/ dx D hf1 ; f2 iL2 ;
where in the second equality we used that h 2 SDiff./ (and therefore h# dx D dx). (b) Every vector field in L2 .; Rd / can be written as the sum of a gradient and a divergence-free vector field, that is, ® ¯ L2 .; Rd / WD wW ! Rd j div.w/ D 0 and w j@ D 0 ® ¯ (2.17) ˚ rq j qW ! R :
Optimal transport
Note that this decomposition is orthogonal. Indeed, Z Z hw; rqiL2 D w rq dx D w q „ƒ‚…
@
D0
44
Z
div.w/ q dx D 0: „ ƒ‚ … D0
This is known as the Helmholtz decomposition. Combining (a) and (b) yields that, for any h 2 SDiff./, we can decompose L2 .; Rd / as ® ¯ L2 .; Rd / WD w ı hW ! Rd j div.w/ D 0 and w j@ D 0 ƒ‚ … „ DTh SDiff./
˚ ¹rq ı h j qW ! Rº: Since this decomposition is orthogonal in L2 .; Rd /, we conclude that, given h 2 SDiff./, ? Th SDiff./ D ¹rq ı h j qW ! Rd º: Hence, thanks to this characterization and recalling Definition 1.3.5, given a curve t ! g.t / 2 SDiff./, the following are equivalent:
t ! g.t / is a geodesic;
@ t t g ? Tg SDiff./;
@ t t g.t; x/ D rq.t; g.t; x// for some function q.t /W ! Rd .
Recalling (2.16), this proves the result taking p.t / WD
q.t /.
A connection between Arnold’s and Brenier’s theorems. Thanks to Arnold’s theorem, we know that the incompressible Euler equations correspond to the geodesic equations in the space SDiff./: We now recall that minimizing geodesics on manifolds can be found by considering the minimization problem (1.2). Thus, to find minimizing geodesics in SDiff./, one could consider the minimization problem ˇ ²Z 1 Z ³ ˇ 2 ˇ inf j@ t g.t; x/j dx dt ˇ g.t/ 2 SDiff; g.0/ D g0 ; g.1/ D g1 ; 0
where g0 ; g1 2 SDiff./ are prescribed. This minimization problem is very challenging and actually minimizers may fail to exist (see for instance [35]). Thus, we consider a simpler version of the problem. Namely, instead of searching the minimizing geodesic from g0 to g1 , we look for an approximate midpoint between them. Recall that, given a d -dimensional manifold M RD , and two points x0 ; x1 2 M , a good approximation of the midpoint between them is found by considering 1 the Euclidean midpoint x0 Cx (note that this point may not belong to M ) and then 2 x0 Cx1 1 finding the closest point to 2 on M , that is, projM x0 Cx . By analogy, given 2
Case of quadratic cost on Euclidean space
45
g0 ; g1 2 SDiff./, one looks for the closest function (with respect to the L2 -norm) 1 in SDiff./ to g0 Cg . Thus, we want to study the map 2 projSDiff W L2 .; Rd / ! SDiff./; g C g g0 C g1 0 1 7! projSDiff : 2 2 Even this simpler problem is far from trivial, the main difficulty being that SDiff./ is neither convex nor closed in L2 .; Rd /. So, as a first relaxation of the problem, one might want to consider the L2 -closure of SDiff./. This closure is characterized in the next (nontrivial) result due to Brenier and Gangbo [21]. Theorem 2.5.16. Let Rd be a bounded set with Lipschitz boundary, and let d 2. Then SDiff./
L2
D S./ WD ¹sW ! j s# dx D dxº:
The next result gives a sufficient condition for the existence and uniqueness of the projection of a map h 2 L2 .; Rd / in S./. Theorem 2.5.17 ([20]). Let h 2 L2 .I Rd / satisfy h# .dxj / dx. Then, (a) there exists a unique projection sN onto S./ (i.e., for any s 2 S./ it holds that kh sN kL2 ./ kh skL2 ./ ); (b) there exists a convex function such that h D r polar decomposition; see Remark 2.5.18).
ı sN (this formula is called
Proof. Let us prove the two statements independently. (a) First we prove the existence of the projection and then its uniqueness. Take hW ! Rd and define WD h# .dxj / dx. Note that Z Z Z d D d D dx D jj: Rd
h./
So, although Brenier’s theorem (Theorem 2.5.10) holds for probability measures, 1 , we can apply it in this context also. up to multiplying both and dxj by jj Thus, by Corollary 2.5.13, there exist convex functions '; W Rd ! R such that r' and r are optimal from to dxj and vice versa. Let sN WD r' ı hW ! . Then, by the optimality of r', Z Z jNs .x/ h.x/j2 dx D jr' ı h hj2 dx Z D jr' Idj2 d (2.18) d R Z D min jx yj2 d :
2.;dxj / Rd
Optimal transport
46
We now observe that if s 2S./ then s WD .hs/# .dxj / belongs to .;dxj /. Indeed, .X /# s D h# .dxj / D and .Y /# s D s# .dxj / D dxj . This implies that Z Z 2 min jx yj d min jx yj2 d s
2.;dxj / Rd s2S./ Rd Z D min jh.x/ s.x/j2 dx; s2S./
that combined with (2.18) yields Z Z jh.x/ sN .x/j2 dx min jh.x/ s2S./
s.x/j2 dx;
thus sN is a projection. Suppose that sO is another projection. Then by the previous step it follows that sN and sO are both optimal couplings. Thus, by uniqueness (see Theorem 2.5.10) the transport plans are equal, therefore Z Z F .h.x/; sO .x// dx D F .h.x/; sN .x// dx 8 F 2 Cb .Rd Rd /:
Choosing F .x; y/ D jr'.x/ yj2 and recalling that sN D r' ı h, we conclude that Z Z 2 0D jr' ı h sN j dx D jNs sO j2 dx;
hence sO D sN , as desired. (b) This follows from the fact that sN D r' ı h and r 2.5.13).
D .r'/
1
(see Corollary
Remark 2.5.18. The polar decomposition can be seen (at least formally) as a generalization of some well-known results: (a) Any matrix M 2 Rd d can be decomposed as S O, with S symmetric and positive semidefinite and O orthogonal. To see this, take h.x/ D M x. Then h D r' ı sN , with '.x/ D 12 hx; Sxi and sN .x/ D Ox. (b) Consider a smooth vector field wW Rd ! Rd , and let h t .x/ WD h.t; x/ be the flow of w: ´ @ t h.t; x/ D w.h.t; x//; h.0; x/ D x: Then h" .x/ D h0 .x/ C @ t h t .x/j tD0 " C o."/ D x C " w.x/ C o."/:
General cost functions: Kantorovich duality
47
Also, the polar decomposition of h" yields h" D r
"
ı s" :
At least formally, since h" .x/ is a perturbation of x, it looks natural also to assume that r " and s are perturbations of the identity map. More precisely, we suppose that jxj2 C " q.x/ C o."/; s" .x/ D x C " u.x/ C o."/: " .x/ D 2 Also, since det rs" D det.Id C "ru C o."// D 1 C " div.u/ C o."/ and s" is measure preserving (hence 1 det rs" ), we deduce that div.u/ D 0. Hence, combining all these equations, we get x C " w.x/ C o."/ D h" D r
"
ı s"
D .x C "rq.x// ı .x C "u.x// C o."/ D x C ".u.x/ C rq.x// C o."/; therefore w D u C rq. In other words, this formally shows that any vector field w can be written as the sum of a divergence-free vector field and a gradient, which is nothing but the Helmholtz decomposition (2.17). Thus, essentially, Helmholtz decomposition is the infinitesimal version of the polar decomposition.
2.6 General cost functions: Kantorovich duality The goal of this section is to repeat, in the case of general costs, what we did in the previous sections for the case c.x; y/ D x y on X D Y D Rd . As we will see, some proofs are essentially identical provided that one introduces the correct definitions. 2.6.1 c-convexity and c-cyclical monotonicity First we need a suitable analogue of the notion of convexity. Note that a possible way to define convex functions is as the suprema of affine functions. Namely, a function 'W Rd ! R [ ¹C1º is convex if '.x/ D sup ¹x y C y º y2Rd
Optimal transport
48
for some choice of values ¹y ºy2Rd with y 2 R [ ¹ 1º.11 Having in mind that before x y D c.x; y/, this suggests the following general definition (cf. Definition 2.5.1): Definition 2.6.1. Given X and Y metric spaces, cW X Y ! R, and 'W X ! R [ ¹C1º, we say that ' is c-convex if '.x/ D sup ¹ c.x; y/ C y º y2Y
for some ¹y ºy2Y R [ ¹ 1º. Then, for any x 2 X , we define the c-subdifferential as @c '.x/ WD ¹y 2 Y j '.z/ c.z; y/ C c.x; y/ C '.x/ 8 z 2 X º: S Also, we define @c ' WD x2X ¹xº @c '.x/ X Y . The following is an analogue of Theorem 2.5.3. Theorem 2.6.2. A set S X Y is c-cyclically monotone if and only if there exists a c-convex function ' such that S @c '. Proof. The proof is essentially the same as that of Theorem 2.5.3, provided one replaces x y with c.x; y/. We write the details for just one implication. ( Let .xi ; yi /i D1;:::;N S @c '. Then '.z/ '.xi /
c.z; yi / C c.xi ; yi / 8 z 2 X:
Choosing z D xi C1 (with the convention xN C1 D x1 ) and summing over i, we obtain N X
c.xi C1 ; yi / C c.xi ; yi / 0:
i D1
) It suffices to define '.x/ WD sup ¹
c.x; yN / C c.xN ; yN /
c.xN ; yN
1/
N 1
C C c.x0 ; y0 / j .xi ; yi /i D1;:::;N Sº and repeat the proof of Theorem 2.5.3 (see [72, 77–78] for the details). 11
If one wants to avoid setting y D 1 for some y, one can instead say that a function ' is convex if there exist A Rd and a family ¹y ºy2A R such that '.x/ D sup ¹x y C y º: y2A
In other words, A corresponds to the set ¹y >
1º.
General cost functions: Kantorovich duality
49
2.6.2 A general Kantorovich duality In this subsection, X and Y are locally compact, separable, and complete metric spaces. Definition 2.6.3. Given a c-convex function 'W X ! R [ ¹C1º, we define its cLegendre transform ' c W Y ! R [ ¹C1º as ' c .y/ WD sup ¹ c.x; y/
'.x/º:
x2X
As in the case of the classical Legendre transform, ' and ' c are related by several interesting properties. Proposition 2.6.4. The following properties hold: (a) '.x/ C ' c .y/ C c.x; y/ 0 for all x 2 X , y 2 Y ; (b) '.x/ C ' c .y/ C c.x; y/ D 0 if and only if y 2 @c '.x/. Proof. The proof is identical to that of Proposition 2.5.5 and is left to interested readers. One then obtains the following general duality result (also recall Remark 2.3.3). Theorem 2.6.5 (Kantorovich duality: General case). Let c 2 C 0 .X Y / be bounded R from below, and assume that inf 2.;/ XY c d < C1. Then Z Z Z min c d D max ' d C d:
2.;/ XY
'.x/C .y/Cc.x;y/0 X
Y
Also, in the max above, one can choose D ' c . Therefore Z Z Z min c d D max ' d C
2.;/ XY
' c-convex X
' c d:
Y
Proof. Again, the steps are the same as in the proof of Theorem 2.5.6, just replacing convexity with c-convexity, subdifferential with c-subdifferential, etc. We refer readers to [72, 78–79] for the details. In the following statement we consider the important case “cost = distance” in a metric space and we show how the general theory becomes simpler in this setting. Proposition 2.6.6. Let .X; d / be a metric space, let Y D X, and let the cost be the distance c.x; y/ D d.x; y/. The following hold: (a) A function 'W X W ! R [ ¹C1º is c-convex if and only if it is 1-Lipschitz, i.e., j'.x/ '.y/j d.x; y/ 8 x; y 2 X:
Optimal transport
50
(b) If ' is c-convex, then y 2 @c '.x/ ) '.y/ (c) If ' is c-convex, then ' c D
'.x/ D d.x; y/:
'.
(d) Given two probability measures ; 2 P .X / so that the transport cost is R finite, that is, inf 2.;/ XX d.x; y/ d < C1, we have Z Z Z min d.x; y/ d .x; y/ D max ' d ' d: ' 1-Lipschitz X
2.;/ XX
X
Proof. We prove the facts in the order they appear in the statement. (a) If ' is 1-Lipschitz then '.x/ '.y/
d.x; y/ with equality for y D x, hence
'.x/ D sup ¹'.y/
d.x; y/º;
(2.19)
y2X
which shows that ' is c-convex (with the choice y D '.y/). Vice versa, assume that ' is c-convex and fix x; y 2 X. By definition of cconvexity, for any " > 0 there exists z 2 X such that '.x/
d.x; z/ C z C ":
Moreover, once again by definition of c-convexity, we have '.y/
d.y; z/ C z :
Combining these two inequalities and applying the triangle inequality, we obtain '.x/
'.y/ d.y; z/
d.x; z/ C " d.x; y/ C ":
Since " > 0 can be chosen arbitrarily small, and the roles of x and y can be exchanged, this latter inequality implies that ' is 1-Lipschitz. (b) Note that y 2 @c '.x/ ) '.z/
d.z; y/ C d.x; y/ C '.x/ 8 z:
Choosing z D y we get '.y/ d.x; y/ C '.x/. Since the converse inequality follows by (a), this proves (b). (c) Thanks to (a), ' is 1-Lipschitz and so is '. Hence we may apply (2.19) with ' instead of ' and deduce ' D ' c as desired. (d) Thanks to Theorem 2.6.5, Z min d.x; y/ d D
2.;/ XX
Z max
' c-convex X
Z 'N d C X
'N c d:
General cost functions: Kantorovich duality
51
Since 'N c D 'N (by (c)) and c-convexity is equivalent to 1-Lipschitzianity (by (a)), this implies that Z Z Z min d.x; y/ d D max ' d ' d: ' 1-Lipschitz X
2.;/ X X
X
Remark 2.6.7. A popular alternative approach to Kantorovich duality is based on general abstract results in convex analysis, and goes as follows: Z inf d .x; y/
2.;/ X Y
Lagrange multiplier for .X /# D ~
…„
‚ Z
²Z
D inf sup
c.x; y/ d .x; y/ C
0 ';
Z
'.x/ d .x; y/ XY
XY
ƒ '.x/ d.x/
X
Lagrange multiplier for .Y /# D
‚ Z
…„ Z .y/ d .x; y/
C XY
Y
²Z
|
Z
D sup inf ';
' d C
0
X
'; }
Z d C
Y
²Z D sup
Z ' d C
X
c.x; y/ C '.x/ C Z
d C inf
0 X Y
Y
sup '.x/C .y/Cc.x;y/0 X
.y/ d
³
XY
Z
D
ƒ ³ .y/ d.y/
c.x; y/ C '.x/ C
³
.y/ d
Z ' d C
d; Y
where the equalities are justified as follows: ~ We do not require to be a probability anymore, and we also drop the coupling constraint; only the sign constraint 0 remains. The other constraints are “hidden” in the Lagrange multipliers. Indeed the supremum over ' is C1 if .X /# ¤ (resp. the supremum over R is C1 Rif .Y /# ¤ ). Note also that once .X /# D (or .Y /# D ) then 1 d D 1 d D 1, which implies that
2 P .X Y /. | We used [71, Thm. 1.9] to exchange inf and sup. } We have the following two possible situations:
R If c.x; y/ C '.x/ C .y/ 0 for any .x; y/, then inf 0 .: : : / d D 0 (take
0).
If there exists .x; N y/ N such that c.x; N y/ N C '.x/ N C .x/ N < 0, then take D M ı.x; and let M ! C1. So, unless c.x; y/ C '.x/ C .y/ 0 for any N y/ N .x; y/, the infimum over is 1.
We refer readers to [71, Chap. 1] for a detailed discussion of this approach.
Optimal transport
52
We conclude this section by noticing that, as a consequence of the previous results, we obtain the following corollary (cf. Corollary 2.5.8): Corollary R2.6.8. Let c 2 C 0 .X Y / be bounded from below, and assume that inf 2.;/ XY c d < C1. For a coupling N 2 .; /, the following statements are equivalent: (a) N is optimal; (b) supp. N / is c-cyclically monotone; (c) there exists a c-convex function 'W X ! R [ ¹C1º such that supp. N / @c '.
2.7 General cost functions: Existence and uniqueness of optimal transport maps Thanks to the previous results, we can now mimic the proof of Brenier’s theorem (Theorem 2.5.10) to prove the existence and uniqueness of optimal transport maps. Since the proof of Theorem 2.5.10 involves taking derivatives, one needs X to have a differentiable structure. For simplicity we will prove the result when X D Y D Rd , but the same argument can be generalized to the case when X is an arbitrary Riemannian manifold and Y has no differentiable structure. Also, to simplify the argument, we will assume that supp./ is compact. For more general statements, we refer to [72, Chaps. 9–10]. Theorem 2.7.1. Let X D Y D Rd , dx, and supp./ beRcompact. Let c 2 C 0 .X Y / be bounded from below, and assume that inf 2.;/ X Y c d < C1. Also, suppose that
for every y 2 supp./, the map Rd 3 x 7! c.x; y/ is differentiable;
for every x 2 Rd , the map supp./ 3 y 7! rx c.x; y/ 2 Rd is injective;
for every y 2 supp./ and R > 0, jrx c.x; y/j CR for every x 2 BR (where CR depends only on R and not on y).
Then there exists a unique optimal coupling , N with N D .Id T /# and T satisfying rx c.x; y/jyDT .x/ C r'.x/ D rx c.x; T .x// C r'.x/ D 0; for some c-convex function 'W Rd ! R [ ¹C1º. Remark 2.7.2. For c.x; y/ D x y we have rx c.x; y/ D y, thus the map y 7! rx c.x; y/ D y is injective. Also, c-convex functions are the same as convex functions, and r'.x/ C rx c.x; T .x// D 0 implies that r'.x/
T .x/ D 0;
General cost functions: Existence and uniqueness of optimal maps
53
therefore T D r'. Hence, Theorem 2.7.1 covers Theorem 2.5.10 (the only extra assumption is that now supp./ is assumed to be compact). Proof. Let N be optimal, let ' be as in Corollary 2.6.8(c), and define ' c as in Definition 2.6.3. Note that, as a consequence of Corollary 2.6.8(c) and Proposition 2.6.4(b), it follows that '.x/ C ' c .x/ C c.x; y/ D 0 8 .x; y/ 2 supp. N /:
(2.20)
In the proof of Theorem 2.5.10 we used that convex functions are differentiable almost everywhere. Here, in order to obtain the almost everywhere differentiability of ', we would like to show that it is locally Lipschitz. In general this is not clear for ' itself, but we can show that we can replace it with another function 'z that is locally Lipschitz. Indeed, define WD '.x/ z
sup ¹ c.x; y/
' c .y/º D sup ¹ c.x; y/ C y º; y2Y
y2supp./
´ y WD
'c .y/ 1
if y 2 supp./; if y 62 supp./:
Note that 'z is c-convex. Also, since c.x; y/ ' c .y/ '.x/ for every x; y (see Proposition 2.6.4(a)), we have 'z '. On the other hand, it follows immediately from the definition of 'z that '.x/ z C ' c .y/ C c.x; y/ 0 8 .x; y/ 2 Rd supp./:
(2.21)
Hence, since supp. / N Rd supp./ (because .Y /# D ) and using (2.20), it follows that 0 '.x/ z C ' c .x/ C c.x; y/ '.x/ C ' c .x/ C c.x; y/ D 0 8 .x; y/ 2 supp. N /; thus '.x/ z C ' c .x/ C c.x; y/ D 0 8 .x; y/ 2 supp. N /:
(2.22)
We now claim that 'z is locally Lipschitz. Indeed, for each y 2 supp./, consider the map Rd 3 x 7! c.x; y/ ' c .y/: The gradient is given by rx c.x; y/, which (by our assumption) is uniformly bounded by CR for x 2 BR . Thus, for any R > 0 the maps BR 3 x 7!
c.x; y/
' c .y/
are CR -Lipschitz, and therefore also the map 'z (being their supremum) is CR -Lipschitz inside BR for any R > 0. This proves the claim.
Optimal transport
54
Since locally Lipschitz maps are differentiable almost everywhere, there exists a set A, with jAj D 0, such that 'z is differentiable on Rd n A. Also, since dx, we have .A/ D 0. Now, fix .x; y/ 2 supp. / N with x 62 A. Then it follows from (2.21) and (2.22) that the function z 7! '.z/ z C ' c .y/ C c.z; y/ attains its minimum at z D x, therefore r '.x/ z C rx c.x; y/ D 0: Since rx c.x; y/ is injective, the equation above has at most one solution. Thus y is uniquely determined in terms of x, and we call this unique point T .x/. Hence, we proved that supp. / N \ Œ.Rd n A/ supp./ graph.T /. As in the proof of Theorem 2.5.10, since .A N supp.// .A N Rd / D .A/ D 0 we conclude that
N D .Id T /# . Finally, uniqueness follows by the same argument as in step (4) of the proof of 2 . Then Theorem 2.5.10. More precisely if 1 and 2 are optimal then so is 1 C 2 C a.e. 1 2 a.e. graph.T1 / [ graph.T2 / D supp. 1 / [ supp. 2 / D supp D graph.Tx / 2 for some map Tx , and this is only possible if T1 D T2 -a.e. Example 2.7.3. Let c.x; y/ D jx yjp , with p > 1. We want to show that Theorem 2.7.1 applies. We only prove that the map y 7! rx c.x; y/ is injective, since the other assumptions on c are easily checked. To show this, fix x; v 2 Rd and assume that v D rx c.x; y/ D pjx yjp 2 .x y/. Since jx yjp 2 is positive, we deduce that the vectors v and .x y/ are parallel and point in the same direction, hence x jx
v y D : yj jvj
We also know that jvj D pjx
yjp , jx
yj D
jvj p 1 1 p
:
Combining these two facts, we deduce that x and therefore y D x
yD
v jvj jvj p
v jx jvj
p1 1
yj D
v jvj p 1 1 jvj p
, which proves that y is unique.
General cost functions: Existence and uniqueness of optimal maps
55
Remark 2.7.4. We note that, for p D 1, the reasoning above fails. Indeed, given x; v 2 Rd , the relation x y v D rx c.x; y/ D jx yj implies that necessarily jvj D 1, and under this condition the relation is satisfied by every y D x t v with t > 0, which shows that y is not unique. In fact, the previous theorem is false for the cost c.x; y/ D jx yj. To see this, consider d D 1 and take the measures D dxjŒ0;1 and D dxjŒ1;2 . As shown in [71, Rem. 2.19(iii)] or [64, Prop. 2.7], the optimal transportation cost between these two densities is given by Z jF .x/ F .x/j dx; where F .x/ WD . 1; x : R
R In this particular case, one can check that R jF .x/ F .x/j dx D 1. Noticing that T1 .x/ WD x C 1 and T2 .x/ WD 2 x satisfy Z .Ti /# D ; jTi .x/ xj d.x/ D 1; i D 1; 2; R
we deduce that both maps T1 and T2 are optimal, so we have no uniqueness. In addi2 is optimal and it is not tion, if we define i WD .Id Ti /# (i D 1; 2), then 1 C 2 induced by a graph. Remark 2.7.5. Consider again the cost c.x; y/ D jx P .R/: Assume that xy
yj on R R, and let ; 2
8 x 2 supp./; y 2 supp./:
Then for any coupling (resp., any transport map T ) we have xy
resp. x T .x/ 8 x 2 supp./ .
8 .x; y/ 2 supp. /
Hence Z
Z jy
xj d D
RR
Z .y
RR
x/ d D
Z y d
ZRR D y d R
and analogously Z Z Z jT .x/ xj d D .T .x/ x/ d D T .x/ d R R Z R Z D y d x d: R
x d RR
Z x d; R
Z x d R
R
In other words, the cost is independent of the coupling or of the transport map, and therefore every coupling/map is optimal.
Chapter 3
Wasserstein distances and gradient flows The goal of this chapter is to show a surprising connection between optimal transport, gradient flows, and PDEs. More precisely, after introducing Wasserstein distances, we will first give a brief general introduction to gradient flows in Hilbert spaces. Then, following the seminal approach of Jordan, Kinderlehrer, and Otto [51] (which is now known as the JKO scheme), we are going to prove that the gradient flow of the entropy functional in the Wasserstein space coincides with the heat equation. Our general treatment of gradient flows gives only a glimpse of the theory. We suggest the expository paper [65] for an introduction to the subject and the monograph [6] for a thorough study of gradient flows, both in the Hilbertian setting and in the vastly more general metric setting.
3.1 p-Wasserstein distances and geodesics We are going to introduce the space of measures with finite p-moment and then a distance on this space induced by optimal transport. Definition 3.1.1. Let .X; d / be a locally compact and separable metric space. Given 1 p < 1, let ® ¯ R Pp .X / WD 2 P .X/ j X d.x; x0 /p d.x/ < C1 for some x0 2 X (3.1) be the set of probability measures with finite p-moment. Remark 3.1.2. Given x1 2 X , by the triangle inequality and the convexity of RC 3 s 7! s p we get12 p d.x; x1 /p d.x; x0 / C d.x0 ; x1 / 2p 1 d.x; x0 /p C d.x0 ; x1 /p : R R Hence, if 2 P .X/ satisfies X d.x; x0 /p d.x/ < C1, then also X d.x; x1 /p d .x/ is finite. This means that the definition of Pp .X / is independent of the base point x0 . Definition 3.1.3. Given ; 2 Pp .X/, we define their p-Wasserstein distance as Wp .; / WD
12 p
2
Z inf
2.;/ XX
By convexity, given a; b 0 we have 1 p .a C b p /:
p1 d.x; y/ d .x; y/ : p
aCb p 2
ap Cb p , 2
or equivalently .a C b/p
Wasserstein distances and gradient flows
58
Remark 3.1.4. If ; 2 Pp .X/ then for all 2 .; / it holds that Z Z p p 1 d.x; y/ d 2 d.x; x0 /p C d.x0 ; y/p d XX X X Z Z p 1 p p D2 d.x; x0 / d C d.y; x0 / d X
X
< 1: Hence Wp is finite on Pp .X/ Pp .X/: To justify the terminology “p-Wasserstein distance,” we now prove the following result. Theorem 3.1.5. Wp is a distance on the space Pp .X /. Proof. As we will see, the most delicate part of the proof consists in proving the triangle inequality.
If Wp .; / D 0, then (thanks to Theorem 2.3.2) there exists N such that Z d.x; y/p d .x; N y/ D 0: XX
Thus x D y -a.e., N which means that is concentrated on the graph of the identity map. Therefore N D .Id Id/# , which yields D .2 /# N D . We now prove that Wp is symmetric. Indeed, given 2 .; / optimal, define
Q WD S# , with S.x; y/ WD .y; x/. Then Q 2 .; / and therefore, since d.x; y/ D d.y; x/, we get Z Z Wp .; / d.x; y/p d Q D d.x; y/p d D Wp .; /: XX
X X
Exchanging the roles of and proves that Wp .; / D Wp .; /, as desired. We now prove the triangle inequality. Let 1 ; 2 ; 3 2 Pp .X /, and let 12 2 .1 ; 2 / and 23 2 .2 ; 3 / be optimal couplings. Applying the disintegration theorem (recall Theorem 1.4.10) with respect to the variable x2 , we can write
12 .dx1 ; dx2 / D 12;x2 .dx1 / ˝ 2 .dx2 / and
23 .dx2 ; dx3 / D 23;x2 .dx3 / ˝ 2 .dx2 /: Consider the measure Q 2 P .X X X/ given by
Q .dx1 ; dx2 ; dx3 / WD 12;x2 .dx1 / ˝ 23;x2 .dx3 / ˝ 2 .dx2 /:
p-Wasserstein distances and geodesics
59
This measure has the property that Z '.x1 ; x2 / d .x Q 1 ; x2 ; x3 / XX X Z Z D '.x1 ; x2 / 12;x2 .dx1 / d 23 .x3 / d2 .x2 / X ZXX D '.x1 ; x2 / d 12 .x1 ; x2 /: XX
Similarly Z
Z '.x2 ; x3 / d .x Q 1 ; x2 ; x3 / D
XXX
'.x2 ; x3 / d 23 .x2 ; x3 /: X X
In other words, the measure Q allows us to think of the couplings 12 and 23 as if they lived in a common space X X X, with 12 not depending on the third variable and 23 not depending on the first variable.13 Note that we have Z Z .x1 / d .x Q 1 ; x2 ; x3 / D .x1 / d 12 .x1 ; x2 / X X X XX Z (3.2) D .x1 / d1 .x1 / X
and similarly Z
Z .x3 / d .x Q 1 ; x2 ; x3 / D
XX X
Set N13 WD
R X
Z
.x3 / d3 .x3 /: X
Q .x1 ; dx2 ; x3 /, i.e., integrate x2 out. Then, since Z '.x1 ; x3 / d N13 D '.x1 ; x3 / d Q .x1 ; x2 ; x3 /;
X X
XXX
it follows from (3.2) and (3.3) that N13 2 .1 ; 3 /. Thus, by the triangle inequality in Lp .X X X; Q / we have Z
p
Wp .1 ; 3 /
p1
d.x1 ; x3 / d N13 .x1 ; x3 / XX
Z D
p
p1
d.x1 ; x3 / d .x Q 1 ; x2 ; x3 / X X
D kd.x1 ; x3 /kLp . / Q kd.x1 ; x2 / C d.x2 ; x3 /kLp . Q / 13
This whole construction above is sometimes called the gluing lemma.
(3.3)
Wasserstein distances and gradient flows
60
kd.x1 ; x2 /kLp . / Q C kd.x2 ; x3 /kLp . Q / D kd.x1 ; x2 /kLp . 12 / C kd.x2 ; x3 /kLp . 23 / D Wp .1 ; 2 / C Wp .2 ; 3 /; where the last equality follows from the optimality of 12 and 23 . This concludes the proof. Being a distance, Wp induces a topology on Pp .X /. In the next theorem we show the connection between the Wasserstein topology and the weak- topology. Theorem 3.1.6. Fix an exponent 1 p < 1 and a base point x0 2 X. Let .n /n2N Pp .X / be a sequence of probability measures, and let 2 Pp .X /. The following statements are equivalent: R R (a) n * and X d.x0 ; x/p dn ! X d.x0 ; x/p d. (b) Wp .n ; / ! 0. Proof. We prove the two implications independently. (a) ) (b) Fix ı > 0, and define Z Mn WD 1 C d.x0 ; x/p dn .x/;
Z M WD
X
1 C d.x0 ; x/p d.x/:
X
R Since n and are probability measures and X d.x0 ; x/p dn ! X d.x0 ; x/p d, it follows that Mn ! M as n ! 1. Define the probability measures R
1 1 1 C d.x0 ; x/p n ; WD 1 C d.x0 ; x/p : Mn M Since n * and x 7! 1 C d.x0 ; x/p is a continuous function, we easily deduce n WD
that n * .14 Therefore, Lemma 2.1.13 implies that n converges narrowly to as n ! 1. Thus, by Theorem 2.1.11, we can find a compact set K X such that, for all n 2 N, Z Z 1 1 p 1 C d.x0 ; x/ dn .x/ ı; 1 C d.x0 ; x/p d.x/ ı: Mn XnK M XnK 14
Indeed, given ' 2 Cc .X/, the function X 3 x 7! 1 C d.x0 ; x/p '.x/ is continuous and
compactly supported. Hence, since n * , it follows that Z Z ' dn D 1 C d.x0 ; x/p '.x/ dn .x/ X ZX Z ! 1 C d.x0 ; x/p '.x/ d.x/ D ' d X
X
Since ' 2 Cc .X / is arbitrary, this proves that n * .
as n ! 1:
p-Wasserstein distances and geodesics
Recalling that Mn ! M as n ! 1, this implies in particular that Z 1 C d.x0 ; x/p dn .x/ XnK Z C 1 C d.x0 ; x/p d.x/ 3M ı 8 n 1:
61
(3.4)
XnK
Now, since by assumption the space X is locally compact, we can find a finite family of nonnegative functions .'i /i 2I Cc .X/ such that X X 'i .x/ D 1 8 x 2 K; 'i 1 in X; (3.5) i 2I i 2I diam.supp.'i // ı 8 i 2 I: Set
Z
Z
ƒn;i WD
'i dn ;
ƒi WD
X
'i d;
® ¯ n;i WD min ƒn;i ; ƒi ;
X
and define the measures n;i 'i n ; ƒn;i X ˛n WD n ˛n;i ; ˛n;i WD
i 2I
n;i 'i ; ƒi X ˇn WD ˇn;i : ˇn;i WD
i 2I
Note that ˛n;i .X / D ˇn;i .X/ D n;i and ˛n .X / D ˇn .X / D 1 we define n 2 P .X X/ as
n WD
P
i 2I
n;i . Then
X ˛n;i ˝ ˇn;i ˛n ˝ ˇn P : C n;i 1 i2I n;i i 2I
One can easily check that n is a transport plan from n to , i.e., n 2 .n ; /. Also, since diam.supp.'i // ı, Z ˛n;i ˝ ˇn;i d.x; y/p d n;i XX Z (3.6) ˛n;i ˝ ˇn;i D d.x; y/p d n;i ı p : n;i supp.'i /supp.'i / R Recalling that n *, we also have n;i ! ƒi D 'i d as n ! 1. Therefore, recalling (3.5) it follows from (3.4) that ˇ ˇ X ˇ ˇ ˇ1 n;i ˇˇ 4M ı for n 1; ˛n .K/ C ˇn .K/ ! 0 as n ! 1: ˇ i 2I
Wasserstein distances and gradient flows
We now observe that Z ˛n ˝ ˇn P d.x; y/p d 1 XX i 2I n;i Z ˛n ˝ ˇn P 2p d.x0 ; x/p C d.x0 ; y/p d 1 XX i 2I n;i Z Z p p p d.x0 ; x/ d˛n .x/ C d.x0 ; x/ dˇn .x/ : 2 X
62
(3.7)
X
Since ˛n n , ˇn , and ˛n .K/ ! 0, ˇn .K/ ! 0, (3.4) implies that Z Z d.x0 ; x/p d˛n .x/ C d.x0 ; x/p dˇn .x/ 4M ı for n 1: X
(3.8)
X
Hence, combining (3.6), (3.7), and (3.8), we finally deduce that Z p Wp .n ; / d.x; y/p d n .x; y/ ı p C 4M 2p ı
8 n 1:
X X
Since ı > 0 can be chosen arbitrarily small, this proves that Wp .n ; / ! 0 as n ! 1. (b) ) (a) Let n 2 .n ; / be an optimal transport plan with respect to the cost c.x; y/ D d.x; y/p . Applying the triangle inequality for the Wasserstein distance (recall Theorem 3.1.5) and using that Wp .n ; / ! 0, we have Z Z d.x0 ; x/p dn D Wp .ıx0 ; n /p ! Wp .ıx0 ; /p D d.x0 ; x/p d: X
X
It remains to show that n * . Let ' 2 Cc .X / be a compactly supported function, and let !W Œ0; 1/ ! Œ0; 1/ be its modulus of continuity (i.e., j'.x/ '.y/j !.d.x; y// for all x; y 2 X). Given ı > 0, we have ˇZ ˇ Z Z ˇ ˇ ˇ ' dn ' dˇˇ j'.x/ '.y/j d n .x; y/ ˇ X X ZX X Z !.ı/ d n .x; y/ C 2k'k1 d n .x; y/ ¹d.x;y/ıº ¹d.x;y/>ıº Z d.x; y/p d n .x; y/ !.ı/ C 2k'k1 ıp ¹d.x;y/>ıº Z 2k'k1 !.ı/ C d.x; y/p d n .x; y/ ıp XX 2k'k1 D !.ı/ C Wp .n ; /p : ıp R By first letting n ! 1 and then ı ! 0, the last inequality implies that X ' dn ! R ' d, concluding the proof. X
p-Wasserstein distances and geodesics
63
Theorem 3.1.6 is particularly useful when the ambient space X is compact (or, equivalently, when all measures n 2 P .X/ live inside a fixed compact set). Indeed, since in this case the function Rd.x0 ; /p has compactRsupport (because the whole space is compact), the convergence X d.x0 ; x/p dn ! X d.x0 ; x/p d is a consequence of the weak- convergence of n to . Hence we immediately deduce that, on compact sets, Wasserstein convergence is equivalent to weak- convergence. Corollary 3.1.7. Let X be compact, p 1; .n /n2N Pp .X / a sequence of probability measures, and 2 Pp .X/. Then
n * , Wp .n ; / ! 0: 3.1.1 Construction of geodesics Let X D Rd and let 2 .; / be an optimal coupling for Wp . Set t .x; y/ WD .1 t /x C ty, so that ´ .0 /# D ; .1 /# D : Define t WD . t /# and let s;t WD .s ; t /# 2 .s ; t /. Then Z Wp .s ; t /
jz
0 p
0
p1
z j d s;t .z; z /
XX
Z D
js .x; y/
p1 t .x; y/j d .x; y/ p
XX
Z D jt
jx
sj
p
p1
yj d
D jt
sjWp .0 ; 1 /:
XX
Applying this bound on the intervals Œ0; s, Œs; t, and Œt; 1, we get Wp .0 ; s / C Wp .s ; t / C Wp . t ; 1 / s C .t
s/ C 1
t Wp .0 ; 1 /
D Wp .0 ; 1 /: Note that the converse inequality always holds, by the triangle inequality. Hence, all inequalities are equalities and we deduce that Wp .s ; t / D jt
sjWp .0 ; 1 /
8 0 s; t 1:
(3.9)
Definition 3.1.8. A curve of measures . t / t 2Œ0;1 Pp .Rd / is said to be a constant speed geodesic if (3.9) holds. Remark 3.1.9. Notice that, on a Riemannian manifold, a minimizing geodesic (as defined in Section 1.3) satisfies (3.9) with Wp replaced by the Riemannian distance. The converse implication is also true: if a curve on a Riemannian manifold satisfies
Wasserstein distances and gradient flows
64
(3.9) (with Wp replaced by the Riemannian distance) then the curve is a minimizing geodesic. It follows from the discussion above that any optimal coupling induces a geodesic via the formula t WD . t /# . Note that, in the particular case when the coupling D .Id T /# is induced by a map, the geodesic t takes the form t D . t /# .Id T /# D t ı .Id T / # D .T t /# ; where T t .x/ WD .1 t/x C tT .x/ is the linear interpolation between the identity map and the transport map T . One can also show (when X D Rd ) that any constant-speed geodesic curve . t / t 2Œ0;1 (with respect to the distance Wp ) is induced by an optimal plan 2 .0 ; 1 /, i.e., t D . t /# (see [72, Cor. 7.22]).
3.2 An informal introduction to gradient flows in Hilbert spaces Let H be a Hilbert space (think, as a first example, H D Rd ) and let W H ! R be of class C 1 . Given x0 2 H , the gradient flow of starting at x0 is given by the ordinary differential equation ´ x.0/ D x0 ; (3.10) x.t/ P D r.x.t //: Note that, for a solution x.t/ of the gradient flow, it holds that d .x.t// D r.x.t// x.t/ P D dt
jrj2 .x.t // 0:
(3.11)
Thus,
decreases along the curve x.t/;
we have of .
d .x.t// dt
D 0 if and only if jrj.x.t // D 0, i.e., x.t / is a critical point
In particular, if has a unique stationary point that coincides with the global minimizer (this is for instance the case if is strictly convex), then one expects x.t / to converge to the minimizer as t ! C1. Remark 3.2.1. To define a gradient flow, one needs a scalar product (exactly as in the definition of the gradient of a function on a manifold; see Definition 1.3.2). Indeed, as a general fact, given a function W H ! R one defines its differential d.x/W H ! R as .x C "v/ .x/ d.x/Œv D lim : "!0 " If is sufficiently regular, the map d.x/W H ! R is linear and continuous, which means that d.x/ 2 H (the dual space of H ). On the other hand, if t 7! x.t / 2 H
Informal introduction to gradient flows in Hilbert spaces
65
is a curve, then x.t C "/ x.t / 2 H: "!0 " So x.t P / 2 H and d.x.t// 2 H live in different spaces. To define a gradient flow, we need a way to identify H and H . This can be done if we introduce a scalar product. Indeed, if h; i is a scalar product on H H , we can define the gradient of f at x as the unique element of H such that x.t/ P D lim
hr.x/; vi WD d.x/Œv
8 v 2 H:
In other words, the scalar product allows us to identify the gradient and the differential, and thanks to this identification we can now make sense of x.t P / D r.x.t //. Now the first question is how one constructs a solution to (3.10). If r is Lipschitz continuous, one can simply rely on the Picard–Lindelöf theorem (see [70, Thm. 2.2]). Actually, even if r is only continuous, one could rely on the Peano theorem (see [70, Thm. 2.19]) to get existence of a solution. Unfortunately, as we will see, in most situations of interest r is not continuous. So, even the assumption of C 1 regularity is too strong; for the time being we keep this assumption just to emphasize the ideas, but later we will remove it. A classical way to construct solutions of (3.10) is by discretizing the ODE in time, via the so-called implicit Euler scheme. More precisely, with a small fixed time step > 0, we discretize the time derivative x.t/ P as x.tC/ x.t / , so that our ODE becomes x.t C /
x.t/
D
r.y/
for a suitable choice of the point y. A natural idea would be to choose y D x.t / (as in the explicit Euler scheme), but for our purposes the choice y D x.t C / (as in the implicit Euler scheme) works better. Thus, given x.t /, one looks for a point x.t C / 2 H solving the relation x.t C /
x.t/
D
r.x.t C //:
With this idea in mind, we set x0 D x0 . Then, given k 0 and xk , we want to find by solving xkC1 xkC1 xk D r.xkC1 /; or equivalently rx
kx
ˇ ˇ xk k2 C .x/ ˇˇ 2 xDx
kC1
D
xkC1
xk
C r.xkC1 / D 0;
where kk denotes the norm induced by the scalar product introduced before. In other
Wasserstein distances and gradient flows
66
2
kx x k
k words, xkC1 is a critical point of the function k .x/ WD C .x/. Therefore, 2 a natural way to construct xkC1 is by looking for a global minimizer of k : As mentioned above, the C 1 assumption on is generally too strong. So, let us assume instead that W H ! R [ ¹1º is convex and lower semicontinuous, and recall the definition of subdifferential introduced in Definition 2.5.1. Then we define a generalized gradient flow in the following way:
Definition 3.2.2. An absolutely continuous curve15 xW Œ0; C1/ ! H is a gradient flow for the convex and lower-semicontinuous function with initial point x0 2 H if ´ x.0/ D x0 ; (GF) x.t/ P 2 @.x.t// for almost every t > 0. Proceeding by analogy with what we did before, for convex and lower semicontinuous we can still repeat the construction of discrete solutions via the implicit Euler scheme: we set x0 D x0 , and given k 0 and xk we look for a point xkC1 satisfying xkC1 xk 2 @.xkC1 /: One can check that this relation is equivalent to 02
xkC1
xk
C @.xkC1 / DW @
k .xkC1 /;
k .x/
WD
kx
xk k2 C .x/: 2
Note that 0 2 @ k .xkC1 / is equivalent to saying that xkC1 is a global minimizer of (this follows immediately from Definition 2.5.1). Hence, given xk , one finds xkC1 k by minimizing x 7! k .x/. It is not difficult to prove that a minimizer exists,16 so we can construct the sequence .xk /k0 . Then, setting x .0/ WD x0 and x .t / WD xk for t 2 ..k 1/; k , one obtains a curve t 7! x .t/ that should be almost a solution to (GF). Then, the main challenge is to let ! 0 and prove that there exists a limit curve x.t / that indeed solves (GF). We will not discuss this here, and we refer to [5, Sect. 3.1] and the references therein for more details.
15
An absolutely continuous curve is a continuous curve that is differentiable almost everywhere, its derivative satisfies jx.t/j P 2 L1loc .Œ0; C1//, and the fundamental theorem of calculus holds: Z t
x.t/
x.s/ D
x./ P d
8 s; t 2 Œ0; C1/:
s
We refer to [6, Sect. 1.1] for a general introduction to absolutely continuous curves. 16 Actually, in this case one can prove that there exists a unique minimizer. Indeed, to prove this, fix x0 2 H a point where the subdifferential of is nonempty, and fix p0 2 @.x0 /. Then .x/ .x0 / .x0 /
hp0 ; x
x0 i
kp0 k kxk C kx0 k D
Akxk
B
8 x 2 H;
Informal introduction to gradient flows in Hilbert spaces
67
Remark 3.2.3 (Uniqueness and stability). Let be a convex function, and let x.t /, y.t / be solutions of (GF) with initial conditions x0 and y0 respectively. If is of class C 1 then d kx.t / y.t/k2 D hx.t/ y.t/; x.t P / y.t P /i dt 2 D hx.t/ y.t/; r.x.t // r.y.t //i 0; where the last inequality follows from the convexity of . More generally, if is convex but not necessarily C 1 , we have x.t P /D
p.t/ and y.t/ P D
q.t/;
p.t/ 2 @.x.t //; q.t / 2 @.y.t //;
and therefore d kx.t/ y.t/k2 D hx.t/ y.t /; x.t P / y.t P /i dt 2 D hx.t/ y.t /; p.t / q.t /i 0; where the last inequality follows from the monotonicity of the subdifferential of convex functions (this is just a particular case of the cyclical monotonicity of the subdifferential of a convex function in the case N D 2; see Theorem 2.5.3). In particular, in both cases the gradient flow is unique. Even more, if the initial conditions x0 and y0 are close, then x.t/ and y.t / remain uniformly close for all times. where A WD kp0 k and B WD kp0 k kx0 k proves that lim
kxk!1
k .x/
lim
.x0 /. Hence, recalling the definition of kx
kxk!1
xk k2 2
Akxk
k,
this
B D C1:
Thus, if xj is a minimizing sequence of k (i.e., k .xj / ! infH k as j ! 1), it follows from the equation above that kxj k cannot go to infinity. This means that xj is a bounded sequence in the Hilbert space H , so by the Banach–Alaoglu theorem it has a subsequence xj` that converges weakly to some point x. N Note now that k is a lower-semicontinuous convex function. Also, for convex functions, lower semicontinuity with respect to strong convergence is equivalent to lower semicontinuity with respect to weak convergence (see for instance [23, Cor. 3.9]). Hence N lim inf k .xj` / D inf k ; k .x/ H
`!1
k
which proves that xN is a minimizer. Note also that, since is uniformly convex (being the sum of the convex function and a uniformly convex function), the minimizer is unique: indeed, if xN 1 and xN 2 are minimizers then k
xN C xN 1 2 2
N1/ k .x
C 2
N2/ k .x
D
infH
k
C infH 2
so equality holds in the first inequality, and therefore xN 1 D xN 2 :
k
D inf H
k;
Wasserstein distances and gradient flows
Example 3.2.4. Let H D L2 .Rd / and 8 Z 0. Then the equation above takes the form Z Z Z jruj2 jr.u C "w/j2 dx dx " pw dx: 2 2 Rd Rd Rd Rearranging the terms and dividing by " yields Z Z Z " 2 jrwj dx ru rw dx C pw dx; 2 Rd Rd Rd so by letting " ! 0 we obtain Z Z ru rw Rd
Rd
pw dx
8 w 2 W 1;2 .Rd /:
Replacing w with w in the inequality above, we conclude that Z Z Z u w D ru rw dx D pw dx 8 w 2 W 1;2 .Rd /; „ƒ‚… Rd
as a distribution
Rd
Rd
i.e., u D p 2 L2 .Rd /. ( Assume that the distributional Laplacian u belongs to L2 .Rd /. By definition of , for any w 2 W 1;2 .Rd / we have Z Z 1 jrwj2 dx .u C w/ .u/ D ru rw dx C 2 Rd Rd Z Z ru rw dx D u w dx: Rd
Rd
Heat equation and optimal transport: The JKO scheme
On the other hand, if w 62 W 1;2 .Rd / then trivially Z .u C w/ D C1 .u/ C
69
u w dx:
Rd
Thus u 2 @.u/. As a consequence of this discussion, we obtain the following: Corollary 3.2.5 (Heat equation as gradient flow). Let H D L2 .Rd / and consider the Dirichlet energy functional 8 Z 0, set 0 WD 0 , and given k we define kC1 as the minimizer of
W 2 .; k / C 7! 2 2
Z log./ dx:
(3.12)
The goal of this section is to show that, as ! 0, the scheme converges to the solution of the heat equation. We begin by proving the existence of discrete solutions.18 exists (i.e., the functional in (3.12) has a miniLemma 3.3.2. For any k 0, kC1 mum).
Proof. Fix k 0, and take .m /m2N P ./ a minimizing sequence, that is, ² 2 ³ Z Z W22 .m ; k / W2 .; k / C m log.m / dx ! inf C log./ dx : 2 2 2P ./ For all M 2 N the sequence ¹m ^ M ºm2N is bounded in L1 ./, thus by the Banach–Alaoglu theorem it is weakly- compact in L1 . Hence, by a diagonal 18
Readers familiar with the Dunford–Pettis theorem will find the proof longer than needed. However, we have decided to present a more elementary proof based only on the weak- compactness of L1 .
Heat equation and optimal transport: The JKO scheme
71
argument, we can find a subsequence m` independent of M such that m` ^ M * M in L1 ./ for each M 2 N. Also, since s log.s/ C 1 0 for all s 0, we can bound Z Z m m ^ M dx D .m` M / dx
¹m` M º
Z 1 m log.m / dx log.M / \¹m M º Z 1 m log.m / C 1 dx log.M / \¹m M º Z 1 C ; m log.m / C 1 dx log.M / log.M / where the last bound follows from the fact that m` is a minimizing sequence (hence R log.m` / is uniformly bounded) and is bounded (hence it has finite volume). m ` Set 1 WD supM M . We know that
m` ^ M dx * M dx; L1
M ! 1 km`
(by monotone convergence); C : ^ M m` kL1 log.M /
Hence, thanks to the first two properties, we can find a sequence of indices .m`M /M 2N , with m`M ! 1, such that m`M ^ M * 1 in L1 ./ as M ! 1. Also, thanks to the third property, m`M *1 in L1 ./. We now want to show that 1 is still a probability density. Note that this is not obvious, since some mass may have “escaped” from . To prove this, set N" WD 1 ¹x 2 j dist.x; @/ < "º. Since jN" j C ", for L WD "jlog."/j we have Z
log.m / log.L/ N" \¹m Lº N" \¹m Lº C 1 C LjN" j C C "L C log.L/ log.L/ jlog."/j Z
m N"
Z
m C
m
8 m 2 N;
so in particular Z nN"
and therefore
m`M 1
Z nN"
1 1
C jlog."/j C : jlog."/j
Letting " ! 0, we conclude that 1 is a probability density. In particular, it follows by Lemma 2.1.13 that the convergence of m`M to 1 is also narrow. Hence, thanks to Theorem 2.1.11, the family ¹m`M ºm2N is tight.
Wasserstein distances and gradient flows
72
We now observe that, since Œ0; 1/ 3 s 7! s log.s/ is convex, [4, Thm. 5.2] implies that19 Z Z 1 log.1 / lim inf m`M log.m`M /: (3.13) M !1
; k / as M ! 1. We now want to study the behavior of Let M 2 .m`M ; k /. Then, since the family ¹m`M ºm2N is tight (by the previous discussion), the proof of Lemma 2.3.1 shows that M is also tight. Hence, up to taking a subsequence, M * 1 with W22 .m`M
.1 /# 1 D 1 ;
.2 /# 1 D k ;
thus 1 2 .1 ; k /. Note also that, since jx , Z Z W22 .m`M ; k / D jx yj2 d M !
yj2 is continuous and bounded on
jx
Hence, combining the lower semicontinuity of above, we get W22 .m`M ; k /
R
m`M
yj2 d 1 W22 .1 ; k /: log.m`M / with the equation
Z
C m`M log.m`M / 2 Z W 2 .1 ; k / 2 C 1 log.1 /: 2
lim inf M !1
Since m was a minimizing sequence, this proves that 1 is a minimizer. Hence, we WD 1 . define kC1 Next, since kC1 minimizes the functional (3.12), we expect it to satisfy some kind of minimality equation. This is the purpose of the next result.
19
A simple way to prove (3.13) is the following: note that, for each s 0, it holds that s log s s. C 1/
e
8 2 R;
with equality for D log.s/:
Hence, given any continuous function .x/, we have Z Z m`M log.m`M / lim inf m`M .x/..x/ C 1/ e .x/ dx lim inf M !1 M !1 Z D 1 .x/..x/ C 1/ e .x/ dx;
where we applied the previous formula with s D .x/ and D .x/, and the final equality follows from the narrow convergence of m`M to 1 . Choosing ¹k ºk2N a sequence of functions converging to log.1 /, the result follows by applying the above formula to D k and letting k ! 1:
Heat equation and optimal transport: The JKO scheme
73
Lemma 3.3.3. For any vector field 2 C 1 .; Rd / tangent to the boundary of , it holds that Z Z 1 kC1 div./ dx D h ı TkC1 ; TkC1 xik dx; where TkC1 W ! is the optimal map from k to kC1 . Proof. In order to exploit the minimality of kC1 , we want to perturb it. We do it as follows. Consider the flow of : ´ P x/ D .ˆ.t; x//; ˆ.t; ˆ.0; x/ D x:
Since is tangent to @, it follows that ˆ.t/W ! is a diffeomorphism. So we can define " WD ˆ."/# kC1 2 P ./: It follows by Section 1.6 that .x/ D " .ˆ."; x// det rˆ."; x/; kC1
therefore Z
Z " .y/ log." .y// dy D
Z D
kC1 .x/ log." .ˆ."; x/// dx kC1 .x/ log
Then a Taylor expansion gives (cf. (2.15)) Z Z " log." / D kC1 log.kC1 /
Z
.x/ kC1 det rˆ."; x/
dx:
kC1 log.det rˆ."; x// dx „ ƒ‚ … 1C" div Co."/
Z D
Now, given a coupling 2 .ˆ."; / Id/# . Since
kC1 log.kC1 / .kC1 ; k /
Z "
kC1 div dx C o."/:
optimal for the W2 -distance, define " WD
.1 /# " D ˆ."; /# kC1 D " ;
.2 /# " D .2 /# D k ;
and ˆ."; x/ D x C ".x/ C o."/, we get Z Z 2 2 W2 ." ; k / jx yj d " D jˆ."; x/ yj2 d Z D jx yj2 C 2"h.x/; x
yi C o."/ d :
Wasserstein distances and gradient flows
74
Therefore, recalling that is optimal from kC1 to k , we obtain Z 2 2 h.x/; x yi d C o."/: W2 ." ; k / W2 .kC1 ; k / C 2"
Combining everything, we have proven that Z W22 .kC1 ; k / C log.kC1 / dx kC1 2 Z W22 ." ; k / C " log." / dx 2 Z W22 .kC1 ; k / / dx C log.kC1 kC1 2 Z Z " C h.x/; x yi d " div dx Co."/: kC1 „ ƒ‚ … .?/
Hence, since " can be chosen both positive and negative, we see that the term .?/ has to vanish. Therefore our optimality condition for kC1 reads Z Z 1 kC1 div./ dx h.x/; x yi d D 0; where realizes the 2-Wasserstein distance between kC1 and k . To simplify the formula we apply Theorem 2.5.10 to deduce that the optimal plan is unique and is induced by an optimal map TkC1 from k to kC1 , namely
D .TkC1 Id/# k . Thus Z Z h.x/; x yi d D h ı TkC1 .x/; TkC1 .x/ xik .x/ dx;
and the optimality equation becomes20 Z Z 1 h ı TkC1 ; TkC1 kC1 div./ dx as wanted.
xik dx D 0;
20
Alternatively, one could have proceeded as follows: let TkC1 be an optimal transport map from k to kC1 , and note that ˆ" ı TkC1 transports k to " . Hence, Z Z W22 ." ; k / jˆ" ı TkC1 xj2 k dx D jTkC1 C " ı TkC1 xj2 k dx C o."/ Z Z D jTkC1 xj2 k dx C 2" h ı TkC1 ; TkC1 xik dx C o."/: „ ƒ‚ … W22 .kC1 ;k /
Using this expression, one gets the desired formula for the optimality condition.
Heat equation and optimal transport: The JKO scheme
75
We are now ready to state and prove the main result of this section. Theorem 3.3.4. Given > 0, let W Œ0; 1/ ! P ./ be the curve of probability densities given by ´ 0 for t D 0; .t/ WD (3.14) k for t 2 ..k 1/; k , k 1: Then there exists a curve of probability measures 2 L1loc .Œ0; 1/ / such that, up to a subsequence in , * weakly in L1loc .Œ0; 1/ /. Furthermore, satisfies the heat equation (in the distributional sense) with initial datum 0 and zero Neumann boundary conditions. Proof. By the minimality of k , we have W22 .k ; k 2
1
/
Z C
k log.k / dx
W22 .; k 2
Z D
k 1
1
/
Z C
ˇ ˇ log./ dx ˇˇ
Dk
1
log.k 1 / dx:
Thus, by taking the telescopic sum over k D 1; : : : ; k0 , one gets Z Z k0 X W22 .k ; k 1 / C k0 log.k0 / dx 0 log.0 / dx: 2 kD1 ƒ‚ … „ 0
R In particular, we deduce that the entropy k log.k / decreases in k. Therefore, recalling (3.14), we have Z Z .t; x/ log. .t; x// dx 0 log.0 / dx 8 > 0; t 0: (3.15)
Also, since s log s
1 on Œ0; C1/, for any k0 1 it holds that
k0 X W22 . .k/; ..k 2
1///
Z
Z 0 log.0 / dx
kD1
Z
k0 log.k0 / dx (3.16)
0 log.0 / C 1 dx:
Furthermore, since Z
.t/ D 1, we have
t2
Z
R
t1
.t; x/ dx dt D t2
t1
8 0 t1 t2 :
(3.17)
Wasserstein distances and gradient flows
76
As shown in the proof of Lemma 3.3.2, the bound (3.15) implies that the measures .t / cannot concentrate nor escape to the boundary of , uniformly in t. Hence, up to a subsequence, converges weakly in L1loc .Œ0; R 1/ / to a density .t; x/, and by passing to the limit in (3.17) we deduce that .t; x/ dx D 1 for almost every t 2 Œ0; T . Now that we have shown the convergence of , we want to show that satisfies the heat equation. The idea is to test the heat equation against ˇ a test function of the form .x/.t /. So, first we fix 2 C 1 ./ such that @@ ˇ@ D 0. Note that, by a Taylor expansion with integral remainder, one has Z 1 1 2 .x/ .y/ D hr .y/; x yi C r .tx C .1 t /y/Œx y; x y dt: 2 0 In particular j .x/
.y/
hr .y/; x
yij
1 2 kr k1 jx 2
from which it follows that Z ˇ ˇhr ı Tk ; Tk
ˇ xi C .x/ .Tk /ˇk Z k1 jTk xj2 k 1 dx
1 2 kr 2 1 2 D kr k1 W22 .k ; k 2
1
yj2 ;
dx
1 /:
ˇ Then, applying Lemma 3.3.3 with D r (note that, since @@ ˇ@ D 0, r to the boundary) we obtain ˇ ˇ Z Z ˇ ˇ 1 ˇ k dx C .Tk / .x/ k 1 dx ˇˇ ˇ „ ƒ‚R … R D
W 2 . ; 1 2 kr k1 2 k k 2
k
1
/
k
1
(3.18)
:
We now take 2 Cc1 .Œ0; C1//, and we multiply (3.18) by ..k recalling (3.14), we get ˇZ Z ˇ ˇ .x/ .k; x/..k 1// dx .x/ ..k 1/; x/..k ˇ ˇ Z ˇ .x/ .k; x/..k 1// dx ˇˇ
1 kr 2 k1 kk1 W22 . .k/; ..k 2
is tangent
1/ //:
1/ /. Then,
1/ / dx
Heat equation and optimal transport: The JKO scheme
Summing this bound over k D 1; : : : ; 1 yields ˇ Z ˇ ˇ .x/0 .x/ dx ˇ .0/ ˇ .I/ ‚ …„ Z 1 1 Z X X C .x/ .kt/..k 1// dx
kD2
ƒ‚
„ C
.II/
1 X
W22 .k/; ..k
.x/ .k; x/.k / dx
ˇ ˇ ˇ 1/ / dx ˇ ˇ …
kD1
ƒ
kD1
Z 1 X .x/ .k; x/..k
77
1// C ;
kD1
where C depends on and , and the last bound follows from (3.16). We now rewrite the terms (I) and (II) as follows. For term (I) we have
.I/ D D
.t/; t2Œ.k 1/;k
1 Z X kD1 1 X
‚ …„ ƒ .k/
.x/ k
Z
kD1 1
Z
Z
.k 1/
Z
.k 1/
‚ ..k
@ t .t / dt
…„ 1/ /
ƒ .k / dx
.x/ .t; x/ @ t .t / dx dt
.x/ .t; x/ @ t dx dt:
D 0
R k
For term (II), since Z
k
1// D
..k
..k
1// dt
.k 1/ Z k
D
Z
k
.t/ dt C .k 1/
..k .k 1/
„
1/ / ƒ‚
k@ t k1
.t / dt ; …
we have .II/ D
1 Z X
.x/ .t; x/..k
1/ / dx
D
kD1 1 Z k X kD1
Z
.k 1/
X 1 Z CO kD1
.x/ .t; x/.t / dx dt
k .k 1/
Z
j .x/j .t; x/j@ t .t /j dx dt
Wasserstein distances and gradient flows
D
1 Z X kD1
k
Z
.k 1/
.x/ .t; x/.t / dx dt
X 1 Z CO kD1
78
1
Z
0
j .x/j .t; x/j@ t .t /j dx dt :
Therefore, choosing T > 0 such that supp./ Œ0; T , we have proven that ˇ Z Z 1Z ˇ ˇ .0/ .x/0 .x/ dx .x/ .t; x/@ t .t / dt dx ˇ 0 ˇ Z 1Z ˇ .x/ .t; x/.t/ dx dt ˇˇ 0
Z C C k k1 k@ t k1
0
T
Z
.t; x/ dx dt ! 0 as ! 0:
Hence, since * in L1loc .Œ0; 1/ /, we conclude that Z Z 1Z .0/ .x/0 .x/ dx .x/.t; x/@ t .t / dx dt 0 Z 1Z .x/ .t; x/.t/ dx dt D 0 0
(3.19)
ˇ for any smooth satisfying @@ ˇ@ D 0. We claim that (3.19) corresponds to saying that solves, in ˇthe distributional /ˇ sense, the heat equation with Neumann boundary conditions21 @.t D 0 and with @ @ initial datum 0 . To prove the claim we first note that, integrating by parts in time, Z Z 1Z .x/.t; x/ @ t .t / dx dt .0/ .x/0 .x/ dx 0 Z Z ˇt D1 ˇ D .0/ .x/0 .x/ dx .x/.t; x/.t / dx ˇ t D0 Z C .x/ @ t .t/ dx „ƒ‚… as a distribution
Z D
.0/
.x/ 0 .x/
.0/ „ƒ‚…
in the trace sense
Z dx C
.x/
@ t .t / dx: „ƒ‚…
as a distribution
ˇ ˇ It is natural to expect that satisfies the Neumann boundary condition @.t/ D 0. @ @ Indeed, this condition corresponds to saying that the mass of cannot enter or leave , and this is coherent with the way the solution was constructed. 21
Heat equation and optimal transport: The JKO scheme
On the other hand, integrating by parts in space and using that Z 0
1
@ ˇ @ @
ˇ
79
D 0, we have
Z .x/ .t; x/.t/ dx dt Z 1Z Z 1 Z @ .x/ .t/ .t/ dt r .x/ r .t / dx dt D „ƒ‚… @ƒ‚ … 0 0 @ „ as a distribution D0 Z 1Z Z 1 Z @.t/ .t/ dt C D .x/ .t / dx dt: .x/ „ƒ‚… @ 0 0 @ „ƒ‚… as a
as a distribution
distribution
Hence, in the sense of distributions, (3.19) is equivalent to Z Z 1 Z @.t / 0 D .0/ .t / dt .x/.0 .x/ .0// dx .x/ @ 0 @ Z 1Z C .x/ @ t .t/ dx dt: 0
(3.20)
2 Cc1 ./ and 2 Cc1 ..0; C1//, we get Z 1Z .x/ @ t .t / dx dt D 0;
First choosing
0
and so by the arbitrariness of distributions. Hence (3.20) becomes Z 0 D .0/ .x/.0 .x/
and we deduce that @ t
Z .0// dx 0
1 Z @
D 0 in the sense of
@.t / .x/ .t / dt: @
(3.21)
ˇ We now use (3.21) with 2 Cc1 ..0; C1// and 2 C 1 ./ such that @@ ˇ@ D 0 to get Z 1 Z @.t / 0D .x/ .t / dt: @ 0 @ ˇ Note that the constraint @@ ˇ@ D 0 plays no role on the possible values of on @. In other ˇ words, j@ can be chosen arbitrarily, and so the equation above implies that @.t/ ˇ D 0. @ @ Finally, combining this information with (3.21), we deduce that Z 0 D .0/ .x/.0 .x/ .0// dx 8 2 Cc1 ./;
hence .0/ D 0 , as desired.
Chapter 4
Differential viewpoint of optimal transport The goal of this chapter is to introduce a differential structure on the space of probability measure, starting from the Benamou–Brenier formula and then introducing Otto’s formalism. This will allow us to interpret several important PDEs as gradient flows with respect to the 2-Wasserstein distance. In order to focus on the main ideas behind this important theory, most computations in this chapter will be formal. R In the previous chapter we saw that, considering the entropy functional ! log./ in the 2-Wasserstein space, the discrete Euler scheme for gradient flows produces solutions to the heat equation. It is natural to wonder whether we can say that, in some sense, the heat equation is the gradient flow of the entropy with respect to the W2 metric. Moreover, one might ask whether a similar strategy could handle other evolution equations. In this chapter we give an answer to these questions by endowing the Wasserstein space with a differential structure. This makes it much easier to guess (and, with the right toolbox, prove) that the gradient flow of the entropy is the heat equation. Also, this will allow us to repeat the same strategy for other functionals/evolution equations.
4.1 The continuity equation and Benamou–Brenier formula Let Rd be a convex set ( D Rd is admissible), let N0 2 P2 ./ be a probability measure with finite second moments (recall (3.1)), and let vW Œ0; T ! Rd be a smooth bounded vector field tangent to the boundary of . Let X.t; x/ denote the flow of v, namely ´ P x/ D v.t; X.t; x//; X.t; X.0; x/ D x; and set t D .X.t//# N0 . Note that, since v is tangent to the boundary, the flow remains inside , hence t 2 P2 ./.22 22
Since v t is bounded, jX.t; x/ xj C t, therefore Z Z 2 jxj t .x/ dx D jX.t; x/j2 N0 .x/ dx Z 2 jxj2 C .C t/2 N0 .x/ dx:
Thus, if 0 has finite second moments, the same holds for t .
Differential viewpoint of optimal transport
82
Lemma 4.1.1. Let v t ./ WD v.t; /. The continuity equation @ t t C div.v t t / D 0
(4.1)
holds in the distributional sense. R Proof. Let 2 Cc1 ./, and consider the function t 7! t .x/ .x/ dx. Then using the definitions of X and t , we get Z Z Z d d @ t t .x/ .x/ dx D t .x/ .x/ dx D .X.t; x//N0 .x/ dx dt dt Z P x/N0 .x/ dx D r .X.t; x// X.t; Z D r .X.t; x// v t .X.t; x//N0 .x/ dx Z Z D r .x/ v t .x/ t .x/ dx D .x/ div.v t t / dx:
Definition 4.1.2. Given a pair . t ; v t / solving the continuity equation (4.1), with v t j@ D 0 (namely, v t is tangent to the boundary of ), we define its action as Z 1Z AŒ t ; v t WD jv t .x/j2 t .x/ dx dt: 0
The following remarkable formula, due to Benamou and Brenier [14], shows a link between the continuity equation and the W2 -distance. Theorem 4.1.3 (Benamou–Brenier formula). Given two probability measures N0 ; N1 2 P2 ./, it holds that ® ¯ W22 .N0 ; N1 / D inf AŒ t ; v t j 0 D N0 ; 1 D N1 ; @ t t C div.v t t / D 0; v t j@ D 0 : Proof. We give only a formal proof. Let . t ; v t / be a couple “probability measure/smooth vector field” satisfying 0 D N0 , 1 D N1 , and @ t t C div.v t t / D 0. Let X.t; x/ denote the flow of v t . By the uniqueness for the continuity equation,23 the unique solution t is the one constructed in Lemma 4.1.1, namely t D .X.t//# N0 . In particular X.1/# N0 D N1 , which implies that X.1/ is a transport map from N0 to N1 . Then, by the definition of X and Hölder inequality, we get Z 1Z Z 1Z 2 jv t .X.t; x//j2 N0 .x/ dx dt AŒ t ; v t D jv t j t dx dt D 0 0 Z 1Z Z Z 1 P x/j2 N0 .x/ dt dx D D jX.t; N0 .x/ jXP .t; x/j2 dt dx 0 23
0
The uniqueness for the continuity equation, at least for smooth vector fields, can be obtained exploiting the duality with the transport equation, as sketched in [1, p. 3, (2)].
Otto’s calculus: From Benamou–Brenier to a Riemannian structure
Z Z D
ˇZ ˇ N0 .x/ˇˇ
1
0
83
ˇ2 ˇ P X.t; x/ dt ˇˇ dx
N0 .x/jX.1; x/
xj2 dx W22 .N0 ; N1 /:
(4.2)
Hence, this proves that W22 .N0 ; N1 / is always less than or equal to the infimum appearing in the statement. To show equality, take X.t; x/ D x C t.T .x/ x/, where T D r' is optimal from N0 to N1 , set t WD X.t/# N0 , and let v t be such that XP .t / D v t ı X.t /.24 With this P x/ D v t .X.t; x//, and looking at the computations choice we have .T .x/ x/ D X.t; above one can easily check that all inequalities in (4.2) become equalities, therefore AŒ t ; v t D W22 .N0 ; N1 /:
4.2 Otto’s calculus: From Benamou–Brenier to a Riemannian structure In [59], Otto generalized classical notions from Riemannian geometry (recall Section 1.3) to the Wasserstein space: the norm, the scalar product, and the gradient. We will not follow the line of thought of that paper precisely. Instead, we will use the Benamou–Brenier formula as a starting point for our reasoning. Thanks to the Benamou–Brenier formula (Theorem 4.1.3), we have ²Z 1 Z ˇ ˇ 2 2 W2 .N0 ; N1 / D inf jv t j t dx dt ˇˇ @ t C div.v t t / D 0; v t j@ D 0; t ;v t 0 ³ 0 D N0 ; 1 D N1 This definition corresponds to saying that v t WD XP ı X sense, we need to show that X.t/ 1 exists. Note that 24
jX.t; x/
X.t; x/j Q jx
xj Q hX.t; x/ D .1
t/hx
X.t; x/; Q x x; Q x
1
.t/. To show that this makes
xi Q
xi Q C t hr'.x/ „
r'.x/; Q x ƒ‚
0 .' convex/
.1
t/jx
xi Q …
(4.3)
xj Q 2;
thus jX.t; x/ X.t; x/j Q .1 t/jx xj. Q This implies that, for t 2 Œ0; 1/, X.t/ is injective and therefore X.t / 1 exists. In addition, this also proves that jX.t/
1
.y/
X.t/
1
.y/j Q
1 1
t
jy
yj; Q
so X.t / 1 is also Lipschitz. Note that the injectivity may be false for t D 1. Indeed, if N1 D ıxN then T .x/ D xN is constant and obviously not injective.
Differential viewpoint of optimal transport 1
84
ˇ ˇ jv t j2 t dx dt ˇˇ @ t C div.v t t / D 0; v t j@ D 0; t vt 0 ³ 0 D N0 ; 1 D N1 ˇ ³ ²Z 1 ²Z ˇ jv t j2 t dx ˇˇ div.v t t / D @ t t ; v t j@ D 0 dt inf D inf t 0 vt „ ƒ‚ …
² Z D inf inf
Z
DWk@ t t k2 t
ˇ ³ ˇ ˇ 0 D N0 ; 1 D N1 ; ˇ
where in the last equality we used that, for each time t, given t and @ t t , one can minimize with respect to all vector fields v t satisfying the constraint div.v t t / D @ t t . In analogy with the formula for the Riemannian distance on a manifold (see Definition 1.3.3), it is natural to define the Wasserstein norm of the derivative @ t t at t as ˇ ²Z ³ ˇ 2 2 ˇ k@ t t k t WD inf jv t j t dx ˇ div.v t t / D @ t t ; v t j@ D 0 : (4.4) vt
In other words the continuity equation gives, at each time t , a constraint on the divergence of v t t , and we get the formula ˇ ³ ²Z 1 ˇ 2 2 ˇ W2 .N0 ; N1 / D inf k@ t t k t dt ˇ 0 D N0 ; 1 D N1 : t
0
To find a better formula for the Wasserstein norm of @ t t we want to understand the properties of the vector field v t that realizes the infimum (in definition (4.4)). Hence, given t and @ t t , let v t be a minimizer, and let w be a vector field such that div.w/ 0. Then, for every " > 0, we have w div v t C " t D @t t : t Thus v t C " wt is an admissible vector field in the minimization problem (4.4), and so by minimality of v t we get Z Z ˇ w ˇˇ2 ˇ jv t j2 t dx ˇv t C " ˇ t dx t Z Z Z jwj2 2 2 D jv t j t dx C 2" hv t ; wi dx C " dx: t Dividing by " and letting it go to zero yields Z hv t ; wi D 0
Otto’s calculus: From Benamou–Brenier to a Riemannian structure
85
for every w such that div.w/ 0. By the Helmholtz decomposition (2.17), this implies that v t 2 ¹w j div.w/ D 0º? D ¹rq j qW ! Rº: Therefore, there exists a function t such that v t D r t . Also, since div.v t t / D @ t t and v t j@ D 0, then t is a solution of 8 0, we consider the minimization problem Z d min c C " log 1 d ; (5.11) d ˝
2.;/ Rd Rd
Computational aspects
111
d where cW Rd Rd ! Œ0; 1/ is the transport cost and d˝ represents the density of with respect to the product measure ˝ . The functional considered in (5.11) is the entropic regularization of the standard cost functional (indeed, we are adding an entropy term to regularize the functional), and it is not hard to see that this new formulation converges, as " ! 0, to the classical Kantorovich formulation (both the cost and the plan converge). It is important to notice that the regularized functional is strictly convex (whereas the original functional is linear with respect to the plan ) and therefore, broadly speaking, iterative algorithms have much more hope of finding the minimum. 1 Let us define the positive measure C" WD e " c.x;y/ ˝ , so that Z Z d d 1 d D " log d : c C " log d ˝ d C" Rd Rd
Then the minimization problem (5.11) boils down to minimizing the Kullback– Leibler divergence of with respect to C" : Z d min log d : (5.12) d C"
2.;/ Rd The minimization of the Kullback–Leibler divergence is a well-studied problem that admits a fully satisfactory computational solution provided by the Sinkhorn–Knopp algorithm [66, 67]. Let us briefly describe the scheme of the Sinkhorn–Knopp algorithm. The crucial observation is that the unique minimizer N" of (5.12) can be written as N
N" D ˛.x/ N ˇ.y/C "; N Rd ! Œ0; 1/ are two suitable functions. Hence, we have reduced the hard where ˛; N ˇW task of finding an optimal plan to the seemingly easier task of finding two optimal functions. The two functions ˛, ˇ can be obtained as the limits of two sequences of functions .˛i /i 0 and .ˇi /i 0 constructed as follows. Let ˛0 ˇ0 1, and for any k 1 let ˛k W Rd ! Œ0; 1/ satisfy
D ;
i.e., ˛k WD
.2 /# ˛k .x/ˇk .y/C" D ;
i.e., ˇk WD
.1 /# ˛k .x/ˇk
1 .y/C"
.1 /# ˇk
1 .y/C"
;
and let ˇk W Rd ! Œ0; 1/ satisfy ; .2 /# ˛k .x/C"
In other words, we are alternating imposing the constraint on the first marginal and the constraint on the second marginal. Then, in the limit k ! 1, one recovers the N desired optimal functions ˛, N ˇ. We suggest [63, Chap. 4] and the references therein for further details on entropic regularization and the Sinkhorn–Knopp algorithm.
Further reading
112
5.7 From Rd to Riemannian manifolds and beyond: CD spaces The more advanced part of our study, i.e., the differential viewpoint of optimal transport and Otto’s calculus, took place entirely in (an open set of) Euclidean space. What if one considers probability measures on a Riemannian manifold .M; g/? One can repeat essentially verbatim all the construction of Otto’s calculus on P2 .M /. However, when computing the convexity properties of functionals along Wasserstein geodesics, the geometry of the manifold M plays a crucial role. More precisely, the convexity of the functionals is affected by the Ricci curvature of .M; g/, and one can prove the following characterization (see [73]): Z 7! F Œ D .x/ log..x// d vol.x/ is W2 -convex if and only if M has M
nonnegative Ricci curvature.
This shows that the Ricci curvature of a manifold is encoded in the convexity properties of the entropy functional. Since it is possible to define the entropy functional on a general metric measure space (i.e., a metric space endowed with a reference measure), one can say that, by definition, a metric measure space is positively Ricci curved if the entropy functional is W2 -convex. This definition was introduced in the seminal papers [57, 68, 69], and it has been the starting point for a very active area of research concerning the study of spaces with Ricci curvature bounded from below. We refer interested readers to the lecture notes [46] (see also [2, 5, 72] for an exhaustive discussion of this topic) and to the papers [7, 49] where the notion of RCD spaces is introduced (this is a subclass of the spaces considered in [57, 68, 69], similarly to how Riemannian manifolds are a subclass of Finsler manifolds). Ricci curvature, Gromov–Hausdorff convergence, and optimal transport. Ricci lower bounds are very useful in many geometric applications, as they appear in inequalities relating gradients and measures such as Sobolev inequalities, heat kernel estimates, spectral gap, and diameter control. For instance, given a d -dimensional compact manifold .M; g/ with Ric Kg, the Bonnet–Myers theorem states r d 1 ; K > 0 ) diam.M / K while the Sobolev inequality states kf kLd=.d
1/ .d vol/
C.d; K; diam.M // kf kL1 .d vol/ C krf kL1 .d vol/
8 f 2 C 1 .M /: Due to their importance, it has long been an open question which assumptions Ricci lower bounds are stable under. Since the Ricci curvature is a nonlinear combination of
From Euclidean spaces to curved spaces: CD spaces
113
derivatives of the metric g up to second order, it is clear that if a sequence of Riemannian manifolds .Mk ; gk / converges (in charts) in the C 2 -topology to a Riemannian manifold .M; g/, then the Ricci curvature converges. However, one of the successes of optimal transport theory has been to show that the stability of lower bounds holds under much weaker notions of convergence. More precisely, in a geometric context, a powerful weak notion of convergence is Gromov–Hausdorff convergence: Definition 5.7.1. A sequence .Xk ; dk /k2N of compact length spaces is said to converge in the Gromov–Hausdorff topology to a metric space .X; d / if there are functions fk W Xk ! X and positive numbers "k ! 0 such that fk is an "k -approximate isometry, i.e.,38 ´ jd.fk .x/; fk .y// dk .x; y/j "k 8 x; y 2 Xk ; distH .f .Xk /; X/ "k : This notion of convergence is clearly very weak, as it does not involve any derivatives, and indeed also makes sense for metric spaces. In addition, it is particularly relevant in this context since Gromov proved a famous precompactness theorem (with respect to the Gromov–Hausdorff topology) for manifolds with Ricci curvature bounded below, and dimension and diameter bounded above. It turns out that optimal transport can be used to recast lower Ricci bounds. Indeed, recalling the definition of -convexity for a functional, introduced in Section 4.4, it has been proven in [57, 68, 69] that Z 7! F Œ D .x/ log..x// d vol.x/ is K-convex on .P2 .M /; W2 / M
if and only if M has Ricci curvature bounded from below by K.
In other words, optimal transport allows recasting lower Ricci bounds in terms of much more robust inequalities, and one can prove for instance the following geometric statement: Theorem 5.7.2 ([57, 68, 69]). Let .Mk ; gk ; volk / be a sequence of smooth compact Riemannian manifolds with Ricgk Kgk , converging in the measured Gromov– Hausdorff sense to a smooth compact Riemannian manifold .M1 ; g1 ; vol1 /. Then .M1 ; g1 / satisfies Ricg1 Kg1 . This is just a basic example of the so-called Lott–Sturm–Villani theory of metric measureRspaces with Ricci curvature bounded R below. In particular, by replacing the entropy log./ d vol with the functional 1 1=N d vol, one can also give a meaning to “a metric space has dimension bounded from above by N .” This has been the 38 Here, distH .A; B/ denotes the Hausdorff ®distance between two sets A; B X¯ measured with respect to d , namely distH .A; B/ D max sup inf d.x; y/; sup inf d.x; y/ . x2A y2B
y2B x2A
Further reading
114
beginning of the theory of the so-called CD.K; N / spaces, namely those spaces that have Ricci curvature bounded from below by K and dimension bounded from above by N . More recently, RCD spaces have been introduced in [7, 49]. These are CD spaces that are also “Riemannian,” namely, those for which the Wasserstein gradient flow of the entropy functional (that in Rd we have seen to be the heat equation; cf. Sections 3.3 and 4.2) is linear. We refer interested readers to the original papers for more details.
Appendix A
Exercises on optimal transport (with solutions) The goal of this appendix is to present a series of exercises on optimal transport that should help readers to familiarize themselves with the topic. Some of the exercises here are far from trivial and in many cases they are an opportunity for us to present aspects of the theory that we did not cover in the book. Readers can choose to challenge themselves and try to solve the exercises, but we believe that this set of exercises may be useful even for those who just want to read the solutions. In this set of exercises, an optimal transport map should be understood as an optimal transport plan that is also a map (so it must be optimal among all possible transport plans). The linear cost is c.x; y/ D jx yj, whereas the quadratic cost is c.x; y/ D 21 jx yj2 . Exercise A.1 (Translations are optimal). Let T W Rd ! Rd be the translation map T .x/ WD x C x0 . For any probability measure 2 P .Rd /, show that T is an optimal transport map from to T# with respect to the quadratic cost. Solution. Let us recall that the gradient of a convex function is always an optimal map (with respect to the quadratic cost) from a probability measure to its push-forward through such map (see Corollary 2.5.12). Hence, to prove the optimality of the translation map T , it is sufficient to check that it is the gradient of a convex function. Let 'W Rd ! R be the convex function '.x/ WD 12 jx C x0 j2 . Since T D r', the optimality of T follows. Exercise A.2 (Homotheties are optimal). Let T W Rd ! Rd be the homothety T .x/WD x where > 0. For any compactly supported probability measure 2 P .Rd /, show that T is an optimal transport map from to T# with respect to the quadratic cost. Solution. As explained in the solution of Exercise A.1, it is sufficient to show that the homothety T is the gradient of a convex function. Let 'W Rd ! R be the convex function '.x/ WD 2 jxj2 . Since T D r', the optimality of T follows. Exercise A.3. Let WD 1 dx jB.0;1/ be the uniform probability measure on B.0; 1/ R2 , and let p1 WD .1; 0/, p2 WD .2; 0/ 2 R2 . Describe the optimal transport map between and 21 .ıp1 C ıp2 / in the following two cases: (a) when the cost is the quadratic cost 21 jx (b) when the cost is the linear cost jx
yj2 ;
yj.
Solution. Let us work in a more general setting. Let 2 P .Rd / and D 21 ıp1 C 1 ı , where p1 , p2 are two distinct points in Rd . Let 2 .; / be an optimal plan 2 p2 with respect to a certain cost cW Rd Rd ! Œ0; C1/ that we assume to be continuous.
Exercises on optimal transport
116
From the fact that is an admissible plan, we can deduce that it is supported on Rd ¹p1 ; p2 º. Moreover, being optimal, it must also be supported on a c-cyclically monotone set (see Corollary 2.6.8). Thus, there must be two measurable subsets A1 ; A2 Rd such that is supported on A WD A1 ¹p1 º [ A2 ¹p2 º and A is a c-cyclically monotone set. Since is an admissible transport plan, for any measurable S Rd it holds that .S/ D ..A1 \ S/ ¹p1 º/ C ..A2 \ S / ¹p2 º/:
(A.1)
Take a1 2 A1 and a2 2 A2 . By the c-cyclical monotonicity of A, we deduce that c.a1 ; p1 / C c.a2 ; p2 / c.a1 ; p2 / C c.a2 ; p1 / , c.a1 ; p1 /
c.a1 ; p2 / c.a2 ; p1 /
c.a2 ; p2 /:
Therefore, denoting w.x/ WD c.x; p1 / c.x; p2 /, we have shown w.a1 / w.a2 /. Since this holds for any a1 2 A1 and a2 2 A2 , there must be a value t0 2 R such that A1 ¹w t0 º and A2 ¹w t0 º. In order to get some additional information on the value of t0 , let us apply (A.1). Setting S D ¹w t0 º we obtain .¹w t0 º/ D .A1 ¹p1 º/ C ..A2 \ ¹w t0 º/ ¹p1 º/ 1 .A1 ¹p1 º/ D .¹p1 º/ D : 2 Differently, setting S D ¹w < t0 º, we get .¹w < t0 º/ D ..A1 \ ¹w < t0 º/ ¹p1 º/ C .; ¹p1 º/ D ..A1 \ ¹w < t0 º/ ¹p1 º/ .A1 ¹p1 º/ D .¹p1 º/ D
1 : 2
Combining the last two inequalities, we get .¹w < t0 º/
1 .¹w t0 º/: 2
(A.2)
If we assume that .¹w D tº/ D 0 for every t 2 R, condition (A.2) uniquely identifies t0 . Moreover, under this additional assumption, it follows directly from our observations that is unique and is induced by the map T W Rd ! Rd defined as ´ p1 if w.x/ t0 , T .x/ WD (A.3) p2 if w.x/ > t0 . Now we go back to solving the statement of the exercise.
Exercises on optimal transport
yj2 , it holds that
Case (a) Since c.x; y/ D 12 jx w.x/ D c.x; p1 /
117
c.x; p2 / D hx; p2
1 p1 i C .p12 2
3 p22 / D x1 C : 2
Notice that, for any t 2 R, ¹w D tº is a vertical line and therefore .¹w D tº/ D 0. Hence the assumption necessary to deduce that the optimal transport map is (A.3) is satisfied. To conclude, it remains only to determine the value of t0 . Condition (A.2) tells us that the line ¹w D t0 º must split into two parts with equal mass the measure and therefore it must split into two parts with the same area the ball B.0; 1/. Therefore the optimal transport map between and is39 ´ p1 if x1 0, T .x/ WD p2 if x1 > 0. Case (b) In this case we have c.x; y/ D jx w.x/ D c.x; p1 /
yj and therefore
c.x; p2 / D jx
p1 j
jx
p2 j:
Notice that, for any t 2 R, ¹w D t º is a hyperbola with foci p1 and p2 (or the empty set) and therefore .¹w D tº/ D 0. Reasoning exactly as in part (a), we deduce that the optimal map from to is ´ p1 if jx e1 j jx 2e1 j t0 , T .x/ WD p2 if jx e1 j jx 2e1 j > t0 , where t0 2 R is the only value such that (A.2) is satisfied (so the optimal map sends the interior of a hyperbola into p1 and the exterior into p2 ). We content ourselves with this description of the optimal map, without trying to determine the value of t0 explicitly. Exercise A.4. For i 2 ¹ 1; 0; 1º, define fi W Œ0; 1 ! R2 as fi .t / WD .i; t /. Let WD .f0 /# L1 , 1 WD .f 1 /# L1 , 1 WD .f1 /# L1 , and WD 12 . 1 C 1 /. Notice that and are probability measures, is supported on a segment in R2 , is supported on two segments in R2 . Show that, with respect to the quadratic cost c.x; y/ D jx yj2 , there is a unique optimal transport plan from to , give a description of such a transport plan, and prove that it is not induced by a map. 39
Alternatively, one could note that ´ T .x/ WD
p1 p2
if x1 0, if x1 > 0,
is the gradient of the convex function .x/ WD max¹x1 ; 2x1 º, hence it is the unique optimal transport map between and T# D 12 .ıp1 C ıp2 /:
Exercises on optimal transport
118
Proof. Let 2 .; / be any transport plan. For any x 2 supp./ D ¹0º Œ0; 1 and y 2 supp./ D ¹˙1º Œ0; 1 it holds that jx yj 1, thus Z Z Z 2 c.x; y/ d .x; y/ D jx yj d .x; y/ 1 d .x; y/ D 1: R2 R2
R2 R2
R2 R2
We now construct a plan with cost exactly 1, which will therefore be optimal thanks to the observation above. For any i 2 ¹ 1; 1º, consider the map Ti W R2 ! R2 given by Ti .x1 ; x2 / D .x1 C i; x2 /. Note that Ti ı f0 D fi ; therefore .Ti /# D .Ti /# .f0 /# L1 D .fi /# L1 D i : Define N WD 21 .Id T 1 /# C .Id T1 /# , and let 1 ; 2 W R2 R2 ! R2 be the projections, respectively, on the first and second coordinates. For i 2 ¹ 1; 1º, we have .1 /# .Id Ti /# D ;
.2 /# .Id Ti /# D i ;
hence one immediately deduces that N 2 .; / (recall that D 12 . 1 C 1 /). Also, the cost of this transport is 1 because N is supported in graph.T 1 / [ graph.T1 /, and for any .x; y/ 2 graph.Ti / it holds that jx yj D 1. We prove that N is the unique optimal transport plan. Take any optimal transport plan . Thanks to our first observation, for -a.e. .x; y/ it holds that jx yj 1, x 2 supp./, and y 2 supp./. One can check that conditions imply that must be supported inside graph.T 1 / [ graph.T1 /. Hence, we have 1 1 2
C 21 1 D D .2 /# D .2 /# . jgraph.T
1/
/ C .2 /# . jgraph.T / / 1
and therefore (considering the supports of the various measures) .2 /# . jgraph.T
1/
/ D 21
1
.2 /# . jgraph.T / / D 12 1 :
and
1
Since 2 is injective when restricted on graph.T 1 / or graph.T1 /, the latter identities characterize univocally jgraph.T / and jgraph.T / and therefore there can be only one 1 1 such (and so D ). N Finally, let us prove that N is not induced by a map. Let us assume by contradiction that Id T D . N For almost every t 2 Œ0; 1, it holds that T ..0; t // D .˙1; t /. For i 2 ¹ 1; 1º, let Ei WD ¹t 2 Œ0; 1 j T ..0; t// D .i; t /º. Note that L1 .E 1 \ E1 / D 0 and that E 1 [ E1 contains almost every point in Œ0; 1. Hence, we have D T# D T# .f0 /# L1 D .T ı f0 /# .L1 jE / C .T ı f0 /# .L1 jE / 1
D .f
1
1 /# .L
1
1
jE 1 / C .f1 /# .L jE1 /:
However this is impossible, as the right-hand side cannot be equal to .
Exercises on optimal transport
119
Exercise A.5 (Counterexamples). For any of the following statements, find two probability measures ; 2 P .Rd / with compact support such that the statement holds (you can also choose the dimension d 2 N). Each of the statements should be treated independently. (a) There is more than one40 optimal transport map from to with respect to the linear cost jx yj. (b) There is more than one optimal transport map from to with respect to the quadratic cost 21 jx yj2 . (c) There is no optimal transport plan between and with respect to the cost c.x; y/ D bjx yjc (the floor function41 of the distance). (d) There is an optimal transport map from to with respect to the linear cost, but not with respect to the quadratic cost. (e) There is an optimal transport map from to with respect to the quadratic cost, but not with respect to the linear cost. Hint: To solve (c), show that the infimum of the Kantorovich problem for D Œ0; 1 L1 , D Œ1; 2 L1 is 0 but any transport plan has strictly positive cost. To solve (e), it might be useful to solve Exercise A.3 first. Solution. (a) Let d D 1 and WD 12 .ı0 C ı1 /, WD 12 .ı1 C ı2 /. Every transport map from to has the same cost (cf. Remark 2.7.5), thus the two maps T1 ; T2 W ¹0; 1º ! ¹1; 2º given by T1 .0/ D 1; T1 .1/ D 2; T2 .0/ D 2; T2 .1/ D 1 both send to and are both optimal. (b) Let d D 2 and let p1 ; p2 ; p3 ; p4 R2 be the four vertices of a square (so that pi and pi C1 are adjacent for i D 1; 2; 3). Let WD 12 .ıp1 C ıp3 / and WD 12 .ıp2 C ıp4 /. One can explicitly check that every transport plan from to has the same quadratic cost. Thus the two maps from to (namely, T1 .p1 / D p2 , T1 .p3 / D p4 , and T2 .p1 / D p4 , T2 .p3 / D p2 ) are both optimal. (c) Let d D 1 and WD dx jŒ0; 1 , WD dy jŒ1; 2 . We will show that the infimum of the Kantorovich problem is 0, but any transport plan has strictly positive cost. Given " > 0, consider the map T" W Œ0; 1 ! Œ1; 2 defined as ´ x C 2 " if 0 x ", T" .x/ WD x C 1 " if " < x 1.
40 41
Uniqueness should be understood in the -a.e. sense. Given t 0, bt c is the largest integer n such that n t.
Exercises on optimal transport
One can check that .T" /# D . Also, it holds that Z Z " Z 1 c.x; T" .x// d.x/ D b2 "c dx C b1 Œ0; 1
0
120
"c dx D ":
"
Since " can be chosen arbitrarily small, we have proven that the infimum of the Kantorovich problem is 0. Let us assume by contradiction that 2 .; / achieves cost 0. Hence must be supported on the set ˇ ® ¯ ® ¯ .x; y/ 2 R2 W c.x; y/ D 0 D .x; y/ 2 R2 ˇ jx yj < 1 : (A.4) Moreover, since 2 .; /, the plan is supported on Œ0; 1 Œ1; 2. Therefore, for any 0 < ` < 1, it holds that ` D .Œ2 (A.4)
D .Œ1
D`
`; 2/ D .R Œ2 `; 1 Œ2
.Œ1
`; 2/ D .Œ0; 1 Œ2
`; 2/ D .Œ1
`; 1 Œ1; 2
`; 1/
`; 2/ `; 1 Œ1; 2
.Œ1
`//
`//
`; 1 Œ1; 2 `// D 0. Since [ ˇ ® ¯ .Œ0; 1 Œ1; 2/ \ .x; y/ 2 R2 ˇ jx yj < 1 and therefore .Œ1
Œ1
`; 1 Œ1; 2
`/ ;
`2Q\.0; 1/
we reach a contradiction as we have proven that 0. (d) Let d D 1 and WD dx jŒ0; 1 C 12 ı1 , WD dy jŒ 5 ; 3 C 12 ı2 . Since is supported 2 2 on ¹x 2º, whereas is supported on ¹x 2º, Remark 2.7.5 implies that any admissible plan has the same cost with respect to the linear cost, and this cost is given by Z Z 7 x d x d D : 4 R R In particular, any admissible map is optimal with respect to the linear cost (and it is clear that there is at least one, take for instance T .x/ D x C 25 for x 2 Œ0; 12 / and T . 12 / D 2). On the other hand, the optimal map with respect to the quadratic cost (if it exists) must be nondecreasing (since, in one dimension, being the gradient of a convex function is equivalent to being nondecreasing). Let us assume by contradiction that there is an admissible transport map T W R ! R that is nondecreasing. Since it is an admissible map, it must hold that T .1/ D 2. The monotonicity then implies that T .x/ 2 for any x 1 and thus D T# is supported on x 2, which is a contradiction. (e) Let d D 2 and WD 21 .ıp1 C ıp2 /, where p1 D .1; 0/ and p2 D .2; 0/. Let A D ¹x 2 R2 j x e1 < 0º and B D ¹p 2 R2 j jp p1 j jp p2 j < 21 º (let us
Exercises on optimal transport
121
remark that the exact form of A and B is not important, it is only important that they are respectively a half-plane and the interior of a hyperbola). Take p3 2 A \ @B;
N \ B; p4 2 .R2 n A/
N \ .R2 n B/; N p5 2 .R2 n A/
and define WD 12 ıp3 C 41 ıp4 C 14 ıp5 . Recalling the observations contained in the solution of Exercise A.3, we can say the following (we identify the mass that and give to a point with the point itself):
Any optimal transport plan with respect to the quadratic cost from to sends p3 into p1 and ¹p4 ; p5 º into p2 . In particular, there is an optimal transport map (because the map that does exactly as described is admissible).
Any optimal transport plan with respect to the linear cost from to sends p4 to p1 and p5 to p2 . This is an obstruction to the existence of an optimal transport map, as the mass of p3 (which is 12 ) should be split into two equal parts (but a map cannot split a delta).
In particular, there is an optimal transport map from to with respect to the quadratic cost, but not with respect to the linear cost. Exercise A.6 (Birkhoff-von Neumann theorem). An n n-matrix A 2 M.n; R/ with nonnegative entries is said to be P P a doubly stochastic matrix if niD1 Aij D 1 for any j D 1; : : : ; n and jnD1 Aij D 1 for any i D 1; : : : ; n,
a permutation matrix if there is a permutation W ¹1; : : : ; nº ! ¹1; : : : ; nº such that Ai.i/ D 1 and Aij D 0 if j 6D .i/.
Prove that any doubly stochastic matrix can be written as a finite convex combination of permutation matrices. Hint: Use Hall’s marriage theorem42 to show that, for any doubly stochastic matrix A, there is a permutation matrix P and a number > 0 such that Aij Pij for any 1 i; j n. Then prove the statement by induction on the number of nonzero entries of A (note that this number is at least n). Solution. Let us begin with the following lemma. Lemma. Given a doubly stochastic matrix A, there is a permutation 2 Sn such that Ai.i/ > 0 for any i D 1; : : : ; n. Proof. Let us construct a bipartite graph as follows: the graph consists of 2n vertices labeled ¹1r ; : : : ; nr º and ¹1c ; : : : ; nc º (the indexes r, c stand for row and column). Then we say that there is an edge between ir and jc if and only if Aij > 0. We denote 42
See the section Graph theoretic formulation in the Wikipedia entry for Hall’s marriage theorem.
Exercises on optimal transport
122
the presence of an edge by ir jc . The first step of the proof consists in showing that such a bipartite graph admits a perfect matching (i.e., there is a permutation W ¹1; : : : ; nº ! ¹1; : : : ; nº such that ir .i/c for any i D 1; : : : ; n). In order to do so, we want to apply Hall’s marriage theorem. Given a subset S ¹1; : : : ; nº, let T be the subset defined as T D ¹t 2 ¹1; : : : ; nº W sr tc for at least one s 2 Sº: Exploiting the fact that the matrix A is doubly stochastic and the definition of T , we obtain n n X n XX XX X XX #S D Asj D Ast Ai t D Ai t D #T: s2S j D1
i D1 t 2T
s2S t 2T
t2T i D1
Since we can choose S arbitrarily, the inequality #S #T is exactly the hypothesis necessary to apply Hall’s marriage theorem and deduce the existence of a perfect matching. Hence, by definition of perfect matching, there is a permutation such that ir .i /c for any i D 1; : : : ; n. This last fact is equivalent to the desired statement. We can now prove the statement of the theorem by induction on the number of nonzero entries of the matrix A. Since A is doubly stochastic, it is easy to see that it must have at least n nonzero entries. Moreover, if it has exactly n nonzero entries, then it must already be a permutation matrix. Let us assume that the number of nonzero entries of A is k > n. Let P be the permutation matrix induced by the permutation ,43 whose existence is provided by the lemma. Let > 0 be the maximum value such that P A (the inequality must be understood entrywise, namely Pij Aij for all i; j ). Notice that, since A is doubly stochastic, each entry of A is bounded by 1 and therefore 1. Also, it must be the case that < 1, as otherwise A would have exactly n nonzero entries. Let A0 WD 1 1 .A P /. Since P A, all entries of A0 are nonnegative. Moreover, thanks to the choice of , the matrix A0 has at most k 1 nonzero entries. Finally, one can easily check that A0 is doubly stochastic. By the inductive hypothesis, we are able to write A0 as a convex combination of permutation matrices X X A0 D i P i ; i 0; i D 1; i 2I
i 2I
where I is a finite set of indices and P i are permutation matrices (induced by the permutations i ). From the definition of A0 , it follows that X A D P C i .1 /P i ; i 2I
thus A is a convex combination of permutation matrices. 43
That is, Pi .i / D 1 for all i , and Pij D 0 if j ¤ .i/.
Exercises on optimal transport
123
Exercise A.7 (Discrete optimal transport). P Given two families P ¹x1 ; : : : ; xn º and ¹y1 ; : : : ; yn º of points in Rd , let WD n1 niD1 ıxi and WD n1 niD1 ıyi . Prove that, for any choice of a continuous cost cW Rd Rd ! R, there exists an optimal transport map from to . Hint: Use Exercise A.6 or Kantorovich duality. We present two different solutions of this exercise: the first uses the Birkhoff–von Neumann theorem, whereas the second borrows some ideas from duality theory. Solution. Note that c is bounded below on the finite set ¹.xi ; yj /º1i;j n , so it follows by Theorem 2.3.2 and Remark 2.3.3 that there exists an optimal coupling 2 .; /. Let A be the n n matrix defined as Aij WD .¹xi ; yj º/. From the marginal constraints on , it follows that nA is a doubly stochastic matrix. Hence, applying Exercise A.6, we can express nA as a convex combination of permutation matrices X nA D k P k ; k2I
where I is a finite set of indices, k2I k D 1, and P k are permutation matrices (induced by the permutations k ). Let us define the cost of an n n matrix B as P
C.B/ WD
n X
Bij c.xi ; yj /:
i;j D1
By definition, the cost C is linear and it holds that Z 1 1X k C .P k / min C .P k /: c.x; y/ d .x; y/ D C .A/ D n n k2I Rd Rd k2I
Hence, there is a permutation k such that n
1 1X c.xi ; yk .i/ / D C .P k / n n i D1
Z c.x; y/ d .x; y/; Rd Rd
and therefore the map T W Rd ! Rd such that T .xi / D yk .i / is optimal. Solution. This alternative solution is heavily inspired by [24]. Without loss of generality (up to relabeling the indices of the points yi ), we can assume that the trivial permutation has minimum cost among all permutations, that is n X i D1
c.xi ; yi /
n X iD1
c.xi ; y .i //
(A.5)
Exercises on optimal transport
124
for any permutation W ¹1; : : : ; nº ! ¹1; : : : ; nº. Under this assumption, we want to prove that the map T .xi / WD yi is optimal in the sense that the coupling induced by it is optimal (in the Kantorovich sense). Thanks to Theorem 2.6.5,44 it suffices to construct two functions 'W ¹x1 ; : : : ; xn º ! R and W ¹y1 ; : : : ; yn º ! R such that '.xi / C
.yj / C c.xi ; yj / 0 for any 1 i; j n, n X
c.xi ; yi / D
i D1
n X
'.xi /
(A.6)
.yi /:
(A.7)
i D1
Indeed, from these equations we get n
Z Rd
1X (A.7) c.xi ; yi / D c.x; T .x// d D n i D1
Z
Z Rd
d Rd
Z
(A.6)
' d C
inf
2.;/ Rd Rd
c d ;
so the optimality of T follows. To prove (A.6)–(A.7), we claim that it suffices to construct a function ' such that '.xj /
'.xi / c.xi ; yj /
c.xj ; yj / DW bij
for any 1 i; j n.
(A.8)
Indeed, if the above bound holds, then (A.6)–(A.7) hold with the function defined as .yi / WD c.xi ; yi / '.xi / for any 1 i n. So it remains only to construct a function ' such that (A.8) holds. To do this, let us consider the weighted oriented complete graph with vertices ¹1; : : : ; nº such that the weight of the edge i ! j is bij , and denote by d.i; j / the distance45 between vertex i and vertex j (notice the similarity between this approach and the proof of Theorem 2.5.3). Let us check that, for any 1 i; j n, it holds that d.i; j / > 1. Since the graph consists of finitely many points, one may note that the distance between two vertices can be 1 if and only if there is a simple loop46 i1 ; i2 ; : : : ; ik with negative length, that is, bi1 i2 C bi2 i3 C C bik i1 < 0: (A.9) 44
Actually, we only need to use the inequality Z inf c d sup
2.;/ XY
'.x/C
Z
Z ' d C
d;
.y/Cc.x;y/0
which follows immediately from the marginal condition (see the proof of (2.12)). 45 The distance between two vertices in a weighted graph is defined as the infimum of the sum of the weights of a path from the first vertex to the second one. 46 A simple loop is a closed path that visits each vertex at most once.
Exercises on optimal transport
125
To rule out this possibility, we have to use the optimality condition (A.5) (which we have not used until now). Let N be the permutation such that N .i1 / D i2 ; N .i2 / D i3 ; : : : ; .ik / D i1 , and .i/ D i for all other values of i. Applying (A.5) with D , N we get 0
n X
c.xi ; y.i/ / N
c.x.i/ ; y.i/ /D N N
i D1
n X
bi N .i / D bi1 i2 C bi2 i3 C C bik i1 ;
i D1
which shows that (A.9) cannot hold. Hence, we have proven that the distance d.i; j / is finite for every 1 i; j n. We now observe that, even if this notion of distance on a graph might be negative and not symmetric, it still satisfies the triangle inequality. Therefore we have d.1; j / d.1; i/ C d.i; j / d.1; i / C bij
8 i; j:
Hence, if we set '.xi / WD d.1; i/ then the desired inequality (A.8) holds, concluding the proof. Exercise A.8. Let SW Rd ! Rd be the function S.x/ WD x. Characterize the probabilities 2 P .Rd / with compact support such that S is an optimal transport map between and S# with respect to the quadratic cost. Solution. Assume that S is optimal from to S# , and let WD .Id S /# be the associated coupling. Since is optimal, Corollary 2.5.8 implies the existence of a cyclically monotone set A Rd Rd such that is supported on A. Since is also supported on graph.S/, we can assume without loss of generality that A graph.S /. Take two points x; y 2 Rd such that .x; S.x//; .y; S.y// 2 A. By cyclical monotonicity, it holds that 1 jx 2
S.x/j2 C 12 jy
S.y/j2 12 jx
S.y/j2 C 21 jy
S.x/j2 :
Developing the squares and rearranging terms, this is equivalent jx yj2 0, thus x D y. Hence, this implies that A contains only one point .x0 ; S.x0 //, and therefore D ıx0 . On the other hand, if is ıx0 for some x0 2 Rd , then S is optimal from to S# D ıS.x0 / (since there is only one transport map). Exercise A.9. Let ; 2 P .Rd / be two compactly supported probability measures invariant under rotations (i.e., .L.E// D .E/ and .L.E// D .E/ for any Borel set E and any orthogonal transformation L 2 O.d /). Assume that Ld , and let T be the unique optimal transport map from to with respect to the quadratic cost (see Theorem 2.5.10). x Show that T can be written as x ! .jxj/ jxj , where W Œ0; C1/ ! Œ0; C1/ is a nondecreasing function.
Exercises on optimal transport
126
Hint: The function is the monotone transport map between two suitable one-dimensional measures. Also, one may want to use the following lemma:47 Lemma. Let 0 ; 1 2 P .Rd / be two rotationally invariant probability measures, and let ˆ.x/ WD jxj. If ˆ# 0 D ˆ# 1 then 0 D 1 . Solution. Let ˆW Rd ! Œ0; C1/ denote the norm, namely ˆ.x/ WD jxj, and define the one-dimensional compactly supported measures Q WD ˆ# 2 P .Œ0; C1//; From the identity ˆ# dx D !d r d
1
Q WD ˆ# 2 P .Œ0; C1//:
dr jŒ0;C1/ ,48 where !d is the measure of the unit
sphere in Rd , and the fact that Ld , it follows that Q ˆ# Ld D !d r d
1
dr jŒ0;C1/ dr:
Hence, applying Theorem 2.5.10 from Q to , Q there exists a convex function 0 'W Œ0; C1/ ! R such that WD ' is the optimal transport map from Q to . Q Notice that ' is nondecreasing.
47
For completeness, here is a proof of the lemma. Let 2 Cc1 .Rd / be a test function and let W Q Œ0; 1/ ! R be defined as − WD .r/ Q .ry/ d Hd 1 .y/: Sd
1
Let mH be the Haar measure of the compact Lie group SO.d / of orthogonal transformations of Rd with determinant equal to 1. The Haar measure is such that, for any x 2 Rd , it holds that − .Q.x// d mH .Q/ D .jxj/: Q SO.d /
Thus, for i D 0; 1, due to the rotational invariance of i , we have Z − Z dQ# i d mH .Q/ di D Rd SO.d / Rd Z − D .Q.x// d mH .Q/ di .x/ Rd
Z D
Rd
SO.d /
.jxj/ Q di .x/ D
1
Q dˆ# i :
0
Because ˆ# 0 D ˆ# 1 by assumption, we deduce Z Z d0 D Rd
Z
Rd 0
d1
and therefore, since we can choose arbitrarily, D 1 must hold. 48 This follows from the fact that, in polar coordinates, dx D r d 1 d dr, and applying the push-forward under the map ˆ the measure d becomes constant, and equal to the volume of the unit sphere in Rd .
Exercises on optimal transport
127
x Set T .x/ WD .jxj/ jxj . We will prove that T is the optimal transport map from to , thus T D T . Let us begin by checking that T is the gradient of a convex function. Consider the function ˆ.x/W Rd ! R defined as ˆ.x/ WD '.jxj/. Then
rˆ.x/ D ' 0 .jxj/
x x D .jxj/ D T .x/: jxj jxj
Moreover, for any t 2 Œ0; 1 and x; y 2 Rd it holds that ˆ.tx C .1
t/y/ D '.jtx C .1 t'.jxj/ C 1
t/yj/ '.t jxj C .1
t /jyj/
t'.jyj/ D t ˆ.x/ C .1
t /ˆ.y/;
where we have used that ' is convex and nondecreasing. Hence, we have shown that T is the gradient of a convex function. Since the optimality of T follows directly from the fact that it is the gradient of a convex function (see Remark 2.5.9), it remains only to prove T# D . Let us start by showing that T# is rotationally invariant and ˆ# T# D ˆ# . Note that, given L 2 O.d /, it holds that Lx x Lx D .jLxj/ D T ı Lx: L ı T .x/ D L .jxj/ D .jxj/ jxj jxj jLxj Thanks to this fact, and since is rotationally invariant, we get (also recall Lemma 1.2.7) L# T# D .L ı T /# D .T ı L/# D T# L# D T# ; thus T# is rotationally invariant. Also, from the identity ˆ ı T D ı ˆ, we deduce that ˆ# T# D .ˆ ı T /# D . ı ˆ/# D # Q D Q D ˆ# : Hence, applying the lemma stated in the hint of this exercise to the probability measures T# and , we conclude that T# D , as desired. Exercise A.10 (Middle point). Given two probability measures ; 2 P .Rd /, let C .; / be the infimum of the Kantorovich problem with respect to the quadratic cost Z C .; / WD
inf
2.;/ Rd Rd
yj2
jx 2
d .x; y/:
Let 0 ; 1 2 P .Rd / be two probability measures with compact support. A probability measure 1 is a middle point of 0 and 1 if C.0 ; 1 / D C.1 ; 1 / D 2 2 2 1 C .0 ; 1 /. 4 (a) If 0 D ıp0 and 1 D ıp1 , show that the middle point is unique and 1 D ı p1 Cp2 . 2
(b) Prove that there is always at least one middle point.
2
Exercises on optimal transport
128
(c) Find two probability measures 0 , 1 such that they have more than one middle point. (d) Show that if the optimal transport plan between 0 and 1 is unique, then there is a unique middle point. (e) Prove that if 0 , 1 are absolutely continuous with respect to the Lebesgue measure, then the middle point 1 is unique and is absolutely continuous with respect 2 to the Lebesgue measure. Solution. Instead of directly attacking the various parts of the exercise, let us spend some time understanding the properties of middle points a little better. Let us begin with the following useful fact. Lemma. For any 2 P .Rd /, it holds that C .0 ; / C C.; 1 / 12 C .0 ; 1 /: Moreover, if equality holds, then there is an optimal plan 2 .0 ; 1 / such that / D (here, x, z denote the first and second coordinates of Rd Rd ). . xCz 2 # Proof. Let 0 2 .0 ; / and 1 2 .; 1 / be two optimal plans from 0 to and from to 1 , respectively. The gluing lemma (see the proof of Theorem 3.1.5) ensures the existence of Q 2 P .Rd Rd Rd / such that (here, x, y, z denote the coordinates of Rd Rd Rd ) .x; y/# Q D 0
and .y; z/# Q D 1 :
Let WD .x; z/# . Q It follows directly from the properties of Q that is an admissible plan from 0 to 1 . Therefore it holds that Z Z 1 1 C .0 ; 1 / jx zj2 d .x; z/ D jx zj2 d Q .x; y; z/ 2 Rd Rd 2 Rd Rd Rd Z 1 2jx yj2 C 2jy zj2 d Q .x; y; z/ 2 Rd Rd Rd Z Z 2 D jx yj d 0 .x; y/ C jy zj2 d 1 .y; z/ Rd Rd
Rd Rd
D 2.C .0 ; / C C .; 1 //: If equality holds, then all the inequalities we have applied must be equalities. Hence (consider the first inequality of the chain) has to be an optimal plan and (consider the second inequality) jx zj2 D 2jx yj2 C 2jy zj2 has to be true Q -a.e. The , thus latter identity implies that -a.e. Q it holds that y D xCz 2 D y# Q D . xCz / Q D . xCz / ; 2 # 2 # as desired.
Exercises on optimal transport
129
Thanks to the lemma, we know that a measure 1 is a middle point if and only if 2
1 is a middle point 2
, C .0 ; 1 / 14 C .0 ; 1 / and C .1 ; 1 / 14 C .0 ; 1 /: 2
(A.10)
2
Indeed these two inequalities, together with C .0 ; 1 / C C . 1 ; 1 / 21 C .0 ; 1 / 2
2
(this inequality follows from the triangle inequality for W2 , since C D 12 W22 ), imply that C .0 ; 1 / D C .1 ; 1 / D 14 C .0 ; 1 /. 2
2
Let us consider an optimal plan 2 .0 ; 1 /. We claim that 1 WD . xCz / 2 # 2
is a middle point. Indeed, since .x; xCz / is an admissible plan from 0 to 1 , it 2 # 2 holds that Z Z ˇ 1 x C z ˇˇ2 1 1 ˇ C .0 ; 1 / jx zj2 d .x; z/ ˇx ˇ d .x; z/ D 2 2 Rd Rd 2 2 Rd Rd 4 1 D C .0 ; 1 /: 4 The same reasoning also yields C .1 ; 1 / 14 C .0 ; 1 / and therefore, thanks to 2 (A.10), 1 is a middle point. 2 Hence, given an optimal plan we can produce a middle point via the formula . xCz / . Vice versa, thanks to the lemma above, given a middle point 1 there is an 2 # 2
optimal plan such that . xCz / D 1 .49 Now we are ready to tackle the statements 2 # 2 of the exercise. (a) Since there is a unique optimal plan (i.e., ıp0 ıp1 ) there can be only one middle point and it must be . xCz / .ı ıp1 / D ı p0 Cp1 . 2 # p0 2
(b) The existence of a middle point follows directly from the existence of an optimal plan. (c) Consider the two probability measures constructed in the solution of Exercise A.5(b). Since every plan induces a middle point as explained above, one can check that the two mentioned probability measures are a good example. (d) Let 2 .0 ; 1 / be the unique optimal coupling. Then, thanks to the observations above, 1 D . xCz / has to be the unique middle point. 2 # 2
(e) Theorem 2.5.10 asserts that there is a unique optimal map between 0 and 1 , which we denote T W Rd ! Rd . Thus, our observations imply that 1 WD 2 xCT .x/ is the unique middle point. Also, again by Theorem 2.5.10, it holds 0 2 # 49
One may be tempted to deduce from these observations that the map between optimal plans and middle points is an isomorphism. In order to show it, one should check that if ; 0 2 .0 ; 1 / are optimal and such that . xCz /# D . xCz /# 0 , then D 0 . Such a statement is 2 2 true but not straightforward, and we will not prove it here.
Exercises on optimal transport
130
that T D r' where 'W Rd ! R is a convex function. Hence 1 jxj2 x C T .x/ D r C' : 2 2 2 We now note that, repeating the estimates in (4.3) with X WD IdCT , we deduce 2 that X 1 is 2-Lipschitz. In particular, for any Borel set E Rd we have Z 1 jX .E/j jdet.rX 1 /j.y/ dy 2d jEj; E
hence
xCT .x/ 2 #
dx dx, and we conclude that
1 D 2
Id C T 2
#
0
Id C T 2
#
dx dx:
Exercise A.11. Consider n red points P1 ; : : : ; Pn and n blue points Q1 ; : : : ; Qn on the plane. Assume that these 2n points are distinct and that no three points are collinear. Show that it is possible to connect each red point to a distinct blue point with a segment in such a way that these segments do not intersect each other. Namely, there exists a permutation W ¹1; : : : ; nº ! ¹1; : : : ; nº such that the segment Pi Q .i / does not intersect the segment Pj Q.j / for any i 6D j . Solution. Consider an optimal map T (which to Exercise A.7) with P exists, thanksP respect to the linear cost jx yj between n1 niD1 ıPi and n1 niD1 ıQi , and let be the permutation induced by this map, that is, T .Pi / D Q .i / . We claim that satisfies all the requirements. Indeed, assume by contradiction that this is not the case. Hence there exist i 6D j such that Pi Q.i/ intersects Pj Q.j / . For notational simplicity, let us denote P WD Pi , P 0 WD Pj , Q WD Q.i/ , Q0 WD Q.j / . Without loss of generality (since we can always translate all the points) we can assume that the intersection of the two segments is the origin O D .0; 0/. Hence, there exist ; 0 > 0 such that Q D P , Q0 D 0 P 0 . As a consequence of the optimality of the map T with respect to the linear cost, it holds that jP Qj C jP 0 Q0 j jP Q0 j C jP 0 Qj and therefore the triangle inequality (which is strict, as we are assuming that no three points are collinear) implies that .1 C /jP j C .1 C 0 /jP 0 j jP C 0 P 0 j C jP 0 C P j < jP j C 0 jP 0 j C jP 0 j C jP j; a contradiction that proves the result.
Appendix B
Disintegrating the disintegration theorem In this appendix we give a sketch of the proof of the disintegration theorem (Theorem 1.4.10) in the form of guided exercises. We split the proof into a series of increasingly more general exercises with hints to solve them. We do not provide the solutions to these exercises, but interested readers may find a complete proof of the disintegration theorem in [4, Thm. 2.28]. Exercise B.1 (Easy disintegration). Let 2 M.R2 / be a finite measure on R2 that is absolutely continuous with respect to the Lebesgue Rmeasure, with density W R2 ! R. Let 2 M.R/ be the measure with density .x/ WD R .x; y/dy. For any x 2 R such . If .x/ D 0, then that .x/ 6D 0, let x be the measure with density x .y/ WD .x;y/ .x/ simply set x WD 0. Show that for any g 2 L1 ./ it holds that Z Z Z g.x; y/ d.x; y/ D g.x; y/ dx .y/ d.x/: R2
R
R
Exercise B.2 (Disintegration for product of compact spaces). Let X, Y be two compact spaces and let 2 M.X Y / be a finite measure on the product X Y . Let WD .1 /# , where 1 W X Y ! X is the projection on the first coordinate. Prove that there exists a family of probabilities .x /x2X P .Y / such that (a) for any Borel set E Y; the map x 7! x .E/ is Borel; (b) for any g 2 L1 ./ it holds that Z Z Z g.x; y/ d.x; y/ D g.x; y/ dx .y/ d.x/: X Y
X
Y
Moreover, if .x /x2X and .Q x /x2X are two families with the mentioned properties, then x D Q x for -a.e. x 2 X. Hint: (1) Given
2 C 0 .Y /, consider the map A W L1 .X; / ! R given by the formula Z A ./ WD .x/ .y/ d.x; y/ 8 2 L1 .X; /: X Y
Prove that the said map is linear and continuous, and therefore A can be represented by a function in L1 .X; /. (2) Fix a countable dense subset S C 0 .Y /. Prove that, for -a.e. x 2 X , the map x W S ! R given by x . / WD A .x/ is linear and continuous. Therefore x 2 P .Y /. Show that such a family .x /x2X satisfies (a).
Disintegrating the disintegration theorem
132
(3) Show that (b) holds when g.x; y/ D .x/ .y/ 2 L1 .X; / S. Show by approximation (since S is dense in C 0 .Y /) that (b) also holds when g.x; y/ D .x/ .y/ 2 L1 .X; / C 0 .Y /. Finally, again by approximation, show that this implies (b) for any g 2 L1 ./. Exercise B.3 (Disintegration for product of Polish spaces). Show the statement of the previous exercise when X and Y are Polish spaces, i.e., they are complete and separable. Hint: Use Lemma 2.1.9 to find a suitable exhaustion of X and Y by compact sets, so that one can apply the previous exercise. Exercise B.4 (Disintegration for fibers of a map). Let X, Y be two Polish spaces (complete and separable), let hW Y ! X be a Borel map, and let 2 M.Y / be a finite measure on Y . Let WD h# . Show that there exists a family of probabilities .x /x2X P .Y / such that
for any Borel set E Y , the map x 7! x .E/ is Borel;
for -a.e. x 2 X, the measure x is supported on the fiber h
1
.x/;
1
for any g 2 L ./, it holds that Z Z Z g.y/ d.y/ D Y
X
h
1 .x/
g.y/ dx .y/ d.x/:
Moreover, if .x /x2X and .Q x /x2X are two families with the mentioned properties, then x D Q x for -a.e. x 2 X. Hint: Apply the previous exercise to the measure .h Id/# 2 P .X Y /.
References [1] L. Ambrosio, Transport equation and Cauchy problem for non-smooth vector fields. In Calculus of variations and nonlinear partial differential equations, pp. 1–41, Lecture Notes in Math. 1927, Springer, Berlin, 2008 [2] L. Ambrosio, Calculus, heat flow and curvature-dimension bounds in metric measure spaces. In Proceedings of the International Congress of Mathematicians—Rio de Janeiro 2018. Vol. I. Plenary lectures, pp. 301–340, World Sci. Publ., Hackensack, NJ, 2018 [3] L. Ambrosio, E. Bruè, and D. Semola, Lectures on optimal transport. La Matematica per il 3+2 130, Springer International, 2021 [4] L. Ambrosio, N. Fusco, and D. Pallara, Functions of bounded variation and free discontinuity problems. Oxford Math. Monogr., Clarendon Press, Oxford University Press, New York, 2000 [5] L. Ambrosio and N. Gigli, A user’s guide to optimal transport. In Modelling and optimisation of flows on networks, pp. 1–155, Lecture Notes in Math. 2062, Springer, Heidelberg, 2013 [6] L. Ambrosio, N. Gigli, and G. Savaré, Gradient flows in metric spaces and in the space of probability measures. Second edn., Lectures in Mathematics ETH Zürich, Birkhäuser, Basel, 2008 [7] L. Ambrosio, N. Gigli, and G. Savaré, Metric measure spaces with Riemannian Ricci curvature bounded from below. Duke Math. J. 163 (2014), 1405–1490 [8] L. Ambrosio, F. Glaudo, and D. Trevisan, On the optimal map in the 2-dimensional random matching problem. Discrete Contin. Dyn. Syst. 39 (2019), 7291–7308 [9] L. Ambrosio, F. Stra, and D. Trevisan, A PDE approach to a 2-dimensional matching problem. Probab. Theory Related Fields 173 (2019), 433–477 [10] V. Arnold, Sur la géométrie différentielle des groupes de Lie de dimension infinie et ses applications à l’hydrodynamique des fluides parfaits. Ann. Inst. Fourier (Grenoble) 16 (1966), 319–361 [11] D. Bakry and M. Émery, Diffusions hypercontractives. In Séminaire de probabilités, XIX, 1983/84, pp. 177–206, Lecture Notes in Math. 1123, Springer, Berlin, 1985 [12] M. Beiglböck and N. Juillet, On a problem of optimal transport under marginal martingale constraints. Ann. Probab. 44 (2016), 42–106 [13] M. Beiglböck, M. Nutz, and N. Touzi, Complete duality for martingale optimal transport on the line. Ann. Probab. 45 (2017), 3038–3074 [14] J.-D. Benamou and Y. Brenier, A computational fluid mechanics solution to the MongeKantorovich mass transfer problem. Numer. Math. 84 (2000), 375–393 [15] H. Bercovici, A. Brown, and C. Pearcy, Measure and integration. Springer, Cham, 2016 [16] S. Bobkov and M. Ledoux, One-dimensional empirical measures, order statistics, and Kantorovich transport distances. Mem. Amer. Math. Soc. 261 (2019), v+126 [17] V. I. Bogachev, Measure theory. Vol. I, II. Springer, Berlin, 2007
References
134
[18] V. I. Bogachev and A. V. Kolesnikov, The Monge-Kantorovich problem: Achievements, connections, and prospects. Uspekhi Mat. Nauk 67 (2012), 3–110 [19] F. Bolley and C. Villani, Weighted Csiszár-Kullback-Pinsker inequalities and applications to transportation inequalities. Ann. Fac. Sci. Toulouse Math. (6) 14 (2005), 331–352 [20] Y. Brenier, Décomposition polaire et réarrangement monotone des champs de vecteurs. C. R. Acad. Sci. Paris Sér. I Math. 305 (1987), 805–808 [21] Y. Brenier and W. Gangbo, Lp approximation of maps by diffeomorphisms. Calc. Var. Partial Differential Equations 16 (2003), 147–164 [22] H. Brézis, Monotonicity methods in Hilbert spaces and some applications to nonlinear partial differential equations. In Contributions to nonlinear functional analysis (Proc. Sympos., Math. Res. Center, Univ. Wisconsin, Madison, Wis., 1971), pp. 101–156, 1971 [23] H. Brezis, Functional analysis, Sobolev spaces and partial differential equations. Universitext, Springer, New York, 2011 [24] H. Brezis, Remarks on the Monge-Kantorovich problem in the discrete setting. C. R. Math. Acad. Sci. Paris 356 (2018), 207–213 [25] G. Buttazzo, L. De Pascale, and P. Gori-Giorgi, Optimal-transport formulation of electronic density-functional theory. Phys. Rev. A 85 (2012), 062502 [26] L. A. Caffarelli, Some regularity properties of solutions of Monge Ampère equation. Comm. Pure Appl. Math. 44 (1991), 965–969 [27] L. A. Caffarelli, The regularity of mappings with a convex potential. J. Amer. Math. Soc. 5 (1992), 99–104 [28] G. Carlier, A. Galichon, and F. Santambrogio, From Knothe’s transport to Brenier’s map and a continuation method for optimal transport. SIAM J. Math. Anal. 41 (2009/10), 2554– 2576 [29] J.-A. Carrillo, Gradient flows: Qualitative properties & numerical schemes. Slides of workshop, 2014, httpsW//www.ricam.oeaw.ac.at/specsem/specsem2014/school2/CharlaRICAM2014-1.pdf [30] F. Cavalletti and A. Mondino, Sharp and rigid isoperimetric inequalities in metric-measure spaces with lower Ricci curvature bounds. Invent. Math. 208 (2017), 803–849 [31] I. Chavel, Riemannian geometry—A modern introduction. Cambridge Tracts in Math. 108, Cambridge University Press, Cambridge, 1993 [32] D. Cordero-Erausquin, Some applications of mass transport to Gaussian-type inequalities. Arch. Ration. Mech. Anal. 161 (2002), 257–269 [33] D. Cordero-Erausquin, R. J. McCann, and M. Schmuckenschläger, A Riemannian interpolation inequality à la Borell, Brascamp and Lieb. Invent. Math. 146 (2001), 219–257 [34] D. Cordero-Erausquin, B. Nazaret, and C. Villani, A mass-transportation approach to sharp Sobolev and Gagliardo-Nirenberg inequalities. Adv. Math. 182 (2004), 307–332 [35] S. Daneri and A. Figalli, Variational models for the incompressible Euler equations. In HCDTE lecture notes. Part II. Nonlinear hyperbolic PDEs, dispersive and transport equations, p. 51, AIMS Ser. Appl. Math. 7, Am. Inst. Math. Sci. (AIMS), Springfield, MO, 2013
References
135
[36] G. De Philippis and A. Figalli, The Monge-Ampère equation and its link to optimal transportation. Bull. Amer. Math. Soc. (N.S.) 51 (2014), 527–580 [37] E. del Barrio, E. Giné, and C. Matrán, Central limit theorems for the Wasserstein distance between the empirical and the true distributions. Ann. Probab. 27 (1999), 1009–1071 [38] E. del Barrio and J.-M. Loubes, Central limit theorems for empirical transportation cost in general dimension. Ann. Probab. 47 (2019), 926–951 [39] S. Di Marino, A. Gerolin, and L. Nenna, Optimal transportation theory with repulsive costs. In Topological optimization and optimal transport, pp. 204–256, Radon Ser. Comput. Appl. Math. 17, De Gruyter, Berlin, 2017 [40] A. Figalli, Stability in geometric and functional inequalities. In European Congress of Mathematics, pp. 585–599, Eur. Math. Soc., Zürich, 2013 [41] A. Figalli, The Monge-Ampère equation and its applications. Zurich Lect. Adv. Math., Eur. Math. Soc., Zürich, 2017 [42] A. Figalli, W. Gangbo, and T. Yolcu, A variational method for a class of parabolic PDEs. Ann. Sc. Norm. Super. Pisa Cl. Sci. (5) 10 (2011), 207–252 [43] A. Figalli and N. Gigli, A new transportation distance between non-negative measures, with applications to gradients flows with Dirichlet boundary conditions. J. Math. Pures Appl. (9) 94 (2010), 107–130 [44] A. Figalli, F. Maggi, and A. Pratelli, A mass transportation approach to quantitative isoperimetric inequalities. Invent. Math. 182 (2010), 167–211 [45] A. Figalli and C. Villani, Strong displacement convexity on Riemannian manifolds. Math. Z. 257 (2007), 251–259 [46] A. Figalli and C. Villani, Optimal transport and curvature. In Nonlinear PDE’s and applications, pp. 171–217, Lecture Notes in Math. 2028, Springer, Heidelberg, 2011 [47] S. Gallot, D. Hulin, and J. Lafontaine, Riemannian geometry. Third edn., Universitext, Springer, Berlin, 2004 ´ ¸ ch, Optimal maps for the multidimensional Monge-Kantorovich [48] W. Gangbo and A. Swie problem. Comm. Pure Appl. Math. 51 (1998), 23–45 [49] N. Gigli, On the differential structure of metric measure spaces and applications. Mem. Amer. Math. Soc. 236 (2015), vi+91 [50] M. Goldman and D. Trevisan, Convergence of asymptotic costs for random euclidean matching problems. Probab. Math. Phys. 2 (2021), 121–142 [51] R. Jordan, D. Kinderlehrer, and F. Otto, The variational formulation of the Fokker-Planck equation. SIAM J. Math. Anal. 29 (1998), 1–17 [52] B. Klartag, Needle decompositions in Riemannian geometry. Mem. Amer. Math. Soc. 249 (2017), v+77 [53] H. Knothe, Contributions to the theory of convex bodies. Michigan Math. J. 4 (1957), 39–52 [54] H. W. Kuhn, The Hungarian method for the assignment problem. Naval Res. Logist. Quart. 2 (1955), 83–97
References
136
[55] M. Ledoux, The concentration of measure phenomenon. Mathematical Surveys and Monographs 89, American Mathematical Society, Providence, RI, 2001 [56] J. M. Lee, Riemannian manifolds. Grad. Texts in Math. 176, Springer, New York, 1997 [57] J. Lott and C. Villani, Ricci curvature for metric-measure spaces via optimal transport. Ann. of Math. (2) 169 (2009), 903–991 [58] R. J. McCann, A convexity principle for interacting gases. Adv. Math. 128 (1997), 153– 179 [59] F. Otto, The geometry of dissipative evolution equations: the porous medium equation. Comm. Partial Differential Equations 26 (2001), 101–174 [60] B. Pass, Multi-marginal optimal transport: theory and applications. ESAIM Math. Model. Numer. Anal. 49 (2015), 1771–1790 [61] L. E. Payne and H. F. Weinberger, An optimal Poincaré inequality for convex domains. Arch. Rational Mech. Anal. 5 (1960), 286–292 [62] P. Petersen, Riemannian geometry. Second edn., Graduate Texts in Math. 171, Springer, New York, 2006 [63] G. Peyré and M. Cuturi, Computational optimal transport. Found. Trends Mach. Learn. 11 (2019), 355–602 [64] F. Santambrogio, Optimal transport for applied mathematicians. Progr. Nonlinear Differential Equations Appl. 87, Birkhäuser/Springer, Cham, 2015 [65] F. Santambrogio, ¹Euclidean, metric, and Wassersteinº gradient flows: an overview. Bull. Math. Sci. 7 (2017), 87–154 [66] R. Sinkhorn, A relationship between arbitrary positive matrices and doubly stochastic matrices. Ann. Math. Statist. 35 (1964), 876–879 [67] R. Sinkhorn and P. Knopp, Concerning nonnegative matrices and doubly stochastic matrices. Pacific J. Math. 21 (1967), 343–348 [68] K.-T. Sturm, On the geometry of metric measure spaces. I. Acta Math. 196 (2006), 65–131 [69] K.-T. Sturm, On the geometry of metric measure spaces. II. Acta Math. 196 (2006), 133– 177 [70] G. Teschl, Ordinary differential equations and dynamical systems. Graduate Stud. Math. 140, American Mathematical Society, Providence, RI, 2012 [71] C. Villani, Topics in optimal transportation. Graduate Stud. Math. 58, American Mathematical Society, Providence, RI, 2003 [72] C. Villani, Optimal transport. Grundlehren Math. Wiss. 338, Springer, Berlin, 2009 [73] M.-K. von Renesse and K.-T. Sturm, Transport inequalities, gradient estimates, entropy, and Ricci curvature. Comm. Pure Appl. Math. 58 (2005), 923–940 [74] S. Willard, General topology. Addison-Wesley, Reading, Mass.-London-Don Mills, Ont., 1970
E M S
T E X T B O O K S
I N
M A T H E M A T I C S
Alessio Figalli, Federico Glaudo
An Invitation to Optimal Transport, Wasserstein Distances, and Gradient Flows This book provides a self-contained introduction to optimal transport, and it is intended as a starting point for any researcher who wants to enter into this beautiful subject. The presentation focuses on the essential topics of the theory: Kantorovich duality, existence and uniqueness of optimal transport maps, Wasserstein distances, the JKO scheme, Otto’s calculus, and Wasserstein gradient flows. At the end, a presentation of some selected applications of optimal transport is given. The book is suitable for a course at the graduate level, and also includes an appendix with a series of exercises along with their solutions.
https://ems.press ISBN 978-3-98547-010-5