265 75 2MB
English Pages 278 Year 2023
Texts in Applied Mathematics Volume 76
Editors-in-Chief Anthony Bloch, University of Michigan, Ann Arbor, MI, USA Charles L. Epstein, University of Pennsylvania, Philadelphia, PA, USA Alain Goriely, University of Oxford, Oxford, UK Leslie Greengard, New York University, New York, NY, USA Series Editors J. Bell, Lawrence Berkeley National Laboratory, Berkeley, CA, USA R. Kohn, New York University, New York, NY, USA P. Newton, University of Southern California, Los Angeles, CA, USA C. Peskin, New York University, New York, NY, USA R. Pego, Carnegie Mellon University, Pittsburgh, PA, USA L. Ryzhik, Stanford University, Stanford, CA, USA A. Singer, Princeton University, Princeton, NJ, USA A Stevens, University of Münster, Münster, Germany A. Stuart, University of Warwick, Coventry, UK T. Witelski, Duke University, Durham, NC, USA S. Wright, University of Wisconsin, Madison, WI, USA
The mathematization of all sciences, the fading of traditional scientific boundaries, the impact of computer technology, the growing importance of computer modelling and the necessity of scientific planning all create the need both in education and research for books that are introductory to and abreast of these developments. The aim of this series is to provide such textbooks in applied mathematics for the student scientist. Books should be well illustrated and have clear exposition and sound pedagogy. Large number of examples and exercises at varying levels are recommended. TAM publishes textbooks suitable for advanced undergraduate and beginning graduate courses, and complements the Applied Mathematical Sciences (AMS) series, which focuses on advanced textbooks and research-level monographs.
Onésimo Hernández-Lerma Leonardo R. Laura-Guarachi Saul Mendoza-Palacios David González-Sánchez •
•
•
An Introduction to Optimal Control Theory The Dynamic Programming Approach
123
Onésimo Hernández-Lerma Department of Mathematics CINVESTAV-IPN Mexico City, Mexico
Leonardo R. Laura-Guarachi Escuela Superior de Economía Instituto Politécnico Nacional Mexico City, Mexico
Saul Mendoza-Palacios CIDE Mexico City, Mexico
David González-Sánchez Conacyt-Universidad de Sonora Hermosillo, Sonora, Mexico
ISSN 0939-2475 ISSN 2196-9949 (electronic) Texts in Applied Mathematics ISBN 978-3-031-21138-6 ISBN 978-3-031-21139-3 (eBook) https://doi.org/10.1007/978-3-031-21139-3 Mathematics Subject Classification: 49-01, 49L20, 60J25, 90C40, 93E20 © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To Max and Juli - OHL To Justina and Manuel - LRLG Dedicated to Edith, Paulina and Saul, - SMP To my family, - DGS
Preface These lecture notes present material for an introductory course on optimal control theory including deterministic and stochastic systems, with discrete- and continuous-time parameter. The course is introductory in the sense that no previous knowledge of control theory is required. There are several well-known techniques to study Optimal Control Problems (OCPs), but here we are mainly interested in the Dynamic Programming (DP) approach. This is a very general technique introduced by Richard E. Bellman (1920–1984) in the 1950s, which is applicable to a large class of optimization and control problems. (See the remark at the end of this Preface.) The DP approach gives sufficient conditions for an OCP to have a solution. The main tool is the so-called Verification Theorem that can be summarized as follows. Given an Optimal Control Problem (OCP): 1) Write an associated equation, called the Dynamic Programming Equation (DPE). 2) If the DPE has a “suitable” solution, then this solution can be used in turn to solve the given OCP. Actually, part (1) is straightforward. The main difficulty is part (2); that is, finding suitable solutions to the DPE can be a nontrivial matter. The DPE is known by several names, such as the optimality equation, the Bellman equation or, for continuous-time OCPs, the Hamilton–Jacobi–Bellman equation.
vii
viii
PREFACE
The notes begin in Chapter 1 with a general introduction to OCPs. This chapter presents basic notions, such as the objective functionals to be optimized, and the notion of control strategy or control policy. Some examples illustrate the classes of OCPs to be studied in the text. Chapters 2 and 3 concern discrete-time systems. The former chapter considers the deterministic case in which the typical dynamic system is of the form x t þ 1 ¼ F t ðx t ; a t Þ for t ¼ 0; 1; . . .; T 1;
ð1Þ
with a given initial condition x 0 is called the planning horizon. In (1), x t denote the so-called state and control variables, respectively. The control actions a t are taken according to a given control policy. Chapter 2 introduces finite (T\1) and infinite (T ¼ 1) horizon OCPs with objective functionals that are typically defined by finite or infinite sums, respectively. In the infinite-horizon case, we also introduce an asymptotic optimality criterion, namely, the long-run average cost. The remaining chapters have essentially the same structure as Chapter 2, except for the dynamic model. In particular, in Chapter 3 we consider the stochastic analogue of (1) that is, x t þ 1 ¼ F t ðx t ; a t ; nt Þ; t ¼ 0; 1; . . .; T 1;
ð2Þ
where x t and a t are as in (1), and the nt are random variables that represent random perturbations. It is explained that, depending on the situation that they represent, these perturbations form either a driving process or a random noise. In addition to the “system model” (2), in Chapter 3 we introduce the “Markov control model” in which we have a transition probability instead of a transition function F t as in (2). The Markov control model was introduced by Bellman (1957b), who also coined the term “Markov decision process”, and which today
PREFACE
ix
is also known as a Markov Control Process (MCP). MCPs are especially useful in control problems (for instance, control of some queueing systems) in which we do not have an explicit system model as in (2). On the other hand, it is noted in the text that (1) and (2) are particular classes of MCPs. The second half of these lectures concerns continuous-time OCPs. It begins in Chapter 4 with control problems in which the state process xðÞ satisfies an ordinary differential equation (ODE) x_ ðt Þ ¼ F ðt; x ðt Þ; a ðt ÞÞ for t 2 ½0; T :
ð3Þ
We again consider finite- and (discounted and undiscounted) infinite-horizon problems in which the “summations” that appear in Chapter 2 are replaced by integrals. In several places of Chapter 4 it is emphasized that (3) defines, in fact, a certain family of continuous-time MCPs. The reason for doing this is that Chapter 5 introduces a general continuous-time MCP, in which the evolution of the state process is determined by an “infinitesimal generator” rather than a system function F as in (3). Since this fact is difficult to grasp from a conceptual viewpoint, immediately after introducing some standard dynamic programming facts, we show that the results in Chapter 4 are examples or particular cases of the ideas in Chapter 5. Finally, as another example of a continuous-time MCP, in Chapter 6 we study the control of diffusion processes that, for our present purposes, can be expressed as solutions of certain stochastic differential equations that generalize the ODEs in (3). Each of the Chapters 2–6 presents examples to illustrate the main results. It also includes a section with exercises. Remark: Why Dynamic Programming? In other words, if there are several well-known techniques to analyze optimal control problems (OCPs), why do we emphasize DP? The answer is simple: it has more advantages than any of the other approaches to OCPs. Let us explain this.
x
PREFACE 1. DP is applicable to essentially all classes of OCPs, including deterministic and stochastic problems, with discrete- or continuous-time parameter, finite- or infinite-dimensional state space (see Fabbri et al. (2017)), with discrete (that is, finite or countable spaces) or general metric state space, etc. (In many applications, such as control of queues or reinforcement learning (see e.g. Sutton and Barto (2018)), we work in finite state spaces.) 2. For discrete–time problems, DP does not require smoothness conditions, which is important in some applications, for instance in model predictive control (see e.g. Raković and Levine (2019).) 3. In DP, there are well-known approximation algorithms, such as the (recursive) value iteration algorithm or the (monotone) policy iteration algorithm, which are the basis for some applications, such as adaptive (or approximate) DP and stabilization problems in control theory. 4. In DP, we automatically obtain feedback (or closed-loop or Markov) optimal controls, in contrast to the open-loop controls obtained when using, for instance, the Maximum Principle.
Some of these advantages, and many others, are practically impossible to obtain when using other solution techniques for OCPs.
PREFACE
xi
On the negative side, the main disadvantage is what Bellman (1957a, 1961) called the curse of dimensionality, which basically refers to the difficulty in solving the DPE when the dimension increases.
Mexico City, Mexico October 2022
Onésimo Hernández-Lerma Leonardo R. Laura-Guarachi Saul Mendoza-Palacios David González-Sánchez
Acknowledgments. The work by Onésimo Hernández-Lerma was partially supported by Consejo Nacional de Ciencia y Tecnologia (CONACYT-México) grant CF-2019/263963. Moreover, the work by Leonardo R. Laura-Guarachi and Saul Mendoza-Palacios was partially supported by CONACYTMéxico under grant Ciencia Frontera 2019-87787.
Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Introduction: Optimal Control Problems
vii
.........
1
2 Discrete–Time Deterministic Systems . . . . . . . . . . . . . 2.1 The Dynamic Programming Equation . . . . . . . . . . . . . 2.2 The DP Equation and Related Topics . . . . . . . . . . . . 2.2.1 Variants of the DP Equation . . . . . . . . . . . . . . 2.2.2 The Minimum Principle . . . . . . . . . . . . . . . . . . . 2.3 Infinite–Horizon Problems . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Discounted Case . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 The Minimum Principle . . . . . . . . . . . . . . . . . . . 2.3.3 The Weighted-Norm Approach . . . . . . . . . . . . 2.4 Approximation Algorithms . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Long–Run Average Cost Problems . . . . . . . . . . . . . . . 2.5.1 The AC Optimality Equation . . . . . . . . . . . . . . 2.5.2 The Steady–State Approach . . . . . . . . . . . . . . . 2.5.3 The Vanishing Discount Approach . . . . . . . . .
13 14 20 20 23 28 29 38 44 50 51 54 62 63 67 72
3 Discrete–Time Stochastic Control Systems . . . . . . . 3.1 Stochastic Control Models . . . . . . . . . . . . . . . . . . . . . . . 3.2 Markov Control Processes: Finite Horizon . . . . . . . . . 3.3 Conditions for the Existence of Measurable Minimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Infinite–Horizon Discounted Cost Problems . . . . . . . 3.6 Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Long-Run Average Cost Problems . . . . . . . . . . . . . . . .
83 83 87 92 96 100 110 111
xiii
xiv
CONTENTS 3.7.1 The Average Cost Optimality Inequality . . . 114 3.7.2 The Average Cost Optimality Equation . . . . 116 3.7.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4 Continuous–Time Deterministic Systems . . . . . . . . . 127 4.1 The HJB Equation and Related Topics . . . . . . . . . . . 128 4.1.1 Finite–Horizon Problems: The HJB Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 4.1.2 A Minimum Principle from the HJB Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 4.2 The Discounted Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 4.3 Infinite–Horizon Discounted Cost . . . . . . . . . . . . . . . . . 143 4.4 Long-Run Average Cost Problems . . . . . . . . . . . . . . . . 147 4.4.1 The Average Cost Optimality Equation (ACOE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 4.4.2 The Steady-State Approach . . . . . . . . . . . . . . . 153 4.4.3 The Vanishing Discount Approach . . . . . . . . . 158 4.5 The Policy Improvement Algorithm . . . . . . . . . . . . . . 163 4.5.1 The PIA: Discounted Cost Problems . . . . . . . 164 4.5.2 The PIA: Average Cost Problems . . . . . . . . . . 170 5 Continuous–Time Markov Control Processes . . . . . 183 5.1 Markov Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 5.2 The Infinitesimal Generator . . . . . . . . . . . . . . . . . . . . . . 188 5.3 Markov Control Processes . . . . . . . . . . . . . . . . . . . . . . . 198 5.4 The Dynamic Programming Approach . . . . . . . . . . . . 201 5.5 Long–Run Average Cost Problems . . . . . . . . . . . . . . . 205 5.5.1 The Ergodicity Approach . . . . . . . . . . . . . . . . . . 209 5.5.2 The Vanishing Discount Approach . . . . . . . . . 210 6 Controlled Diffusion Processes . . . . . . . . . . . . . . . . . . . . 217 6.1 Diffusion Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 6.2 Controlled Diffusion Processes . . . . . . . . . . . . . . . . . . . . 221 6.3 Examples: Finite Horizon . . . . . . . . . . . . . . . . . . . . . . . . 223 6.4 Examples: Discounted Costs . . . . . . . . . . . . . . . . . . . . . 231 6.5 Examples: Average Costs . . . . . . . . . . . . . . . . . . . . . . . . 237
CONTENTS
xv
Appendix A: Terminology and Notation . . . . . . . . . . . . . . 245 Appendix B: Existence of Measurable Minimizers
...
249
Appendix C: Markov Processes . . . . . . . . . . . . . . . . . . . . . . . . 255 Bibliography
............................................
263
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
Chapter 1 Introduction: Optimal Control Problems
In a few words, in an optimal control problem (OCP) we are given a dynamical system that is “controllable” in the sense that its behavior depends on some parameters or components that we can choose within certain ranges. These components are called control actions. When we look at these control actions throught the whole period of time in which the system is functioning, then they form control policies or strategies. On the other hand, we are also given an objective function or performance index that somehow measures the system’s response to each control policy. The OCP is then to find a control policy that optimizes the given objective function. More precisely, in an OCP we are given: 1. A “controllable” dynamical system, which typically, depending on the time parameter, can be a discrete–time system or a continuous-time system. For instance, in the former case, the dynamical system is of the form xt+1 = Ft (xt , at , ξt ) for t = 0, 1, . . . , T − 1,
(1.1)
with some given initial state x0 . (In Chaps. 4–6 we consider continuous–time systems.) In (1.1), at each time t, xt denotes the state variable, with values in some state space X; at is the © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hern´andez-Lerma et al., An Introduction to Optimal Control Theory, Texts in Applied Mathematics 76, https://doi.org/10.1007/978-3-031-21139-3 1
1
2
1
INTRODUCTION: OPTIMAL CONTROL PROBLEMS
control or action variable, with values in some action space A; and ξt is a disturbance or perturbation in a disturbance set S. In most applications of control theory, the spaces X, A, and S are subsets of finite-dimensional spaces. However, for technical reasons (to be briefly discussed in Remark 1.7), it is convenient to assume that they are general Borel spaces. Moreover, depending on the disturbances ξt , the system (1.1) is classified in deterministic, stochastic or uncertain. This is explained in Remark 1.2, below. 2. We are also given a set Π of admissible (or feasible) control policies or strategies, which are sequences π = {a0 , a1 , ...} with values at ∈ A. 3. Finally, we are given a real–valued function V on Π × X, which is the objective function or performance index. The function V can take many different forms. One of the most common is a total cost V (π, x0 ) :=
T −1
ct (xt , at ) + CT (xT ),
(1.2)
t=0
where π = {a0 , . . . , aT −1 } denotes the strategy being used, and x0 is the initial state for the system (1.1). The term ct (xt , at ), which is called the stage cost, denotes the cost incurred at time t given that xt is the state of the system and at is the applied control action. The so–called terminal cost function CT (·) in (1.2) depends on the terminal state xT only. In (1.2), T is called the OCP’s planning horizon and can be finite or infinite. The infinite–horizon case is obtained from (1.2) taking CT (·) ≡ 0 and letting T → ∞. With the above components, we can now state the OCP as follows: For each initial state x0 , optimize the objective function V (π, x0 ) over all π ∈ Π subject to (1.1). Here, depending on the context, “optimize” means either “minimize” (for instance, if V is a cost function) or “maximize” (if V is a reward or a utility function). Thus, if we are minimizing V , then the OCP would be: Find π ∗ ∈ Π such that
1
INTRODUCTION: OPTIMAL CONTROL PROBLEMS
V (π ∗ , x) = inf V (π, x) ∀ x0 = x, π∈Π
3
(1.3)
subject to (1.1). If this holds, then π ∗ is called an optimal policy or optimal solution to the OCP, and the function V ∗ (x) := inf V (π, x) = V (π ∗ , x) ∀ x ∈ X π∈Π
(1.4)
is the OCP’s value function or minimum cost. If, on the other hand, V is a profit or reward function to be maximized, then in (1.3)–(1.4) we write “sup” in lieu of “inf”, and the value function V ∗ in (1.4) is called the OCP’s maximum utility or maximum reward, depending on the context. To ensure that the value function V ∗ (·) in (1.4) is finite–valued, we will suppose that the following assumption holds unless stated otherwise. Basic Assumption. (a) The cost functions ct and CT are nonnegative; (b) There is a strategy π ∈ Π such that V (π, x) < ∞ for all x ∈ X. The dynamic programming approach under the condition (a) in the Basic Assumption (or when the cost functions are bounded below) is known as the positive case. In the negative case the cost functions are nonpositive (or bounded above). In Sect. 2.3.3 we will introduce the so-called weighted-norm approach, which allows positive and negative costs but with a restricted growth rate (Assumption 2.33(c)). There are many conditions ensuring that V ∗ (·) is finite–valued, but our Basic Assumption greatly simplifies the mathematical presentation. For instance, if the condition (a) holds, and ct , CT are bounded above by a constant M , then the total cost in (1.2) satisfies that 0 ≤ V (π, x0 ) ≤ (T + 1)M
∀ π and x0 .
The main role of the Basic Assumption is that it helps to fix ideas, and we can move forward to other aspects on an OCP. Part (a) in the Basic Assumption guarantees that V (π, x) ≥ 0 for all π ∈ Π and x0 = x ∈ X, and so V ∗ (·) ≥ 0. On the other hand, (b) yields that V ∗ (x) < ∞ for all x ∈ X. (Part (a) can
4
1
INTRODUCTION: OPTIMAL CONTROL PROBLEMS
be replaced by the apparently weaker condition: ct and CT are bounded below.) Example 1.1 (A production–inventory system). Consider a production system in which the state variable xt ∈ R is the stock or inventory level of some product. A typical state equation for this system is (1.5) xt+1 = xt + at − ξt ∀ t = 0, 1, . . . , where the control or action variable at ≥ 0 is the amount to be ordered or produced (and immediately supplied) at the beginning of period t, and the disturbance ξt ≥ 0 is the product’s demand. Observe that (1.5) is simply a balance–like equation. A negative stock, which occurs if ξt > xt + at in (1.5), is interpreted as a backlog that will be fulfilled as soon as the stock is replenished. However, if backlogging is not allowed, then the model (1.5) can be replaced with xt+1 = max{xt + at − ξt , 0}. Usually, production systems have a finite capacity C, so that, at each period t, we must have xt + at ≤ C or at ≤ C − xt . Therefore, if the state is xt = x, then we have the control constraint a ∈ A(x),
with A(x) = [0, C − x].
Thus, the action space is A = [0, ∞), which contains the control constraint set A(x) for every state x. In a production–inventory system the objective function V in (1.2) is usually interpreted as a total profit or revenue function. Hence, given the unit sale price (p), the unit production cost (c), and the unit holding (or maintenance) cost (h), then at each time t, instead of a cost ct in (1.2), we have a net revenue of the form rt (xt , at ) := pyt − cat − h(xt + at ), where yt = min{ξt , xt + at } is the sale during period t.
3
Remark 1.2. (a) The system (1.1) (or (1.5)) is said to be deterministic or stochastic or uncertain depending on whether the disturbances ξt are, respectively,
1
INTRODUCTION: OPTIMAL CONTROL PROBLEMS
5
• given constants in (the disturbance set) S, • S–valued random variables, or • constants in S but with unknown values. (b) If (1.1) is a stochastic system, the disturbances ξt are said to form a driving process if they have a concrete meaning, say, physical or economical. In contrast, if the disturbances are generic random variables (with no specific meaning), the process {ξt } is called a random noise. For instance, in Example 1.1, if the demand variables ξt are random, then they form a driving process. On the other hand, in models of economic growth or population growth the disturbances typically form a random noise. (See Example C.4 and Remark C.5 in Appendix C.) (c) If (1.1) is a stochastic system, the cost in the right–hand side of (1.2) is random. In this case, (1.2) is replaced with the expected total cost T −1 ct (xt , at ) + CT (xT ) . (1.6) V (π, x0 ) := E t=0
(d) Strategies. To introduce a control policy or strategy π = {at } it is important to specify the information available to the controller. In the simplest case, the control actions depend on the time parameter only; that is, at = g(t) for some function g. In this case, π is called an open–loop policy. If, on the other hand, at each time t, the control action at = g(t, xt ) depends on t and the current state xt , then π is said to be a closed–loop policy, also known as a feedback or Markov policy. In general, if the actions are of the form at = g(x0 , a0 , x1 , a1 , . . . , xt−1 , at−1 , xt ) so that, at each time t, the action depends on the whole history of the process up to time t, then π is called a nonanticipative or history–dependent policy. In these notes, to fix ideas, we will consider only open– loop and closed–loop (or Markov) policies unless stated otherwise. See Fig. 1.1 for a representation of a feedback or closed– loop scheme. 3
1
6 Fig. 1.1 scheme
INTRODUCTION: OPTIMAL CONTROL PROBLEMS
A closed–loop
system
at
xt
controller
Example 1.3 (A portfolio selection problem ≡ A consumption– investment problem). Let xt be an investor’s wealth at time t = 0, 1, . . . . At each time t the investor decides how much to consume of his wealth, say ct , and the rest xt − ct is invested in a stock portfolio, which consists of • a risk–free asset (e.g., bonds) with a fixed interest rate r, and • a risky asset (e.g., stocks) with a random return rate ξt . Note that {ξt } is a driving process in the sense of Remark 1.2(b). Thus the control variable is at = (ct , pt ) ∈ [0, xt ] × [0, 1] with ct = consumption, and pt = fraction of xt − ct to be invested in the risky asset, and 1 − pt is the fraction of xt − ct invested in the riskless asset. The corresponding dynamical model is xt+1 = [(1 − pt )(1 + r) + pt ξt ](xt − ct ), t = 0, 1, . . . , with some given initial wealth x0 = x ≥ 0. Clearly, for the investment to be profitable, we impose the condition E(ξt ) ≥ r for all t = 0, 1, . . . . (Otherwise, if the interest rate r is greater that the expected return rate E(ξt ), then obviously the best decision would be to invest in the risk–free asset.) Hence we have a stochastic control system in which typically we wish to maximize a so-called expected utility of consumption T β t u(ct ) , with T ≤ ∞, (1.7) V (π, x) := E t=0
where u(·) is a given utility function. Moreover, given the interest 3 rate r > 0, the discount factor is β := (1 + r)−1 . If T = ∞ in (1.7), then we obtain an infinite–horizon objective function called an infinite–horizon discounted utility. A quite
1
INTRODUCTION: OPTIMAL CONTROL PROBLEMS
7
different infinite–horizon objective function is the long–run expected average cost defined as n−1 −1 c(xt , at ) , (1.8) V (π, x) := lim sup n E n→∞
t=0
where, as usual, π = {at } is the control policy being used, x0 = x is the given initial state, and the lim sup is in order to minimize the long–run average cost in a worst case scenario. (From a mathematical viewpoint, taking the lim sup is convenient because it ensures that (1.8) is well defined, whereas the “limit” might not exist. Moreover, for theoretical reasons, it is more convenient to take lim sup rather than lim inf. We will come back to this point in the following chapters.) Observe that (1.8) is an asymptotic value that does not depend on the expected cost incurred in any finite number of stages. In fact, in many cases it is even independent of the initial state x, that is V (π, x) ≡ V (π) for all x. Long-run average cost problems appear in both deterministic and stochastic problems. See, for instance, Sects. 2.5 and 5.5 (or 6.5), respectively. Example 1.4 (A tracking problem). Consider the general, possibly stochastic system (1.1) with state space X ⊂ Rk and action set A ⊂ Rl . In addition, we are given a fixed state trajectory, say, {x∗t } in X, and a fixed action trajectory {a∗t } in A. Finally, consider the long–run expected average cost (1.8) with running cost c(xt , at ) := |xt − x∗t |2 + |at − a∗t |2
∀ t = 0, 1, ...,
which is essentially the distance from (xt , at ) to (x∗t , a∗t ). Hence, the corresponding OCP of minimizing V (π, x) over all π is called a tracking problem because the underlying idea is that the state– action pair (xt , at ) should “track” or “follow” or “pursue” or “stay as close as possible to” the given trajectories (x∗t , a∗t ). In particular, if a∗t ≡ 0 for all t, then we have a tracking problem with minimum fuel. Tracking problems are very common in engineering and economics. As an example, controlling the attitude of a satellite or keeping it as close as possible to a given orbit are tracking problems. (See Fig. 1.2.) For applications in economics, see Kendrick (2002).
1
8
INTRODUCTION: OPTIMAL CONTROL PROBLEMS
Fig. 1.2 Controlling a satellite’s attitude
If the state dynamics (1.1) in the tracking problem is stochastic (so that the perturbations ξt are random variables), then the process {ξt } would typically be a random noise. This is simply because there is no possibility of assigning a “physical” meaning 3 to ξt . Remark 1.5 (Partially observable systems). In the tracking problem of Example 1.4, suppose that the state xt of the system is the attitude of a satellite—see Fig. 1.2. Hence we have a particular case of a so-called partially observable system, which is so– named precisely because the state of the system is not observable. Mathematically speaking, one can model this class of systems as follows. The evolution of the state process xt is as in (1.1), where usually the disturbance {ξt } is a random noise. As already noted, the state xt is not observable, but there is an observation process yt behaving according to an equation of the form yt = Gt (xt , ηt ) ∀ t = 0, 1, 2, ...,
(1.9)
where {ηt } is a random noise influencing the observation process.
1
INTRODUCTION: OPTIMAL CONTROL PROBLEMS
Fig. 1.3 A partially observable control system
9
ξt
system
xt
at
controller
observer
ηt
yt x ˆt
filter
Using the observation process {yt } and a suitable filter (that is, some statistical device), we obtain an (statistical) estimator xˆt of the state xt , which is used by the controller to obtain a control of the form at = g(t, xˆt ). (See Fig. 1.3 for a graphical representation of a partially observable control system.) Under suitable assumptions, a partially observable system consisting of Eqs. (1.1) and (1.9) can be transformed into a completely observable system in which the original (unobservable) state xt is replaced by the estimator xˆt . In some particular cases, the estimator xˆt is a finite–dimensional vector, but in general it is a probability distribution. Hence, the “completely observable” system is an OCP with state variable xˆt in a space of probability measures! For details see, for instance, B¨auerle and Rieder (2011), Bertsekas and Shreve (1978) or Hern´andez-Lerma (1989). 3 We conclude this section with some comments on the continuous–time systems that we will study in Chaps. 4–6. Remark 1.6 (Continuous–time control problems). For continuous–time control problems we again distinguish (as in the discrete–time case above) between deterministic and stochastic problems. In the former case, the state equation, which is the analogue of (1.1), is an ordinary differential equation x(t) ˙ = F (t, x(t), a(t)) for t ∈ [0, T ],
(1.10)
with some given initial state x(0) = x0 . In (1.10), for each t, x(t) and a(t) denote the state variable and the control action that
10
1
INTRODUCTION: OPTIMAL CONTROL PROBLEMS
belong to appropriate spaces, say, X ⊂ Rn and A ⊂ Rm , respectively. Similarly, the cost functional (1.2) is replaced by an integral T V (π, x0 ) := c(t, x(t), a(t))dt + CT (x(T )), (1.11) 0
where the instantaneous (or running) cost c(t, x, a) and the terminal cost CT (x) are given functions. The control variable a(·) in (1.10) and (1.11) depends on the information available to the controller. Here, to simplify the presentation, we only consider • open–loop controls a(t) := g(t), where g : [0, T ] → A is a given (measurable) function; and • closed–loop (or feedback or Markov) controls a(t) := g(t, x(t)), for some (measurable) function g : [0, T ] × X → A. In the stochastic case, (1.10) is replaced by a stochastic differential equation dx(t) = F (t, x(t), a(t))dt + G(t, x(t), a(t))dW (t),
(1.12)
with x(·) ∈ Rn , assuming that the state space is X ⊂ Rn , as in (1.10). Moreover, in (1.12), W (·) ∈ Rd is a d–dimensional Wiener process (also known as a Brownian motion), and G(t, x, a) is a n– by–d matrix function. In this case, the cost functional in (1.11) is a random variable and so we replace it by its expected value: T c(t, x(t), a(t))dt + CT (x(T )) . (1.13) V (π, x0 ) := E 0
The stochastic control problem is to minimize (1.13) subject to (1.12). 3 The remainder of these notes is organized as follows. Chapters 2 and 3 deal with discrete–time systems. Chapter 2 considers deterministic systems, and Chap. 3 is about the stochastic case. The remaining chapters deal with continuous-time problems. Chapter 4 concerns control of continuous–time deterministic systems in which the state dynamics is an ordinary differential
1
INTRODUCTION: OPTIMAL CONTROL PROBLEMS
11
equation, as in (1.10). Chapter 5 introduces general continuous– time Markov control processes (MCPs), which include the deterministic systems in Chap. 4 and many other stochastic control systems. In this general framework we show that some aspects of stochastic control problems can be analyzed by exploiting the “Markovian nature” of the involved dynamic systems. As an application of the results for general MCPs we show how to recover some results for the deterministic systems in Chap. 4. (Here, deterministic systems are viewed as a class of, say, “degenerate” MCPs.) Finally, as another application of MCPs, in Chap. 6 we consider a class of controlled stochastic differential equations (1.12), also known as controlled diffusion processes. Remark 1.7. Why should we use Borel spaces (see Appendix A)? To answer this question we should note that the state space X and the action set A of an OCP can be of a different “topological” nature. For instance, in Examples 1.1 and 1.3, X and A are sets in (finite–dimensional) Euclidean spaces R or R2 . However, in the control of some queueing systems, for instance, X and A are discrete spaces, that is, either finite or countably infinite sets. Indeed, in this case, the state variable xt is typically the “number of customers” (that is, a nonnegative integer) in the system at time t. Moreover, the state space X can be finite or infinite, depending on the system’s “capacity”, and the action set A can be finite or infinite. For example, in a control of admissions problem, A is a two-point set {a, b}, where, for each arriving customer, a := allow the customer to access the waiting line, and b := reject the customer. At the other extreme, X and/or A can be infinite– dimensional sets, as in the book by Fabbri et al. (2017). As an example, consider the partially observable system in Remark 1.5. If we transform this system into a completely observable one, in which the state variable is the estimator xˆt , then the new state ˆ of probability measures! Clearly, if we conspace will be a set X sider separately each of these cases, the theory could be a little confusing. To avoid this situation, when dealing with a general OCP we will simply assume that X and A are Borel spaces; this includes all the cases mentioned above. (See Appendix A.)
12
1
INTRODUCTION: OPTIMAL CONTROL PROBLEMS
On the other hand, the Borel space context requires a suitable concept of “measurability” of sets and functions. Here, however, to avoid being repetitious, throughout these lectures we assume that all sets and functions are Borel-measurable. In fact, this measurability assumption usually holds in our case, because we mostly consider standard, “well-behaved” settings in which functions are, for instance, continuous or differentiable and so forth, and similarly for sets—we mostly deal with nice sets, such as closed or open intervals. To conclude, we believe that to learn from these lecture notes it is not necessary to know Measure Theory. Nevertheless, if the reader wishes to learn about it, he/she can look at several introductory reader-friendly texts, such as Ash (1972), Bartle (1995), Bass (2020), ... 3
Chapter 2 Discrete–Time Deterministic Systems
In this chapter we consider the discrete–time system (1.1) in the so–called deterministic case (see Remark 1.2(a)), so that the disturbances ξt are supposed to be given constants in some space S. Since this information is irrelevant for our present purposes, we will omit the notation ξt so (1.1) becomes xt+1 = Ft (xt , at ) for t = 0, 1, . . . , T − 1,
(2.0.1)
with a given initial condition x0 . The state and control spaces X and A are given spaces. (Recall our assumption in Remark 1.7: In these lectures, all sets and functions are Borel measurable.) First, we consider the finite-horizon case, T < ∞. Hence, the optimal control problem (OCP) we are concerned with is to minimize the total cost in (1.2), that is, V (π, x0 ) :=
T −1
ct (xt , at ) + CT (xT )
(2.0.2)
t=0
over the set Π of all control policies (or strategies) π = {a0 , . . . , aT −1 } subject to (2.0.1). At each time t, the set of feasible actions is a set A(x) ⊂ A, which may depend on the current state x. (If necessary, the action set A(x) may also depend on the time t.) For this OCP an optimal policy and the corresponding value function are defined as in (1.3) and (1.4), respectively.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hern´andez-Lerma et al., An Introduction to Optimal Control Theory, Texts in Applied Mathematics 76, https://doi.org/10.1007/978-3-031-21139-3 2
13
14
2
DISCRETE–TIME DETERMINISTIC SYSTEMS
In the following Sects. 2.1 and 2.2.1 we introduce the dynamic programming approach to study the OCP (2.0.1)–(2.0.2). This approach gives sufficient conditions for the existence of an optimal control policy. In contrast, in Sect. 2.2.2 we introduce the minimum principle (also known as Pontryagin’s principle) that gives necessary conditions. In Sect. 2.3 we study an infinite-horizon problem, so T = ∞ in (2.0.1)–(2.0.2), with CT (·) ≡ 0. To this end we use both DP and the minimum principle in Sects. 2.3.1 and 2.3.2, respectively. Moreover, in Sect. 2.3.3 we introduce the weighted-norm approach that allows positive and negative cost functions. In Sect. 2.4 we consider again the infinite-horizon problem but now from the viewpoint of two approximation schemes, value iteration (VI) and policy iteration (PI). Both schemes are very useful in applications, and easily extended to stochastic problems (as in Chap. 3). We conclude this chapter with an analysis, in Sect. 2.5, of long-run average cost problems. These problems are very popular in the control of some queueing systems, and also in computer and telecommunications applications, in which we wish to optimize some asymptotic cost. Remark 2.1. Sometimes, without further comments, we will tacitly assume that the OCPs we are dealing with are consistent in the sense that they are well defined and admit optimal solutions. Of course, in all of the particular cases considered below we give conditions ensuring consistency. 3 Recall also the Basic Assumption at the beginning of Chap. 1 about the cost functions ct , CT being nonnegative, and V ∗ (·) < ∞.
2.1 The Dynamic Programming Equation The dynamic programming (DP) technique is based on the following “principle of optimality” stated by Richard Bellman (1920– 1984).
2.1
THE DYNAMIC PROGRAMMING EQUATION
15
Lemma 2.2 (The principle of optimality (PO)). Let π ∗ ={a∗0 , . . . , a∗T −1 } be an optimal strategy for the OCP (2.0.1)–(2.0.2), that is, V (π ∗ , x0 ) = min V (π, x0 ) ∀ x0 . π
Let {x∗0 , . . . , x∗T } be the corresponding path obtained from (2.0.1), so x∗0 = x0 . Then, for any time s ∈ {0, . . . , T − 1}, the “truncated” strategy πs∗ = {a∗s , . . . , a∗T −1 } from time s onward is an optimal strategy that leads the system (2.0.1) from the point x∗s to the point x∗T . We will next sketch the proof of this lemma. The details are left as an exercise for the reader. (See Exercise 2.4) Sketch of the proof of Lemma 2.2. Arguing by contradiction, suppose that the lemma’s conclusion does not hold; that is, for some 0 ≤ s < T − 1, the truncated policy πs∗ is not optimal. Therefore, by our consistency assumption in Remark 2.1, there as , . . . , a ˆT −1 } that is optimal in the interval exists a policy π ˆs := {ˆ [s, T − 1] so that Vs (ˆ πs , x∗s ) < Vs (πs∗ , x∗s ), where Vs (πs , x) :=
T −1
ct (xt , at ) + CT (xT )
(2.1.1)
t=s
is the total cost from time s onward when using the truncated policy πs := {as , . . . , aT −1 }, given that xs = x. Now consider the ˜T −1 } defined as combined policy π ˜ = {˜ a0 , . . . , a ∗ at if 0 ≤ t < s, a ˜t := a ˆt if s ≤ t ≤ T − 1. Then, by definition of π ˜ , V0 (˜ π , x0 ) < V0 (π ∗ , x0 ), which contradicts 2 the optimality of π ∗ stated in the lemma. We will now show how to use Lemma 2.2 to obtain the dynamic programming equation (2.1.8)–(2.1.9) below. Consider the OCP (2.0.1)–(2.0.2) but only from time s onward, with initial state xs = x, that is, the cost function to be minimized is as in (2.1.1), i.e.,
2
16
Vs (πs , x) :=
DISCRETE–TIME DETERMINISTIC SYSTEMS T −1
ct (xt , at ) + CT (xT ).
t=s
Let vs (x) be the corresponding minimal cost, that is, vs (x) := inf Vs (πs , x). πs
(2.1.2)
Moreover, since no control actions are applied at the terminal time T , we define (2.1.3) vT (x) := CT (x). Hence, by Lemma 2.2, (2.1.2) becomes vs (x) = Vs (πs∗ , x) T −1 ct (x∗t , a∗t ) + CT (x∗T ) = t=s
∗ = cs (x, a∗s ) + Vs+1 (πs+1 , x∗s+1 ) = cs (x, a∗s ) + vs+1 (x∗s+1 ).
Therefore, by (2.0.1), vs (x) = cs (x, a∗s ) + vs+1 (Fs (x, a∗s )).
(2.1.4)
On the other hand, by definition (2.1.2) of vs (x), vs (x) ≤ cs (x, a) + vs+1 (Fs (x, a)) ∀ a ∈ A(x).
(2.1.5)
Finally, combining (2.1.4) and (2.1.5) we obtain that, for s ∈ {0, . . . , T − 1} and x ∈ X, vs (x) = min [cs (x, a) + vs+1 (Fs (x, a))], a∈A(x)
(2.1.6)
with terminal condition as in (2.1.3): vT (x) := CT (x) ∀ x ∈ X.
(2.1.7)
Equation (2.1.6) with the terminal condition (2.1.7) is called the dynamic programming (DP) equation or the Bellman equation for the OCP (2.0.1)–(2.0.2). The DP equation is the
2.1
THE DYNAMIC PROGRAMMING EQUATION
17
basis of the dynamic programming algorithm in the following theorem. Theorem 2.3 (Dynamic programming theorem). Let J0 , . . . , JT be the functions defined “backward” (from s = T to s = 0) on X by JT (x) := CT (x), (2.1.8) and for s = T − 1, T − 2, . . . , 0, Js (x) := min [cs (x, a) + Js+1 (Fs (x, a))]. a∈A(x)
(2.1.9)
Suppose that for each s = 0, 1, . . . , T − 1, there exists a function a∗s : X → A that attains the minimum in the right hand side of (2.1.9) for all x ∈ X. Then the Markov strategy π ∗ = {a∗0 , . . . , a∗T −1 } is optimal and the value function coincides with J0 , i.e., inf V (π, x) = V (π ∗ , x) = J0 (x) ∀ x ∈ X. π
(2.1.10)
Actually, for each s = 0, . . . , T, Js coincides with the function in (2.1.2)–(2.1.3), i.e., vs (x) = Js (x) ∀ s = 0, 1, . . . , T, x ∈ X.
(2.1.11)
Proof. Let vT , vT −1 , ..., v0 be the functions defined by (2.1.2)– (2.1.3), and let π ∗ be the Markov strategy defined in the theorem. Then, for each s ∈ {0, 1, . . . , T }, the definition of vs yields vs (x) ≤ Vs (π ∗ , x) = Js (x) ∀ x ∈ X, where the second equality follows from (2.1.9), which can be expressed as Js (x) = cs (x, a∗s ) + Js+1 (x∗s+1 ). Hence, to complete the proof of (2.1.11), it remains to show that vs (x) ≥ Js (x)
(2.1.12)
for all s ∈ {0, 1, . . . , T } and x ∈ X. We will prove (2.1.12) using backward induction. Indeed, first note that, by (2.1.7) and (2.1.8), the equality holds in (2.1.12) when s = T . Now let us suppose that (2.1.12) holds for s + 1, i.e.,
2
18
DISCRETE–TIME DETERMINISTIC SYSTEMS
vs+1 (x) ≥ Js+1 (x) ∀ x ∈ X.
(2.1.13)
Take an arbitrary policy π = {a0 , . . . , aT −1 }. Then, for any x ∈ X, Vs (π, x) = cs (x, as ) + Vs+1 (π, Fs (x, as )) ≥ cs (x, as ) + vs+1 (Fs (x, as )) ≥ cs (x, as ) + Js+1 (Fs (x, as )) [by (2.1.13)] ≥ min [cs (x, a) + Js+1 (Fs (x, a))] a∈A(x)
= Js (x). Since this holds for any policy π, we obtain (2.1.12).
2
In the following example we illustrate Theorem 2.3 with an LQ control problem (also known as a “linear regulator problem”), which consists of a Linear state equation with a Quadratic stage cost. Example 2.4 (Discrete–time LQ system). Consider the linear system xt+1 = αxt + βat
∀ t = 0, 1, . . . , T − 1; x0 given,
with nonzero coefficients α, β. The state and action spaces are X = A = R. The objective function (or performance index) is V (π, x0 ) :=
T −1
(qx2t + ra2t ) + qT x2T ,
t=0
where r > 0 and q, qT ≥ 0. In this case, the dynamic programming equation (2.1.8)–(2.1.9) becomes (2.1.14) JT (x) := qT x2 and for s = T − 1, T − 2, . . . , 0: Js (x) := min[qx2 + ra2 + Js+1 (αx + βa)]. a
(2.1.15)
This equation is solved by backward induction: substituting (2.1.14) into (2.1.15) we obtain
2.1
THE DYNAMIC PROGRAMMING EQUATION
19
JT −1 (x) := min[qx2 + ra2 + qT (αx + βa)2 ]. a
Therefore, JT −1 (x) := min[(q + qT α2 )x2 + (r + qT β 2 )a2 + 2qT αβxa]. a
The right–hand side of this equation is minimized when a∗T −1 (x) = GT −1 x,
with GT −1 := −(r + qT β 2 )−1 qT αβ,
and the minimum is JT −1 (x) = KT −1 x2 ,
with KT −1 := (r + qT β 2 )−1 qT rα2 + q.
In general, using backward induction we can see that the optimal strategy π ∗ = {a∗0 , . . . , a∗T −1 } is given by a∗s (x) = Gs x,
with Gs := −(r + Ks+1 β 2 )−1 Ks+1 αβ, (2.1.16)
with coefficients Ks defined recursively by KT := qT and for s = T − 1, . . . , 0: Ks = (r + Ks+1 β 2 )−1 Ks+1 rα2 + q. Likewise, the optimal cost (2.1.11) from time s onward becomes Js (x) = Ks x2
for s = 0, 1, . . . , T − 1.
(2.1.17)
In particular, with s = 0 we obtain the minimum cost in (2.1.10). 3 Remark 2.5. (The nonstationary vector LQ problem.) For notational ease, we have considered above the scalar (or onedimensional) LQ problem in the stationary case, in which the state and action spaces are X = A = R and the coefficients α, β, q, r are all time-invariant constants. However, essentially the same arguments (with obvious notational changes) are valid in the general nonstationary vector case with state and action spaces X = Rn and A = Rm , respectively, and time-varying matrix coefficients αt ∈ Rn×n , βt ∈ Rn×m , and quadratic stage costs c(x, a) = x qt x + a rt a, t = 0, 1, ...
2
20
DISCRETE–TIME DETERMINISTIC SYSTEMS
where “prime” ( ) denotes transpose, and qt and rt are symmetric matrices, with qt nonnegative definite, and rt positive definite. (The latter means that, for all t, x ∈ Rn , and a ∈ Rm , we have x qt x ≥ 0 and a rt a > 0 for a = 0.) In this case, (2.1.16)–(2.1.17) become a∗s (x) = Gs x ∀x ∈ Rn with
Gs = −(rs + βs Ks+1 βs )−1 βs Ks+1 αs ,
where KT = qT and, for s = T − 1, ..., 0, Ks = αs [Ks+1 − Ks+1 βs (rs + βs Ks+1 βs )−1 βs Ks+1 ]αs + qs . With this value of Ks , the optimal (minimum) cost in (2.1.17) becomes Js (x) = x Ks x for s = 0, 1, ..., T − 1. For further details see, for instance, Sect. 2.1 in Bertsekas (1987). 3
2.2 The DP Equation and Related Topics This section is divided in two parts. First, in Sect. 2.2.1 we introduce some variants of the DP equation (2.1.8)–(2.1.9). Then, in Sect. 2.2.2, we consider the OCP (2.0.2)–(2.0.1) from the viewpoint of the minimum principle, which gives necessary conditions for optimality. The idea is to compare this principle with the DP approach.
2.2.1 Variants of the DP Equation Discounted costs. Let us suppose that (2.0.2)–(2.0.1) are of the form T −1 αt ct (xt , at ) + αT CT (xT ) V (π, x0 ) = t=0
and
2.2
THE DP EQUATION AND RELATED TOPICS
21
xt+1 = Ft (xt , at ) ∀ t = 0, 1, . . . , T − 1, respectively, where α ∈ (0, 1) is a given discount factor. Then the DP equation (2.1.8)–(2.1.9) becomes JT (x) = αT CT (x), Js (x) = min[αs cs (x, a) + Js+1 (Fs (x, a))]. a
Now consider the change of variable Us (x) := α−s Js (x). Then we obtain the so–called DP equation in the discounted case: Us (x) = min[cs (x, a) + αUs+1 (Fs (x, a))] ∀ s = 0, . . . , T − 1, a
(2.2.1)
with UT (x) = CT (x).
(2.2.2)
Example 2.6 (Optimal advertising, Adukov et al. (2015)). Let us consider a market where a monopolistic firm is entering with a new product. The firm’s market share at time t is xt , and its advertising expenditure rate is at . Suppose that the market share evolves according to the nonlinear system xt+1 = (1 − δ)xt + ρat (1 − xt )1−σ
t = 0, 1, . . . , T − 1, (2.2.3) where the state and control spaces are X = A = [0, 1], ρ ∈ (0, 1) is the effectiveness of advertising, δ ∈ [0, 1] is the rate at which consumers lose interest in the product, and σ ∈ [0, 1] is the nonlinearity parameter. In order to interpret the nonlinear part of (2.2.3), first note that from the Taylor expansion for (1 − x)1−σ at x = 0, we have for
(1 − x)1−σ = (1 − x) + σx(1 − x) + σ 2 x2 + .... Therefore, the nonlinear term ρat (1 − xt )1−σ can be approximated by the sum of a portion representing the new consumers due to the direct advertising, ρat (1 − xt ); and the new consumers due to the word-of-mouth advertising by active consumers xt , namely, σρat xt (1 − xt ).
2
22
DISCRETE–TIME DETERMINISTIC SYSTEMS
Now, going back to our OCP, given an initial state x0 , we want to maximize the profit function V (π, x0 ) =
T −1
1 αt pxt − catσ + αT pT xT ,
t=0
where p > 0 and pT ≥ 0 are potential revenues, and c > 0 is the advertising cost. This problem can be solved explicitly by means of Eqs. (2.2.1)– (2.2.2) for maximization problems. That is UT (x) := pT x, and for s = 0, 1, ..., T − 1: 1 1−σ σ . Us (x) := max px − ca + αUs+1 (1 − δ)x + ρa(1 − x) a
By backward induction, for s = T − 1, we have 1 UT −1 (x) = max px − ca σ + αpT (1 − δ)x + ρa(1 − x)1−σ . a
The maximum is attained at a∗T −1 (x) = [M (pT ) (1 − x)]σ , where M (τ ) := and so UT −1 (x) = N (pT )x + c
1 ασρτ 1−σ
c
,
1−σ M (pT ) , σ
with
1−σ M (τ ). σ Continuing with the backward induction, we obtain that the optimal control policy π ∗ = {a∗0 , . . . , a∗T −1 } is given by
σ a∗s (x) = M N T −s−1 (pT ) (1 − x) , s = 0, 1, ..., T − 1, N (τ ) := p + α(1 − δ)τ − c
where N T −s−1 is the (T − s − 1)th iterate of the function N . From (2.2.3), the corresponding state path is x∗s+1 = (1 − δ)x∗s + ρ[M (N T −s−1 (pT ))]σ (1 − x∗s ),
2.2
THE DP EQUATION AND RELATED TOPICS
23
and the optimal benefit from time s onward turns out to be Us (x) = N
T −s
T −s−1 1 − σ T −s−1 −i i (pT )x + c α M N (pT ) α σ i=0
for s = 0, 1, ..., T − 1.
3
Forward form of the DP equation. Consider the DP equation (2.1.8)–(2.1.9) and let Us := JT −s for s = 0, 1, . . . , T . Then the DP equation becomes Us (x) = min {cs (x, a) + Us−1 (Fs (x, a))} a
(2.2.4)
for s = 1, 2, . . . , T , with initial condition U0 (x) = CT (x). Moreover, if fT −s (x) minimizes the right–hand side of (2.1.9) at the stage T − s, then gs := fT −s minimizes (2.2.4) at the stage s, and π ∗ = {gT , ..., g1 } is an optimal policy, that is UT (x) = J0 (x) = V (π ∗ , x). In the discounted case, the forward form of (2.2.1)–(2.2.2) becomes us (x) = min [cs (x, a) + αus−1 (Fs (x, a))] a∈A(x)
(2.2.5)
for s = 1, 2, . . . , T , with u0 (x) = CT (x).
2.2.2 The Minimum Principle Remark 2.7. (a) In this section, we suppose that the state and action spaces X and A are subsets of Rn and Rm , respectively. We also assume that the cost functions in (2.0.2) and the system functions in (2.0.1) are differentiable. (b) Given two column vectors x, y of the same (finite) dimension, we write their inner product as x · y := i xi yi . Sometimes we also write x · y as x, y . If x is a row vector and y a column vector, then we write their inner product simply as xy.
24
2
DISCRETE–TIME DETERMINISTIC SYSTEMS
(c) Consider the OCP (2.0.1)–(2.0.2) and the Hamiltonian function Ht (x, a, ρ) := ct (x, a) + ρ · Ft (x, a),
t = 0, 1, ...
for (x, a, ρ) ∈ X × A × Rn . Under the differentiability assumption in (a), the Hamiltonian is also differentiable. 3 The conditions in Remark 2.7 are one of the key differences between the minimum principle (MP) and the dynamic programming (DP) approach. In the former, typically, the state and action spaces X and A are finite-dimensional and the functions in (2.0.1)–(2.0.2) require some differentiability condition. In contrast, in DP, X and A are general Borel spaces (that is, Borel subsets of complete and separable metric spaces), and the functions in (2.0.1)–(2.0.2) require some mild condition, for instance, piecewise continuity. The minimum principle gives necessary conditions for optimality. Roughly, it states the following: If π ∗ = {a∗t } is an optimal strategy and {x∗t } is the corresponding state path, then the pairs (x∗t , a∗t ), t = 0, 1, . . . , satisfy, for some vectors ρ0 , ρ1 , . . ., the conditions (2.2.6)–(2.2.8) below. In particular, (2.2.7) yields that the optimal controls a∗t minimize the Hamiltonian function Ht , which gives the name “the minimum principle”. The minimum principle was originally developed for continuoustime problems (Gamkrelidze 1999). In this section we obtain the discrete–time minimum principle in the form of first–order necessary conditions. Theorem 2.8. (The minimum principle). Let π ∗ ={a∗0 , . . ., a∗T −1 } be an optimal strategy for the OCP (2.0.1)–(2.0.2) and {x∗0 , . . . , x∗T } the corresponding path. Then there exist vectors ρ1 , . . . , ρT in Rn such that (a) for all t = 1, . . . , T − 1, ∂Ft ∗ ∗ ∂ct ∗ ∗ (xt , at ) + ρt+1 (x , a ) = ρt , ∂x ∂x t t (b) for all t = 0, . . . , T − 1,
(2.2.6)
2.2
THE DP EQUATION AND RELATED TOPICS
∂ct ∗ ∗ ∂Ft ∗ ∗ (xt , at ) + ρt+1 (x , a ) = 0, ∂a ∂a t t
25
(2.2.7)
(c) we have the terminal condition (TC) ρT =
∂CT ∗ (x ). ∂x T
(2.2.8)
Proof. Consider the “cost to go function” from time s using π ∗ given x˜s = x, i.e., Vs∗ (x)
=
T −1
ct (˜ xt , a∗t ) + CT (˜ xT ),
(2.2.9)
t=s
where each x˜t is generated by (2.0.1) using πs∗ . Note that Vs∗ is a differentiable function (see (2.2.10) below.) Define ∂Vt∗ ∗ ρt := (x ), ∂x t for all t = 1, . . . , T . (a) Taking at = a∗t in (2.1.6), it follows that Vt∗ (x) = ct (x, a∗t ) + ∗ (Ft (x, a∗t )) for all t = 1, . . . , T − 1. Differentiating and Vt+1 evaluating in x∗t , we obtain ∂Vt∗ (x∗t ) ∂Vt+1 ∂Ft ∗ ∗ ∂ct ∗ ∗ = (xt , at ) + (Ft (x∗t , a∗t )) (x , a ), ∂x ∂x ∂x ∂x t t (2.2.10) which is (2.2.6). (b) By the Principle of Optimality (Lemma 2.2) and (2.1.2), Vt∗ (x∗t ) = vt (x∗t ). This fact, by (2.1.6), implies
∗ Vt (x∗t ) = min ct (x∗t , a) + Vt+1 (Ft (x∗t , a)) , a∈A(x)
for all t = 0, . . . , T − 1. Since a∗t minimizes the righ–hand side, we obtain ∂V ∗ ∂ct ∗ ∗ ∂Ft ∗ ∗ (xt , at ) + t+1 (Ft (x∗t , a∗t )) (x , a ) = 0, ∂a ∂x ∂a t t which is (2.2.7).
2
26
DISCRETE–TIME DETERMINISTIC SYSTEMS
(c) Finally, by (2.2.9), we have VT∗ (x) = CT (˜ xT ), so ρT =
∂CT (x∗T ) ∂VT (x∗T ) = . ∂x ∂x 2
This completes the proof of the theorem.
Sufficient conditions can also be provided with additional convexity assumptions, as follows. Theorem 2.9. Suppose that there exists a strategy π ∗ = {a∗0 , . . . , a∗T −1 and vectors ρ1 , . . . , ρT such that (2.2.6), (2.2.7) and (2.2.8) hold. Define h0 : X × A → R as h0 (x, a) = c0 (x, a) + ρ1 · F0 (x, a), and for t = 1, . . . , T − 1, ht : X × A → R as ht (x, a) = ct (x, a) + ρt+1 · Ft (x, a) − ρt · x. If ht is convex for all t = 1, . . . , T − 1 and CT is convex, then π ∗ is optimal for the OCP (2.0.1)–(2.0.2). Proof. Since (2.2.6) and (2.2.7) are the first order conditions of optimality for each ht , the convexity of ht implies that (x∗t , a∗t ) minimizes ht . Analogously, x∗T minimizes CT (x) − ρT · x. Let π = {a0 , . . . , aT −1 } be an arbitrary policy. We have T −1
ct (a∗t , x∗t )
+
CT (x∗T )
=
t=0
T −1
ht (a∗t , x∗t )
t=0 T −1
−
≤ ≤
+
T −1
ρt · x∗t
t=1
ρt+1 · x∗t+1 + CT (x∗T )
t=0 T −1
ht (xt , at ) − ρT · x∗T + CT (x∗T )
t=0 T −1 t=0
ht (xt , at ) − ρT · xT + CT (xT )
2.2
THE DP EQUATION AND RELATED TOPICS
=
T −1
27
ct (at , xt ) + CT (xT ).
t=0
Thus π ∗ is optimal.
2
Example 2.10 (An economic growth model). This is a model introduced by Brock and Mirman (1972). The state and control variables xt and at denote capital and consumption, respectively, at time t = 0, 1, . . . . The state and control spaces are X = A = [0, ∞), and the dynamics of the system is given by xt+1 = cxθt − at
for
t = 0, 1, . . . , T − 1,
(2.2.11)
with θ ∈ (0, 1) and x0 given. Let A(x) := (0, cxθ ], and consider the objective function or performance index V (π, x0 ) =
T −1
αt log at + αT log xθT .
(2.2.12)
t=0
In this OCP, the economic interpretation is that the controller wishes to determine a consumption strategy {at } to maximize the total discounted utility (2.2.12), subject to the capital dynamics (2.2.11). The term cxθt in (2.2.11) represents the output as a function of the current capital xt and the technological parameter c; this output is distributed in consumption at and capital xt+1 for the next period. To solve this problem we write the minimum principle equations (2.2.6)–(2.2.8) as follows ρt+1 cθxθ−1 = ρt t αt − ρt+1 = 0 at
for
for
t = 1, ..., T − 1,
(2.2.13)
t = 0, ..., T − 1,
(2.2.14)
θαT . xT
(2.2.15)
with the terminal condition ρT =
From (2.2.14) for t = T − 1 and (2.2.15),
2
28
aT −1 =
DISCRETE–TIME DETERMINISTIC SYSTEMS
cxθ − aT −1 αT −1 xT = = T −1 ρT θα θα
which implies aT −1
cxθT −1 = . 1 + θα
Again, from (2.2.13) and (2.2.14) for t = T − 2, aT −2 =
cxθT −2 − aT −2 αT −2 aT −1 xT −1 = = = , ρT −1 θα + (θα)2 θα + (θα)2 cθαxθ−1 T −1
yielding aT −2 =
cxθT −2 . 1 + θα + (θα)2
Continuation of this process backward in time yields the optimal control policy c[x∗t ]θ c(1 − θα) ∗ [x∗t ]θ (2.2.16) = at = 1 + θα + · · · + (θα)T −t 1 − (θα)T −t+1 for all t = 0, ..., T − 1.
3
Remark 2.11. Theorems 2.8 and 2.9 above, and also Theorem 2.29 below, are simplified versions of results by Dom´ınguez-Corella and Hern´andez-Lerma (2019). 3
2.3 Infinite–Horizon Problems In this section we consider infinite-horizon problems. First, in Sect. 2.3.1 we study the so-called stationary discounted case by means of the DP approach. Then in Sect. 2.3.2 we consider general nonstationary problems using the minimum principle (MP). Again, as in Sect. 2.2.2, the idea is to compare the DP and the MP techniques. Finally, in Sect. 2.3.3, we introduce the so-called weighted-norm approach.
2.3
INFINITE–HORIZON PROBLEMS
29
A relevant question is, why should we consider infinite horizon problems? Do they appear in “real” situations? Are they important? Detailed answers will require to refer to the approximation algorithms and the asymptotic or long-run averages in Sects. 2.4 and 2.5, respectively. Therefore, we defer this topic to the end of the chapter.
2.3.1 Discounted Case Instead of the OCP (2.0.1)–(2.0.2), consider the stationary discounted OCP: Minimize Vn (π, x) :=
n
αt c(xt , at )
(2.3.1)
t=0
over all policies π = {a0 , . . . , an } subject to xt+1 = F (xt , at ) ∀ t = 0, . . . , n − 1,
(2.3.2)
with a given initial state x0 = x. In (2.3.1), α ∈ (0, 1) is a given discount factor. Remark 2.12. The problem (2.3.1)–(2.3.2) is called stationary because the cost c(x, a) and the system function F (x, a) are time– invariant. Similarly, a Markov policy π = {at }, with at = g(t, xt ) for all t = 0, 1, . . . (see Remark 1.2(d)) is said to be stationary if g(t, x) ≡ g(x) for all t. In other words, the control actions at = g(xt ) depend on the time parameter t only through the state xt . Stationary control policies usually appear in infinite–horizon OCPs only. (See Corollary 2.22, for instance.) 3 In this section, we are interested in the infinite horizon OCP obtained from (2.3.1) by letting n → ∞; that is, the OCP now is to minimize ∞ V (π, x) := αt c(xt , at ) (2.3.3) t=0
subject to (2.3.2). Let
2
30
DISCRETE–TIME DETERMINISTIC SYSTEMS
V ∗ (x) := inf V (π, x) ∀ x ∈ X, π
(2.3.4)
which is called the α–discount value function. A policy π ∗ is said to be α–discount optimal if V (π ∗ , x) = V ∗ (x) ∀ x ∈ X. The infinite-horizon discounted OCP (2.3.2)–(2.3.3) was introduced by Blackwell (1965). Example 2.13 (Tracking problems). As noted in Example 1.4, tracking problems are a standard topic in some areas of economics and engineering, among other fields. The idea is as follows. We are given a nominal control path {a∗t , t = 0, 1, . . . } and a nominal state trajectory {x∗t , t = 0, 1, . . . }. The OCP is then to minimize the tracking cost V (π, x) :=
∞
αt [(xt − x∗t )2 + (at − a∗t )2 ]
(2.3.5)
t=0
subject to a state equation such as (2.3.2). In particular, if this state equation is linear, say, xt+1 = F xt + Gat , the tracking problem becomes an LQ (Linear–Quadratic) problem. Observe that, essentially, (2.3.5) is an 2 –distance from the state–control trajectory {(xt , at ), t = 0, 1, . . . } to the given nominal trajectory {(x∗t , a∗t ), t = 0, 1, 2, ...}, so the tracking problem is to keep the state–control trajectory as close as possible to the nominal trajectory. In particular, if a∗t ≡ 0 for all t = 0, 1, . . . , in engineering this is called a tracking problem with minimum fuel. Similar problems are dealt with in economics. 3 In this section we are interested, among other things, in providing solutions to the following problems. Problem 1. Let Vn∗ (x) := inf π Vn (π, x) be the minimal cost or value function for the n–stage OCP. Under what conditions does it hold that, as n → ∞, Vn∗ converges to V ∗ ? That is, when do we have
2.3
INFINITE–HORIZON PROBLEMS
31
lim Vn∗ (x) = V ∗ (x)
n→∞
(2.3.6)
for all x ∈ X? In (2.3.7), below, A(x) ⊂ A denotes the set of feasible control actions in the state x, for each x ∈ X. Problem 2. Under what conditions is V ∗ a solution of the dynamic programming equation, also known as the Bellman equation or α–optimality equation (α–OE) in the discounted– cost case, that is, V ∗ (x) = inf [c(x, a) + αV ∗ (F (x, a))] a∈A(x)
(2.3.7)
for all x ∈ X? To put it in other words, consider the so–called Bellman operator K defined as Kv(x) := inf [c(x, a) + αv(F (x, a))] a∈A(x)
(2.3.8)
for functions v in a certain family. Then we can restate Problem 2 as follows: when is V ∗ a fixed point of K? That is, when can we ensure that (2.3.9) V ∗ = KV ∗ holds? Remark 2.14. The fact that V ∗ satisfies (2.3.7) or (2.3.9) is not quite surprising. Indeed, for the α–discounted OCP (2.3.1)– (2.3.2), the DP equation (2.2.5) becomes ∗ Vn∗ (x) = min [c(x, a) + αVn−1 (F (x, a))] ∀ n = 1, 2, . . . a∈A(x)
(2.3.10) ≡ 0. Therefore, if in (2.3.10) we let n → ∞ and, in with addition, (2.3.6) holds, then we would expect to obtain (2.3.7) (if interchanging of the “lim” and “min” operations is allowed— see Lemma 2.15 below). On the other hand, note that, using the operator K in (2.3.8), we may express (2.3.10) as an iteration of K, that is, V0∗ (·)
∗ = K n V0∗ , Vn∗ = KVn−1
with V0∗ ≡ 0.
(2.3.11)
2
32
DISCRETE–TIME DETERMINISTIC SYSTEMS
Due to this fact, the functions Vn∗ are also known as value iteration (VI) functions. 3 Note that (2.3.7) and (2.3.9) always have the trivial solutions v ≡ ∞ and v ≡ −∞. We are interested, of course, in finite–valued functions. The following lemma gives conditions that allow to interchange limits and minima. Lemma 2.15. Let {gk } be a sequence of real–valued functions on a space Y converging to a function g. Each of the following conditions (a), (b), (c) ensures that lim inf gk (y) = inf lim gk (y) = inf g(y). k
y
y
k
y
(a) gk ↓ g, (b) gk converges uniformly to g; that is, for each > 0 there exists a number k( ) such that, for all y ∈ Y , |gk (y) − g(y)| < whenever k > k( ). (c) Y is a metric space, the functions gk are inf–compact (that is, for each k = 1, 2, . . . and r ∈ R, the set {y ∈ Y : gk (y) ≤ r} is compact), and, moreover, gk ↑ g. Proof. See Exercise 2.5.
2
We will next use some definitions and facts from Appendix B. Let K := {(x, a) ∈ X × A|a ∈ A(x)} (2.3.12) be the graph of the multifunction x → A(x). (See Definition B.1(b).) We denote by F the family of (measurable) functions— called selectors—f : X → A such that f (x) ∈ A(x) for all x ∈ X. (Note that a selector f ∈ F is simply a function from X to A whose graph (x, f (x)) is in K for all x.) Lemma 2.16. Let u : K → R be nonnegative and K–inf–compact, that is (as in Definition B.4(a2)), for every compact set X ⊂ X and every r ∈ R, the set {(x, a) ∈ G(X )|u(x, a) ≤ r}
2.3
INFINITE–HORIZON PROBLEMS
33
is compact, where G(X ) := {(x, a) ∈ X × A|a ∈ A(x)}. Then (a) There exists f ∈ F such that u∗ (x) := min u(x, a) = u(x, f (x)) ∀ x ∈ X, a∈A(x)
and u∗ is l.s.c. (b) If u, uk : K → R, for k = 1, 2, . . . , are bounded below and K– inf–compact, and uk ↑ u, then lim min uk (x, a) = min u(x, a)
k→∞ a∈A(x)
a∈A(x)
for all x ∈ X. Proof. (a) This part follows from Theorem B.9. It also follows from Theorem B.8, using the fact that (by Lemma B.6) K − inf-compactness ⇒ inf-compactness on K,
(2.3.13)
and also implies lower semi–continuity of u (by Theorem B.9). (b) Fix an arbitrary x ∈ X, and let g(a) := u(x, a) and gk (a) := uk (x, a). Then gk ↑ g and, by (2.3.13) again, the functions gk are inf–compact on A(x). Since x was arbitrary, (b) follows from Lemma 2.15 (c). 2 In the context of the infinite–horizon OCP (2.3.2)–(2.3.3), our Basic Assumption (in Chap. 1) states that: (a) the stage cost c : K → R is nonnegative, and (b) there exists a control policy π ∈ Π such that V (π, x) < ∞ for all x ∈ X. In addition to the Basic Assumption, in the remainder of this section we suppose the following. Assumption 2.17. (a) The cost function c ≥ 0 is K–inf–compact; (b) The system function F : K → X in (2.3.2) is continuous. We will denote by L(X) the family of l.s.c. functions on X, and by L+ (X) the subfamily of nonnegative functions in L(X). Observe that L+ (X) is a convex cone. (See Exercise 2.9.)
34
2
DISCRETE–TIME DETERMINISTIC SYSTEMS
Lemma 2.18. Suppose that Assumption 2.17 holds, and let L+ (X) be the family of nonnegative and l.s.c. functions on X. Then: (a) The Bellman operator K in (2.3.8) maps L+ (X) into itself; that is, if v is in L+ (X), then so is Kv. (b) If v is in L+ (X), then there exists a selector f ∈ F that attains the minimum in (2.3.8), i.e., Kv(x) = c(x, f (x)) + αv(F (x, f (x))) ∀ x ∈ X. Proof. Fix an arbitrary function v ∈ L+ (X), and define u(x, a) := c(x, a) + αv(F (x, a)) ∀ (x, a) ∈ K.
(2.3.14)
Both results (a) and (b) will follow from Lemma 2.16(a) if we show that the nonnegative function u is K–inf–compact. To this end, take an arbitrary compact set X ⊂ X and r ∈ R, and note that the set G(X )c := {(x, a) ∈ G(X )|c(x, a) ≤ r} is compact, because c is K–inf–compact (Assumption 2.17(a)). Moreover, G(X )c contains the set G(X )u := {(x, a) ∈ G(X )|u(x, a) ≤ r}. Therefore, to see that u is K–inf–compact, it suffices to show that G(X )u is closed, that is, if a sequence of elements (xk , ak ) ∈ G(X )u converges to (x, a), then (x, a) is in G(X )u . This, however, is obvious because the mapping (x, a) → u(x, a) is l.s.c. (Exercise 2.9). 2 Remark 2.19. Consider a selector f ∈ F. To simplify the notation we will write c(x, f (x)) and F (x, f (x)) as c(x, f ) and F (x, f ), respectively. Moreover, we will identify f with the stationary Markov control policy π = {at } such that at (xt ) = f (xt ) for all t = 0, 1, . . . . Similarly, if π = f , we will express the total α– discounted cost in (2.3.3) as V (f, x), that is, V (f, x) =
∞ t=0
αt c(xt , f )
(2.3.15)
2.3
INFINITE–HORIZON PROBLEMS
35
for all initial state x0 = x. Note that expanding the righ–hand side of (2.3.15) we obtain V (f, x) = c(x, f ) + α
∞
αt−1 c(xt , f ).
t=1
Thus, by (2.3.2), we can express (2.3.15) as V (f, x) = c(x, f ) + αV (f, F (x, f )) ∀ x ∈ X or, using (2.3.2), V (f, xt ) = c(xt , f ) + αV (f, xt+1 )
(2.3.16) 3
for all t = 0, 1, . . . .
We need the following “monotonicity property” of the Bellman operator K before going back to Problems 1 and 2. Lemma 2.20. Consider the Basic Assumption and Assumption 2.17. If v ∈ L+ (X) is such that v ≥ Kv, then (a) There exists f ∈ F such that v(x) ≥ V (f, x) for all x ∈ X; therefore, (b) v ≥ V ∗ . Proof. (a) If v ≥ Kv, then, by Lemma 2.18(b), there exists f ∈ F such that v(x) ≥ c(x, f ) + αv(F (x, f )) ∀ x ∈ X. Iteration of this inequality yields, for all n = 1, 2, ... and x ∈ X, v(x) ≥
n−1
αt c(xt , f ) + αn v(xn ).
t=0
Therefore, since v ≥ 0, letting n → ∞ we obtain v(x) ≥ V (f, x) ∀ x ∈ X. (b) Thus, since V (f, ·) ≥ V ∗ (·), we conclude that v ≥ V ∗ .
2
We are now ready to give conditions under which the answer to both Problems 1 and 2 is affirmative.
2
36
DISCRETE–TIME DETERMINISTIC SYSTEMS
Theorem 2.21. Consider the Basic Assumption and Assumption 2.17. Then (a) (2.3.6) holds, in fact Vn∗ ↑ V ∗ , and (b) V ∗ is the minimal solution of the α-optimality equation (2.3.7); that is, V ∗ = KV ∗ and if v ∈ L+ (X) also satisfies v = KV , then v ≥ V ∗ . (c) V ∗ is the “unique” solution of the DPE in the following sense: If v ∈ L+ (X) is such that v = Kv and, in addition, for any policy π = {at } and the corresponding state trajectory {xt } we have (2.3.17) lim αn v(xn ) = 0, n→∞
then v = V ∗ . (The condition (2.3.17) is sometimes called a “transversality condition”.) Proof. Since c ≥ 0, the definition of Vn∗ gives Vn∗ (x) ≤ Vn (π, x) ≤ V (π, x) for any policy π and x ∈ X, so Vn∗ (x) ≤ V ∗ (x) ∀ x ∈ X.
(2.3.18)
On the other hand, since the sequence {Vn∗ } is nondecreasing, Vn∗ ↑ v for some function v such that, by (2.3.18), v ≤ V ∗.
(2.3.19)
Furthermore, from (2.3.10) and Lemma 2.16(b), v is in L+ (X) and it satisfies (2.3.9), that is, ∗ v = lim Vn∗ = lim KVn−1 = Kv. n
n
It follows from Lemma 2.20 that v ≥ V ∗ . This fact and (2.3.19) give that v = V ∗ . This proves both (a) and (b). (c) If v = Kv, Lemma 2.20 gives that v ≥ V ∗ . To obtain the reverse inequality first note that v = Kv implies v(x) ≤ c(x, a) + αv(F (x, a)) ∀ (x, a) ∈ K.
2.3
INFINITE–HORIZON PROBLEMS
37
Therefore, for any policy π = {at } and the associated state trajectory xt , v(xt ) ≤ c(xt , at ) + αv(xt+1 ) ∀ t = 0, 1, . . . . Iteration of this inequality, given an arbitrary initial state x0 = x, gives v(x) ≤
n−1
αt c(xt , at ) + αn v(xn ) ∀ n = 1, 2, . . . ,
t=0
i.e., v(x) ≤ Vn−1 (π, x) + αn v(xn ). Finally, letting n → ∞, from (2.3.17) we obtain v(x) ≤ V (π, x). Thus, since π and x0 = x were arbitrary, it follows that v(·) ≤ V ∗ (·). This completes the proof of part (c). 2 As a consequence of Theorem 2.21(b) and Lemma 2.18(b) we obtain the existence of an optimal stationary policy for the infinite–horizon OCP (2.3.2)–(2.3.3), as follows. Corollary 2.22. Under the hypotheses of Theorem 2.21, there exists f ∗ ∈ F such that f ∗ (x) ∈ A(x) attains the minimum in the right–hand side of (2.3.7), that is, V ∗ (x) = c(x, f ∗ ) + αV ∗ (F (x, f ∗ )) ∀ x ∈ X,
(2.3.20)
and f ∗ is an optimal stationary policy. Proof. The existence of f ∗ as in (2.3.20) follows from Lemma 2.18(b). Moreover, as in (2.3.16), iteration of (2.3.20) gives that V ∗ (x) = V (f ∗ , x) for all x ∈ X. Therefore, f ∗ is α–discount optimal. 2 In the following examples we use the notation in (2.3.2)–(2.3.4). These examples show that before using DP results (such as Theorem 2.21) we should carefully verify their hypotheses. Example 2.23 (Bertsekas 1987, p. 212). Let X = [0, ∞), c(x, a) ≡ 0, and F (x, a) = x/α. Then, for any constant b, the function V (x) = bx, for x ∈ X, satisfies (2.3.7). Hence, the operator K in (2.3.8)–(2.3.9) has an infinite number of fixed points in this
38
2
DISCRETE–TIME DETERMINISTIC SYSTEMS
case. However, it has a unique fixed point in the class B(X) of real–valued bounded functions on X, namely, the zero function V ∗ (·) ≡ 0, which is the optimal cost function for this example. (See Remark 2.25 below, and also the Exercise 2.7.) 3 Example 2.24 (Bertsekas 1987, p. 215). Let X = R, A = A(x) = (0, 1] for all x ∈ X, c(x, a) = |x|, F (x, a) = α−1 ax. It can be verified that V ∗ (x) = |x| for all x ∈ X. Now let π be the stationary policy such that π(x) = 1 for all x ∈ X. Then the cost function v(·) = V (π, ·) in (2.3.3) is v(x) = ∞ if x = 0 and v(0) = 0, so π is not optimal because v(·) = V ∗ (·). Nevertheless, it can be verified that v is a fixed point of K, so it satisfies (2.3.9). 3 The Exercise 2.7 and other results below use the following wellknown fact. Remark 2.25 (Banach’s fixed point theorem). Let (X , ρ) be a complete metric space, and T : X → X a contraction mapping, that is, there exists a number β ∈ (0, 1) such that ρ(T u, T v) ≤ βρ(u, v) ∀ u, v ∈ X . Then T has a unique fixed point u∗ ∈ X , i.e., T u∗ = u∗ . Moreover, for any u ∈ X , T n u converges to the fixed point u∗ ; in fact, ρ(T n u, u∗ ) ≤ β n ρ(u, u∗ ) ∀ n = 0, 1, . . . , where T n := T (T n−1 ) for all n = 1, 2, ..., with T 0 :=identity.
3
2.3.2 The Minimum Principle Now we introduce the minimum principle for an infinite–horizon OCP. As in Sect. 2.2.2, these necessary conditions for optimality are stated under smoothness properties. We assume that the state space X is a subset of Rn , the action space A is a subset of Rm , and the cost functions ct : X × A → R as well as the system functions Ft : X × A → X are differentiable in the interior of X × A.
2.3
INFINITE–HORIZON PROBLEMS
39
Given an initial state x0 = x and the dynamics (2.0.1), let us consider the general performance function V (π, x) =
∞
ct (xt , at ),
t=0
for which we assume that V (π, x) > −∞ for each admissible policy π. We also suppose that there is a policy π such that V (π, x) < ∞ for every x. Under these conditions we first study the Gˆateaux differential of the real valued function V (·, x). Remark 2.26 (on notation). (a) Given τ = 0, 1, . . . , a policy π = {a0 , a1 , . . . , aτ −1 , aτ , aτ +1 , . . . }, and an action a ∈ A, we define the policy π−τ (a) := {a0 , a1 , . . . , aτ −1 , a, aτ +1 , . . . }, which is obtained from π by replacing the action aτ with a. π (a) (b) We will denote by xt −τ , t = 0, 1, . . ., the state path corresponding to π−τ (a). Note that π
−τ xt+1
(a)
:= Ft (xt , at ) if t < τ , := Fτ (xτ , a) if t = τ , π
:= Ft (xt −τ
(a)
, at ) if t > τ
for all t = 0, 1, .... (c) We use the following notation for the product of square matrices J1 , J2 , ... : t
Jk := Jτ · · · Jt
k=τ
:= I
if τ ≤ t,
if τ > t,
where I is the identity matrix.
3
Lemma 2.27. Fix an arbitrary τ in {0, 1, . . . } and y ∈ Rm (a row vector). Let ψ τ ,y be defined by
2
40
ψiτ ,y
DISCRETE–TIME DETERMINISTIC SYSTEMS
y for i = τ , := 0 for i =
τ.
Then, if it exists, the Gˆateaux differential of V (·, x) at π in the direction ψ τ ,y is dV (π; ψ τ ,y ) =
∂Fτ ∂cτ (xτ , aτ )y ∗ + λτ +1 (xτ , aτ )y ∗ , ∂a ∂a
where y ∗ denotes the transpose of y ∈ Rm , and λτ +1
∞ s−1 ∂cs ∂Fk := (xs , as ) (xk , ak ). ∂x ∂x s=τ +1 k=τ +1
Proof. For δ ∈ [0, 1) small enough so that aτ + δy is in an open neighborhood of aτ , consider the policy π + δψ τ ,y = π−τ (aτ + δy). Then V (π + δψ τ ,y , x) =
τ −1
cs (xs , as ) + cτ (xτ , aτ + δ y) +
s=0
∞
π
cs xs−τ
(aτ +δ y)
, as .
s=τ +1
Thus, the Gateaux differential of V at π in the direction ψ τ ,y is given by dV (π ; ψ τ ,y ) = =
=
d V (π + δψ τ ,y ) dδ δ =0
∞ d ∂ cτ π (a +δ y) cs xs−τ τ , as . (xτ , aτ )y ∗ + ∂a dδ δ =0 s=τ +1
∂ cτ (xτ , aτ )y ∗ ∂a ⎛ ⎞ s−1 ∞ ∂ Fk ∂ c ∂ Fτ s +⎝ (xs , as ) (xk , ak )⎠ (xτ , aτ )y ∗ , ∂x ∂x ∂a s=τ +1 k=τ +1
where the last equality is obtained applying the chain rule inductively. 2 From the above computation, we can see that the existence of the Gˆateaux differential of V (·, x) requires the following additional hypothesis.
2.3
INFINITE–HORIZON PROBLEMS
41
Assumption 2.28. Let π = {a0 , a1 , . . . } be a policy. For any τ = 0, 1, . . . , there is an open neighborhood Uτ of aτ such that the sequence of functions ρτ ,s : Uτ → R given by ρτ ,s (a) :=
s−1 ∂Fk π−τ (a) ∂cs π−τ (a) , as ) , ak ) (xs (xk ∂x ∂x k=τ +1
are such that the series
∞
s=τ +1
(2.3.21)
ρτ ,s (a) converges uniformly in Uτ .
Theorem 2.29. Let π = {at , t = 0, 1, . . . } be a policy satisfying the Assumption 2.28 and {xt , t = 0, 1, . . . } is the corresponding state trajectory. If π is an optimal policy for the OCP, then there n exists a sequence {λt }∞ t=1 in R such that: 1. for each t = 1, 2, . . ., ∂ct ∂Ft (xt , at ) + λt+1 (xt , at ) = λt , ∂x ∂x
(2.3.22)
2. for each t = 0, 1, 2, . . . , ∂ct ∂Ft (xt , at ) + λt+1 (xt , at ) = 0, ∂a ∂a
(2.3.23)
3. Transversality condition (TC): for each τ = 0, 1, 2, . . . , lim λt
t→∞
t−1 ∂Fk k=τ
∂x
(xk , ak ) = 0.
(2.3.24)
Proof. Define, for each t = 1, 2, . . . , λt :=
∞ ∂cs s=t
∂x
(xs , as )
s−1 ∂Fk k=t
∂x
(xk , ak ).
(2.3.25)
1. A direct calculation gives λt =
∞ ∂ cs s=t
∂x
(xs , as )
s−1 k=t
⎛
∂ Fk (x , a ) ∂x k k
⎞ ∞ s−1 ∂F ∂ Ft ∂ ct ∂ c s k = (xt , at ) + ⎝ (xs , as ) (x , a )⎠ (xt , at ) ∂x ∂x ∂x k k ∂x s=t+1
k=t+1
2
42
DISCRETE–TIME DETERMINISTIC SYSTEMS
∂ ct ∂ Ft (xt , at ) + λt+1 (xt , at ). ∂x ∂x
=
2. From Lemma 2.27, we have dV (π; ψ τ ,y ) =
∂cτ ∂Fτ (xτ , aτ )y ∗ + λτ +1 (xτ , aτ )y ∗ . ∂a ∂a
Recalling Theorem 1 from Luenberger (1969), p. 178, a necessary condition for π to be a minimizer of V is that dV (π; ψ τ ,y ) = 0 for all τ = 0, 1, 2, . . . and every y ∈ Rm . Hence ∂cτ ∂Fτ (xτ , aτ ) + λτ +1 (xτ , aτ ) = 0 for all τ = 0, 1, 2, . . . . ∂a ∂a 3. Notice that λt
t−1 k=τ
∂Fk (xk , ak ) = ∂x
t−1 s−1 ∞ ∂Fk ∂Fk ∂cs (xs , as ) (xk , ak ) (xk , ak ) ∂x ∂x ∂x s=t k=t
k=τ
s−1 ∞ ∂Fk ∂cs = (xs , as ) (xk , ak ). ∂x ∂x s=t k=τ
From Assumption 2.28, this series is convergent, and so its tail tends to zero as in (2.3.24). 2 Remark 2.30. Let us assume that in Theorem 2.29 the search of an optimal control is restricted to Markov policies π = {ft }, so that at = ft (xt ) for all t = 0, 1, . . . . If π is optimal for the OPC and {xt }∞ t=0 is the corresponding optimal trajectory xt+1 = Ft (xt , ft (xt )) for all t = 0, 1, . . . , then (2.3.22)–(2.3.24) are rewritten as follows: 1. for each t = 1, 2, . . ., ∂Ft ∂ct (xt , ft ) + λt+1 (xt , ft ) = λt , ∂x ∂x
(2.3.26)
2. for each t = 0, 1, 2, . . . , ∂ct ∂Ft (xt , ft ) + λt+1 (xt , ft ) = 0, ∂a ∂a
(2.3.27)
3. Transversality condition (TC): for each τ = 0, 1, 2, . . . ,
2.3
INFINITE–HORIZON PROBLEMS
lim λt
t→∞
43
t−1 ∂Fk k=τ
∂Fk ∂fk (xk , fk ) + (xk , fk ) (xk ) = 0. (2.3.28) ∂x ∂a ∂x
Moreover, for each t = 1, 2, . . . we redefine λt in (2.3.25) as ∞ s−1 s−1 ∂cs ∂fs ∂cs λt := Ak + Bk , (xs , fs ) (xs , fs ) (xs ) ∂x ∂a ∂x s=t k=t k=t (2.3.29) where ∂Fk (xk , fk ) + ∂x ∂Fk Bk := (xk , fk ) + ∂x Ak :=
∂Fk ∂fk (xk , fk ) (xk ), ∂a ∂x ∂Fk ∂fk (xk , fk ) (xk ). ∂a ∂x
The proof is similar to Theorem 2.29. For details see Dom´ınguezCorella and Hern´andez-Lerma (2019). 3 Example 2.31 (Brock–Mirman infinite-horizon model). We next consider an infinite-horizon version of the Brock and Mirman (1972) model in Example 2.10. Given an initial state x0 , the system evolves according to the dynamics (2.3.30) xt+1 = cxθt − at , t = 0, 1, 2, . . . , with θ ∈ (0, 1). We consider again the control constraint sets A(x) := (0, cxθ ]. The performance index, for a feasible policy π = {at , t = 0, 1, . . . }, is given by V (π, x0 ) =
∞
αt log(at ).
t=0
Thus the minimum principle conditions (2.3.26) and (2.3.27) become = λt , t = 1, 2, . . . , (2.3.31) λt+1 cθxθ−1 t αt − λt+1 = 0, at
t = 0, 1, . . . .
(2.3.32)
2
44
DISCRETE–TIME DETERMINISTIC SYSTEMS
This difference equations can be solved by the “guess and verify method” as in Chow (1997). To this end, consider a policy at := dcxθt for some real number d. Then combining (2.3.31) and (2.3.32) we obtain = λt = λt+1 cθxθ−1 t
αt αt cθxθ−1 θαt t cθxθ−1 = = . t at dxt dcxθt
This implies that λt+1 =
θαt+1 θαt+1 θαt+1 = = . dxt+1 d(cxθt − at ) d(cxθt − dcxθt ) t
α On the other hand, from (2.3.32), λt+1 = dcx θ . Equating both valt ues, we conclude that d = 1 − θα. Then an optimal policy is
a∗t = c(1 − θα)[x∗t ]θ
for
t = 0, 1, 2, . . . .
(2.3.33) 3
Remark 2.32. From (2.3.33), and in accordance with Corollary 2.22, the optimal stationary policy for the Brock and Mirman model for the infinite horizon case is f ∗ (x) = c(1 − θα)xθ and the value function is V ∗ (x) =
θα θ 1 log[c(1 − θα)] + log(x). log(cθα) + 1−α (1 − α)(1 − θα) 1 − θα
This function satisfies the DP equation (2.3.8). On the other hand, in the spirit of Theorem 2.21, we can deduce that the optimal policy (2.2.16) for the finite horizon case converges to the optimal policy (2.3.33) for the infinite horizon problem. 3
2.3.3 The Weighted-Norm Approach In Sect. 2.3.1 we studied the infinite-horizon discounted OCP (2.3.2)–(2.3.3) assuming that the stage cost c(x, a) is possibly unbounded, but nonnegative. This nonnegativity yields that the infinite series (2.3.3) is well defined (although it might be infinite). In general, however, considering costs c(x, a) with both positive
2.3
INFINITE–HORIZON PROBLEMS
45
and negative values may create technical complications. To avoid them, in this section we consider weighted norms, an approach introduced by Wessels (1977). This approach indeed allows c(x, a) to take positive and negative values, but its “growth” is restricted in a suitable sense (see Assumption 2.33(c) and Lemma 2.34(b)). The basis of this approach is in the following conditions. Assumption 2.33. For every x ∈ X: (a) the control constraint set A(x) is compact, and the set-valued mapping x → A(x) is continuous. (It suffices to assume that x → A(x) is u.s.c.) (b) For (x, a) ∈ K, with K in (2.3.12), the function (x, a) → F (x, a) is continuous, and (x, a) → c(x, a) is l.s.c. (c) There is a continuous function w(·) ≥ 1 on X, and positive constants c¯ and β ≥ 1 such that, for every x ∈ X, (c1) supa∈A(x) |c(x, a)| ≤ c¯w(x), and (c2) supa∈A(x) w(F (x, a)) ≤ βw(x), and αβ < 1. Assumption 2.33 is supposed to hold throughout this section. The function w in part (c) will be referred to as a weight function , but in the control literature is also known as a majorant, a bounding function or a gauge function. Part (c2) is called Wessels condition. As an example, if the stage cost c is bounded, that is, |c(x, a)| ≤ c¯ for some constant c¯ and all (x, a) ∈ K, then we may take w ≥ 1 as a bounded function. On the other hand, a common situation is when c satisfies a polynomial growth condition, in which case we may take w as w(x) := D(1 + |x|k ). Another common situation is the exponential growth case, with w(x) := Dek|x| . The proof of the following lemma is left to the reader (Exercise 2.12). Observe that parts (a) and (b) are a direct consequence of Assumption 2.33(c) and an induction argument. Part (c) in the lemma follows from (a)-(b) and the definition (2.3.3) of V (π, ·). Lemma 2.34. Let {(xt , at ), t = 0, 1, ...} be an arbitrary sequence in K for which (2.3.2) holds, that is, xt+1 = F (xt , at ) for all t = 0, 1, .... Then, for every initial state x0 = x and t = 0, 1, ..., (a) w(xt ) ≤ β t w(x); and
2
46
DISCRETE–TIME DETERMINISTIC SYSTEMS
(b) |c(xt , at )| ≤ c¯β t w(x). (c) For any control policy π = {at , t = 0, 1, ...} and x0 = x, V (π, x) ≤ c¯
w(x) , (1 − γ)
where γ := αβ. (By Assumption 2.33(c2), γ < 1.) In this section we will be working in the spaces Mw (X) and Lw (X) defined in the following lemma. Lemma 2.35. (a) The space Mw (X) of real-valued functions v on X with a finite w-norm, which is defined as vw := sup x∈X
|v(x)| , w(x)
(2.3.34)
is a Banach space. (b) The space Lw (X) := L(X) ∩ Mw (X) of l.s.c. functions in Mw (X) is a complete metric space with the metric induced by the w-norm, that is, dist(v, v ) := v − v w . Proof. Part (a) is straightforward and is left to the reader (Exercise 2.13). To prove (b), let vn be a sequence in Lw (X) that converges in the w-norm to a function v. By part (a), v belongs to Mw (X). Thus, to complete the proof of (b) it only remains to show that v is l.s.c. To this end, first observe that v(x) = [v(x) − vn (x)] + vn (x) ≥ −vn − vw w(x) + vn (x) for all x ∈ X and n = 0, 1, .... Now consider a sequence xk → x, and fix an arbitrary n. Then, since vn is l.s.c. and w is continuous (by Assumption 2.33(c)), lim inf v(xk ) ≥ −vn − vw w(x) + vn (x). k→∞
Finally, letting n tend to ∞ we obtain lim inf k v(xk ) ≥ v(x). That is, v is l.s.c. 2 In the following proposition we state the Blackwell conditions (Blackwell 1965) for an operator to be a contraction. They will
2.3
INFINITE–HORIZON PROBLEMS
47
be used below in connection with Banach’s fixed point theorem in Remark 2.25 in the case that the metric space X is Lw (X). Proposition 2.36. Let T : Lw (X) → Lw (X) be a mapping such that: (a) T is monotone, that is, if u and v are functions in Lw (X) and u ≤ v, then T u ≤ T v; and (b) there is a positive number δ < 1 such that, for every v ∈ Lw (X) and every constant k ∈ R, T (v + kw) ≤ T v + δkw. Then T is a contraction with modulus δ. Proof. For any two functions v, v in Lw (X), v ≤ v + |v − v | ≤ v + wv − v w . Therefore, by (a)-(b), with k = v − v w , T v − T v ≤ δwv − v w . Interchanging v and v , and then combining with the latter inequality we obtain |T v − T v | ≤ δwv − v w . This implies the desired result.
2
To state our main result, Theorem 2.38, we will use the following fact. Proposition 2.37. Let x → A(x) and w be as in Assumption 2.33. (It suffices to take x → A(x) u.s.c.). Suppose that v is a l.s.c. function on K and such that, for some constant k sup |v(x, a)| ≤ kw(x) ∀x ∈ X. a∈A(x)
Then there exists f ∈ F such that, for all x ∈ X, v ∗ (x) := inf v(x, a) = v(x, f (x)) a∈A(x)
and, moreover, v ∗ is in Lw (X) with w-norm v ∗ w ≤ k.
(2.3.35)
2
48
DISCRETE–TIME DETERMINISTIC SYSTEMS
Proof. The proposition follows from Theorem B.3(b) applied to 2 the nonnegative l.s.c. function v (x, a) := v(x, a) + kw(x). The following Theorem 2.38 is the main result of the weightednorm approach to the infinite-horizon discounted OCP (2.3.2)– (2.3.4). It gives the dynamic programming equation (2.3.7), which we already obtained in Theorem 2.21 when the stage cost c is nonnegative. In the present case, however, we also obtain the convergence estimate (2.3.37), which is impossible to obtain in Theorem 2.21. Theorem 2.38. Under the Assumption 2.33 the following holds: (a) The α-discount value function V ∗ is the unique solution in Lw (X) of the dynamic programming equation (2.3.7), i.e., for every x ∈ X, V ∗ (x) = inf [c(x, a) + αV ∗ (F (x, a))]. a∈A(x)
(2.3.36)
Moreover, for every n = 1, 2, ..., Vn∗ − V ∗ w ≤ c¯γ n /(1 − γ),
(2.3.37)
where Vn∗ is the VI function in (2.3.10)–(2.3.11), and the constants c¯ and γ := αβ < 1 come from Assumption 2.33. (b) There exists f ∗ ∈ F such that, for every x ∈ X, f ∗ (x) ∈ A(x) attains the minimum in the right-hand side of (2.3.36), i.e. (using the notation in Remark 2.19), V ∗ (x) = c(x, f ∗ ) + αV ∗ (F (x, f ∗ )),
(2.3.38)
and f ∗ is α-discount optimal. Proof. (a) To prove (2.3.36) we will show that, equivalently, V ∗ is the unique fixed-point in Lw (X) of the Bellman operator K in (2.3.8). To this end, we will prove that K is a contraction operator on the complete metric space Lw (X), so (2.3.36)–(2.3.37) will follow from Banach’s fixed-point theorem (Remark 2.25). First, we need to show that indeed K maps Lw (X) into itself. To do this, pick an arbitrary function v in Lw (X). Hence, since v is l.s.c. and (x, a) → F (x, a) is continuous (Assumption 2.33(b)), the
2.3
INFINITE–HORIZON PROBLEMS
49
function (x, a) → v(F (x, a)) is l.s.c on K. In addition, by Assumption 2.33(b), (x, a) → c(x, a) is l.s.c., so v (x, a) := c(x, a)+ αv(F (x, a)) is l.s.c. on K. On the other hand, by definition (2.3.34) of the w-norm and Assumption 2.33(c2), we have |v(F (x, a))| ≤ vw w(F (x, a)) ≤ βvw w(x). This inequality together with Assumption 2.33(c1) on c and Proposition 2.37 give that Kv(x) := inf a∈A(x) v (x, a) is in Lw (X). To conclude, K maps Lw (X) into itself. Now, to prove that K is a contraction operator on Lw (X) we will verify the Blackwell conditions in Proposition 2.36. First note that, obviously, K is monotone. On the other hand, for any function v ∈ Lw (X) and any constant k, Assumption 2.33(c2) gives K(v + kw)(x) = inf [c(x, a) + αv(F (x, a)) + αkw(F (x, a))] a
≤ Kv(x) + kγw(x) with γ := αβ < 1. Therefore, by Proposition 2.36, K is a contraction with modulus γ. It follows that K has a unique fixed point v ∗ in Lw (X), i.e., v ∗ = Kv ∗ or, more explicitly, v ∗ (x) = inf [c(x, a) + αv ∗ (F (x, a))] a∈A(x)
(2.3.39)
for all x ∈ X. To complete the proof of part (a) we will show that v ∗ = V ∗ , the OCP’s value function. First, by Proposition 2.37, there exists f ∈ F that minimizes the right-hand side of (2.3.39), i.e., v ∗ (x) = c(x, f ) + αv ∗ (F (x, f )) ∀x ∈ X. Iteration of this equality gives ∗
v (x) =
n−1
c(xt , f ) + αn v ∗ (xn )
(2.3.40)
t=0
for all x ∈ X and n = 1, 2, . . . . In addition, the last term αn v ∗ (xn ) tends to zero as n → ∞. In fact, for any function v in Lw (X), Lemma 2.34(a) yields
2
50
DISCRETE–TIME DETERMINISTIC SYSTEMS
αn |v(xn )| ≤ αn vw w(xn ) ≤ γ n vw w(x) → 0
(2.3.41)
as n → ∞. Thus, for every x ∈ X, (2.3.40) yields v ∗ (x)=V (f, x) ≥ V ∗ (x), i.e., (2.3.42) v ∗ (·) ≥ V ∗ (·). To obtain the reverse inequality, note that (2.3.39) gives that v ∗ (x) ≤ c(x, a) + αv ∗ (F (x, a)) ∀(x, a) ∈ K. Therefore, for any policy π = at and any initial state x0 = x ∈ X, v ∗ (xt ) ≤ c(xt , at ) + αv ∗ (xt+1 ), so, for every n = 1, 2, ..., v ∗ (x) ≤
n−1
c(xt , at ) + αn v ∗ (xn ).
t=0
Finally, letting n → ∞ and using (2.3.41) again we obtain v ∗ (x) ≤ V (π, x) ∀x ∈ X. Thus, since π was arbitrary, it follows that v ∗ (·) ≤ V ∗ (·). This inequality and (2.3.42) give that v ∗ = V ∗ . This fact together with Banach’s fixed-point theorem complete the proof of part (a). (b) This part follows from Proposition 2.37, as in (2.3.39)– (2.3.41). 2 In the following section we will present some applications of the weighted-norm approach.
2.4 Approximation Algorithms Solving a dynamic programming equation (DPE), such as (2.3.7), is in general a difficult task. Richard Bellman, in his book Bellman (1957a) coined the term “the curse of dimensionality” to refer to the fact that the difficulty in solving a DPE rapidly increases with the number of dimensions.
2.4
APPROXIMATION ALGORITHMS
51
On the other hand, there are two natural ways in which we can approximate the solution of a DPE, namely, the value iteration (VI) and the policy iteration (PI) algorithms. These algorithms are difficult to compare because their performances highly depend on the particular features of the OCPs being dealt with. In either case, however, they are the basis for several useful approaches in so-called adaptive dynamic programming and reinforcement learning to obtain approximate solutions to a DPE and analyze related issues, such as the “stabilizability property” of an optimal control. First, we will consider the VI algorithm (Sect. 2.4.1), and then the PI algorithm (Sect. 2.4.2). The VI approach is also known as the method of successive approximations. See Wessels (1977).
2.4.1 Value Iteration Let K be the Bellman operator in (2.3.8), and consider the VI functions ∗ Vn∗ = KVn−1 = K n V0∗ for n = 1, 2, ...
in (2.3.11), with V0∗ ≡ 0. The VI algorithm refers to Problem 1 in Sect. 2.2, that is, the convergence Vn∗ → V ∗ in (2.3.6). This convergence was already obtained in Theorems 2.21 and 2.38 under two different sets of assumptions. In particular, Theorem 2.21 requires the stage cost c(x, a) to be nonnegative, whereas Theorem 2.38 uses a weighted-norm approach. In this section we analyze additional properties of the VI algorithm, which concern the VI control policies defined as follows. Definition 2.39. A sequence πV I = {fn , n = 1, 2, ...} ⊂ F is called a VI policy if, for every n = 1, 2, ..., fn ∈ F minimizes the right-hand side of (2.3.10), i.e., ∗ (F (x, fn )) ∀x ∈ X. Vn∗ (x) = c(x, fn ) + αVn−1
(2.4.1)
We will denote by Fn ⊂ F the subfamily of selectors that satisfy (2.4.1). (Recall from Remark 2.14 that V0∗ (·) ≡ 0. Thus, for πV I to be a true control policy we may take f0 as any selector in F.)
2
52
DISCRETE–TIME DETERMINISTIC SYSTEMS
To analyze a VI policy we will use the discrepancy function D : K → R defined as D(x, a) := c(x, a) + αV ∗ (F (x, a)) − V ∗ (x)
(2.4.2)
for (x, a) ∈ K. By the DPE (2.3.7), D is nonnegative. Moreover, we can rewrite (2.3.7) as inf D(x, a) = 0 ∀x ∈ X.
a∈A(x)
(2.4.3)
Similarly, an equality such as (2.3.38) becomes D(x, f ∗ ) = 0 ∀x ∈ X,
(2.4.4)
where, as in the Remark 2.19, D(x, f ∗ ) ≡ D(x, f ∗ (x)). The name discrepancy function comes from the fact that, for any policy π = {at }, we can express the difference or “discrepancy” between V (π, ·) and V ∗ (·) in terms of D; in fact, for any initial state x0 = x, ∗
V (π, x) − V (x) =
∞
αt D(xt , at ).
(2.4.5)
t=0
(See Exercise 2.14) Results such as (2.4.4) and (2.4.5) motivate the following definition. Definition 2.40. A Markov policy π = {fn } is said to be asymptotically optimal (for the discounted cost criterion) if, for every x ∈ X, (2.4.6) D(x, fn ) → 0 as n → ∞. The concept of asymptotic optimality was introduced in adaptive Markov control processes (adaptive MCPs), which are MCPs that depend on unknown parameters, say, θ. In this case, at each time n, the controller computes an estimate θn of θ, and then he/she adapts his/her control action fn to this estimate. Thus (in view of (2.4.4)), (2.4.6) holds if fn is approximating in some sense an optimal control. Alternatively, asymptotic optimality can be used to analyze “approximations” to the OCP (2.3.2)–(2.3.3).
2.4
APPROXIMATION ALGORITHMS
53
(See, for instance, Sect. 4.6 in Hern´andez-Lerma and Lasserre (1996).) By (2.3.6), a VI policy is a natural candidate to be asymptotically optimal. This is not necessarily true, however, in the generality of Theorem 2.21. On the other hand, in the weighted-norm context of Theorem 2.38 we obtain the following nice result. Proposition 2.41. Suppose that Assumption 2.33 holds, and let πV I = {fn } be a VI policy. Then, for every x ∈ X and n = 1, 2, ..., 0 ≤ D(x, fn ) ≤ 2¯ cγ n w(x)/(1 − γ)
(2.4.7)
with c¯, w(·), and γ := αβ as in Assumption 2.33. Hence πV I is asymptotically optimal. Proof. From (2.4.2) and (2.4.1), D(x, fn ) = c(x, fn ) + αV ∗ (F (x, fn )) − V ∗ (x) ∗ = Vn∗ (x) − V ∗ (x) + α[V ∗ (F (x, fn )) − Vn−1 (F (x, fn ))] for all x ∈ X and n = 1, 2, .... Thus (2.4.7) follows from (2.3.37). 2 Note that (2.4.7) gives an a priori bound for D(x, fn ) in the sense that the right-hand does not depend on neither V ∗ nor Vn∗ . Now consider the following situation: fix an arbitrary n = 1, 2, ... and let fn ∈ Fn be the corresponding selector in (2.4.1). Furthermore, consider the stationary Markov policy πn = {gt } ⊂ F such that gt ≡ fn for all t = 0, 1, .... In other words, we apply the same control fn at every stage t = 0, 1, .... Then (2.4.8) below states that πn can be made “arbitrarily close to an optimal policy” if n is large enough. (See also Proposition 2.43 below.) Proposition 2.42. With πn as above, cγ n w(x)/(1 − γ) 0 ≤ V (πn , x) − V ∗ (x) ≤ 2¯
(2.4.8)
for all x ∈ X. Proof. Fix x ∈ X, and consider V (πn , x) − V ∗ (x) ≤ |V (πn , x) − Vn∗ (x)| + |Vn∗ (x) − V ∗ (x)|.
54
2
DISCRETE–TIME DETERMINISTIC SYSTEMS
By (2.3.37), the second term on the right satisfies that |Vn∗ (x) − V ∗ (x)| ≤ c¯γ n w(x)/(1 − γ).
(2.4.9)
Now, by definition of πn , V (πn , x)=c(x, fn )+αV (πn , F (x, fn )). Combining this equation with (2.4.1) we obtain ∗ (F (x, fn ))|. |V (πn , x) − Vn∗ (x)| ≤ α|V (πn , F (x, fn )) − Vn−1
Iteration of this inequality (and recalling that V0∗ (·) ≡ 0) gives |V (πn , x) − Vn∗ (x)| ≤ αn |V (πn , xn )| ≤ c¯αn w(xn )/(1 − γ) [by Lemma 2.34(c)] ≤ c¯γ n w(x)/(1 − γ) [by Lemma 2.34(a)]. This inequality and (2.4.9) give (2.4.8).
2
Results such as Propositions 2.41 or 2.42 obviously suggest that the VI selectors fn ∈ Fn might converge to an α-optimal control f ∗ ∈ F. This is not necessarily true, but we can ensure the following. Proposition 2.43. Suppose that the hypotheses of Theorem 2.38 are satisfied, and consider a VI policy πV I = {fn , n = 0, 1, . . . }. Then there exists f ∗ ∈ F such that fn converges to f ∗ in the sense of Sch¨al (1975); that is, for each x ∈ X, there is a sequence ni = ni (x) such that fni (x) → f ∗ (x) as i → ∞. The proof of Proposition 2.43 follows from Proposition B.12 with vn ≡ Vn∗ and v ∗ ≡ V ∗ in (2.4.1) and (2.3.38), respectively.
2.4.2 Policy Iteration The policy iteration (PI) algorithm is also known as Howard’s policy improvement method. The algorithm was introduced by Howard (1960) for a class of discrete-time Markov decision processes with finite state space and finite action sets. Nevertheless, it soon became evident that the PI algorithm could be extended to many classes of OCPs, including all those considered in these
2.4
APPROXIMATION ALGORITHMS
55
lectures. On the other hand, a key difference with respect to the VI algorithm is that PI gives a monotone sequence of functions converging to the optimal value function V ∗ (·), even if the stage costs c(x, a) take positive and negative values! The PI algorithm is based on the “monotonicity property” of the Bellman operator K (2.3.8) established in Lemma 2.20, under the assumptions of Theorem 2.21. In the context of Theorem 2.38 the same proof of Lemma 2.20 yields the following. Lemma 2.44. Suppose that the Assumption 2.33 holds. If v ∈ Lw (X) is such that v ≥ Kv, then (a) There exists f ∈ F such that v(x) ≥ V (f, x) for all x ∈ X; and, therefore, (b) v ≥ V ∗ . Now suppose that the conditions of Lemma 2.20 or Lemma 2.44 are satisfied. Consider an arbitrary selector g0 ∈ F, and let v0 (·) := V (g0 , ·) be the corresponding discounted cost. Then (as in the Remark 2.19) v0 (x) = c(x, g0 ) + αv0 (F (x, g0 )) ≥ inf [c(x, a) + αv0 (F (x, a))], a∈A(x)
(2.4.10)
so v0 ≥ Kv0 . Therefore, by Lemma 2.20(a) or Lemma 2.44(a), there exists g1 ∈ F such that v0 ≥ v1 , where v1 (x) := V (g1 , x) for all x ∈ X. Next, in (2.4.10) replace v0 , g0 by v1 , g1 and repeat the same argument to obtain g2 ∈ F and v2 (·) := V (g2 , ·), with v1 ≥ v2 . In general, the PI algorithm is as follows, with n = 0, 1, . . . (PI1 ) Given gn ∈ F, compute the discounted cost vn (·) ≡ V (gn , ·). Then, for all x ∈ X, vn (x) = c(x, gn ) + αvn (F (x, gn )) ≥ Kvn (x).
(2.4.11)
(PI2 ) Policy improvement: Find gn+1 ∈ F such that Kvn (x) = c(x, gn+1 ) + αvn (F (x, gn+1 )) ∀x ∈ X; so vn ≥ vn+1 , where vn+1 (·) ≡ V (gn+1 , ·). Replace n by n + 1 and go back to step (PI1 ).
56
2
DISCRETE–TIME DETERMINISTIC SYSTEMS
Theorem 2.45. Suppose that the hypotheses of Theorem 2.21 or Theorem 2.38 are satisfied, and let vn be as in the PI algorithm. Then: (a) If there exists n for which vn (x) = vn+1 (x) for all x ∈ X, then the function v ∗ (·) ≡ vn (·) satisfies the DPE v ∗ = Kv ∗ . Moreover, v ∗ = V ∗ and gn is an optimal control. (b) In general, as n → ∞, vn ↓ v ∗ , where v ∗ is a solution of the DPE, and v ∗ = V ∗ . Proof. (a) Let v ∗ (·) := vn (·) ≡ vn+1 (·). Then, by (PI1 ) and (PI2 ), v ∗ ≥ Kv ∗ ≥ v ∗ , so v ∗ satisfies the DPE. This completes the proof of part (a) under the assumptions of Theorem 2.38. Similarly, in Theorem 2.21 V ∗ is the minimal solution of the DPE. Therefore, if v ∗ = V ∗ , then, by (PI2 ), there exists a control that “improves” v∗ = vn+1 . This is a contradiction. (b) By construction, the functions vn form a nondecreasing sequence, which (by Lemma 2.20 or Lemma 2.44) is bounded below by V ∗ . Therefore, vn ↓ v ∗ for some function v ∗ ≥ V ∗ . Moreover, by Lemma 2.15(a) and (2.4.11), v ∗ ≥ Kv ∗ .
(2.4.12)
On the other hand, for all x ∈ X and n = 0, 1, ..., v ∗ (x) ≤ vn (x) = c(x, gn ) + αvn (F (x, gn )) ≤ c(x, gn ) + αvn−1 (F (x, gn )) = Kvn−1 (x). Hence, by definition of K in (2.3.8), v ∗ (x) ≤ c(x, a) + αvn−1 (F (x, a)) ∀ (x, a) ∈ K. As n → ∞, the latter inequality yields, v ∗ (x) ≤ c(x, a) + α v ∗ (F (x, a)), which in turn gives v ∗ ≤ Kv ∗ . This inequality and (2.4.12) give that v ∗ satisfies the DPE v ∗ = Kv ∗ . The last statement in part (b) is obtained as in (a). 2
2.4
APPROXIMATION ALGORITHMS
57
Note that Proposition 2.43 remains true if we replace “VI policy” by “PI policy”. For a more general statement of this proposition, see Theorem B.10 or Proposition B.12 in Appendix B. Example 2.46. Consider the LQ control problem consisting of the linear system xt+1 = δxt + ηat ,
t = 0, 1, ...,
(2.4.13)
with initial state x0 = x, nonzero coefficients δ, η, and a quadratic stage cost c(x, a) = qx2 + ra2
for x, a ∈ X = A = R
(2.4.14)
with q ≥ 0 and r > 0. Hence the OCP is to minimize the αdiscounted cost ∞ V (π, x) = αt c(xt , at ) (2.4.15) t=0
subject to (2.4.13). Given this OCP, develop: (a) the VI algorithm, and (b) a PI algorithm. (c) Solve the LQ problem (2.4.13)–(2.4.15) by means of the “guess and verify” approach, which is also known as “the method of undetermined coefficients”. Solution of (a). To simplify the notation, we will write the VI functions Vn∗ as vn . Hence, (2.3.10) becomes vn (x) = min [c(x, a) + αvn−1 (F (x, a))] a∈A(x)
(2.4.16)
for all x ∈ X and n = 1, 2, ..., with v0 (·) ≡ 0. Thus, for n = 1, (2.4.16) gives v1 (x) = min(qx2 + ra2 ) = qx2 a
∀x ∈ X,
and the minimum is attained at a∗ = f1 (x) = 0 for all x. Similarly, v2 (x) = min[qx2 + ra2 + αv1 (δx + ηa)] a
= min[qx2 + ra2 + αq(δx + ηa)2 ] a
(2.4.17)
2
58
DISCRETE–TIME DETERMINISTIC SYSTEMS
Computating the derivative of the right-hand side with respect to a, and then equating to 0, gives that the minimizer is f2 (x) = −(r + αqη 2 )−1 αqδη · x. Replacing a = f2 (·) in (2.4.17), v2 becomes the quadratic function v2 (x) = C2 x2 , with coefficient C2 :=
qr + (rδ 2 + qη 2 )αq . r + αqη 2
In general, by induction we can see that, for every x ∈ X, ∀n = 0, 1, ...,
(2.4.18)
fn (x) = −(r + SCn−1 )−1 αδηCn−1 · x
(2.4.19)
vn (x) = Cn x2 with C0 = 0, and VI controls
for all n = 1, 2, ..., with f0 ∈ F arbitrary (since v0 ≡ 0), and Cn =
P + QCn−1 r + SCn−1
∀n = 1, 2, . . . .
(2.4.20)
Here P := qr,
Q := (rδ 2 + qη 2 )α,
S := αη 2 .
By means of some technical arguments (which can be seen, for instance, in Dynkin and Yushkevich (1979), Sect. 2.11, or Hern´andez-Lerma and Lasserre (1996), Sect. 4.7) based on the fixed-point approach to (2.4.20), it can be seen that Cn → C as n → ∞, where C = z is the unique positive solution of the quadratic equation P + Qz z= , r + Sz which we may be rewrite as Sz 2 + (r − Q)z − P = 0. The unique positive solution is C = [−(r − Q) + ((r − Q)2 + 4P S)1/2 ]/2S.
(2.4.21)
2.4
APPROXIMATION ALGORITHMS
59
Finally, from Theorem 2.21 and (2.4.18) we conclude that the α-optimal discounted cost V ∗ is given by V ∗ (x) = lim vn (x) = Cx2 . n→∞
(2.4.22)
Moreover, from (2.4.19) and Proposition B.12(a) (in Appendix B), f ∗ (x) : = lim fn (x) n→∞
= −(r + SC)−1 αδηCx
(2.4.23)
is an α-optimal selector. Solution of (b). In the policy iteration (PI) algorithm, first, we take an arbitrary control g0 ∈ F and then we compute the corresponding discounted cost v0 (·) = V (g0 , ·). (See (2.4.10).) Since we are not given an indication about how to choose g0 , we may select it so that v0 is easy to compute. This is the case if we choose g0 (·) ≡ 0 (which is the same as f1 in part (a) above). Thus, with at = 0 for all t = 0, 1, ..., the LQ system (2.4.13)–(2.4.15) becomes xt+1 = δxt = δ t+1 x ∀t = 0, 1, ..., given the initial state x0 = x, and the stage cost c(x, a) = qx2 . Therefore, the corresponding α-discounted cost is v0 (x) = q
∞
αt x2t = D0 x2
t=0
with coefficient D0 := q/(1 − αδ 2 ), assuming that αδ 2 < 1. Having v0 , we then proceed to the policy improvement step. That is, we wish to find g1 ∈ F such that, for all x ∈ X, g1 (x) ∈ A attains the minimum in the right-hand side of the inequality v0 (x) ≥ min[c(x, a) + αv0 (F (x, a))] a
= min[qx2 + ra2 + αD0 (δx + ηa)2 ]. a
Now, the usual calculations (computing the derivative with respect to a, equating to 0, and so on) give that g1 (x) = −G1 x,
2
60
DISCRETE–TIME DETERMINISTIC SYSTEMS
with coefficient G1 :=
αδηD0 . r + αη 2 D0
From (2.4.15), the corresponding α-discounted cost v1 (x) = V (g1 , x) is v1 (x) = D1 x2 , where D1 :=
q + rG21 , 1 − α(δ − ηG1 )2
assuming that |α(δ − ηG1 )2 | < 1. In general, we obtain by induction that, for all x ∈ X and n = 0, 1, ..., gn (x) = −Gn x and vn (x) = Dn x2 ,
(2.4.24)
with coefficients G0 = 0, Dn =
q + rG2n 1 − α(δ − ηGn )2
and Gn+1 =
αδηDn r + αη 2 Dn
(2.4.25)
(2.4.26)
for n = 0, 1, .... Since the functions vn (x) = Dn x2 form a nonnegative nonincreasing sequence (Theorem 2.45), there is a nonnegative function v ∗ (x) = D∗ x2 such that vn (x) ↓ v ∗ (x) for all x ∈ X. In particular, as n → ∞, Dn → D∗ and, therefore, from (2.4.26), Gn → G∗ , with αδηD∗ r + αη 2 D∗ αδηD∗ = , r + SD∗
G∗ =
which is the same as the coefficient of (2.4.23), with C = D∗ . Finally, from (2.4.25) and (2.4.22) we conclude that v ∗ (·) = V ∗ (·) with C = D∗ . Solution of (c). We now wish to solve the infinite-horizon LQ problem (2.4.13)–(2.4.15) by the “guess and verify” approach. The idea is to “guess” that the solution of the DPE has a certain form
2.4
APPROXIMATION ALGORITHMS
61
and then we verify that this is indeed the case. In the present LQ problem, from the finite-horizon case en Example 2.4 we know that the optimal cost is a quadratic function—see (2.1.17). Since this is the case for any finite horizon T = 1, 2, ..., we immediately guess that the optimal cost in the infinite-horizon problem (2.4.13)–(2.4.15) is also of the form v(x) := Bx2
∀x ∈ X
for some constant B. To verify that this is correct, we consider the DPE (2.3.7) with V ∗ = v. Therefore, (2.3.7) becomes v(x) = min[qx2 + ra2 + αv(δx + ηa)] a∈A
or Bx2 = min[qx2 + ra2 + αB(δx + ηa)2 ]. a
(2.4.27)
The minimum in the right-hand side is attained at αδηB ·x r + αη 2 B αδηB =− · x, r + SB
f (x) = −
(2.4.28)
with S=αη 2 as in (2.4.19)–(2.4.20). Replacing a=f (x) in (2.4.27) and comparing both sides of the resulting equation, we conclude that v(x) = Bx2 indeed satisfies the DPE if B is the unique positive solution of the quadratic equation SB 2 + [r − α(rδ 2 + qη 2 )]B − qr = 0. Since this equation is the same as (2.4.21), we obtain that v(·) = V ∗ (·), the α-optimal discounted cost, and that f (·) in (2.4.28) is the optimal control. 3
2
62
DISCRETE–TIME DETERMINISTIC SYSTEMS
2.5 Long–Run Average Cost Problems In this section, we study undiscounted infinite horizon optimal control problems with the long–run average cost (AC). This optimality criterion was originally introduced by Bellman (1957b) for a class of Markov decision processes (as in Chap. 3, below). In this section we study some aspects of the AC criterion for discrete– time deterministic systems. In the following chapters we study AC optimality for other discrete– and continuous–time, deterministic and stochastic control systems. Given an initial condition x0 = x ∈ X and a policy π = {at }, let T −1 JT (π, x) := c(xt , at ). t=0
We wish to minimize the long-run average cost (AC) J(π, x) defined as 1 (2.5.1) J(π, x) := lim sup JT (π, x), T →∞ T subject to xt+1 = F (xt , at ), t = 0, 1, . . . .
(2.5.2)
The AC value function is J ∗ (x) := inf{J(π, x) : π ∈ Π}
(2.5.3)
and a control policy π ∗ is said to be average–cost optimal (AC– optimal) if J(π ∗ , x) = J ∗ (x) for all x ∈ X. To avoid trivial situations, we will assume that there is a policy π ∈ Π such that the mapping x → J(π, x) is finite-valued. There are several approaches to analyze the AC optimal control problem (2.5.1)–(2.5.3). The most common are (i) the AC optimality equation (ACOE), (ii) the steady state (or stationary state) approach, and (iii) the vanishing discount approach. We will briefly discuss each of them. (There is also an infinite-dimensional linear programming approach to study deterministic AC problems,
2.5
LONG–RUN AVERAGE COST PROBLEMS
63
but it is too technical to include it here. The interested reader may consult, for instance, Borkar et al. (2019.) or Chap. 11 in Hern´andez-Lerma and Lasserre (1999).) Remark 2.47. We will use the following notation: (a) Given a control policy π ∈ Π, we denote by {xπt , t = 0, 1, ...} the sequence defined by (2.5.2) when at is given by the policy π, that is, xπt+1 = F (xπt , at ) for all t = 0, 1, ... with some initial condition xπ0 = x0 . In particular if f ∈ F is an stationary policy with at = f (xt ), then xft+1 = F (xft , f ) for all t = 0, 1, . . . . (b) For a given function ξ : X → R, let Πξ be the family of control policies π ∈ Π such that, for every initial state x0 , 1 ξ(xπt ) → 0 as t → ∞. t
(2.5.4)
Similarly, we denote by Fξ the family of stationary policies f ∈ F that satisfies (2.5.4) for every initial state x0 . Note that if ξ is bounded, then (2.5.4) holds for all π ∈ Π and all f ∈ F; hence Πξ = Π, and Fξ = F. For special functions ξ, the relation (2.5.4) is a transversality-like condition, similar to (2.3.17) in Theorem 2.21(c). 3
2.5.1 The AC Optimality Equation A pair (j ∗ , l) consisting of a real number j ∗ ∈ R and a function l : X → R is called a solution to the average cost optimality equation (ACOE) if, for every x ∈ X, j ∗ + l(x) = inf [c(x, a) + l(F (x, a))]. a∈A(x)
(2.5.5)
It can be shown that if (j ∗ , l) is a solution to the ACOE, then j ∗ is unique. Moreover, it is obvious that if l(·) satisfies (2.5.5), then so does l(·) + k for any constant k. A solution (j ∗ , l) to the ACOE is also known as a canonical pair . If, in addition, f ∗ is a stationary policy that satisfies (2.5.6) below, then (j ∗ , l, f ∗ ) is called a canonical triplet.
2
64
DISCRETE–TIME DETERMINISTIC SYSTEMS
Theorem 2.48. Suppose that (j ∗ , l) is a solution to the ACOE (2.5.5), and let Πl be as in Remark 2.47(b) with ξ = l in (2.5.4). Then, for every initial state x0 = x, (a) j ∗ ≤ J(π, x) for all π ∈ Πl ; hence (b) j ∗ ≤ J ∗ (x) if Π = Πl . Moreover, suppose that there exists a policy f ∗ ∈ Fl such that f ∗ (x) ∈ A(x) attains the minimum in the right–hand side of (2.5.5), i.e., j ∗ + l(x) = c(x, f ∗ ) + l(F (x, f ∗ )) ∀ x ∈ X.
(2.5.6)
Then, for all x ∈ X, (c) j ∗ = J(f ∗ , x) ≤ J(π, x) for all π ∈ Πl ; hence (d) f ∗ is AC–optimal and J(f ∗ , ·) ≡ J ∗ (·) ≡ j ∗ if Πl = Π. Proof. (a) By (2.5.5), for every (x, a) ∈ K we have j ∗ + l(x) ≤ c(x, a) + l(F (x, a)).
(2.5.7)
Now consider an arbitrary policy π = {at } ∈ Πl , and let xt ≡ xπt , t = 0, 1, . . . , be the corresponding state trajectory for any given initial state xπ0 = x0 . Hence, by (2.5.7), j ∗ ≤ c(xt , at ) + l(xt+1 ) − l(xt ) ∀ t = 0, 1, . . . . Thus summation over t = 0, 1, ..., T − 1 gives T j ∗ ≤ JT (π, x) + l(xT ) − l(x0 ).
(2.5.8)
Finally, multiplying by 1/T both sides of this inequality and then letting T → ∞, (2.5.1) and (2.5.4) yield part (a). Part (b) follows from (a) if Πl = Π. (c) If f ∗ satisfies (2.5.6), then we have equality throughout (2.5.7)–(2.5.8), which yields the equality in (c). The inequality follows from (a). Finally, (d) is a consequence of (b) and (c). 2 As in Remark 2.47, if l is bounded, then Πl = Π, as required in parts (b) and (d) of Theorem 2.48. Arguments similar to those in the proof of Theorem 2.48 give other useful results, such as the following.
2.5
LONG–RUN AVERAGE COST PROBLEMS
65
Proposition 2.49. (a) Suppose that instead of (2.5.7), for some f ∈ F, we have j ∗ + l(x) ≥ c(x, f ) + l(F (x, f )) ∀ x ∈ X.
(2.5.9)
If limt→∞ l(xft )/t ≥ 0, then j ∗ ≥ J(f, x) for all x ∈ X. (b) If the inequality in (2.5.9) is reversed, i.e., j ∗ + l(x) ≤ c(x, f ) + l(F (x, f )) ∀ x ∈ X,
(2.5.10)
and limt→∞ l(xft )/t ≤ 0, then j ∗ ≤ J(f, x). The proof of Proposition 2.49 is left to the reader (Exercise 2.11). Corollary 2.50. (a) If (2.5.9) holds for some f ∈ F, then j ∗ + l(x) ≥ inf [c(x, a) + l(F (x, a))] ∀ x ∈ X. a∈A(x)
(b) If (2.5.10) holds for all f ∈ F, then j ∗ + l(x) ≤ inf [c(x, a) + l(F (x, a))] ∀ x ∈ X. a∈A(x)
Example 2.51 (The Brock–Mirman model). In the infinite– horizon Brock and Mirman economic growth model studied in Example 2.31, the system evolves according to xt+1 = cxθt − at ,
t = 0, 1, 2, . . . ,
with a given initial state x0 ∈ X and θ ∈ (0, 1). As in Examples 2.10 and 2.31, we assume that X = A = [0, ∞), and A(x) := (0, cxθ ]. The system function is F (x, a) = cxθ − a, and the stage reward (or utility) function is r(x, a) = log(a). Thus the performance index to be optimized is the long–run average reward (AR) T −1 1 J(π, x0 ) = lim inf log(at ). T →∞ T t=0
(2.5.11)
Concerning the “lim inf” in (2.5.11), instead of the “lim sup” in (2.5.1), see the paragraph and the Remark after expression (5.2.15) in Sect. 5.2.
2
66
DISCRETE–TIME DETERMINISTIC SYSTEMS
To find the canonical triplet (j ∗ , l, f ∗ ) that satisfies the ACOE j ∗ + l(x) = max [r(x, a) + l(F (x, a))] ∀x ∈ X, a∈A(x)
(2.5.12)
we consider a function l(x) of the form l(x) := b log(x), where b is an unknown parameter. Under this assumption, the right side cxθ of (2.5.12) reaches the maximum when a = 1+b , so we can rewrite (2.5.12) as θ cb cx ∗ θ j + b log(x) = log + b log x 1+b 1+b c cb = (1 + b)θ log(x) + log + b log . 1+b 1+b This last equation is satisfied if b = (1 + b)θ, which implies that θ . Therefore the canonical triplet (j ∗ , l, f ∗ ) is given by b= 1−θ j ∗ = log (c(1 − θ)) + θ log(x), 1−θ f ∗ (x) = c(1 − θ)xθ .
θ log (cθ) , 1−θ
l(x) =
(2.5.13) (2.5.14) (2.5.15)
It can be shown that, for every initial state x0 , ∗
1−θt
t
xft = (cθ) 1−θ xθ0 for t = 1, 2, . . . ∗
(2.5.16)
1
and xft → (cθ) 1−θ , which implies that f ∗ satisfies (2.5.4), that is f ∗ ∈ Fl . Thus by Theorem 2.48(d), j ∗ is the optimal average 3 reward and f ∗ is AR-optimal. The ACOE (2.5.5) provides a “complete solution” to the AC control problem in the sense that, in addition to providing a canonical triplet (j ∗ , l, f ∗ ), it also allows us to identify refinements of this triplet, such as “overtaking optimal” or “bias optimal” controls. However, if we are only interested in obtaining AC-optimal controls, it suffices to obtain an optimality inequality. This is explained in Sect. 2.5.3 below.
2.5
LONG–RUN AVERAGE COST PROBLEMS
67
2.5.2 The Steady–State Approach Let K be as in (2.3.12). A pair (x∗ , a∗ ) ∈ K is said to be a steady (or stationary) state-action pair for the system (2.5.2) if F (x∗ , a∗ ) = x∗ . If, in addition, (x∗ , a∗ ) solves the steady state problem minimize c(x, a) subject to F (x, a) = x,
(2.5.17)
then (x∗ , a∗ ) is said to be a minimum steady state-action pair for the AC control problem (2.5.1)–(2.5.2). Assumption 2.52. The OCP (2.5.1)–(2.5.2) satisfies: (a) There exists a minimum steady state-action pair (x∗ , a∗ ) ∈ K. (b) Dissipativity. The OCP (2.5.1)–(2.5.2) is dissipative, which means that there is a so–called storage function λ : X → R such that, for every (x, a) ∈ K, λ(x) − λ(F (x, a)) ≤ c(x, a) − c(x∗ , a∗ ).
(2.5.18)
(c) Stabilizability. Let Πλ be as in Remark 2.47, with λ as in (2.5.18). For each initial state x0 ∈ X, there exists a control policy π ¯ ∈ Πλ (which may depend on x0 ) such that the cor¯t ) converges to the miniresponding state–control path (¯ xt , a mum steady state pair (x∗ , a∗ ) in (a). Observe that, introducing the constant w∗ := c(x∗ , a∗ ), the dissipativity inequality (2.5.18) can be expressed as w∗ + λ(x) ≤ c(x, a) + λ(F (x, a)) ∀ (x, a) ∈ K,
(2.5.19)
which in turn gives w∗ + λ(x) ≤ inf [c(x, a) + λ(F (x, a))] ∀ x ∈ X. a∈A(x)
(2.5.20)
Consequently, in view of the similarity between (2.5.20) and the ACOE (2.5.5), one would expect some connection between the constants w∗ and v ∗ . In fact, the following theorem shows that w∗ satisfies conditions similar to (a)–(d) in Theorem 2.48. (See also Proposition 2.49(b).)
2
68
DISCRETE–TIME DETERMINISTIC SYSTEMS
Theorem 2.53. Suppose that Assumption 2.52 holds and the stage cost c : K → R is continuous. For the storage function λ in (2.5.18) let Πλ be as in Remark 2.47. Then, for all x ∈ X, π , x) ≤ J(π, x) for all π ∈ Πλ , where π ¯ satisfies (a) w∗ = J(¯ Assumption 2.52(c); hence (b) the AC value function satisfies that J ∗ (x) = c(x∗ , a∗ ), with (x∗ , a∗ ) as in Assumption 2.52(a), if Πλ = Π. Proof. (a) Let w∗ := c(x∗ , a∗ ) be as in (2.5.19). Let π = {at } be an arbitrary control policy in Πλ with corresponding state–action sequence (xt , at ). Then, from (2.5.19), w∗ ≤ c(xt , at ) + λ(xt+1 ) − λ(xt ) for all t = 0, 1, .... This yields, as in (2.5.7)–(2.5.8), T w∗ ≤ JT (π, x) + λ(xT ) − λ(x0 ) ∀ T = 1, 2, ..., so, by (2.5.4), w∗ ≤ J(π, x) for all x ∈ X.
(2.5.21)
On the other hand, by the stabilizability in Assumption 2.52(c), there is a policy π ¯ ∈ Πλ for which the state–action sequence ¯t ) converges to (x∗ , a∗ ). Therefore, since the stage cost c is (¯ xt , a ¯t ) converges to w∗ , which implies that J(¯ π , x) = continuous, c(¯ xt , a ∗ w for all x. This prove (a). ¯ is AC-optimal and (b) If Πλ = Π, then part (a) yields that π ∗ 2 also that the AC value function is J (·) ≡ w∗ . Example 2.54 (The Brock–Mirman model, cont’d.). In Example 2.51, for the system function F (x, a) = cxθ − a and the stage reward (or utility) function r(x, a) = log(a), it can be verified that the unique solution to the corresponding steady–state problem (2.5.17), namely, maximize: log(a) subject to cxθ − a = x, is given by
1 θ (x∗ , a∗ ) = (cθ) 1−θ , c(1 − θ)(cθ) 1−θ .
(2.5.22)
2.5
LONG–RUN AVERAGE COST PROBLEMS
69
Then using Theorem 2.53(a) θ log (cθ) ; 1−θ (2.5.23) compare this with (2.5.13). To get Assumption 2.52(b), note that F (x, a∗ ) − F (x, a) = a − a∗ . Thus, the strict concavity of the stage reward r(x, a) := log(a) gives J(¯ π , x) = r(x∗ , a∗ ) = log(a∗ ) = log (c(1 − θ)) +
∂r ∗ ∗ ∂r ∗ ∗ (x , a )(x − x∗ ) + (x , a )(a − a∗ ) ∂x ∂a 1 = ∗ (F (x, a∗ ) − F (x, a)). a
r(x, a) − r(x∗ , a∗ ) ≤
Moreover, the system function F (x, a∗ ) is also concave in x and ∂F (x∗ , a∗ ) = 1, so ∂x F (x, a∗ ) − F (x∗ , a∗ ) ≤
∂F ∗ ∗ (x , a )(x − x∗ ) = x − x∗ , ∂x
that is, F (x, a∗ ) ≤ x for all x ∈ X. Therefore, from the last two inequalities we get the corresponding dissipativity condition (2.5.18) r(x, a) − r(x∗ , a∗ ) ≤ λ(x) − λ(F (x, a)) with the storage function λ(x) := a1∗ x. Notice that λ is different from l given in (2.5.14), Example 2.51. Observe that for any initial state x0 ∈ X, the policy f ∗ in (2.5.15) with corresponding state-control path (xt , at ), where 1−θt
t
xt = (cθ) 1−θ xθ0 ,
at = c(1 − θ)(cθ)
θ−θt+1 1−θ
t+1
xθ0
,
satisfies the stabilizability Assumption 2.52(c), i.e., (xt , at ) converges to the optimal stationary pair (x∗ , a∗ ). π , x) ≥ J(π, x) for all Hence by Theorem 2.53, c(x∗ , a∗ ) = J(¯ 3 π ∈ Πλ and all x ∈ X. Example 2.55 (The Mitra-Wan forestry model). Consider a forestland covered by trees of the same species classified by ageclasses from 1 to n. After age n, trees have no economic value. The state space in this example (Mitra and Wan Jr 1985) can be
2
70
DISCRETE–TIME DETERMINISTIC SYSTEMS
identified with the n–simplex Δ := {x ∈ Rn :
n
xi = 1, xi ≥ 1, i = 1, ..., n}
i=1
where each coordinate xi denotes the proportion of land occupied by i-aged trees. Let xt = (x1,t , ..., xn,t ) ∈ Δ be the forest state at period t. By the end of the period the forester must decide to harvest a proportion of land in any age class, say at = (a1,t , ..., an,t ) with 0 ≤ ai,t ≤ xi,t , i = 1, ..., n. Because a tree has no economic value after age n, an,t = xn,t . Thus, for each x ∈ Δ, the admissible control set is A(x) = [0, x1 ] × · · · × [0, xn−1 ] × {xn }. Suppose that the forest evolves according to the dynamic model x1,t+1 = a1,t + · · · + an,t , xi+1,t+1 = xi,t − ai,t , i = 1, ..., n − 1,
(2.5.24) (2.5.25)
where (2.5.24) means that all harvested area at the end of period t must be sown by trees of age 1 at the beginning of period t + 1. On the other hand, (2.5.25) states that trees of age i that have not been harvested until the end of period t become trees of age i + 1 in period t + 1. For a planning horizon T , (2.5.24)–(2.5.25) can be written as a discrete-time linear control system xt+1 = f (xt , at ) := Axt + Bat where
⎛
⎞ 00.00 ⎜1 0 . 0 0⎟ ⎜ ⎟ ⎟ A := ⎜ ⎜0 1 . 0 0⎟ , ⎝. . . . .⎠ 00.10
for t = 0, 1, ..., T − 1, (2.5.26) ⎛
⎞ 1 1 . 1 1 ⎜ −1 0 . 0 0 ⎟ ⎜ ⎟ ⎟ and B := ⎜ ⎜ 0 −1 . 0 0 ⎟ . ⎝ . . . . .⎠ 0 0 . −1 0
(2.5.27)
Now, assume the timber production per unit area is related to the tree age-classes by the biomass vector ξ = (ξ1 , ξ2 , ..., ξn ) ∈ Rn ,
ξi ≥ 0, i = 1, 2, . . . , n,
2.5
LONG–RUN AVERAGE COST PROBLEMS
71
where ξi represents the amount of timber produced by i-aged trees occupying a unit of land. Hence, the total amount of timber collected at the end of period t is given by ξ · at = ξ1 a1,t + · · · + ξn an,t . Consider a timber price function p : [0, ∞) → [0, ∞), assumed to be increasing and concave. Given a forest state x and an admissible harvest control a, the stage income is r(x, a) := p(ξ · a). Therefore, the performance index to maximize is JT (π, x) :=
T −1
r(xt , at ).
(2.5.28)
t=0
It can be shown that the control system (2.5.26) has a set of stationary states given by the pairs (x, a) satisfying x1 ≥ x2 ≥ · · · ≥ xn and a1 = x1 − x2 , a2 = x2 − x3 , ..., an = xn . Moreover, for each age class i there is a pair of stationary state and control (xi , ai ), known as normal forest, defined as follows: the state is xi := (1/i, ..., 1/i, 0, ..., 0), where each of the first i coordinates are 1/i , and the remaining are 0; and the control is ai := (0, ..., 0, 1/i, 0, ..., 0), where 1/i is in the i–coordinate. We choose a normal forest (x∗ , a∗ ) such that
r(x∗ , a∗ ) = max p ξ · ai : i = 1, 2, ..., n . So, given the concavity of p, there is k ≥ 0 such that r(x, a) − r(x∗ , a∗ ) ≤ kξ · (a − a∗ ) for all a ∈ A(x). Letting N := (1, 2, ..., n) and γ := max{ξ · ai : i = 1, 2, ..., n}, we get the vector componentwise inequality ξ ≤ γN . Moreover, from a straightforward calculation we have N · [x − F (x, a)] = N · (a − a∗ ) for any x ∈ Δ and all a ∈ A(x). Therefore, r(x, a) − r(x∗ , a∗ ) ≤ kγN · [x − F (x, a)] for all x ∈ Δ, a ∈ A(x).
2
72
DISCRETE–TIME DETERMINISTIC SYSTEMS
Introducing the function λ : Δ → R, defined by λ(x) = kγN · x, and the value j ∗ := r(x∗ , a∗ ), we have the corresponding dissipative inequality (2.5.19), j ∗ + λ(x) ≥ r(x, a) + λ(F (x, a)) for all x ∈ Δ, a ∈ A(x). Thus in particular we conclude that (x∗ , a∗ ) is an optimal stationary state. Moreover, that optimal stationary state can be reached by a finite sequence of harvest plans from any initial stationary state. Hence, this example satisfies the Assumption 2.52, and the conditions of Theorem 2.53, so the optimal AC value function is ξi ∗ ∗ ∗ J (x) ≡ r(x , a ) = max p : i = 1, 2, ..., n . i 3
2.5.3 The Vanishing Discount Approach The so-called vanishing discount approach to AC-control problems is based on several connections between discounted cost problems and the average cost. The most straightforward is the following. Given an arbitrary control policy π = {at } and initial state x0 = x, consider the discounted cost V (π, x) in (2.3.3), which we now write as Vα (π, x) to make explicit the dependence on the discount factor α ∈ (0, 1), i.e., Vα (π, x) =
∞
αt c(xt , at ).
(2.5.29)
t=0
Now, let M be an arbitrary constant and inside the summation replace c(·, ·) with c(·, ·) ± M . Hence (2.5.29) becomes Vα (π, x) =
∞ t=0
αt [c(xt , at ) − M ] +
M , 1−α
which, multiplying both sides by 1 − α, we may express as
2.5
LONG–RUN AVERAGE COST PROBLEMS
(1 − α)Vα (π, x) = M + (1 − α)
73 ∞
αt [c(xt , at ) − M ].
t=0
In particular, taking M as the average cost J(π, x) it follows that (1 − α)Vα (π, x) = J(π, x) + (1 − α)
∞
αt [c(xt , at ) − J(π, x)].
t=0
(2.5.30) This equation obviously suggests that we can approximate J(π, x) by (1 − α)Vα (π, x) as α ↑ 1. A second connection between discounted cost problems and the average cost is provided by the Abelian theorem in Part (a) of the following lemma. Lemma 2.56. Let {ct } be a sequence bounded below, and consider the lower and upper limit averages (also known as Ces`aro limits) 1 ct , C := lim inf n→∞ n t=0 n−1
L
1 C := lim sup ct , n→∞ n t=0 n−1
U
and the lower and upper Abelian limits A := lim inf (1 − α) L
α↑1
∞ t=0
α ct , t
A := lim sup(1 − α) U
α↑1
∞
α t ct .
t=0
Then (a) C L ≤ AL ≤ AU ≤ C U . (b) If AL = AU , the equality holds in (a), i.e., C L = AL = AU = CU . For a proof of Lemma 2.56 see the references in Bishop et al. (2014) or Sznajder and Filar (1992). Part (b) in Lemma 2.56 is known as the Hardy-Littlewood Theorem. Consider now a control policy π = {at }, the corresponding state trajectory {xt }, and in Lemma 2.56 take ct := c(xt , at ). Then the third inequality in Lemma 2.56(a) gives lim sup(1 − α)Vα (π, x) ≤ J(π, x) α↑1
(2.5.31)
2
74
DISCRETE–TIME DETERMINISTIC SYSTEMS
for every initial state x0 = x, with Vα (π, x) and J(π, ·) as in (2.5.30). Moreover, if Vα∗ (·) ≡ V ∗ (·) denotes the α-discount value function in (2.3.4), then (2.5.31) yields lim sup(1 − α)Vα∗ (x) ≤ J(π, x). α↑1
In fact, since π in the latter inequality is arbitrary, we obtain from (2.5.3) that, for every x ∈ X, lim sup(1 − α)Vα∗ (x) ≤ J ∗ (x). α↑1
(2.5.32)
In words, (2.5.32) states that, for values of α close to 1, (1 − α)Vα∗ (·) is a lower bound for the average cost J ∗ (·). One more connection between discounted cost problems and the AC criterion is provided by the α-discount dynamic programming equation (2.3.7) that we will rewrite as Vα∗ (x) = inf [c(x, a) + αVα∗ (F (x, a))] a∈A(x)
(2.5.33)
Consider an arbitrary constant mα , which may depend on α ∈ (0, 1), and define hα (x) := Vα∗ (x) − mα ,
and ρ(α) := (1 − α)mα .
(2.5.34)
Some typical choices of the constant mα are mα := Vα∗ (¯ x), where x¯ ∈ X is an arbitrary (but fixed) state, and mα := inf x∈X Vα∗ (x), assuming of course that Vα∗ is bounded below. The first choice of x) = 0, so that we “fix” hα at x¯. The secmα is useful because hα (¯ ond choice is also useful because then hα is a nonnegative function. Either way, using (2.5.34) the DPE (2.5.33) becomes ρ(α) + hα (x) = inf [c(x, a) + αhα (F (x, a))]. a∈A(x)
(2.5.35)
Comparing this equation with (2.5.5), we might try to get conditions for the pair (ρ(α), hα ) in (2.5.35) to converge, as α ↑ 1, to a solution (j ∗ , l) of (2.5.5).
2.5
LONG–RUN AVERAGE COST PROBLEMS
75
We will next show by means of examples the feasibility of this approach. It should be noted however that, to the best of our knowledge, there are no general results on the “vanishing discount approach” for deterministic discrete-time systems such as (2.5.1)–(2.5.2). All the known results on discrete-time AC control problems refer to stochastic systems. See Remark 2.60. Example 2.57 (An LQ system, cont’d.). We consider again the α-discounted LQ control problem (2.4.13)–(2.4.15), with the αoptimal discounted cost Vα∗ (x) = C(α)x2 in (2.4.22). Here C(α) ≡ C is the unique positive solution of (2.4.21) with coefficients P, Q = Q(α), S = S(α) as in (2.4.20), i.e., P = qr,
Q(α) = (rδ 2 + qη 2 )α,
S(α) = αη 2 .
(2.5.36)
We suppose again the conditions in Example 2.46, but now we assume in addition that the coefficients in (2.4.13) are such that |δ| < 1 and δη > 0. Now, in (2.5.34) take mα := inf Vα∗ (x) = Vα∗ (¯ x) = 0, x∈X
with x¯ = 0. Then ρ(α) = 0 for every α ∈ (0, 1), and, as α ↑ 1, we obviously have that ρ(α) → j ∗ = 0 and hα (x) = Vα∗ (x) → l(x) := C(1)x2
∀x
where C(1) is the unique positive solution of the quadratic equation (2.4.21) when α = 1. We can also see that, as α ↑ 1, the α-optimal control fα (·) ≡ f ∗ (·) in (2.4.23) converges to g ∗ (x) := −θx for all x, where θ := (r + S(1)C(1))−1 δηC(1) and S(1) is given in (2.5.36) when α = 1. Summarizing, (j ∗ , l, g ∗ ) is a canonical triplet, as in (2.5.6), with minimum average cost j ∗ = 0. (Here, we are tacitly using the following fact: If we write at = −θxt in (2.4.13), then xt+1 = (δ − ηθ)xt is a stable system since the coefficient |δ − ηθ| < 1.) 3 Example 2.58 (The Brock-Mirman model, cont’d.). Consider the Brock-Mirman model in Example 2.31. In this α-discount
2
76
DISCRETE–TIME DETERMINISTIC SYSTEMS
problem, the performance index to be maximized for a given initial state x0 = x is Vα (π, x) =
∞
αt log(at ).
t=0
The optimal control policy of this problem is fα∗ (x) := c(1 − θα)xθ (see (2.3.33)) and the optimal state-control pair path (x∗t , a∗t ) for t = 1, 2, ... is given by 1−θt
x∗t = (cθα) 1−θ xθ
t
and
a∗t = c(1 − θα)(cθα)
θ−θt+1 1−θ
xθ
t+1
.
Moreover, from Remark 2.32 the corresponding α-discount value function is Vα∗ (x) =
θα θ 1 log(cθα) + log[c(1 − θα)] + log(x). 1−α (1 − α)(1 − θα) 1 − θα
Rearranging terms, we have (1 − α)Vα∗ (x) = log[c(1 − θα)] +
θ(1 − α) θα log(cθα) + log(x), 1 − θα 1 − θα
and therefore lim(1 − α)Vα∗ (x) = log(c(1 − θ)) + α↑1
θ log(cθ) = j ∗ 1−θ
∀x.
In other words, the constant function J ∗ (·) ≡ j ∗ is the average optimal value function which, of course, coincides with (2.5.13) and (2.5.23). Finally, notice that, as α ↑ 1, the optimal path (x∗t , a∗t ) con3 verges to the steady-state pair (x∗ , a∗ ) in (2.5.22). Remark 2.59 (The Arzela-Ascoli Theorem). Let Cb (X) be the space of real-valued continuous bounded functions on a metric space X, with the supremum norm f := supx∈X |f (x)|. The Arzela-Ascoli theorem characterizes compact subspaces of Cb (X). This result is used, for instance, in the vanishing discount approach, to decide whether a sequence (say, hαn (·) in (2.5.34)) in Cb (X) has a convergent subsequence. A precise statement is as follows.
2.5
LONG–RUN AVERAGE COST PROBLEMS
77
The Arzela-Ascoli Theorem. Suppose that X is a compact metric space and let fn be a sequence in Cb (X) such that (a) fn is bounded, that is, supn fn < ∞, and (b) fn is equicontinuous, that is, for each > 0 there exists δ > 0 such that sup |fn (x) − fn (y)| <
if |x − y| < δ.
n
Then fn has a subsequence converging to some function in Cb (X). For a proof of this theorem see, for example, Appendix H in Morimoto (2010). 3 Remark 2.60. As already noted in the paragraph before Example 2.57, as far as we can tell there are no general results on the “vanishing discount approach” to discrete-time deterministic AC problems. (For differential systems, see part (b) below.) In fact, the closest result we are aware of is the following theorem by Feinberg et al. (2012) for Markov decision processes (MDPs), which we introduce in Chap. 3. Note, however, that this theorem does not give the ACOE (2.5.5); it gives the optimality inequality (2.5.37) below, which is the same as the inequality in Corollary 2.50(a). (Vega-Amaya (2015) presents another proof of this theorem.) (a) For our deterministic AC control problem (2.5.1)–(2.5.3) the Feinberg et al. (2012) theorem can be stated as follows. Theorem (Feinberg et al. (2012)). Suppose that: (a) Assumption 2.17 holds, that is, the system function F (·, ·) is continuous, and the stage cost c(·, ·) is nonnegative and K-inf-compact; (b) there exists a policy π and an initial state x such that J(π, x) is finite, and, moreover, the function h(·) := lim inf hα (·) α↑1
is finite-valued, where hα (·) = Vα∗ (·) − mα is as in (2.5.34), with mα = inf x∈X Vα∗ (x).
2
78
DISCRETE–TIME DETERMINISTIC SYSTEMS
Then there exists a l.s.c. function l(·) on X and a stationary policy f ∈ F such that j ∗ + l(x) ≥ inf [c(x, a) + l(F (x, a))] a∈A(x)
= c(x, f ) + l(F (x, f )) ∀x ∈ X,
(2.5.37)
where j ∗ := lim supα↑1 mα is the optimal AC, and f is ACoptimal, that is, J(f, x) = J ∗ (x) = j ∗ for all x ∈ X. (b) AC control problems are mainly studied for stochastic systems, as in Chaps. 3, 5, and 6 below. For discrete-time deterministic systems the AC problems are practically unexplored, except perhaps for implicit results such as the theorem in part (a). Similarly, for differential systems (as in Chap. 4, below) AC problems have been studied in just a handful of papers such as Arisawa (1997) and Kawaguchi (2003). 3 For additional comments on deterministic AC control problems see Hern´andez-Lerma et al. (2023).
Exercises 2.1. Let X and Y be (nonempty) sets, and D ⊂ X × Y . Assume that the x–section D(x) := {y ∈ Y : (x, y) ∈ D} is nonempty for all x ∈ X, and similarly for the y–section D(y):={x ∈ X : (x, y) ∈ D} for all y ∈ Y . Let v be a real–valued function on D. Prove that: (a) sup v(x, y) = sup sup v(x, y) (x,y)∈D
x∈X y∈D(x)
= sup sup v(x, y). y∈Y x∈D(y)
(b) Show that (a) holds if “sup” is replaced by “inf”. Remark. Parts (a) and (b) in Exercise 2.1 are called “property of the repeated supremum” and “property of the repeated infimum”, respectively.
2.5
LONG–RUN AVERAGE COST PROBLEMS
79
2.2. Let g and h be real–valued functions on a set X. 1. If g and h are bounded from above, then (a1) supx g(x) − supx h(x) ≤ supx [g(x) − h(x)]; (a2) | supx g(x) − supx h(x)| ≤ supx |g(x) − h(x)|. 2. If g and h are bounded from below, then |inf x g(x) − inf x h(x)| ≤ supx |g(x) − h(x)|. 2.3. Let X and Y be convex subsets of R, and v a convex function on a convex set D ⊂ X × Y . Then v ∗ (x) := inf v(x, y) y∈D(x)
is convex, provided that v ∗ is finite–valued, where D(x) is the x– section defined in Exercise 2.1 above. If v is strictly convex, then so is v ∗ . 2.4. Prove Lemma 2.2—Bellman’s principle of optimality. 2.5. Prove Lemma 2.15. Hint. To prove part (a) note that, if gk ↓ g, then limk gk (y) = inf k gk (y) = g(y) for all y ∈ Y . In other words, limk gk = inf k gk . Now use Exercise 2.1(b) above. Part (b) in Lemma 2.15 follows from the definition of uniform convergence. Proving part (c) is a little complicated. The reader might wish to see the proof of Lemma 4.2.4 in Hern´andez-Lerma and Lasserre (1996). 2.6. Give an example of a function on a metric space that is l.s.c. but not inf–compact (as defined in Lemma 2.15(c)). 2.7. Let K be the operator in (2.3.8) defined on the complete metric space B(X) of measurable bounded functions v with the supremum norm v := supx∈X |v(x)|. Suppose that the cost function c(x, a) is bounded. Prove that: (a) K is a contraction on B(X); in fact, Kv − Kv ≤ αv − v for all v, v ∈ B(X), and (b) the α–discount value function V ∗ is in the space B(X) and it is the unique fixed–point of K (see (2.3.9)) in B(X).
80
2
DISCRETE–TIME DETERMINISTIC SYSTEMS
Hint. To prove (a) use Exercise 2.2, part 2. For (b), recall Remark 2.25. 2.8. Let v : K → R be K–inf–compact (see Definition B.4(a2)). Show that v is l.s.c. 2.9. Let L+ (X) be as in Lemma 2.18, that is, the family of nonnegative l.s.c. functions on X. Show that L+ (X) is a convex cone, so if u and v are in L+ (X) and k ≥ 0, then u + v and ku are also in L(X). 2.10. Let v ∈ L+ (X) and u be as in (2.3.14). Show that, under the Assumption 2.17, the functions (x, a) → c(x, a), v(F (x, a)), u(x, a) are all l.s.c. Hint. Use Exercise 2.8. 2.11. Prove the Proposition 2.49. 2.12. Prove Lemma 2.34. 2.13. Prove Lemma 2.35(a). 2.14. Prove (2.4.6) for any policy π = {at }. Hint. In (2.4.2), replace (x, a) ∈ K by (xt , at ) with t = 0, 1, . . . . Next multiply by αt both sides of (2.4.2), and then sum over all t = 0, 1, . . . . Finally, rearrange terms to obtain (2.4.6). 2.15. Let vn (n = 1, 2, ...) and v be functions on X such that vn ↑ v. Show that if the vn are convex or l.s.c. or monotone, then so is v, respectively. 2.16. Consider the time-varying control system (2.0.1)–(2.0.2) with state and action spaces X ⊂ Rn and A ⊂ Rm , respectively, with A compact. Assume, moreover, that the system function and the stage costs are linear in the state variable, that is, Ft (x, a) = F1 (t)x + F2 (a) and ct (x, a) = c1 (t) · x + c2 (a), and terminal cost CT (·) ≡ 0, where F2 (·) and c2 (·) are continuous functions.
2.5
LONG–RUN AVERAGE COST PROBLEMS
81
(a) Prove that the OCP has an optimal control which is independent of the state variable. Hint. Use the DP algorithm (2.1.8)–(2.1.9). (The result in (a) is due to Midler (1969).) (b) Show that, under the appropriate conditions (for instance, as in Theorem 2.21), the result in (a) also holds in the infinitehorizon stationary case (2.3.1)–(2.3.2). 2.17. Maximize V (π, x0 ) =
T −1 √
at x t
t=0
over all π = {at }, with at ∈ [0, 1], subject to xt+1 = ρ(1 − at )xt , t = 0, 1, ..., T − 1, where x0 > 0 and ρ > 0. Answer. a∗t =
1−ρ , 1 − ρT −t
Vt (x) =
x
1 − ρT −t , 1−ρ
t = 0, 1, ..., T − 1.
2.18. (A cake eating or nonrenewable-resource extraction problem.) Maximize T −1 a1−γ V (π, x0 ) = βt t 1−γ t=0 over all π = {at }, with at ∈ (0, xt ), subject to xt+1 = xt − at , t = 0, 1, ..., T − 1, where 0 < γ < 1 and x0 > 0. Answer.
a∗t = x
β t 1 − β (T −t)/γ 1 − β 1/γ
, t = 0, 1, ..., T − 1. , Vt (x) = x1−γ (T −t)/ γ 1−β (1 − γ ) 1 − β 1/γ
Chapter 3 Discrete–Time Stochastic Control Systems For the discrete–time deterministic systems studied in Chap. 2 there is essentially a unique dynamic model, namely, xt+1 = F (xt , at ) ∀ t = 0, 1, . . . ,
(3.0.1)
with a given initial condition x0 . In contrast, in the stochastic case there are two common dynamic models: the so–called system model, and the Markov control model.
3.1 Stochastic Control Models In the system model (SM), also known as the control model, the controlled system evolves according to a difference equation (similar to (3.0.1)) of the form xt+1 = F (xt , at , ξt ) ∀ t = 0, 1, . . . , T − 1,
(3.1.1)
for T ≤ ∞, with a given—possibly random—initial condition x0 . Here, the state and control variables xt , at have the same meaning as in (3.0.1); in particular, they take values in a state space X and an action set A, respectively, both assumed to be Borel spaces. Moreover, the ξt are independent random variables with values in a Borel space S, and they denote random perturbations. These perturbations (as in Remark 1.2(b)) can form a driving process © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hern´andez-Lerma et al., An Introduction to Optimal Control Theory, Texts in Applied Mathematics 76, https://doi.org/10.1007/978-3-031-21139-3 3
83
84
3
DISCRETE–TIME STOCHASTIC CONTROL SYSTEMS
or a random noise. In the former case, the ξt have a physical or economic interpretation, whereas in the latter case they are arbitrary random variables. See Example C.4 and Remark C.5 in Appendix C. The Markov control model (MCM) approach can be traced back to the paper by Bellman (1957b) who, in addition, coined the term Markov decision process, which today is also known as a Markov control process. The difference between the MCM and (3.1.1) is that in a MCM the evolution of the system is not specified by a “system function” F (x, a, s) as in (3.1.1); rather it is specified by a stochastic kernel or transition probability—as in Definition C.2 in Appendix C. (An approach similar to Bellman’s was introduced by Shapley (1953) for stochastic games.) More precisely, a MCM is expressed in the form (X, A, {A(x) : x ∈ X}, Q, c),
(3.1.2)
where, as usual, X and A are Borel spaces denoting the state space and the control or action set, respectively. Moreover, for each x ∈ X, A(x) ∈ B(A) represents the set of feasible (or admissible) actions in the state x. Let K := {(x, a) ∈ X × A : a ∈ A(x)}
(3.1.3)
be the set of feasible state–action pairs, which is assumed to be a Borel subset of X × A. (In the terminology of Definition B.1, K is the graph of the multifunction x → A(x).) Then Q is a stochastic kernel on X given K that represents the transition probability from a state xt to xt+1 under a control action at ∈ A(xt ); that is, for each t = 0, 1, . . . , B ∈ B(X), and (x, a) ∈ K, Q(B|x, a) := Prob[xt+1 ∈ B|xt = x, at = a].
(3.1.4)
Since this holds for all t, we say that the kernel Q is stationary or time–homogeneous or time–invariant. Finally, c in (3.1.2) is a real– valued function on K that is used to define the OCP’s objective function, below. Remark 3.1. (a) Consider the SM (3.1.1), and suppose that the ξt are independent and identically distributed (i.i.d.) random
3.1
STOCHASTIC CONTROL MODELS
85
variables with a common distribution μ on S. Then, from (3.1.4), we can see that the stochastic kernel Q is given by Q(B|x, a) = Prob[F (x, a, ξ) ∈ B] = μ({s ∈ S : F (x, a, s) ∈ B}) = E[IB (F (x, a, ξ))]
(3.1.5)
where ξ represents a generic random variable with distribution μ, and IB denotes the indicator function of the set B, that is, 1 if x ∈ B, IB (x) := 0 otherwise . Hence, one can easily go from the SM (3.1.1) to the MCM (3.1.2). One can also go the other way around, from (3.1.2) to (3.1.1), but this is not very helpful because it is obtained by means of an “existence” (nonconstructive) proof. (See Gihman and Skorohod (1979), Sect. 1.1.) (b) Similarly, if we are given the deterministic system (3.0.1), then (3.1.4) gives Q(B|x, a) = IB [F (x, a)] or, equivalently, Q(B|x, a) = δF (x,a) (B),
(3.1.6)
where δF (x,a) denotes the Dirac (or point) measure concentrated at F (x, a), i.e., 1 if F (x, a) ∈ B, δF (x,a) (B) := 0 otherwise . 3 In view of Remark 3.1, in the following we will mainly (but not exclusively) work with the MCM (3.1.2). Another key difference between a SM and a MCM is that (3.1.1) gives explicitly the state and action processes {xt } and {at }, whereas in a MCM it is unclear, at the outset, how to obtain these processes from (3.1.2). We show next how this is done, but first we need to formalize the notion of “randomized policy” introduced in the Definition 3.2.
86
3
DISCRETE–TIME STOCHASTIC CONTROL SYSTEMS
The remainder of this section is too technical. The reader might wish to skip it and go directly to Sect. 3.2. Consider the Markov control model (3.1.2) and, for each t = 0, 1, . . . , define the space Ht of admissible histories up to time t as H0 := X, and Ht := Kt × X = K × Ht−1
for t = 1, 2, . . . ,
(3.1.7)
where K is the set in (3.1.3). A generic element ht of Ht , which is called an admissible t–history, or simply a t–history, is a vector of the form (3.1.8) ht = (x0 , a0 , . . . , xt−1 , at−1 , xt ), with (xi , ai ) ∈ K for i = 0, . . . , t − 1, and xt ∈ X. Definition 3.2. A randomized control policy—more briefly, a control policy or simply a policy—is a sequence π = {πt , t = 0, 1, . . .} of stochastic kernels πt on the control set A given Ht satisfying the constraint πt (A(xt )|ht ) = 1 ∀ ht ∈ Ht , t = 0, 1, . . . .
(3.1.9)
The set of all policies is denoted by Π. Roughly, a policy π = {πt } may be interpreted as defining a sequence {at } of A–valued random variables, called actions (or controls), such that for every t–history ht as in (3.1.8) and t = 0, 1, . . . , the distribution of at is πt (·|ht ), which, by (3.1.9), is concentrated on A(xt ), the set of feasible actions in state xt . This interpretation of π is made rigorous in equation (3.1.10b), below. Remark 3.3. The canonical construction. Consider the MCM (3.1.2) and let (Ω, F) be the measurable space consisting of the (canonical) sample space Ω := (X × A)∞ , that is, the space of sequences ω = (x0 , a0 , x1 , a1 , . . . ) with xt in X and at in A for all t = 0, 1, . . . , and F is the corresponding product σ–algebra. The proyections (or coordinate variables) ω → xt and ω → at , from Ω to X and A, respectively, are called state and action variables. Observe that Ω contains the space H∞ , in (3.1.7), of admissible histories ω = (x0 , a0 , x1 , a1 , . . . ) with (xt , at ) ∈ K (that is, by (3.1.3), at ∈ A(xt )) for all t = 0, 1, . . . .
3.2
MARKOV CONTROL PROCESSES: FINITE HORIZON
87
Let π = {πt } be an arbitrary control policy and ν an arbitrary probability measure on X, referred to as the “initial distribution”. Then, by a theorem of C. Ionescu–Tulcea (Proposition C.8 and Remark C.9 in Appendix C), there exists a unique probability measure Pνπ on (Ω, F) which, by (3.1.9), is supported on H∞ , namely, Pνπ (H∞ ) = 1. Moreover, for all B ∈ B(X), C ∈ B(A), and ht ∈ Ht as in (3.1.8), t = 0, 1, . . ., we have: Pνπ (x0 ∈ B) = ν(B), Pνπ (at ∈ C|ht ) = πt (C|ht ), Pνπ (xt+1 ∈ B|ht , at ) = Q(B|xt , at ).
(3.1.10a) (3.1.10b) (3.1.10c) 3
Definition 3.4. The stochastic process (Ω, F, Pνπ , {xt }) is called a discrete–time Markov control process (or Markov decision process). The process {xt } in Definition 3.4 depends, of course, on the particular policy π being used and on the given initial distribution ν. Hence, strictly speaking, we should write, for instance, xπ,ν t instead of just xt . However, we shall keep the simpler notation xt for it will always be clear from the context what particular π and ν are being used. The expectation operator with respect to Pνπ is denoted by Eνπ . If ν is concentrated at the “initial state” x ∈ X, then we write Pνπ and Eνπ as Pxπ and Exπ , respectively.
3.2 Markov Control Processes: Finite Horizon In this section we consider the Markov control model (MCM) (X, A, {A(x) : x ∈ X}, Q, c)
(3.2.1)
introduced in (3.1.2)–(3.1.4), and the Markov control processes (MCP) in Definition 3.4. The optimal control problem (OCP)
3
88
DISCRETE–TIME STOCHASTIC CONTROL SYSTEMS
we are concerned with is to minimize the finite–horizon objective function (or performance criterion) N −1 π J(π, x) := Ex c(xt , at ) + cN (xN ) , (3.2.2) t=0
which is a stochastic analogue of (2.0.1). Thus, letting J ∗ (x) := inf J(π, x), x ∈ X, π
(3.2.3)
be the corresponding value function or minimum cost function, the OCP we are dealing with is to find an optimal policy, that is, a policy π ∗ ∈ Π such that J(π ∗ , x) = J ∗ (x) ∀ x ∈ X.
(3.2.4)
To this end, we will prove below the stochastic version of the Dynamic Programming (DP) Theorem 2.3. Let π = {πt } be an arbitrary policy and, for every t = 0, 1, . . . , N, let Ct (π, x) be the “cost–to–go” or cost from time t onwards when using the policy π and given xt = x; that is, for t = 0, 1, . . . , N − 1, N −1 Ct (π, x) := E π c(xn , an ) + cN (xN )|xt = x (3.2.5) n=t
and CN (π, x) := E π [cN (xN )|xN = x] = cN (x).
(3.2.6)
In particular, from (3.2.2), J(π, x) = C0 (π, x).
(3.2.7)
Moreover, for t = 0, . . . , N , let Jt be the optimal cost from time t to N , that is, Jt (x) := inf Ct (π, x) ∀ x ∈ X, t = 0, . . . , N. π
(3.2.8)
The following lemma states that the functions Jt satisfy the DP equation (3.2.9) with the condition (3.2.10).
3.2
MARKOV CONTROL PROCESSES: FINITE HORIZON
89
Lemma 3.5. Suppose that for each t ∈ {0, . . . , N − 1} there is a policy π t for which the minimum is attained in (3.2.8) for all x ∈ X, that is Jt (x) = Ct (π t , x). Then, for each t = N − 1, N − 2, . . . , 0 and x ∈ X, Jt (x) = min [c(x, a) + Jt+1 (y)Q(dy|x, a)] (3.2.9) a∈A(x)
X
and JN (x) = cN (x).
(3.2.10)
Remark 3.6. Consider a system model (SM) as in (3.1.1), that is, xt+1 = F (xt , at , ξt ) ∀ t = 0, 1, . . . with S–valued i.i.d. random disturbances ξt with a common distribution G. Then, in analogy with the right–hand side of (2.1.9), the integral in (3.2.9) becomes E[Jt+1 (F (xt , at , ξt ))|xt = x, at = a] = E[Jt+1 (F (x, a, ξt ))] = Jt+1 (F (x, a, s))G(ds). In a general MCP, the integral in (3.2.9) can be expressed (by (3.1.10c)) as 3 E[Jt+1 (xt+1 )|xt = x, at = a] = Jt+1 (y)Q(dy|x, a). Proof of Lemma 3.5. From (3.2.6), the terminal condition (3.2.10) is obvious. Now let t = 0, 1, . . . , N − 1, and consider a policy π such that πt = f ∈ F and {πt+1 , . . . , πN −1 } is an optimal policy from time t + 1 onwards. Hence, by (3.2.5),
Ct (π , x) = c(x, f (x)) + E
π
N −1 n=t+1
c(xn , an ) + cN (xN )|xt = x, at = f (x)
= c(x, f (x)) + Jt+1 (y)Q(dy|x, f (x)) X ≥ min c(x, a) + Jt+1 (y)Q(dy|x, a) ; a∈A(x)
X
3
90
DISCRETE–TIME STOCHASTIC CONTROL SYSTEMS
therefore,
Jt (x) ≥ min [c(x, a) + a∈A(x)
Jt+1 (y)Q(dy|x, a)]. X
To obtain the reverse inequality observe that, for any f ∈ F, (3.2.8) yields Jt+1 (y)Q(dy|x, f (x)). Jt (x) ≤ c(x, f (x)) + X
Since f ∈ F was arbitrary, the latter inequality gives that Jt (x) ≤ min [c(x, a) + Jt+1 (y)Q(dy|x, a)] ∀ x ∈ X. a∈A(x)
X
This completes the proof of (3.2.9).
2
Remark 3.7. To simplify the notation, if we have a function g(x, a) on K and use a selector f ∈ F, we will simply write g(x, f ) or gf (x) in lieu of g(x, f (x)). 3 From Lemma 3.5 we easily obtain the DP algorithm in the following theorem. Theorem 3.8. Let {J0 , . . . , JT } be the functions in (3.2.9)– (3.2.10). Suppose that, for each t = 0, 1, ..., N − 1, there is a selector ft ∈ F such that ft (x) ∈ A(x) attains the minimum in (3.2.9) for every x ∈ X, that is (using the notation in Remark 3.7), Jt+1 (y)Q(dy|x, ft ). (3.2.11) Jt (x) = c(x, ft ) + X
Then the deterministic Markov policy π ∗ = {f0 , . . . , fN −1 } is optimal, that is, it satisfies (3.2.4). Proof. This theorem follows directly from (3.2.11) and (3.2.8), which yield inf Ct (π, x) = Ct (π ∗ , x) π
for every t = 0, . . . , T and x ∈ X.
2
3.2
MARKOV CONTROL PROCESSES: FINITE HORIZON
91
Remark 3.9. (a) Lemma 3.5 and Theorem 3.8 hold, of course, if the cost function c and the stochastic kernel Q are time– varying, that is, ct and Qt for t = 0, 1, ... as in Theorem 2.3. (b) Our approach in this section to obtain the DP Theorem 3.8 is a little different from the approach followed in Sect. 2.1 to obtain Theorem 2.3. Indeed, here we introduced the functions Jt in (3.2.8) and then we showed that they satisfy the DP algorithm (3.2.9)–(3.2.10). In contrast, in Theorem 2.3 we started the other way around: first, we introduced the DP algorithm (2.1.8)–(2.1.9), and then we obtained (2.1.10)–(2.1.11). 3 Remark 3.10. Variants of the D.P. equation. (a) Nonstationary MCPs. Lemma 3.5 and Theorem 3.8 hold if the MCM (3.2.1) is nonstationary, that is, all of the components in (3.2.1) are time–varying, i.e, Xt , At , At (x), Qt , ct for t = 0, 1, . . . . For instance, (3.2.9) would be expressed as Jt+1 (y)Qt (dy|x, a) . Jt (x) = min ct (x, a) + a∈At (x)
Xt+1
Moreover, there is a well–known state–augmentation procedure to transform a nonstationary MCM into a stationary one. [See, for instance, Bertsekas and Shreve (1978), Guo et al. (2010), Hern´andez-Lerma (1989), Hinderer et al. (2016).] (b) Discounted costs. Given a discount factor α ∈ (0, 1), the performance criterion (3.2.2) becomes N −1 αt c(xt , at ) + αN cN (xN ) Exπ t=0
Then, using the same change of variable to obtain (2.2.1)– (2.2.2), the DP equation in Lemma 3.5 becomes Vt+1 (y)Q(dy|x, a) (3.2.12) Vt (x) = min c(x, a) + α a∈A(x)
X
3
92
DISCRETE–TIME STOCHASTIC CONTROL SYSTEMS
for t = 0, 1, . . . , N − 1, and VN (x) = cN (x).
(3.2.13)
(c) System models. Consider the SM in Remark 3.6. Then (3.2.10) remains the same, but (3.2.9) becomes Jt (x) = min {c(x, a) + E[Jt+1 (F (x, a, ξ))]} a∈A(x) = min {c(x, a) + Jt+1 (F (x, a, s))G(ds)}. (3.2.14) a∈A(x)
S
In particular, in the discounted case (3.2.12), the expected value in the right–hand side of (3.2.14) is replaced with αE[Jt+1 (F (x, a, ξ))]. Finally, we can also rewrite (3.2.9)–(3.2.10) in a forward form, similar to the deterministic case in (2.2.4)–(2.2.5).
3.3 Conditions for the Existence of Measurable Minimizers For Theorem 3.8 to be useful we need to ensure the existence of measurable selectors (or “minimizers”) ft ∈ F as in (3.2.11). There are many ways of doing this—see, for instance, Bertsekas and Shreve (1978), B¨auerle and Rieder (2011), Hern´andez-Lerma and Lasserre (1996), Hinderer et al. (2016), to mention just a few references. Here we will use some results in Appendix B, below, to see how to ensure the existence of measurable minimizers. First, we consider how to adapt our MCM to Theorem B.3. Theorem 3.11. Consider the MCM (3.2.1), and a l.s.c. function u : X → R bounded below. Suppose that, for every x ∈ X, (a) The action set A(x) is compact; (b) The function a → c(x, a) is l.s.c. on A(x);
3.3
CONDITIONS FOR THE EXISTENCE . . .
93
(c) The function a → v (x, a) := X v(y)Q(dy|x, a) is l.s.c. on A(x) for each continuous and bounded function v on X. Then the function u∗ defined on X by ∗ u(y)Q(dy|x, a) u (x) := inf c(x, a) + a∈A(x)
(3.3.1)
X
is measurable and there exists a selector f ∈ F such that f (x) ∈ A(x) attains the minimum in (3.3.1) for every x ∈ X, i.e., ∗ u(y)Q(dy|x, f ). (3.3.2) u (x) = c(x, f ) + X
Proof. By the present hypotheses, the result will follow from Theorem B.3 provided that the integral in (3.3.1) is l.s.c. in a ∈ A(x) for every x ∈ X. Thus, we need to show that, for each x ∈ X, if {an } is a sequence in A(x) converging to a ∈ A(x), then u(y)Q(dy|x, an ) ≥ u(y)Q(dy|x, a). (3.3.3) lim inf n→∞
X
X
To this end, we will use that u is l.s.c. and bounded below and, therefore, by Proposition A.2, there is a sequence of continuous and bounded functions vk on X such that vk ↑ u. Hence, for every n and k, u(y)Q(dy|x, an ) ≥ vk (y)Q(dy|x, an ), X
X
and, letting n → ∞, the assumption (c) yields u(y)Q(dy|x, an ) ≥ vk (y)Q(dy|x, a). lim inf n
X
X
Finally, letting k → ∞, (3.3.3) follows.
2
Remark 3.12. (a) Given a real–valued function v on X, let v be as in Theorem 3.11(c), i.e., v (x, a) := v(y)Q(dy|x, a) ∀ (x, a) ∈ K. (3.3.4) X
3
94
DISCRETE–TIME STOCHASTIC CONTROL SYSTEMS
The stochastic kernel (or transition probability) Q in (3.1.2) is said to be weakly continuous (or that it has the weak– Feller property) if the mapping (x, a) → v (x, a) is continuous on K for every continuous bounded function v. On the other hand, Q is called strongly continuous (or that it satisfies the strong–Feller property) if v is continuous for every measurable bounded function v. It should be noted that here we will be dealing with the weakly continuous case. Clearly, since every continuous function is measurable, strong continuity implies weak continuity, but not conversely. The following examples emphasize this fact. First, consider the system model (3.1.1), where the ξt are i.i.d. random variables with common distribution G on S. Then we can express (3.3.4) as v (x, a) = E[v(F (x, a, ξ))] = v(F (x, a, s))G(ds), (3.3.5) S
where ξ is a generic random variable with distribution G. Then, for an arbitrary measurable function v, the function v is not necessarily continuous, even if the map (x, a) → F (x, a, s) is continuous for every s ∈ S. (Take, for instance, v = IB the indicator function of a set B.) Now suppose that (3.1.1) is the additive noise system xt+1 = F (xt , at ) + ξt , t = 0, 1, ...
(3.3.6)
Moreover, the ξt are i.i.d. disturbances as in (3.3.5), but now they take values in X = S = Rd and, in addition, the distribution G has a continuous and bounded probability density g. Then, using the change of variable y = F (x, a) + s, (3.3.5) becomes v (x, a) = v(F (x, a) + s)g(s)ds = v(y)g(y − F (x, a))dy. S
S
(3.3.7) Therefore, assuming that the mapping (x, a) → F (x, a) is continuous, it follows that the function v in (3.3.7) is continuous, even if the bounded function v is just measurable! In other words, the additive noise system (3.3.6) is strongly continuous.
3.3
CONDITIONS FOR THE EXISTENCE . . .
95
(b) For the deterministic system (3.0.1), the transition probability Q turns out to be the Dirac measure (3.1.6) and so the function v in (3.3.4) becomes v (x, a) = v(F (x, a)). Therefore, Q is weakly continuous (in the sense of part (a)) if the system function F : K → X is continuous, as in Assumption 2.17. However, requiring that Q is strongly continuous (so that v is continuous for every measurable bounded function v) would be extremely restrictive. (c) As in the proof of (3.3.3) (replacing u with v), it can be seen that: If Q is weakly continuous and v : X → R is l.s.c. and bounded below, then the mapping (x, a) → v (x, a) in (a) is l.s.c. and bounded below on K. 3 Theorem 3.11 requires the action sets A(x) to be compact. This requirement is replaced, in the following theorem, by inf– compactness on K (see Definition B.4(a3)). Theorem 3.13. Consider the MCM (3.2.1) and let u : X → R be as in Theorem 3.11, that is, u is l.s.c. and bounded below. In addition, let us assume that (a) the cost function c is l.s.c., bounded below, and inf–compact on K; (b) the transition probability Q is weakly continuous (see Remark 3.12(a)). Then the conclusions of Theorem 3.11 hold again, that is, the function u∗ in (3.3.1) is measurable, and there exists a selector f ∈ F such that (3.3.2) holds for every x ∈ X. Moreover, for each x ∈ X, let Au (x) be the set of actions a∗ ∈ A(x) where the righthand side of (3.3.1) attains the minimum, i.e., ∗ ∗ u(y)Q(dy|x, a∗ ), u (x) = c(x, a ) + X
and suppose that the multifunction x → Au (x) is l.s.c. (see Definition B.4(b)). Then the function u∗ is l.s.c. on X.
96
3
DISCRETE–TIME STOCHASTIC CONTROL SYSTEMS
Proof. See Exercise 3.1(a).
In Theorem 3.13, to conclude that u∗ is l.s.c., we assumed that the multifunction x → Au (x) is l.s.c. In the following Theorem 3.14, to obtain that u∗ is l.s.c., we replace the assumption on x → Au (x) by the condition that the stage cost c is K–inf–compact (see Lemma 2.16 or Definition B.4(a2)). Theorem 3.14. Consider the MCM (3.2.1), and let u : X → R be l.s.c. and bounded below. In addition, (a) the stage cost c : K → R is nonnegative and K–inf–compact; and (b) the transition probability Q is weakly continuous. Then the conclusions in Theorem 3.11 hold, that is, the function u∗ in (3.3.1) is measurable and there exists f ∈ F that satisfies (3.3.2). Moreover, u∗ is l.s.c.
Proof. See Exercise 3.1(b).
3.4 Examples Example 3.15 (Stochastic LQ systems) We now consider a stochastic version of the LQ system in Example 2.4. The state and action spaces are X = A = R and the dynamics is given by xt+1 = γxt + βat + ξt , t = 0, 1, . . . , N − 1,
(3.4.1)
with nonzero coefficients γ and β. The disturbances ξt are i.i.d. random variables, independent of the initial state x0 , with a common distribution G that has zero mean and finite variance, i.e., E(ξ) = 0 and σ 2 := E(ξ 2 ) < ∞,
(3.4.2)
where ξ is a generic real–valued random variable with distribution G. The performance criterion is as in (3.2.2), with one–stage costs c(x, a) = qx2 + ra2
and cN (x) = qN x2 ,
(3.4.3)
3.4
EXAMPLES
97
with nonnegative coefficients q and qN , and r > 0. Hence, from Lemma 3.5 and Remark 3.6, the D.P. equation (3.2.9)–(3.2.10) becomes Jt (x) = min[qx2 + ra2 + EJt+1 (γx + βa + ξ)] a
(3.4.4)
for t = N − 1, N − 2, ..., 0, and JN (x) = qN x2 .
(3.4.5)
Note that, from (3.4.5) and (3.4.2), EJN (γx + βa + ξ) = qN E(γx + βa + ξ)2 = qN (γ 2 x2 + σ 2 + 2γβxa + β 2 a2 ). Inserting this quantity in the right–hand side of (3.4.4) and minimizing over all a ∈ A we obtain the minimizer fN −1 ∈ F given by fN −1 (x) = GN −1 x, with GN −1 := −(r + qN β 2 )−1 qN γβ. Replacing this value of a = fN −1 (x) in (3.4.4) we obtain JN −1 (x) = KN −1 x2 + qN σ 2 with
∀x∈R
KN −1 := [1 − (r + qN β 2 )−1 qN β 2 ]qN γ 2 + q.
Similarly, we replace JN −1 in (3.4.4) to obtain the minimizer fN −2 ∈ F and the function JN −2 . In general, by backward induction, we obtain the optimal policy π ∗ = {f0 , . . . , fN −1 } with ft (x) = Gt x, with
where Gt := −(r + Kt+1 β 2 )−1 Kt+1 γβ, (3.4.6)
Kt = [1 − (r + Kt+1 β 2 )−1 Kt+1 β 2 ]Kt+1 γ 2 + q
for t = N − 1, ..., 1, 0. The optimal cost function from time t to N (see (3.2.8)), with JN in (3.4.5) and t = 0, 1, . . . , N − 1, is 2
Jt (x) = Kt x + σ
2
N n=t+1
Kn .
(3.4.7)
3
98
DISCRETE–TIME STOCHASTIC CONTROL SYSTEMS
In particular, from (3.2.4) and (3.2.8), ∗
2
J (x) = J0 (x) = K0 x + σ
2
N
Kn
n=1
3
for every initial state x0 = x.
Example 3.16. Stochastic LQ system with discounted cost. Let us consider again the LQ system (3.4.1)–(3.4.3), except that now we use the α-discounted DP equation (3.2.12) with terminal cost VN (·) ≡ 0 in (3.2.13). The optimal cost is again of the form J0 (x) = K0 x2 + σ 2
N −1
Kt
t=1
with Kt given by a backward recursion from t = N − 1, ..., 1, 0 by Kt = [1 − (rαt + Kt=1 β 2 )−1 Kt+1 β 2 ]Kt+1 γ 2 + qαt with KN = 0. To complement these results, note that Sect. 3.5 in Hern´andez-Lerma and Lasserre (1996) as well as Sect. 2.1 in Bertsekas (1987) present the nonstationary vector case of the LQ system (3.4.1)–(3.4.2) in which the state and control variables xt and at are vectors in, say, Rn and Rm , respectively, and the coefficients γt , βt , qt , rt are matrices of suitable dimensions. On the other hand, observe that the optimal control ft in (3.4.6) is the same as in the deterministic case (2.1.16). This property is sometimes called the certainty–equivalence principle of LQ systems. This principle is shared by other stochastic systems. For an extension of the certainty–equivalence principle to a class of stochastic differential games see Josa-Fombellida and Rinc´on-Zapatero (2019). 3 Example 3.17 (A consumption–investment problem). At each time t = 0, 1, . . . , N − 1, an investor wishes to allocate his/her current wealth xt between investment (at ) and consumption (xt − at ). Hence, if xt = x, the corresponding investment or control set is A(x) = [0, x]. The wealth at time t + 1 is xt+1 = at ξt , t = 0, 1, . . . , N − 1,
3.4
EXAMPLES
99
that is, the wealth is proportional to the amount invested at time t, where ξt ≥ 0 denotes a “random interest rate”. These “disturbances” ξt are i.i.d. random variables, assumed to be independent of the initial wealth x0 . Moreover, to ensure that reinvestment is indeed profitable, we assume that the ξt have a (finite) mean m = E(ξt ) > 1. The investor’s problem is to find an investment strategy π = {a0 , a1 , . . . , aN −1 } that maximizes an expected total discounted utility from consumption defined as N −1 αt u(xt − at ) , J(π, x) := Exπ t=0
for every initial wealth x0 = x ≥ 0, where α is a discount factor, and u(x − a) is a given utility of consumption. We will take the state and action spaces as X = A = [0, ∞). Since we are now maximizing, the discounted D.P. equation (3.2.12)–(3.2.13) becomes, for every x ∈ X and t = N − 1, N − 2, . . . , 0, (3.4.8) JN (x) = 0, Jt (x) = max [u(x − a) + αEJt+1 (aξt )]. a∈A(x)
(3.4.9)
The solution to (3.4.8)–(3.4.9) depends, of course, on the utility function u. Here, we will consider two cases. Linear case. Suppose that u(x − a) = b · (x − a) for some b > 0. We will assume that αm > 1. In this case, the optimal investment policy is π ∗ = {f0 , . . . , fN −1 } with fN −1 (x) = 0, and ft (x) = x ∀ t = 0, 1, . . . , N − 2.
(3.4.10)
Moreover, for all x ∈ X, and t = N − 1, . . . , 0, Jt (x) = (mα)N −t−1 bx.
(3.4.11)
In particular, J ∗ (x) = J0 (x) = (mα)N −1 bx, for any initial state x0 = x. Nonlinear case. Assume that u(x − a) = (b/k)(x − a)k . Here b > 0 and 0 < k < 1. The function u(x) = (b/k)xk is called the
3
100
DISCRETE–TIME STOCHASTIC CONTROL SYSTEMS
isoelastic utility function or power utility function. In this case, the solution of (3.4.8)–(3.4.9) is Jt (x) = (b/k)Dt xk
∀ t = N − 1, N − 2, . . . , 0,
(3.4.12)
whereas the optimal maximizers are fN −1 (x) = 0 and, for t = N − 2, . . . , 0, 1/(k−1) ft (x) = x/[1 + δDt+1 ], (3.4.13) with δ = (αEξ k )1/(k−1) , and Dt is given recursively by DN −1 = 1 and for t = N − 2, . . . , 0, 1/(k−1) k−1
Dt = δ k−1 Dt+1 /[1 + δDt+1
]
.
(3.4.14) 3
(See Exercise 3.4.)
3.5 Infinite–Horizon Discounted Cost Problems In this section we consider the Markov control model (MCM) (3.1.2): (X, A, {A(x) : x ∈ X}, Q, c). The objective function to be minimized is the infinite–horizon expected total discounted cost ∞ αt c(xt , at ) (3.5.1) V (π, x) := Exπ t=0
for every policy π ∈ Π, and every initial state x ∈ X, where α ∈ (0, 1) is a given discount factor. A policy π ∗ such that V (π ∗ , x) = inf V (π, x) =: V ∗ (x) ∀ x ∈ X π
(3.5.2)
is said to be α–discount optimal, and V ∗ is called the α–discount value function, also known as the α–discount optimal cost function. We can analyze the infinite-horizon discounted cost problem in the context of any of the Theorems 3.11, 3.13 or 3.14. Here,
3.5
INFINITE–HORIZON DISCOUNTED COST PROBLEMS
101
however, to fix ideas we will follow Theorem 3.13. (The analysis following Theorem 3.14 is quite similar.) The following assumption, which is supposed to hold throughout this section, includes some of the hypotheses of Theorem 3.13. (See also Theorem 3.13 or Remark 3.12(a) for the definition of weak continuity of Q.) Assumption 3.18. (a) The stage cost c : K → R is l.s.c., nonnegative, and inf-compact on K, that is (by Definition B.4 (a3)), for each r > 0, the set {a ∈ A(x)|c(x, a) ≤ r} ⊂ A is compact. (b) The multifunction x → Ac (x) is l.s.c. (see Definition B.4(b)), where, for every x ∈ X, Ac (x) stands for the set of control actions a∗ ∈ A(x) such that c(x, a∗ ) = mina∈A(x) c(x, a). (c) The transition law Q is weakly continuous. Observe that the LQ system in Examples 3.15 and 3.16 satisfies Assumption 3.18. In particular, since c(x, a) = qx2 + ra2 (see (3.4.3)) and there are no control constraints (i.e, A(x) = A = R for all x ∈ X) the multifunction x → Ac (x) in Assumption 3.18(b) is “constant”, that is, Ac (x) = {0} for all x ∈ X; hence, it is trivially l.s.c. Similarly, Q is weakly continuous because, from (3.4.1) and the Remark 3.12(a), the mapping v(y)Q(dy|x, a) = v(γx + βa + s)G(ds) (x, a) → X
R
is continuous for every continuous bounded function v. Assumption 3.18 can also be verified for the Example 3.17. Nevertheless, since the problem aims to maximize a utility, the reader might wish to make the “change of variable” c(x, a) := −u(x − a) to put Example 3.17 in the format of Assumption 3.18. In our present context, the Basic Assumption in Chap. 1 is that (a) the cost function c is nonnegative, and (b) the value function V ∗ is finite–valued. Part (a) is included in Assumption 3.18(a). Part (b) follows from the following Assumption 3.19.
3
102
DISCRETE–TIME STOCHASTIC CONTROL SYSTEMS
Assumption 3.19. There exists a policy π such that V (π, x) < ∞ for each x ∈ X. We denote by Π0 the family of policies that satisfy Assumption 3.19. For instance, if the cost c is bounded, say 0 ≤ c ≤ M , then 0 ≤ V (π, x) ≤ M (1 − α) ∀ π ∈ Π, x ∈ X. Hence, in this case, Π0 = Π. In the remainder of this section we roughly follow the dynamic programming (DP) approach in Sect. 2.3.1 for deterministic infinite–horizon problems. In particular, we wish to prove that the α–discount value function V ∗ in (3.5.2) satisfies the DP equation—compare (2.3.7) and (3.5.5). To this end, we again consider the family L+ (X) of nonnegative l.s.c. functions on X, introduced in Lemma 2.18. (See also the Exercise 2.10.) For each u ∈ L+ (X), let T u : X → X be defined as u(y)Q(dy|x, a) . (3.5.3) T u(x) := inf c(x, a) + α a∈A(x)
X
One of the main results in this section is that the value function V ∗ is a solution of the α–discount DP equation (also known as the α–discount optimality equation) u(x) = inf c(x, a) + α u(y)Q(dy|x, a) . (3.5.4) a∈A(x)
X
By (3.5.3), a solution u ∈ L+ (X) to (3.5.4) is a fixed point of T , that is, T u = u. Remark 3.20. The following Theorem 3.21 is quite similar to Theorem 4.2.3 in Hern´andez-Lerma and Lasserre (1996) except for a key difference: In the latter reference, the transition law Q is supposed to be strongly continuous (see Remark 3.12(a)), whereas here Q is weakly continuous (Assumption 3.18(c)). This implies (by the Remark 3.12(b)) that Theorem 3.21 includes as a special case the “deterministic” results in Theorem 2.21 and Corollary 2.22. This fact is not true in the cited reference. 3
3.5
INFINITE–HORIZON DISCOUNTED COST PROBLEMS
103
Theorem 3.21. Suppose that Assumptions 3.18 and 3.19 hold. Then: (a) The α–discount value function V ∗ is the pointwise minimal solution of (3.5.4); that is, for every x ∈ X, ∗ ∗ V (y)Q(dy|x, a) , (3.5.5) V (x) = min c(x, a) + α a∈A(X)
X
equivalently V ∗ = T V ∗ , and, in addition, if u is another solution of (3.5.4), then u(x) ≥ V ∗ (x) for all x ∈ X. (b) There is a selector f∗ ∈ F such that f∗ (x) ∈ A(x) attains the minimum in (3.5.5), that is, ∗ V ∗ (y)Q(dy|x, f∗ ) ∀ x ∈ X, (3.5.6) V (x) = c(x, f∗ ) + α X
and the deterministic stationary policy f∗∞ = {f∗ , f∗ , . . . } is α–discount optimal; conversely, if f∗∞ = {f∗ , f∗ , . . .} is α– discount optimal, then f∗ ∈ F satisfies (3.5.6). (c) If π ∗ is a policy such that V (π ∗ , ·) satisfies (3.5.4) and, moreover, the condition lim αn Exπ V (π ∗ , x) = 0 ∀π ∈ Π0 and x ∈ X
n→∞
(3.5.7)
holds, then π ∗ is α–discount optimal. In other words, if (3.5.7) holds, then π ∗ is α–optimal if and only if V (π ∗ , ·) satisfies the α–discount D.P. equation. (d) If an α–discount optimal policy exists, then there exists one that is deterministic stationary. The conclusion of part (d) in Theorem 3.21 is important because it ensures that, even if we work in the space of all randomized, history–dependent policies (see the Remark 1.2(d) or the Remark 3.3), to solve the infinite–horizon discounted cost problem it suffices to consider deterministic stationary policies f ∞ = {f, f, ...}, with f ∈ F.
104
3
DISCRETE–TIME STOCHASTIC CONTROL SYSTEMS
The proof of Theorem 3.21 requires some preliminary results that are important in themselves. The first one is that the operator T in (3.5.3) maps L+ (X) into itself. Lemma 3.22. If u is a function in L+ (X), then so is T u and, moreover, there exists a selector f ∈ F such that f (x) ∈ A(x) attains the minimum in the right-hand side of (3.5.3), that is, T u(x) = c(x, f ) + α u(y)Q(dy|x, f ) ∀ x ∈ X. (3.5.8) X
Proof. First, note that the hypotheses (a)-(b) in Theorem 3.13 are the same as the Assumptions 3.18(a),(c). Moreover, our Assumptions 3.18(a),(b),(c) yield that the multifunction x → Au (x) in Theorem 3.13 is l.s.c. Therefore, the conclusion in Lemma 3.22 follows from Theorem 3.13. 2 The expression (3.5.9) below is the “stochastic version” of (2.3.16). In fact, the arguments in the Remark 2.19 are a simplified version of the arguments in the proof of Lemma 3.23. Lemma 3.23. Given a selector f ∈ F, consider the deterministic stationary policy f ∞ = {f, f, . . . }. Then the α–discounted cost vf (x) := V (f ∞ , x) satisfies that, for every x ∈ X, vf (x) = c(x, f ) + α vf (y)Q(dy|x, f ). (3.5.9) X
Proof. Given f ∈ F, from (3.5.1) we obtain ∞ ∞ αt c(xt , f ) vf (x) = Exf t=0 =
∞ Exf
c(x0 , f ) + ∞
∞
t
α c(xt , f )
t=1
= c(x, f ) + αExf (Θ),
(3.5.10)
t−1 c(xt , f ). Then, by the properties of the conwhere Θ := ∞ t=1 α ditional expectation and the Markov–like property (3.1.10c), ∞ ∞ ∞ Exf (Θ) = Exf Exf (Θ|x0 , a0 , x1 )
3.5
INFINITE–HORIZON DISCOUNTED COST PROBLEMS
105
∞ ∞ = Exf Exf (Θ|x1 ) ∞ ∞ = Ef αt−1 c(xt , f )|x1 = y Q(dy|x, f ) X
t=1
vf (y)Q(dy|x, f ).
= X
The latter fact together with (3.5.10) gives (3.5.9). 2 Compare part (a) of the following lemma with Lemma 2.20. Lemma 3.24. (a) If u ∈ L+ (X) satisfies that u ≥ T u, then there exists a selector f ∈ F such that u(x) ≥ vf (x) for every x ∈ X. Hence u ≥ V ∗ . (b) Let u : X → R be a measurable function such that T u is well defined. In addition, suppose that u ≤ T u and lim αn Exπ [u(xn )] = 0 ∀ π ∈ Π0 and x ∈ X.
n→∞
(3.5.11)
Then u ≤ V ∗ . Proof. (a) Let u ∈ L+ (X) be such that u ≥ T u. Then, by Lemma 3.22, there exists f ∈ F for which, for every x ∈ X, u(x) ≥ c(x, f ) + α u(y)Q(dy|x, f ). Iteration of this inequality yields, for every n = 1, 2, . . . and x ∈ X, n−1 u(x) ≥ Exf αt c(xt , f ) + αn Exf u(xn ), Exf u(xn )
t=0
= u(y)Qn (dy|x, f ) and Qn (·|x, f ) is the n–step where transition probability of the Markov process {xt } when using the policy f ∞ . Therefore, since u is nonnegative, n−1 u(x) ≥ Exf αt c(xt , f ) , t=0
and letting n → ∞ we obtain u(·) ≥ vf (·).
3
106
DISCRETE–TIME STOCHASTIC CONTROL SYSTEMS
(b) First, note that u ≤ T u and (3.5.4) imply that, for every x ∈ X and a ∈ A(x), u(x) ≤ c(x, a) + α u(y)Q(dy|x, a). (3.5.12) On the other hand, for arbitrary π ∈ Π and x ∈ X, the Markov– like property (3.1.10c) yields t+1 π t+1 Ex α u(xt+1 )|ht , at = α u(y)Q(dy|xt , at ) X
≥ αt [u(xt ) − c(xt , at )]
(by (3.5.12)),
so αt c(xt , at ) ≥ −Exπ [αt+1 u(xt+1 ) − αt u(xt )|ht , at ]. Thus, taking expectations Exπ and summing over t = 0, . . . , n − 1, we have n−1 αn c(xt , at ) ≥ u(x) − αn Exπ u(xn ) ∀ n. Exπ t=0
Finally, letting n → ∞ and using (3.5.11), we obtain that V (π, x) ≥ u(x) for all x ∈ X. Therefore, since π ∈ Π was arbi2 trary, it follows that V ∗ ≥ u. The following Lemma 3.25 extends Theorem 2.21 to the stochastic case. It shows the convergence to V ∗ of the value iteration (VI) functions {vn } defined as v0 ≡ 0, and vn := T vn−1 for all n = 1, 2, . . . , that is, vn (x) := min [c(x, a) + α vn−1 (y)Q(dy|x, a)] (3.5.13) a∈A(x)
X
for all x ∈ X. The lemma is a key result in the theory and applications of dynamic programming. Before stating it, however, note that the convergence of vn to V ∗ is to be expected, because using the forward form of the D.P. equation (3.2.9)–(3.2.10) with cN ≡ 0, we have that vn (x) = inf Vn (π, x), π
(3.5.14)
3.5
INFINITE–HORIZON DISCOUNTED COST PROBLEMS
107
where Vn is the n–stage α–discounted cost with zero terminal cost. Hence, since Vn (π, x) ↑ V (π, x) as n → ∞, one would expect that, interchanging “lim” and “inf”, vn (x) ↑ inf V (π, x) = V ∗ (x) ∀ x ∈ X. π
(3.5.15)
Lemma 3.25 confirms that this is indeed the case. Lemma 3.25 (Convergence of the α—VI functions). Under the Assumptions 3.18 and 3.19: (a) vn is in L+ (X) for every n, (b) the convergence in (3.5.15) holds, and (c) V ∗ satisfies the α–discount DP equation (3.5.5). Proof. Part (a) follows from the Assumption 3.18 (in particular, part (c)) and Lemma 3.22. (b)–(c) By (3.5.14) and the monotonicity of {vn }, there is a function v ≤ V ∗ such that vn (x) ↑ v(x) for all x ∈ X. On the other hand, from Lemma 2.15(c) we obtain that v satisfies the α– discount DP equation v = T v. Hence, by Lemma 3.24(a), v ≥ V ∗ . This yields (b) and (c). 2 We are finally ready for the proof of Theorem 3.21. Proof of Theorem 3.21. (a) The DP equation (3.5.5) follows from Lemma 3.25. The fact that V ∗ is the minimal solution of (3.5.5) is a consequence of Lemma 3.24(a). (b) The existence of a selector f∗ ∈ F satisfying (3.5.6) follows from Lemma 3.22. Further, from (3.5.6) and Lemma 3.22, f∗∞ is optimal. The converse is also obtained from Lemma 3.23. (c) If V (π ∗ , .) satisfies (3.5.5), then part (a) or Lemma 3.24(a) give that V (π ∗ , .) ≥ V ∗ (.). The reverse inequality follows from (3.5.7) and Lemma 3.24(b). (d) This part is a consequence of (a) and (b).
2
108
3
DISCRETE–TIME STOCHASTIC CONTROL SYSTEMS
The convergence of the α–VI functions vn = T vn−1 , or vn = T v0 , also known as the method of successive approximations, is inspired in Banach’s fixed point theorem. (See Remark 2.25) Value iteration is one of the two most popular algorithms to solve a dynamic programming equation. The other most popular algorithm is the policy iteration (PI) algorithm, which we introduce in the next section. n
Example 3.26 (Exhaustible resource extraction). The optimal exhaustible resource extraction—also known as a cake-eating problem—is one of the most studied in environmental and resource economics; see for example Hung and Quyen (1994), or Long and Kemp (1984). In this example, we present a discrete version of the continuous model in Dasgupta and Heal (1974); see also Pindyck (1980) for a stochastic version. Consider an agent that exploits a certain nonrenewable resource. Let xt and at be the stock of the nonrenewable resource and the agent’s consumption at time t, respectively. The initial stock of the resource is x0 = x > 0 and the law of motion of xt is xt+1 = ξt (xt − at ) for t = 0, 1, 2, ...,
(3.5.16)
where {ξt }∞ t=0 is a sequence of i.i.d. binomial random variables such that
1 with probability p, (3.5.17) ξt = d with probability 1 − p, at each time step t, where 0 ≤ d < 1, and 0 < p < 1. The value 1 − d can be interpreted as the loss caused by bad natural conditions in the extraction. The state space and control spaces are, respectively, X = (0, x0 ], A = (0, x0 ], and A(x) = (0, x] for all x ∈ X. The agent’s OCP is to find a consumption trajectory {at }∞ t=1 that maximizes the following discounted utility function, with α ∈ (0, 1), ∞ π t (3.5.18) Ex α log at t=0
3.5
INFINITE–HORIZON DISCOUNTED COST PROBLEMS
109
subject to (3.5.16). The D.P. equation (3.5.4) becomes V (x) = max log(a) + αE[V (xt+1 )|xt = x, at = a] . a∈[0,x]
(3.5.19)
We next solve (3.5.19) using the method of undetermined coefficients. To this end, let V (x) := b1 + b2 log x,
(3.5.20)
where b1 and b2 are unknown parameters to be determined. Hence E[V (xt+1 )|xt = x, at = a] = b1 + b2 E[log ξt ] + b2 log(x − a), and (3.5.19) becomes V (x) = max log(a) + α (b1 + b2 E[log ξ0 ] + b2 log(x − a)) , a∈(0,x]
(3.5.21) which has a unique solution given by a=
x . 1 + αb2
Next, substituting this value of a in (3.5.21) and solving the system for b1 and b2 , we obtain α 1 E[log ξ0 ] + log(1 − α), 2 (1 − α) 1−α 1 b2 = . 1−α b1 =
With these values of b1 and b2 in (3.5.21), it follows that the optimal control and the corresponding state trajectory are f ∗ (x) = x(1 − α) and xt+1 = αξt xt ,
t = 0, 1, ...,
and the optimal discounted utility is V (x) =
α 1 1 E[log ξ0 ] + log(1 − α) + log x, 2 (1 − α) 1−α 1−α
where E[log ξ0 ] = (1 − p) log(d).
3
3
110
DISCRETE–TIME STOCHASTIC CONTROL SYSTEMS
3.6 Policy Iteration Let g0 ∈ F be a selector such that the deterministic stationary policy g0∞ has a finite α–discounted cost w0 (x) := V (g0∞ , x) < ∞ ∀ x ∈ X. By Lemma 3.23,
(3.6.1)
w0 (x) = c(x, g0 ) + α
w0 (y)Q(dy|x, g0 ) ≥ min [c(x, a) + α w0 (y)Q(dy|x, a)]. X
a∈A(x)
X
Hence, by Lemma 3.22, there exists g1 ∈ F such that, for all x ∈ X, w0 (y)Q(dy|x, g1 ). w0 (x) ≥ c(x, g1 ) + α X
Iteration of the latter inequality, as in the proof of Lemma 3.24(a) gives, for all x ∈ X, w0 (x) ≥ w1 (x),
with w1 (x) := V (g1∞ , x).
(3.6.2)
Iteration of these arguments yields the PI algorithm: Step 1. (Initialization step.) Pick an arbitrary selector g0 ∈ F as in (3.6.1). Step 2. (Policy improvement.) Given gn ∈ F, compute the corresponding α–discounted cost wn (·) := V (gn∞ , ·) as in (3.5.9). Then, as in (3.6.2), find a selector gn+1 such that wn (x) ≥ c(x, gn+1 ) + α wn (y)Q(dy|x, gn+1 ) (3.6.3) X ∞ , ·). for all x ∈ X, so that wn (·) ≥ wn+1 (·) := V (gn+1
Step 3. (Checking optimality.) Is wn (x) = wn+1 (x) for all x ∈ X? If “yes”, then stop. If “no”, then go back to step 2 replacing n with n + 1.
3.7
LONG-RUN AVERAGE COST PROBLEMS
111
Thus, the PI algorithm gives a sequence of selectors gn ∈ F with corresponding α–discounted cost wn ∈ L+ (X) forming a nonincreasing sequence that satisfies the following. Theorem 3.27. (a) If there is an integer n such that wn (x) = wn+1 (x) for all x ∈ X, then w := wn is a solution to the α– discount DP equation w = T w. If, in addition, w satisfies the condition (3.5.7), that is, lim αt Exπ w(xt ) = 0
t→∞
(3.6.4)
for all π ∈ Π0 and x ∈ X, then w = V ∗ and gn∞ is α–discount optimal. (b) In general, as n → ∞, wn ↓ w, where w is a solution to the α–discount DP equation. Moreover, if w satisfies (3.6.4), then w = V ∗ and an α–optimal policy can be determined as in Theorem 3.21(b). Proof. (a) If there exists n such that wn = wn+1 =: w, the arguments leading to (3.6.2) give that w satisfies the α–discount DP equation w = T w. If, moreover, (3.6.4) holds, the desired conclusion follows from Theorem 3.21(c). (b) In general, using the arguments in the proof of Lemma 3.24(a), it can be seen that (3.6.3) gives wn ≥ T wn ≥ wn+1 . Hence, since the sequence {wn } is bounded below (wn ≥ 0 for all n), there exists w such that wn ↓ w and so, by Lemma 2.15(a), w ≥ T w ≥ w, i.e., w = T w. Therefore, if (3.6.4) holds, the desired conclusion follows from Theorem 3.21(c). 2
3.7 Long-Run Average Cost Problems In Sect. 2.5 we considered long-run average cost (AC) problems for discrete-time deterministic systems. We now briefly introduce AC problems for the stochastic system (3.1.1) and the Markov control model (MCM) (X, A, {A(x) : x ∈ X}, Q, c) in (3.1.2)–(3.1.4). This material requires a working knowledge of probability and
112
3
DISCRETE–TIME STOCHASTIC CONTROL SYSTEMS
discrete-time stochastic processes, in particular, Markov chains with a general state space. Hence, to avoid excessive technicalities, most of the results in this section are stated without proof. Some references are provided in appropriate places. For each T = 1, 2, ..., x ∈ X and π = {at } ∈ Π, let T −1 JT (π, x) := Exπ [ c(xt , at )]
(3.7.1)
t=0
be the T -step expected cost when using the policy π, given the initial step x0 = x. The corresponding long-run expected average cost J(π, x) is defined as J(π, x) := lim sup T →∞
1 JT (π, x). T
(3.7.2)
Then, as usual, the AC value function is J ∗ (x) := inf{J(π, x) : π ∈ Π}, and a policy π ∗ is said to be AC-optimal if J(π ∗ , x) = J ∗ (x) for every initial state x. For stochastic control systems, in addition to the expected average cost in (3.7.2) one can study a pathwise average cost. In this case, the T -stage expected cost JT (π, x), T = 1, 2, ..., in (3.7.1)– (3.7.2) is replaced by the pathwise or random cost JT0 (π, x)
:=
T −1
c(xt , at ).
(3.7.3)
t=0
The analysis of this stochastic control problem is more technical than the expected case and it will not be considered in these notes. For the AC control problem in the deterministic case (2.5.1)– (2.5.3), we considered three approaches or techniques: (i) the average cost optimality equation (ACOE), (ii) the vanishing approach, and (iii) the steady state approach.
3.7
LONG-RUN AVERAGE COST PROBLEMS
113
In this section we combine the approaches (i) and (ii). (For the steady state approach (iii), which is based on the constrained optimization problem (2.5.17), there is no direct analogue in the stochastic case. In the latter case, it can be seen that the closest thing to (2.5.17) is an infinite-dimensional linear programming problem, such as, for instance, (M P1 ) in page 150 of Hern´andezLerma and Lasserre (1996).) Definition 3.28. A pair (j ∗ , l) consisting of a real number j ∗ and a real-valued function l on X is said to be: (a) a solution to the average cost optimality equation (ACOE) if, for every x ∈ X and t = 0, 1, ..., j ∗ + l(x) = inf [c(x, a) + E[l(xt+1 )|xt = x, at = a]] , a∈A(x)
(3.7.4) where
E[l(xt+1 )|xt = x, at = a] =
l(y)Q(dy|x, a)
(3.7.5)
X
denotes the conditional expectation of l(xt+1 ) given (xt , at ) = (x, a) for (x, a) ∈ K; and (b) a solution to the average cost optimality inequality (ACOI) if, for every x ∈ X and t = 0, 1, ..., the equality in (3.7.4) is replaced by the inequality ≥, that is (using (3.7.5)), ∗ j + l(x) ≥ inf [c(x, a) + l(y)Q(dy|x, a)]. (3.7.6) a∈A(x)
X
When dealing with the system model (3.1.1) with i.i.d. random disturbances ξt with distribution G, the expected value (or integral) in (3.7.4)–(3.7.6) becomes E[l(F (x, a, ξ)] = l(F (x, a, s))G(ds) (3.7.7) S
for all (x, a) ∈ K, where ξ is a generic random variable with distribution μ.
3
114
DISCRETE–TIME STOCHASTIC CONTROL SYSTEMS
In the remainder of this section we proceed essentially as in Sect. 2.5.3 for deterministic systems. That is, we use the Abelian theorem in Lemma 2.56 and functions as in (2.5.34) to obtain a solution to the ACOI. (Note that inequalities such as (2.5.31) and (2.5.32) are valid in the present, stochastic case because they only depend on the fact that the sequence {ct } in Lemma 2.56 is bounded below.)
3.7.1 The Average Cost Optimality Inequality Throughout the remainder of this section we suppose that Assumptions 3.18 and 3.19 are satisfied. Consequently, for each α ∈ (0, 1), Theorem 3.21 ensures that the α-discount value function Vα satisfies the dynamic programming equation (3.5.5), i.e., Vα (x) = min [c(x, a) + α Vα (y)Q(dy|x, a)]. (3.7.8) a∈A(x)
X
Now, pick an arbitrary state x¯ and, as in (2.5.34), let x) mα := Vα (¯ and hα (x) := Vα (x) − mα and ρ(α) := (1 − α)mα .
(3.7.9)
In addition to Assumptions 3.18 and 3.19, we suppose the following. Assumption 3.29. There exists α0 ∈ (0, 1), positive constants N and M , and an u.s.c. function b(·) ≥ 1 on X such that, for every α ∈ [α0 , 1) and x ∈ X, (a) ρ(α) ≤ M , and (b) −N ≤ hα (x) ≤ b(x). Assumption 3.29 and many of its variants are well known in the study of MDPs with the AC criterion. The main ideas probably
3.7
LONG-RUN AVERAGE COST PROBLEMS
115
go back to Sennott (1986) for MDPs with a countable state space and finite action sets. By Assumption 3.29(a), ρ(α) has a limit point, say j ∗ ∈ [0, M ], as α ↑ 1. Hence, there is a sequence αn ↑ 1 such that ρ(αn ) → j ∗
(3.7.10)
as n → ∞. From (3.7.10) and (3.7.9) we obtain the following fact. Lemma 3.30. Let αn be as in (3.7.10). Then lim (1 − αn )Vαn (x) = j ∗
n→∞
(3.7.11)
for all x ∈ X. Proof. Write Vα (x) = hα (x) + mα . Multiply this equality by (1 − α) to obtain, from (3.7.9), |(1 − α)Vα (x) − j ∗ | ≤ |(1 − α)|hα (x)| + |ρ(α) − j ∗ | ≤ (1 − α) max{N, b(x)} + |ρ(α) − j ∗ |, with N and b(·) as in Assumption 3.29(b). Finally, replace α by 2 αn and then use (3.7.10) to obtain (3.7.11). Now, in (3.7.8) replace Vα (·) with hα (·) + mα . Then, by (3.7.9), we can express the α-discount DCOE (3.7.8) as ρ(α) + hα (x) = min [c(x, a) + α hα (y)Q(dy|x, a)]. (3.7.12) a∈A(x)
X
Moreover, by Theorem 3.21, for each α ∈ (0, 1) there exists fα ∈ F such that, for every x ∈ X, fα (x) ∈ A(x) attains the minimum in (3.7.12), i.e., hα (y)Q(dy|x, fα ). (3.7.13) ρ(α) + hα (x) = c(x, fα ) + α X
Finally, replacing α in (3.7.12)–(3.7.13) by αn in the proof of Lemma 3.30 and letting n → ∞, some technical arguments yield the ACOI in the following theorem.
116
3
DISCRETE–TIME STOCHASTIC CONTROL SYSTEMS
Theorem 3.31. Suppose that Assumptions 3.18, 3.19 and 3.29 are satisfied and let j ∗ be as in (3.7.12). Then there exists a function h∗ ∈ L+ (X) and a stationary policy f ∗ ∈ F such that ∗ ∗ j + h (x) ≥ min [c(x, a) + h∗ (y)Q(dy|x, a)] a∈A(x) X h∗ (y)Q(dy|x, f ∗ ) ∀x ∈ X. = c(x, f ∗ ) + X
Moreover, f ∗ is AC-optimal and J ∗ (x) = J(f ∗ , x) = j ∗ for all x ∈ X. For the proof of Theorem 3.31 see Costa and Dufour (2012). Similar results appear in Feinberg et al. (2012) or Vega-Amaya (2015). These papers provide many earlier references.
3.7.2 The Average Cost Optimality Equation To obtain the ACOE starting from the ACOI in Theorem 3.31 we impose another assumption. Assumption 3.32. Let b(·) ≥ 1 be as in Assumption 3.29. We suppose that: (a) X b(y)Q(dy|x, f ) < ∞ for all x ∈ X and f ∈ F. (b) For each f ∈ F there exists a probability measure μf on X such that: (b1 ) X b(y)μf (dy) < ∞. (b2 ) As t → ∞, |Exf [v(xt )] − X v(y)μf (dy)| → 0 for each initial state x ∈ X and each real-valued function v on X such that supy∈X |v(y)|/b(y) < ∞. We now obtain the ACOE as follows. Theorem 3.33. Suppose that 3.18, 3.19, 3.29 and 3.32 are satisfied. Let j ∗ and f ∗ be as in Theorem 3.31, and μf ∗ as in Assumption 3.32(b). Then there exists a function h on X such that supy |h(y)|/b(y) < ∞ and, furthermore, for every x ∈ X,
3.7
LONG-RUN AVERAGE COST PROBLEMS
117
∗
j + h(x) = min [c(x, a) + h(y)Q(dy|x, a)] a∈A(x) X ∗ h(y)Q(dy|x, f ∗ ) (3.7.14) = c(x, f ) + X
for μf ∗ -almost all x ∈ X, that is, there exists a set C ⊂ X such that μf ∗ (C) = 0 and (3.7.14) holds for x in the complement of C. For the proof of Theorem 3.33 see Costa and Dufour (2012). Note, however, that the theorem’s conclusion is on the “weak” side in the sense that the ACOE (3.7.14) is valid “almost everywhere” only. To obtain the ACOE “everywhere”, that is, for all x ∈ X, requires strong conditions that we do not state here. On the other hand, the reader may consult Sect. 1 in Vega-Amaya (2018) for a review of results on the ACOE, as well as a presentation of the fixed-point approach, which we have not considered. For surveys of results up to the late 1990s see Chap. 5 and Chap. 11 in Hern´andez-Lerma and Lasserre (1996, 1999), respectively.
3.7.3 Examples Example 3.34 (An LQ average cost problem). Consider the discounted LQ problem in Exercise 3.10 below, assuming that the i.i.d. random variables ξt have a bounded continuous density. Thus, for each α ∈ (0, 1), the α-discount value function Vα and the optimal stationary policy fα are given, respectively, by
and
Vα (x) = Cα x2 + (1 − α)−1 Cα ασ 2
(3.7.15)
fα (x) = −(r + αβ 2 Cα )−1 αβγCα x,
(3.7.16)
where C ≡ Cα is the unique positive solution of the quadratic equation (3.7.17) F z 2 + Gz + qr = 0, with coefficients F ≡ Fα := αβ 2
and G ≡ Gα := r + (rγ 2 + qβ 2 )α.
(3.7.18)
118
3
DISCRETE–TIME STOCHASTIC CONTROL SYSTEMS
Therefore, taking x¯ = 0, mα and (3.7.9) become mα = Vα (0) = (1 − α)−1 Cα ασ 2 , hα (x) = Vα (x) − mα = Cα x2 , (3.7.19) and (3.7.20) ρ(α) = (1 − α)mα = Cα ασ 2 . Moreover, a direct calculation shows that for all α sufficiently close to 1, say α0 < α < 1 for some α0 ∈ (0, 1), the positive number Cα is bounded above by L := (r + rγ 2 + qβ 2 )/β 2 α0 . This yields the Definition 3.28(a), i.e., ρ(α) ≤ M with M := Lσ 2 . Definition 3.28(b) is similarly verified, with N = 0 and b(x) := Lx2 . Finally, to obtain the ACOI in Theorem 3.31 take αn ∈ (0, 1) such that αn ↑ 1 as n → ∞. Then (as in (3.7.10) and (3.7.11)) we can see from (3.7.20) that ρ(αn ) → j ∗ := kσ 2
(3.7.21)
where k is the unique positive solution of (3.7.17)–(3.7.18) with α = 1. Likewise, from (3.7.19), as n → ∞ we obtain hαn (x) → h∗ (x) := kx2
(3.7.22)
for all x ∈ X. We can now verify that the pair (j ∗ , k ∗ (·)) is a solution to the ACOI (3.7.6). In fact, if in (3.7.16) we replace α by αn ↑ 1, then in the limit we obtain f ∗ (x) = −(r + kβ 2 )−1 βγkx ∀x ∈ X and (j ∗ , h∗ (·), f ∗ (·)) is a canonical triplet that satisfies the AC optimality equation (ACOE) (3.7.14) for all x ∈ X. (Note that we didn’t have to verify Assumption 3.32. This is because of the particular characteristics of LQ problems, but it is not the general case.) 3 Example 3.35 (The Brock-Mirman model with technological shocks). In the infinite-horizon Brock and Mirman model studied
3.7
LONG-RUN AVERAGE COST PROBLEMS
119
in Examples 2.10, 2.31 and 2.51, the system evolves according to xt+1 = ct xθt − at
for t = 0, 1, 2...
(3.7.23)
with initial condition x0 = x > 0 and θ ∈ (0, 1). Now, we suppose that the technological parameter {ct } behaves as a Markov process such that log(ct+1 ) = ρ log(ct ) + ξt
for t = 0, 1, 2...
(3.7.24)
with ρ > 0, initial condition c0 > 0, and where {ξt } is a collection of independent and identically distributed normal random variables with mean 0 and variance σ > 0. Under this hypothesis, (3.7.24) yields ct+1 = cρt eξt , where eξt has a log-normal distribution. This stochastic model was introduced by Brock and Mirman (1972, 1973). The state space and control spaces are, respectively, X × C = [0, ∞) × [0, ∞), A = (0, ∞), and A(x, c) = (0, cxθ ] for x > 0. Consider the objective function to be optimized as the long–run average reward (AR) T −1 1 π J(π, (x, c)) = lim inf Ex,c log(at ) . (3.7.25) T →∞ T t=0 This AR problem is, of course, analogous to the AC problem in (3.7.2), so the AR value function is J ∗ (x, c) := sup{J(π, (x, c)) : π ∈ Π}.
(3.7.26)
To find a canonical triplet (j ∗ , l, f ∗ ) that satisfies the average reward optimality equation (as in Definition 3.28(a) or Theorem 3.33), i.e., j ∗ + l(x, c) = max [log(a) + E[l(xt+1 , ct+1 )|(xt , ct ) = (x, c), at = a]] a∈A(x,c)
(3.7.27) for all (x, c) ∈ X × C, we consider a function l(x, c) of the form l(x, c) := b1 log(x) + b2 log(c), where b1 and b2 are unknown parameters to be determined.
3
120
DISCRETE–TIME STOCHASTIC CONTROL SYSTEMS
In this case E[l(xt+1 , ct+1 )|(xt , ct ) = (x, c), at = a] = b1 log(cxθ − a) + b2 ρ log(c).
Hence, (3.7.27) reaches the maximum at a = express (3.7.27) as
cxθ , 1+b1
and so we can
j ∗ + b1 log(x) + b2 log(c) = b1 log(b1 ) − (1 + b1 ) log(1 + b1 ) + θ(1 + b1 ) log(x) + [1 + b1 + b2 ρ] log(c). This last equation is satisfied if j ∗ = b1 log(b1 ) − (1 + b1 ) log(1 + b1 ), b1 = θ(1 + b1 ), b2 = 1 + b1 + b2 ρ, which implies that b1 =
θ , 1−θ
b2 =
1 . (1 − θ)(1 − ρ)
Therefore the canonical triplet (j ∗ , l, f ∗ ) for the stochastic Brock-Mirman model (3.7.23)–(3.7.25) is given by θ θ 1 1 ∗ j = log − log , (3.7.28) 1−θ 1−θ 1−θ 1−θ θ log(x) log(c) l(x, c) = + , 1−θ (1 − θ)(1 − ρ) f ∗ (x, c) = (1 − θ)cxθ . 3 Example 3.36 (The stochastic Brock-Mirman model). We wish to solve again the stochastic average reward Brock-Mirman model in Example 3.35 (see (3.7.23)–(3.7.25)), but now using the vanishing discount approach. To this end, consider the α-optimal discounted utility Vα (x, c) in the Exercise 3.14. Then (1 − α)Vα (x, c) = (1 − α)B +
(1 − α)αθ 1−α log(x) + log(c), 1−θ 1 − αθ
3.7
LONG-RUN AVERAGE COST PROBLEMS
121
with B as in Exercise 3.14. Therefore, letting α ↑ 1 we obtain the optimal average reward J ∗ (x, c) = lim(1 − α)Vα (x, c) = j ∗ α↑1
for all (x, c), with j ∗ as in (3.7.28).
3
Exercises 3.1. (a) Prove Theorem 3.13. Hint. Use the proof of (3.3.3) or the Remark 3.12(c). Moreover, since c and u are bounded below, without loss of generality you can assume that they are nonnegative. Then use Theorem B.7. (b) Prove Theorem 3.14. Hint. Recall Remark 3.12(c) and Lemma 2.16(a). 3.2. Consider the MCM (3.1.2), the set K in (3.1.3), and a real– valued function v on K. (Recall the notation in Remark 3.7: v(x, f ) := v(x, f (x)).) Show that, for each x ∈ X, sup v(x, a) = sup v(x, f ). a∈A(x)
f ∈F
Hint. Fix an arbitrary x ∈ X. For any f ∈ F, v(x, f ) ≤ supa∈A(x) v(x, a). Hence supf ∈F v(x, f ) ≤ supa∈A(x) v(x, a). To obtain the reverse inequality, fix an arbitrary a ∈ A(x), and let h ∈ F be any selector. Define the mapping g : X → A as a if y = x, g(y) := h(y) if y = x. Then g is measurable and so it belongs to F. Therefore, v(x, a) = v(x, g) ≤ sup v(x, f ). f ∈F
Thus, supa v(x, a) ≤ supf v(x, f ). 3.3. Use backward induction to verify (3.4.6)–(3.4.7). 3.4. In Example 3.17, prove (3.4.10)–(3.4.11) and (3.4.12)– (3.4.13).
122
3
DISCRETE–TIME STOCHASTIC CONTROL SYSTEMS
3.5. In the MCM (3.2.1), suppose that Q is weakly continuous (see Theorem 3.13(b)), and let u be a real–valued l.s.c. function on X that is bounded below. Show that, then, the mapping (x, a) → u(y)Q(dy|x, a) is l.s.c. on K. X 3.6. Consider a MCM in which the cost–per–stage c is bounded— see the paragraph after Assumption 6.2. Show that, then, V ∗ is the unique bounded solution to the α–discount DP equation. (The latter equation, however, may have several unbounded solutions— see, for instance, the following exercise.) 3.7. The following MCM by Sennott (1986) shows that the α– discount DP equation may have several unbounded solutions. Let X = {1, 2, . . . }, A = {1}, c(x, a) ≡ 0 for all x, a, and transition law Q({1}|x, 1) = 2x/3(2x − 1)
and Q({2x}|x, 1) = (4x − 3)/3(2x − 1)
for all x ∈ X. Let α = 3/4. Show that V ∗ (·) ≡ 0, but the identity function u(x) = x for all x ∈ X is also a solution to the DP equation (3.5.4). 3.8. (Blackwell 1965) Consider the MCM with components X = {0}, A = {1, 2, ...}, c(0, a) = 1/a and, of course, Q(0|0, a) = 1 for all a ∈ A. Note that the optimal cost function is V ∗ (0) = 0. (a) Is V ∗ a solution of the DP equation (3.5.4)? Explain. (b) Does it exist an optimal policy? Explain. 3.9. Consider a MCM with X = {1}, A(1) = {1, 2, . . . }, and c(1, a) = (1 + a)/a for all a. Show that V ∗ (1) = 1/(1 − α), but there is no optimal control policy; in other words, there is no π such that V (π, ·) = V ∗ (·). However, for any > 0, there exists an
–optimal policy—that is, for each > 0, there exists a policy π ≡ π such that V (π, x) ≤ V ∗ (x) + for all x ∈ X. 3.10. The infinite–horizon LQ discounted problem. Consider the discounted LQ problem in Example 3.16; that is, the system (3.4.1)–(3.4.3) with state and action spaces X = A = R, quadratic cost c(x, a) = qx2 + ra2 , and linear system equation
3.7
LONG-RUN AVERAGE COST PROBLEMS
xt+1 = γxt + βat + ξt
123
∀ t = 0, 1, . . . ,
where the random perturbations ξt are i.i.d. random variables with mean zero and finite variance σ 2 = E(ξ 2 ). The coefficients q and r are both positive, and γ · β = 0. Show that the α–discount optimal cost is, for every x ∈ X, V ∗ (x) = Cx2 + (1 − α)−1 Cασ 2 , and the α–optimal stationary policy is determined by f ∗ (x) = −(r + αβ 2 C)−1 αβγCx, where C = z1 is the unique positive solution of the quadratic equation F z 2 + Gz + qr = 0, with F = αβ 2 and G = r + (rγ 2 + qβ 2 )α. (Concerning the LQ discounted problem, note that some of the hypotheses in Theorem 3.21 were already verified in the paragraph after Assumption 3.18.) 3.11. Let π be a policy such that its cost function V (π, ·) ≡ u(·) is “almost” a solution to the α–discount D.P. equation (3.5.4) in the sense that, for some > 0, u(y)Q(dy|x, a) + (1 − α) u(x) ≤ c(x, a) + α X
for all x ∈ X and a ∈ A(x). In addition, suppose that u satisfies the condition (3.5.11). Show that π is an –optimal policy, that is, u(x) ≤ V ∗ (x) + for all x ∈ X. 3.12. Consider the MCM (X, A, {A(x)|x ∈ X}, Q, c) in (3.1.2), and let B(X) be as in Exercise 2.7. Suppose that the cost function c is bounded, say 0 ≤ c ≤ M , and for each u ∈ B(X), let T u be as in (3.5.3). Show that the operator T satisfies (a) and (b) in Exercise 2.7; that is, (a) T is a contraction on B(X), and (b) the value function V ∗ in (3.5.2) is the unique fixed point of T in B(X). Moreover, (c) the α–VI functions vn := T n 0 in (3.5.13) converge to V ∗ .
124
3
DISCRETE–TIME STOCHASTIC CONTROL SYSTEMS
3.13. (a) Consider the time-varying additive-noise system xt+1 = F1 (t)xt + F2 (t, at ) + ξt ,
t = 0, 1, ..., T − 1,
with state and action spaces X ⊂ Rn and A ⊂ Rm , respectively. Note that the system is linear in the state. The stage cost is of the same form, so the associated performance criterion is T −1 [c1 (t)xt + c2 (t, at )] . J(π, x) = Exπ t=0
Assume that A is compact, and the functions F2 (t, a) and c2 (t, a) are continuous in a ∈ A. Moreover, the random variables ξt (t = 0, ..., T − 1) have finite means, and they are independent and also independent of the initial state x0 . Prove that the OCP has an optimal control that is independent of the state variable. (b) Show that the conclusion in part (a) also holds in the timehomogeneous infinite-horizon discounted OCP with the system model (3.1.1) and discounted cost (3.5.1) with F (x, a, s) := F1 x + F2 (a) + s,
c(x, a) := c1 x + c2 (a),
respectively, under the Assumption 3.18. (Part (a) is due to Midler (1969). See Exercise 2.16 for a deterministic version of this exercise.) 3.14. An stochastic Brock-Mirman (1972) model. Consider a Brock-Mirman economic growth model as in Example 2.10 and Example 3.35 (3.7.29) xt+1 = ct xθt − at t = 0, 1, ... except that the “technological parameter” c is now a Markov process that evolves as log(ct+1 ) = ρ log(ct ) + ξt ,
t = 0, 1, ...
(3.7.30)
where the ξt are i.i.d. Gaussian (or normal) random variables with zero mean and standard deviation σ > 0. In (3.7.30) we assume that c0 is independent of the sequence {ξt }. As in Example 2.10,
3.7
LONG-RUN AVERAGE COST PROBLEMS
125
xt and at denote capital and consumption both with values in [0, ∞), but now the state variable is the pair (xt , ct ) with values in X × C with X = C = [0, ∞). Consider the discounted utility ∞ π αt log(at ) E(x,c) t=0
with initial condition (x0 , c0 ) = (x, c). Show that the optimal discounted utility Vα (x, c) is Vα (x, c) = B +
αθ 1 log(x) + log(c) 1−θ 1 − αθ
with
1 αθ αθ 1 1 B= log − log . 1−α 1−θ 1 − αθ 1 − αθ 1 − αθ
Hint. Solve the corresponding α-DP equation by means of the “guess and verify” approach (or method of undetermined coefficients) with a logarithmic function of the form v(x, c) = b1 + b2 log(x) + b3 log(c).
Chapter 4 Continuous–Time Deterministic Systems
We now consider a deterministic continuous–time optimal control problem (OCP) in which the state process x(·) evolves in the state space X := Rn according to an ordinary differential equation x(t) ˙ = F (t, x(t), a(t)) for t ∈ [0, T ],
(4.0.1)
for a given initial state x(0) = x0 ∈ X, and a given system function F : [0, T ] × X × A → X, where A is a separable metric space that stands for the action (or control) set. For the time being, in (4.0.1) we consider so-called open–loop controls a(·), that is, controls in the family A[0, T ] of piecewise continuous functions a(·) : [0, T ] → A. Remark 4.1. We assume that, for each control function a(·), (4.0.1) has a unique solution x(·). To this end, it suffices to assume that, for instance, F is a continuous function and has continuous first partial derivatives with respect to the components of x ∈ X. (For a more precise statement, see Assumption 4.2 below.) ♦
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hern´andez-Lerma et al., An Introduction to Optimal Control Theory, Texts in Applied Mathematics 76, https://doi.org/10.1007/978-3-031-21139-3 4
127
4
128
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
4.1 The HJB Equation and Related Topics 4.1.1 Finite–Horizon Problems: The HJB Equation The performance index or cost functional associated to (4.0.1) is T c(t, x(t), a(t))dt + C(x(T )), (4.1.1) J(a(·)) := 0
where c and C are given nonnegative functions, called the running or instantaneous cost and the terminal cost, respectively. Assuming that (4.0.1) has a unique solution and that (4.1.1) is well defined, the OCP we are concerned with is the following: OCP Minimize (4.1.1) over A[0, T ]. As in previous chapters, when using dynamic programming associated to an OCP we consider, for each time s ∈ [0, T ), an OCP from s to the terminal time T . That is, for each (s, y) ∈ [0, T ) × X, we consider the dynamic system (as in (4.0.1)) x(t) ˙ = F (t, x(t), a(t)) for t ∈ [s, T ], x(s) = y.
(4.1.2)
The controls in (4.1.2) are restricted to the interval [s, T ], so a(·) is in A[s, T ], the family of piecewise continuous functions a(·) from [s, T ] to A, and the performance index in (4.1.1) is replaced with T c(t, x(t), a(t))dt + C(x(T )). (4.1.3) J(s, y; a(·)) := s
This new OCP is called OCPsy . If (s, y) = (0, x0 ), then the problem OCPsy reduces, of course, to the original OCP. If a∗ (·) minimizes (4.1.3) over all a(·) ∈ A[s, T ] and x∗ (·) denotes the corresponding solution to (4.1.2), then we say that a∗ is an optimal control and (x∗ (·), a∗ (·)) is an optimal pair of problem OCPsy . To ensure that OCPsy is well defined for each initial condition (s, y) in [0, T ] × X we impose the following hypotheses.
4.1
THE HJB EQUATION AND RELATED TOPICS
129
Assumption 4.2. Let ϕ(t, x, a) be any of the functions F (t, x, a), c(t, x, a), C(t) in (4.0.1)–(4.1.1) (or (4.1.2)–(4.1.3)). The function ϕ is uniformly continuous and, moreover, there is a constant L such that (a) |ϕ(t, x, a) − ϕ(t, xˆ, a)| ≤ L|x − xˆ| ∀ t ∈ [0, T ], x, xˆ ∈ X, a ∈ A. (b) |ϕ(t, 0, a)| ≤ L ∀ (t, a) ∈ [0, T ] × A. Under this assumption, (4.1.2) has a unique solution x(·) ≡ x(·; s, y, a(·)) for every initial condition (s, y) ∈ [0, T ] × X and every control a(·) ∈ A[s, T ]. (See Exercise 4.4.) Moreover, (4.1.3) is well defined. Consider now the value function corresponding to OCPsy : V (s, y) :=
inf
a(·)∈A[s,T ]
J(s, y; a(·)) ∀ (s, y) ∈ [0, T ) × X,
(4.1.4)
V (T, y) = C(y) ∀ y ∈ X. The following theorem states that V satisfies Bellman’s principle of optimality (4.1.5). (See also Corollary 4.4.) Theorem 4.3. Suppose that Assumption 4.2 holds. Then, for any (s, y) ∈ [0, T ) × X and sˆ ∈ [s, T ], sˆ c(t, x(t), a(t))dt + V (ˆ s, x(ˆ s)) . (4.1.5) V (s, y) = inf a(·)∈A[s,T ]
s
Proof. Denote by V¯ (s, y) the right–hand side of (4.1.5). It is easy to see that V (s, y) ≤ V¯ (s, y). (4.1.6) Indeed, by definition (4.1.4), for any control a(·) ∈ A[s, T ] we have V (s, y) ≤ J(s, y; a(·)) sˆ = c(t, x(t), a(t))dt + J(ˆ s, x(ˆ s); a(·)) s
and then, taking the infimum over a(·) ∈ A[s, T ], (4.1.6) follows. To obtain the reverse inequality, fix an arbitrary > 0, and choose
4
130
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
a control a (·) ∈ A[s, T ] such that V (s, y) + ≥ J(s, y; a (·)) sˆ c(t, x(t), a(t))dt + V (ˆ s, x (ˆ s)) ≥ s
≥ V¯ (s, y), where x (·) = x(·; s, y, a ). The latter inequality and (4.1.6) give (4.1.5). Assuming the existence of optimal controls, Theorem 4.3 yields Bellman’s principle of optimality in the usual form of Lemma 2.2; namely, if the control a∗ (·) is optimal on [s, T ], with initial condition (s, y), then restricted to [ˆ s, T ], for any sˆ ∈ (s, T ), a∗ (·) is also s)). A more precise statement optimal with initial condition (ˆ s, x∗ (ˆ in terms of the value function V in (4.1.4) is the following. Corollary 4.4. Suppose that (x∗ (·), a∗ (·)) is an optimal pair of problem OCPsy . Then, for any sˆ ∈ (s, T ), V (ˆ s, x∗ (ˆ s)) = J(ˆ s, x∗ (ˆ s); a∗ (·)).
(4.1.7)
Conversely, if (x∗ (·), a∗ (·)) satisfies (4.1.7) for all 0 ≤ s < sˆ < T , then (4.1.5) holds. Proof. By the optimality of (x∗ (·), a∗ (·)), V (s, y) = J(s, y; a∗ (·)) sˆ = c(t, x∗ (t), a∗ (t))dt + J(ˆ s, x∗ (ˆ s); a∗ (·)) s sˆ ≥ c(t, x∗ (t), a∗ (t))dt + V (ˆ s, x∗ (ˆ s)) s
≥ V (s, y)
[by (4.1.5)].
That is, the latter inequality is in fact an equality and it yields (4.1.7). The converse is obvious. Corollary 4.4 gives a necessary condition for a∗ (·) to be optimal. For practical purposes, however, this is not very helpful because the corollary does not say how to find neither V nor a∗ (·). Never-
4.1
THE HJB EQUATION AND RELATED TOPICS
131
theless, proceeding as in Sect. 2.1, we can try to use the principle of optimality (Lemma 2.2) to find the dynamic programming (or Bellman) equation (in (2.1.6)–(2.1.7)). In our present case, the latter procedure shows that, under suitable hypotheses, (4.1.5) yields the DP equation (4.1.8)–(4.1.9) below, which in the continuous– time case is called the Hamilton–Jacobi–Bellman (HJB) equation associated to OCP. Remark 4.5. Given a real-valued function (t, x) → v(t, x) on (0, T ) × Rn we denote by vt the partial derivative of v with respect to t and by vx the gradient of v, that is, the (row) vector (vx1 , ..., vxn ) of partial derivatives. Further, C 1 (Y ) denotes the family of real-valued continuously differentiable functions on the space Y . ♦ Theorem 4.6. In addition to Assumption 4.2, suppose that the value function V : [0, T ] × X → R in (4.1.4) is continuously differentiable. Then: (a) V satisfies the first-order partial differential equation Vt + inf [c(t, x, a) + Vx · F (t, x, a)] = 0 ∀ (t, x) ∈ [0, T ) × X, a∈A
V (T, x) = C(x) ∀ x ∈ X.
(4.1.8) (4.1.9)
(b) Furthermore, if there exists a control function a∗ : [0, T ) × X → A such that a∗ (t, x) ∈ A attains the minimum in (4.1.8) for every (t, x) ∈ [0, T ) × X, then a∗ (t) := a∗ (t, x∗ (t)) is an optimal control. Before proving Theorem 4.6 we mention other equivalent forms of expressing the HJB equation (4.1.8). Remark 4.7. (a) The function within brackets in (4.1.8), that is, H(t, x, a, λ) := c(t, x, a) + λ · F (t, x, a),
with λ = Vx (4.1.10) is called the Hamiltonian associated to the OCP (4.0.1)– (4.1.1). In terms of the Hamiltonian, the HJB equation (4.1.8) becomes
132
4
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
Vt + inf H(t, x, a, Vx ) = 0 a∈A
(4.1.11)
for all (t, x) ∈ [0, T ) × X, with the terminal condition (4.1.9). We will use this form of the HJB equation to obtain necessary conditions for optimality in Sect. 4.1.2. (b) For each a ∈ A and each real–valued continuously differentiable function v on [0, T ] × X, let La v(t, x) := vt (t, x) + vx (t, x) · F (t, x, a).
(4.1.12)
Interpreting (4.0.1)–(4.1.1) as a Markov control problem , the operator La in (4.1.12) denotes the infinitesimal generator of the “controlled Markov process” x(·) in (4.0.1). (We will come back to this point in Chap. 5.) Moreover, we can rewrite the HJB equation (4.1.8) as inf [c(t, x, a) + La V (t, x)] = 0.
a∈A
(4.1.13)
(c) For future reference, note that (4.1.12) is the derivative of v(t, x(t)) with respect to t, given that (t, x(t), a(t)) = (t, x, a); indeed, by the chain rule and (4.1.2), d a L v(t, x) = v(t, x(t)) dt (t,x,a) = [vt (t, x(t)) + vx (t, x(t)) · x(t)] ˙ |(t,x,a) = vt (t, x) + vx (t, x) · F (t, x, a). (d) In the time-homogeneous (or time-invariant) case in which c(t, x, a) and F (t, x, a) in (4.1.1)–(4.1.2) are replaced by c(x, a) and F (x, a), respectively, the operator La in (4.1.2) becomes d La v(x) := vx (x) · F (x, a) = v(x(t)) dt (x,a) for v ∈ C 1 (X). Summarizing, in addition to the “explicit form” (4.1.8) of the HJB equation, we also have (4.1.11) and (4.1.13). ♦ In part (a) of the following proof we use the fact that, for ϕ = F or c,
4.1
THE HJB EQUATION AND RELATED TOPICS
lim sup |ϕ(t, y, a) − ϕ(s, y, a)| = 0. t↓s y∈X,a∈A
133
(4.1.14)
This follows from the uniform continuity of F and c in Assumption 4.2. Proof of Theorem 4.6. (a) Fix an arbitrary a ∈ A, and let x(·) be the solution of (4.0.1) when using the control a(·) ≡ a. Then, from (4.1.5), sˆ V (ˆ s, x(ˆ s)) − V (s, y) + c(t, x(t), a)dt ≥ 0. s
Multiplying both sides of this inequality by (ˆ s − s)−1 and then letting sˆ ↓ s we obtain c(s, y, a) + Vt (s, y) + Vx (s, y) · F (s, y, a) ≥ 0. Therefore, since a ∈ A was arbitrary, it follows that Vt (s, y) + inf [c(s, y, a) + Vx (s, y) · F (s, y, a)] ≥ 0. (4.1.15) a∈A
To obtain the reverse inequality, for any > 0 and sˆ ∈ [s, T ] such that sˆ − s > 0 is small enough, there exists a control a(·) ≡ a,ˆs (·) in A[s, T ] such that sˆ c(t, x(t), a(t))dt ≤ V (s, y) − V (ˆ s, x(ˆ s)) + (ˆ s − s). s
Therefore, since V is continuously differentiable, and from Remark 4.7(c), sˆ −1 V (ˆ s, x(ˆ s)) − V (s, y) + c(t, x(t), a(t))dt ≥ (ˆ s − s) s sˆ a(t) −1 L V (t, x(t)) + c(t, x(t), a(t)) dt = (ˆ s − s) s sˆ ≥ (ˆ s − s)−1 inf [c(t, x(t), a) + La V (t, x(t))] dt s a∈A
→ inf [c(s, y, a) + La V (s, y)] a∈A
as sˆ ↓ s,
4
134
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
by (4.1.14). The last inequality and (4.1.13) conclude the proof of part (a). (b) We will use the HJB equation in the form (4.1.13). Hence, for any a(·) ∈ A[0, T ] we have c(t, x(t), a(t)) + La(t) V (t, x(t)) ≥ 0
(4.1.16)
and so integration on [s, T ] yields (by the Remark 4.7(c)) T c(t, x(t), a(t))dt + V (T, x) − V (s, x) ≥ 0. (4.1.17) s
Therefore, by (4.1.9) and (4.1.3), V (s, x) ≤ J(s, x; a(·)).
(4.1.18)
Finally, let a∗ (t, x) be as in (b) and let x∗ (·) be the solution of (4.0.1) when using the control a∗ (t) := a∗ (t, x∗ (t)). Then, replacing the pair (x(·), a(·)) by (x∗ (·), a∗ (·)), we obtain equalities in (4.1.16)–(4.1.18); in particular, (4.1.18) becomes V (s, x) = J(s, x; a∗ (·)) ∀ (s, x) ∈ [0, T ] × X. Thus, a∗ (·) is an optimal control.
Theorem 4.6 is called a verification theorem. It is so–named because it gives the following “verification technique” to solve the OCP (4.0.1)–(4.1.1): Step 1. Solve the HJB equation (4.1.8)–(4.1.9) to find the value function V (t, x). Step 2. Find the minimizer a∗ (t, x) in (4.1.8). Step 3. Solve (4.0.1) with a = a∗ to obtain the optimal pair (x∗ (·), a∗ (·)). Clearly, step 1 is the most demanding from a technical viewpoint, because it hinges on the fact that (i) V is continuously differentiable, that is, V ∈ C 1 ([0, T ] × X), and (ii) the HJB equation admits a unique classical solution. Unfortunately, neither (i) nor (ii) are true in general. For instance, Example 2.4 in Yong and
4.1
THE HJB EQUATION AND RELATED TOPICS
135
Zhou (1999), pp. 163–164, is an innocent–looking OCP in which V does not satisfy (i). See also Exercise 4.4.
4.1.2 A Minimum Principle from the HJB Equation In Sects. 2.2.2 and 2.3.2, we respectively considered finite- and infinite-horizon versions of the Minimum Principle in discrete time. We now discuss an analogous principle in continuous time. Consider the OCP with dynamics (4.0.1) and cost functional (4.1.1). Suppose that Assumption 4.2 holds and the value function V is of class C 2 . Let (x∗ (·), a∗ (·)) be an optimal pair. Put λ(s) := Vx (s, x∗ (s))
0 ≤ s ≤ T.
Recall the Hamiltonian H(s, x, a, λ) = c(s, x, a) + λ · F (s, x, a) introduced in Remark 4.7(a). The minimum condition. We start by differentiating, with respect to s, both sides of the equality T ∗ c(t, x∗ (t), a∗ (t))dt + C(x∗ (T )) V (s, x (s)) = s
to obtain Vt (s, x∗ (s)) + Vx (s, x∗ (s)) · x˙ ∗ (s) = −c(s, x∗ (s), a∗ (s)) which is equivalent to − Vt (s, x∗ (s)) = c(s, x∗ (s), a∗ (s)) + Vx (s, x∗ (s)) · F (s, x∗ (s), a∗ (s)) = c(s, x∗ (s), a∗ (s)) + λ(s) · F (s, x∗ (s), a∗ (s)) = H(x, x∗ (s), a∗ (s), λ(s)) (4.1.19) for each s in [0, T ). On the other hand, by Theorem 4.6, V satisfies the HJB equation (4.1.8)–(4.1.9). In particular,
4
136
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
−Vt (s, x∗ (s)) = inf [c(s, x∗ (s), a) + λ(s) · F (s, x∗ (s), a)], a∈A
0 ≤ s < T.
From the latter equality and (4.1.19), we conclude that H(s, x∗ (s), a∗ (s), λ(s)) = min H(s, x∗ (s), a, λ(s)), a∈A
0 ≤ s < T.
(4.1.20) The adjoint equation. Let us now assume that, for each fixed s, x → inf H(s, x, a, Vx (s, x))
(4.1.21)
a∈A
is differentiable and its derivative at x∗ (s) equals cx (s, x∗ (s), a∗ (s)) + λ(s)Fx (s, x∗ (s), a∗ (s)) + Vxx (s, x∗ (s)) · F (s, x∗ (s), a∗ (s)). (4.1.22) Remark 4.8. Results about sufficient conditions for differentiability of mappings like (4.1.21) are usually called Envelope Theorems. An example of such sufficient conditions—as well as a formula that yields (4.1.22)—is given in Exercise 4.8. After taking the partial derivative with respect to x in both sides of (4.1.8) and evaluating at x = x∗ (s), we have −Vtx (s, x∗ (s)) = cx (s, x∗ (s), a∗ (s)) + λ(s)Fx (s, x∗ (s), a∗ (s)) +Vxx (s, x∗ (s)) · x˙ ∗ (s) for each s in [0, T ). Then cx (s, x∗ (s), a∗ (s)) + λ(s)Fx (s, x∗ (s), a∗ (s)) = −Vtx (s, x∗ ) − Vxx (s, x∗ (s)) · x˙ ∗ (s) = −
d Vx (s, x∗ (s)), dt
that is, − λ˙ (s) = cx (s, x∗ (s), a∗ (s)) + λ(s)Fx (s, x∗ (s), a∗ (s)),
0 ≤ s < T.
(4.1.23) Besides, the terminal condition λ(T ) = Cx (x∗ (T )) follows from the definition of λ and (4.1.9).
(4.1.24)
4.1
THE HJB EQUATION AND RELATED TOPICS
137
Finally, the so-called Minimum Principle can be stated as follows. The Minimum Principle. Under the assumptions made above, if (x∗ (·), a∗ (·)) minimizes the cost functional (4.1.1) subject to the dynamics (4.0.1), then there exists a function λ : [0, T ] → Rn that satisfies the minimum condition (4.1.20) and the adjoint equation (4.1.23)–(4.1.24). Remark 4.9. (a) The way we obtained the Minimum Principle is similar to that in Bertsekas (2005). For a general treatment about the relationship between the Minimum Principle and the HJB equation, see Yong and Zhou (1999). In the latter reference, the stochastic case is also considered. (b) If we consider maximization problems, then (4.1.20) is replaced by a maximum condition. This is one reason why the Minimum Principle is also known as Maximum Principle. Another name for the principle is Pontryagin’s Maximum Principle because the Soviet mathematician L. S. Pontryagin started the research on a class of OCPs and formulated a first version of the above equations. According to Gamkrelidze (1999), the complete formulation of the Maximum Principle as well as a full proof were accomplished however by Pontryagin and two of his close collaborators V. Boltyanski and R. V. Gamkrelidze. (c) In the optimal control literature, the adjoint equation (4.1.23) is also named costate equation whereas the variable λ is called adjoint or costate variable. In particular, in economics, λ can be interpreted as a shadow price; see, for instance, Sect. 2.2.4 in Sethi (2021). Likewise, the terminal condition (4.1.24) is also called transversality condition. ♦ The Minimum Principle provides only necessary conditions for optimality. However, the Minimum Principle along with some additional conditions are sufficient to find solutions in a class of OCPs. The following result is a simplified version of a theorem in Mangasarian (1966) (see Remark 4.11 below).
4
138
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
Theorem 4.10. Let x∗ (·), a∗ (·), and λ(·) satisfy (4.1.20), (4.1.23), and (4.1.24). Suppose the following (a) A and X are convex sets, (b) the functions C(·) and H(t, ·, ·, λ(t)) are convex and continuously differentiable for each t, and (c) a∗ (t) is an interior point of A for each t. Then (x∗ (·), a∗ (·)) minimizes the cost functional (4.1.1) subject to (4.0.1). Proof. Consider a control function a(·) ∈ A[0, T ], and let x(·) be the corresponding solution to (4.0.1). Put D :=
T
0
c(t, x∗ (t), a∗ (t)) dt + C(x∗ (T )) −
T
0
c(t, x(t), u(t)) dt − C(x(T )),
H∗ (t) := H(t, x∗ (t), a∗ (t), λ(t)),
0 ≤ t < T,
Hx∗ (t) Ha∗ (t)
∗
∗
0 ≤ t < T,
∗
∗
0 ≤ t < T,
:= Hx (t, x (t), a (t), λ(t)), := Ha (t, x (t), a (t), λ(t)),
and H(t) := H(t, x(t), a(t), λ(t)),
0 ≤ t < T.
By assumptions (b)–(c) and (4.1.23), H∗ (t) − H(t) ≤ Hx (t) · [x∗ (t) − x(t)] + Ha∗ · [a∗ (t) − a(t)] = Hx (t) · [x∗ (t) − x(t)] + 0 (4.1.25) ∗ ˙ = −λ(t) · [x (t) − x(t)] and C(x∗ (T )) − C(x(T )) ≤ Cx (x∗ (T )) · [x∗ (T ) − x(T )]. Then
4.1
THE HJB EQUATION AND RELATED TOPICS
139
T
[H∗ (t) − λ(t)x˙ ∗ (t)]dt + C(x∗ (T )) 0 T [H(t) − λ(t)x(t)]dt ˙ − C(x(T )) − 0 T T ∗ = [H (t) − H(t)]dt + λ(t)[x(t) ˙ − x˙ ∗ (t)]dt
D=
0
≤ 0
0
T
+C(x∗ (T )) − C(x(T )) ∗ ˙ λ(t)[x(t) − x (t)]dt +
T
λ(t)[x(t) ˙ − x˙ ∗ (t)]dt
0
+Cx (x∗ (T )) · [x∗ (T ) − x(T )] T
d ∗ dt+Cx (x∗ (T )) · [x∗ (T ) − x(T )] = λ(t)[x(t)−x (t)] dt 0 = λ(T )[x(T ) − x∗ (T )] − 0 + Cx (x∗ (T )) · [x∗ (T ) − x(T )] = 0. That is, D ≤ 0 so T ∗ ∗ ∗ c(t, x (t), a (t)) dt + C(x (T )) ≤ 0
which yields the required conclusion.
T
c(t, x(t), u(t)) dt + C(x(T )),
0
Remark 4.11. We point out that Theorem 4.10 is still valid without the assumption (c). Indeed, from (4.1.25), we notice that the inequality Ha∗ · [a∗ (t) − a(t)] ≤ 0 (instead of the equality) is enough to prove the theorem. The latter inequality holds under assumptions (a) and (b) of the theorem—see Proposition 2.1.1 in Borwein and Lewis (2006). ♦ Theorem 4.10 can be used to solve OCPs according to the following procedure: Step 1. From (4.1.20), find a∗ (as a function of x∗ and λ) that minimizes the Hamiltonian. Step 2. Substitute a∗ in (4.0.1) and (4.1.23) to obtain a system of two ordinary differential equations in the variables x∗ and λ.
140
4
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
Step 3. Determine x∗ and λ by using the initial condition x∗ (0) = x0 and the terminal condition (4.1.24). Step 4. Go to Step 1 and find the open-loop policy a∗ . Step 5. Verify the assumptions (a), (b), and (c) in Theorem 4.10. The following example illustrates the above procedure to find an optimal policy of a scalar LQ system, by means of the Minimum Principle. Example 4.12. Let X = A = R, γ, β ∈ R, r > 0, and q ≥ 0. Consider the linear system x(t) ˙ = γx(t) + βa(t),
x(0) = x0 ,
(4.1.26)
and the functional cost 1 T 2 r [qx (t) + a2 (t)]dt + x2 (T ). 2 0 2 For notational ease, we simply write a(t) and x(t) instead of a∗ (t) and x∗ (t). We observe that a(t) = −βλ(t)
(4.1.27)
minimizes the Hamiltonian 1 H(t, x(t), a, λ(t)) = [qx2 (t) + a2 ] + λ(t)[γx(t) + βa]. 2 Thus the dynamics (4.1.26) and the adjoint equation (4.1.23) become x(t) ˙ = γx(t) + βa(t), x(0) = x0 ˙ λ(t) = −qx(t) − γλ(t), λ(T ) = rx(T ). The solution (x(·), λ(·)) to this (linear and homogeneous) system of ordinary differential equations can be found by standard methods. Finally, notice that assumptions (a), (b), and (c) in Theorem 4.10 hold. Therefore, (4.1.27) gives an open–loop optimal policy. ♦ Remark 4.13. The Minimum Principle holds, with the corresponding changes, for other classes of OCPs. For instance, when
4.2
THE DISCOUNTED CASE
141
there is a constraint for the terminal state, say g(x(T )) = 0 or h(x(T )) ≥ 0; see Chap. 5 in Li and Yong (1995). Another class consists of OCPs with infinite horizon; see Halkin (1974).
4.2 The Discounted Case Consider the OCP at the beginning of this chapter, except that (4.1.1) is replaced with T e−rt c(t, x(t), a(t))dt + e−rT C(x(T )), (4.2.1) J(a(·)) := 0
where r > 0 is a given discount factor. In this case, the HJB equation (4.1.8)–(4.1.9) becomes Vt + inf [e−rt c(t, x, a) + Vx · F (t, x, a)] = 0 a∈A
for (t, x) ∈ [0, T ) × X, with terminal condition V (T, x)=e−rT C(x). With the change of variable v(t, x) := e−rt V (t, x), it can be seen that the HJB in the discounted case becomes vt + inf [c(t, x, a) + vx · F (t, x, a)] = rv a∈A
(4.2.2)
for all (t, x) ∈ [0, T ) × X, and v(T, x) = C(x) ∀ x ∈ X.
(4.2.3)
Note that if r = 0, then (4.2.1) and (4.2.2) reduce to (4.1.1) and (4.1.8), respectively. Similarly, (4.1.11) and (4.1.13) become vt + inf H(t, x, a, vx ) = rv
(4.2.4)
inf [c(t, x, a) + La v(t, x)] = rv
(4.2.5)
a∈A
and a∈A
for (t, x) ∈ [0, T ) × X, with the terminal condition (4.2.3).
♦
Example 4.14 (The discounted LQ case). In the general LQ problem (also known as the linear regulator problem) the state and
4
142
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
control spaces are general finite–dimensional spaces, say, X = Rn and A = Rm , respectively. Here, however, to simplify the presentation we shall consider the scalar case n = m = 1, so X = A = R. The state process is given by x(t) ˙ = γ(t)x(t) + β(t)a(t),
for t ∈ [0, T ],
(4.2.6)
with continuously differentiable coefficients γ(·) and β(·), and a given initial condition x(0) = x0 . The Eq. (4.2.6) is of course of the form (4.0.1) with F (t, x, a) = γ(t)x + β(t)a. The associated cost functional is a discounted cost, as in (4.2.1), with c(t, x, a) := Q(t)x2 + R(t)a2
and C(x) := x2
with coefficients Q(·) ≥ 0 and R(·) > 0. Hence, the HJB equation (4.2.2) becomes rv = inf [c(t, x, a) + vt + vx · F (t, x, a)] a
= Qx2 + vt + vx · γx + inf [Ra2 + vx · βa], a
(4.2.7)
with terminal condition v(T, x) = x2 . The minimum in (4.2.7) is attained at (4.2.8) a∗ (t, x) = −(2R(t))−1 β(t) · vx . Inserting this value of a∗ in (4.2.7) we obtain the partial differential equation Qx2 + vt + γxvx − (βvx )2 /4Q = rv
(4.2.9)
for (t, x) ∈ [0, T ) × X, and v(T, x) = x2 . The problem now is how to obtain a solution to (4.2.9). By the form of the LQ problem (or by analogy with the discrete-time case—see Example 2.4), we may try a solution of the form v(t, x) = k(t)x2 + h(t) for
0 ≤ t < T,
(4.2.10)
4.3
INFINITE–HORIZON DISCOUNTED COST
143
with continuously differentiable coefficients k(t), h(t), and k(·) ≥ 0. Note that the terminal condition v(T, x) = k(T )x2 + h(T ) = x2 gives that k(T ) = 1, h(T ) = 0. (4.2.11) 2 ˙ ˙ + h(t) and vx = 2k(t)x. Replacing these From (4.2.10), vt = k(t)x values in (4.2.9) we obtain the equation ˙ [k˙ + (2γ − r)k + (kβ)2 /R + Q]x2 + (h(t) − rh(t)) = 0, which in turn yields two ordinary differential equations: k˙ + (2γ − r)k + (kβ)2 /R + Q = 0,
(4.2.12)
˙ and the linear equation h(t) = rh(t). In view of the terminal condition h(T ) = 0 in (4.2.11), it can be seen that h(t) = 0 for all t ∈ [0, T ]. Therefore, the function v in (4.2.10) becomes v(t, x) = k(t)x2 , where k(t) is the solution of the Riccati equation (4.2.12). (The existence of this solution is ensured in our present context. See, for instance, Theorem 5.2 in Fleming and Rishel (1975), Sect. IV.5.) Finally, since vx = 2k(t)x, from (4.2.8) we obtain that the optimal control for the LQ problem is a∗ (t, x) = −R(t)−1 β(t)k(t)x, which is a linear function in the state x. ♦
4.3 Infinite–Horizon Discounted Cost We consider again the system (4.0.1) except that now it is defined for all t ≥ 0, i.e., x(t) ˙ = F (t, x(t), a(t)) for t ≥ 0,
(4.3.1)
with the same initial condition x(0) = x0 . Recall that the state space is X := Rn . Moreover, instead of the discounted cost (4.2.1) we consider the infinite–horizon discounted cost functional
4
144
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
∞
V (x0 ; a(·)) :=
e−rt c(t, x(t), a(t))dt.
0
We will assume that the running (or instantaneous) cost c(t, x, a) is nonnegative and, in addition, the set Ax0 of piecewise– continuous controls a(·) : [0, ∞) → A such that V (x0 ; a(·)) < ∞ is nonempty. In this context, the verification Theorem 4.6, concerning the HJB equation (4.2.2) (or (4.2.5)) becomes as follows. Theorem 4.15. Let v : [0, ∞) × X → R be a continuously differentiable function that satisfies the equation rv(s, x) = inf [c(s, x, a) + La v(s, x)] a∈A
(4.3.2)
for all (s, x) ∈ [0, ∞) × X, with La as in (4.1.12). Then: (a) v(s, x) ≤ V (s, x; a(·)) for each control a(·) ∈ Ax0 such that, as t → ∞, (4.3.3) e−rt v(t, x(t)) → 0, where
V (s, x; a(·)) :=
∞
e−r(t−s) c(t, x(t), a(t))dt.
s ∗
∗
(b) If a ≡ a (s, x) ∈ A attains the minimum in (4.3.2), i.e. ∗
rv(s, x) = c(s, x, a∗ ) + La v(s, x) ∀ (s, x),
(4.3.4)
then a∗ (·) ≡ a∗ (·, x(·)) is optimal within the class of controls in Ax0 that satisfy the condition (4.3.3). Proof. (a) By (4.3.2), rv(s, x) ≤ c(s, x, a) + La v(s, x) ∀ (s, x, a). Let u(s, x) := e−rs v(s, x) and note that, by (4.1.12),
(4.3.5)
4.3
INFINITE–HORIZON DISCOUNTED COST
145
La u(s, x) := us + ux · F (s, x, a) = e−rs [vs (s, x) − rv(s, x) + vx (s, x) · F (s, x, a)] = e−rs [La v(s, x) − rv(s, x)] ≥ −e−rs c(s, x, a), [by (4.3.5)] that is, La u(s, x) ≥ −e−rs c(s, x, a) for all (s, x, a). Integrating both sides of the latter inequality, and recalling the Remark 4.7(c), we obtain T u(T, x(T )) − u(s, x) ≥ e−rt c(t, x(t), a(t))dt. (4.3.6) s
Equivalently, by definition of u, T v(s, x) ≤ e−r(t−s) c(t, x(t), a(t))dt + e−r(T −s) v(T, x(T )). s
Finally, letting T → ∞, (4.3.3) yields the desired conclusion in part (a). (b) On the other hand, if (4.3.4) holds, we have equalities throught (4.3.5)–(4.3.6) with a = a∗ , and part (b) follows. Example 4.16. Consider the OCP: Minimize, over all nonnegative controls a(·) ∈ Ax0 , the objective function ∞ e−rt [x(t) + ka(t)2 ]dt V (x0 ; a(·)) = 0
subject to x(t) ˙ = p − a(t)x(t)1/2
∀ t > 0,
with k and x(0) = x0 both positive. Note that this is an autonomous or stationary or time–invariant OCP because both the instantaneous cost c(t, x, a) ≡ c(x, a) = x + ka2 and the state transition function F (t, x, a) ≡ F (x, a) = p − ax1/2 are independent of the time variable t, in which case the value function v depends only on the state variable x. Therefore, the HJB equation (4.3.2) becomes rv = inf [x + ka2 + vx · (p − ax1/2 )], a≥0
(4.3.7)
4
146
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
that attains its minimum at a∗ ≡ a∗ (x) = (2k)−1 vx · x1/2 . Substituting this value in (4.3.7) we obtain rv(x) = x + p · vx − (4k)−1 x · vx2 .
(4.3.8)
Guessing that a possible solution of this equation could be of the form v(x) = P x + Q, for some constants P, Q, substitution of this function v(x) in (4.3.8) gives that it indeed solves (4.3.8) provided that P and Q satisfy the equations rQ = pP and P 2 + 4krP − 4k = 0.
(4.3.9)
Hence v(x) = P x + Q solves (4.3.8) if Q = pP/r and P is the positive solution of (4.3.9). Further, since vx = P , the optimal ♦ control is a∗ (x) = (2k)−1 P x1/2 . Remark 4.17. For OCPs in an infinite horizon there are several optimality concepts, in addition to the discounted cost in Sect. 4.3. For instance, Carlson et al. (1991), Sect. 1.5, introduce the following concepts: (a) Strong optimality; (b) Overtaking optimality; (c) Weak overtaking optimality; (d) Finite optimality. They also show the following chain of implications: (a) ⇒ (b) ⇒ (c) ⇒ (d). ♦ Example 4.18 (An infinite-horizon discounted LQ problem). Consider an infinite-horizon scalar (i.e., X = A = R) LQ problem in which we wish to minimize ∞ e−rt [Qx2 (t) + Ra2 (t)]dt (4.3.10) V (x; a(·)) = 0
subject to x(t) ˙ = δx(t) + ηa(t),
t ≥ 0, x(0) = x.
(4.3.11)
The coefficients Q and R are both positive, and η = 0. Since the system (4.3.10)–(4.3.11) is time-homogeneous or time-invariant (in particular, all the coefficients are constant), the HJB equation (4.2.2) or (4.3.2) for v(x) becomes
rv = inf Qx2 + Ra2 + vx · (δx + ηa) . (4.3.12) a∈A
4.4
LONG-RUN AVERAGE COST PROBLEMS
147
As in similar LQ problems (see Examples 2.46 or 4.14, for instance), we seek a solution of (4.3.12) of the form v(x) = kx2 + h, where k > 0 and h are constants to be determined. With this value of v(·), (4.3.12) can be expressed as (rk − Q − 2δk)x2 = min{Ra2 + 2ηkxa}. a
The minimum is attained at the stationary Markov control f ∗ (x) = −R−1 ηkx ∀x ∈ R, and the value function is vr (x) = V (x, f ∗ ) = kx2 + h, where h = 0 and k is the positive solution of the quadratic equation η 2 k 2 + (r − 2δ)Rk − RQ = 0.
(4.3.13)
Finally, observe that f ∗ satisfies (4.3.3), i.e., e−rt v(x∗ (t)) → 0 as t → ∞. Therefore Theorem 4.15 gives that f ∗ is indeed optimal in the ♦ class Ax0 .
4.4 Long-Run Average Cost Problems Consider the infinite-horizon control system x(t) ˙ = F (x(t), a(t)), t ≥ 0, x(0) = x,
(4.4.1)
with state space X = Rn and control (or action) set A ⊂ Rm . We denote by A the family of piecewise–continuous control functions a(·) : [0, ∞) → A. For each control a(·) ∈ A and each T > 0, let T c(x(t), a(t))dt, JT (x, a(·)) := 0
(4.4.2)
148
4
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
where the running cost c is nonnegative. In this section we wish to minimize the long–run average cost (AC) J(x, a(·)) := lim sup T →∞
1 JT (x, a(·)) T
(4.4.3)
subject to (4.4.1). The AC-value function is J ∗ (x) := inf J(x, a(·)) a(·)
(4.4.4)
for all x ∈ X. As usual, a control function a∗ is said to be AC– optimal if J(x, a∗ (·)) = J ∗ (x) for all x ∈ X. We will assume the existence of a control function a(·) such that J(x, a(·)) < ∞ for every x ∈ X. This condition ensures that J ∗ (·) is a finite–valued function. The OCP (4.4.1)–(4.4.4) is, of course, a deterministic continuous–time version of the discrete–time AC problems in Sects. 2.5 and 3.7. Hence it is no surprise that some of the techniques for discrete–time problems are also applicable, with obvious changes, to the continuous–time case. These techniques include the average cost optimality equation (ACOE), the steady state approach, and the vanishing discount approach, which we introduce in the remainder of this section.
4.4.1 The Average Cost Optimality Equation (ACOE) In addition to the AC control problem (4.4.1)–(4.4.3), consider the operator La in the Remark 4.7(d), i.e, La v(x) = vx (x) · F (x, a)
(4.4.5)
for v ∈ C 1 (X). Note that, by (4.4.1) and the chain rule, La v(x) =
d v(x(t))|(x,a) . dt
(4.4.6)
4.4
LONG-RUN AVERAGE COST PROBLEMS
149
The ACOE approach to the AC problem is based on the so– called Poisson equation in the following lemma. Lemma 4.19. Let j ∈ R and h(·) ∈ C 1 (X) be such that the pair (j, h) satisfies the Poisson equation j = c(x, a) + La h(x) ∀ x, a.
(4.4.7)
Moreover, let a(·) ∈ A be a control function such that, as t → ∞, h(xa (t))/t → 0
(4.4.8)
for every initial state x(0) = x, where xa is the solution of (4.4.1) when using the control a(·). Then (a) j = J(x, a(·)) for all x ∈ X. (b) If the equality in (4.4.7) is replaced by ≥ (resp. ≤), then in (a) we have j ≥ J(x, a(·)) (resp. ≤) for all x. Proof. (a) For notational ease, we will write xa (·) as x(·). Then, by the Remark 4.7(d), the Poisson equation (4.4.7) yields j = c(x(t), a(t)) +
d h(x(t)) ∀ t ≥ 0. dt
It follows that, for all T > 0, T Tj = c(x(t), a(t))dt + h(x(T )) − h(x).
(4.4.9)
(4.4.10)
0
Multiplying both sides by 1/T and then letting T → ∞, from (4.4.8) and (4.4.3) we obtain part (a). (b) In (4.4.7) replace = with either ≥ or ≤. Then in (4.4.9)– (4.4.10) we obtain ≥ or ≤ , respectively, in lieu of = . Remark. Observe that Lemma 4.19 tacitly assumes the existence of a solution (j, h) to the Poisson equation (4.4.7). In other words, the lemma itself does not guarantee the existence of such a solution. If, however, that solution exists, then necessarily j is unique and h is unique up to additive constants. A similar result holds for the ACOE (4.4.12) below. (See Exercise 4.14.) ♦
4
150
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
From the Poisson equation (4.4.7) we obtain the average cost optimality equation (ACOE) in Theorem 4.20, below, where we use the following notation: Given a function h : X → R, we denote by Ah the family of controls a(·) ∈ A such that 1 h(xa (t)) → 0 as t → ∞ t
(4.4.11)
where xa is the solution of (4.4.1) when using the control a, for any initial state x(0) = x. The ACOE (4.4.12) is also known as the HJB (or the dynamic programming or simply the Bellman) equation for the AC control problem (4.4.1)–(4.4.3). Theorem 4.20. Let us assume that j ∈ R and h ∈ C 1 (X) form a solution to the ACOE j = inf [c(x, a) + La h(x)] ∀ x ∈ X. a∈A
(4.4.12)
Then, for every initial state x(0) = x, (a) j ≤ J(x, a(·)) for all a(·) ∈ Ah ; hence (b) j ≤ J ∗ (x) if A = Ah . In addition, let us suppose that there exists a control a∗ (·) ∈ Ah that attains the minimum in the right–hand side of (4.4.12), i.e., for every x ∈ X, a∗ (x) ∈ A is such that j = c(x, a∗ (x)) + La
∗ (x)
h(x) ∀ x ∈ X.
(4.4.13)
Then, for all x ∈ X, (c) j = J(x, a∗ (·)) ≤ J(x, a(·)) for all a(·) ∈ Ah ; hence (d) a∗ (·) is AC–optimal and J(·, a∗ (·)) ≡ J ∗ (·) ≡ j if A = Ah . Proof. If (4.4.12) holds, then j ≤ c(x, a) + La h(x) ∀ (x, a) ∈ X × A. Consequently, (a) follows from Lemma 4.19(b). On the other hand, if A = Ah , then (b) follows from (a).
4.4
LONG-RUN AVERAGE COST PROBLEMS
151
Suppose now that the Poisson equation (4.4.13) holds. Then Lemma 4.19(a) yields the equality in (c), whereas the inequality is obtained from (a). Finally, (d) follows from (c). Remark 4.21. If the pair (j, h) is a solution to the ACOE (4.4.12), then it is also called a canonical pair. Similarly, if (4.4.13) holds, then (j, h, a∗ ) is said to be a canonical triplet. Unfortunately, Theorem 4.20 does not say how to solve the ACOE, and all the known results (see, for instance, Arisawa (1997)) impose restrictive conditions such as compact state space and/or compact control set and/or bounded running cost. However, none of these conditions is satisfied in the following LQ example but still we do obtain a canonical triplet. ♦ Example 4.22. We again consider the scalar LQ system in Example 4.18 with state equation in (4.3.11), i.e., x(t) ˙ = δx(t) + ηa(t), t ≥ 0, x(0) = x, and running cost c(x, a) = Qx2 + Ra2 . In this case the ACOE (4.4.12) is j = inf [Qx2 + Ra2 + h (x) · (δx + ηa)]. a
(4.4.14)
In view of previous LQ examples, we conjecture that the function h is of the form h(x) = bx2 for some constant b > 0. To verify that this is indeed the case, we insert h in (4.4.14) and obtain that, for each x ∈ X, the minimum is attained at a∗ (x) = −Bx with B := R−1 bη.
(4.4.15)
Therefore, (4.4.14) can be expressed as j = (Q + 2bδ − R−1 b2 η 2 )x2
∀ x ∈ X.
This equation holds provided that j = 0 and b solves the quadratic equation R−1 η 2 b2 − 2δb − Q = 0. (4.4.16) To proceed further observe that, with a∗ as in (4.4.15), the state equation becomes
4
152
x(t) ˙ = Δx(t),
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
with Δ := R−1 (Rδ − bη 2 ).
Now, let us assume that b is such that Δ < 0. Then the corresponding state trajectory x∗ (t) = x0 eΔt → 0 as t → ∞
(4.4.17)
for every initial state x0 . Moreover, J(x, a∗ (·)) = 0 = j for all x0 = x. Finally, from (4.4.15)–(4.4.16) we conclude that, if b is the positive solution of (4.4.16), then (j, h(·), a∗ (·)) is a canonical triplet for the ACOE and, by Theorem 4.20, a∗ is AC–optimal in the class of controls Ah . It should be noted that the results in this “scalar” LQ example are valid in the general vector case, with state and control spaces X = Rn and A = Rm , respectively, except that now we require special concepts from linear systems theory, say “stabilizability”, “detectability”, and so forth. For details, see Theorem 5.4.4 in Davis (1977), for instance. ♦ Example 4.23 (Example 4.16 cont’d.). Let us consider the AC problem (4.4.1)–(4.4.3) with transition function F (x, a) and running cost c(x, a) as in the Example 4.16, i.e., and F (x, a)=p − ax1/2 ,
c(x, a)=x+ka2
with x(0) = x > 0, (4.4.18) where k and p are given positive constants. The OCP is to minimize the average cost J(x, a(·)) in (4.4.3) over all the controls a(·) ≥ 0 in A. In this case, the ACOE (4.4.12) becomes j = inf [x + ka2 + h (x) · (p − ax1/2 )], a≥0
(4.4.19)
where h (x) = dh(x)/dx. By comparison with the r-discount HJB equation (4.3.7), we will propose h(x) = Rx as a possible solution of (4.4.19), with the coefficient R to be determined. Replacing h(·) in (4.4.19) we obtain j = x + pR + inf [ka2 − Rx1/2 a], a
(4.4.20)
4.4
LONG-RUN AVERAGE COST PROBLEMS
153
which attains the minimum at a∗ (x) = (2k)−1 Rx1/2
∀x ≥ 0.
(4.4.21)
With this value of a = a∗ , (4.4.20) becomes j = pR + (1 − R2 /4k)x ∀x ≥ 0, which yields j = pR and R2 /4k = 1, i.e., R = 2k 1/2 . Hence, we already have a canonical triplet (j, h, a∗ ), that is, a triplet satisfying (4.4.13). Finally, to use Theorem 4.20 we will identify the family Ah of controls that satisfy (4.4.11). Let F (x, a) = p − ax1/2 be as in (4.4.18), and a(·) = a∗ (·) as in (4.4.21). Then the corresponding state trajectory x∗ (·) is the solution of the linear equation x(t) ˙ = −δx(t) + p,
with δ := R/2k = k −1/2 ,
for each initial x(0) = x0 > 0. Hence, for some constant C, x∗ (t) = Ce−δt + p/δ, which yields (4.4.11), that is, since h(x) = Rx, 1 h(x∗ (t)) → 0 as t → ∞. t
(4.4.22)
Therefore, by Theorem 4.20(c) we conclude that a∗ (·) in (4.4.21) is AC-optimal within the class of controls that satisfy (4.4.22). ♦
4.4.2 The Steady-State Approach Let us consider again the AC control problem (4.4.1)–(4.4.3). As in the discrete-time case (see (2.5.17)), the steady-state approach to the AC problem hinges on the existence of state-action pairs (¯ x, a ¯) which are “steady” in the sense that F (¯ x, a ¯) = 0. Let K ⊂ X × A be the set of all such pairs. Then a pair (¯ x, a ¯) in K is said to be a minimum steady pair if it is a solution of the constrained optimization problem:
154
4
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
minimize c(x, a) subject to F (x, a) = 0.
(4.4.23)
We denote by K∗ the family of minimum steady pairs. We wish to find conditions under which a minimum steady pair gives the ACvalue function J ∗ in (4.4.4). In particular, the following Theorem 4.24 mimics the discrete-time result in Theorem 2.53. Theorem 4.24. Suppose that the running cost c is continuous and, in addition, the AC control problem (4.4.1)–(4.4.3) is such that: (a1 ) There exists a minimum steady pair (x∗ , a∗ ) ∈ K∗ . (a2 ) The control system is dissipative with respect to the pair (x∗ , a∗ ) in (a1 ), in the following sense: There exists a realvalued function l ∈ C 1 (X) such that c(x∗ , a∗ ) ≤ c(x, a) + La l(x)
(4.4.24)
for all x ∈ X, a ∈ A. (a3 ) The control system is stabilizable, that is, there exists a control a ¯(·) ∈ Al such that, as t → ∞, (¯ x(t), a ¯(t)) → (x∗ , a∗ )
(4.4.25)
with Al as in (4.4.11). Then j ∗ := c(x∗ , a∗ ) is such that, for all x ∈ X, (b1 ) J(x, a ¯(·)) = j ∗ and, moreover, j ∗ ≤ J(x, a(·)) for all a(·) ∈ Al ; ¯(·) is AC-optimal and the AC-value function (b2 ) The control a ∗ ∗ is J (·) ≡ j if A = Al . Proof. (b1 ) Since c is continuous, (4.4.25) yields c(¯ x(t), a ¯(t)) → c(x∗ , a∗ ) =: j ∗ as t → ∞. This implies that J(·, a ¯(·)) ≡ j ∗ . Moreover, the inequal∗ ity j ≤ J(·, a(·)) follows from (4.4.24) and Lemma 4.19(b) (with the inequality ≤). On the other hand, if A = Al , then (b2 ) follows from (b1 ).
4.4
LONG-RUN AVERAGE COST PROBLEMS
155
Remark 4.25. (a) Under the hypotheses of Theorem 4.24, (x∗ , a∗ ) is a so-called minimum pair in the sense that j ∗ := c(x∗ , a∗ ) = inf inf J(x, a(·)). x a(·)
(b) Recalling (4.4.6), we can rewrite (4.4.24) as t l(x(t)) − l(x) + [c(x(s), a(s)) − c(x∗ , a∗ )]ds ≥ 0 0
for any solution (x(·), a(·)) of (4.4.1). (Reader beware: The definition of “dissipativity” is not standard; that is, different authors may use different definitions.) Example 4.26. Consider the LQ system in Example 4.22, in which the system function and the running cost are F (x, a) = δx + ηa and c(x, a) = Qx2 + Ra2 , respectively. Hence, (x, a) is a steady state-action pair if F (x, a) = 0, which holds if a = −δx/η. Replacing this value in c(x, a) we see that c(x, a) = (Q + Rδ 2 /η 2 )x2 . Therefore, we have the minimum steady pair (x∗ , a∗ ) = (0, 0), that is, the hypothesis (a1 ) in Theorem 4.24. The hypotheses (a2 ) and ♦ (a3 ) can be obtained from (4.4.14)–(4.4.17). Example 4.27. We continue Example 4.23. From (4.4.18), F (x, a) = 0 if a = px−1/2 . With this value of a, we obtain c(x, a) = x + ka2 = x + kp2 /x which is minimized at x = pk 1/2 > 0. Thus, we have the minimum steady pair (x∗ , a∗ ) = (pk 1/2 , p1/2 k −1/4 ), with corresponding minimum AC cost j ∗ = c(x∗ , a∗ ) = 2pk 1/2 as in Example 4.23. The latter example also yields the remaining parts of Theorem 4.24. ♦ Remark 4.28. In Example 4.29, below, we wish to maximize over A a long-run average reward (AR) defined as
4
156
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
1 JAR (x, a(·)) := lim inf T →∞ T
T
r(x(t), a(t))dt 0
subject to (4.4.1), where r(x, a) is the given running (or instantaneous) reward function. In this case, the ACOE (4.4.12) is replaced by the average reward optimality equation (AROE) ρ = sup[r(x, a) + La h(x)] ∀x ∈ X
(4.4.26)
a∈A
for some function h ∈ C 1 (X) and some constant ρ. Theorems 4.20 and 4.24 are modified accordingly. In particular, the existence of a minimum steady pair is replaced by a maximum steady pair (x∗ , a∗ ) that solves the constrained optimization problem: Maximize r(x, a) subject to F (x, a) = 0,
(4.4.27)
whereas the dissipativity inequality (4.4.24) is replaced by r(x∗ , a∗ ) ≥ r(x, a) + La h(x) for all x ∈ X, a ∈ A.
♦
Example 4.29 (Control of pollution accumulation). The control of pollution accumulation is a standard OCP in environmental economics since the years 1970s. Here we consider a special case of an application by Kawaguchi (2003) in which he wishes to obtain a consumption strategy a(·) that maximizes the long-run average welfare 1 T [U (a(t)) − D(x(t))]dt (4.4.28) JAR (a(·)) = lim inf T →∞ T 0 where x(t) is the stock of pollution at time t associated to a(·). Furthermore, U : A → [0, ∞) is a social utility function and D : X → [0, ∞) is a disutility function, with state and control sets A = X = [0, ∞). Both functions are assumed to be continuously differentiable functions and satisfying suitable concavity/convexity conditions. The stock of pollution evolves according to the equation
4.4
LONG-RUN AVERAGE COST PROBLEMS
x(t) ˙ = a(t) − ϕ(x(t)), with x(0) = x0 > 0
157
(4.4.29)
where ϕ(·) denotes the rate of pollution decay. Kawaguchi gives general conditions ensuring that the HJB equation associated to (4.4.28)–(4.4.29) has a “classical” solution. Here, however, to illustrate the steady state approach we will suppose that the utility and disutility functions and the pollution decay are of the form U (a) = 2a1/2 , D(x) = d1 x, ϕ(x) = d0 x
(4.4.30)
for some positive constants d0 , d1 . Hence, to obtain a maximum steady state-action pair first note that, from (4.4.29)–(4.4.30), F (x, a) = a − ϕ(x) = a − d0 x = 0 holds if a∗ = a∗ (x) = d0 x. With this value of a∗ the running reward r(x, a) := U (a) − D(x) becomes r(x, a) = 2(d0 x)1/2 − d1 x, which is maximized at x∗ = d0 /d21 . Therefore, we have obtained a maximum steady pair (x∗ , a∗ ) = (d0 /d21 , (d0 /d1 )2 ), as in the hypothesis (a1 ) of Theorem 4.24. Hence r∗ := r(x∗ , a∗ ) = U (a∗ ) − D(x∗ ) = d0 /d1 .
(4.4.31)
To verify the hypotheses (a2 )-(a3 ), let us try to obtain the AROE in (4.4.26), i.e., from (4.4.28)–(4.4.30), r∗ = sup[r(x, a) + h (x)F (x, a)] a>0
= sup[U (a) − D(x) + h (x)(a − d0 x)] a>0
= −(d1 + d0 h (x))x + sup[2a1/2 + h (x)a].
(4.4.32)
a>0
Clearly, the maximum at the right-hand side of (4.4.32) is attained at a∗ (x) = 1/(h (x))2 . (4.4.33) Finally, in (4.4.32) let us try a solution h(·) of the form h(x) = h0 x for a constant h0 to be determined. We then see that (4.4.32) indeed holds with h(x) = h0 x if h0 = d1 /d0 . Moreover, (4.4.33)
158
4
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
shows that the AR-optimal control is the constant a∗ (·)≡(d0 /d1 )2 , ♦ which is the same as a∗ in (4.4.31).
4.4.3 The Vanishing Discount Approach As in Sect. 2.5 for discrete-time OCPs, we can relate AC and discounted cost problems directly from the definition of a discounted cost functional. To this end, consider the problem of minimizing the discounted cost ∞ Vr (x, a(·)) = e−rt c(x(t), a(t))dt (4.4.34) 0
for a given discount factor r > 0, subject to the dynamics (4.4.1). Inside the integral in (4.4.34) replace the cost c(x, a) by c(x, a) ± M for some constant M . Then (4.4.34) can be expressed as ∞ M e−rt [c(x(t), a(t)) − M ]dt + ; Vr (x, a(·)) = r 0 that is,
rVr (x, a(·)) = M + r
∞
e−rt [c(x(t), a(t)) − M ]dt.
0
In particular, taking M as the average cost j(x) := J(x, a(·)), we obtain ∞ rVr (x, a(·)) = j(x) + r e−rt [c(x(t), a(t)) − j(x)]dt. (4.4.35) 0
This relation obviously suggests that, as r → 0+ , rVr (x, a(·)) → j(x)
(4.4.36)
provided that the rightmost term in (4.4.35) tends to 0 as r ↓ 0. An example of this situation is in the context of Theorem 4.24, as in the following Proposition 4.30. (The proof is left to the reader: Exercise 4.11.) Proposition 4.30. Suppose that c is continuous and that the hypotheses (a1 ) and (a3 ) of Theorem 4.24 hold; that is, there exists
4.4
LONG-RUN AVERAGE COST PROBLEMS
159
a minimum steady pair (x∗ , a∗ ) and a control a¯(·) that satisfy (4.4.25). Let j ∗ := c(x∗ , a∗ ). Then ¯(·)) = J(x, a ¯(·)) = j ∗ lim rVr (x, a r↓0
for every initial state x(0) = x. In general, (4.4.36) can be obtained by means of an Abelian theorem as in parts (b)–(c) of the next lemma, where we use essentially the same terminology as in Lemma 2.56. Lemma 4.31. For t ≥ 0, let ψ(t) be a nondecreasing continuous function with ψ(0) = 0. Define the upper and lower (Ces`aro) limits C L := lim inf ψ(t)/t, C U := lim sup ψ(t)/t, t→∞
t→∞
and the lower and upper Abelian limits ∞ L −rt e dψ(t), AU := lim sup r A := lim inf r r↓0
r↓0
0
∞
e−rt dψ(t).
0
Suppose that C U < ∞.
(4.4.37)
Then: ∞ ∞ (a) 0 e−rt dψ(t) = r 0 e−rt ψ(t)dt for every r > 0. (b) C L ≤ AL ≤ AU ≤ C U . (c) If the limit j := limt→∞ ψ(t)/t exists, then ∞ lim r e−rt dψ(t) = j. r↓0
0
(In other words, if C L = C U = j, then AL = AU = j.) Proof. Part (a) follows from the integration-by-parts formula t t −rs −rt e dψ(s) = e ψ(t) + r e−rs ψ(s)dt, (4.4.38) 0
0
and noting that (4.4.37) implies lim sup e−rt ψ(t) = 0. t→∞
(4.4.39)
4
160
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
To prove the third inequality in (b), we use (4.4.37) again and choose an arbitrary > 0 and let T = T () be such that sup ψ(s)/s ≤ C U + ∀t ≥ T. s≥t
Hence
r
t
2
−rs
e
t
se−rs [ψ(s)/s]ds T t U 2 ≤ (C + )r se−rs ds
ψ(s)ds = r
T
2
T
≤ C U + ,
t because r2 0 se−rs ds = 1 − re−rt (1/r + t) ≤ 1. Thus, for t ≥ T , (4.4.38) gives t T −rs −rt 2 r e dψ(s) ≤ re ψ(t) + r e−rs ψ(s)ds + C U + . 0
0
Letting t → ∞ and then r ↓ 0, from (4.4.39) we obtain the third inequality in (b), since was arbitrary. The first inequality is proved similarly, and the second one is obvious. Finally, Part (c) follows from (b). Lemma 4.31 is a well-known result in Laplace transform theory (see, for instance, Widder (1941), pp. 181–182). Let us now suppose that the instantaneous cost function c(x, a) is continuous and nonnegative. Let a(·) ∈ A be any given control function and x(·) the corresponding solution of (4.4.1). In addition, let t
ψ(t) :=
c(x(s), a(s))ds, t ≥ 0.
0
Then (4.4.2)–(4.4.3) yield that the average cost J(x, a(·)) can be expressed as J(x, a(·)) = lim sup ψ(t)/t t→∞
whereas the r-discounted cost becomes
4.4
LONG-RUN AVERAGE COST PROBLEMS
Vr (x, a(·)) : =
∞
0 ∞
=
161
e−rt c(x(t), a(t))dt e−rt dψ(t).
0
Hence, from the third inequality in Lemma 4.31(b), lim sup rVr (x, a(·)) ≤ J(x, a(·)). r↓0
Consequently, since the control function a(·) was arbitrary, we conclude the following. Proposition 4.32. The AC-value function J ∗ (·) and the rdiscount value function Vr (x) := inf a(·) Vr (x, a(·)) satisfy that lim sup rVr (x) ≤ J ∗ (x)
(4.4.40)
r↓0
for all x ∈ X. In words, (4.4.40) states that, for r sufficiently small, rVr (·) is a lower bound for J ∗ (·). We are actually interested in the “convergence”, in some sense, of rVr to some particular value of J ∗ as r ↓ 0. We next explain this. Suppose that the value function Vr is in C 1 (X) and satisfies the HJB equation (4.3.2) in the autonomous (or time-homogeneous) case, i.e., rVr (x) = min[c(x, a) + La Vr (x)], x ∈ X. a∈A
(4.4.41)
Now pick (and fix) a state x¯ ∈ X, and let mr := rVr (¯ x)
and
lr (x) := Vr (x) − Vr (¯ x)
(4.4.42)
for x ∈ X. Then we can rewrite (4.4.41) as mr + rlr (x) = min[c(x, a) + La lr (x)]. a∈A
Finally, the key step is to find a sequence rn ↓ 0 and a pair (j, l(·)) in R × C 1 (X) such, as n → ∞,
162
4
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
mrn → j,
lrn (·) → l(·),
(4.4.43)
and (j, l(·)) satisfies either the ACOE (4.4.12) or the AC optimality inequality (ACOI) j ≥ inf [c(x, a) + La l(x)]. a∈A
(4.4.44)
Hence, if there exists a control a∗ (·) ∈ Ah that attains the minimum in (4.4.44), i.e., j ≥ c(x, a∗ (x)) + La
∗ (x)
l(x) ∀x ∈ X,
(4.4.45)
then a∗ (·) is AC-optimal and j is the AC-value function, i.e., J ∗ (·) ≡ j. The good news is that the procedure (4.4.43) works in many particular AC control problems. The bad news however is that, to the best of our knowledge, there are no general results ensuring the existence of such a pair (j, l(·)). (The existing results require restrictive hypotheses. See Bardi and Capuzzo-Dolcetta (1997), Sect. VII.1, for instance.) We will next show some particular cases. Example 4.33 (Example 4.16 cont’d.). Consider the AC control problem (4.4.1)–(4.4.3) with system function and running cost as in Example 4.16, that is, F (x, a) = p − ax1/2 ,
c(x, a) = x + ka2
with k and x(0) = x0 both positive. In the r-discounted case, Example 4.16 shows that the OCP value function and the roptimal control are Vr (x) = P (r)x + Q(r) and a∗ (x) = (2k)−1 P (r)x1/2 , where Q(r) = pP (r)/r and and P = P (r) is the positive solution of (4.3.9). Hence, for any given (fixed) state x¯ ≥ 0 (4.4.42) gives mr = rVr (¯ x) = r[P (r)¯ x + Q(r)],
lr (x) = Vr (x) − Vr (¯ x) = P (r)(x − x ¯).
It follows that, as r ↓ 0, mr → pP (0) and lr (x) → P (0)(x − x¯).
4.5
THE POLICY IMPROVEMENT ALGORITHM
163
In other words, (4.4.43) holds with j = pP (0) and l(x) = P (0)(x − x¯), where P (0) = 2k 1/2 is the positive solution of (4.3.9). Moreover, the AC-optimal control is a∗ (x) = (2k)−1 P (0)x1/2 . ♦ Example 4.34 (Example 4.18 cont’d.). Consider again the rdiscounted LQ problem in Example 4.18, in which the optimal control is fr∗ (x) = −R−1 ηkx with R and η as in (4.3.10)–(4.3.11), and k = k(r) is the unique positive solution of (4.3.13). The corresponding value function is vr (x) = k(r)x2 . In (4.4.42) we can take an arbitrary state x¯. However, to simplify the presentation we take x¯ = 0. Therefore, mr = 0 and lr (x) = vr (x) = k(r)x2 → k(0)x2 as r ↓ 0, where k(0) = (RQ)1/2 /η is the positive solution of (4.3.13) with r = 0.
4.5 The Policy Improvement Algorithm We will now introduce the policy improvement (or policy iteration) algorithm (PIA) for the discounted and the average cost problems in Sects. 4.3 and 4.4. The general ideas are, of course, similar to the discrete-time cases in Sects. 2.4 and 3.6 for deterministic and stochastic problems, respectively. Namely, we wish to find a sequence of control functions an ∈ A such that, for each n, an+1 improves an in the sense that the cost vn+1 when using an+1 is “better” than the cost vn when using an , because vn+1 ≤ vn . Therefore, the cost functions vn form a monotone nonincreasing sequence.
4
164
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
4.5.1 The PIA: Discounted Cost Problems Given a discount factor r > 0, the OCP is to minimize the discounted cost ∞ e−rt c(x(t), a(t))dt (4.5.1) V (x, a(·)) := 0
subject to x(t) ˙ = F (x(t), a(t)),
t ≥ 0,
x(0) = x.
(4.5.2)
The running cost c is supposed to be nonnegative, and the controls are restricted to the class A∗SM ⊂ A of stationary Markov controls for which (4.5.3) e−rt V (x(t), a(·)) → 0 as t → ∞. Recall from Chap. 1 that a stationary Markov control (also known as a feedback or closed-loop control) is a function f (·) : X → A such that, at any time t ≥ 0, the control action is f (x) ∈ A if x(t) = x. We will denote by ASM the family of stationary Markov controls, and by A∗SM the subset of controls that satisfy (4.5.3). For notational convenience we will write Markov controls either as f (·) or a(·). Moreover, we will write c(x, f (x)) and F (x, f (x)) as c(x, f ) and F (x, f ), respectively. Before describing the PIA let us note the following. Initialization. Let f0 ∈ A∗SM be a control with discounted cost 0 v (·) := V (·, f0 ) ∈ C 1 (X), so that rv 0 (x) = c(x, f0 ) + vx0 (x)F (x, f0 )
(4.5.4)
for all x ∈ X. The latter equation yields rv 0 (x) ≥ inf [c(x, a) + vx0 (x)F (x, a)]. a∈A
(4.5.5)
Let us now assume that there exists f1 ∈ A∗SM that attains the minimum in (4.5.5); that is, for each x ∈ X, f1 (x) ∈ A is such that
4.5
THE POLICY IMPROVEMENT ALGORITHM
165
c(x, f1 ) + vx0 (x)F (x, f1 ) = min[c(x, a) + vx0 F (x, a)]. a∈A
Therefore, we can rewrite the inequality (4.5.5) as rv 0 (x) ≥ c(x, f1 ) + vx0 (x)F (x, f1 ) ∀x ∈ X.
(4.5.6)
A key step in the PIA is to show that (4.5.6) implies that f1 improves f0 in the sense that v 0 (·) ≥ v 1 (·), where v 1 (x) := V (x, f1 ). More precisely, we have the following. Lemma 4.35. For any two controls f ≡ f0 and g ≡ f1 in A∗SM that satisfy (4.5.4) and (4.5.6) we have V (x, f ) ≥ V (x, g) for all x ∈ X. Proof. Let u(t, x) := e−rt v 0 (x). Then, from (4.1.12), Lf1 u(t, x) = ut + ux · F (x, f1 ) = e−rt [vx0 (x) · F (x, f1 ) − rv 0 (x)] ≤ −e−rt c(x, f1 ). [by (4.5.6)] Thus, recalling from Remark 4.7(c) that La u(t, x) = du(t, x(t))/ dt|(x,a) , integration of both sides of the latter inequality from t = 0 to t = T gives T −rT 0 0 e v (x(T )) − v (x(0)) ≤ − e−rt c(x(t), f 1 )dt. 0
Finally, letting T → ∞ we see that −v 0 (x) ≤ −v 1 (x), which yields the desired result. Having the Initialization step and Lemma 4.35 we can proceed with the PIA as follows, where fn is a control in A∗SM with rdiscounted cost v n (·) := V (·, fn ) ∈ C 1 (X). (PI1 ) Given fn (n = 0, 1, . . .) compute the corresponding discounted cost v n (·) so, for every x ∈ X, rv n (x) = c(x, fn )) + vxn (x)F (x, fn ). Hence rv n (x) ≥ inf [c(x, a) + vxn (x)F (x, a)]. a∈A
(4.5.7)
4
166
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
(PI2 ) Policy improvement. Assume that there exists fn+1 ∈ such that, for every x ∈ X, fn+1 (x) ∈ A attains the minimum in (4.5.7), that is, A∗SM
c(x, fn+1 ) + vxn (x)F (x, fn+1 ) = min[c(x, a) + vxn (x)F (x, a)]. a∈A
(4.5.8) If v (·) ≡ v (·), then stop the algorithm because v is the optimal discounted cost (see Proposition 4.36(a) below). Otherwise, replace n by n + 1 and go back to (PI1 ). The existence of fn+1 as in (PI2 ) is ensured by well known results. See, for instance, Lemma 2.16(a) above or Theorems B.8 and B.9 in Appendix B. A first step on the convergence of the PIA is the following. n+1
n
n
Proposition 4.36. Let fn and v n (n = 0, 1, . . .) be as in (PI1 ) and (PI2 ). (a) If for some n we have v n (x) = v n+1 (x) for all x ∈ X, then v n (·) ≡ V (·) is the discounted value function, and fn and fn+1 are optimal controls. (b) In general, there exists a function v ≥ 0 such that, for every x ∈ X, v n (x) ↓ v(x). Proof. (a) If v n (·) = v n+1 (·), then in the right-hand side of (4.5.7) we can replace v n with v n+1 , which combined with (4.5.8) gives rv n (x) ≥ inf [c(x, a) + vxn+1 (x)F (x, a)] = rv n+1 (x) a
for all x ∈ X. Therefore, v(·) := v n (·) = v n+1 (·) satisfies the rdiscounted cost HJB equation rv(x) = min[c(x, a) + vx (x) · F (x, a)], a∈A
(4.5.9)
and fn and fn+1 are optimal controls. (b) This part is a consequence of the monotonicity of {v n }. Remark 4.37. Unfortunately, the conditions we have so far on the OCP (4.5.1)–(4.5.2) are not enough to guarantee two key steps in the PIA, namely:
4.5
THE POLICY IMPROVEMENT ALGORITHM
167
(I) Convergence to the value function; that is, it remains to show that the limiting function v in Proposition 4.36(b) is in C 1 (X) and that it satisfies the HJB equation (4.5.9). (II) Convergence of controls; that is, convergence of fn (or a subsequence thereof) to a control f ∈ A∗SM that attains the minimum in (4.5.9), and so it is optimal for (4.5.1)–(4.5.2). To deal with (I), there are three usual options: I1 . Impose suitable assumptions on the OCP, as in Doshi (1976a) or Jacka and Mijatovi´c (2017), for instance. Typically, the idea is to impose conditions ensuring that the Arzela-Ascoli Theorem in Remark 2.59 is applicable, and then one shows that v preserves some of the properties of v n , such as v ∈ C 1 (X). (As an example of this approach see Kawaguchi (2003)). In general, however, the assumptions are so restrictive that are not applicable to, for instance, our Example 4.38 below. (Indeed, the conditions on the PIA usually require compact state space X and/or compact control set A and/or bounded running cost c(x, a). None of these conditions, however, is satisfied in the Example 4.38.) I2 . Use a numerical approach, as in Alla et al. (2015) or Wei et al. (2020). I3 . The direct approach: Verify the convergence of v n to a function v ∈ C 1 (X), and then show that v indeed satisfies (4.5.9). (See Example 4.38.) Concerning (II), we can try a direct approach, as in I3 above, or (if possible) use a general result such as Theorem B.10 or Proposition B.12 in the Appendix B. ♦ Example 4.38. Consider the scalar LQ problem in Example 4.18, where ∞ e−rt [Qx2 (t) + Ra2 (t)]dt (4.5.10) V (x, a(·)) = 0
and x(t) ˙ = δx(t) + ηa(t),
t ≥ 0,
x(0) = x.
(4.5.11)
Recall that Example 4.18 requires η = 0 and R > 0. Let f0 be the linear Markov control f0 (x) := C0 x for some constant C0 . In this
4
168
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
case (4.5.11) becomes x(t) ˙ = D0 x(t),
with D0 := δ + ηC0 ,
so x(t) = xeD0 t for all t ≥ 0. Hence, with a(t) := f0 (x(t)) = C0 x(t) in (4.5.10), we see that v 0 (x) := V (x, f0 ) is given by ∞ 0 2 2 e−rt e2D0 t dt v (x) = (Q + RC0 )x 0
= F0 x2
with F0 :=
Q + RC02 r − 2D0
(4.5.12)
if r − 2D0 > 0, which is assumed hereafter. This implies that, in particular, e−rt v 0 (x(t)) = F0 x2 e−(r−2D0 )t → 0 as t → ∞, so f0 is indeed in the class of stationary Markov controls that satisfy (4.5.3). Moreover, v 0 is in C 1 (X) and it can be directly verified that (4.5.4) holds, i.e., rv 0 (x) = c(x, f0 ) + 2F0 x · F (x, f0 ) ∀x ∈ X. The latter equality yields that, for all x ∈ X, rv 0 (x) ≥ inf [c(x, a) + 2F0 x · (δx + ηa)], a
(4.5.13)
and the minimum at the right-hand side is attained at f1 (x) = C1 x with C1 := −F0 η/R. With this value of a = f1 (x) in (4.5.11) we obtain x(t) ˙ = D1 x(t) with D1 := δ + ηC1 , so x(t) = xeD1 t for all t ≥ 0. Similarly, from (4.5.10) we see that v 1 (x) := V (x, f1 ) is given by v 1 (x) = F1 x2 if r − 2D1 > 0.
with F1 :=
Q + RC12 r − 2D1
4.5
THE POLICY IMPROVEMENT ALGORITHM
169
In general, if for some n = 0, 1, . . . we are given that fn (x) := Cn x for some Cn , then x(t) ˙ = Dn x(t) with Dn := δ + ηCn , so x(t) = xeDn t for all t ≥ 0, and v n (x) := V (x, fn ) = Fn x2 ,
with Fn :=
Q + RCn2 r − 2Dn
provided that r − 2Dn > 0. The latter condition yields that fn satisfies (4.5.3) and, on the other hand, one can directly verify that (4.5.4) holds with v n and fn . Furthermore, as in (4.5.13), we can see that, for all x ∈ X, fn+1 (x) = Cn+1 x and v n+1 (x) = Fn+1 x2 with Cn+1 := −Fn η/R,
Dn+1 := δ + ηCn+1 ,
and Fn+1 :=
2 Q + RCn+1 r − 2Dn+1
(4.5.14)
(4.5.15)
if r − 2Dn+1 > 0. Now, in (4.5.15) replace Cn+1 and Dn+1 by their values in (4.5.14) to obtain that Fn+1 =
QR + (Fn η)2 . (r − 2δ)R − 2Fn η 2
(4.5.16)
Then a direct calculation gives that Fn+1 ≤ Fn for all n = 0, 1, . . .; that is, the sequence {Fn } is nonincreasing. Therefore, there exists a number k ≥ 0 such that Fn ↓ k. In fact, letting n → ∞ in (4.5.16) we see that k is the same as the positive solution of the quadratic equation (4.3.13). In other words, we conclude that the direct approach in I3 above works very nicely in this example, that is, the controls fn and the value functions v n converge to the values f ∗ and vr in Example 4.18. ♦
4
170
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
Remark 4.39. In Chap. 6, below, we will study stochastic differential equations (SDEs) of the form dx(t) = F (x(t), a(t))dt + σ(x(t))dW (t),
(4.5.17)
which of course is an “extension” of the deterministic equation (4.5.2). An obvious question is if results for (4.5.17), such as the PIA, are applicable to the deterministic case. The answer, in general, is negative! Indeed, one should be careful because some results for SDEs require σ(x) to be nonzero. For instance, a typical condition is that σ(0) may or may not be 0, but |σ(x)| > 0 for all x = 0. (See Eq. (6.4.19), for instance.) ♦
4.5.2 The PIA: Average Cost Problems We will now consider the long-run average cost (AC) problem in (4.4.1)–(4.4.3). Hence, we wish to minimize over a(·) ∈ A the AC defined as 1 T J(x, a(·)) := lim sup c(x(t), a(t))dt (4.5.18) T →∞ T 0 subject to x(t) ˙ = F (x(t), a(t)),
t ≥ 0,
x(0) = x.
(4.5.19)
Recall from Sect. 4.4 that the cost c(x, a) is assumed to be nonnegative. Given a function h ∈ C 1 (X), we will denote by AhSM the family of stationary Markov controls a ∈ ASM that satisfy (4.4.8), i.e., h(xa (t))/t → 0 as t → ∞,
(4.5.20)
where xa (·) stands for the state process in (4.5.19) when using the control function a(·). Let La h(x) be as in (4.4.5)–(4.4.6). The PIA in the AC case hinges on the Poisson equation (4.4.7) and an AC-analogue of Lemma 4.35, which is the following.
4.5
THE POLICY IMPROVEMENT ALGORITHM
171
Lemma 4.40. Given a stationary Markov control f , let us suppose that there exists a constant j(f ) and a function hf ∈ C 1 (X) that satisfy the Poisson equation j(f ) = c(x, f ) + hfx (x) · F (x, f ) ∀x ∈ X,
(4.5.21)
where we are using the notation introduced in Sect. 4.5.1, namely, c(x, f ) := c(x, f (x)) and F (x, f ) := F (x, f (x)). We assume that a(t) = f (x(t)) satisfies (4.5.20). By (4.5.21), j(f ) ≥ inf [c(x, a) + hfx (x) · F (x, a)] a∈A
(4.5.22)
for all x ∈ X, and we suppose the existence of a Markov control f g ∈ AhSM that attains the minimum in (4.5.22), so j(f ) ≥ c(x, g) + hfx (x) · F (x, g) ∀x.
(4.5.23)
Then g improves f in the sense that the AC j(g) ≤ j(f ), with j(g) := J(x, g) for all x ∈ X. Proof. By (4.4.6), we can express the rightmost term in (4.5.23) as d hfx (x) · F (x, g) = hf (x(t))|(x,g(x)) . dt Therefore, integration of (4.5.23) from t = 0 to t = T gives T c(x(t), g)dt + hf (x(T )) − hf (x). j(f )T ≥ 0
Finally, multiply both sides of the latter inequality by 1/T and then let T → ∞ to obtain the desired conclusion. We now introduce the PIA for the average cost OCP. Consider a sequence of Markov controls fn as follows. (PI1 ) Given fn , for some n = 0, 1, . . ., suppose that there exists a solution (j(fn ), hn (·)) ∈ R × C 1 (X) to the Poisson equation j(fn ) = c(x, fn ) + hnx (x) · F (x, fn ), n where fn ∈ AhSM , so (by Lemma 4.19) j(fn ) ≡ J(·, fn ). It follows that
4
172
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
j(fn ) ≥ inf [c(x, a) + hnx (x) · F (x, a)]. a∈A
(4.5.24) n
(PI2 )Policy improvement. Assume the existence of fn+1 ∈ AhSM such that, for every x ∈ X, fn+1 (x) attains the minimum in (4.5.23), i.e., c(x, fn+1 ) + hnx (x) · F (x, fn+1 ) = min[c(x, a) + hnx (x) · F (x, a)], a∈A
(4.5.25) and then find the solution (j(fn+1 ), hn+1 (·)) of the Poisson equation corresponding to fn+1 . Proposition 4.41. For n = 0, 1, . . ., let fn and hn be as in (PI1 )(PI2 ). (a) If for some n, j(fn ) = j(fn+1 ), then stop the PIA: fn and fn+1 n are both AC-optimal in the class AhSM , that is, n . j(fn ) = j(fn+1 ) ≤ J(·, f ) ∀f ∈ AhSM
(b) Otherwise, if j(fn ) > j(fn+1 ) for all n = 0, 1, . . ., then there exists a number j ∗ ≥ 0 such that j(fn ) ↓ j ∗ . Proof. (a) By Lemma 4.40 and (4.5.25), j(fn+1 ) = min[c(x, a) + hnx (x) · F (x, a)] a∈A
≤ j(fn ) = j(fn+1 ). Therefore, both fn and fn+1 satisfy the ACOE (4.4.12), and so the desired result follows from Theorem 4.20(c). Part (b) is a consequence of Lemma 4.40. Remark 4.42. (a) Open problems: The bad news about Proposition 4.41 is that (as in the Remark 4.37 concerning the discounted cost) the proposition does not guarantee the “convergence” of the PIA, that is, it does not state the convergence of the functions hn (·) nor that j ∗ is in fact the AC-value function. Moreover, in Remark 4.37 we mentioned three options to prove the convergence of the PIA for discounted problems, but two of them are
4.5
THE POLICY IMPROVEMENT ALGORITHM
173
not applicable in the AC case; namely, the option I1 (about imposing “suitable assumptions” on the OCP) and the option I2 (about using a numerical approach) are, to the best of our knowledge, completely unexplored in the AC case. This situation suggests some interesting research problems. On the other hand, the good news is that the option I3 (the direct approach) in the Remark 4.37 is—at least sometimes—applicable to AC problems. See the following Example 4.43. (b) The following comment is similar to Remark 4.39: Since there are some results on the PIA for AC problems related to stochastic differential equations (see Sect. 6.5) of the form dx(t) = F (x(t), a(t))dt + σ(x(t))dW (t),
(4.5.26)
an obvious question is if one can deduce results for the deterministic problem (4.4.1)–(4.4.3) simply by taking σ(·) ≡ 0 in (4.5.26). The answer, in general, is no. The reason is that (as can be seen in the references in Sect. 6.5) the AC results in the stochastic case require |σ(x)| > 0 for all x ∈ X, except perhaps at x = 0. (The latter fact is required even in the one-dimensional case. See for instance assumption (A1) in Anulova et al. (2020).) ♦ Example 4.43. Let us consider again the AC control problem in Example 4.23 with transition function and running cost given by F (x, a) := p − ax1/2
and c(x, a) := x + ka2 .
(4.5.27)
The constants p and k, and the initial state x(0) = x are all positive. To initialize the PIA, take the Markov control f0 (x) = C0 x1/2 with C0 > 0. We selected this value of f0 because then the righthand side of F is linear in x and similarly for the cost c. More explicitly, with a = f0 (x) the system equation becomes x(t) ˙ = p − C0 x(t),
t ≥ 0,
(4.5.28)
so x(t) = D0 e−C0 t + p/C0 with D0 := x(0) − p/C0 . Therefore, by definition of c(x, a) in (4.5.27), the corresponding average cost j0 = J(f0 ) is given by
4
174
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
T
j0 = lim sup T →∞
(1 + kC02 )x(t)dt,
0
where the integrand (1 + kC02 )x(t) = (1 + kC02 )D0 e−C0 t + (1 + kC02 )p/C0 . Observe that, in the right-hand side, the first term → 0 as t → ∞. Consequently, (4.5.29) j0 = (1 + kC02 )p/C0 . We now wish to find the Poisson equation associated to f0 , i.e.,
j0 = c(x, f0 ) + h0 (x) · F (x, f0 )
(4.5.30)
for some function h0 , where h0 (x) = dh0 (x)/dx. From the definition (4.5.27) of F and c we obtain
j0 = (kC02 − h0 (x)C0 + 1)x + h0 (x) · p.
This equation is satisfied if h0 (x) · p = j0 , or h0 (x) = j0 /p, and kC02 − j0 C0 /p + 1 = 0 or pkC02 − j0 C0 + p = 0. (Note that the latter equation is the same as (4.5.29).) This concludes the initialization step in the PIA. To proceed with the policy improvement procedure, instead of (4.5.30) we now consider the inequality
j0 ≥ inf [c(x, a) + h0 (x)F (x, a)]. a≥0
Thus, proceeding as usual, the minimum in the right-hand side is attained at f1 (x) = C1 x1/2 , with C1 = h0 (x)/2k = j0 /(2kp). Hence, the system equation (4.5.28) now becomes x(t) ˙ =p− C1 x(t) for t ≥ 0, so x(t) = D1 e−C1 t + p/C1
with D1 := x(0) − p/C1 .
Continuing this process we see that the average cost j1 = J(f1 ) is j1 = (1 + kC12 )p/C1
(4.5.31)
4.5
THE POLICY IMPROVEMENT ALGORITHM
175
and the Poisson equation
j1 = c(x, f1 ) + h1 (x) · F (x, f1 )
is satisfied with h1 (x) = j1 /p. Moreover, from (4.5.29) and (4.5.31) we see that j1 < j0 . In general, given the Markov control fn (x) := Cn x1/2 , with Cn > 0, we obtain the average cost jn := J(fn ) = (1 + kCn2 )p/Cn ,
n = 0, 1, . . .
and the associated Poisson equation holds with a function hn such hn (x) = jn /p, whereas Cn is the positive root of the quadratic equation pkCn2 − jn Cn + p = 0. One can also show, as in (4.5.29) and (4.5.31), that the jn form a monotone decreasing sequence and, in fact, jn ↓ j ∗ := (1 + kC 2 )p/C
(4.5.32)
where C is the positive solution of pkC 2 − j ∗ C + p = 0.
(4.5.33)
To conclude, one can verify that the Markov controls fn (x) converge to the optimal AC control f ∗ (x) = Cx1/2 and that j ∗ is the optimal average cost, with C > 0 as in (4.5.33). ♦
Exercises 4.1. Solve the following OCP: Minimize 1 c(x(t), a(t))dt J(a(·)) = 0
subject to x(t) ˙ = a(t)2 , with x(0) = x(1) = 0, and control set A = [−1, 1]. Hint. Note that x(·) = 0.
4
176
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
4.2. Consider the OCP: minimize T e−rt [x(t) + ka(t)2 ]dt + e−rT qx(T ) 0
over all control functions a(·) ≥ 0, subject to x(t) ˙ = b − a(t)x(t)1/2 , for 0 ≤ t ≤ T, x(0) = x0 . Propose a solution of the form v(t, x) = P (t)x + Q(t) for the corresponding HJB equation. Show that, in this case, the coefficients P (·) and Q(·) should satisfy that ˙ P˙ (t) = rP (t) + P (t)2 /4k − 1, Q(t) = rQ(t) − bP (t) with P (T ) = q, Q(T ) = 0, and the optimal control is a∗ (t, x) = (2k)−1 vx · x1/2 . 4.3. The goal of this exercise is to prove that, under Assumption 4.2, there is a unique solution to (4.1.2) for every control a ∈ A. (a) Fix a ∈ A, and let δ > 0 be such s + δ ≤ T . Let C ≡ C([s, s + δ], Rn ) be the complete linear space of continuous functions x : [s, s + δ] → Rn , with the supremum norm x := sup{|x(t)| : s ≤ t ≤ s + δ}. For each x ∈ C, define the mapping t → R[x](t), for every t ∈ [s, s + δ] and some y ∈ Rn , as t R[x](t) := y + F (r, x(r), a(r))dr. s
Prove that R[x1 ] − R[x2 ] ≤ Lδx1 − x2 ∀ x1 , x2 ∈ C, where L is the constant in Assumption 4.2. (b) Show that, for δ small enough, (4.1.2) has a unique solution in [s, s + δ], and explain how to extend such a solution to the whole interval [s, T ].
4.5
THE POLICY IMPROVEMENT ALGORITHM
177
4.4. (Dockner et al. 2000, p. 44) Let X = R and A = [−1, 1] be the state and action spaces, respectively. Suppose that we wish to maximize T x(s)a(s)ds for 0 ≤ t ≤ T J(t, x; a(·)) := t
over all a(·) ∈ A[t, T ], subject to x(s) ˙ = a(s) and initial condition x(t) = x ∈ R. (a) Let xa be the state path associated to the control a(·). Verify that −(T − t) ≤ xa (T ) − x ≤ T − t, and J(t, x; a(·)) = (xa (T )2 − x2 )/2 for all t ≤ T . (b) Show that the control policy a∗ defined, for all s ∈ [t, T ], as 1 if x ≥ 0, ∗ a (s) = −1 if x < 0 is optimal. (c) Show that the value function can be expressed as V (t, x) = (T −t)2 + (T − t)|x| for t ≤ T, x ∈ R. Note that V is not dif2 ferentiable at (t, 0) for t < T . 4.5. Let X = A = R. Consider the OCP T 1/2 min [1 + a(s)] ds + |y(T ) − b| a(·)
t
subject to y(s) ˙ = a(s) ∀ t ≤ s ≤ T,
with y(t) = y.
Show that V (t, y) = [(T − t)2 + (b − y)2 ]1/2 is a solution to the corresponding HJB equation. Give a geometric interpretation. 4.6. (Managing investment income) Consider an optimal investment problem at a rate of interest β > 0, with the state space (capital) X = [0, ∞), the set of feasible actions (consumptions) at the state x given by A(x) = [0, βx], and the performance index
4
178
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
T
e−rt
a(t)dt,
0
subject to x(t) ˙ = βx(t) − a(t) and x(T ) = 0. Find a non negative function v(t) such that the value function is given by V (t, x) := e−rt v(t)x. Hint: The HJB equation (4.2.2) has an optimal action a(t) = x(t)/v(t), and becomes in v(t) ˙ − (2r − β)v(t) + 1 = 0 with boundary condition v(T ) = 0. 4.7. (A heuristic derivation of the HJB equation.) Consider the time–invariant OCP: T c(x(t), a(t))dt + C(x(T )) min a(·)
0
subject to x(t) ˙ = F (x(t), a(t)) ∀ 0 ≤ t ≤ T,
with x(0) = x.
(a) For N = 1, 2, ... and t ∈ [0, T ), let δ := TN−t . Assume that the following discrete–time approximating problem N −1 δc(xk , ak ) + C(xN ) min k=0
subject to xk+1 = xk + δF (xk , ak ) ∀ k = 0, 1, ..., N − 1,
and x0 = x
has a solution. Show that there is a function J that satisfies J(t, x) = min [δc(x, a) + J(t + δ, x + δF (x, a))] . a
Hint. Fix t < T . Let J(t + N δ, x) := C(x), and then go backwards. (b) Assume that J is a well–defined C 1 function in a neighborhood of (t, x). Consider the Taylor’s expansion J(t + δ, x + δF (x, a)) = J(t, x) + Jt (t, x) · δ + Jx (t, x) · δF (x, a) + o(δ),
4.5
THE POLICY IMPROVEMENT ALGORITHM
179
where o(δ)/δ → 0 as δ → 0. Show that min [c(x, a) + Jt (t, x) + Jx (t, x) · F (x, a)] = 0. a
4.8. (Cruz-Su´arez and Montes-de Oca (2008)) Let X ⊂ Rn and, for each x ∈ X, let A(x) ⊂ Rm be a set with nonempty interior. Moreover, let K = {(x, a) | x ∈ X, a ∈ A(x)}, and c : K → R. The purpose of this exercise is to provide some sufficient conditions for the differentiability of the function w(x) := inf{c(x, a) | a ∈ A(x)}. Assume c is of class C 2 in the interior of K and, for each x, the Hessian matrix caa (x, a) is nonsingular for every a. Further, suppose there is a function f : X → Rm such that w(x) = c(x, f (x)) and f (x) is an interior point of A(x) for each x ∈ X. Justify the equality ca (x, f (x)) = 0. Use the Implicit Function Theorem (see for instance Theorem 9.28 in Rudin (1976)) to show that f is differentiable and find fx (x). Show also that wx (x) = cx (x, f (x))
x ∈ X.
(4.5.34)
4.9. Consider the dynamics (4.0.1) and the cost functional T c(t, x(t), a(t))dt + C(T, x(T )) (4.5.35) 0
which is said to be in the Bolza form. The so-called Lagrange form of the functional happens when C ≡ 0, whereas c ≡ 0 corresponds to the Mayer form. Suppose that Assumption 4.2 holds with C as a function of (t, x). (a) Show that any cost C in the Mayer form can be put in the Lagrange form.
4
180
CONTINUOUS–TIME DETERMINISTIC SYSTEMS
T Hint. C(T, x(T )) − C(0, x(0)) = 0 [ dtd C(t, x(t))]dt. (b) Consider an additional state xn+1 to the system (4.0.1) given by xn+1 (0) = 0. x˙ n+1 (t) = c(t, x(t), a(t)), Prove that the Bolza form (4.5.35) can be put in the Mayer form. Answer. xn+1 (T ) + C(T, x(T )) where x = (x1 , . . . , xn ). 4.10. Let a∗ , x∗ , and λ satisfy the Minimum Principle for the OCP with cost (4.2.1) and dynamics (4.0.1). Let μ(t) := ert λ(t) and H(t, x, a, μ) = c(t, x, a) + μ · F (t, x, a),
t ∈ [0, T ].
Show that the minimum condition and the adjoint equation can be respectively written as H(t, x∗ (t), a∗ (t), μ(t)) = min H(t, x∗ (t), a, μ(t)) a∈A
and −μ(t) ˙ + rμ(t) = H(t, x∗ (t), a∗ (t), μ(t)),
μ(T ) = Cx (x∗ (T ))
for 0 ≤ t < T . 4.11. Prove Proposition 4.30. 4.12. Consider the r-discount OCP (4.5.1)–(4.5.2) and let f ∈ A∗SM be a stationary Markov control that satisfies (4.5.3). (a) Assuming that v(·) := V (·, f (·)) is in C 1 (X), show that v is the unique solution of the equation rv(x) = c(x, f ) + Lf (x) v(x) ∀x ∈ X. (b) Show that the r-discount value function v ∗ (x):= inf a∈A V (x, a (·)) is the unique solution in C 1 (X) of the HJB equation rv ∗ (x) = inf [c(x, a) + La v ∗ (x)]. a∈A
4.5
THE POLICY IMPROVEMENT ALGORITHM
181
Hint. In both cases (a) and (b) consider a function of the form u(t, x) = e−rt v(x) as in the proof of Lemma 4.35. 4.13. (The Ramsey model). Consider the maximization problem of the discounted utility functional ∞ u(c(t))e−ρt dt V (k0 , c(·)) = 0
˙ over all consumption strategies c(·)≥0 subject to k(t)=F (k(t))− c(t), k(0) = k0 , where the utility and system functions are given, respectively, by u(c) =
c1−σ and F (k) = Ak α , α, A > 0, 0 < σ < 1. 1−σ
For the particular case α = σ, solve the HJB equation by conjecturing the value function as v(k) = B0 + B1 k 1−σ , where B0 and B1 are undetermined coefficients. Answer.
σ A σ ρ 1 1−σ , c∗ (t) = k ∗ (t), v(k) = + k ρ ρ 1−σ σ 1 1−σ Aσ Aσ −(1−σ) ρ t σ k ∗ (t) = . + k01−σ − e ρ ρ
4.14. Let (j, h(·)) be a solution to the Poisson equation (4.4.7)– (4.4.8). Prove that j is unique, and h is unique up to additive constants. Hint. Recall (4.4.6) and (4.4.10).
Chapter 5 Continuous–Time Markov Control Processes As noted in Remark 4.7(b), the solution x(·) of the (deterministic) ordinary differential equation (4.0.1) can be interpreted as a Markov control process (MCP), also known as a controlled Markov process. In this chapter we introduce some facts on general continuous–time MCPs, which allows us to make a unified presentation of related control problems. We will begin below with some comments on (noncontrolled) continuous–time Markov processes. (We only wish to motivate some concepts, so our presentation is not very precise. For further details, see the bibliographical notes at the end of this chapter.) For notational convenience, sometimes we write x(t) as xt .
5.1 Markov Processes Consider a continuous–time stochastic process X = {x(t) : t ≥ 0} with values in a set X ⊂ Rn for some positive integer n. (In most applications, X is an open set; perhaps Rn itself. There are, however, other important cases. For instance, if X is a so–called jump process, then X is usually a countable set.) Accordingly, there is a probability space (Ω, F, P ) such that, for each t ≥ 0, x(t) is a random variable (or measurable function) from Ω to Rn . Hence, strictly speaking, we have a function of two variables (t, ω) → x(t)(ω) ≡ x(t, ω). © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hern´andez-Lerma et al., An Introduction to Optimal Control Theory, Texts in Applied Mathematics 76, https://doi.org/10.1007/978-3-031-21139-3 5
183
184
5
CONTINUOUS–TIME MARKOV CONTROL PRORCESSES
Summarizing, for each t ≥ 0, we have a random variable ω → x(t, ω) on Ω; and, for each ω ∈ Ω, we have a function t → x(t, ω), for t ≥ 0, which is called a trajectory or sample path of X . By a standard convention, the variable ω is omitted and we write x(t) rather than x(t, ω). The continuous–time process X is said to satisfy the Markov property if, informally, given the “present” state, the future behavior of the process is independent of its past. To be a little more precise, fix an arbitrary time s ≥ 0, and let x(s) be the “present” state. Then we can express the Markov property as follows: for any “future time” t > s and B ⊂ X, P [x(t) ∈ B|x(r) ∀ r ≤ s] = P [x(t) ∈ B|x(s)].
(5.1.1)
In words, (5.1.1) states that the distribution (or “behavior”) of the process at any “future” state x(t), for t > s, given the “past history” {x(r), r ≤ s}, depends only on the present state x(s). From the right–hand side of (5.1.1) we obtain the transition probabilities P (s, x, t, B) := P [x(t) ∈ B|x(s) = x]
(5.1.2)
for every 0 ≤ s ≤ t, x ∈ X, and B ⊂ X. If t = s, then (5.1.2) becomes the Dirac (or unit) measure concentrated at x(s) = x, i.e., P (s, x, s, B) = δx (B), which is defined as δx (B) := 1 if x ∈ B, and := 0 if x ∈ / B. Alternatively, δx (B) = IB (x) where IB denotes the indicator function / B. of the set B, IB (x) := 1 if x ∈ B, and := 0 if x ∈ Remark 5.1. The transition probabilities are called stationary or time–homogeneous if they depend only on the time difference t − s, that is, P (s, x, t, B) ≡ P (t − s, x, B). In this case, (5.1.2) becomes P (t, x, B) := P [x(t) ∈ B|x(0) = x] for t ≥ 0,
5.1
MARKOV PROCESSES
185
and the Markov process itself is said to be stationary or time– homogeneous. ♦ Example 5.2. A deterministic system. Consider the ordinary differential equation x(t) ˙ = F (t, x(t)) for t ≥ 0,
(5.1.3)
with a given initial condition x(0) = x0 . Let us suppose that F is continuous and has continuous first partial derivatives with respect to the components of x ∈ X, where X ⊂ Rn . In this case, (5.1.3) has a unique solution t x(t) = x0 + F (r, x(r))dr ∀ t ≥ 0, 0
which can be expressed as t x(t) = x(s) + F (r, x(r))dr
∀ 0 ≤ s ≤ t.
(5.1.4)
s
If we interpret x(s) as the “present” state, and x(t), for t ≥ s, as the “future”, then it follows that (5.1.4) is the “deterministic version” of the Markov property (5.1.1). Observe that the deterministic function t → x(t) in (5.1.3) or (5.1.4) can be seen as a “degenerate” stochastic process in the sense that, for each t ≥ 0, we can interpret x(t) as a constant random variable, that is, ω → x(t, ω) ≡ x(t). Consequently, the transition probability in (5.1.2) is a Dirac measure P (s, x, t, B) = δx(t;s,x) (B),
(5.1.5)
where x(t; s, x) is given by (5.1.4) when the “initial condition” is x(s) = x. For future reference note that integration with respect to this Dirac measure (5.1.5) yields P (s, x, t, dy)v(y) = v[x(t; s, x)] (5.1.6) X
for any bounded measurable function v on X.
♦
186
5
CONTINUOUS–TIME MARKOV CONTROL PRORCESSES
Example 5.3 (Wiener process.). A Wiener process, also known as a Brownian motion, is a real–valued stochastic process {w(t), t ≥ 0} that plays an important role in pure and applied mathematics, and also in physics, astronomy, economics, mathematical finance, and many other fields. It satisfies that w(0) = 0 and, furthermore, by definition, (a) it has independent increments, which means that if t0 = 0 < t1 < · · · < tm , then the “increments” w(t1 ) − w(t0 ), w(t2 ) − w(t1 ), ..., w(tm ) − w(tm−1 ) are independent random variables; and (b) it has Gaussian stationary increments, that is, for any t ≥ 0 and h > 0, the distribution of the increment w(t + h) − w(t) is Gaussian (or normal) with 0 mean and variance h. (This increment is “stationary” in the sense that its distribution depends on the “time increment” h only, not on t.) Remark. A continuous–time stochastic process with independent increments is Markov. (See Ash and Gardner (1975), Theorem 4.6.5) ♦ Hence, by this remark, the Wiener process w(·) is Markov. Moreover, from (a) and (b) above, for any t > 0 and initial state w(0) = x, the stationary transition probability is nx,t (y)dy, (5.1.7) P (t, x, B) = B
where nx,t (·) denotes the Gaussian (or normal) density with mean x and variance t, i.e., nx,t (y) = (2πt)−1/2 exp(−|y − x|2 /2t) ∀ y ∈ R. Among the many properties of a Wiener process it is the fact that it has continuous sample paths t → w(t) that are nowhere differentiable! A process (w1 (t), . . . , wn (t)) ∈ Rn , for t ≥ 0, is called an n– dimensional Wiener process (or Brownian motion) if w1 , ..., wn are independent 1–dimensional Wiener processes. ♦
5.1
MARKOV PROCESSES
187
Example 5.4. Stochastic differential equations. In Chap. 6, below, we consider n–dimensional stochastic differential equations (SDEs) of the form dx(t) = b(t, x(t))dt + σ(t, x(t))dw(t), t ≥ 0,
(5.1.8)
with a given initial condition x(0) = x0 , where w(·) is a Wiener process. The functions b and σ in (5.1.8) are called the SDE’s drift coefficient and the diffusion coefficient, respectively. As noted in Example 5.3, the sample paths t → w(t) are not differentiable. Hence, strictly speaking we should express (5.1.8), for any 0 ≤ s ≤ t, in the integral form t t b(r, x(r))dr + σ(r, x(r))dw(r), (5.1.9) x(t) = x(s) + s
s
where the second integral in the right–hand side is well defined as a so–called Itˆ o integral. For our present purposes, it suffices to note that, under suitable conditions (see Assumption 6.1), the solution of (5.1.8) is a Markov process with transition probabilities P (s, x, t, B) = P [x(t) ∈ B|x(s) = x] = P [x(t; s, x) ∈ B] for all 0 ≤ s ≤ t, x ∈ Rn , and B ⊂ Rn , where x(t; s, x) is given by (5.1.9) when the initial condition is x(s) = s. Moreover, with some additional mild condition (Assumption 6.2) the solutions of SDEs form a class of so–called diffusion processes, and, therefore, by an abuse of terminology, sometimes one uses the latter term, diffusion processes, to refer to the SDEs (5.1.8)–(5.1.9). If the coefficients b(t, x) ≡ b(x) and σ(t, x) ≡ σ(x) in (5.1.8) do not depend on the time parameter, then the Markov process x(·) is time–homogeneous with transition probabilities P (t, x, B) = P [x(t) ∈ B|x(0) = x]. Finally, observe that an ordinary differential equation as in (5.1.3) can be seen, of course, as a special case of a SDE with diffusion coefficient σ(·) ≡ 0. ♦
188
5
CONTINUOUS–TIME MARKOV CONTROL PRORCESSES
We conclude this section with another example of a continuoustime Markov process, namely, the Poisson process, which is very useful in some applications, for instance, in the control of queues and other systems in which the state space is a denumerable (or countable) set. First, we recall the following definition from elementary probability. A random variable N with values in the set of nonnegative integers N = {0, 1, ...} is said to have a Poisson distribution with rate λ > 0 if P (N = k) :=
e−λ λk for k = 0, 1, . . . . k!
On the other hand, a continuous-time stochastic process {N (t), t≥ 0} with values in N is called a counting process if, for any t ≥ 0 and h > 0, the increment N (t + h) − N (t) equals the number of “events” that have occurred in the interval (t, t + h]. (Compare the following definition with Example 5.3 above.) Definition 5.5. A counting process {N (t), t ≥ 0} with N (0) = 0 is called a Poisson process with rate λ > 0 if it has (a) independent increments, and (b) Poisson stationary increments with rate λ, that is, for each t ≥ 0 and h > 0, the increment N (t + h) − N (t) is a Poisson random variable with parameter λh, i.e., e−λh (λh)k P (N (t + h) − N (t) = k) = k! for k = 0, 1, .... By property (a), a Poisson process is a Markov process. (See the Remark in Example 5.3)
5.2 The Infinitesimal Generator As in the previous section, we consider a continuous–time Markov process X = {x(t) : t ≥ 0} in X ⊂ Rn , with transition probabilities (5.1.2). In this section we introduce the infinitesimal generator
5.2
THE INFINITESIMAL GENERATOR
189
(or simply the generator) of the Markov process X , which is a key tool to study different aspects of X . An important property of the transition probabilities is expressed by the Chapman–Kolmogorov equation: P (s, x, r, B) = P (s, x, t, dy)P (t, y, r, B) (5.2.1) X
for 0 ≤ s ≤ t ≤ r. We will now introduce three families M ⊃ M0 ⊃ D of real– valued measurable functions on X∞ := [0, ∞) × X. Definition 5.6. Let M be the linear space of real–valued measurable functions v on X∞ such that P (s, x, t, dy)|v(t, y)| < ∞ X
for each 0 ≤ s ≤ t, x ∈ X. For each t ≥ 0, and v ∈ M , let Tt v be the function such that, for each (s, x) ∈ X∞ , the expected value P (s, x, s + t, dy)v(s + t, y) Tt v(s, x) : =Es,x [v(s + t, x(s + t))]= X
(5.2.2) is well defined (and finite), where Es,x [· · · ] denotes the conditional expectation given the initial condition x(s) = x. The operators Tt , t ≥ 0, form a semigroup of operators on M , that is, T0 =Identity, each Tt maps M into itself, and Tt+r = Tt Tr
∀ t, r ≥ 0,
where the latter equality follows from the right-hand side of (5.2.2) and the Chapman–Kolmogorov equation (5.2.1). (See Exercise 5.1.) Let M0 be the subfamily of M consisting of those functions v ∈ M such that: (a) limt↓0 Tt v(s, x) = v(s, x) for every (s, x) ∈ X∞ , and (b) there exists t0 > 0 and u ∈ M such that
5
190
CONTINUOUS–TIME MARKOV CONTROL PRORCESSES
Tt |v|(s, x) ≤ u(s, x) ∀ (s, x) ∈ X∞ , 0 ≤ t ≤ t0 . The next definition introduces the infinitesimal generator of the semigroup Tt . Definition 5.7. Let D(L) be the subset of functions v ∈ M0 that satisfy the following conditions: (a) The limit Lv(s, x) := lim t−1 [Tt v(s, x) − v(s, x)] t↓0
(5.2.3)
exists for all (s, x) ∈ X∞ , and (b) Lv is in M0 . The operator L in (5.2.3) is called the infinitesimal generator of the semigroup Tt , and is also known as the infinitesimal generator of the Markov process X . The set D(L) is called the domain of L. Example 5.8. (a) Consider the deterministic system in (5.1.3), and suppose that (t, x) → v(t, x) is a real–valued mapping on X∞ such that (for instance) it is continuously differentiable in x with bounded derivatives, uniformly in t ≥ 0. Then, from (5.1.6) and (5.2.2), Tt v(s, x) = v(s + t, x(s + t; s, x)) and (5.2.3) becomes Lv(s, x) = vs (s, x) + vx (s, x)F (s, x),
(5.2.4)
where vs denotes the partial derivative of v with respect to s, and vx is the gradient of v (in the x–variables), that is, the row vector of partial derivatives vx1 , ..., vxn . Hence, more explicitly, we can express (5.2.4) as Lv(s, x) = vs (s, x) +
n
Fi (s, x)vxi (s, x).
i=1
Compare the expressions (5.2.4) and (4.1.12). Omitting the control variable a ∈ A in (4.1.12), these expressions are essentially the same. See also Remark 4.7(d) or (4.4.5)–(4.4.6) for the timehomogeneous deterministic case.
5.2
THE INFINITESIMAL GENERATOR
191
(b) Let w(·) be the 1–dimensional Wiener process in Example 5.3, and let x → v(x) be a twice continuously differentiable function. Since w(·) is a time–homogeneous Markov process with transition probability (5.1.8), suitable calculations yield that (5.2.3) becomes Lv(x) = (1/2)vxx (x), where vxx denotes the second derivative of v with respect to x. In the n–dimensional case w = (w1 , ..., wn ), 1 Lv(x) = vx x (x), 2 i=1 i i n
where vxi xi denotes the second partial derivative of v with respect to xi . For the SDE in Example 5.4, the corresponding generator Lv is given in (6.1.4), below. ♦ The infinitesimal generator L is in fact an extension of the “weak infinitesimal generator” of a semigroup, defined in Chap. 1 of Dynkin (1965), and it has essentially the same properties. For instance, straightforward modifications of the proofs in the latter reference give the following results. Lemma 5.9. If v ∈ D(L), then, for all (s, x) ∈ X∞ := [0, ∞) × X, d+ Tv dt t
:= lim h−1 [Tt+h v − Tt v] = Tt Lv; h↓0 t (b) Tt v(s, x) − v(s, x) = 0 Tr (Lv)(s, x)dr. (a)
Moreover, if ρ ≥ 0 and vρ (s, x) := e−ρs v(s, x), then vρ is also in D(L) and (c) Lvρ (s, x) = e−ρs [Lv(s, x) − ρv(s, x)]. (d) v is a constant if, and only if, Lv(s, x) = 0 for all (s, x). The proof of Lemma 5.9 is left to the reader. (See Exercise 5.2.) Remark 5.10. (a) The expression in Lemma 5.9(b) is called Dynkin’s formula and using (5.2.2) can be rewritten as
192
5
CONTINUOUS–TIME MARKOV CONTROL PRORCESSES
Es,x v(s + t, x(s + t)) − v(s, x) = Es,x
s+t
Lv(r, x(r))dr s
(5.2.5) for v ∈ D(L). In the deterministic case, Dynkin’s formula can be obtained from Remark 4.7(c): T v(T, x(T )) − v(s, x) = Lv(r, x(r))dr. s
(b) Under suitable assumptions, (5.2.5) holds if t is replaced by a (random) stopping time τ . (See, for instance, the “corollary” in Dynkin (1965), p. 133.) ♦ The proof of the following proposition illustrates the use of Dynkin’s formula (5.2.5). Proposition 5.11. Let c and K be nonnegative functions on XT := [0, T ] × X, for some T > 0. Suppose that c is in M0 , and let ρ ≥ 0. (a) If v ∈ D(L) satisfies the equation ρv(s, x) = c(s, x) + Lv(s, x) ∀ (s, x) ∈ XT
(5.2.6)
with the “terminal” condition v(T, x) = K(T, x) ∀ x ∈ X,
(5.2.7)
then, for all (s, x) ∈ XT , T e−ρ(t−s) c(t, x(t))dt + e−ρ(T −s) K(T, x(T ))]. v(s, x) = Es,x [ s
(5.2.8) (a’) If instead of (5.2.6) we have the inequality ρv ≤ c + Lv, then in (5.2.8) we replace “=” with “≤”. Similarly, if in (5.2.6) we replace the equality with “≥”, then in (5.2.8) we replace “=” with “≥”. (b) Suppose that c is as above, but ρ > 0 and K ≡ 0. Suppose that v ∈ D(L) satisfies (5.2.6) for all (s, x) ∈ X∞ := [0, ∞) × X and the condition
5.2
THE INFINITESIMAL GENERATOR
193
e−ρt Es,x v(s + t, x(s + t)) = e−ρt Tt v(s, x) → 0 as t → ∞. (5.2.9) Then v ≡ v ρ is given by ∞ ρ e−ρ(t−s) c(t, x(t))dt v (s, x) = Es,x ∞ s = e−ρt Tt c(s, x)dt. (5.2.10) 0
(b’) If in (5.2.6) we replace the equality “=” with either “≤” or “≥”, then in (5.2.10) we replace the equality with “≤” or “≥”, respectively.
Proof. (a) As in Lemma 5.9(c), let vρ (s, x) := e−ρs v(s, x) and note that (5.2.6) can be expressed as Lv(s, x) − ρv(s, x) = −c(s, c). Hence, Lemma 5.9(c) yields Lvρ (s, x) = e−ρs [Lv(s, x) − ρv(s, x)] = −e−ρs c(s, x). (5.2.11) Therefore, applying Dynkin’s formula (5.2.5) to vρ , we obtain s+t −ρ(s+t) −ρs e v(s+t, x(s + t))−e v(s, x)= − e−ρr c(r, x(r))dr. s
Multiplying both sides of the latter expression by eρs and taking t = T − s it follows that T −ρ(T −s) e Es,x v(T, x(T )) − v(s, x) = − e−ρ(r−s) c(r, x(r))dr. s
Finally, rearranging terms and using (5.2.7) we obtain (5.2.8). (a’)If ρv ≤ c + Lv, then instead of (5.2.11) we obtain Lvρ (s, x) ≥ −e−ρs c(s, x). Thus, with the obvious changes, the proof of (a) gives also (a’). (b) If K ≡ 0, (5.2.9) and (5.2.8) give (5.2.10) as T → ∞. The proof of (b’) is left as an exercise.
194
5
CONTINUOUS–TIME MARKOV CONTROL PRORCESSES
Proposition 5.11 will be very useful to obtain the dynamic programming (or Hamilton–Jacobi–Bellman) equation associated to some stochastic control problems. If the function c ∈ M0 in Proposition 5.11 is a “cost rate”, that is, a cost per unit time, then v(s, x) in (5.2.8) can be interpreted as a total expected cost during the time interval [s, T ], with initial state x(s) = x ∈ X and terminal cost K(T, x(T )). Similarly, (5.2.10) can be seen as an infinite–horizon expected discounted cost from time s onward, with discount factor ρ > 0 and initial condition x(s) = x. In the following Proposition 5.13 we present a result for the long–run expected average cost defined as follows. First, given a “cost rate” c ∈ M0 and t > 0, let s+t vt (s, x) := Es,x c(r, x(r))dr s t = Tr c(s, x)dr (5.2.12) 0
be the total expected cost in the interval [s, s + t], given the initial condition x(s) = x ∈ X at time s ≥ 0. Then Jt (s, x) :=
vt (s, x) t
(5.2.13)
denotes the expected average cost during the time interval [s, s + t], with initial condition x(s) = x. To define the “long–run expected average cost” we would like to take the limit as t → ∞ in (5.2.13). A priori, however, we do not know if such a limit exists; hence, we can take instead either the “lim sup”, i.e., J sup (s, x) := lim sup Jt (s, x),
(5.2.14)
Jinf (s, x) := lim inf Jt (s, x).
(5.2.15)
t→∞
or the “lim inf”, t→∞
For theoretical reasons, taking the lim sup is more convenient, so we will take (5.2.14) as the definition of the long–run expected average cost. (As an example, taking the average cost as J sup in
5.2
THE INFINITESIMAL GENERATOR
195
(5.2.14) simplifies the calculations to obtain results as in Exercise 5.8(b),(c).) Remark 5.12. (a) As a rule of thumb, if we wish to minimize a long–run expected average cost, we use the lim sup as in (5.2.14). (We thus take a “conservative” or “minimax” attitude—we wish to minimize a “maximum” or “lim sup”.) Nevertheless, if we wish to maximize a long–run average reward (or utility or income), then we use the lim inf in (5.2.15). (This is again a “conservative”, in fact, “maximin” attitude.) (b) Since the long-run expected average cost (5.2.14) concerns the convergence of time averages (5.2.13), it is also known as an ergodic cost. (See, for instance, Arapostathis et al. (2012) or Arisawa (1997).) ♦ The following Proposition 5.13 is a general version of Lemma 4.19 for deterministic continuous-time systems. Proposition 5.13. Let c ∈ M0 be a given function, and suppose that there exists a number j(c) ∈ R and a function hc ∈ D(L) such the pair (j(c), hc ) satisfies, for all (s, x) ∈ X∞ := [0, ∞) × X, the so–called Poisson equation j(c) = c(s, x) + Lhc (s, x).
(5.2.16)
Moreover, suppose that, for all (s, x) ∈ X∞ , hc is such that lim Tt hc (s, x)/t = 0.
t→∞
(5.2.17)
Then: (a) The constant j(c) = J sup (s, x) for all (s, x) ∈ X∞ . (b) If the equality in (5.2.16) is replaced with “≤”, then the equality in (a) is replaced with “≤”; that is, if j(c) ≤ c(s, x) + Lhc (s, x) ∀ (s, x), then j(c) ≤ J sup (s, x) for all (s, x). This result is also true if we have “≥” in lieu of “≤”. Proof. (a) By Dynkin’s formula in Lemma 5.9(b) (or (5.2.5)),
5
196
CONTINUOUS–TIME MARKOV CONTROL PRORCESSES
Tt hc (s, x) − hc (s, x) =
t
Tr (Lhc )(s, x)dr t = Es,x Lhc (s + r, x(s + r))dr 0
0
[by (5.2.17)]
t
c(s + r, x(s + r))dr = tj(c) − Es,x 0 t Tr c(s, x)dr. = tj(c) − 0
Hence, rearranging terms and multiplying by t−1 , t −1 Tr c(s, x)dr + t−1 [Tt hc (s, x) − hc (s, x)]. j(c) = t 0
Finally, letting t → ∞, (5.2.17) yields (a). The proof of part (b) is similar.
Remark 5.14. (a) The pair (j(c), hc ) in Proposition 5.13 is called a canonical pair or a solution of the Poisson equation (5.2.16) corresponding to the function c ∈ M0 . There are several approaches to obtain such a pair. Some of these approaches are briefly introduced in part (d) below, in Sect. 5.5, and also in the exercise section. See also Sect. 4.4 for the deterministic case. (b) If (j(c), hc ) is a solution to (5.2.16), then j(c) is unique, but hc is unique up to additive constants only. More precisely, suppose that, for i = 1, 2, the pair (j i , hi ) ∈ R × D(L) is a solution to (5.2.16), i.e., j i = c(s, x) + Lhi (s, x) ∀ (s, x), and it satisfies (5.2.17). Then j 1 = j 2 , and h1 , h2 differ by a constant: h1 = h2 + constant. (See Exercise 5.6.) (c) For a stationary (or time–homogeneous) Markov process, the notation and results in this section simplify in the obvious manner. For instance, the linear space M in Definition 5.6
5.2
THE INFINITESIMAL GENERATOR
197
becomes the space of measurable functions v : X → R such that P (t, x, dy)|v(y)| < ∞ ∀ t ≥ 0, x ∈ X, X
and (5.2.2) becomes P (t, x, dy)v(y) ∀ t ≥ 0, x ∈ X.
Tt v(x) = Ex v(x(t)) = X
(d) As in part (c), above, consider a time–homogeneous Markov process X with transition probabilities P (t, x, ·), and a function c ∈ M0 . Let μ be a probability measure on X, which is an invariant probability measure1 for X ; that is, for each Borel set B ⊂ X, t ≥ 0, and x ∈ X, we have μ(B) = P (t, x, B)μ(dx). X
Let · ∗ be a norm on the linear space of finite signed measures on X. We assume that our Markov process is uniformly ergodic (or geometrically ergodic) with respect to the norm · ∗ , that is, there exist positive constants θ and γ such that, for all t ≥ 0 and x ∈ X, P (t, x, ·) − μ(·) ∗ ≤ θe−γt .
(5.2.18)
In this case, as t → ∞, P (t, x, ·) converges geometrically fast to μ(·) for any initial state x. (For conditions ensuring geometric ergodicity, see the references in Exercise 5.9.) Finally, suppose that c ∈ M0 is bounded2 by some constant c¯: |c(x)| ≤ c¯. Let ∞ c(x)μ(dx), and hc (x) := [Tt c(x) − j(c)]dt j(c) := X
1
0
(5.2.19)
The probability measure μ is said to be “invariant” or a “stationary measure” for X because if the initial state x(0) has distribution μ, then the state x(t) has distribution μ for all t ≥ 0. A Markov process is called ergodic if it has a unique invariant probability measure. 2 The requirement that c is bounded simplifies the presentation, but it is not necessary. (See the references in Exercise 5.9.)
198
5
CONTINUOUS–TIME MARKOV CONTROL PRORCESSES
for x ∈ X. Then hc is in D(L), and the pair (j(c), hc ) is a solution to the Poisson equation j(c) = c(x) + Lhc (x) for all x ∈ X.
(5.2.20) ♦
(See Exercise 5.9.)
5.3 Markov Control Processes For our present purposes, a continuous–time Markov control process (MCP) is specified by: (a) Two sets X ⊂ Rn and A ⊂ Rm , called the state space and the action set (or action space), respectively; (b) The law of motion: Corresponding to each action a ∈ A, there exists a linear operator La that is the infinitesimal generator of a X–valued Markov process with transition probabilities P a (s, x, t, B). (c) A cost rate function c(s, x, a), which is a real–valued measurable function defined on [0, ∞) × X × A. We assume that c is nonnegative. The quadruple (X, A, La , c) in (a), (b), (c) expresses in a compact form a continuous–time MCP. (In a more general context, X and A can be complete and separable metric spaces, also known as Polish spaces. Moreover, the condition that the cost rate c is nonnegative simplifies some theoretical and computational aspects, but strictly speaking it is not necessary.) Example 5.15. The (deterministic) differential system (4.0.1) defines a continuous–time MCP with state and action spaces X and A as in Chap. 4, and infinitesimal generator La in (4.1.12), that is, La v(s, x) := vs (s, x) + vx (s, x) · F (s, x, a).
(5.3.1)
5.3
MARKOV CONTROL PROCESSES
199
The situation is essentially the same as in Examples 5.2 and 5.8(a) except that now the system function F depends also on the control variable a ∈ A. Compare (5.3.1) and (5.2.4). On the other hand, the cost rate function c(s, x, a) is as in (4.1.1), and in some cases—as in the finite–horizon case (4.1.1)— we also need to specify a terminal cost function C(x). ♦ Given a MCP, we will only consider Markov control policies (also known as closed–loop or feedback controls), that is, measurable functions π : [0, ∞) × X → A such that aπ := π(s, x) denotes the control action prescribed by π when the state x ∈ X is observed at time s. A Markov policy is said to be stationary if it is independent of the time parameter s, that is, π(s, x) ≡ π(x) for all (s, x). We will denote by Π the set of all Markov policies, and by ΠS the subset of stationary policies. Moreover, for technical reasons, we restrict ourselves to the class Π of Markov policies for which the corresponding state (Markov) processes are nicely behaved, in the following sense. Assumption 5.16. For each π ∈ Π there exists a continuous– time Markov process xπ (·) = x(·) such that: (a) Almost all the sample paths of x(·) are right–continuous, with left–hand limits, and have only finitely many discontinuities in any finite interval of time. (b) x(·) is a Markov process with transition probability denoted by P π (s, x, t, B) and associated semigroup T π ; see (5.2.2). (c) The substitution property. The infinitesimal generator Lπ of x(·) satisfies that Lπ = L a
if π(s, a) = a.
(d) The process x(·) is conservative in the sense that if v(s, x) ≡ 1 for all (s, x), then Lπ v = 0. (e) There is a nonempty subfamily Π0 of Π such that D(Lπ ) is nonempty for all π ∈ Π0 , and, furthermore, the cost rate c(s, x, a) is such that the function is in M0 for all π ∈ Π0 , where cπ (s, x) := c(s, x, π(s, x)) for all π ∈ Π0 . To state our final assumption in this chapter we note that, for each π ∈ Π, the function sets M ⊃ M0 ⊃ D(L) in Sect. 5.2 depend
200
5
CONTINUOUS–TIME MARKOV CONTROL PRORCESSES
on the policy π being used, so they now will be written as M π , M0π , and D(Lπ ), respectively. With this notation and the family Π0 in Assumption 5.16(e), we have the following. Assumption 5.17. There exist nonempty sets M ⊃ M0 ⊃ D such that, for all π ∈ Π0 , M ⊂ M π , M0 ⊂ M0π , and D ⊂ D(Lπ ). Remark 5.18 (Notation). Given a policy π ∈ Π0 and the cost rate c(s, x, a) we will use the notation: cπ (s, x) = c(s, x, π) = c(s, x, π(s, x)) if x(s) = x.
(5.3.2)
In particular, if c(s, x, a) = c(x, a) is independent of the time parameter s, then (5.3.2) means: cπ (s, x) = c(x, π(s, x)) if x(s) = x.
(5.3.3)
If, in addition, π is stationary, so π(s, x) ≡ π(x), then (5.3.3) becomes cπ (x) = c(x, π) = c(x, π(x)). (5.3.4) Similarly, when using a policy π ∈ Π0 , expectations such as (5.2.2) will be written as π π P π (s, x, s+t, dy)v(s + t, y) Tt v(s, x)=Es,x v(s + t, x(s + t))= X
(5.3.5)
or, by the substitution property in Assumption 5.16(c), a a Tt v(s, x)=Es,x v(s + t, x(s + t))= P a (s, x, s + t, dy)v(s + t, y) X
(5.3.6) if π(s, x) = a. In particular, for a time–homogeneous MCP and a function v(s, x) ≡ v(x) in D, if π(s, x) = a, then Ttπ v(s, x) = Tta v(x) and Lπ v(s, x) = La v(x).
(5.3.7) ♦
5.4
THE DYNAMIC PROGRAMMING APPROACH
201
5.4 The Dynamic Programming Approach As noted in previous chapters, when using the dynamic programming (DP) approach to study a given optimal control problem (OCP) the idea is to obtain an equation—the so–called Bellman equation or dynamic programming equation (DPE)—from which we can obtain, under appropriate conditions, the OCP’s value function and also optimal control policies. Remark 5.19. For continuous–time MCPs, the dynamic programming (or Bellman) equation is also known as the Hamilton– Jacobi–Bellman (HJB) equation. ♦ In the remainder of this chapter we use the DP approach to analyze some OCPs associated to a general continuous–time MCP (X, A, La , c) that satisfies the conditions in Sect. 5.3, in particular, Assumptions 5.16 and 5.17. We consider, first, a finite–horizon OCP. Fix ρ ≥ 0 and T > 0. Consider the cost functional T π e−ρ(t−s) cπ (t, x(t))dt + e−ρ(T −s) K(T, x(T )) V (s, x, π) := Es,x [ s
(5.4.1) with 0 ≤ s ≤ T, x ∈ X, and π ∈ Π0 , where K ∈ M is a given nonnegative function representing a “terminal cost” at the terminal time T . (Recall that M is the set in Assumption 5.17.) For future reference, compare (5.4.1) and (5.2.8): they are essentially the same except that (5.4.1) depends on π. If ρ = 0 in (5.4.1), then V (s, x, π) is called the expected total cost during the interval [s, T ] when using the policy π. If, on the other hand, ρ > 0 then V is the discounted cost during [s, T ] when using π. In either case, the OCP is to find a policy π ∗ such V (s, x, π ∗ ) = inf V (s, x, π) =: V ∗ (s, x) ∀ (s, x) ∈ XT . π
(5.4.2)
If this is the case, then we say that π ∗ is an optimal policy, and the function V ∗ is called the OCP’s value function or optimal
5
202
CONTINUOUS–TIME MARKOV CONTROL PRORCESSES
cost function. The corresponding dynamic programming theorem is Theorem 5.21 below. Remark 5.20. Compare Theorem 4.6 and the following Theorem 5.21 with ρ = 0: the theorems are the same except that La v in (5.4.3) is written in the form (4.1.12), and the terminal cost K(T, x) in (5.4.4) takes the form C(x) in (4.1.9). In other words, Theorem 4.6 is a special case of Theorem 5.21 when the Markov control problem is given by the deterministic system (4.0.1)–(4.1.1). Similarly, in the following chapter we will specialize Theorem 5.21 (and also Theorem 5.23) to controlled diffusion processes, which is a class of controlled stochastic differential equations. ♦ Recall that D is the set in Assumption 5.17. Theorem 5.21. Suppose that v ∈ D satisfies the equation ρv(s, x) = inf [c(s, x, a) + La v(s, x)] ∀ (s, x) ∈ XT a∈A
(5.4.3)
with the boundary (or “terminal”) condition v(T, x) = K(T, x) ∀ x ∈ X.
(5.4.4)
Then: (a) v(s, x) ≤ V (s, x, π) for all (s, x) ∈ XT and π ∈ Π0 . (b) If π ∗ ∈ Π0 is an admissible Markov policy such that π ∗ (s, x) attains the minimum in the right-hand side of (5.4.3), that is (using the notation in Remark 5.18), ∗
∗
ρv(s, x) = cπ (s, x) + Lπ v(s, x) ∀ (s, x) ∈ XT ,
(5.4.5)
then v(s, x) = V (s, x, π ∗ ), and so (by part (a)) π ∗ is an optimal policy and v = V ∗ is the optimal cost function in (5.4.2). Proof. (a) Suppose that v satisfies (5.4.3). Then, for all (s, x) ∈ XT and a ∈ A, ρv(s, x) ≤ c(s, x, a) + La v(s, x).
5.4
THE DYNAMIC PROGRAMMING APPROACH
203
Therefore, for any π ∈ Π0 , ρv(s, x) ≤ cπ (s, x) + Lπ v(s, x). Note that this inequality can be written as in Proposition 5.11(a’), that is, ρv ≤ cπ + Lπ v. Hence, this inequality and (5.4.4) yield (by Proposition 5.11(a’) and comparing (5.4.1) with (5.2.8)) v(s, x) ≤ V (s, x, π) ∀ (s, x) ∈ XT . This completes the proof of part (a). (b) If (5.4.5) and (5.4.4) hold, then Proposition 5.11(a) and (5.4.1) give v(s, x) = V (s, x, π ∗ ) ∀ (s, x) ∈ XT . Thus, part (a) and (5.4.2) yield the desired conclusion.
To conclude this section, we consider the infinite–horizon version of (5.4.1), with ρ > 0, a given discount factor, and K ≡ 0. Hence, consider ∞ π V∞ (s, x, π) := Es,x e−ρ(t−s) cπ (t, x(t))dt (5.4.6) s T π e−ρ(t−s) cπ (t, x(t))dt = lim Es,x T →∞
s
for all (s, x) in X∞ := [0, ∞) × X, and π ∈ Π0 . (Compare (5.4.6) and (5.2.10).) The corresponding value (or optimal cost) function is V∞∗ (s, x) := inf V∞ (s, x, π). π
∗
As usual, a policy π ∈ Π0 is said to be optimal if V∞ (s, x, π ∗ ) = V∞∗ (s, x) for all (s, x) ∈ X∞ . On the other hand, to have a nontrivial OCP we assume the following. Assumption 5.22. There exists an admissible policy π ∈ Π0 such that V∞ (s, x, π) < ∞ for every (s, x) ∈ X∞ . Assumption 5.22 ensures that V∞∗ (s, x) < ∞ for every initial condition (s, x). As an example, Assumption 5.22 trivially holds
204
5
CONTINUOUS–TIME MARKOV CONTROL PRORCESSES
if c(s, x.a) is bounded, that is, for some positive constant K, 0 ≤ c(s, x, a) ≤ K for all (s, x, a) ∈ X∞ × A. In this case, (5.4.6) yields V∞ (s, x, π) ≤ K/ρ ∀ s, x, π. In the infinite–horizon case (5.4.6), the DP Theorem 5.21 becomes as follows. (Note that Theorem 4.15, in Sect. 4.3, is a deterministic version of Theorem 5.23. In particular, compare (5.4.8) and (4.3.3).) Theorem 5.23. Suppose that v ∈ D satisfies the equation ρv(s, x) = inf [c(s, x, a) + La v(s, x)] a∈A
(5.4.7)
for all (s, x) ∈ X∞ . Then: (a) v(s, x) ≤ V∞ (s, x, π) for every policy π ∈ Π0 such that e−ρt Ttπ v(s, x) → 0 as t → ∞.
(5.4.8)
(b) If π ∗ ∈ Π0 is such that π ∗ (s, x) ∈ A attains the minimum in (5.4.7), that is, ∗
∗
ρv(s, x) = cπ (s, x) + Lπ v(s, x)
(5.4.9)
for all (s, x) ∈ X∞ , then π ∗ is optimal within the class of policies π ∈ Π0 that satisfy (5.4.8) and, moreover, v(s, x) = V∞∗ (s, x) = V∞ (s, x, π ∗ ) for all (s, x).
Proof. As in the proof of Theorem 5.21(a), the relation (5.4.7) implies that ρv ≤ cπ + Lπ ∀ π ∈ Π0 . If, in addition, π ∈ Π0 satisfies the condition (5.4.8), then Proposition 5.11(b’) yields that v(·, ·) ≤ V (·, ·, π). This proves part (a). Similarly, if π ∗ satisfies (5.4.9) and (5.4.8), then Proposition 5.11(b) gives that
5.5
LONG–RUN AVERAGE COST PROBLEMS
205
v(·, ·) = V∞ (·, ·, π ∗ ). Therefore, the desired conclusion in (b) follows from (a).
5.5 Long–Run Average Cost Problems In this section we consider a time-homogeneous continuous–time MCP (X, A, La , c), with a nonnegative cost function c. We will use the notation (5.3.3), (5.3.4), and (5.3.7). For each t ≥ 0, x ∈ X, and π ∈ Π, let t π cπ (r, x(r))dr (5.5.1) Jt (x, π) := Ex 0
be the total expected cost in [0, t], when using the control policy π, given the initial state x(0) = x. (In (5.5.1) we are using the notation (5.3.3), according to which cπ (r, x) := c(x, π(r, x)) if x(r) = x.) As in (5.2.13)–(5.2.14), we now consider the long-run expected average cost, or simply the average cost (AC), when using π ∈ Π, defined as (5.5.2) J(x, π) := lim sup Jt (x, π)/t t→∞
for each initial state x. (As a particular case, see the deterministic problem (4.4.1)–(4.4.3).) Assumption 5.24. There exists a policy π ∈ Π such that J(x, π) < ∞ for every x ∈ X. For instance, if c is bounded (say, there is a constant c¯ such that 0 ≤ c(x, a) ≤ c¯ for all x ∈ X and a ∈ A), then Assumption 5.24 holds with J(x, π) ≤ c¯ for all x, π. Under our current assumptions, the AC–value function J ∗ (x) := inf J(x, π), x ∈ X, π∈Π
(5.5.3)
is finite–valued. As usual, a policy π ∗ ∈ Π is said to be optimal with respect to (5.5.2), or AC–optimal, if
206
5
CONTINUOUS–TIME MARKOV CONTROL PRORCESSES
J(x, π ∗ ) = J ∗ (x) ∀ x ∈ X. To analyze the AC–optimal control problem, we will first use Proposition 5.13 to obtain a characterization of J(x, π). Remark 5.25. In (5.5.4)–(5.5.5) below we use the notation (5.3.3) and (5.3.7). ♦ Proposition 5.26. (a) Let π ∈ Π be a policy for which the following holds: There exists a number j π and a function hπ ∈ D such that the pair (j π , hπ ) satisfies the Poisson equation j π = cπ (s, x) + Lπ hπ (s, x) ∀ s, x
(5.5.4)
and, furthermore, lim Ttπ hπ (s, x)/t = 0.
t→∞
(5.5.5)
Then J(·, π) is the constant j π , i.e., j π = J(x, π) ∀ x ∈ X.
(5.5.6)
(b) If in (5.5.4) we replace the equality by either ≤ or ≥, then in (5.5.6) the equality is replaced by ≤ or ≥, respectively. We will omit the proof of Proposition 5.26 because it is the same as that of Proposition 5.13. On the other hand, observe that (5.5.5) is obviously true if hπ is a bounded function. We will next consider the AC optimal control problem in which we wish to minimize the function π → J(x, π) for every x ∈ X. To this end we introduce the following definition. Definition 5.27. Consider a pair (j ∗ , h∗ ) that consists of a real number j ∗ and a function h∗ ∈ D. The pair (j ∗ , h∗ ) is called (a) a solution to the average cost optimality equation (ACOE) if j ∗ = inf [c(x, a) + La h∗ (x)] ∀ x ∈ X; a∈A
(5.5.7)
(b) a solution to the average cost optimality inequality (ACOI) if j ∗ ≥ inf [c(x, a) + La h∗ (x)] ∀ x ∈ X. a∈A
(5.5.8)
5.5
LONG–RUN AVERAGE COST PROBLEMS
207
A large part of the analysis of AC problems concerns either the ACOE (5.5.7) or the ACOI (5.5.8). This is mainly due to the following theorem. Theorem 5.28. Let (j ∗ , h∗ ) ∈ R × D be a solution to the ACOE (5.5.7). Let ΠAC ⊂ Π be family of policies π ∈ Π such that lim Ttπ h∗ (s, x)/t = 0 ∀ s, x.
t→∞
(5.5.9)
Then for every x ∈ X : (a) j ∗ ≤ inf π∈ΠAC J(x, π); hence (a’) j ∗ ≤ J ∗ (x) if ΠAC = Π, where J ∗ (·) is the AC–value function in (5.5.3). Moreover, let ΠAC S ⊂ ΠS be the family of stationary is such that, for every policies that satisfy (5.5.9). If π ∗ ∈ ΠAC S x ∈ X, π ∗ (x) ∈ A minimizes the right–hand side of (5.5.7), i.e., ∗
∗
j ∗ = cπ (x) + Lπ h∗ (x) ∀ x ∈ X,
(5.5.10)
then, for all x ∈ X, (b) j ∗ = J(x, π ∗ ) = inf π∈ΠAC J(x, π); hence S ∗ ∗ ∗ (b’) j = J(x, π ) = J (x) if ΠAC = Π. In words, if π ∗ ∈ ΠS is a stationary policy that satisfies (5.5.10) and, in addition, (5.5.9) holds for every π ∈ Π, then π ∗ is AC–optimal, and the optimal cost is the constant J(·, π ∗ ) ≡ j ∗ . minimizes the (c) Suppose that, instead of (5.5.10), π ∗ ∈ ΠAC S right–hand side of the ACOI (5.5.8), i.e., ∗
Then
∗
j ∗ ≥ cπ (x) + Lπ h∗ (x) ∀ x ∈ X.
(5.5.11)
j ∗ ≥ J(x, π ∗ ) ≥ J ∗ (x)
(5.5.12)
∀ x ∈ X.
(c’) Part (b’) holds, that is, j ∗ = J(·, π ∗ ) = J ∗ (·) if ΠAC = Π. Proof. (a) By (5.5.7), j ∗ ≤ c(x, a) + La h∗ (x) ∀ x ∈ X, a ∈ A, and so
j ∗ ≤ cπ (s, x) + Lπ h∗ (s, x) ∀ π ∈ Π.
208
5
CONTINUOUS–TIME MARKOV CONTROL PRORCESSES
Therefore, by Proposition 5.26(b) and (5.5.9), j ∗ ≤ J(x, π) ∀ x ∈ X, π ∈ ΠAC . This implies (a), and also (a’) if ΠAC = Π. (b) From Proposition (5.2.14)(a), j ∗ = J(x, π ∗ ) for all x ∈ X. Hence, (b) follows from part (a). Clearly, (b) implies (b’). (c) The first inequality in (5.5.12) is a consequence of (5.5.11) and Proposition 5.26(b). The second inequality follows from the definition (5.5.3) of J ∗ . Finally, parts (c) and (a) give (c’). Remark 5.29. In results such as Theorem 5.28(a’) or (c), we require conditions ensuring the existence of measurable mappings π ∗ : X → A such that, for every x ∈ X, π ∗ (x) ∈ A attains the minimum in the right–hand side of (5.5.7) or (5.5.8); that is, if v(x, a) := c(x, a) + La h∗ (x), then
∗
inf v(x, a) = v π (x) := v(x, π ∗ (x))
a∈A
(5.5.13) (5.5.14)
for all x ∈ X. These conditions can be obtained from results as those in Appendix B. For example, suppose that A is a compact metric space, and in Theorem B.3 consider the “constant” multifunction Φ(·) ≡ A. Suppose, in addition, that v in (5.5.13) is such that a → v(x, a) is l.s.c. on A for each x ∈ X. Then Theorem B.3 gives the existence of π ∗ that satisfies (5.5.14). If A is not compact, we can try to use Theorems B.8 or B.9, for instance. ♦ Theorem 5.28 shows that the ACOE (5.5.7) and the ACOI (5.5.8) give a lower bound or an upper bound, respectively, for the optimal AC function J ∗ (·). They also give means to obtain an AC–optimal policy π ∗ ∈ ΠAC S . Then the obvious question is, of course, how to obtain a solution (j ∗ , h∗ ) to (5.5.7) or (5.5.8)? There are several ways to answer this question, depending on the underlying assumptions, such as the ergodicity approach, the vanishing discount approach, the infinite–dimensional linear programming
5.5
LONG–RUN AVERAGE COST PROBLEMS
209
approach,..., etc. In fact, to the best of our knowledge, none of these approaches has been developed for the general MCPs introduced in this chapter. The first two, however, can be naturally extended to our current context—see the following Sects. 5.5.1 and 5.5.2. The infinite-dimensional linear programming approach requires a more technical background, but the general ideas are as in the discrete-time case in Hern´andez-Lerma and Lasserre (1996) and (1999) Chaps. 6 and 12, respectively.
5.5.1 The Ergodicity Approach In Remark 5.14(d) and Proposition 5.26, let us suppose that for each stationary policy π ∈ ΠS , the corresponding Markov process x(·) is uniformly geometrically ergodic in the sense of (5.2.18), that is, for every π ∈ ΠS , t ≥ 0, and x ∈ X, P π (t, x, ·) − μπ (·) ∗ ≤ θe−γt ,
(5.5.15)
where θ and γ are positive constants. Let us suppose, in addition, that the cost function c is bounded. Then defining j π ∈ R and hπ (·) ∈ D as in (5.2.19), we obtain a solution (j π , hπ ) to the Poisson equation (5.5.4), i.e., for each π ∈ ΠS , j π = cπ (x) + Lπ hπ (x) ∀ x ∈ X.
(5.5.16)
Moreover, since c is bounded, then so is hπ and hence (5.5.5) holds. Therefore, from (5.5.6), for every π ∈ ΠS and x ∈ X, j π = J(x, π) ≥ inf J(x, π). π∈ΠS
(5.5.17)
For examples and further comments on the ergodicity approach, see Sect. 6.5. In the meantime, note that results such as (5.5.15) are well known in the literature on Markov processes; see, for instance, Down et al. (1995) or Lund et al. (1996).
5
210
CONTINUOUS–TIME MARKOV CONTROL PRORCESSES
5.5.2 The Vanishing Discount Approach The vanishing discount approach to the AC problem refers to the analysis of the ρ–discounted cost (5.4.6) for a time–homogeneous MCP, say, ∞ ρ π e−ρt cπ (x(t))dt (5.5.18) V (x, π) := Ex 0
as “the discount ρ vanishes”, that is, as ρ ↓ 0. (For deterministic continuous-time AC problems, the vanishing discount approach is studied in Sect. 4.4.3.) There are several way to see the connection between (5.5.18) and the AC (5.5.2). For instance, from the Abelian theorems in Exercises 5.7 and 5.8(c) it can be seen that if (5.5.2) holds with “lim sup” replaced by “limit”, i.e., J(x, π) = lim Jt (x, π)/t, t→∞
then lim ρV ρ (x, π) = J(x, π). ρ↓0
(5.5.19)
(See Exercise 5.8(b) or (c).) Alternatively, (5.5.19) can be obtained in the context of (5.5.15)–(5.5.17). Indeed, inside the integral in (5.5.18) replace cπ (·) with cπ (·) − j π + j π , with j π = J(x, π) as in (5.5.17). Then V ρ (x, π) can be expressed as ∞ ρ π V (x, π) = Ex e−ρt [cπ (x(t)) − j π ]dt + j π /ρ, 0
so, multiplying both sides by ρ, we obtain ∞ ρ π π ρV (x, π) = j + ρEx e−ρt [cπ (x(t)) − j π ]dt.
(5.5.20)
0
Therefore, this fact yields again (5.5.19) provided that the rightmost term in (5.5.20) tends to zero as ρ ↓ 0, i.e., ∞ π e−ρt [cπ (x(t)) − j π ] dt = 0. (5.5.21) lim ρEx ρ↓0
0
5.5
LONG–RUN AVERAGE COST PROBLEMS
211
As an example, this is true if (5.5.15) holds. (See Exercise 5.10.) The starting point of the so–called “vanishing discount approach” in stochastic control theory is the ρ–discount dynamic programming equation (5.4.7) for a time–homogeneous MCP, that is, with vρ (·) ≡ v(·), ρvρ (x) = inf [c(x, a) + La vρ (x)], x ∈ X. a∈A
(5.5.22)
Recall that, in the time-homogeneous case, vρ (x) := inf vρ (x, π), π ∞
π
with vρ (x, π) := Ex 0 e−ρt cπ (x(t))dt. For applications of the vanishing discount approach to controlled diffusion processes, see Sect. 6.5. Notes—Chapter 5 1. Most of the material in this chapter comes from Hern´andezLerma (1994) but, in fact, general continuous–time MCPs are a standard subject; see Doshi (1976a, b, 1979), Fleming (1984), Gihman and Skorohod (1979), Hern´andez-Lerma and Govindan (2001), Rishel (1990), and their references. For noncontrolled continuous-time Markov processes, as in Sects. 5.1 and 5.2, above, there are many excellent textbooks; see, for instance, Evans (2013) or Mikosch (1998) or the introductory chapters in Arnold (1974), Hanson (2007), Øksendal (2003), ... 2. In these notes, as examples of continuous–time MCPs, we only consider the deterministic systems in Chap. 4, and the controlled diffusion processes in Chap. 6. There are, however, many other important classes of continuous–time MCPs, such as controlled jump-Markov processes with a countable state space (see, for instance, Guo and Hern´andez-Lerma (2009) or Prieto-Rumeau and Hern´andez-Lerma (2012)) or an uncountable (Borel) state space (as in Piunovskiy and Zhang (2020)), and controlled jump– diffusion processes (as in Hanson (2007) or Øksendal and Sulem (2007)).
5
212
CONTINUOUS–TIME MARKOV CONTROL PRORCESSES
Exercises 5.1. Let Tt be as in (5.2.2). Prove that Tt , t ≥ 0, is a semigroup of operators on M , that is, T0 =Identity, each Tt maps M into itself, and Tt+r = Tt Tr for all t, r ≥ 0. 5.2. Prove Lemma 5.9. 5.3. Prove Proposition 5.11(b’). 5.4. Show directly that Propositions 5.11(a) and (a’) hold when ρ = 0. More explicitly, suppose that the functions c and K are as in Proposition 5.11. Then: (a) If v ∈ D(L) satisfies that c(s, x) + Lv(s, x) = 0 ∀ (s, x) ∈ XT , with the terminal condition (5.2.7), then T c(t, x(t))dt + K(T, x(T ))] ∀ (s, x) ∈ XT . v(s, x) = Es,x [ s
Similarly for (a’). Remark. Let bM0 be the class of functions c ∈ M0 that are bounded in the supremum norm, c := sups,x |c(s, x)|. For fixed ρ > 0, the operator Rρ on bM0 defined by (5.2.10), i.e., ∞ Rρ c(s, x) := e−ρt Tt c(s, x)dt (5.5.23) 0
is called the resolvent of the semigroup Tt . The following exercise shows that the resolvent is the unique solution of (5.2.6) in D(L) ♦ if c is in bM0 . 5.5. Show that if c is in bM0 , then the resolvent v := Rρ c in (5.5.23) is the unique function in D(L) that satisfies (5.2.6) for all (s, x) ∈ X∞ . 5.6. Prove the statement in Remark 5.14(b). The result in the following Exercise 5.7 is a so–called Abelian theorem, well known in Laplace transform theory. (See Widder
5.5
LONG–RUN AVERAGE COST PROBLEMS
213
(1941), pp. 181–182, for instance.) As shown in Exercise 5.8, Abelian theorems establish a connection between discounted costs (as in (5.2.10)) and long–run average costs (as in (5.2.14)). Exercises 5.7 and 5.8 extend to general MCPs the results in Lemma 4.31 and Proposition 4.32. 5.7. For t ≥ 0, let t → α(t) be a nondecreasing function with α(0) = 0. Define αinf := lim inf α(t)/t, αsup := lim sup α(t)/t t→∞
t→∞
that, for every ρ > 0, andsuppose αsup < ∞. ∞Prove ∞ −ρt −ρt (a) 0 e dα(t) = ρ 0 e α(t)dt; ∞ ∞ (b) αinf ≤ lim inf ρ↓0 ρ 0 e−ρt dα(t) ≤ lim supρ↓0 ρ 0 e−ρt dα(t) ≤ αsup ; (c) If the limit α(t)/t → α∗ exists, then ∞ ∗ e−ρt dα(t). α = lim ρ ρ↓0
0
5.8. Let c ∈ M0 be nonnegative. In Exercise 5.7, let t Tr c(s, x)dr, α(t) := 0
so the discounted cost v ρ in (5.2.10) and the long-run average cost J sup in (5.2.14) become ∞ ρ e−ρt Tt c(s, x)dt v (s, x) = 0 ∞ = e−ρt dα(t) 0
and J sup (s, x) = αsup , respectively, with αsup < ∞ as in Exercise 5.7. Then (a) for every ρ > 0, t ∞ ρ −ρt e Tr c(s, x)dr dt; v (s, x) = ρ 0
(b) Jinf (s, x) ≤ lim inf ρ↓0 ρv ρ (s, x)
0
214
5
CONTINUOUS–TIME MARKOV CONTROL PRORCESSES
≤ lim supρ↓0 ρv ρ (s, x) ≤ J sup (x, s), with J sup and Jinf as in (5.2.14) and (5.2.15), respectively. (c) If Jt (s, x) → j ∗ as t → ∞, then limρ↓0 ρv ρ (s, x) = j ∗ . 5.9. Prove the statement in Remark 5.14(d). More explicitly, let X = {x(t), t ≥ 0} be a time–homogeneous Markov process with an invariant probability measure μ and geometrically (or uniformly) ergodic in the sense of (5.2.18). Let c ∈ M0 be a bounded function, say, |c(x)| ≤ c¯ for all x ∈ X. Then the function hc in (5.2.19) is in D(L), and the pair (j(c), hc ) in (5.2.19) is a solution to the Poisson equation (5.2.20). (The uniform ergodicity condition (5.2.18) is well known for several norms · ∗ and different Markov processes. See for instance (5.3.1)) Solution. Observe that, for all r ≥ 0 and x ∈ X, |Tr c(x) − j(c)| = | [P (r, x, dy) − μ(dy)]|c(y)| X
≤ c¯ P (r, x, ·) − μ(·) ∗ ≤ c¯θe−γr .
(5.5.24)
In the latter inequality we used (5.2.18). Note that hc is bounded, because (from (5.2.19) and (5.2.20)) ∞ |Tr c(x) − j(c)|dr |hc (x)| ≤ 0
≤ c¯θ/γ
∀ x ∈ X.
Moreover (since (5.5.24) allows the interchange of integrals), ∞ Ts hc (x) = [Tr+s c(x) − j(c)]dr 0 ∞ = [Tr c(x) − j(c)]dr s s = hc (x) − [Tr c(x) − j(c)]dr; 0
that is,
s
Ts hc (x) − hc (x) = − 0
[Tr c(x) − j(c)]dr.
5.5
LONG–RUN AVERAGE COST PROBLEMS
215
Finally, multiplying by 1/s and then letting s ↓ 0, the Poisson equation (5.2.20) follows. 5.10. Prove: (5.5.15) implies (5.5.21). Solution. As in (5.5.24), for every t ≥ 0 and x ∈ X, |Exπ cπ (x(t)) − j π | ≤ c¯θe−γt , where c¯ is an upper bound for c(x, a). Consequently, for every x ∈ X, ∞ π [cπ (x(t)) − j π ]dt| ≤ ρ¯ cθ/γ. |ρEx 0
This fact yields (5.5.21).
Chapter 6 Controlled Diffusion Processes
6.1 Diffusion Processes In the remainder of these notes we consider a class of Rd –valued Markov processes {x(t), t ≥ 0} called (Markov) diffusion processes. These are processes that are characterized in a suitable sense by a function b : [0, ∞) × Rd → Rd called the drift vector, and a d × d matrix D on [0, ∞) × Rd called the diffusion matrix, which is assumed to be symmetric and nonnegative definite. In the extreme case in which D ≡ 0, the zero matrix, the process x(·) is the solution of an ordinary differential equation x(t) ˙ = b(t, x(t)), t ≥ 0. At the other extreme, if b ≡ 0 and D ≡ I the identity matrix, then x(·) is a Markov process called Wiener process or Brownian motion. (See Example 5.3, above, or any introductory book on stochastic analysis or stochastic differential equations, for instance, Arnold (1974), Evans (2013), Mikosch (1998), Øksendal (2003),...) More precisely, we will consider so–called Itˆ o diffusions or Itˆ o processes {x(t), t ≥ 0} that are solutions of stochastic differential equations (SDEs) of the form dx(t) = b(t, x(t))dt + σ(t, x(t))dw(t),
(6.1.1)
where b(t, x) and σ(t, x) are given functions from [0, ∞) × Rd to Rd and Rd×n , respectively, and {w(t), t ≥ 0} is a standard n– dimensional Wiener process. To begin, throughout the following © The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hern´andez-Lerma et al., An Introduction to Optimal Control Theory, Texts in Applied Mathematics 76, https://doi.org/10.1007/978-3-031-21139-3 6
217
218
6
CONTROLLED DIFFUSION PROCESSES
we impose conditions on the coefficients b and σ ensuring that (6.1.1) has indeed a well-defined (and well–behaved) solution. Assumption 6.1. Itˆ o conditions. The functions b(t, x) and σ(t, x) are measurable and satisfy: (a) Linear growth: For every T > 0, there is a constant K = K(T ) such that, for all 0 ≤ t ≤ T and x ∈ Rd , |b(t, x)| ≤ K(1 + |x|),
|σ(t, x)| ≤ K(1 + |x|),
(where, fora matrix d = (dij ), we define its norm |d|2 := T r(dd∗ ) = ij d2ij , with d∗ := transpose of d, and T r(D) := Trace of a matrix D); and (b) Lipschitz conditions: For every T > 0 and r > 0, there is a constant K = K (T, r) such that |b(t, x) − b(t, y)| ≤ K |x − y|,
|σ(t, x) − σ(t, y)| ≤ K |x − y|
for all 0 ≤ t ≤ T and |x| ≤ r, |y| ≤ r. The conditions (a) and (b) in Assumption 6.1 are implied, for instance, by the following: (a’) The components of b(t, x) and σ(t, x) are continuously differentiable in x with bounded derivatives, uniformly in t ≥ 0; (b’) |b(t, 0)| + |σ(t, 0)| ≤ K for all t ≥ 0, for some constant K. Under Assumption 6.1 the SDE (6.1.1) has a unique continuous solution x(·), which is a Markov process with transition probabilities P (s, x, t, B) = P (x(t) ∈ B|x(s) = x) = P (x(t; s, x) ∈ B) (6.1.2) for all 0 ≤ s ≤ t, x ∈ Rd , B ∈ B(Rd ) where x(t) = x(t; s, x) denotes the solution of (6.1.1) for t ≥ s, with initial condition x(s) = x. Moreover, denoting by Es,x the conditional expectation given the initial condition x(s) = x, we also have Es,x |x(t)|k ≤ (1 + |x|k )eC(t−s)
(k = 1, 2, . . .)
(6.1.3)
for some constant C depending on the integer k and the constant K in Assumption 6.1(a).
6.1
DIFFUSION PROCESSES
219
The solution process x(·) has other nice properties. For instance, it is continuous and satisfies the “Feller property”, which implies that x(·) is in fact a strong Markov process. (We do not use these assertions here.) We will also suppose: Assumption 6.2. The functions b and σ are continuous in the time variable t ≥ 0. Assumptions 6.1 and 6.2 imply that the solution x(·) of (6.1.1) is a Markov diffusion process with drift coefficient b(t, x) and diffusion matrix D(t, x) := σ(t, x)σ(t, x)∗ , where σ(t, x) is the diffusion coefficient in (6.1.1). (Recall that σ ∗ denotes the transpose of σ.) We will next obtain the infinitesimal generator L (see (5.2.3)) of x(·). Definition 6.3. Let C 1,2 ≡ C 1,2 ([0, ∞) × Rd ) be the class of real– valued continuous functions v(s, x) on [0, ∞) × Rd such that v is of class C 1 in s and of class C 2 in x, that is, the partial derivates vs , vxi , vxi xj , for i, j = 1, . . . , d, are continuous. If v ∈ C 1,2 , let 1 Lv(s, x) := vs (s, x) + vx (s, x)b(s, x) + T r[vxx (s, x)D(s, x)], 2 (6.1.4) where D(x, s) is the diffusion matrix, the row vector vx :=(vx1 , . . . , vxd ) is the gradient of v (in the x–variables), and vxx = (vxi xj ) is the Hessian matrix. The last term in (6.1.4) can be expressed more explicitly as d 1 1 vx x (s, x)dij (s, x), T r[vxx (s, x)D(s, x)] = 2 2 i,j=1 i j
where dij are the components of D = σσ ∗ . In terms of L we may write the important Itˆ o’s differential rule as in (6.1.5), below. Theorem 6.4. Let x(·) be the solution of (6.1.1). If v ∈ C 1,2 , then the process v(t, x(t)) satisfies the SDE dv(t, x(t)) = Lv(t, x(t))dt + vx (t, x(t))σ(t, x(t))dw(t).
(6.1.5)
6
220
CONTROLLED DIFFUSION PROCESSES
In integral form, we may write (6.1.5) for t ≥ s ≥ 0, given x(s) = x, as v(t, x(t)) − v(s, x) =
t
Lv(r, x(r))dr +
s
s
t
vx (r, x(r))σ (r, x(r))dw(r).
If v and σ are such that, for each t > s, t |vx (r, x(r))σ(r, x(r))|2 dr < ∞, Es,x
(6.1.6)
(6.1.7)
s
then the expected value of the last integral in (6.1.6) is zero. Therefore, if in addition to (6.1.7) we have that t Es,x |Lv(r, x(r))|dr < ∞, (6.1.8) s
then taking expectations Es,x in (6.1.6) we obtain t Lv(r, x(r))dr. Es,x v(t, x(t)) − v(s, x) = Es,x
(6.1.9)
s
Multiplying both sides of (6.1.9) by (t − s)−1 and letting t ↓ s we obtain, from (5.2.3), the infinitesimal generator Lv = Lv. More explicitly, we have shown the following. Theorem 6.5. If v ∈ C 1,2 is such that (6.1.7) and (6.1.8) hold, then v is in the domain D(L) and Lv is given by 1 Lv(s, x) = vs (s, x) + vx (s, x)b(s, x) + T r[vxx (s, x)D(s, x)]. 2 (6.1.10) With Lv = Lv, (6.1.9) is a particular form of Dynkin’s formula in Remark 5.10(a). Remark 6.6. A function f (s, x) is said to satisfy a polynomial growth condition if there are constants K and j such that |f (s, x)| ≤ K(1 + |x|j ) for every (s, x). Now, using the inequality (6.1.3) and Assumption 6.1, one can see that if v ∈ C 1,2 and its partial derivates vs , vxi , vxi xj satisfy polynomial growth conditions, then (6.1.7) and (6.1.8) are satisfied. 3
6.2
CONTROLLED DIFFUSION PROCESSES
221
Remark 6.7. (a) Let v ∈ C 1,2 be as in Theorem 6.5, and let τ be a stopping time for x(·) such that Es,x (τ ) < ∞. Then (6.1.9) holds when t is replaced by τ . (See, for instance, Friedman (1975) p. 85.) This fact is analogous to Remark 5.10(b). (b) Strictly speaking, for every t ≥ s ≥ 0 and every initial condition x(s) = x, the SDE (6.1.1) is a compact form of expressing the integral equation t t b(r, x(r))dr + σ(r, x(r))dw(r), x(t) = x + s
s
where the second integral on the right–hand side, with respect to the Wiener process w(·), is an Itˆ o integral. Developing this approach, however, is out of the scope of these lecture notes. We are thus proceeding as in Chap. 5, in which instead of analyzing the properties of a Markov process x(·) we use directly the corresponding generator. Formally, this is all we need to develop the dynamic programming approach, as in Sect. 5.4.
6.2 Controlled Diffusion Processes Let A, the control (or action) set, be a closed subset of Rm and, instead of (6.1.1), consider the controlled SDE dx(t) = b(t, x(t), a(t))dt + σ(t, x(t), a(t))dw(t),
(6.2.1)
with coefficients b and σ, which are functions from [0, ∞) × Rd × A to Rd and Rd×n , respectively, and a(t) ∈ A, for t ≥ 0, being the control process. As in Chap. 5, we will only consider Markov control policies, so that a(·) is of the form a(t) = π(t, x(t)), with π : [0, ∞) × Rd → A a measurable function. We also need to restrict the set Π of admissible control policies. Thus, in view of the Assumptions 6.1, 6.2 and Remark 6.6, we define Π as follows. Definition 6.8. A Markov control policy π is said to be admissible (and we write π ∈ Π) if
222
6
CONTROLLED DIFFUSION PROCESSES
(a) The functions bπ (t, x) := b(t, x, π(t, x)) and σ π (t, x) := σ(t, x, π(t, x)) satisfy the Assumptions 6.1 and 6.2—the corresponding solution x(·) of (6.2.1) is written as xπ (·); (b) The generator Lπ of x(·) satisfies that Lπ = La if π(s, x) = a, where, from (6.1.10), 1 La v(s, x) = vs (s, x) + vx (s, x)b(s, x, a) + T r[vxx (s, x)D(s, x, a)] 2
(6.2.2) with D(s, x, a) = σ(s, x, a)σ(s, x, a) (recall that σ = transpose of σ, and T r(·) = Trace). ∗
∗
In some cases it is easy to give conditions for a policy to be admissible. For instance, suppose that the control set A contains the origin 0 ∈ Rm , and also (see (a’) and (b’) in the paragraph following Assumption 6.1): (a’) The functions b(t, x, a) and σ(t, x, a) are continuous, of class C 1 in x ∈ Rd and a ∈ A with bounded derivatives (i.e., |bx |, |ba |, |σx |, |σa | ≤ C for some constant C) uniformly in t ≥ 0; (b’) |b(t, 0, 0)| + σ|(t, 0, 0)| ≤ C
∀t ≥ 0 and some constant C.
Then a continuous function π : [0, ∞) × Rd → A is an admissible Markov control policy if, for instance, it satisfies: (c’) For every T > 0, there is a constant KT (which may also depend on π) such that |π(t, x)| ≤ KT (1 + |x|) for all 0 ≤ t ≤ T and x ∈ Rd ; (d’) For every T > 0 and r > 0, there is a constant KT,r (which may depend on π) such that |π(t, x) − π(t, y)| ≤ KT,r |x − y| for all 0 ≤ t ≤ T , and |x|, |y| ≤ r. The conditions (c’) and (d’) are of course suggested by (a’), (b’) and the Itˆo conditions in Assumption 6.1. Finally, we will suppose that Assumptions 5.16(e) and 5.17 hold. Notice in particular that, for instance (in view of (6.1.3)), a sufficient condition for the cost rate c(s, x, a) to satisfy Assumption 5.16(e) is that cπ (s, x) satisfies a polynomial growth condition, say
6.3
EXAMPLES: FINITE HORIZON
|cπ (s, x)| ≤ K(1 + |x|j )
223
∀π ∈ Π,
(6.2.3)
where K and j are positive constants, and cπ (s, x) := c(s, x, π(s, x)). We have thus completed the description of the controlled SDE (6.2.1) in the general MCP setting of Sect. 5.3.
6.3 Examples: Finite Horizon For a cost functional as in (5.4.1), with ρ = 0, and a controlled diffusion process determined by (6.2.1), the Dynamic Programming (DP) Theorem 5.21 is valid provided of course that v satisfies the conditions in Theorem 6.5 (that is, v ∈ C 1,2 and (6.1.7), (6.1.8) hold). In this case, the generator La in (5.4.3) is given by (6.2.2), so that (5.4.3), with ρ = 0, becomes 1 vs (s, x) + min vx (s, x)b(s, x, a) + T r(vxx (s, x)D(s, x, a)) a∈A 2 (6.3.1) +c(s, x, a) = 0 for (s, x) in XT := [0, T ] × Rd , with the boundary condition v(T, x) = K(T, x),
x ∈ Rd .
(6.3.2)
Moreover, using the Remark 6.7 we obtain in fact a slightly more general form of Theorem 5.21. To state it, let Q be a given open subset of XT , and let ζ be the exit time of (t, x(t)) from Q, given the initial condition (s, x) ∈ Q, that is, ζ := inf{t > s|(t, x(t)) ∈ / Q}.
(6.3.3)
(In particular, if Q := (0, T ) × Rd , then ζ = T .) Now let ∂ ∗ Q be a closed subset of the boundary ∂Q of Q such that (ζ, x(ζ)) ∈ ∂ ∗ Q with probability 1 for every initial condition (s, x) ∈ Q and every admissible policy π. Finally, in the expression (5.4.1) of the cost functional V (s, x, π), with ρ = 0, replace T with ζ to obtain
6
224
V (s, x, π) =
π Es,x
ζ
CONTROLLED DIFFUSION PROCESSES
c (t, x(t))dt + K(ζ, x(ζ)) . π
(6.3.4)
s
Then Theorem 5.21 is valid if (6.3.1) is restricted to hold for (s, x) ∈ Q, and instead of the boundary condition (5.4.4) we take v(s, x) = K(s, x) ∀(s, x) ∈ ∂ ∗ Q.
(6.3.5)
Remark. For the existence of solutions to (6.3.1) with either the boundary condition (6.3.2) or (6.3.5), see Bensoussan (1982), Fleming and Rishel (1975), Hanson (2007) or Krylov (1980), for instance. Example 6.9. (LQ systems). To simplify the exposition we consider first the scalar case (d = n = m = 1 in (6.2.1)). The state x(·) ∈ R of the system is supposed to satisfy the linear SDE dx(t) = [γ(t)x(t) + β(t)a(t)]dt + σ(t)dw(t),
(6.3.6)
with coefficients γ(·), β(·) and σ(·) of class C 1 [0, T ], and the cost functional T π 2 2 2 V (s, x, π) := Es,x (q(t)x (t) + r(t)a (t))dt + qT x (T ) , s
(6.3.7) where q(·) ≥ 0 and r(·) ≥ > 0 are continuous functions; and qT ≥ 0. Thus the cost rate and the terminal cost are given, respectively, by c(s, x, a) := q(s)x2 + r(s)a2 ,
and K(s, x) := qT x2 .
(6.3.8)
We assume that there are no control constraints, so A = R. Notice, on the other hand, that the coefficients of (6.3.6), b(t, x, a) = γ(t)x + β(t)a,
and σ(t, x, a) = σ(t)
(6.3.9)
satisfy the conditions (a’), (b’) in the paragraph after Definition 6.8. Now, from (6.3.8)–(6.3.9), the DP Eq. (6.3.1)–(6.3.2) becomes
6.3
EXAMPLES: FINITE HORIZON
225
1 vs + γ(s)xvx + σ 2 (s)vxx + q(s)x2 + min[β(s)vx a + r(s)a2 ] = 0 a∈R 2 (6.3.10) with v(T, x) = qT x2 , x ∈ R. (6.3.11) The minimum in (6.3.10) is reached at a∗ = π ∗ (s, x) given by π ∗ (s, x) = −β(s)vx /2r(s),
(6.3.12)
which inserted in (6.3.10) yields 1 vs + γ(s)xvx + σ 2 (s)vxx + q(s)x2 − (β(s)vx )2 /4r(s) = 0. 2 (6.3.13) The question now is how to obtain a solution of (6.3.13). However, by the form of this equation (or by analogy with the discrete–time case), we may try a solution of the form v(s, x) = k(s)x2 + g(s),
(6.3.14)
with k(·) and g(·) of class C 1 , and k(·) ≥ 0. Moreover, for (6.3.11) to hold we require k(T ) = qT ,
and g(T ) = 0.
(6.3.15)
With this value of v(s, x), the Eq. (6.3.13) becomes [k (s) + q(s) + 2γ (s)k(s) − β 2 (s)k 2 (s)/r(s)]x2 + g (s) + σ 2 (s)k(s) = 0
(where “prime” denotes derivative with respect to s). Hence, for v in (6.3.14) to be a solution of (6.3.13) it suffices that k(·) and g(·) satisfy k (s) = −q(s) − 2γ(s)k(s) + r(s)−1 β 2 (s)k 2 (s) and
(6.3.16)
g (s) = −σ 2 (s)k(s)
for s < T . Thus, combined with the boundary condition (6.3.15), g(·) is given by
6
226
g(s) =
T
CONTROLLED DIFFUSION PROCESSES
σ 2 (t)k(t)dt,
s ≤ T,
s
and k(·) is uniquely determined by the (Riccati equation) (6.3.16) with the terminal condition k(T ) = qT . Finally, from (6.3.12) and (6.3.14), the optimal policy π ∗ is π ∗ (s, x) = −r(s)−1 β(s)k(s)x.
(6.3.17)
Observe that π ∗ satisfies the conditions (c’) and (d’) in Sect. 6.2, and so it is admissible (in the sense of Definition 6.8). Moreover, from Theorem 5.21(b), with ρ = 0, the value function of the LQ problem (6.3.6)–(6.3.7) is T ∗ 2 σ 2 (t)k(t)dt. (6.3.18) V (s, x) = v(s, x) = k(s)x + s
The vector case. Let us suppose now that in (6.3.6), x ∈ Rd , a ∈ Rm , w ∈ Rn , with γ(·), β(·) and σ(·) matrices of appropriate dimensions, of class C 1 [0, T ] again. Furthermore, the costs in (6.3.7) (see (6.3.8)) are now the quadratic forms c(t, x, a) := x∗ q(t)x + a∗ r(t)a,
K(T, x) := x∗ qT x,
where q(·), qT ∈ Rd×d are symmetric and nonnegative definite, and r(·) ∈ Rm×m is symmetric and positive definite. We also assume that q(·) and r(·) are continuous. The analysis in the vector case is completely analogous to that presented in (6.3.10)–(6.3.18), and it yields that the optimal Markov policy and the optimal value function are [see (6.3.17), (6.3.18)] (6.3.19) π ∗ (s, x) = −r(s)−1 β(s)∗ k(s)x, and ∗
∗
J (s, x) = x k(s)x +
T
T r[D(t)k(t)]dt
(6.3.20)
s
where D(t) = σ(t)σ(t)∗ , and k(·) ∈ Rd×d is the solution to the matrix Riccati equation [see (6.3.16)]
6.3
EXAMPLES: FINITE HORIZON
227
k (s) = −k(s)γ(s) − γ(s)∗ k(s) + k(s)β(s)r(s)−1 β(s)∗ k(s) − q(s), (6.3.21) 3 for s ≤ T , with the boundary condition k(T ) = qT . Example 6.10. (Optimal portfolio selection.) Let us now consider an example formulated in the context of a financial market with two assets, or “securities”. One of them is a risk–free asset, called a bond, and the other one is a risky asset called a stock. In a consumption–investment problem, also known as an optimal portfolio selection problem, a “small investor”, that is, an economic agent whose actions cannot influence the market prices, may choose a portfolio (investment strategy) and a consumption strategy that determine the evolution of his wealth. The problem is to choose these strategies to maximize some “utility” criterion. In this example, the problem is to maximize the expected discounted total utility from consumption (6.3.25) below; in the following Example 6.11, the investor wishes to maximize the expected utility from terminal wealth (6.3.31). Let x(t) denote the investor’s wealth at time t, and suppose that the price p1 (t) of the risk–free asset (the bond) is given by dp1 (t) = rp1 (t)dt, whereas the price p2 (t) of the risky asset (the stock) changes according to the linear SDE dp2 (t) = p2 [αdt + σdw(t)], where w(·) is a 1–dimensional standard Wiener process. Here, r, α, and σ are constants with r < α, and σ > 0. A consumption– investment policy π is a pair (a1 (·), a2 (·)) consisting of a portfolio process a1 (·) and a consumption rate process a2 (·). That is, a1 (t) (respectively, 1 − a1 (t)) is the fraction of wealth invested in the stock (respectively, the bond) at time t, and a2 (t) is the consumption rate, satisfying the control constraints 0 ≤ a1 (t) ≤ 1,
a2 (t) ≥ 0.
(6.3.22)
Thus, when using a given consumption/investment policy π, the wealth x(·) ≡ xπ (·) changes according to the SDE
6
228
CONTROLLED DIFFUSION PROCESSES
dx(t) = (1 − a1 (t))x(t)rdt + a1 (t)x(t)[αdt + σdw(t)] − a2 (t)dt. (6.3.23) The three terms on the right–hand side of (6.3.23) correspond, respectively, to: (i) gains from money invested in the bond, (ii) gains from investment in the stock, and (iii) the decrease in wealth due to consumption. Rewritten in the standard form (6.2.1), Eq. (6.3.23) becomes dx(t) = [(r + (α − r)a1 (t))x(t) − a2 (t)]dt + σa1 (t)x(t)dw(t). (6.3.24) Now let U be a utility function, that is, U is a nonnegative function on [0, ∞), of class C 2 , strictly increasing, strictly concave, and such that U (0) = +∞. Then the consumption–investment problem we are concerned with is to maximize the expected discounted utility from consumption: T π e−ρt U (a2 (t))dt, (6.3.25) V (s, x, π) = Es,x s
with discount rate ρ > 0. In this case, the DP Eq. (6.3.1)–(6.3.2) becomes 1 vs + max e−ρs U (a2 ) + [(r + (α − r)a1 )x − a2 ]vx + (σ a1 x)2 vxx = 0, a 2
(6.3.26) with terminal condition v(T, x) = 0, and the maximization is over the set of pairs a = (a1 , a2 ) satisfying (6.3.22). Ignoring for the moment the constraints (6.3.22), an elementary calculation shows that the function within brackets in (6.3.26) is maximized by a∗ = (a∗1 , a∗2 ) such that a∗1 = −(α − r)vx /σ 2 xvxx ,
U (a∗2 ) = eρs vx
(6.3.27)
provided that vx > 0 and vxx < 0 for x > 0. The solution to our problem will depend of course on the particular utility function U used in (6.3.25).
6.3
EXAMPLES: FINITE HORIZON
229
Let us suppose that the utility function is of the form U (a2 ) = with 0 < γ < 1, and propose a solution to (6.3.26) of the form
aγ2 ,
v(s, x) = h(s)xγ ,
with h(T ) = 0.
(6.3.28)
In this case, (6.3.27) yields a∗1 = (α − r)/σ 2 (1 − γ),
a∗2 = x[eρs h(s)]1/(γ−1) .
(6.3.29)
Replacing these values in (6.3.26) we obtain [h (s) + Cγh(s) + (1 − γ)h(s)(eρs h(s))1/(γ−1) ]xγ = 0,
(6.3.30)
where C := r + (α − r)2 /2σ 2 (1 − γ). Since the latter equation holds for all x > 0, the function in brackets must be 0. This yields a differential equation for h, which can be solved (making the substitution g = (eρs h)1/(γ−1) ) to obtain h(s) = e−ρs [β − βe−(T −s)/β ]1−γ , with β := (1 − γ)/(ρ − Cγ). Thus with this function h, and if α − r ≤ σ 2 (1 − γ), the optimal consumption–investment policy is given by (6.3.29). Notice in particular that the optimal process a∗1 (·) is constant, and the optimal consumption rate a∗2 (·) is a linear function of x. 3 Example 6.11. In the same context of Example 6.10, let us now suppose that there is no consumption, say a2 (·) ≡ 0, so that the wealth Eqs. (6.3.23)–(6.3.24) becomes dx(t) = [r + (α − r)a(t)]x(t)dt + σa(t)x(t)dw(t), where a(·) = a1 (·). Moreover, we wish to maximize the expected utility from terminal wealth π V (s, x; π) := Es,x U [x(ζ)],
(6.3.31)
where U is a utility function such that U (0) = 0, and ζ is the first exit time from the open set Q = (0, T ) × (0, ∞). Observe that the utility criterion (6.3.31) is of the form (6.3.4) with c(s, x, a) ≡ 0 and K(s, x) = U (x).
6
230
CONTROLLED DIFFUSION PROCESSES
Then the DP equation is 1 2 2 2 vs + max (r + (α − r)a)xvx + σ a x vxx = 0, 0≤a≤1 2
(6.3.32)
with the boundary condition (see (6.3.5)) v(s, x) = U (x) for (s, x) ∈ ∂ ∗ Q, where ∂ ∗ Q is the union of [0, T ] × {0} and {T } × [0, ∞). Again ignoring for the moment the constraint 0 ≤ a ≤ 1 in (6.3.22), we find that the maximization in (6.3.32) is obtained with a∗ = π ∗ (s, x) = −(α − r)vx /σ 2 xvxx
(6.3.33)
if vx > 0 and vxx < 0. Substitution of a∗ in (6.3.32) yields vs + rxvx − (α − r)2 vx2 /2σ 2 vxx
on Q,
(6.3.34)
or x = 0.
(6.3.35)
with v(s, x) = U (x) for s = T
To solve (6.3.34)–(6.3.35) we suppose that the utility function is U (x) = xγ , with 0 < γ < 1, and we try a solution of the form v(s, x) = h(s)xγ ,
where h(T ) = 1.
With this choice of v, Eq. (6.3.34) becomes h (s) + Cγh(s) = 0, with C as in (6.3.30). Hence h(s) = eCγ(T −s) for s ≤ T , so that v(s, x) = eCγ(T −s) xγ ,
and π ∗ (s, x) = (α − r)/σ 2 (1 − γ). (6.3.36) 2 Thus if (α − r)/σ (1 − γ) ≤ 1, then we conclude that the functions in (6.3.36) correspond to the optimal value function J ∗ (s, x) = v(s, x) and the optimal control policy (or portfolio pro3 cess) π ∗ , which is a constant.
6.4
EXAMPLES: DISCOUNTED COSTS
231
6.4 Examples: Discounted Costs We now specialize the infinite–horizon discounted cost problem in (5.4.6) and Theorem 5.23 to dx(t) = b(x(t), a(t))dt + σ(x(t))dw(t),
t ≥ 0.
(6.4.1)
In contrast to (6.2.1), the coefficients b and σ are time–invariant, and therefore (6.4.1) is an “autonomous” equation. (Observe that σ in (6.4.1) is independent of the controls a ∈ A.) The cost rate c(s, x, a) is also time–invariant, c(s, x, a) = c(x, a), and so the meaning of the notation bπ , σ π and cπ in Definition 6.8(a) and (6.2.3) is as in (5.3.3): bπ (t, x) := b(x, π(t, x)),
σ π (x) := σ(x),
cπ (t, x) := c(x, π(t, x)).
If π is stationary, we use the notation (5.3.4): bπ (x) = b(x, π), cπ (x) = c(x, π). For a function v(s, x) = v(x) of class C 2 in x ∈ Rd (assuming that it satisfies the assumptions of Theorem 6.5), the expression (6.2.2) for the generator La reduces to 1 La v(x) = vx (x)b(x, a) + T r[vxx (x)D(x)], 2
(6.4.2)
with D(x) = σ(x)σ(x)∗ . Let V (x, π) be as in (5.4.6) but in the time–homogeneous case: ∞ π e−ρt cπ (t, x(t))dt V (x, π) := Ex 0
where π is an admissible policy in the sense of Definition 6.8. Moreover, let V ∗ be the value function V ∗ (x) := inf V (x, π) ∀ x ∈ Rd . π
Then, under Assumption 5.22, V ∗ (x) < ∞ for every x ∈ X and the DP equation is given by (5.4.7), with La as in (6.4.2), i.e.,
6
232
CONTROLLED DIFFUSION PROCESSES
1 − ρv(x) + T r[vxx (x)D(x)] + min[c(x, a) + vx (x)b(x, a)] = 0. a∈A 2 (6.4.3) Definition 6.12. Let v ∈ D be a solution of (6.4.3). We denote by ΠD the class of stationary policies π ∈ Π for which the following condition (6.4.4) lim e−ρt Exπ v(x(t)) = 0 ∀ x ∈ Rd t→∞
is satisfied. Note that (6.4.4) is a time–homogeneous version of (5.4.8). Remark 6.13. By Theorem 5.23(b), if π ∗ ∈ ΠD attains the minimum in (6.4.3), that is c(x, π ∗ ) + vx (x)b(x, π ∗ ) = min[c(x, a) + vx (x)b(x, a)] ∀x, a∈A
then π ∗ is ρ–discount optimal in the class ΠD . Example 6.14. (LQ systems). scalar linear system [see (6.3.6)]
3
Consider the time–invariant
dx(t) = [γx(t) + βa(t)]dt + σdw(t),
t ≥ 0,
(6.4.5)
with constant coefficients γ, β, σ (β = 0), and the ρ–discounted cost functional ∞ π V (x, π) := Ex e−ρt (qx2 (t) + ra2 (t))dt, (6.4.6) 0
where q ≥ 0, r > 0. Thus, using the notation vx = v and vxx = v , the DP Eq. (6.4.3) becomes 1 − ρv + σ 2 v + min[qx2 + ra2 + (γx + βa)v ] = 0, a 2
(6.4.7)
where the minimum is over A = R. As in (6.3.10), we will try to solve (6.4.7) with a function of the form v(x) = kx2 + g ∀ x ∈ R;
k and g constants.
For this function v, (6.4.7) becomes
(6.4.8)
6.4
EXAMPLES: DISCOUNTED COSTS
233
− ρ(kx2 + g) + σ 2 k + min[qx2 + ra2 + 2(γx + βa)kx] = 0 a
and the minimum is attained at a = π ∗ given by π ∗ (x) = −r−1 βkx,
x ∈ R.
(6.4.9) (6.4.10)
Inserting this value of a = π ∗ in (6.4.9), we obtain [q + (2γ − ρ)k − r−1 β 2 k 2 ]x2 + (kσ 2 − ρg) which is zero for all x ∈ R if g = kσ 2 /ρ, and k satisfies the equation q + (2γ − ρ)k − r−1 β 2 k 2 = 0.
(6.4.11)
Assuming that q > 0, the latter equation has a unique positive solution. Thus the function v(x) in (6.4.8) is given by v(x) = kx2 + kσ 2 /ρ,
x ∈ R,
(6.4.12)
where k is the unique positive solution to (6.4.11). Thus to conclude that π ∗ in (6.4.10) is optimal in the sense of Remark 6.13, it only remains to verify that π ∗ satisfies (6.4.4). To do this, in (6.4.5) take a(t) = π ∗ (x(t)), to obtain dx(t) = −αx(t)dt + σdw(t),
x(0) = x,
(6.4.13)
with α = r−1 β 2 k − γ, which is the so–called Langevin equation. Thus, as is well known (see, for instance, Arnold (1974) Sect. 8.3, or Øksendal (2003) p. 75) the solution of (6.4.13) is t −αt +σ e−α(t−s) dw(s) for t ≥ 0. x(t) = xe 0
Therefore, by the properties of stochastic integrals, t
2 π ∗ −ρt 2 2 −(ρ+2α)t 2 −ρt π ∗ −α(t−s) Ex [e x (t)] = x e + σ e Ex e dw(s) 0
6
234
CONTROLLED DIFFUSION PROCESSES
= (x2 − σ 2 /2α)e−(ρ+2α)t + σ 2 e−ρt /2α.
(6.4.14)
Finally, from (6.4.11) and the definition of α, we have ρ + 2α = [(2γ − ρ)2 + 4r−1 β 2 q] > 0. Therefore,
∗
lim Exπ [e−ρt x2 (t)] = 0,
t→∞
(6.4.15)
which in turn, by (6.4.12), implies (6.4.4) for π = π ∗ . Hence from the Remark 6.13 we conclude that π ∗ minimizes (6.4.6) within the class of admissible stationary policies π that satisfy lim Exπ [e−ρt x2 (t)] = 0,
t→∞
(6.4.16)
and the value (or minimum) cost function is v(·)≡vρ (·) in (6.4.12), where k is the unique positive solution of the quadratic equation (6.4.11). 3 Remark. The same argument leading from (6.4.13) to (6.4.14) shows that (6.4.15) holds for every policy of the form π(x) = −Gx if G is a constant such that ρ + 2(βG − γ) > 0, that is, G > (γ − ρ/2)/β. The vector case. The above results are easily extended to n– dimensional systems (6.4.5) with q and r in (6.4.6) being symmetric matrices, q nonnegative definite, and r positive definite. In this case, (6.4.8)=(6.4.12) and (6.4.10) result v(x) = x∗ kx + g,
and π ∗ (x) = −r−1 β ∗ kx,
where g is a constant, and k is the symmetric and positive definite solution of a matrix equation corresponding to (6.4.11). 3 Example 6.15. (Maximization of total discounted utility from consumption). The problem is to choose a consumption process a(t) = π(x(t)) to maximize the expected total discounted utility ∞ π V (x, π) = Ex e−ρt U [a(t)]dt, ρ > 0, (6.4.17) 0
6.4
EXAMPLES: DISCOUNTED COSTS
235
where U is a utility function, and the wealth x(·) = xπ (·) satisfies, for t ≥ 0, dx(t) = x(t)[αdt + σdw(t)] − a(t)dt,
x(0) = x;
(6.4.18)
[see (6.3.23)–(6.3.24) with a1 (·) ≡ 1 and a2 (·) = a(·)]. In (6.4.18), which can also be written as dx(t) = [αx(t) − a(t)]dt + σx(t)dw(t),
x(0) = x > 0; (6.4.19)
we assume that α > 1, σ 2 > 0, and the initial wealth x is positive, whereas a(t) = π(x(t)) is subject to the constraint 0 ≤ π(x) ≤ x ∀x.
(6.4.20)
Then for a general utility function U , the DP equation (6.4.3) becomes (with vx = v and vxx = v ) 1 − ρv(x) + σ 2 x2 v (x) + αxv (x) + max[U (a) − av (x)] = 0. a 2 (6.4.21) We will solve (6.4.21) for the particular utility function U (a) = aγ /γ, with 0 < γ < 1, and try a solution of the form [see (6.3.28)] v(x) = hxγ ,
h > 0.
(6.4.22)
In this case, the function U (a) − av (x) is maximized when a = π ∗ (x) is given by π ∗ (x) = (γh)−1/(1−γ) x ∀x ≥ 0.
(6.4.23)
Inserting this value of a = π ∗ (x) in (6.4.21) gives the following equation for h: −[ρ + σ 2 γ(1 − γ)/2 − αγ] + (1 − γ)(γh)−1/(1−γ) = 0. Therefore h = γ −1 θ−(1−γ) , and, from (6.4.23),
with θ := [ρ + σ 2 γ(1 − γ)/2 − αγ]/(1 − γ), (6.4.24) π ∗ (x) = θx.
(6.4.25)
6
236
CONTROLLED DIFFUSION PROCESSES
This is the optimal consumption process provided that it satisfies (6.4.20), so that we must have 0 < θ ≤ 1,
(6.4.26)
and provided also that the condition (6.4.4) holds. To verify (6.4.4), let us apply Itˆo’s differential rule (Theorem ∗ 6.4) to the process y(t) = log x(t), with x(·) = xπ (·) in (6.4.19), to obtain
1 2 dy(t) = α − θ − σ dt + σdw(t). 2 This equation, in integral form, yields
1 2 log[x(t)/x] = α − θ − σ t + σw(t); 2 that is,
x(t) = x exp
1 2 α − θ − σ t + σw(t) . 2
(6.4.27)
Therefore, since w(t) is a Gaussian variable N (0, t) (see Example 5.3), from (6.4.22) we obtain
1 2 π ∗ −ρt γ Ex [e v(x(t))] = hx exp − ρ + σ γ(1 − γ) − αγ + θγ t 2 γ −θt [from (6.4.24)] = hx e → 0 as t → ∞, provided that (6.4.26) holds. Thus from Remark 6.13 we conclude that, assuming (6.4.26), the consumption process in (6.4.23) is optimal within the class of admissible stationary policies π for which Exπ [e−ρt v(x(t))] = he−ρt Exπ [xγ (t)] → 0 as t → ∞. Moreover, when using π ∗ , the corresponding wealth is the log– normal process in (6.4.27), and the optimal expected discounted utility is given by (6.4.22) and (6.4.24) for any initial wealth x > 0. 3
6.5
EXAMPLES: AVERAGE COSTS
237
6.5 Examples: Average Costs Let us consider again the autonomous SDE (6.4.1) with generator La in (6.4.2). As in Sect. 5.5, we wish to minimize the long–run expected average cost (AC) J(x, π) = lim sup Jt (x, π)/t,
(6.5.1)
t→∞
with Jt (x, π) as in (5.5.1). As in Assumption 5.24, we suppose that there exists a control policy π ∈ Π such that J(x, π) < ∞ for every x ∈ Rd , where Π is the set of admissible policies in Definition 6.8. Finally, observe that, from (6.4.2), the AC optimality equation (ACOE) in Definition 5.27 can be expressed as 1 j ∗ = inf [c(x, a) + hx (x)b(x, a)] + T r[hxx (x)D(x)] a∈A 2
(6.5.2)
in terms of a solution pair (j ∗ , h(·)). Example 6.16. (LQ problems). This example is related to the deterministic LQ problem in Example 4.22. First, we consider a stochastic version and then we show how it reduces to the deterministic case. (a) The stochastic case. Consider the LQ system in Example 6.14, with state equation (6.4.5), with β = 0. The cost Jt in (6.5.1) is Jt (x, π) =
Exπ
t
c(x(t), a(t))ds, 0
with quadratic instantaneous (or running) cost c(x, a) := qx2 + ra2 , where both q and r are positive numbers. For notational ease, we will write the derivatives hx and hxx in the ACOE (6.5.2) as h and h , respectively. Hence, (6.5.2) becomes 1 j ∗ = inf [qx2 + ra2 + (γx + βa)h (x)] + σ 2 h”(x). a∈R 2
(6.5.3)
Note that this equation is similar to (6.4.7). Therefore, as a first “guess”, we may try to solve (6.5.3) with a function of
6
238
CONTROLLED DIFFUSION PROCESSES
the form (6.4.8), i.e., h(x) = kx2 + g, x ∈ R,
(6.5.4)
for some constants k and g. Observe, however, that these constants are NOT the same as in (6.4.8), and (6.4.12), because the latter equations depend on the discount factor ρ. Thus, to avoid confusions, we will rewrite (6.4.12) as vρ (x) = k(ρ)x2 + k(ρ)σ 2 /ρ,
(6.5.5)
where k(ρ) is the unique positive solution of (6.4.11). Inserting h(·) in (6.5.3), the same calculations used in (6.4.9)– (6.4.11) now show that the minimum in (6.5.3) is attained at a ¯=π ¯ (x) given by π ¯ (x) = −r−1 βkx ∀ x ∈ R.
(6.5.6)
With this value of a = a ¯, (6.5.3) becomes j ∗ = σ 2 k + x2 (q + 2γk − r−1 β 2 k 2 )
(6.5.7)
for all x ∈ R. Therefore (recalling that q > 0), taking k ∗ as the unique positive solution of the quadratic equation
we obtain
q + 2γk − r−1 β 2 k 2 = 0,
(6.5.8)
π ¯ (x) = −r−1 βk ∗ x.
(6.5.9)
Thus, from (6.5.7) and (6.5.4), we have that the pair (j ∗ , h(·)) consisting of j ∗ = σ2k∗,
and h(x) = k ∗ x2 + g
∀ x,
(6.5.10)
where g is an arbitrary constant, is a solution to the ACOE (6.5.3). We now wish to show, using Theorem 5.28(b), that π ¯ is AC– AC optimal in the family ΠS of stationary policies that satisfy (5.5.9), which in our present situation becomes
6.5
EXAMPLES: AVERAGE COSTS
239
lim t−1 Exπ¯ h(x(t)) = 0 ∀ x.
t→∞
This condition, by the definition of h in (6.5.10), is equivalent to lim Exπ¯ [x2 (t)]/t = 0 ∀ x. (6.5.11) t→∞
To prove this, note that inserting a(t) = π ¯ (x(t)) in the linear equation (6.4.5) we obtain again a Langevin equation (6.4.13) but now with coefficient α := r−1 β 2 k ∗ − γ. Hence, ¯ and ρ = 0, we have from (6.4.14)–(6.4.15) with π ∗ = π Exπ¯ [x2 (t)] = (x2 − σ 2 /2α)e−2αt + σ 2 /2α.
(6.5.12)
This clearly yields (6.5.11) since, from (6.5.8), α = (γ 2 + β 2 q/r)1 /2) > 0. Therefore, from Theorem 5.28(b) we conclude the desired result: π ¯ is AC–optimal in ΠAC S , and the minimum average cost is j ∗ in (6.5.10). (b) The deterministic LQ case. The AC results for the deterministic LQ system are obtained by taking the coefficient σ = 0 in (6.4.5) and “everywhere” in part (a) above; in particular, the state Eq. (6.4.5) is now x(t) ˙ = γx(t) + βa(t), t ≥ 0,
(6.5.13)
with constant coefficients γ, β, with β = 0. Moreover, from (6.5.7)–(6.5.10) with σ = 0, we conclude that the optimal average cost is j ∗ = 0, and the AC–optimal policy is again π ¯ in (6.5.9). 3 Remark 6.17. (The certainty–equivalence principle). Consider an stochastic optimal control problem (OCP) with state equation as in, say, (6.2.1): dx(t) = b(t, x(t), a(t))dt + σ(t, x(t), a(t))dw(t). Consider also an associated deterministic OCP in which the corresponding state equation is obtained from (6.2.1) taking σ(·) ≡ 0. If the optimal control policy in the stochastic case is the same as in the associated deterministic problem, it is then said that the stochastic system satisfies the certainty–equivalence
240
6
CONTROLLED DIFFUSION PROCESSES
principle. From the Example 6.16(a), (b) we can see that this principle is satisfied by the average cost problem for LQ systems. This fact is also true in the discounted cost case. Can you guess why? 3 In the following example we wish to maximize the long-run average reward (AR) defined as JAR (x, a(·)) := lim inf jT (x, a(·))/T, T →∞
(6.5.14)
where, given a control policy a(·) and the running (or instantaneous) reward function r(x, a), T jT (x, a(·)) := Ex [ r(x(t), a(t))ds], 0
given the one-dimensional Eq. (6.4.1). In this case, the optimality Eq. (6.5.2) becomes the average reward optimality equation (AROE) 1 j ∗ = sup[r(x, a) + h (x)b(x, a) + h (x)σ 2 (x)], 2 a∈A
(6.5.15)
where j ∗ = supa(·) JAR (x, a(·)), and h , h denote the first and second derivatives of h with respect to x. Example 6.18. (Average welfare in a pollution accumulation model). This example is a particular case of the control problem in Kawaguchi and Morimoto (2007) or Sect. 10.1 in Morimoto (2010), and it is also an extension to controlled diffusion processes of the deterministic Example 4.29. Now, instead of (4.4.28)– (4.4.30), the stock of pollution x(·) evolves according to the onedimensional stochastic differential equation dx(t) = [a(t) − d0 x(t)]dt + σ · x(t)dw(t), x(t) = x > 0, (6.5.16) where a(·) denotes the flow of consumption (or pollution), d0 >0 is the constant rate of pollution decay, and σ > 0 is a given constant. The long-run average reward (or average welfare) is as in (6.5.14) with instantaneous reward r(x, a) = U (a) − D(x), where
6.5
EXAMPLES: AVERAGE COSTS
241
U (a) is the social utility function of the consumption a, and D(x) is the social disutility of the pollution stock x. Kawaguchi and Morimoto (2007) consider general utility and disutility functions. Here, however, as in (4.4.30), we will assume that U and D are of the particular form U (a) = 2a1/2
and D(x) = d1 x, d > 0,
in which case the AROE (6.5.15) becomes 1 j ∗ = sup[2a1/2 − d1 x + h (x)(a − d0 x)] + h (x)σ 2 x2 . (6.5.17) 2 a≥0 Note that the right-hand side of (6.5.17) is maximized at a = a∗ (x) = 1/(h (x))2 . The problem then is how to find a suitable function h. Observe that the drift coefficient b(x, a) = a − d0 x in (6.5.16) is the same as the right-hand side of (4.4.29) with ϕ(x) as in (4.4.30). Therefore, in view of the results in Example 4.29, we conjecture that h is of the form h(x) = −h1 x + h2 for some constants h1 , h2 that need to be determined. Replacing this function h in (6.5.17) and using that a∗ (x) = 1/(h (x))2 = 1/h21 , (6.5.17) reduces to j∗ =
1 + (h1 d0 − d1 )x ∀x ≥ 0. h1
Therefore, h1 = d1 /d0 and j ∗ = 1/h1 = d0 /d1 . This gives a pair (j ∗ , h(·)) that satisfies (6.5.16). However, to conclude from Theorem 5.28 that j ∗ has some optimality property we still need to verify (5.5.9) for some family of policies π ∈ Π, so that 1 π E h(x(t)) → 0 as t → ∞. t x
(6.5.18)
To this end, in (6.5.16) take a(·) ≡ a∗ := (d0 /d1 )2 . Then for any initial state x(0) = x > 0 and any positive integer k such that 2d0 > (k − 1)σ 2 the solution x(·) of (6.5.16) satisfies that
(6.5.19)
6
242
CONTROLLED DIFFUSION PROCESSES
sup E[x(t)k ] ≤ xk + C t≥0
for some constant C > 0. (See Morimoto (2010), Lemma 10.2.1 or Proposition 10.2.2.) The latter inequality obviously yields (6.5.18) provided that the coefficients d0 and σ satisfy (6.5.19). To conclude, note that the optimal control a∗ (x) = 1/(h (x))2 in this example is the same as the optimal control a∗ in the deterministic control problem of Example 4.29. Hence we have another example of the certainty-equivalence principle in Remark 6.17. Can you explain why? 3 Notes—Chapter 6 1. The setting in this chapter—imposing Assumptions 6.1, 6.2, and the notion of admissibility in Definition 6.8—is very restrictive, but it suffices to analyze some stochastic control problems by means of concepts from undergraduate calculus, such as continuity, differentiability, and so forth. The setting can be considerably relaxed but at the cost of introducing nontrivial mathematical complications. To illustrate this, consider the following innocent–looking one– dimensional control problem in which the system Eq. (6.3.6) and the cost functional (6.3.7) are of the form dx(t) = a(t)dt + dw(t),
0 ≤ t ≤ 1,
(6.5.20)
J(0, x, π) = Exπ [x(1)2 ], respectively, with control set A = [−1, 1]. Then the optimal control is π ∗ (s, x) = −sign(x) [see Beneˇs (1974), or Christopeit and Helmes (1982)], which is not an admissible Markov policy in the sense of Definition 6.8. In fact, if we write a(t) = π ∗ (t, x(t)) in (6.5.20), the resulting equation does not satisfy the Assumption 6.1(b), so in our context we cannot claim that the equation has a “solution”. This kind of situations can be dealt with in a number of ways (typically, by “weakening” the notion of solution of a SDE); see Fleming and Soner
6.5
EXAMPLES: AVERAGE COSTS
243
(2006), Hanson (2007), Morimoto (2010), Pham (2009), Yong and Zhou (1999),... 2. The examples in Sects. 6.3 and 6.4 are standard, see e.g. Fleming and Rishel (1975) Chap. 6, or Øksendal (2003) Chap. 11. The Example 6.10 is originally due to Merton (1971). For related applications and further references on financial economics see, for instance, Chang (2004), Karatzas (1989), Karatzas and Shreve (1998), Merton (1990), Morimoto (2010), Pham (2009). 3. The control problems in which the coefficients b and σ of (6.2.1) are of the form b(t, x, a) = a, σ(t, x, a) = σ(t, x), and A = Rm (which is the case in (6.5.20)), are called stochastic calculus of variations problems: Fleming (1983), Loewen (1987). 4. For the existence of solutions to the DP Eq. (6.4.3) see e.g. Bensoussan (1982), Morimoto (2010), Pham (2009), Krylov (1980). 5. The Example 6.15 is due to Merton (1971). For other discounted problems in economics and finance see the references in Note 2 above. For other applications see, for instance, Mangel (1985), and Whittle (1982).
Appendix A Terminology and Notation
For technical reasons, all the sets and functions considered in these lecture notes are assumed to be Borel measurable. If the reader is not familiar with this concept, don’t worry: we only consider nice sets and functions, for instance, open sets and continuous (or even differentiable) functions. It is important, however, to know at least the following basic terminology. Let X be a metric space. The Borel sigma–algebra of X, denoted by B(X), is the smallest sigma–algebra that contains all the open subsets of X. The sets in B(X) are called Borel sets. If X is a complete and separable metric space (also known as a Polish space), then a Borel subset of X is called a Borel space. The following are examples of Borel spaces: • Any open or any closed subset of Rn . • A discrete space X, that is, a finite or denumerable set with the discrete topology (the topology consisting of all the subsets of X). • A compact metric space (which is complete and separable). • If X1 , X2 , ... is a (finite or countable) sequence of Borel spaces, then the product space Y := X1 × X2 × · · · is also a Borel space with the (product) Borel sigma–algebra. • If X is a Borel space, then the space P(X) of probability measures on X with the topology of weak convergence is also a Borel space. © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hern´andez-Lerma et al., An Introduction to Optimal Control Theory, Texts in Applied Mathematics 76, https://doi.org/10.1007/978-3-031-21139-3
245
246
APPENDIX A:
TERMINOLOGY AND NOTATION
For further details on measurability and related concepts, see any introductory book on Real Analysis, for instance, Ash (1972), Bartle (1995), Bass (2020), etc.
Lower Semicontinuous Functions Let X be a metric space and v a function from X to R ∪ {+∞} such that v(x) < ∞ for at least one point x ∈ X. This function v is said to be lower semicontinuous (l.s.c.) at x ∈ X if lim inf v(xn ) ≥ v(x) n→∞
for any sequence {xn } in X that converges to x. The function v is called lower semicontinuous (l.s.c.) if it is l.s.c. at every point of X. Proposition A.1. The following statements are equivalent: (a) v is l.s.c.; (b) the set epi(v) := {(x, λ) ∈ X × R | v(x) ≤ λ}, called the epigraph of v, is closed; (c) all of the lower sections (or level sets) Sλ (v) are closed, where Sλ (v) := {x ∈ X | v(x) ≤ λ},
λ ∈ R.
Let L(X) be the family of l.s.c. functions on X, and L+ (X) the subfamily of nonnegative l.s.c. functions. (In many applications it suffices to assume that L+ (X) consists of l.s.c. functions that are bounded below. Clearly, if v is l.s.c. and v(·) ≥ −m for some m, then v(·) + m is in L+ (X).) Proposition A.2. A function v is in L+ (X) if and only if there exists a sequence of continuous and bounded functions vn on X such that vn ↑ v. Proposition A.3. If v, v1 , . . . , vn belong to L+ (X), then (a) the functions αv, with α≥0, v1 + . . . + vn , and mini vi , belong to L+ (X);
APPENDIX A:
TERMINOLOGY AND NOTATION
247
(b) if X is compact, then v attains its minimum, that is, there exists a point x∗ ∈ X such that v(x∗ ) = inf x v(x). For proofs of Proposition A.1–A.3 see Ash (1972), Appendix A6 or Bertsekas and Shreve (1978), Sect. 7.5. On the other hand, v is upper semicontinuous (u.s.c.) if and only if −v is l.s.c. Moreover, v is continuous if and only if v is both l.s.c. and u.s.c.
Appendix B Existence of Measurable Minimizers
In this appendix X and A denote Borel spaces. In the previous chapters, they denote the state space and the action space (or control set) of an OCP, respectively. Let 2A denote the family of all nonempty subsets of A. A set– valued mapping Φ : X → 2A is called a multifunction, also known as a correspondence. In this case, a (measurable) function f : X → A such that f (x) ∈ Φ(x) for all x ∈ X is said to be a selector (or selection) of Φ. We denote by F the family of selectors of Φ. Sometimes we write Φ(x) as A(x). Definition B.1. (a) The lower inverse of B ⊂ A with respect to Φ is defined as Φ−1 [B] := {x ∈ X : A(x) ∩ B = ∅}. (b) The graph of Φ, which we will denote by K, is given by K := {(x, a) ∈ X × A : a ∈ A(x)}. (c) Φ is said to be Borel measurable if the lower inverse Φ−1 [B] is a Borel subset of X for every closed set B ⊂ A. Theorems B.2 and B.3, below, are obtained in Himmelberg et al. (1976) and in Sch¨al (1975); they are also reproduced in Appendix D of Hern´andez-Lerma and Lasserre (1996).
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hern´andez-Lerma et al., An Introduction to Optimal Control Theory, Texts in Applied Mathematics 76, https://doi.org/10.1007/978-3-031-21139-3
249
250
APPENDIX B:
EXISTENCE OF MEASURABLE MINIMIZERS
Theorem B.2. Suppose that Φ is compact–valued, that is, Φ(x) is compact for every x ∈ X. Then the following statements are equivalent: (a) Φ is Borel measurable. (b) Φ−1 [B] ⊂ X is a Borel set for every open set B ⊂ A. (c) The graph of Φ is a Borel subset of X × A. Theorem B.3. Suppose that Φ is compact–valued. Let v : K → R be a Borel function such that v(x, ·) is lower semicontinuous (l.s.c) on Φ(x) for every x ∈ X. Then: (a) there exists a (Borel measurable) selector f ∗ ∈ F such that v(x, f ∗ (x)) = v ∗ (x) := min v(x, a) ∀ x ∈ X, a∈A(x)
(B.1)
and v ∗ is measurable. (b) If the set-valued mapping x → A(x) is u.s.c. and v is l.s.c. and bounded below, then there exists f ∗ ∈ F that satisfies (B.1) and, furthermore, v ∗ is l.s.c. and bounded below. Theorems B.2 and B.3 require Φ to be compact–valued. In contrast, Theorem B.8 below does not require compactness, but we need the following concepts. Definition B.4. (a) Consider a function v : K → R. (a1) v is called inf–compact if, for every r ∈ R, the set {(x, a) ∈ K|v(x, a) ≤ r} is compact; (a2) v is called K–inf–compact if, for every compact set X ⊂ X and every r ∈ R, the set {(x, a) ∈ G(X )|v(x, a) ≤ r} is compact, where G(X ) := {(x, a) ∈ X × A|a ∈ A(x)}; (a3) v is called inf–compact on K if, for every x ∈ X, the function a → v(x, a) is inf–compact on A(x), that is, the set {a ∈ A(x)|v(x, a) ≤ r} ⊂ A
APPENDIX B:
EXISTENCE OF MEASURABLE MINIMIZERS
251
is compact for every r ∈ R. (b) Consider a compact-valued multifunction Φ : X → 2A as above, with A a separable metric space. Then Φ is said to be: (b1) lower semicontinuous (l.s.c.) at x ∈ X if it satisfies that: If xn → x and a is in A(x), then there exist an ∈ A(xn ) such that an → a. Φ is called l.s.c. on X if it is l.s.c. at every x ∈ X. (b2) upper semicontinuous (u.s.c) at x ∈ X if it satisfies that: If xn → x and an ∈ A(xn ) is such that an → a, then a is in A(x). Φ is u.s.c. on X if it is u.s.c. at every x ∈ X. (b3) continuous on X if it is both l.s.c and u.s.c. on X. Remark B.5. (a) For calculations, it is convenient to restate Definition B.4(a2) as follows (see Feinberg et al. 2021). A function v : K → R is K-inf-compact if and only if for every sequence {(xt , at )} in K such that, for some x ∈ X, xt → x and v(xt , at ) is bounded above, it holds that the sequence {at } has an accumulation point a ∈ A(x). (b) As a simple example of Definition B.4(b), consider the spaces X = A = R. Then the multifunction [0, 1] if x = 0 Φ1 (x) := [0, 1/2] if x = 0 is l.s.c. On the other hand, the multifunction [0, 1] if x = 0 Φ2 (x) := [0, 2] if x = 0 is u.s.c., whereas the “constant” multifunction Φ(x) := [0, 1] for all x ∈ X is continuous. ♦ Lemma B.6. For v as in Definition B.4 (a): (a1) ⇒ (a2) ⇒(a3). To verify that a multifunction is l.s.c., the following proposition from Michael (1970) is useful.
252
APPENDIX B:
EXISTENCE OF MEASURABLE MINIMIZERS
Proposition B.7. (Michael, 1970). Consider a multifunction Φ : X → 2A . If for every x ∈ X and a ∈ Φ(x) there is a continuous selector f of Φ such that f (x) = a, then Φ is l.s.c. Theorem B.8. Let us suppose that K is a Borel subset of X × A, v is l.s.c., bounded below, and inf-compact on K. Then (a) There exists a Borel selector f ∗ ∈ F for which (B.1) holds. (b) If, in addition, the multifunction x → A∗ (x), where A∗ (x) := {a ∈ A(x) : v ∗ (x) = v(x, a)} is l.s.c., then v ∗ is l.s.c. If, moreover, v is continuous, then so is v ∗ . Proof. For part (a) see Rieder (1978); for (b) see Hern´andezLerma and Runggaldier (1994). The conclusions (a) and (b) in Theorem B.8 can be obtained in several ways. For instance, from our Definition B.4(a) above and Feinberg et al. (2013) we obtain the following. Theorem B.9. Let v be a real–valued function on K (with K as in Definition B.1(b)). If v is K–inf–compact (Definition B.4(a2)), then there exists f ∗ ∈ F that satisfies (B.1). Moreover, v is l.s.c. on K, and v ∗ is l.s.c. on X. We conclude this appendix with a useful result from Sch¨al (1975), Proposition 12.2, which is reproduced as Proposition D.7 in Hern´andez-Lerma and Lasserre (1996). Theorem B.10. Let X be an arbitrary metric space, A a separable metric space, and Φ a compact-valued multifunction from X to 2A . Let F be the family of measurable selectors of Φ. In fn is a sequence in F, then there exists f ∈ F such that, for each x ∈ X, f (x) ∈ A(x) is an accumulation point of {fn (x)}. In other words, for each x ∈ X, there is a sequence ni = ni (x) such that fni (x) → f (x) as i → ∞. If fn and f satisfy the conclusion of Theorem B.10, then we say al to f . that the sequence {fn } converges in the sense of Sch¨
APPENDIX B:
EXISTENCE OF MEASURABLE MINIMIZERS
253
Remark B.11. Theorem B.10 is useful, for instance, in optimization problems such as the following. Let X, A and Φ be as in Theorem B.10, and consider a sequence of real-valued functions vn on X × A such that vn → v ∗ . Suppose that, for each n, there exists fn ∈ F such that vn (x, fn (x)) = inf vn (x, a) ∀x ∈ X. a∈A(x)
Then, by Remark B.11, the sequence of “minimizers” {fn } converges in the sense of Sch¨al to some f ∗ ∈ F. The obvious question now is, is f ∗ a “minimizer” for v ∗ ? More explicitly, is f ∗ (x) ∈ A(x) such that v ∗ (x, f ∗ (x)) = min v ∗ (x, a) ∀x ∈ X? a∈A(x)
The answer to this question depends on the underlying assumptions. For examples in which the answer is affirmative, see Sect. 2.4 above, or Sect. 4 in Escobedo-Trujillo et al. (2020). Another example is given in the following proposition, which requires A to be a locally compact space (that is, for each a ∈ A, there is an open set containing a and such that its closure is compact). For ♦ example, Euclidean spaces Rd are locally compact. Proposition B.12. (a) Let v and vn (n = 1, 2, . . .) be l.s.c. functions, bounded below, and inf-compact on K. Let, for x ∈ X, vn∗ (x) := min vn (x, a), a∈A(x)
v ∗ (x) := min v(x, a). a∈A(x)
For each n, let fn ∈ F be a minimizer of vn , i.e., vn∗ (x) = vn (x, fn (x)). If A is locally compact and either vn ↑ v or vn ↓ v, then fn converges in the sense of Sch¨al (1975) to some f ∈ F that is a minimizer of v, i.e., v ∗ (x) = v(x, f (x)) for all x ∈ X. (b) The conclusion in part (a) is also valid if the Assumption 2.33 (in Sect. 2.3.3 above) holds, the functions v and vn belong to Lw (X), and vn → v as n → ∞. For a proof of Proposition B.12(a), see Lemma 4.6.6 in Hern´andez-Lerma and Lasserre (1996). Part (b) follows from Theorem B.10.
Appendix C Markov Processes Let {xn } be a discrete–time stochastic process, that is, a sequence of random variables, with values in a space S. We will assume that S is a Borel space. The process {xn } is called a Markov chain or a discrete–time Markov process with state space S if, for every n ≥ k ≥ 0 and B ∈ B(S), P (xn ∈ B | x0 , . . . , xk ) = P (xn ∈ B | xk ).
(C.1)
The interpretation of (C.1) is as follows. Let us refer to k as the “present (or current) time”, n ≥ k as the “future”, and n ≤ k − 1 as the “past”. Then (C.1) states that the distribution of the sequence at any future time n, given the “history” of the process up to the current time k depends only on the current state xk . In fact, the so-called Markov property (C.1) holds iff (C.1) holds for n = k + 1 only; that is, (C.1) is equivalent to the following: For every k ≥ 0 and B ∈ B(S), P (xk+1 ∈ B | x0 , . . . , xk ) = P (xk+1 ∈ B | xk ).
(C.2)
In other words, to verify the Markov property it is not necessary to check (C.1) for every n ≥ k; it suffices to check (C.2) for n = k + 1. The right–hand side of (C.2) defines the one–step transition probabilities Pk (x, B) := P (xk+1 ∈ B | xk = x) © The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hern´andez-Lerma et al., An Introduction to Optimal Control Theory, Texts in Applied Mathematics 76, https://doi.org/10.1007/978-3-031-21139-3
255
256
APPENDIX C:
MARKOV PROCESSES
for all x ∈ S, B ∈ B(S), and k = 0, 1, . . . . If the transition probabilities are independent of k, that is, P (x, B) ≡ P (xk+1 ∈ B | xk = x) ∀ k ≥ 0, then {xn } is said to be a stationary or time–homogeneous Markov chain. Here, unless noted otherwise, we will consider stationary Markov chains only. Remark C.1. We will try to explain the origin of the name Markov chain. The Russian mathematician Andrei A. Markov (1856–1922) considered a sequence {xk } of random variables with values in a finite set S, and such that P (x0 = i0 , x1 = i1 , . . . , xn = in ) = P (x0 = i0 )p(i0 , i1 )p(i1 , i2 ) · · · p(in−1 , in )
(C.3)
for every n ≥ 1 and every sequence of states i0 , . . . , in in S, where p(ik , ik+1 ) := P (xk+1 = ik+1 |xk = ik ) denote the one–step transition probabilities. It can be shown (Exercise 2) that (C.2) and (C.3) are equivalent; that is, the finite– valued sequence {xk } is a Markov chain iff (C.3) holds. Moreover, because of the right–hand side of (C.3), Markov referred to {xk } as a sequence whose probabilities were “chained”. These facts appeared in a paper by Markov, in 1906. The name “Markov chain” was used for the first time by Bernstein (1927). For additional details on Markov’s contributions see the paper by Basharin et al. (2004). ♦ The following definition generalizes the concept of transition probability. Definition C.2. Let X and Y be Borel spaces. A stochastic kernel on X given Y (also known as a transition probability from Y to X) is a real–valued function Q(·|·) such that (a) B → Q(B|y) is a probability measure on B(X) for each fixed y ∈ Y , and
APPENDIX C:
MARKOV PROCESSES
257
(b) y → Q(B|y) is a measurable function on Y for each fixed Borel set B ⊂ X. Note that a (Markov) transition probability P (x, B) as above, written in the form P (B|x), is a stochastic kernel with X = Y . Proposition C.3. Let {xn , n = 0, 1, . . . } and {ξn , n = 0, 1, . . . } be stochastic processes in Borel spaces X and S, respectively. Suppose that ξ0 , ξ1 , . . . are independent, and also independent of x0 . If there is a measurable function F : X × S → X such that xn+1 = F (xn , ξn ) ∀ n ≥ 0,
(C.4)
then {xn } is a Markov chain. The converse is also true. Example C.4. Let {ξn } be a sequence of independent random variables, and x0 a given random variable independent of {ξn }. By Proposition C.3, the following are examples of Markov chains. (a) A Markov chain that evolves as xn+1 = F (xn ) + ξn
∀ n = 0, 1, . . .
(C.5)
is called a first order autoregressive process, and {ξn } is said to be an additive noise. (C.5) includes linear systems xn+1 = Gxn + ξn ,
(C.6)
where G is a constant, which can be a matrix in the vector case. If the ξn and the initial state x0 are Gaussian random variables, then (C.6) is called a Gaussian–Markov system. (b) Consider an inventory–production system xn+1 = xn + f (xn ) − ξn , n = 0, 1, . . . , where xn is the stock or inventory level (of a certain product) at time n, f (x) is the production strategy given the stock level x, and ξn is the demand of the product in period n. (c) Consider a water reservoir with capacity Q. Let xn be the amount of water in the reservoir at time n, and f (xn ) the amount of water discharged in period n (for instance, for irrigation or to produce electrical energy). Then we can express
APPENDIX C:
258
xn+1 as
MARKOV PROCESSES
xn+1 = min [xn − f (xn ) + ξn , Q],
where ξn is the amount rainwater deposited in the reservoir in period n. Remark C.5. In Example C.4(a), the sequence {ξn } is, in general, a random noise, that is, a sequence of arbitrary random variables with no particular interpretation or meaning. In contrast, in (b) and (c) the sequence is a driving process, that is, the random variables ξn have a physical or economic interpretation. Usually, this is also the case in queueing systems, in which the “random perturbations” ξn represent the arrival process.
Continuous–Time Markov Processes Let {xt , t ≥ 0} be a continuous–time stochastic process with values in a Borel space X. We say that {xt } is a Markov process if, for every t ≥ s ≥ 0 and B ∈ B(X), P (xt ∈ B | xr
∀ r ≤ s) = P (xt ∈ B | xs )
(C.7)
The interpretation of (C.7) is analogous, of course, to that of (C.1), where s is the “present time”, t > s is the “future”, and r < s is the “past”. From the right–hand side of (C.7) we obtain the transition probabilities (C.8) P (s, x, t, B) := P (xt ∈ B | xs = x) for every t ≥ s, x ∈ X, and B ∈ B(X). The transition probabilities are called stationary or time–homogeneous if they depend only on the time–difference t − s. In this case we rewrite (C.8) as P (t, x, B) := P (xt ∈ B | x0 = x). Example C.6. The simplest example of a continuous–time Markov process is the solution {xt } of an ordinary differential equation
APPENDIX C:
MARKOV PROCESSES
259
x˙ t = F (xt ) for t ≥ 0, for some given initial condition x0 . Under suitable assumptions on F , there is a unique solution t xt = x 0 + F (xu )du, 0
which can be expressed, for every t ≥ s ≥ 0, as t x t = xs + F (xu )du. s
This is the deterministic analogue of the Markov property (C.7). Example C.7. [Brownian motion.] The one–dimensional Brownian motion, also known as Wiener process, is a process {w(t), t ≥ 0} with values w(·) ∈ R and: (a) continuous trajectories t → w(t), with w(0) = 0; (b) independent increments, that is, for any positive integer n and times 0 = t0 < t1 < · · · < tn , the increments w(t1 ) − w(t0 ), w(t2 ) − w(t1 ), ..., w(tn ) − w(tn−1 ) are independent random variables; and (c) stationary Gaussian increments, that is, for any t > s ≥ 0, the increment w(t) − w(s) has a normal (or Gaussian) distribution with zero mean, and variance E[(w(t) − w(s))2 ] = t − s. The Wiener process has many interesting properties. In particular, from the independence of increments (b), it is a continuous– time Markov process. A process w(t), t ≥ 0 with values w(·) = (w1 (·), . . . , wn (·)) ∈ n R is a n–dimensional Brownian motion (or Wiener process) if the components w1 (·), . . . , wn (·) are independent one–dimensional Brownian motions.
APPENDIX C:
260
MARKOV PROCESSES
Theorem of C. Ionescu–Tulcea The following proposition in used in the analysis of Markov chains. Proposition C.8. (Theorem of C. Ionescu Tulcea.). Let X0 , X1 , . . . be a sequence of Borel spaces and, for n = 0, 1, . . . , define Yn := X0 × · · · × Xn and Y := Π∞ n=0 Xn . Let ν be an arbitrary probability measure on X0 and, for every n = 0, 1, . . . , let Pn (dxn+1 |yn ) be a stochastic kernel on Xn+1 given Yn . Then there exists a unique probability measure Pν on Y such that, for every measurable rectangle B0 × · · · × Bn in Yn , Pν (B0 × · · · × Bn )
=
B0
···
Bn
ν(dx0 )
B1
P0 (dx1 |x0 )
B2
P1 (dx2 |x0 , x1 )
Pn−1 (dxn |x0 , . . . , xn−1 ).
(C.9)
Moreover, for any nonnegative measurable function u on Y , the function x → u(y)Px (dy) is measurable on X0 , where Px stands for Pν when ν is the probability concentrated at x ∈ X0 . Proof. See Ash (1972, p. 109), Bertsekas and Shreve (1978, pp. 140–141), or Neveu (1965, p. 162). Remark C.9. Let us (informally) write the measure Pν in (C.9) as Pν (dx0 , dx1 , dx2 , . . .) = ν(dx0 )P0 (dx1 |x0 )P1 (dx2 |x0 , x1 ) · · · , and let π = {πt } be an arbitrary control policy. Then the measure Pνπ in Chap. 3 [see (3.1.10a)–(3.1.10c)] can be written in the form Pνπ (dx0 , da0 , dx1 , da1 , . . .) = ν(dx0 )π0 (da0 |x0 )Q(dx1 |x0 , a0 ). ·π1 (da1 |x0 , a0 , x1 )Q(dx2 |x1 , a1 ) · · · .
APPENDIX C:
MARKOV PROCESSES
Exercises—Appendix C 1. Prove that (C.1) and (C.2) are equivalent. 2. Prove that (C.2) and (C.3) are equivalent.
261
Bibliography Adukov, V.M., Adukova, N.V., Kudryavtsev, K.N.: On a discrete model of optimal advertising. J. Comput. Eng. Math. 2(3), 13–24 (2015) Alla, A., Falcone, M., Kalise, D.: An efficient policy iteration algorithm for dynamic programming equations. SIAM J. Sci. Comput. 37(1), A181–A200 (2015) Anulova, S.V., Mai, H., Veretennikov, A.Y.: On iteration improvement for averaged expected cost control for one-dimensional ergodic diffusions. SIAM J. Control Optim. 58(4), 2312–2331 (2020) Arapostathis, A., Borkar, V.S., Ghosh, M.K.: Ergodic Control of Diffusion Processes. Cambridge University Press, Cambridge (2012) Arisawa, M.: Ergodic problem for the hamilton-jacobi-bellman equation. i. existence of the ergodic attractor. Annales de l’Institut Henri Poincar´e C, Analyse Non Lin´eaire. 14(4), 415–438 (1997) Arnold, L.: Stochastic Differential Equations: Theory and Applications. Wiley-Interscience, New York (1974) Ash, R.B.: Real Analysis and Probability. Academic Press, New York (1972) Ash, R.B., Gardner, M.F.: Topics in Stochastic Processes. Academic Press, New York (1975) Bardi, M., Capuzzo-Dolcetta, I.: Optimal Control and Viscosity Solutions of Hamilton-Jacobi-Bellman Equations. Systems & Control: Foundations & Applications with Appendices by Maurizio Falcone and Pierpaolo Soravia. Birkh¨ auser Boston, Inc., Boston, MA (1997) Bartle, R.G.: The Elements of Integration and Lebesgue Measure. Wiley, New York (1995) Basharin, G.P., Langville, A.N., Naumov, V.A.: The life and work of AA Markov. Linear Algebra Appl. 386, 3–26 (2004) Bass, R.F.: Real analysis for graduate students. Version 4.2, 2020. Available from: https://bass.math.uconn.edu/real.html B¨auerle, N., Rieder, U.: Markov Decision Processes with Applications to Finance. Springer, Berlin (2011) Bellman, R.: Dynamic Programming. Princeton University Press, New York (1957)
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hern´andez-Lerma et al., An Introduction to Optimal Control Theory, Texts in Applied Mathematics 76, https://doi.org/10.1007/978-3-031-21139-3
263
264
BIBLIOGRAPHY
Bellman, R.: A Markovian decision process. J. Math. Mech. 6(5), 679–684 (1957) Bellman, R.: Adaptive Control Processes. Princeton University Press, Princeton, NJ (1961) Beneˇs, V.E.: Girsanov functionals and optimal bang-bang laws for final value stochastic control. Stoch. Process. Appl. 2(2), 127–140 (1974) Bensoussan, A.: Stochastic Control by Functional Analysis Methods. NorthHolland, Amsterdam (1982) Bernstein, S.: Sur l’extension du th´eor´eme limite du calcul des probabilit´es aux sommes de quantit´es d´ependantes. Math. Ann. 97(1), 1–59 (1927) Bertsekas, D.P.: Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall, Englewood Cliffs, N.J. (1987) Bertsekas, D.P.: Dynamic Programming and Optimal Control, vol. 1, 3rd edn. Athena Scientific, Belmont,Massachuts (2005) Bertsekas, D.P., Shreve, S.: Stochastic Optimal Control: the Discrete-Time Case. Academic Press, New York (1978) Bishop, C.J., Feinberg, E.A., Zhang, J.: Examples concerning Abel and Ces`aro limits. J. Math. Anal. Appl. 420(2), 1654–1661 (2014) Blackwell, D.: Discounted dynamic programming. Ann. Math. Stat. 36(1), 226–235 (1965) Borkar, V.S., Gaitsgory, V., Shvartsman, I.: LP formulations of discrete time long-run average optimal control problems: The nonergodic case. SIAM J. Control Optim. 57(3), 1783–1817 (2019) Borwein, J.M., Lewis, A.S.: Convex Analysis and Nonlinear Optimization. Theory and Examples, vol. 3, 2nd edn, of CMS Books in Mathematics/Ouvrages de Math´ematiques de la SMC. Springer, New York (2006) Brock, W.A., Mirman, L.J.: Optimal economic growth and uncertainty: the discounted case. J. Econ. Theory 4, 479–513 (1972) Brock, W.A., Mirman, L.J.: Optimal economic growth and uncertainty: the no discounting case. Int. Econ. Rev. 560–573 (1973) Carlson, D.A., Haurie, A.B., Leizarowitz, A.: Infinite Horizon Optimal Control: Deterministic and Stochastic Systems. Springer, Berlin (1991) Chang, F.-R.: Stochastic Optimization in Continuous Time. Cambridge University Press, Cambridge, UK (2004) Chow, G.C.: Dynamic Economics: Optimization by the Lagrange Method. Oxford University Press, New York (1997)
BIBLIOGRAPHY
265
Christopeit, N., Helmes, K.: On Beneˇs’bang-bang control problem. Appl. Math. Optim. 9(1), 163–176 (1982) Costa, O.L.V., Dufour, F.: Average control of markov decision processes with feller transition probabilities and general action spaces. J. Math. Anal. Appl. 396(1), 58–69 (2012) Cruz-Su´ arez, H., Montes-de Oca, R.: An envelope theorem and some applications to discounted Markov decision processes. Math. Methods Oper. Res. 67(2), 299–321 (2008) Dasgupta, P., Heal, G.: The optimal depletion of exhaustible resources. Rev. Econ. Stud. 41, 3–28 (1974) Davis, M.H.A.: Linear Estimation and Stochastic Control. Chapman and Hall, London (1977) Dockner, E.J., Jorgensen, S., Van Long, N., Sorger, G.: Differential Games in Economics and Management Science. Cambridge University Press, Cambridge, UK (2000) Dom´ınguez-Corella, A., Hern´ andez-Lerma, O.: The maximum principle for discrete-time control systems and applications to dynamic games. J. Math. Anal. Appl. 475(1), 253–277 (2019) Doshi, B.T.: Continuous time control of Markov processes on an arbitrary state space: average return criterion. Stoch. Process. Appl. 4(1), 55–77 (1976) Doshi, B.T.: Continuous time control of Markov processes on an arbitrary state space: discounted rewards. Ann. Statist. 4(6), 1219–1235 (1976) Doshi, BT.: Generalized semi-Markov decision processes. J. Appl. Probab. 618–630 (1979) Down, D., Meyn, S.P., Tweedie, R.L.: Exponential and uniform ergodicity of Markov processes. Ann. Probab. 23(4), 1671–1691 (1995) Dynkin, E.B.: Markov Processes. Springer, Berlin (1965) Dynkin, E.B., Yushkevich, A.A.: Controlled Markov Processes. Springer, Berlin (1979) Escobedo-Trujillo, B., Hern´ andez-Lerma, O., Alaffita-Hern´ andez, F.: Adaptive control of diffusion processes with a discounted reward criterion. Appl. Math. 47, 225–253 (2020) Evans, L.C.: An Introduction to Stochastic Differential Equations. American Mathematical Society, Providence, Rhode Island (2013) Fabbri, G., Gozzi, F., Swiech, A.: Stochastic Optimal Control in Infinite Dimensions: Dynamic Programming and HJB Equations. Springer, Berlin (2017)
266
BIBLIOGRAPHY
Feinberg, E.A., Kasyanov, P.O., Zadoianchuk, N.V.: Average cost markov decision processes with weakly continuous transition probabilities. Math. Oper. Res. 37(4), 591–607 (2012) Feinberg, E.A., Kasyanov, P.O., Zadoianchuk, N.V.: Berge’s theorem for noncompact image sets. J. Math. Anal. Appl. 397(1), 255–259 (2013) Fleming, W.H.: Stochastic calculus of variations and mechanics. J. Optim. Theory Appl. 41(1), 55–74 (1983) Fleming, W.H.: Optimal control of Markov processes. In: Proceedings of the International Congress of Mathematicians, August 16–24 1983, vol. I, pp. 71–84. Polish Scientific Publisher and Elsevier, Warsaw (1984) Fleming, W.H., Rishel, R.W.: Deterministic and Stochastic Optimal Control. Springer, New York (1975) Fleming, W.H., Soner, H.M.: Controlled Markov Processes and Viscosity Solutions, 2nd edn. Springer, New York (2006) Friedman, A.: Stochastic Differential Equations and Applications, vol. I. Academic Press, New York (1975) Gamkrelidze, R.V.: Discovery of the maximum principle. J. Dyn. Control Syst. 5(4), 437–451 (1999) Gihman, I.I., Skorohod, A.V.: Controlled Stochastic Processes. SpringerVerlag, New York (1979) Guo, X., Hern´ andez-del Valle, A., Hern´ andez-Lerma, O.: Nonstationary discrete-time deterministic and stochastic control systems with infinite horizon. Int. J. Control 83(9), 1751–1757 (2010) Guo, X., Hern´ andez-Lerma, O.: Continuous-Time Markov Decision Processes: Theory and Applications. Springer, Berlin (2009) Halkin, H.: Necessary conditions for optimal control problems with infinite horizons. Econometrica 42, 267–272 (1974) Hanson, F.B.: Applied Stochastic Processes and Control for Jump-Diffusions: Modeling. Analysis and Computation. Society for Industrial and Applied Mathematics, Philadelphia (2007) Hern´ andez-Lerma, O.: Adaptive Markov Control Processes. Springer, New York (1989) Hern´ andez-Lerma, O.: Lectures on Continuous-Time Markov Control Processes. Sociedad Matem´atica Mexicana, Mexico City (1994) Hern´ andez-Lerma, O., Govindan, T.E.: Nonstationary continuous-time Markov control processes with discounted costs on infinite horizon. Acta Appl. Math. 67(3), 277–293 (2001)
BIBLIOGRAPHY
267
Hern´ andez-Lerma, O., Lasserre, J.B.: Discrete-Time Markov Control Processes: Basic Optimality Criteria. Springer, New York (1996) Hern´ andez-Lerma, O., Lasserre, J.B.: Further Topics on Discrete-Time Markov Control Processes. Springer, New York (1999) Hern´ andez-Lerma, O., Laura-Guarachi Leonardo R., Mendoza-Palacios, S.: A survey of average cost problems in deterministic discrete-time control systems. J. Math. Anal. Appl. 522(1), (2023) Hern´ andez-Lerma, O., Runggaldier, W.J.: Monotone approximations for convex stochastic control problems. J. Math. Syst. Estimation Control 4(4), 99–140 (1994) Himmelberg, C.J., Parthasarathy, T., VanVleck, F.S.: Optimal plans for dynamic programming problems. Math. Oper. Res. 1(4), 390–394 (1976) Hinderer, K., Rieder, U., Stieglitz, M.: Dynamic Optimization: Deterministic and Stochastic models. Springer, Cham (2016) Howard, R.A.: Dynamic Programming and Markov Processes. John Wiley, New York (1960) Hung, N.M., Quyen, N.V.: Dynamic Timing Decisions Under Uncertainty. Lecture Notes in Economics and Mathematical Systems, vol. 406. SpringerVerlag, Berlin (1994) Jacka, S.D., Mijatovi´c, A.: On the policy improvement algorithm in continuous time. Stochastics 89(1), 348–359 (2017) Josa-Fombellida, R., Rinc´ on-Zapatero, J.P.: Certainty equivalence principle in stochastic differential games: An inverse problem approach. Optimal Control Appl. Methods 40(3), 545–557 (2019) Karatzas, I.: Optimization problems in the theory of continuous trading. SIAM J. Control Optim. 27(6), 1221–1259 (1989) Karatzas, I., Shreve, S.E.: Methods of Mathematical Finance. Springer, New York (1998) Kawaguchi, K.: Optimal control of pollution accumulation with long-run average welfare. Environ. Res. Econ. 26(3), 457–468 (2003) Kawaguchi, K., Morimoto, H.: Long-run average welfare in a pollution accumulation model. J. Econ. Dyn. Control 31(2), 703–720 (2007) Kendrick, D.A.: Stochastic Control for Economic Models, 2nd edn. McGraw Hill, New York (2002) Krylov, N.V.: Controlled Diffusion Processes. Springer, New York (1980) Li, X.J., Yong, J.M.: Optimal Control Theory for Infinite-dimensional Systems. Systems & Control: Foundations & Applications. Birkh¨ auser Boston, Inc., Boston, MA (1995)
268
BIBLIOGRAPHY
Loewen, P.D.: Existence theory for a stochastic Bolza problem. IMA J. Math. Control Inf. 4(4), 301–320 (1987) Long, N.V., Kemp, M.C.: Essays in the Economics of Exhaustible Resources. Elsevier Science Publishers, Amsterdam (1984) Luenberger, D.G.: Optimization by Vector Space Methods. John Wiley & Sons, New York (1969) Lund, R.B., Meyn, S.P., Tweedie, R.L.: Computable exponential convergence rates for stochastically ordered Markov processes. Ann. Appl. Probab. 6(1), 218–237 (1996) Mangasarian, O.L.: Sufficient conditions for the optimal control of nonlinear systems. SIAM J. Control Optim. 4(1), 139–152 (1966) Mangel, M.: Decision and Control in Uncertain Resource Systems. Mathematics in Science and Engineering, 172. Academic Press, Orlando (1985) Merton, R.C.: Optimum consumption and portfolio rules in a continuous-time model. J. Econ. Theory 3(4) (1971) Merton, R.C.: Continuous-Time Finance. Basil Blackwell, London (1990) Michael, E.: A survey of continuous selections. In Set-Valued Mappings, Selections and Topological Properties of 2X (Proc. Conf., SUNY, Buffalo,N.Y., 1969), Lecture Notes in Mathematics, vol. 171, pp. 54–58. Springer, Berlin (1970) Midler, J.L.: Optimal control of a discrete time stochastic system linear in the state. J. Math. Anal. Appl. 25(1), 114–120 (1969) Mikosch, T.: Elementary Stochastic Calculus with Finance in View. World Scientific Publishing, Singapore (1998) Mitra, T., Wan, H.Y., Jr.: Some theoretical results on the economics of forestry. Rev. Econ. Stud. 52(2), 263–282 (1985) Morimoto, H.: Stochastic Control and Mathematical Modeling. Cambridge University Press, Cambridge (2010) Neveu, J.: Mathematical Foundations of The Calculus of Probability. Calif, Holden-Day, San Francisco (1965) Øksendal, B.: Stochastic Differential Equations: an Introduction with Applications. Springer-Verlag, Berlin (2003) Øksendal, B., Sulem, A.: Applied Stochastic Control of Jump Diffusions. Springer, Berlin (2007) Pham, H.: Continuous-Time Stochastic Control and Optimization with Financial Applications. Springer, Berlin (2009)
BIBLIOGRAPHY
269
Pindyck, R.S.: Uncertainty and exhaustible resource markets. J. Polit. Econ. 88(6), 1203–1225 (1980) Piunovskiy, A., Zhang, Y.: Continuous-time Markov Decision Processes-Borel Space Models and General Control Strategies. Springer, Cham (2020) Prieto-Rumeau, T., Hern´ andez-Lerma, O.: Selected Topics on ContinuousTime Controlled Markov Chains and Markov Games. Imperial College Press, London, London (2012) Rakovi´c, S.V., Levine, W.S.: Handbook of Model Predictive Control. Springer, Cham (2019) Rieder, U.: Measurable selection theorems for optimization problems. Manuscripta Math. 24(1), 115–131 (1978) Rishel, R.: Controlled Continuous Time Markov Processes. In Heyman, D.P., S.J.M. (eds.) Stochastic Models, Handbooks Operational Resource Management Science, vol. 2, pp. 435–467. North-Holland, Amsterdam (1990) Rudin, W.: Principles of Mathematical Analysis. McGraw-Hill Book Co., New York-Auckland-D¨ usseldorf (1976) Sch¨ al, M.: Conditions for optimality in dynamic programming and for the limit of n-stage optimal policies to be optimal. Zeitschrift f¨ ur Wahrscheinlichkeitstheorie und Verwandte Gebiete 32(3), 179–196 (1975) Sennott, L.I.: A new condition for the existence of optimal stationary policies in average cost Markov decision processes. Oper. Res. Lett. 5(1), 17–23 (1986) Sethi, S.P.: Optimal control theory: Applications to Management Science and Economics, 4th edn. Springer, Cham (2021) Shapley, L.S.: Stochastic games. Proceed. Nat. Acad. Sci. 39(10), 1095–1100 (1953) Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction MIT Press, 2nd edn. MIT Press, Cambridge, MA (2018) Sznajder, R., Filar, J.A.: Some comments on a theorem of Hardy and Littlewood. J. Optim. Theory Appl. 75(1), 201–208 (1992) ´ On the vanishing discount factor approach for markov deciVega-Amaya, O.: sion processes with weakly continuous transition probabilities. J. Math. Anal. Appl. 426(2), 978–985 (2015) ´ Solutions of the average cost optimality equation for markov Vega-Amaya, O.: decision processes with weakly continuous kernel: The fixed-point approach revisited. J. Math. Anal. Appl. 464(1), 152–163 (2018) Wei, Q., Liao, Z., Yang, Z., Li, B., Liu, D.: Continuous-time time-varying policy iteration. IEEE Trans. Cybern. 50(12), 4958–4971 (2020)
270
BIBLIOGRAPHY
Wessels, J.: Markov programming by successive approximations with respect to weighted supremum norms. J. Math. Anal. Appl. 58(2), 326–335 (1977) Whittle, P.: Optimization Over Time-Volme II: Dynamic Programming and Stochastic Control. John Wiley & Sons, Chichester (1982) Widder, D.V.: The Laplace Transform. Princeton Mathematical Series, vol. 6. Princeton University Press, Princeton, N. J.: (Dover, p. 2010. N.Y, Mineola (1941) Yong, J., Zhou, X.Y.: Stochastic Controls: Hamiltonian Systems and HJB Equations. Springer, New York (1999)
Index
A Abelian theorem continuous–time, 159 discrete–time, 73 Action, see Policy Average cost optimality equation continuous–time, 148, 206 discrete–time, 63, 116 Average cost optimality inequality continuous–time, 162, 206 discrete–time, 113 Average cost problems continuous–time, 147 discrete–time, 62, 111
B Banach’s fixed point theorem, 38 Bellman equation, see Dynamic programming equation Bellman operator, 31, 102 Bellman’s principle of optimality, see Principle of optimality (PO) Blackwell’s conditions, 46 Borel space, 11, 245 Bounding function, see Weight function Brock–Mirman model, 27, 43, 44, 65, 68, 75 Brownian motion, 186
C Canonical pair continuous–time, 151, 196 discrete–time, 63
Canonical triplet continuous–time, 151 discrete–time, 63 Ces` aro limits, 73, 159 Chapman–Kolmogorov equation, 189 Consumption–investment problem, 6, 98, 227 Control, see Policy Control model, see Optimal control problem Control policy, see Policy Control problem, see Optimal control problem
D Differential equation ordinary, 9, 127, 185 stochastic, 10, 187 Diffusion coefficient, 187 Diffusion process, 187, 217 Dirac measure, 85, 184, 185 Discounted problems continuous-time, 141, 143 discrete-time, 20, 29, 100 Drift coefficient, 187, 219 Driving process, 5, 84, 258 Dynamic programming algorithm, 90 Dynamic programming equation, 14, 16, 31, 88, 91, 102, 131 forward form, 23, 88, 92, 106 Dynkin’s formula, 191, 192
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 O. Hern´andez-Lerma et al., An Introduction to Optimal Control Theory, Texts in Applied Mathematics 76, https://doi.org/10.1007/978-3-031-21139-3
271
272 E Economic growth model, see Brock– Mirman model Envelope Theorems, 136 Ergodic Markov process, 197
F Feller property, 94, 219
G Gauge function, see Weight function Generator, see Infinitesimal generator Geometrically ergodic Markov process, 197, 209
H Hamilton–Jacobi–Bellman (HJB), 132, 144 Hamilton–Jacobi–Bellman (HJB) equation, 131 classical solution, 134 Hamiltonian function, 24, 131 Hardy-Littlewood Theorem, 73
I Indicator function, 85, 184 Infinitesimal generator, 132, 188, 190 Invariant probability measure, 197 Ionescu Tulcea Theorem, 260 Itˆ o integral, 187 Itˆ o processes, 217
L Langevin equation, 233 Law of motion, 198 Long–run expected average cost, 7, 194, 195, 205, 237 LQ control problem continuous–time, 140, 141, 146, 151, 155, 224, 232, 237 discrete–time, 18, 30, 57, 75, 96, 98, 122
M Majorant, see Weight function Markov control model (MCM), 83, 87, 91, 100 Markov control problem
INDEX continuous–time, 132 Markov control process continuous–time, 183, 198 discrete–time, 84, 87 Markov decision process, see Markov control process Markov process continuous–time, 183, 185, 187, 188, 190 Minimal cost, see Value function Minimum principle continuous-time, 135 discrete-time, 38 Minimum steady state continuous–time, 153 discrete–time, 67 Minimumprinciple discrete-time, 23 Multifunction, 32, 84, 101, 249 continuous, 251 lower semicontinuous (l.s.c.), 251 upper semicontinuous (u.s.c.), 251 O Optimal control problem finite–horizon, 128 Optimal control problem (OCP), 1, 13, 87 continuous–time, 9, 127, 183 discounted case, 141 discrete–time, 1, 13, 87, 100 finite-horizon, 87 infinite–horizon, 28, 100, 143 long–run average cost (AC), 62, 72 stationary discounted, 28 Optimality equation, see Dynamic programing equation P Partially observable systems, 8 Poisson equation, 149, 195, 196, 198, 209 Policy, 86 history–dependent, 5 action, see Control policy closed–loop, see Markov control, 5, 86 feedback, see Markov policy Markov, 5, 10, 199 open–loop, 5, 10, 127 optimal, 3, 88 randomized control, 86 stationary Markov, 53, 199
INDEX Policy iteration (PI) algorithm, 54, 108, 110, 163 Pollution accumulation, 156, 240 Pontryagin’s Maximum Principle, see Minimum principle Portfolio selection problem, see Consumption–investment problem Principle of optimality (PO), 15, 129 Production–inventory system, 4
R Random noise, see Driving process
S Sample path, 184 Selector, 32, 249 Semigroup of operators, 189, 190 Stationary measure, see Invariant probability measure Stochastic differential equations (SDEs), see Differential equation Stochastic kernel, 84, 86, 93, 256, 260 Stochastic process continuous–time, 183, 186 Stopping time, 192 Strategy, see Policy
273 T Tracking problem, 7, 30 Transition probability, 84, 93, 184, 187, 198, 256, 260
U Uniformly ergodic Markov process, see Geometrically ergodic Markov process Uniformly geometrically ergodic Markov process, see Geometrically ergodic Markov process
V Value function, 3, 30, 62, 88, 100, 129, 201, 205, 236, 237 Value iteration (VI) algorithm, 51 functions, 32, 106 Vanishing discount continuous–time, 158, 210 discrete–time, 72 Verification theorem, 133, 134, 144
W Weight function, 45 Weighted-norm, 44 Wiener process, see Brownian motion