302 97 6MB
English Pages 583 [605] Year 2020
Probability Theory and Stochastic Modelling 97
Alexey Piunovskiy Yi Zhang
Continuous-Time Markov Decision Processes Borel Space Models and General Control Strategies
Probability Theory and Stochastic Modelling Volume 97
Editors-in-Chief Peter W. Glynn, Stanford, CA, USA Andreas E. Kyprianou, Bath, UK Yves Le Jan, Orsay, France Advisory Editors Søren Asmussen, Aarhus, Denmark Martin Hairer, Coventry, UK Peter Jagers, Gothenburg, Sweden Ioannis Karatzas, New York, NY, USA Frank P. Kelly, Cambridge, UK Bernt Øksendal, Oslo, Norway George Papanicolaou, Stanford, CA, USA Etienne Pardoux, Marseille, France Edwin Perkins, Vancouver, Canada Halil Mete Soner, Zürich, Switzerland
The Probability Theory and Stochastic Modelling series is a merger and continuation of Springer’s two well established series, Stochastic Modelling and Applied Probability and Probability and Its Applications. It publishes research monographs that make a significant contribution to probability theory or an applications domain in which advanced probability methods are fundamental. Books in this series are expected to follow rigorous mathematical standards, while also displaying the expository quality necessary to make them useful and accessible to advanced students, as well as researchers. The series covers all aspects of modern probability theory including • • • • • •
Gaussian processes Markov processes Random Fields, point processes and random sets Random matrices Statistical mechanics and random media Stochastic analysis
as well as applications that include (but are not restricted to): • Branching processes and other models of population growth • Communications and processing networks • Computational methods in probability and stochastic processes, including simulation • Genetics and other stochastic models in biology and the life sciences • Information theory, signal processing, and image synthesis • Mathematical economics and finance • Statistical methods (e.g. empirical processes, MCMC) • Statistics for stochastic processes • Stochastic control • Stochastic models in operations research and stochastic optimization • Stochastic models in the physical sciences
More information about this series at http://www.springer.com/series/13205
Alexey Piunovskiy Yi Zhang •
Continuous-Time Markov Decision Processes Borel Space Models and General Control Strategies
Foreword by Albert Nikolaevich Shiryaev
123
Alexey Piunovskiy Department of Mathematical Sciences University of Liverpool Liverpool, UK
Yi Zhang Department of Mathematical Sciences University of Liverpool Liverpool, UK
ISSN 2199-3130 ISSN 2199-3149 (electronic) Probability Theory and Stochastic Modelling ISBN 978-3-030-54986-2 ISBN 978-3-030-54987-9 (eBook) https://doi.org/10.1007/978-3-030-54987-9 Mathematics Subject Classification: 90C40, 60J76, 62L10, 90C05, 90C29, 90C39, 90C46, 93C27, 93E20 © Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Nothing is as useful as good theory—investigation of challenging real-life problems produces profound theories. Guy Soulsby
Foreword
This monograph presents a systematic and modern treatment of continuous-time Markov decision processes. One can view the latter as a special class of (stochastic) optimal control problems. Thus, it is not surprising that the traditional method of investigation is dynamic programming. Another method, sometimes termed the convex analytic approach, is via a reduction to linear programming, and is similar to and comparable with the weak formulation of optimal control problems. On the other hand, this class of problems possesses its own features that can be employed to derive interesting and meaningful results, such as its connection to discrete-time problems. It is this connection that accounts for many modern developments or complete solutions to otherwise delicate problems in this topic. The authors of this book are well established researchers in the topic of this book, to which, in recent years, they have made important contributions. So they are well positioned to compose this updated and timely presentation of the current state-of-the-art of continuous-time Markov decision processes. Turning to its content, this book presents three major methods of investigating continuous-time Markov decision processes: the dynamic programming approach, the linear programming approach and the method based on reduction to discrete-time problems. The performance criterion under primary consideration is the expected total cost, in addition to one chapter devoted to the long run average cost. Both the unconstrained and constrained versions of the optimal control problem are studied. The issue at the core is the sufficient class of control strategies. In terms of the technical level, in most cases, this book intends to present the results as generally as possible and under conditions as weak as possible. That means, the authors consider Borel models under a wide class of control strategies. This is not only for the sake of generality, indeed, it simultaneously covers both the traditional class of relaxed controls for continuous-time models and randomized controls in semi-Markov decision processes. The more relevant reason is perhaps that it paves the way for a rigorous treatment of more involving issues on the realizability or the implementability of the control strategies. In particular, mixed strategies, which were otherwise often introduced verbally, are now introduced as a subclass of control strategies, making use of an external space, an idea that can be ascribed to vii
viii
Foreword
the works of I. V. Girsanov, N. V. Krylov, A. V. Skorokhod and I. I. Gikhman and to the book Statistics of Random Processes by R. S. Liptser and myself for models of various degrees of generality, which was further developed by E. A. Feinberg for discrete-time problems. The authors have made this book self-contained: all the statements in the main text are proved in detail, and the appendices contain all the necessary facts from mathematical analysis, applied probability and discrete-time Markov decision processes. Moreover, the authors present numerous solved real-life and academic examples, illustrating how the theory can be used in practice. The selection of the material seems to be balanced. It is natural that many statements presented and proved in this monograph come from the authors themselves, but the rest come from other researchers, to reflect the progress made both in the west and in the east. Moreover, it contains several statements unpublished elsewhere. Finally, the bibliographical remarks also contain useful information. No doubt, active researchers (from the level of graduate students onward) in the fields of applied probability, statistics and operational research, and in particular, stochastic optimal control, as well as statistical decision theory and sequential analysis, will find this monograph useful and valuable. I can recommend this book to any of them.
Albert Nikolaevich Shiryaev Steklov Mathematical Institute Russian Academy of Sciences, Moscow, Russia
Preface
The study of continuous-time Markov decision processes dates back at least to the 1950s, shortly after that of its discrete-time analogue. Since then, the theory has rapidly developed and has found a large spectrum of applications to, for example, queueing systems, epidemiology, telecommunication and so on. In this monograph, we present some recent developments on selected topics in the theory of continuous-time Markov decision processes. Prior to this book, there have been monographs [106, 150, 197], solely devoted to the theory of continuous-time Markov decision processes. They all focus on models with a finite or denumerable state space, [150] also discussing semi-Markov decision processes and featuring applications to queueing systems. Here, we emphasized the word “solely” in the previous claim, because, leaving alone those on controlled diffusion processes, there have also been important books on the more general class of controlled processes, see [46, 49], as well as the thesis [236]. These works are on piecewise deterministic processes and deal with problems without constraints. Here, we consider, in that language, piecewise constant processes in a Borel state space, but we pay special attention to problems with constraints and develop techniques tailored for our processes. The authors of the books [106, 150, 197], followed a direct approach, in the sense that no reduction to discrete-time Markov decision processes is involved. Consequently, as far as the presentation is concerned, this approach has the desirable advantage of being self-contained. The main tool is the Dynkin formula, and so to ensure the class of functions of interest is in the domain of the extended generator of the controlled process, a weight function needs to be imposed on the cost and transition rates. In some parts of this book, we also present this approach and apply it to the study of constrained problems. Following the observation made in [230, 231, 264], we present necessary and sufficient conditions for the applicability of the Dynkin formula to the class of functions of interest. This hopefully leads to a clearer picture of what minimal conditions are needed for this approach to apply.
ix
x
Preface
On the other hand, the main theme of this book is the reduction method of continuous-time Markov decision processes. When this method is applicable, it often allows one to deduce optimality results under more general conditions on the system primitives. Another advantage is that it allows one to make full use of results known for discrete-time Markov decision processes, and referring to recent results of this kind makes the present book a more updated treatment of continuous-time Markov decision processes. In greater detail, a large part of this book is devoted to the justification of the reduction method and its application to problems with total (undiscounted) cost criteria. This performance criterion was rarely touched upon in [106, 150, 197]. Recently, a method for investigating the space of occupation measures for discrete-time Markov decision processes with total cost criteria has been described, see [61, 63]. The extension to continuous-time Markov decision processes with total cost criteria was carried out in [117, 185, 186]. Although the continuous-time Markov decision processes in [117, 185, 186], were all reduced to equivalent discrete-time Markov decision processes, leading to the same optimality results, different methods were pursued. In this book, we present in detail the method of [185, 186], because it is based on the introduction of a class of so-called Poisson-related strategies. This class of strategies is new to the context of continuous-time Markov decision processes. The advantage of this class of strategies is that they are implementable or realizable, in the sense that they induce action processes that are measurable. This realizability issue does not arise in discrete-time Markov decision processes, but is especially relevant to problems with constraints, where relaxed strategies often need to be considered for the sake of optimality. Although has long been known that relaxed strategies induce action processes with complicated trajectories, in the context of continuous-time Markov decision processes, it was [76], that drew special attention on it, and also constructed realizable optimal strategies, termed switching strategies, for discounted problems. By the way, in [76], a reduction method was developed for discounted problems, which is also presented in this book. This method is different from the standard uniformization technique. Although it is not directly applicable to the problem when the discount factor is null, our works [117, 185, 186]. were motivated by it. A different reduction method was followed in [45, 49], where the induced discrete-time Markov decision process has a more complicated action space (in the form of some space of measurable mappings) than the original continuous-time Markov decision process. The reduction method presented in this book is different as it induces a discrete-time Markov decision process with the same action space as the original problem in continuous-time. An outline of the material presented in this book follows. In Chap. 1, we describe the controlled processes and introduce the primarily concerned class of strategies. We discuss their realizability and sufficiency for problems with total cost criteria. The latter was achieved by investigating the detailed occupation measures. A series of examples of continuous-time Markov decision processes can be found in this chapter which illustrate the practical applications, many of which are solved
Preface
xi
either analytically or numerically in subsequent chapters. In Chap. 2, we provide conditions for the explosiveness or the non-explosiveness of the controlled process under a particular strategy or under all strategies simultaneously, and discuss the validity of Dynkin’s formula. Chapters 3–5 are devoted to problems involving the discounted cost criteria, total undiscounted cost criteria and the average cost criteria, respectively. The main tool used in Chap. 4, where Poisson-related strategies are introduced, is the reduction to discrete-time Markov decision processes. For the average cost criteria, extra conditions were imposed in a form based on how they are used in the reasoning of the proofs. Chapter 6 returns to the total cost model with a more general class of strategies. Chapter 7 is devoted to models with both gradual and impulsive control. Each chapter is supplemented with bibliographical remarks. Relevant facts about discrete-time Markov decision processes, as well as those from analysis and probability, together with selected technical proofs, are included in the appendices. We hope that this monograph will be of interest to the research community in Applied Probability, Statistics, Operational Research/Management Science and Electrical Engineering, including both experts and postgraduate students. In this connection, we have made efforts to present all the proofs in detail. This book may also be used by “outsiders” if they focus on the solved examples. Readers are expected to have taken courses in real analysis, applied probability and stochastic models/processes. Basic knowledge on discrete-time Markov decision processes is also useful, though not essential, as all the necessary facts from these topics are included in the appendices.
Acknowledgements We would like to take this opportunity to thank Oswaldo Costa, François Dufour, Eugene Feinberg, Xian-Ping Guo and Flora Spieksma for discussions, communications and collaborations on the topic of this book. We also thank our student, Xin Guo, who helped us with some tricks in using LaTeX. Finally, we are very grateful to Professor Albert Nikolaevich Shiryaev for the foreword and his consistent support.
Notations The following notations are frequently used throughout this book. For all constants a; b 2 ½1; 1; a _ b ¼ maxfa; bg, a ^ b ¼ minfa; bg, b þ :¼ maxfb; 0g and b :¼ maxfb; 0g. The supremum/maximum (infimum/minimum) over the empty set is 1 ( þ 1). N ¼ f1; 2; . . .g is the set of natural numbers; N0 ¼ N [ f0g. Z is the set of all integers. We often write R þ ¼ ð0; 1Þ, R0þ ¼ ½0; 1Þ,
xii
Preface
0 ¼ ½0; 1. Given two sets A; B, if A is a subset of ¼ ½1; þ 1, R þ ¼ ð0; 1, R R þ B, then we write AB or A B interchangeably. We denote by Ac the complement of A. On a set E, if E 1 and E 2 are two r-algebras (or r-fields), then E 1 _ E 2 denotes the smallest r-algebra on E containing the two r-algebras E 1 and E 2 : Consider two measurable spaces ðE; EÞ and ðF; F Þ. A mapping X from E to F is said to be measurable (from ðE; EÞ to ðF; F Þ) if X 1 ðF Þ E, i.e., for each C 2 F , its preimage X 1 ðCÞ with respect to X belongs to E. 0 -valued By a measure l on a measurable space ðE; EÞ, we always mean an R þ r-additive function on the r-algebra E, taking value 0 at the empty set ;. When there is no confusion regarding the underlying r-algebra, we write “a measure on E”, instead of ðE; EÞ, or E. If the singleton fxg is a measurable subset of a measurable space ðE; EÞ, then dx ðÞ is the Dirac measure concentrated at x, and we call such distributions degenerate; Ifgis the indicator function. Defined on an arbitrary measure space ðE; E; lÞ, an R-valued measurable R function f is called integrable if E j f ðeÞjlðdeÞ\1 (We shall use the notations for R R integrals such as E f ðeÞlðdeÞ and E f ðeÞdlðeÞ interchangeably.) More generally, for each 1 p\1, an R-valued measurable function f is said to be pth integrable if j f jp is integrable. The space of pth integrable (real-valued) functions on the measure space ðE; E; lÞ is denoted by Lp ðE; E; lÞ, where two functions in it are not distinguished if they differ only at a null set with respect to l. The space Lp ðE; E; lÞ is a Banach space when it is endowed with the norm defined by R 1 ð E j f ðeÞjp lðdeÞÞ p for each f 2 Lp ðE; E; lÞ. Here is one more convention regarding integrals. Suppose r is an R-valued measurable function on the product measure space ðE R; E BðRÞ; lðdeÞ dtÞ; where l is a measure on E, and dt stands for the Lebesgue measure on R. (By the way, the Lebesgue measure is also often denoted by Leb.) Then we understand the integral of r with respect to lðdeÞ dt as Z Z Z rðe; tÞdt lðdeÞ :¼ r þ ðe; tÞdt lðdeÞ r ðe; tÞdt lðdeÞ; ð1Þ ER
ER
ER
where þ 1 1 :¼ þ 1. When l is a r-finite measure on E, the Fubini–Tonelli theorem applies: Z Z Z Z Z rðe; tÞdt lðdeÞ ¼ r þ ðe; tÞdt lðdeÞ r ðe; tÞdt lðdeÞ: ER
E
R
E
R
By a Borel space is meant a measurable space that is isomorphic to a Borel subset of a Polish space (complete separable metric space). A topological Borel space is a topological space that is homeomorphic to a Borel subset of a Polish space, endowed with the relative topology. Thus, when talking about a Borel space,
Preface
xiii
only the underlying r-algebra is fixed, whereas when talking about a topological Borel space, the topology is fixed. For a topological space E, BðEÞ denotes its Borel r-algebra, i.e., the r-algebra on E generated by all the open subsets of E. If E is a topological Borel space, then the measurable space ðE; BðEÞÞ is a Borel space. If E is a Borel space without a fixed topology, we still typically denote its r-algebra by BðEÞ. For a topological space E, we denote by CðEÞ ðrespectively, C þ ðEÞÞ the space of bounded continuous (respectively, nonnegative bounded continuous) real-valued functions on E. If ðE; EÞ is a measurable space, then we denote by MF ðEÞ ðrespectively, MFþ ðEÞÞ the space of finite signed measures (respectively, finite nonnegative measures) on ðE; EÞ: Also M þ ðEÞ denotes the space of (possibly infinite) measures on ðE; EÞ, and PðEÞ is the space of probability measures on ðE; EÞ. If E is a topological space, then MF ðEÞ (respectively, MFþ ðEÞÞ is understood as the space of finite signed measures (respectively, finite nonnegative measures) on ðE; B ðEÞÞ: Similarly, M þ ðEÞ denotes the space of (possibly infinite) measures on ðE; B ðEÞÞ, and PðEÞ is the space of probability measures on ðE; BðEÞÞ. The abbreviation a.s. (respectively, i.i.d.) stands for “almost surely” (respectively, “independent identically distributed”). Expressions like “for almost all s 2 R” refer to the Lebesgue measure, unless stated otherwise. For an ðE; EÞvalued random variable X on ðX; F ; PÞ, the assertion “a statement holds for Palmost all X” and the assertion “a statement holds for PX 1 -almost all x 2 E” mean the same, where PX 1 denotes the distribution of X under P. Throughout the main text (excluding the appendices), capital letters such as H usually denote random elements, lower case letters such as h denote arguments of functions and realizations of random variables; and spaces are denoted using bold fonts, e.g., H. In the rest of the book, we use the following abbreviations: CTMDP (respectively, DTMDP, ESMDP, SMDP) stands for continuous-time Markov decision process (respectively, discrete-time Markov decision process, exponential semi-Markov decision process, semi-Markov decision process). They will be recalled when they appear for the first time in the main text below. Liverpool, UK
Alexey Piunovskiy Yi Zhang
Contents
1 Description of CTMDPs and Preliminaries . . . . . . . . . . . . . . . . 1.1 Description of the CTMDP . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Initial Data and Conventional Notations . . . . . . . . . . 1.1.2 Informal Description . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Strategies, Strategic Measures, and Optimal Control Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.4 Realizable Strategies . . . . . . . . . . . . . . . . . . . . . . . . 1.1.5 Instant Costs at the Jump Epochs . . . . . . . . . . . . . . . 1.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Queueing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 The Freelancer Dilemma . . . . . . . . . . . . . . . . . . . . . 1.2.3 Epidemic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 Inventory Control . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.5 Selling an Asset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.6 Power-Managed Systems . . . . . . . . . . . . . . . . . . . . . 1.2.7 Fragmentation Models . . . . . . . . . . . . . . . . . . . . . . . 1.2.8 Infrastructure Surveillance Models . . . . . . . . . . . . . . 1.2.9 Preventive Maintenance . . . . . . . . . . . . . . . . . . . . . . 1.3 Detailed Occupation Measures and Further Sufficient Classes of Strategies for Total Cost Problems . . . . . . . . . . . . . . . . . 1.3.1 Definitions and Notations . . . . . . . . . . . . . . . . . . . . . 1.3.2 Sufficiency of Markov …-Strategies . . . . . . . . . . . . . . 1.3.3 Sufficiency of Markov Standard n-Strategies . . . . . . . 1.3.4 Counterexamples . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.5 The Discounted Cost Model as a Special Case of Undiscounted . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . .
. . . .
. . . .
1 1 1 2
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
3 17 24 28 28 30 30 33 33 35 36 37 39
. . . . .
. . . . .
. . . . .
. . . . .
40 40 43 49 52
.... ....
55 60
xv
xvi
Contents
2 Selected Properties of Controlled Processes . . . . . . . . . . . . . . . . . 2.1 Transition Functions and the Markov Property . . . . . . . . . . . . . 2.1.1 Basic Definitions and Notations . . . . . . . . . . . . . . . . . . 2.1.2 Construction of the Transition Function . . . . . . . . . . . . 2.1.3 The Minimal (Nonnegative) Solution to the Kolmogorov Forward Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.4 Markov Property of the Controlled Process Under a Natural Markov Strategy . . . . . . . . . . . . . . . . . . . . . . 2.2 Conditions for Non-explosiveness Under a Fixed Natural Markov Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 The Nonhomogeneous Case . . . . . . . . . . . . . . . . . . . . . 2.2.2 The Homogeneous Case . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 Possible Generalizations . . . . . . . . . . . . . . . . . . . . . . . . 2.2.4 A Condition for Non-explosiveness Under All Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.5 Direct Proof for Non-explosiveness Under All Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Birth-and-Death Processes . . . . . . . . . . . . . . . . . . . . . . 2.3.2 The Gaussian Model . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 The Fragmentation Model . . . . . . . . . . . . . . . . . . . . . . 2.3.4 The Infrastructure Surveillance Model . . . . . . . . . . . . . . 2.4 Dynkin’s Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Non-explosiveness and Dynkin’s Formula . . . . . . . . . . . 2.4.3 Dynkin’s Formula Under All …-Strategies . . . . . . . . . . . 2.4.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
3 The Discounted Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 The Unconstrained Problem . . . . . . . . . . . . . . . . . . . . . . 3.1.1 The Optimality Equation . . . . . . . . . . . . . . . . . . . 3.1.2 Dynamic Programming and Dual Linear Programs 3.2 The Constrained Problem . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Properties of the Total Occupation Measures . . . . . 3.2.2 The Primal Linear Program and Its Solvability . . . 3.2.3 Comparison of the Convex Analytic and Dynamic Programming Approaches . . . . . . . . . . . . . . . . . . 3.2.4 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.5 The Space of Performance Vectors . . . . . . . . . . . . 3.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 A Queuing System . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 A Birth-and-Death Process . . . . . . . . . . . . . . . . . . 3.4 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
63 63 64 65
..
69
..
76
. . . .
. . . .
79 80 93 95
..
97
. . . . . . . . . . . .
. . . . . . . . . . . .
102 109 109 112 114 115 117 117 130 133 140 142
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
145 146 147 151 159 160 167
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
173 174 183 192 192 198 199
Contents
xvii
4 Reduction to DTMDP: The Total Cost Model . . . . . . . . 4.1 Poisson-Related Strategies . . . . . . . . . . . . . . . . . . . . . 4.2 Reduction to DTMDP . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Description of the Concerned DTMDP . . . . . . 4.2.2 Selected Results of the Reduction to DTMDP . 4.2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.4 Models with Strongly Positive Intensities . . . . 4.2.5 Example: Preventive Maintenance . . . . . . . . . . 4.3 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
201 201 225 225 233 242 251 258 261
5 The Average Cost Model . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Unconstrained Problems . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Unconstrained Problem: Nonnegative Cost . . . 5.1.2 Unconstrained Problem: Weight Function . . . . 5.2 Constrained Problems . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 The Primal Linear Program and Its Solvability 5.2.2 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3 The Space of Performance Vectors . . . . . . . . . 5.2.4 Denumerable and Finite Models . . . . . . . . . . . 5.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 The Gaussian Model . . . . . . . . . . . . . . . . . . . 5.3.2 The Freelancer Dilemma . . . . . . . . . . . . . . . . 5.4 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
263 264 264 281 293 293 302 307 314 319 319 322 334
6 The Total Cost Model: General Case . . . . . . . . . . . . . . . . . . . . 6.1 Description of the General Total Cost Model . . . . . . . . . . . . 6.1.1 Generalized Control Strategies and Their Strategic Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Subclasses of Strategies . . . . . . . . . . . . . . . . . . . . . . 6.2 Detailed Occupation Measures and Sufficient Classes of Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Detailed Occupation Measures . . . . . . . . . . . . . . . . . 6.2.2 Sufficient Classes of Strategies . . . . . . . . . . . . . . . . . 6.2.3 Counterexamples . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Reduction to DTMDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Mixtures of Strategies and Convexity of Spaces of Strategic and Occupation Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Properties of Strategic Measures . . . . . . . . . . . . . . . . 6.4.2 Properties of Occupation Measures . . . . . . . . . . . . . . 6.5 Example: Utilization of an Unreliable Device . . . . . . . . . . . . 6.6 Realizable Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . 337 . . . . 337 . . . . 337 . . . . 342 . . . . .
. . . . .
. . . . .
. . . . .
345 346 348 357 361
. . . . . .
. . . . . .
. . . . . .
. . . . . .
367 368 384 388 395 400
xviii
Contents
7 Gradual-Impulsive Control Models . . . . . . . . . . . . . . . . . . . . . . . . 7.1 The Total Cost Model and Reduction . . . . . . . . . . . . . . . . . . . 7.1.1 System Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Total Cost Gradual-Impulsive Control Problems . . . . . . 7.1.3 Reduction to CTMDP Model with Gradual Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Example: An Epidemic with Carriers . . . . . . . . . . . . . . . . . . . . 7.2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 General Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.3 Optimal Solution to the Associated DTMDP Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.4 The Optimal Solution to the Original Gradual-Impulsive Control Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 The Discounted Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 a-Discounted Gradual-Impulsive Control Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Reduction to DTMDP with Total Cost . . . . . . . . . . . . . 7.3.3 The Dynamic Programming Approach . . . . . . . . . . . . . 7.3.4 Example: The Inventory Model . . . . . . . . . . . . . . . . . . 7.4 Bibliographical Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . .
. . . .
403 403 403 405
. . . .
. . . .
410 425 425 428
. . 431 . . 437 . . 441 . . . . .
. . . . .
441 446 448 452 470
Appendix A: Miscellaneous Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473 Appendix B: Relevant Definitions and Facts . . . . . . . . . . . . . . . . . . . . . . . 505 Appendix C: Definitions and Facts about Discrete-Time Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581
Notation
A A qÞ a ¼ ð c; b; AG AI Að xÞ AðtÞ Bf ðXÞ b n (bn ) B fBi g1 i¼1 ; 1 fAi gi¼1 n ( C cn ) cj ðx; aÞ; cG j ðx; aÞ cIj ðx; aÞ C ðx; aÞ Co Co dj D DS DReM DRaM
Action space Action space in the DTMDP describing the gradual-impulsive model Actions in the DTMDP describing the gradual-impulsive model Space of gradual controls (actions) Space of impulsive controls (actions) Set of admissible actions Action process Space of all f -bounded measurable functions on X Impulsive action (control) Random (realized) impulsive action Controlling process in DTMDP
1, 549 405 406 404 404 5, 156, 307, 551 343, 398 119, 527 404 407 226, 362, 550
Random (realized) planned time until the next impulse Cost rates
407
Cost functions Lump sum cost (Positive) cone Dual cone Constraints constants
404 24 175, 540 175, 540 11, 146, 263, 409, 554 562 41
Total occupation measures in DTMDP Sequences of detailed occupation measures Sequences of detailed occupation measures generated by Markov …-strategies Sequences of detailed occupation measures generated by Markov standard n-strategies
2, 404
41
41
xix
xx DnP Dn DdetM Ddet
DM Dt D av E Sc , E Sx ^S E c E rc ; E r;a c ; r Ex Eqx r E x0 Fn þ 1 ðhn ; c; bÞ fF t gt 0 ; fGt gt 0 g Gn , Gnn e G ha ð xÞ, hð xÞ ðh ; g; u Þ Hn (hn ) n (hn ) H Hn inf ðPÞ; inf ðPav Þ inf ðP c Þ; c inf P av J K, K
Notation Sequences of detailed occupation measures generated by Poisson-related strategies Sequences of detailed occupation measures generated by generalized standard n strategies Sequences of detailed occupation measures generated by mixtures of simple deterministic Markov strategies Sequences of detailed occupation measures generated by mixtures of deterministic generalized standard n-strategies Sequences of detailed occupation measures generated by mixtures of Markov standard n-strategies Collection of all total normalized occupation measures Collection of all stable probability measures Expectations with respect to PcS ðdxÞ, PxS ðdxÞ ^S Expectation with respect to P c
205
385
386
387
387
159 294 10, 341 57
r Expectation with respect to P rc ; P r;a c ;Px
226, 252, 551
Expectation with respect to the transition (probability) function generated by the Q-function q Expectation with respect to P r
117
x0
Relaxed control in the gradual-impulsive model Filtration
409 407 4, 339, 399, 544
Lagrange multipliers Conditional distribution of the sojourn time and the post-jump state Transition probability in the a-jump chain Relative value function Canonical triplet Random (realized) finite history Random (realized) finite history in the gradual-impulsive model Space of histories Value of the Primal Linear Program
181, 303, 542 9, 202, 203, 340, 376 82 268 284 4, 338 407
Value of the Primal Convex Program
179, 304, 542
Number of constraints
11, 146, 263, 409, 554 156, 307, 551
Space of admissible state-action pairs
4, 338 179, 303, 541
Notation
xxi
L, Lav lj ðx; aÞ
Lagrangian Cost functions in DTMDP
lj ððh; xÞ; a; ðt; yÞÞ; lj a ððh; xÞ; a; ðt; yÞÞ MGO ma m S;a c;n ðdx daÞ M rc ðdx daÞ M rc;n ðdx daÞ ~ Oav O, O,
AðÞ
pM , - M pðs; x; t; dyÞ, pq ðs; x; t; dyÞ pq ðt; dyÞ; pq ðx; t; dyÞ p~q ðs; x; t; dyÞ ~pn;k ðdajxÞ pðdyjx; aÞ pa ðdyjx; aÞ pðdt dyjðh; xÞ; aÞ ^pðdt dyjðh; xÞ; aÞ c Pr; Pr D P
PSc ðdxÞ, PSx ðdxÞ ^S P c
r P x0 ~ ðv;xÞ P P Sc ðt; dy daÞ; P Sc ðt; dyÞ r P rc , P r;a c , Px ParðEÞ 9 qðdyjx; aÞ; = qð jji; aÞ; ; qGO ðdyjx; aÞ qðdyjx; sÞ q f ðdyjx; sÞ qx ðaÞ; qx ð…Þ; qx ðsÞ; qx ðqt Þ qx ðn; …; sÞ
Cost functions in the DTMDP describing the gradual-impulsive model
178, 304, 542 226, 252, 362, 549 406, 441
CTMDP model with gradual control only Infimum of the discounted cost Detailed occupation measure Total occupation measure in DTMDP Detailed occupation measure in DTMDP Space of performance vectors
410 268 40, 205, 346 561 562 184, 186, 309
Markov standard n-strategy Transition function (Homogeneous) transition function
6, 345 64, 65 76, 482
Transition probability function generated by pq ðs; x; t; dyÞ Element of a Poisson-related strategy Transition probability Transition probability Transition probability in the DTMDP describing the gradual-impulsive model Predictable r-algebras Space of strategic measures in the associated DTMDP 1-1-correspondence between strategic measures Strategic measure Strategic measure in the “hat” model with killing Strategic measure in the DTMDP describing the gradual-impulsive model Probability on the trajectories of the a-jump chain Marginal distribution
79 202 226, 362, 376 252 406, 441
5, 339, 399 377 377 9, 341 56 409
83 10, 341
Strategic measures in DTMDP Pareto set Transition rate
226, 252, 551 537 1, 2, 404, 411
Q-function f -transformed Q-function Jump intensity
8 118 1, 8, 404, 405
Jump intensity under a generalized …-n-strategy
340
xxii qx 9 q~ðdyjx; aÞ; > > = q~ðdyjx; …Þ; q~ðdyjx; sÞ; > > ; q~GO ðdyjx; aÞ q~ðdyjx; qt Þ q~n ðdyjx; sÞ q~ðdyjx; n; …; sÞ Qðdyjx; bÞ R AG S ¼ fSn g1 n¼1 S ¼ fN; p0 ; ðpn ; …n Þg1 n¼1 S SP S S S… , S… S- , Sn SG n SDS ) M M S… ; S… M SM - ; Sn
Ssn SP , SPe Sstable supðDÞ; supðDav Þ supðDc Þ; sup Dcav Tn (tn ) ð0; T1 Þ 0 U , U wðt; xÞ; wðxÞ e ð xÞ; w e 0 ð xÞ; w wð xÞ; w0 ð xÞ Wja ðS; cÞ; Wja ðS; xÞ Wj ðS; cÞ; Wj ðS; xÞ j ðr; x0 Þ; W j a ðr; x0 Þ W W0DT ðr; xÞ W0 a ð xÞ W0DT ð xÞ; DT;b W0 ð xÞ ~ W
Notation Supremum of the jump intensity Post-jump measures
2, 404 1, 7, 8, 404, 411
Post-jump measure Post-jump measure under a Poisson-related strategy Post-jump measure under a generalized …-n-strategy Post-impulse state distribution Collection of relaxed controls (P AG valued mappings) Control strategy Generalized control strategy (Uniformly) Optimal strategy Poisson-related strategy Set of all strategies Set of all generalized …-n-strategies Set of all …-strategies Set of all n-strategies Set of all generalized standard n-strategies Set of all deterministic stationary strategies Set of all Markov p-strategies (Markov standard n-strategies)
405 203
Set of all stationary standard n-strategies Set of all Poisson-related strategies Set of all stable strategies Value of the Dual Linear Program
345 202 294 179, 304, 541
Value of the Dual Convex Program
179, 304, 542
Random (realized) jump moment Time horizon Adjoint mapping Lyapunov function Lyapunov functions
4, 338 5, 12 176, 303 80, 90, 94, 98 133, 136, 173
Expected total a-discounted costs
11, 41
Long run average cost
12, 263
Performance functionals in the gradual-impulsive model Performance functional in DTMDP Value (Bellman) function Value (Bellman) function in DTMDP
409, 441
Performance vector
184
340 404 404 5 339 11 202 7 339 7, 343 7, 343 345 7 7, 345
553 11, 146 554, 559
Notation X X X1 ~ X XD Xd Xn (xn ) n (xn ) X x1 X ðtÞ ðX ; Y Þ fYi g1 i¼0 ; fXi g1 i¼0 a cðdxÞ D gðdx daÞ g S;a c ðdx daÞ g S;0 c ðdx daÞ Hn (hn ) lðx; CR CX Þ; l~ðx; CR CX Þ lðx; CR CN CX Þ l^ðx; CR CX CN Þ l; uð0Þ ; uð1Þ m ~v mðdbÞ, mðk Þ …n ðdajhn1 ; sÞ … ðdajx; tÞ …M …s ðdajxÞ Pðx; tÞ -n ðdajhn1 Þ - M , pM -s ðdajxÞ; ps ðdajxÞ qt ðdaÞ r ¼ frn g1 n¼1 ¼ frðn0Þ ; Fn g1 n¼1
frðn0Þ ; rðn1Þ g1 n¼1 ~ n g1 ¼ frðn0Þ ; u n¼1 rC r rM , rs ðdajxÞ
xxiii State space State space in the DTMDP describing the gradual-impulsive model Extended state space State space in the a-jump chain State space excluding the cemetery State space on which the f -transformed Qfunction q f is defined Random (realized) post-jump state of the controlled process Random (realized) state in the DTMDP describing the gradual-impulsive model Artificial isolated point Controlled process Dual pair Controlled process in DTMDP
1, 403, 549 405
4, 338
Discount factor Initial distribution Cemetery Stable measure a-discounted total occupation measure Total occupation measure Random (realized) sojourn time Random measures
11 2, 549 5, 41 293 159, 346 241 4, 338 4, 99
Random measure Random measure l-deterministic stationary strategy in the gradual-impulsive model Compensator of l Compensator of l~ Weights distribution Relaxed control Natural Markov strategy Markov …-strategy Stationary …-strategy P ðAÞ-valued predictable process Randomized control Markov standard n-strategy Stationary standard n-strategy
339 399 408
Relaxed control Strategy in the gradual-impulsive model
404 407
Markov standard n-strategy in the gradual-impulsive model Hitting time in the a-jump chain Control strategy in DTMDP Markov, stationary strategy in DTMDP
409
3, 338 82 42 118
407
3, 338 5, 339 540 226, 362, 550
10, 341, 474 476 368, 371 5 6 6, 344 7, 344 6, 341 5 6, 345 7, 345
83 550 550
xxiv R RS RDM RDS RGI sC sweak , sðX ; Y Þ uð xÞ, us ð xÞ uð xÞ ^ u ~ ðdajxÞ u uð0Þ (uð1Þ ) N N1 n1 nn ¼ ðsn0 ; an0 ; sn1 ; an1 ; . . .Þ Nn (nn ) ðX; F Þ ðX; BðXÞÞ ^ X X h h; i
Notation Set of all strategies in DTMDP Set of all stationary strategies in DTMDP Set of all deterministic Markov strategies in DTMDP Set of all deterministic stationary strategies in DTMDP Set of all strategies in the gradual-impulsive CTMDP Return time in the a-jump chain Weak topology Deterministic stationary strategy Deterministic stationary strategy in DTMDP Simple deterministic Markov strategy Stationary randomized gradual control in the gradual-impulsive model Impulsive (gradual) component of a l-deterministic stationary strategy Artificial space for constructing strategic measures Extended artificial space Additional artificial point Artificial point in case of a Poisson-related strategy Random (realized) artificial point Sample space Sample space in DTMDP Sample space in the “hat” model with killing Sample space in the gradual-impulsive model Cemetery in a DTMDP Bilinear form, scalar product
226, 550 550 550 550 407 83 527, 540 7, 344 550 344 409 408 202, 337 338 338 351 338 4, 338, 344 550 56 409 252, 565 175, 186, 540
Chapter 1
Description of CTMDPs and Preliminaries
In this chapter, we provide a rigorous mathematical description of continuous-time Markov decision processes (CTMDPs), the total cost model, and present several practical examples of CTMDPs. We also establish several preliminary properties of the total cost model to serve our future investigations. In comparison with the previous literature, we introduce and consider wider classes of control strategies and put a special emphasis on those that are realizable, see more discussions on this in Sect. 1.4. A further discussion of possible control strategies is in Chap. 6, Sect. 6.1.1.
1.1 Description of the CTMDP 1.1.1 Initial Data and Conventional Notations The primitives of a CTMDP are the following elements. (a) The state space X; this is assumed to be a nonempty Borel space, with the fixed σ-algebra B(X). (b) The action space (A, B(A)), a nonempty Borel space. We assume that each action a ∈ A can be applied in each current state x ∈ X of the process. We use the words “action” and “control” interchangeably. (c) The transition rate q(dy|x, a) is a signed kernel on B(X) given (x, a) ∈ X × A taking nonnegative values on X \ {x}, where X ∈ B(X). We assume that q is conservative in the sense that q(X|x, a) = 0, i.e., q({x}|x, a) = −q(X \ {x}|x, a). Below, q ( X |x, a) := q( X \ {x}|x, a), qx (a) := −q({x}|x, a) and © Springer Nature Switzerland AG 2020 A. Piunovskiy and Y. Zhang, Continuous-Time Markov Decision Processes, Probability Theory and Stochastic Modelling 97, https://doi.org/10.1007/978-3-030-54987-9_1
1
2
1 Description of CTMDPs and Preliminaries
and qx (a) is often referred to as the jump intensity. Furthermore, we also assume that the transition rate q is stable: q x := sup qx (a) < ∞, ∀ x ∈ X.
(1.1)
a∈A
If the state space X is finite, then, under an arbitrarily fixed action a ∈ A, the matrix of transition rates Q(a) = [q( j|i, a)] is the infinitesimal generator of a continuous-time Markov chain with a standard transition probability function. Here and below, if the state space X is discrete, we use the simplified notation q( j|i, a) instead of q({ j}|i, a). The same is done for the initial distribution γ introduced below and other measures. (d) The cost rates c j ( j = 0, 1, . . . , J ) are measurable functions on X × A with values in the extended real line [−∞, ∞]. (e) The initial distribution γ(·), a probability measure on (X, B(X)). Below we refer to (X, A, q, {c j } Jj=0 ) as the CTMDP.
1.1.2 Informal Description The trajectory of the controlled process is piecewise constant. The (random) initial state X 0 ∈ X has the distribution γ. After the value x0 of X 0 is realized, the decisionmaker can choose an action A1 , possibly using a randomized control 1 (da|x0 ), a stochastic kernel on B(A) given x0 ∈ X. Otherwise, he (she) chooses a so-called relaxed control, a stochastic kernel π1 (da|x0 , s) on B(A) given (x0 , s) ∈ X × R+ . As a result, the sojourn time 1 and the new state X 1 are realized with the following joint distribution. (Here we assume for simplicity that 0 < δ ≤ qx (a) ≤ K < ∞ for some constants δ, K ; this is not needed in the formal construction below.) • In the first case, under the randomized control 1 , P( ≤ t, X 1 ∈ X |X 0 = x0 ) 1 q ( X |x0 , a)e−qx0 (a)s 1 (da|x0 )ds, t ≥ 0, X ∈ B(X). = (0,t]
A
• In the second case, under the relaxed control π1 , for t ≥ 0, X ∈ B(X) we have P(1 ≤ t, X 1 ∈ X |X 0 = x0 ) q ( X |x0 , a)π1 (da|x0 , s)e− (0,s] A qx0 (a)π1 (da|x0 ,u)du ds. = (0,t]
A
If 1 (da|x0 ) = δa ∗ (da) or π1 (da|x0 , s) = δa ∗ (da) then the sojourn time 1 is exponential with parameter q X 0 (a ∗ ) (assuming q X 0 (a ∗ ) > 0).
1.1 Description of the CTMDP
3
The meaning of the randomized control is obvious: the decision-maker chooses the action A1 , which remains the same up to the next jump. Relaxed control π1 means, roughly speaking, that the decision-maker “flips a coin” at any moment in time, so that the controlled transition rate at time s > 0 is q(·|x, a)π1 (da|x0 , s). A
We underline that usually such relaxed controls cannot be implemented, unless π1 is degenerate, i.e., concentrated at singletons ϕ1 (x0 , s). (See the discussion at the end of Sect. 1.1.3.) Note also that such an equivalent degenerate π1 exists in convex models, where, for any measure π(da) ∈ P(A), there is a point a π ∈ A such that
q(·|x, a)π(da) = q(·|x, a π ) for all x ∈ X A
and
c j (x, a)π(da) = c j (x, a π ) for all x ∈ X, j = 0, 1, . . . , J. A
We omit the measurability issues in this informal discussion. After the piece of trajectory h 1 = (x0 , θ1 , x1 ) is realized, the process develops in a similar manner. The kernels 2 or π2 may depend on h 1 . And so on. The rigorous mathematical description of the control strategy, stochastic process, and the stochastic basis is given in the next subsection.
1.1.3 Strategies, Strategic Measures, and Optimal Control Problems 1.1.3.1
Strategies and Strategic Measures
Before describing the strategies, we construct the sample space and the controlled process X (·) thereon. To be precise, the process X (·) is fixed, and the probability measure on the sample space is under control. Nevertheless, we follow the standard terminology and call X (·) the “controlled process.” Since absorbing states x with qx (a) = 0, where the sojourn time is infinite, are not excluded, we introduce the / X and put artificial isolated point x∞ ∈ X∞ := X ∪ {x∞ }; qx∞ (a) = q(X|x∞ , a) ≡ 0, c j (x∞ , a) ≡ 0,
(1.2)
∀ j = 0, 1, . . . , J, a ∈ A. Given the primitives described in Sect. 1.1.1, let us construct the underlying (Borel) sample space (, F). Having firstly defined the measurable space
4
1 Description of CTMDPs and Preliminaries
(0 , F 0 ) := ((X × R+ )∞ , B((X × R+ )∞ )), let us adjoin all the sequences of the form (1.3) (x0 , θ1 , x1 , . . . , θm−1 , xm−1 , ∞, x∞ , ∞, x∞ , . . . ) to 0 , where m ≥ 1 is some integer, θl ∈ R+ for l = 1, 2, . . . , m − 1 (provided that m > 1), and xl ∈ X for all nonnegative integers l ≤ m − 1. After the corresponding modification of the σ-algebra F 0 , we obtain the basic sample space (, F). Below, we use the generic notation ω = (x0 , θ1 , x1 , θ2 , x2 , . . .) ∈ . ¯ + by n (ω) = θn ; for each n ∈ For each n ∈ N, introduce the mapping n : → R N0 , the mapping X n : → X∞ is defined by X n (ω) = xn . As usual, the argument ω will often be omitted. The increasing sequence of random variables Tn , n ∈ N0 , is defined by n i ; T∞ = lim Tn . Tn = i=1
n→∞
Here, n (respectively Tn , X n ) can be understood as the sojourn times (respectively the jump moments and the states of the process on the intervals [Tn , Tn+1 )). We do not intend to consider the process after T∞ . The isolated point x∞ will be regarded as absorbing; it appears when θm = ∞. Finally, for n ∈ N, Hn = (X 0 , 1 , X 1 , . . . , n , X n ) is the n-term (random) history, and H0 = X 0 is the 0-term history. The random sequence {(Tn , X n )}∞ n=0 is a marked point process. As mentioned above, the capital letters X, , T, H denote random elements; the corresponding lower case letters denote their realizations. The bold letters denote spaces; Hn = {(x0 , θ1 , x1 , . . . , θn , xn ) : (x0 , θ1 , x1 , . . .) ∈ } is the space of all n-term histories (n ≥ 1) and H0 = X is the space of 0-term histories. The random measure μ on R+ × X with values in N0 ∪ {∞} is defined by μ(ω; R × X ) =
I{Tn (ω) < ∞}δ(Tn (ω),X n (ω)) (R × X ),
(1.4)
n≥1
where R ∈ B(R+ ) and X ∈ B(X). The right-continuous filtration {Ft }t≥0 on (, F) is given by Ft = σ(H0 ) ∨ σ(μ((0, s] × B) : s ≤ t, B ∈ B(X)). Following from the definition, one can show that
(1.5)
1.1 Description of the CTMDP
5
FTn = σ(Hn ).
(1.6)
The controlled process X (·) we are interested in is defined by X (t) :=
I{Tn ≤ t < Tn+1 }X n + I{T∞ ≤ t}x∞ ,
t ∈ R0+ ,
(1.7)
n≥0
which takes values in X∞ and is right-continuous and adapted. The process X (·) is observed and controlled on the interval [0, T∞ ). Definition 1.1.1 A state in the state space X is called a cemetery point, or simply a cemetery, if q (a) = c j (, a) ≡ 0 for all j = 0, 1, . . . , J. The difference between a cemetery ∈ X and the artificial absorbing state x∞ ∈ /X is that x∞ is never observed on [0, T∞ ), whereas the cemetery might be observed at a finite time Tn < T∞ . The filtration {Ft }t≥0 gives rise to the predictable σ-algebra Pr on × R0+ defined by σ( × {0} ( ∈ F0 ), × (s, ∞) ( ∈ Fs− , s > 0)), where Fs− := t 0 is the time interval elapsed since the jump moment tn−1 .
8
1 Description of CTMDPs and Preliminaries
After the corresponding interval θn , the new state xn ∈ X∞ of the process X (·) is realized at the jump moment tn = tn−1 + θn . The joint distribution of (n , X n ) is given below. And so on. The absorbing state x∞ appears for the first time when θn = ∞; after that the pair (∞, x∞ ) is repeated endlessly. We accept that q(x∞ |x∞ , a) ≡ 0. Below, if π ∈ P(A), then we use the notations q (dy|x, π) :=
q (dy|x, a)π(da); A
qx (π) :=
qx (a)π(da) = q (X|x, π).
(1.10)
A
If the probability measure π depends on the time parameter s, we write qx (π, s) and q (dy|x, π, s) respectively. When it is clear which natural Markov strategy π˘ is fixed, we use the following simplified notations, consistent with (1.10). q(dy|x, a)π(da|x, ˘ s),
q(dy|x, s) := A
q (dy|x, s) :=
q (dy|x, a)π(da|x, ˘ s),
⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬
⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎭ qx (s) := q (X|x, a)π(da|x, ˘ s) = q (X|x, s). ⎪
(1.11)
A
A
At the same time, let us underline that in (1.11) s is the actual time parameter, but when expressions (1.9) are used in the calculations such as (1.13), the parameter s is the time elapsed after the last jump moment of the controlled process X (·). Definition 1.1.4 A (Borel-measurable) signed kernel q(dy|x, s) on B(X) from X × [0, ∞) is called a (conservative stable) Q -function on the Borel space X if the following conditions are satisfied. (a) For each s ≥ 0, x ∈ X and ∈ B(X) with x ∈ / , ∞ > q(|x, s) ≥ 0. (b) For each (x, s) ∈ X × R+ , q(X|x, s) = 0. (c) For each x ∈ X, sup {q(X \ {x}|x, s)} < ∞. s∈R0+
1.1 Description of the CTMDP
9
Evidently, under a fixed natural Markov strategy, the kernel q(dy|x, s) defined in (1.11) is a Q-function on the Borel space X. In the case of standard ξ-strategies, such a controlled model is traditionally called an “exponential semi-Markov decision process.” The strategic measure PγS (dω) on is constructed recursively on the spaces of histories H0 , H1 , . . .. The distribution of H0 = X 0 is given by γ(d x0 ) and, for any ¯ + × X∞ ) given h n−1 ∈ Hn−1 is defined n = 1, 2, . . ., the stochastic kernel G n on B(R by the formulae G n ({∞} × {x∞ }|h n−1 )
= δxn−1 ({x∞ }) + δxn−1 (X)
I{qxn−1 (a) = 0}n (da|h n−1 ); A
G n (R × X |h n−1 ) q ( X |xn−1 , a)e−qxn−1 (a)θ dθ n (da|h n−1 ), = δxn−1 (X) A
(1.12)
R
∀ R ∈ B(R+ ), X ∈ B(X); G n ({∞} × X|h n−1 ) = G n (R+ × {x∞ }|h n−1 ) = 0, if Sn = n is a randomized control, whereas if Sn = πn is a relaxed control, we define the stochastic kernel G n by G n ({∞} × {x∞ }|h n−1 ) = δxn−1 ({x∞ }) + δxn−1 (X)e−
(0,∞)
qxn−1 (πn ,s)ds
;
G n (R × X |h n−1 ) q ( X |xn−1 , πn , θ)e− (0,θ] qxn−1 (πn ,s)ds dθ, = δxn−1 (X)
(1.13)
R
∀ R ∈ B(R+ ), X ∈ B(X); G n ({∞} × X|h n−1 ) = G n (R+ × {x∞ }|h n−1 ) = 0. It remains to apply the induction and Ionescu-Tulcea’s Theorem (see Proposition B.1.37) to obtain the corresponding unique probability measure PγS on (, F), called ¯ + ), H ∈ B(Hn−1 ), the strategic measure, which, for all X ∈ B(X∞ ), R ∈ B(R n = 1, 2, . . ., satisfies PγS (H0 ∈ X ) = PγS (X 0 ∈ X ) = γ( X ∩ X); PγS (n ∈ R , X n ∈ X , Hn−1 ∈ H ) G n (R × X |h n−1 )PγS (Hn−1 ∈ dh n−1 ). = H
10
1 Description of CTMDPs and Preliminaries
The following formula defines a version of the dual predictable projection (often called the compensator), with respect to PγS , of the random measure μ given by (1.4): ν(ω; dt × d x) =
G n ((dt − Tn−1 ) × d x|Hn−1 ) I{Tn−1 < t ≤ Tn }. (1.14) G n ([t − Tn−1 , ∞] × X∞ |Hn−1 ) n≥1
The very brief proof is given in Appendix A.1. The kernels n (da|h n−1 ) and πn (da|h n−1 , s) are of no importance if xn−1 = x∞ . Below, when γ(·) is a Dirac measure concentrated at x ∈ X, we use the “degenerated” notation PxS . Expectations with respect to PγS and PxS are denoted by EγS and ExS , respectively. 1.1.3.2
Marginal Distributions and Formulation of Optimal Control Problems
To state the concerned optimal control problems, it is convenient to introduce the marginal distributions under a strategy S. Definition 1.1.5 The marginal distribution of the CTMDP with the initial distribution γ under a strategy S = {Sn }∞ n=1 is a substochastic kernel on B(X × A) given t ∈ R+ defined by PγS (t, dy × da) ∞
I{Sn+1 = n+1 }EγS I{X (t) ∈ dy}n+1 (da|Hn ) := n=0
×I{t ∈ (Tn , Tn+1 ] ∩ R+ }
+ I{Sn+1 = πn+1 }EγS I{X (t) ∈ dy}πn+1 (da|Hn , t − Tn ) × I{t ∈ (Tn , Tn+1 ] ∩ R+ } . We also put PγS (t, dy × A) =: PγS (t, dy) and PγS (0, dy × A) := γ(dy) with a slight abuse of notation. Corresponding to the case of γ(dy) = δx (dy) for some x ∈ X, we write PδSx (t, dy × da) as PxS (t, dy × da) for brevity. The meaning of the notation PxS (t, dy) is understood similarly. If S = {πn }∞ n=1 is a π-strategy, then PγS (t, dy × da) = EγS [I{X (t) ∈ dy}(da|t)] ,
1.1 Description of the CTMDP
11
and PγS (t, dy) = PγS (X (t) ∈ dy) , see (1.8). A large part of this book deals with the following expected total α-discounted costs defined as α −αt W j (S, γ) := e c j (x, a)PγS (t, d x × da)dt (0,∞) X×A −αt e c+j (x, a)PγS (t, d x × da)dt := (0,∞) X×A −αt e c−j (x, a)PγS (t, d x × da)dt − (0,∞)
X×A
for each α ∈ R0+ and j = 0, 1, . . . , J . If γ(·) = δx (·), then the objective is denoted by W jα (S, x) for each α ∈ R0+ and x ∈ X. α is the discount factor. If α = 0, we often refer to W j0 (S, γ) as the total undiscounted cost, instead of the 0-discounted cost. For each α ∈ R0+ , we are interested in the unconstrained αdiscounted (or total undiscounted when α = 0) CTMDP problem Minimize W0α (S, γ) over S ∈ S
(1.15)
and the constrained α-discounted (or total undiscounted when α = 0) problem Minimize W0α (S, γ) over S ∈ S subject to W jα (S, γ) ≤ d j , j = 1, 2, . . . , J.
(1.16)
The constants d1 , d2 , . . . , d J ∈ R are assumed to be given. The control strategies S ∈ S satisfying all inequalities in (1.16) are called feasible. Definition 1.1.6 A strategy S ∗ ∈ S is called uniformly optimal for problem (1.15) if W0α (S ∗ , x) = inf W0α (S, x) =: W0∗α (x), ∀ x ∈ X. S∈S
The function x → W0∗α is called the value (Bellman) function for problem (1.15). Definition 1.1.7 A strategy S ∗ ∈ S is called optimal for problem (1.15) under the initial distribution γ if W0α (S ∗ , γ) ≤ W0α (S, γ) for each strategy S ∈ S.
12
1 Description of CTMDPs and Preliminaries
It should be pointed out that a uniformly optimal strategy might fail to be optimal with respect to some initial distribution γ. Definition 1.1.8 A strategy S ∗ ∈ S is called optimal for problem (1.16) under the initial distribution γ if it is feasible for problem (1.16), and satisfies W0α (S ∗ , γ) ≤ W0α (S, γ) for each feasible strategy S ∈ S. Another widely used objective is the long-run average cost which, for a π-strategy, has the form 1 W j (S, γ) := lim c j (x, a)PγS (t, d x × da)dt, T →∞ T (0,T ] X×A and the unconstrained and constrained average CTMDP problems considered in this book are formulated as Minimize W0 (S, γ) over S ∈ S
(1.17)
Minimize W0 (S, γ) over S ∈ S subject to W j (S, γ) ≤ d j , j = 1, 2, . . . , J.
(1.18)
and
The constants d1 , d2 , . . . , d J ∈ R are assumed to be given. The control strategies S ∈ S satisfying all inequalities in (1.18) are called feasible. If γ(·) = δx (·) then the objective is denoted by W j (S, x). Definition 1.1.9 A strategy S ∗ is called average optimal under the given initial distribution γ if W0 (S ∗ , γ) = inf W0 (S, γ), S∈S
and is called average uniformly optimal if W0 (S ∗ , x) = inf W0 (S, x) =: W0∗ (x), ∀ x ∈ X. S∈S
(1.19)
The average cost model is investigated in Chap. 5. Let us emphasize that in this chapter we do not assume that the controlled process X (·) is non-explosive, i.e. we do not exclude the strategies for which PγS (T∞ < ∞) > 0. The expected total costs W jα are calculated on the time horizon (0, T∞ ). The functionals W jα and W j , j = 0, 1, . . . , J , are called performance or objective functionals.
1.1 Description of the CTMDP
1.1.3.3
13
Expression of Expected α-Discounted Costs
Let us say a few more words on the (expected) α-discounted and total undiscounted cost. After the history h n−1 ∈ Hn−1 is realised for some n ∈ N, if Sn = πn , the total discounted cost, associated with the cost rate c j on the interval (tn−1 , tn−1 + θ], equals
e−α(tn−1 +u)
(0,θ]∩R+
c j (xn−1 , a)πn (da|h n−1 , u)du. A
¯ + . Remember, c j (x∞ , a) ≡ 0. Here and below, α ≥ 0 is the discount factor and θ ∈ R If Sn = n , after the action a ∈ A is realised, the similar cost equals (0,θ]∩R+
e−α(tn−1 +u) c j (xn−1 , a)du.
The presented formulae are familiar to the traditional CTMDP and to the semiMarkov decision processes, correspondingly. The conditional expectations of the described costs over the sojourn time n , given Hn−1 = h n−1 ∈ Hn−1 , equal
e−αu c j (xn−1 , a) (0,∞) (0,θ] A × πn (da|h n−1 , u)du qxn−1 (πn , θ)e− (0,θ] qxn−1 (πn ,s)ds dθ +e− (0,∞) qxn−1 (πn ,s)ds e−αu c j (xn−1 , a) (0,∞) A × πn (da|h n−1 , u)du c j (xn−1 , a)πn (da|h n−1 , θ) = e−αtn−1
C αj (πn , h n−1 ) = e−αtn−1
×e and
(0,∞) A −αθ− (0,θ] qxn−1 (πn ,s)ds
dθ (1.20)
qxn−1 (a)e−qxn−1 (a)θ C αj (n , h n−1 ) = e−αtn−1 A (0,∞) × e−αu du c j (xn−1 , a) dθ (0,θ]
+ I{qxn−1 (a) = 0} ×n (da|h n−1 )
(0,∞)
e−αu du c j (xn−1 , a)
14
1 Description of CTMDPs and Preliminaries
= e−αtn−1
c j (xn−1 , a)
e−αθ−qxn−1 (a)θ dθ (0,∞)
A
×n (da|h n−1 ) c j (xn−1 , a) = e−αtn−1 n (da|h n−1 ). α + qxn−1 (a) A The second equations here are by integrating by parts. Recall that we always deal separately with positive and negative parts of the costs. (See (1).) Remember also the convention ∞ − ∞ = +∞. Note that, for Sn = πn , EγS [C αj (πn , Hn−1 )] = EγS e−αu c j (X n−1 , a)πn (da|Hn−1 , u)du . (Tn−1 ,Tn ]∩R+
Then W jα (S, γ)
=
EγS
A
∞
C αj (Sn ,
Hn−1 ) , j = 0, 1, . . . , J,
(1.21)
n=1
where α ∈ R+ . Using formula (1.8), for a π-strategy, the total α-discounted expected costs can be equivalently represented as W jα (π, γ)
=
lim Eπ t→∞ γ
e
−αv
(0,t]
− lim Eγπ
t→∞
A
e−αv (0,t]
c+j (X (v), a)(da|v)dv
A
c−j (X (v), a)(da|v)dv .
If one of the limits above is finite, then α π −αv W j (π, γ) = lim Eγ e c j (X (v), a)(da|v)dv . t→∞
(0,t]
(1.22)
A
If α = 0 it can easily happen that W jα (S, γ) either identically equals +∞ or identically equals −∞ for any control strategy S. Such a CTMDP problem is regarded as degenerate. One example of a non-degenerate CTMDP problem is the “Stopping Problem.” In this CTMDP, it is assumed that there is a specific absorbing “cemetery” state ∈ X satisfying q (a) ≡ 0 and c j (, a) ≡ 0, and there is a specific action “stop” ∈ A such that q (|x, “stop”) = qx (“stop”) for any x ∈ X. An example of a stopping problem is given in Sect. 1.2.5.
1.1 Description of the CTMDP
1.1.3.4
15
Sufficiency of π-Strategies
Theorem 1.1.1 Let S = {Sn }∞ n=1 be a strategy. There is a corresponding π-strategy such that on B(X × A), S˜ = {π˜ n }∞ n=1 ˜
PγS (t, dy × da) = PγS (t, dy × da), t ∈ R0+ for each initial distribution γ. Proof We define the π-strategy S˜ as follows. For each n ∈ {1, 2, . . . }, if Sn = πn , then put S˜n = π˜ n := πn ; if Sn = n , then put S˜n = π˜ n with π˜ n (da|h n−1 , t) :=
e−qxn−1 (a)t n (da|h n−1 ) −qxn−1 (a)t n (da|h n−1 ) Ae
on B(A) for each t > 0. Let n ∈ {0, 1, 2, . . . } be fixed. Suppose Sn+1 = n+1 . Then on B(X × A),
EγS I{X (t) ∈ dy}n+1 (da|Hn )I{t ∈ (Tn , Tn+1 ]}
= EγS I{X n ∈ dy}n+1 (da|Hn )PγS Tn < t ≤ Tn+1 |Hn
= EγS I{X n ∈ dy, Tn < t}n+1 (da|Hn )e−q X n (a)(t−Tn ) . On the other hand, on B(X × A), ˜ EγS I{X (t) ∈ dy}π˜ n+1 (da|Hn , t − Tn )I{t ∈ (Tn , Tn+1 ]} ˜ ˜ = EγS I{X n ∈ dy, Tn < t}π˜ n+1 (da|Hn , t − Tn )PγS {t ∈ (Tn , Tn+1 ]|Hn } ˜ = EγS I{X n ∈ dy, Tn < t}π˜ n+1 (da|Hn , t − Tn ) × e− (0,t−Tn ] A q X n (a)π˜ n+1 (da|Hn ,s)ds ˜
= EγS I{X n ∈ dy, Tn < t}π˜ n+1 (da|Hn , t − Tn ) ×e
−
(0,t−Tn ]
−q X n (a)s n+1 (da|Hn ) A q X n (a)e ds −q X n (a)s e n+1 (da|Hn ) A
.
Note that q X (a)e−q X n (a)s n+1 (da|Hn ) d −q X n (a)s ln e . n+1 (da|Hn ) = − A n −q (a)s Xn ds n+1 (da|Hn ) A Ae
16
1 Description of CTMDPs and Preliminaries
Here the Dominated Convergence Theorem is in use. Therefore, ˜ EγS I{X (t) ∈ dy}π˜ n+1 (da|Hn , t − Tn )I{t ∈ (Tn , Tn+1 ]} ˜ = EγS I{X n ∈ dy, Tn < t}π˜ n+1 (da|Hn , t − Tn ) e−q X n (a)(t−Tn ) A × n+1 (da|Hn ) ˜ = EγS I{X n ∈ dy, Tn < t}n+1 (da|Hn )e−q X n (a)(t−Tn ) . Observe that when Sn+1 = n+1 PγS (n+1 ∈ dt, X n+1 ∈ d x|h n ) q (d x|xn , a)e−qxn (a)t dt n+1 (da|h n ) = A
= q (d x|xn , π˜ n+1 , t)e− =
˜ PγS (n+1
(0,t]
qxn (π˜ n+1 ,s)ds
dt
∈ dt, X n+1 ∈ d x|h n ),
and when Sn+1 = πn+1 ˜
PγS (n+1 ∈ dt, X n+1 ∈ d x|h n ) = PγS (n+1 ∈ dt, X n+1 ∈ d x|h n ). ˜
It follows from this, PγS (X 0 ∈ d x) = γ(d x) = PγS (X 0 ∈ d x) and an inductive argu˜ ment that the strategic measures under S and S˜ coincide: PγS = PγS . Consequently,
EγS I{X (t) ∈ dy}n+1 (da|Hn )I{t ∈ (Tn , Tn+1 ]} ˜ = EγS I{X (t) ∈ dy}π˜ n+1 (da|Hn , t − Tn )I{t ∈ (Tn , Tn+1 ]} Similarly, if Sn+1 = πn+1 , then
EγS I{X (t) ∈ dy}πn+1 (da|Hn , t − Tn )I{t ∈ (Tn , Tn+1 ]} ˜ = EγS I{X (t) ∈ dy}π˜ n+1 (da|Hn , t − Tn )I{t ∈ (Tn , Tn+1 ]} . Therefore, ∞
I{Sn+1 = n+1 }EγS I{X (t) ∈ dy}n+1 (da|Hn )I{t ∈ (Tn , Tn+1 ]} n=0
+ I{Sn+1 = πn+1 }EγS I{X (t) ∈ dy}πn+1 (da|Hn , t − Tn )
1.1 Description of the CTMDP
× I{t ∈ (Tn , Tn+1 ]} =
∞
17
˜ EγS I{X (t) ∈ dy}π˜ n+1 (da|Hn , t − Tn )I{t ∈ (Tn , Tn+1 ]} ,
n=0
as required. The following consequence is immediate.
Corollary 1.1.1 The class of π-strategies is sufficient for CTMDP problems (1.15), (1.16), (1.17), (1.18). Consequently, in CTMDP problems (1.15), (1.16), (1.17), (1.18), one can replace S with Sπ without changing the values of the problems. The reason for considering the class S of strategies rather than only the class of π-strategies is that π-strategies are not realizable, except in the degenerate case. The rigorous definition of a realizable strategy is given in the next subsection.
1.1.4 Realizable Strategies If the optimal (or ε-optimal) strategy is a non-degenerate relaxed one, which is typical for constrained problems, then the decision-maker must be aware that it is impossible to realize directly such a strategy in practice. On the other hand, randomized strategies are realizable because the decision-maker only needs to mix the actions at jump moments T0 , T1 , . . ., not continuously in time as in the case of relaxed strategies. This subsection clarifies and formalizes the issue of the realizability of a strategy. Suppose h n−1 ∈ Hn−1 is fixed for a given n ∈ N and tn−1 = ∞ ⇐⇒ xn−1 ∈ X. It is intuitively sensible to accept the following definition. Definition 1.1.10 A control strategy S is called realizable for h n−1 ∈ Hn−1 with , , F P) and a measurable, with xn−1 ∈ X if there is a complete probability space ( × R+ with values in A such that the following respect to ( ω , s), process A on assertions hold. P(A(s) ∈ A ) (a) πn ( A |h n−1 , s) (correspondingly n ( A |h n−1 )) coincides with is ω∈ for each A ∈ B(A), for almost all s ∈ R+ . As usual, the argument often omitted. (b) In either case, for any conservative and stable transition rate q, ˆ for the random ¯ + × X∞ , depending on and defined by ω∈ probability measure G ω on R
ω ({∞} × {x∞ }) = e− (0,∞) qˆxn−1 (A(ω,s))ds ; G ( × X ) G ω R = ω , θ))e− (0,θ] qˆxn−1 (A(ω,s))ds dθ, q( ˆ X |xn−1 , A( R
∀ R ∈ B(R+ ), X ∈ B(X),
18
1 Description of CTMDPs and Preliminaries
after taking expectation E with respect to P, we must obtain the measure ˆ G n (·|h n−1 ) given by (1.12) or (1.13), with q being replaced by q. A control strategy S is realizable if, for each n ∈ N, it is realizable for Hn−1 (ω) with X n−1 ∈ X almost surely with respect to PγS . Recall that the initial distribution γ is fixed. Theorem 1.1.2 Suppose for some n ∈ N, Sn = πn . Then the following statements are equivalent. (a) The control strategy S is realizable for h n−1 ∈ Hn−1 with xn−1 ∈ X. , F, (b) There is a complete probability space ( P) and a measurable (with respect to ( ω , t) ) process A on × R+ with values in A such that, for almost all s ∈ R+ , P(A(s) ∈ A ) = πn ( A |h n−1 , s) and, for each θ ∈ R+ , for each A ∈ B(A), ˆ is for each bounded measurable function qˆ on A, the integral (0,θ] q(A(s))ds degenerate (not random), that is, equals a constant P-a.s. (c) For almost all s ∈ R+ , πn (·|h n−1 , s) = δϕ(s) (·) is a Dirac measure, where ϕ is an A-valued measurable mapping on R+ . , F, (d) There is a complete probability space ( P) and a measurable (with respect × R+ with values in A such that to ( ω , t) ) process A on P(A(s) ∈ A ) = πn ( A |h n−1 , s) – for almost all s ∈ R+ , for each A ∈ B(A), and ˆ – for each bounded measurable function qˆ on A, the integrals I1 q(A(u))du and I2 q(A(u))du ˆ are independent for any bounded non-overlapping intervals I1 , I2 ⊂ R+ . Proof Below, we assume that h n−1 ∈ Hn−1 is fixed and xn−1 ∈ X. Suppose S is realizable for h n−1 on the nonempty interval (tn−1 , Tn ]. Let us show that statement (b) is valid. Below, since xn−1 and h n−1 ∈ Hn−1 are fixed, we omit these arguments and use ˆ s) = A q(a)π ˆ notations q(a) ˆ = q(X ˆ \ {xn−1 }|xn−1 , a) and q(π, n (da|h n−1 , s), if the transition rate qˆ is given. Suppose the total jump rate q(a) ˆ is an arbitrary measurable bounded function. Then, according to item (a) of Definition 1.1.10, for almost all s ∈ R+ , q(π, ˆ s) = E[q(A( ˆ ω , s))]. Therefore, according to item (b) of Definition 1.1.10, the cumulative distribution function of the sojourn time n , given by G n ((0, θ] × X) = 1 − e− must coincide with
that is, we have
(0,θ]
q(π,s)ds ˆ
=1−e
− E (0,θ] q(A( ˆ ω ,s))ds
ˆ ω ,s))ds E 1 − e− (0,θ] q(A( ,
for each θ < ∞,
1.1 Description of the CTMDP
e
19
− E (0,θ] q(A( ˆ ω ,s))ds
ˆ ω ,s))ds . = E e− (0,θ] q(A(
−z Since the function e is strictly convex, we conclude that, for each θ ∈ R+ , the ˆ ω , s))ds is not random. Therefore, assertion (a) implies (b). integral (0,θ] q(A( Let us prove that (b) implies (c). Suppose πn (·|h n−1 , s) is not a Dirac measure on a subset of positive Lebesgue measure, that is, on a subset of a finite interval (0, tˆ] ⊂ R+ having a positive Lebesgue measure. The goal is to show that assertion (b) is violated. We are going to apply Lemma B.1.6 to X = A, where A has been equipped with a compatible metric ρ. Below, O(a, ε) = {b ∈ A : ρ(a, b) < ε} is an open ball. If for any subset of A, for any k ∈ N, the set em ∈ X d with X d = {e11, e2 , . . .} being a dense t ∈ (0, tˆ] : πn (O(em , k )|h n−1 , t) ∈ (0, 1) is null, then the set
1 ˆ t ∈ (0, t ] : ∃ em ∈ X d , ∃ k ∈ N : πn (O(em , )|h n−1 , t) ∈ (0, 1) k is also null as a countable union of null sets, and therefore, according to Lemma B.1.6, for almost all t ∈ (0, tˆ], πn (·|h n−1 , t) is a Dirac measure concentrated on a singleton. From the obtained contradiction, we conclude that there are eˆm ∈ X d and kˆ ∈ N such that Leb(R ) > 0, where 1 R = t ∈ (0, tˆ] : πn (O(eˆm , )|h n−1 , t) ∈ (0, 1) . kˆ
(1.23)
Here and below, Leb is the Lebesgue measure. Now, suppose assertion (b) holds. Consider the function 1 q(a) ˆ = I a ∈ O(eˆm , ) kˆ and the integrals V (t) =
(0,t]
q(A(u))du ˆ
for t ∈ (0, tˆ], which are non-random if assertion (b) holds. To be more precise, for each rational t ∈ (0, tˆ], there is a number f (t) such that P(V (t) = f (t)) = 1. Hence P(for all rational t ∈ (0, tˆ] V (t) = f (t)) = 1. the function V (·) is absolutely continuous, we can extend the Since for each ω∈ definition of the function f to the whole interval (0, tˆ] in such a way that it is also absolutely continuous: it is sufficient to take an arbitrary ω such that V (t) = f (t)
20
1 Description of CTMDPs and Preliminaries
for all rational t ∈ (0, tˆ] and extend this equality for the whole interval t ∈ (0, tˆ]. As a result, P(∀ t ∈ (0, tˆ] V (t) = f (t)) = 1. Therefore, the function f (·) is differentiable everywhere apart from a null set N and ˆ = 1, P()
(1.24)
where df ˆ = ω ∈ : q(A(t)) ˆ = h(t) = for all t ∈ (0, tˆ] \ Nω dt : ∀ t ∈ (0, tˆ] V (t) = f (t)}. = { ω∈ . Below, if ˆ ω , t)) = h(t)} and Leb(Nω ) = 0 for all ω∈ Here Nω := N {t : q(A( necessary, we complement the definition of the function h(·) with values in {0, 1} ˆ ∈ on the set R ⊂ (0, tˆ], defined in (1.23), in an arbitrary way. (Remember, q(A(t)) {0, 1}.) For the set ˆ ω , t)) = h(t)}, = {( ω , t) : t ∈ R , q(A( we have ˆ ω , t)) = h(t)}) = Leb(R \ Nω ) Leb(ω ) = Leb({t : t ∈ R , q(A( = Leb(R ) . Therefore, (1.24) implies that for all ω∈ P × Leb() = Leb(R ) > 0.
(1.25)
On the other hand, according to (1.23), since for almost all t ∈ R , 1 1 h P(q(A(t)) ˆ = 1) = P A(t) ∈ O eˆm , = πn O eˆm , , t n−1 kˆ kˆ ∈ (0, 1), and similarly P(q(A(t)) ˆ = 0) ∈ (0, 1), we have the inequality 0 0 and Leb(A− ω )) > 0) = 0, P(Leb(A− 1 ( 2 (
(1.26)
where, for each A ∈ B(A), we define ω ) := {t ∈ R+ : A( ω , t) ∈ A } = A−1 ( ω , A ), − A (
(1.27)
ω )) is a measurable function where Leb is the Lebesgue measure. Note that Leb( − A ( because of the equality , F) on ( ω , t) ∈ A ) = lim Leb({t ∈ R+ : A(
T →∞ (0,T ]
I{A( ω , s) ∈ A }ds
and the measurability of the process A; see Proposition B.1.35. Suppose equality (1.26) is violated. Let us fix T ∈ R+ such that, for A−,T ω ) := 1,2 ( − A1,2 ( ω ) ∩ [0, T ], the inequality P(Leb(A−,T ω )) > 0 and Leb(A−,T ω )) > 0) > 0 1 ( 2 (
(1.28)
holds. Below in this proof, since n, xn−1 and h n−1 ∈ Hn−1 are fixed, we omit these arguments and use the notation q(a) ˆ = q(X ˆ \ {xn−1 }|xn−1 , a), if the transition rate qˆ is given. Suppose the total jump rate q(a) ˆ is in the following form: q(a) ˆ =
λ1 , if a ∈ A1 ; λ2 , if a ∈ A2 ,
1.1 Description of the CTMDP
23
where 0 < λ1 < λ2 are two arbitrarily fixed numbers. According to item (b) of Definition 1.1.10, the cumulative distribution function of the sojourn time , given by
G((0, θ] × X) = (A1 ) 1 − e−λ1 θ + (A2 ) 1 − e−λ2 θ , must coincide with ˆ ω ,s))ds E 1 − e− (0,θ] q(A( , that is, for θ = T we have ˆ ω ,s))ds . (A1 )e−λ1 T + (A2 )e−λ2 T = E e− (0,T ] q(A(
(1.29)
As before, the argument h n−1 and index n are omitted here for brevity. Let q(A( ˆ ω , s))ds ∈ [λ1 T, λ2 T ] Z ( ω) = (0,T ]
P with respect to the introduced and denote by P Z (·) the image of the measure → [λ1 T, λ2 T ]; E Z is the corresponding mathematical measurable mapping Z : expectation. Note that P(Leb(A−,T ω )) > 0 and Leb(A−,T ω )) > 0) > 0; P Z ((λ1 T, λ2 T )) = 1 ( 2 ( hence
(Z − λ1 T )(e−λ2 T − e−λ1 T ) 0 and Leb(A−,T ω )) > 0) = 0 for all T > 0, 1 ( 2 ( and property (1.26) is proved. (ii) Let λ(ds) be the probability measure on B(R+ ) defined by the cumulative distribution function 1 − e−t and introduce the image pωA (da) of the measure λ with . First of all, pωA (da) is a respect to the mapping s → A( ω , s) under a fixed ω∈ A stochastic kernel: pω (A) = 1 and, for a fixed A ∈ B(A), ω , s) ∈ A }) = pωA ( A ) = λ({s ∈ R+ : A(
R+
λ(ds)I{A( ω , s) ∈ A }
is a measurable function of ω , since I{A( ω , s) ∈ A } is a measurable function of (s, ω ): see Proposition B.1.35. ω )) > 0 and Secondly, for each A ∈ B(A), if pωA ( A ) ∈ (0, 1), then Leb( − A ( ω )) > 0, where we use the notation introduced in (1.27). Hence, Leb((A \ A )− ( P( pωA ( A ) ∈ {0, 1}) = 1 according to (1.26). It remains to refer to Proposition B.1.36 for the existence of a measurable map → A such that pωA (da) = δϕ(ω) (da) P-a.s. The last equality cannot ping ϕ : of positive measure with respect to hold if A( ω , s) = ϕ( ω ) on a subset of R+ × Leb × P. One can find several comments on the presented proof in Appendix A.3. When solving optimal control problems, it is desirable that the optimal (or nearly optimal) strategy is realizable. In special cases, including the α-discounted model with α > 0, Markov standard ξ-strategies, which are realizable, are sufficient for solving optimal control problems (see Sects. 1.3.3 and 1.3.5). The more general situations are investigated in Chaps. 4 and 6.
1.1.5 Instant Costs at the Jump Epochs Sometimes, it is natural to introduce (measurable) lump sum costs C(x, y) incurred at the jump epochs Tn , if X (Tn −) = x and X (Tn ) = y. Such a case is treated by C(x, y) q (dy|x, a). As usual, here and below
considering the cost rate c(x, a) = X
the positive and negative parts of C are analysed separately (see (1)), so that in this subsection, without loss of generality, we assume that C ≥ 0. Below, we provide the formal justification of this trick. To start with, consider a deterministic stationary strategy ϕ : X → A and the corresponding cost
1.1 Description of the CTMDP
Eγϕ e−αTn C(X (Tn −), X (Tn )) = Eγϕ
25
e−αt Y (t, y)μ(dt × dy) ,
(0,∞)
X
where the random function Y (t, y) = I{Tn−1 < t ≤ Tn }C(X n−1 , y) is Pr × B(X)-measurable, being left-continuous in t. In this simple case, the dual predictable projection of μ on the interval (Tn−1 , Tn ] is given by ν(dt × dy) = q (dy|X n−1 , ϕ(X n−1 ))dt; see (1.14), so that
Eγϕ e−αTn C(X (Tn −), X (Tn )) = Eγϕ e−αt C(X n−1 , y) q (dy|X n−1 , ϕ(X n−1 ))dt (T ,T ]∩R X n−1 n + e−αt c(X n−1 , ϕ(X n−1 ))dt . = Eγϕ (Tn−1 ,Tn ]∩R+
Therefore, the total expected lump sum cost can be calculated in the standard way as in (1.20), (1.21), using the cost rate c. Note, this reasoning also holds true if the cost C depends on the action a: one has to replace C(X n−1 , y) with C(X n−1 , ϕ(X n−1 ), y) C(x, a, y) q (dy|x, a). In the case of an arbitrary strategy S,
and take c(x, a) = X
this trick for the a-dependent lump sum cost C(x, a, y) does not work. The next statement asserts that the (nonnegative) lump sum cost C(x, y) can be C(x, y) q (dy|x, a) for any control
taken into account by the cost rate c(x, a) = X
strategy S.
Theorem 1.1.4 Suppose C(x, y) ≥ 0 for each x, y ∈ X. Then, for α ∈ [0, ∞) and each control strategy S,
EγS e−αTn C(X (Tn −), X (Tn )) = EγS [C α (Sn , Hn−1 )], where C α (Sn , h n−1 ), is defined in (1.20) as for the cost rate c(x, a) := (dy|x, a).
X
C(x, y) q
Proof Following the ideas and notations described above, let us show that for each fixed n and control strategy S, EγS
(0,∞)
X
e−αt Y (t, y)ν(dt × dy) Hn−1 = C α (Sn , Hn−1 ).
(a) Suppose Sn = n . Then, for t ∈ (Tn−1 , Tn ],
26
1 Description of CTMDPs and Preliminaries
ν(dt × dy) =
A
q (dy|X n−1 , a)e−q X n−1 (a)(t−Tn−1 ) n (da|Hn−1 ) dt −q X n−1 (a)(t−Tn−1 ) e n (da|Hn−1 ) A
and EγS
(0,∞)
⎡
X
e−αt Y (t, y)ν(dt × dy) Hn−1
⎢ = EγS ⎣ e−αTn−1
= e−αTn−1
(Tn−1 ,Tn ]∩R+ X
⎤ ⎥ e−α(t−Tn−1 ) C(X n−1 , y)ν(dt × dy) Hn−1 ⎦
C(X n−1 , y) q (dy|X n−1 , a)I{q X n−1 (a) = 0} A
⎡
X
⎢ ×EγS ⎢ ⎣
(Tn−1 ,Tn ]∩R+
×n (da|Hn−1 ) = e−αTn−1
A
⎤ −q X n−1 (a)(t−Tn−1 )−α(t−Tn−1 ) ⎟ ⎜ ⎥ ⎟ dt Hn−1 ⎥ ⎜ e ⎠ ⎝ ⎦ e−q X n−1 (a)(t−Tn−1 ) n (da|Hn−1 ) ⎞
⎛
A
⎡
⎢ c(X n−1 , a)EγS ⎢ ⎣
(0,n ]∩R+
I{q X n−1 (a) = 0}
⎤ ⎟ ⎥ ⎜ e−q X n−1 (a)u−αu ⎟ du Hn−1 ⎥ × n (da|Hn−1 ). ⎜ ⎠ ⎦ ⎝ e−q X n−1 (a)u n (da|Hn−1 ) ⎞
⎛
A
Now ⎡ ⎢ EγS ⎢ ⎣
(0,n ]∩R+
I{q X n−1 (a) = 0}
⎤ −q (a)u−αu ⎟ ⎥ ⎜ e X n−1 ⎟ du Hn−1 ⎥ ⎜ ⎠ ⎦ ⎝ e−q X n−1 (a)u n (da|Hn−1 ) ⎞
⎛
A
1.1 Description of the CTMDP
= I{q X n−1 (a) = 0}
27
⎡
⎛
⎢ ⎢ ⎣ (0,∞)
⎞
⎜ ⎜ ⎝ (0,θ]
e−q X n−1 (a)u−αu e−q X n−1 (a)u n (da|Hn−1 )
⎤
⎟ ⎥ ⎟ du ⎥ ⎠ ⎦
A
×
q X n−1 (a)e
−q X n−1 (a)θ
n (da|Hn−1 )dθ.
A
Integrating by parts with respect to θ and noting that q X n−1 (a) = 0, we obtain e Y (t, y)ν(dt × dy) Hn−1 (0,∞) X c(X n−1 , a) e−q X n−1 (a)θ−αθ dθ n (da|Hn−1 ) = e−αTn−1
−αt
EγS
(0,∞)
A
= C α (n , Hn−1 ).
Remember that c(x, a) = 0 if qx (a) = 0. (b) Suppose Sn = πn . Then, on (Tn−1 , Tn ], ν(dt × dy) = q (dy|X n−1 , πn , t − Tn−1 )dt and, according to the definition of q (dy|X n−1 , πn , s), we have EγS
e
(0,∞)
X
−αt
Y (t, y)ν(dt × dy) Hn−1
e C(X n−1 , y)ν(dt × dy) Hn−1 = (T ,T ]∩R X n−1 n + e−αt C(X n−1 , y) q (dy|X n−1 , a) = EγS (Tn−1 ,Tn ]∩R+ A X × πn (da|Hn−1 , t − Tn−1 )dt Hn−1 = EγS e−α(Tn−1 +u) c(X n−1 , a)πn (da|Hn−1 , u)du Hn−1
EγS
α
(0,n ]∩R+
−αt
A
= C (πn , Hn−1 ). The proof is complete.
Clearly, from the theoretical viewpoint, when dealing with the total cost model there is no need to consider lump sum costs C(x, y). If needed, one can always adjust the cost rate c as explained above.
28
1 Description of CTMDPs and Preliminaries
1.2 Examples In this section, we present several meaningful examples to illustrate the possible applications of CTMDPs. In the case of costs (rewards) appearing instantly at the jump epochs, we use the results of Sect. 1.1.5 without special reference.
1.2.1 Queueing Systems Suppose there are M > 1 types of jobs to be served by a single server. The interarrival and service times of the jobs of type m = 1, 2, . . . , M are exponential random variables with parameters λm and μm correspondingly, which are called the arrival and service rates. We assume there is an infinite space for waiting in the queue. The holding cost of one type m job equals Cmh ≥ 0 per time unit. At any moment, the server should choose a job for service from the queue. Clearly, the queue is characterized not only by the total number of jobs, but by the vector (i 1 , i 2 , . . . , i M ), i m ∈ N0 , where i m is the number of type m jobs in the system. The corresponding CTMDP is now described by the following primitives. • X = N0M ; • A = {1, 2, . . . , M}, where action m means the server has chosen a type m job for the service. • The non-zero transition rates for ( j1 , j2 , . . . , j M ) = (i 1 , i 2 , . . . , i M ) are given by q(( j1 , j2 , . . . , j M )|(i 1 , i 2 , . . . , i M ), a) λm , if jm = i m + 1, jl = il for all l = m; = μa , if ja = i a − 1, jl = il for all l = a. M • c0 ((i 1 , i 2 , . . . , i M ), a) = m=1 i m Cmh is the total holding cost rate. Of course one can also introduce a service cost rate like c1 ((i 1 , i 2 , . . . , i M ), a) = CaS , where CmS ≥ 0 is the cost per time unit associated with the service of a type m job. • Initial distribution may be taken as γ((0, 0, . . . , 0)) = 1, that is, the system is empty at time 0. According to the definition of the control strategy, the server may switch serving from one job to another before the service is completed. The remaining service time is again exponentially distributed due to the memoryless property. Such a service rule (called preemptive) can be optimal, e.g., if a higher priority job arrives during the service of a lower priority job. M λm < 1 ensuring that the system is It is natural to impose the condition m=1 μm stable, i.e., the queue does not grow to infinity. In this connection, note that serving an empty queue is not forbidden: it is allowed to choose a = m if i m = 0 and il > 0
1.2 Examples
29
for some l = 1, 2, . . . , M. But such a strategy will be far from optimal in any one practical problem. For the described model, one can consider the discounted expected cost (1.21) with α > 0. Another natural problem is to concentrate at the finite (random) interval up to the moment when the system becomes idle. In the latter case, we introduce the absorbing cemetery and put q(|(i 1 , i 2 , . . . , i M ), a) =
μa , if i a = 1 and il = 0 for all l = a; 0, otherwise.
In both modifications, the optimal control strategy for the unconstrained problem (1.15) is known. This is the so-called Cμ-rule. One has to order the jobs in such a h μ M and assign the higher priority to the jobs with way that C1h μ1 ≥ C2h μ2 ≥ . . . C M smaller index. The deterministic stationary strategy ϕ∗ ((i 1 , i 2 , . . . , i M )) =
m, if i 1 = . . . i m−1 = 0, i m > 0; arbitrarily fixed, if i 1 = i 2 = . . . = i M = 0
is optimal. Another common problem is called “Admission Control.” Suppose the space for waiting is limited: there can be no more than V jobs in the system. Suppose also there are two types of jobs, and type two has higher priority. As before, λ1,2 and μ1,2 h are the arrival and service rates, and C1,2 are the holding cost rates. We assume also that R1,2 are the rewards from servicing the corresponding jobs. At any one moment, the server should decide whether to allow the new first-type jobs to join the queue. Now • X = {(i 1 , i 2 ) ∈ N20 : i 1 + i 2 ≤ V } and x = (i 1 , i 2 ) means there are i 1,2 jobs of the corresponding types in the system. • A = {0, 1}, where action a = 0 (a = 1) means the first-type jobs are not accepted (accepted). • The non-zero transition rates for ( j1 , j2 ) = (i 1 , i 2 ) and j1 + j2 ≤ V are given by ⎧ ⎪ ⎪ aλ1 , ⎨ λ2 , q(( j1 , j2 )|(i 1 , i 2 ), a) = ⎪ μ1 , ⎪ ⎩ μ2 ,
if if if if
j1 j2 j1 j2
= i 1 + 1, = i 2 + 1, = i 1 − 1, = i 2 − 1,
and and and and
j2 j1 j2 j1
= i2 ; = i1 ; = i 2 = 0; = i1 .
• c0 ((i 1 , i 2 ), a) = i 1 C1h + i 2 C2h − I{i 2 = 0}I{i 1 > 0}μ1 R1 − I{i 2 > 0}μ2 R2 . • The initial distribution may be taken as γ((0, 0, . . . , 0)) = 1, that is, the system is empty at time 0. If, e.g., R2 > R1 then it can be reasonable to accept only the second-type jobs if the available space V − i 1 − i 2 is small. We again consider the preemptive service. Since V < ∞, no stability conditions are needed. One can consider a discounted expected cost (1.21) with α > 0, or one can split the cost into two parts:
30
1 Description of CTMDPs and Preliminaries
c0 ((i 1 , i 2 ), a) = −I{i 2 = 0}I{i 1 > 0}μ1 R1 − I{i 2 > 0}μ2 R2 ; c1 ((i 1 , i 2 ), a) = i 1 C1h + i 2 C2h , and consider the discounted constrained problem (1.16) with J = 1.
1.2.2 The Freelancer Dilemma Consider a freelancer who can accept or reject jobs arriving in accordance with independent Poisson processes with rates λm > 0 for jobs of type m ∈ {1, 2, . . . , M}. If a job is rejected then it is lost; it is also lost if it arrives at a moment when the freelancer is busy. The busy period is exponential, with parameter μm > 0 for an accepted job of type m. At the moment when a job of type m is completed, the freelancer receives the payoff (reward) Rm > 0. The dilemma is to decide which jobs to accept so as to maximise the total discounted expected reward. Another connected problem is to maximise the long-run average reward. To formulate the minimisation problem, we change the sign of rewards and consider the costs −Rm , m = 1, 2, . . . , M. The corresponding CTMDP is now described by the following primitives. • X = {0, 1, 2, . . . , M}, where 0 means the freelancer is free, and x = m > 0 means the freelancer is busy doing a type m job. • A = {0, 1} M , where action a = (a1 , a2 , . . . , a M ) means the new arriving job of type m will be accepted if and only if am = 1 and the freelancer is not busy. • The non-zero transition rates for y = x are given by q(y|x, (a1 , a2 , . . . , a M )) =
a y λ y , if x = 0, y = 0; μx , if x = 0, y = 0.
• c0 (x, a) = −Rx μx I{x = 0}. (See Lemma A.1.2 for the justification.) • The initial distribution may be taken as γ(0) = 1, that is, the freelancer is free at time 0. The described model is a special case of a queueing system when there is no space for waiting. In this framework, making decisions about accepting/rejecting jobs is called “Admission Control.” See also Sect. 1.2.1.
1.2.3 Epidemic Models Infectious diseases are typically categorised as either acute or chronic. The term acute refers to “fast” infections, where a relatively rapid immune response removes pathogens after a short period of time (days or weeks). Examples of acute infections include influenza, distemper, rabies, chickenpox, and rubella. In a dairy herd, acute
1.2 Examples
31
infections include bovine diarrhoea virus, Leptospira harjo and Coxiella burnetii. The spread of many acute infections can be described mathematically as a so-called SIR epidemic (Susceptible-Infected-Recovered), where every infective, after recovery, can never become infective again, being immunised. Consider a homogeneous closed population of N individuals where each one can be either susceptible, infective, or recovered. Every infective recovers after a random time following the exponential distribution with mean 1/ζ (ζ ∈ R+ ). Every susceptible, after having contact with infectives, can become infective. Suppose X 1 (t) > 0 and X 2 (t) > 0 are the numbers of susceptibles and infectives correspondingly at time t. Then the transition rate to the state (X 1 (t) − 1, X 2 (t) + 1) equals β X 1 (t)X 2 (t), where the coefficient β > 0 takes into account the frequency of meetings between susceptibles and infectives, as well as the chance of catching the disease after any one ˆ , where βˆ is the frequency of meeting contact. Very often β is represented as β = βδ N other individuals for one particular susceptible and δ is the probability of catching the disease during one such contact. (The ratio X 2 (t)/N is the chance that the contacted individual is infective.) The transition rate to the state (X 1 (t), X 2 (t) − 1) equals ζ X 2 (t). No other transitions are possible. We have just described an uncontrolled epidemic. A typical trajectory of such a model is presented in Fig. 1.1. Here N = 500, β = 1/100, ζ = 1; the initial values are X 1 (0) = 495, X 2 (0) = 5. The number of infectives X 2 (t) firstly increases and then goes to zero; the number of susceptibles monotonically decreases and reaches zero. This is an example of a huge epidemic outbreak: eventually the whole population recovers, after catching the disease. 500 450 400 350 300 250 200 150 100 50 0
0
1
2
3
4
Fig. 1.1 Trajectories of an uncontrolled epidemic model
5
6
7
32
1 Description of CTMDPs and Preliminaries
If the control (treatment) is applied at the rate a ≥ 0 then, additionally to the natural recoveries, a infectives recover on average per time unit due to the treatment. The corresponding CTMDP is now described by the following primitives. • X = {(x1 , x2 ) ∈ N20 , x1 + x2 ≤ N }; • A = [0, a], ¯ where a¯ is the maximal possible treatment rate. • The non-zero transition rates for (y1 , y2 ) = (x1 , x2 ) are given by q((y1 , y2 )|(x1 , x2 ), a) =
βx1 x2 , if y1 = x1 − 1, y2 = x2 + 1; ζ x2 + a, if y1 = x1 , y2 = x2 − 1, x2 > 0.
• The cost rates can correspond to the rate of the disease spread: c0 ((x1 , x2 ), a) = βx1 x2 , and to the treatment intensity: c1 ((x1 , x2 ), a) = Ca. • The initial distribution of the numbers of susceptibles and infectives at time 0 is γ(·). In Fig. 1.2, we present a trajectory of the intensively treated epidemic (a ≡ 50) starting from the initial distribution γ((485, 15)) = 1. All the other parameters are the same as before. Although the initial number of infectives is 15, the epidemic dies out: the infectives quickly become extinct at time ≈ 3.5. After that, the number of susceptibles remains at the level ≈ 70 forever. Certainly, it is sensible to apply the maximal possible action if the target is to stop the epidemic, that is, the cost rate is c0 . But if one also takes into account the cost of treatment c1 , then the optimal control of such an epidemic is far from trivial. 500 450 400 350 300 250 200 150 100 50 0
0
0.5
1
1.5
Fig. 1.2 Trajectories of a controlled epidemic model
2
2.5
3
3.5
1.2 Examples
33
1.2.4 Inventory Control Consider a store that sells a certain commodity. At any moment the supply is represented by a nonnegative real number. The customers arrive according to a Poisson process with rate λ > 0 and buy a random amount of commodity, depending on the current supply x ∈ R0+ . To be more precise, any one customer, independently of other customers, buys Z units having the cumulative distribution function Q(z|x) with support (0, x], if x > 0. Customers who face the supply x = 0 are lost and are not taken into account. At any moment, the manager can order a ∈ R+ units, but the replenishment is not instantaneous: paperwork takes exponentially distributed random time with parameter μ > 0, and after that the product arrives immediately. During this lead time, the order size may be changed or the manager can even cancel the order. We accept that a = 0 means no order is placed at the current moment. The holding cost of one unit equals h > 0, and selling one unit results in the profit r > 0. Each demand which cannot be immediately met is lost. The goal is to minimize the total discounted expected cost under a fixed discount factor α > 0. The corresponding CTMDP is now described by the following primitives. • X = A = R0+ . • It is convenient to define the transition rate by its values on the intervals [0, y], y ∈ R0+ : q([0, y] \ {x}|x, a) = I{x > 0}λ 1 − lim Q(z|x) z↑(x−y)
+I{a > 0}μδx+a ([0, y]); q({x}|x, a) = −q(X \ {x}|x, a). • c0 (x, a) = hx − I{x > 0}λr
z d Q(z|x). (0,x]
• The initial distribution of the inventory at time 0 is γ(·). One can also consider the constrained problem (1.16) with, e.g., c0 (x, a) = hx and c1 (x, a) = −λr (0,x] z d Q(z|x): the target is to minimize the holding cost under the condition that the profit is not below a certain level. In Sect. 7.3.4, we solve a version of this problem assuming that the replenishment is instantaneous, the demand is independent of supply, and the maximal inventory level x¯ is finite.
1.2.5 Selling an Asset Imagine an “investor” who owns an expensive property or asset which he wants to sell. The offers appear according to a Poisson process with parameter λ > 0 and take values in R+ . The next offer value Z is a random variable with the (cumulative)
34
1 Description of CTMDPs and Preliminaries
distribution function Q(z|x), where x is the previous offer (or zero if we are talking about the first offer). We assume that the function Q(·|x) is continuous at x, that is, any new offer is different from the previous one almost surely. The investor has to decide immediately whether to accept or reject the offer. If he rejects, the offer is lost and he has the maintenance cost rate of C ≥ 0 per time unit. If he accepts the offer of value r ∈ R+ , the reward equals r and the process terminates. The goal is to elaborate the optimal selling strategy under the condition that the expected total selling period does not exceed the given duration d1 ∈ R+ . The corresponding CTMDP is now described by the following primitives. • X = R0+ ∪ {(r, ), r ∈ R+ }. State x = 0 appears only at the beginning and means no offers have appeared so far. Other states x ∈ R+ represent the last rejected offer. All the states (r, ) are absorbing with no future costs, meaning that the asset is already sold for the price r (the reward equals r ). • A = R0+ . Action a ∈ A means that the next offer will be accepted if and only if its value is not smaller than a. • It is convenient to define the transition rate, for x ∈ R0+ , by its values on intervals (0, y] and (0, y] × {}, y ∈ R+ : q((0, y] \ {x}|x, a) = λ
Q(y|x), if y < a; lim Q(z|x), if y ≥ a; z↑a
0, if y < a; Q(y|x) − lim z↑a Q(z|x), if y ≥ a; q({x}|x, a) = −q(X \ {x}|x, a) = λ.
q((0, y] × {}|x, a) = λ
• The reward r is received at the jump moment from x ∈ R0+ to y = (r, ) (see Sect. 1.1.5), so that C0 (x, y) =
−r, if x ∈ R0+ and y = (r, ); 0 otherwise,
and, for x ∈ R0+ , c0 (x, a) = C +
C0 (x, y) q (dy|x, a) = C − λ
X
[a,∞)
r d Q(r |x);
c1 (x, a) = 1. The first expression here is in accordance with Theorem 1.1.4. • d1 ≥ λ1 . (Smaller values of the time constraint lead to an empty set of feasible strategies.) • The initial distribution is γ({0}) = 1. One can consider problem (1.16) with α = 0.
1.2 Examples
35
Remark 1.2.1 (a) If the offer Z can take values only in a subset S ⊂ R+ then one can consider X = {0} ∪ S ∪ {(r, ), r ∈ S} as the state space. Similarly, A = S. (b) After we represent the instant costs as the cost rate c0 (x, a), there is no need to keep all the absorbing states (r, ): we combine all such states into one absorbing state and put q({}|x, a) = λ 1 − lim Q(z|x) z↑a
for x ∈ R0+ (or x ∈ {0} ∪ S in view of (a)).
1.2.6 Power-Managed Systems A power-managed system consists of a service requestor, a service provider and a power manager. The service requestor is the source of jobs, which are assumed to arrive in a Poisson counting process with intensity λ > 0. To fix ideas and for brevity, suppose that all the jobs are served in a first-in-first-out manner, and at each moment in time, no more than one job can receive service. The other jobs in the system are building up a queue. In total there can be no more than N jobs in the system including the one under service. New arrivals are rejected when the queue is full. The jobs are served by the service provider, who can be in one of the following three states: sleeping mode, waiting mode and active mode. The service times are assumed to be exponential random variables with the rate depending on the mode the system provider is in. Let μ S > 0, μW > 0 and μ A > 0 be the service rates corresponding to the three modes of the service provider. The power manager controls the mode the service provider is in. We assume that switching from one mode to another takes a random time following an exponential distribution. The parameter of this exponentially distributed random switching time from the sleeping mode to the waiting mode is denoted by ζ SW > 0, and ζ S A > 0, ζ AW > 0, etc. are defined similarly. We assume that the arrival of jobs, service of jobs, switching of modes, and the other random variables such as the service times of the consecutive jobs are all mutually independent. Then, for example, a simultaneous change in the number of jobs in the system and the mode of the service provider is a null event. The power cost c p is incurred depending on the mode the service provider is in. The energy cost ce is incurred depending on the action in use of switching from one mode to another. The holding cost ch is incurred depending on the number of jobs present in the power-managed system. The objective of power management is to minimize the power and energy consumption, while maintaining an acceptable service. The above system can be modelled as a CTMDP with the following system parameters.
36
1 Description of CTMDPs and Preliminaries
• X = {0, 1, . . . , N } × {S, W, A}, where S, W and A stand for sleeping, waiting and active mode, respectively. A state is generically denoted as x = (x1 , x2 ) with the first coordinate counting the number of jobs in the system, and the second one denoting the mode of the service provider. • A = {S, W, A}, where the action S, W or A is interpreted as switching to the mode S, W or A, respectively. Fictitious situations such as x2 = S and a = S will not be regarded as a switching. • The non-zero transition rates for (y1 , y2 ) = (x1 , x2 ) are specified by ⎧ ⎨ λ, if x1 < N , y1 = x1 + 1, y2 = x2 ; q((y1 , y2 )|(x1 , x2 ), a) = μx2 , if x1 > 0, y1 = x1 − 1, x2 = y2 ; ⎩ ζx2 a , if y1 = x1 , x2 = a, y2 = a. The total transition intensities can then be calculated automatically. • The power cost rate, energy cost rate and the holding cost rate are in the form c0 : ((x1 , x2 ), a) ∈ X × A → c p (x2 ) ≥ 0 c1 : ((x1 , x2 ), a) ∈ X × A → ce (x2 , a) ≥ 0 c2 : ((x1 , x2 ), a) ∈ X × A → ch (x1 ) ≥ 0, respectively, where c p , ce and ch are fixed functions. • The initial distribution of the number of jobs in the system and the mode of the service provider at the initial time 0 is given by γ. One can consider the constrained optimal control problem (1.16) under a positive discount factor α > 0. The above model can be generalized or modified according to needs; the service provider might be in one of a different collection of modes, or the jobs might be served in a discipline other than first-in-first-out, for example. A real-life example of this type of power-managed system could be a computer. Power consumption is a major concern in the design of such devices.
1.2.7 Fragmentation Models Consider a system of particles, which are specified with some characteristics. To be specific, suppose the mass of each of the particles is specified. All the particles behave independently. A particle of mass s ∈ X ⊂ R+ = (0, ∞) waits for an exponentially distributed random time with the fragmentation rate r (s, a) > 0, where a ∈R+ repk resents the temperature, and then it splits into fragments z = (z 1 , . . . , z k ) ∈ ∞ k=2 X according to kernel F(dz|s), which is a stochastic kernel on the fragmentation k B(X) = B( ∞ k=1 X ) given s ∈ X . Here, X is a measurable subset of R+ and is endowed with the usual Euclidean topology, and so the space of clouds of fragments
1.2 Examples
37
X is a Borel space. Furthermore, the fragmentation kernel satisfies the conservation property, i.e., for each s ∈ X , .F
z = (z 1 , . . . , z k ) ∈
∞ /
Xk :
k=2
k i=1
0 1 z i = s s = 1.
The model described above but without control is called a pure fragmentation model. According to the conservation property, when the number of particles in the system reaches “infinity” in finite time, infinitely many particles are of mass below any given level, creating “dust.” This phenomenon is called “shattering into dust.” Suppose the temperature of the environment can be controlled. Some cost cT (a) is incurred depending on the temperature a. We formulate a corresponding CTMDP with the following primitives. k • X= ∞ k=1 X = {(s1 , . . . , s N ) : N ≥ 1, si ∈ X , ∀ i = 1, . . . , N } . • A = [0, a], which specifies all the possible temperatures. • The transition rate q(dy|x, a) is defined by q(|x, a) =
N i=1
I{w(x, si , z) ∈ }r (si , a)F(dz|si )
X
−I{x ∈ }
N
r (si , a),
i=1
for each x = (s1 , . . . , s N ) and a ∈ A, where w(x, si , z) = (s1 , . . . , si−1 , si+1 , . . . , s N , z 1 , . . . , z k ) for each z = (z 1 , . . . , z k ) ∈ X. Here r is a (0, ∞)-valued measurable function on X × A. • c0 (x, a) ≡ 1, i.e., one is interested in the total time before shattering into dust, and c1 (x, a) = cT (a), where cT is a nonnegative measurable function on A. • The initial distribution of the number of particles and their masses at time 0 is given by γ. One can consider the α-discounted problem (1.16). We underline that the moment of shattering into dust is given by the explosion moment T∞ of the CTMDP.
1.2.8 Infrastructure Surveillance Models Consider an infrastructure consisting of N sectors with N > 0. Each of the sectors are subject to two types of potential threats, namely, thief invasions (denoted as type
38
1 Description of CTMDPs and Preliminaries
1) and terrorist attacks (denoted as type 2). At any moment in time, a threat level from (0, ∞) is assigned to each sector. Given the current threat level s ∈ (0, ∞) of a sector, a threat event of type i = 1, 2 takes place in the sector after an exponentially distributed random time with the rate λi (s) ∈ (0, ∞), and at the moment of occurrence of this event, a cost ρi ∈ (0, ∞) is incurred, and the threat level in that sector will be changed to m i s with m i ∈ (1, ∞), i = 1, 2. The surveillance allows one to obtain the threat level in each of the sectors, and then the manager can dynamically decide to assign security staff to carry out missions in each of the sectors to lower the threat level, or do nothing. If a security staff is assigned to do a mission in a sector with threat level s ∈ (0, ∞), then it takes exponentially distributed time with rate μ(s) ∈ (0, ∞) to complete it, and then the new threat level will be bs with 0 < b < 1. If the manager chooses to do nothing for this sector, then the threat level in the sector will remain the same until the occurrence of the next threat event or completion of a mission. A cost is incurred at the rate c S ∈ (0, ∞) while a security staff is completing a mission. Suppose the threat events, the sectors and the security staff are all independent. For simplicity, assume that there are always sufficiently many security staff available. For this model, we formulate a corresponding CTMDP with the following system parameters. • X = (0, ∞) N . So x = (x1 , . . . , x N ) represents the vector of the threat levels in each of the N sectors. • A = {0, 1} N . So a = (a1 , . . . , a N ) is the generic notation of an action, where for each i = 1, 2, . . . , N , ai = 0 if and only if the manager chooses to do nothing for the ith sector. • The transition rates for y = x are specified by ⎧ ⎨ λ1 (xi ) if y = (x1 , . . . , m 1 xi , . . . , x N ); q({y}|x, a) = λ2 (xi ) if y = (x1 , . . . , m 2 xi , . . . , x N ); ⎩ μ(xi ) if y = (x1 , . . . , bxi , . . . , x N ), ai = 1 for each i =1, 2, . . . , N . The rates of the other transitions are zero. N N ai c S and c1 (x, a) = i=1 (λ1 (xi )ρ1 + λ2 (xi )ρ2 ) for each • c0 (x, a) = i=1 x = (x1 , . . . , x N ) ∈ X and a = (a1 , . . . , a N ) ∈ A. • The initial distribution of the vector of the threat levels of each of the sectors at time 0 is given by γ. One can consider an α-discounted problem (1.16) for the above CTMDP. The previous model can be generalized easily only at the cost of additional notations; for example, there can be more than two types of threat events, more than one type of mission, the occurrence of a threat event or a mission completion may lead to new threat levels following specified distributions, the threat levels of different sectors may not be independent, and so on.
1.2 Examples
39
1.2.9 Preventive Maintenance Suppose the condition of a machine used in a manufacturing process deteriorates over time. Let the state i ∈ X = {0, 1, 2, . . . , M} represent the condition of the machine: the higher the value of i, the worse the condition. The time spent in state i ∈ {0, 1, 2, . . . , M − 1} is exponential with parameter λi , after that the state becomes i + 1. We consider the state M as the “pre-broken”: after an exponentially distributed random time with parameter λ M the machine completely fails and is immediately replaced by a new one; so that the new state will be 0. Participation in a manufacturing process means that the machine performs jobs of an exponentially distributed duration with parameter μi if the current state is i ∈ {0, 1, 2, . . . , M}. We assume that the machine is never idle, i.e., there are always jobs in the queue. The replacement of a failed machine yields cost C f , and the profit rate is ri , if the machine is in state i. The above description corresponds to the decision “do nothing.” At the same time, the manager can decide to replace the machine by a new one if the current state is i ∈ {1, 2, . . . , M}, without waiting for total failure. But that replacement is only possible after the current job is completed. We assume that there is a cost Cir incurred whenever a replacement is made. The corresponding CTMDP is now described by the following primitives. • X = {0, 1, 2, . . . , M}; • A = {0, 1}, where action 0 (1) means “do nothing” (“replace the machine”). • The non-zero transition rates for j = i are given by
λi , if i < M, j = i + 1; λ M , if i = M, j = 0, q( j|i, 1) = q( j|i, 0) + μi I{ j = 0}.
q( j|i, 0) =
• c0 (i, a) = k0 λ M C f I{i = M} − k1ri + k2 μi Cir I{a = 1}I{i = 0}, where k0 , k1 and k2 are some positive constants. • The initial distribution may be taken as γ(0) = 1, that is, the machine is new at the very beginning. • α > 0 is the discount factor. One can consider the discounted expected cost (1.21), or one can split the cost as: c0 (i, a) = λ M C f I{i = M}; c1 (i, a) = −ri ; c2 (i, a) = μi Cir I{a = 1}I{i = 0}, and consider the constrained problem (1.16) with J = 2.
40
1 Description of CTMDPs and Preliminaries
1.3 Detailed Occupation Measures and Further Sufficient Classes of Strategies for Total Cost Problems In this section, unless stated otherwise, the discount factor α is possibly zero-valued.
1.3.1 Definitions and Notations For a fixed strategy S ∈ S, we introduce the detailed occupation measures as follows. Definition 1.3.1 For each n ≥ 1, the detailed occupation measure of a strategy S is defined for each X ∈ B(X), A ∈ B(A) by S,α ( X × A ) m γ,n S −αt := Eγ e I{X n−1 ∈ X }πn ( A |Hn−1 , t − Tn−1 )dt (Tn−1 ,Tn ]∩R+ e−α(Tn−1 +s) I{X n−1 ∈ X }πn ( A |Hn−1 , s)ds = EγS (0,n ]∩R+ = EγS I{X n−1 ∈ X }e−αTn−1 e−αθ− (0,θ] q X n−1 (πn ,s)ds (0,∞) × πn ( A |Hn−1 , θ)dθ ,
(1.31)
if Sn = πn ; and ⎡ ⎢ S,α m γ,n ( X × A ) := EγS ⎣I{X n−1 ∈ X }e−αTn−1 ⎤
e−αθ−q X n−1 (a)θ dθ
A (0,∞)
× n (da|Hn−1 )⎦ 1 −αTn−1 = I{X n−1 ∈ X }e α + q X n−1 (a) A n (da|Hn−1 ) , EγS
if Sn = n . Here, as usual, 0 × ∞ = 0. If the kernel πn (·|h n−1 , t) does not depend on time t, formula (1.31) takes the form
1.3 Detailed Occupation Measures and Further Sufficient Classes … S,α m γ,n ( X × A ) := EγS I{X n−1 ∈ X }e−αTn−1
1 α + q X n−1 (πn )
41
(1.32)
πn ( A |Hn−1 ) .
In general, detailed occupation measures are [0, ∞]-valued, e.g., if α = q X n−1 (πn ) ≡ 0. We also introduce the following spaces of sequences of detailed occupation measures: S,α ∞ }n=1 , S ∈ S}, D S = {m γS,α = {m γ,n
D ReM = {m γπ
M
D Ra M = {m γ
,α
π ,α ∞ = {m γ,n }n=1 , π M ∈ SπM },
,α
,α ∞ M M = {m γ,n }n=1 , ∈ S }.
M
M
M
For any S ∈ S, for all cost rates c j , we have, by (1.20), (1.21), W jα (S, γ) =
∞ n=1
X×A
S,α c j (x, a)m γ,n (d x × da),
(1.33)
so that problem (1.16) can be reformulated as minimize
∞ n=1
c0 (x, a)m n (d x × da)
over m = {m n }∞ n=1 ∈ D S ∞ subject to c j (x, a)m n (d x × da) ≤ d j , n=1
(1.34)
X×A
j = 1, 2, . . . , J.
X×A
In this connection, we call two strategies S1 and S2 equivalent if m γS1 ,α = m γS2 ,α . In view of (1.34), it is important to study the spaces of detailed occupation measures and to understand which strategies can be equivalent. Recall from Definition 1.1.1 that by a cemetery we mean an absorbing state in the state space with the zero cost rates: for all a ∈ A, q (a) = c j (, a) ≡ 0, j = 0, 1, . . . , J . Condition 1.3.1 The collection of all cemeteries is a measurable subset of X. In this case, we regard that collection as a singleton {}, for brevity, i.e., we merge together all the cemeteries, with the obvious adjustment of the transition rate q. ˜ ∈ B(X) One can also consider the following situation. Suppose there is a subset X ˜ qx (a) ≡ 0 or ˜ such that, for each x ∈ X, q (X|x, a) ≡ 1, and c j (x, a) ≥ 0 for all a ∈ A, ˜ → A, c j (x, a ∗ (x)) = 0 j = 0, 1, 2, . . . , J , while for a measurable mapping a ∗ : X ˜ j = 0, 1, 2, . . . , J . Since we are concerned with minimization probfor all x ∈ X,
42
1 Description of CTMDPs and Preliminaries
˜ and we merge together all the states from lems, actions a ∗ (x) are optimal for x ∈ X, ˜ in one cemetery . The transition rate q can be adjusted in the trivial way. X In this framework, if there is a cemetery state ∈ X, we introduce the space X = X \ {}. Clearly, the values of the detailed occupation measures on {} × A play no role when solving the optimal control problems (1.15) and (1.16). Thus, without S,α on B(X × A). more explanations, we sometimes consider the measures m γ,n Having in mind the definition of the detailed occupation measures, we can rewrite Definition 1.1.10 of the realizability of a strategy in the following way. Definition 1.3.2 A control strategy S is called realizable for h n−1 ∈ Hn−1 with , F, P) and a measurable with xn−1 ∈ X if there is a complete probability space ( × R+ with values in A such that assertion (b) of respect to ( ω , s) process A on Definition 1.1.10 holds and, for some (and all) α ∈ R0+ , for any conservative and stable transition rate q, ˆ for all A ∈ B(A),
=
−αθ− (0,θ] qˆ xn−1 (A( ω ,s))ds E e I{A( ω , s) ∈ A }dθ (0,∞) ⎧ ⎪ −αθ− (0,θ] qˆ xn−1 (πn ,s)ds ⎪ e πn ( A |h n−1 , θ)dθ, if Sn = πn ; ⎪ ⎪ ⎨ (0,∞) ⎪ ⎪ ⎪ ⎪ ⎩
A
(0,∞)
e−αθ−qˆxn−1 (a)θ dθ n (da|h n−1 ),
if Sn = n .
As before, E is the mathematical expectation with respect to the measure P. The meaning of Definition 1.3.2 is that, for a fixed h n−1 ∈ Hn−1 with xn−1 ∈ X, along with assertion (b) of Definition 1.1.10, the action process A(s) results in the same expected discounted cost over the sojourn time n as that coming from Sn , for all cost rates c and all conservative and stable transition rates q. ˆ Lemma 1.3.1 A control strategy S is realizable for h n−1 ∈ Hn−1 with xn−1 ∈ X in the sense of Definition 1.3.2 if and only if it is realizable in the sense of Definition 1.1.10. Proof (i) Suppose all the assertions of Definition 1.3.2 are valid, and let us verify Item (a) of Definition 1.1.10. Let Sn = πn . Then, by taking all different values qˆ x (a) ≡ q > 0, we see that, for each fixed A ∈ B(A), the equality (0,∞)
P(A(s) ∈ A )ds = e−βs
(0,∞)
e−βs πn ( A |h n−1 , s)ds
holds for all β = q + α > α, meaning that, for the function d(t) := P(A(t) ∈ A ) − πn ( A |h n−1 , t) on the domain R+ ,
1.3 Detailed Occupation Measures and Further Sufficient Classes …
(0,∞)
e−βs d(s)ds = e−βs
(0,s]
∞ d(u)du + β
for all β > α, where the function D(t) := fore, (0,∞)
(0,∞)
0
(0,t]
43
e−βs D(s)ds = 0
d(s)ds is continuous on R+ . There-
e−βs D(s)ds = D(β) = 0 for all β > α.
For the Laplace transform D one can apply the Post–Widder inversion formula (see Proposition B.1.45) to obtain (−1)n 2 n 3n+1 d n D 2 n 3 =0 n→∞ n! t dβ n t
D(t) = lim
for all t > 0, so that d(s) = 0 for almost all s ∈ R+ , and assertion (a) of Definition 1.1.10 follows. The reasoning for Sn = n is identical. (ii) Suppose all the assertions of Definition 1.1.10 hold, and let us show that Definition 1.3.2 holds as well. If Sn = πn then, according to Theorem 1.1.2(c), πn (·|h n−1 , s) = δϕ(s) (·) for almost , F, = { P) with ω} all s ∈ R+ , so that one can take the trivial probability space ( being a singleton and put A( ω , s) = ϕ(s). After that Definition 1.3.2 obviously holds. If Sn = n then, similarly to the discussion at the end of Sect. 1.1.4, one can , , F take ( P) as the completed probability space (A, B(A), n (·|h n−1 )) and put A( ω , s) = ω . After that, Definition 1.3.2 obviously holds. It follows from Lemma 1.3.1 that, if Definition 1.3.2 holds for some α ∈ R0+ , then it also holds for any α ∈ R0+ .
1.3.2 Sufficiency of Markov π-Strategies Here we show that each control strategy is equivalent to a corresponding Markov π-strategy. This means that the value of the problem (1.16) is achieved within the class SπM . Theorem 1.3.1 For each fixed α ≥ 0, it holds that D S = D ReM . Proof For an arbitrarily fixed π-ξ-strategy S, we introduce the relaxed Markov strategy π M in the following way. For each fixed n = 1, 2, . . ., we define the kernel πˆ n ¯ + × X∞ : on B(A) given (s, t, x) ∈ R+ × R if Sn = n , then, for each A ∈ B(A), πˆ n ( A |s, t, x) is such that, for any s ∈ R+ ,
44
1 Description of CTMDPs and Preliminaries
EγS
A
e−q X n−1 (a)s−αs n (da|Hn−1 ) Tn−1 , X n−1 =: πˆ n ( A |s, Tn−1 , X n−1 );
if Sn = πn , then, for each A ∈ B(A), πˆ n ( A |s, t, x) is such that, for any s ∈ R+ , EγS πn ( A |Hn−1 , s)e− (0,s] q X n−1 (πn ,u)du−αs Tn−1 , X n−1 =: πˆ n ( A |s, Tn−1 , X n−1 ). (To be more rigorous, depending on whether Sn = n or Sn = πn , πˆ n ( A |s, t, x), for fixed A and s, is the regular conditional expectation of A
e−q X n−1 (a)s−αs n (da|Hn−1 ) or πn ( A |Hn−1 , s)e−
(0,s]
q X n−1 (πn ,u)du−αs
with respect to (Tn−1 , X n−1 ); for each fixed A , the measurability of π( ˆ A |s, t, x) in (s, t, x) is guaranteed by the existence of a regular measure, see Proposition B.1.41 and the text after it.) Clearly, the measure Mn on A × X∞ × R+ defined for each A ∈ B(A), X ∈ B(X∞ ) and R ∈ B(R+ ) by Mn ( A × X × R ) :=
EγS πˆ n ( A |s, Tn−1 , X n−1 ) R ×e−αTn−1 I{X n−1 ∈ X } ds
is σ-finite. An application of Proposition B.1.33 gives the existence of a stochastic kernel πnM (da|x, s) such that Mn (da × d x × ds) = πnM (da|x, s)Mn (A × d x × ds).
(1.35)
M Clearly, {πnM }∞ n=1 defines a Markov π-strategy π . Since the measure Mn (A × d x × ds) on X∞ × R+ is absolutely continuous with respect to the σ-finite measure m n on X∞ × R+ defined by
m n ( X × R ) :=
R
EγS e−αTn−1 I{X n−1 ∈ X } ds,
∀ X ∈ B(X∞ ), R ∈ B(R+ ),
we may consider Fn (A, x, s) as a (specific version of the) Radon–Nikodym derivative of Mn (A × d x × ds) with respect to m n (d x × ds). We can assume without loss of generality that Fn (A, x, s) ≥ e−q x −αs > 0
(1.36)
1.3 Detailed Occupation Measures and Further Sufficient Classes …
45
because πˆ n (A|s, Tn−1 , X n−1 ) ≥ e−q X n−1 −αs > 0 PγS -a.s. Now we may continue from (1.35): Mn (da × d x × ds) = πnM (da|x, s)Mn (A × d x × ds) = πnM (da|x, s)F(A, x, s)m n (d x × ds), from which we see that for each A ∈ B(A), πnM ( A |x, s)F(A, x, s) =: Fn ( A , x, s)
(1.37)
is a Radon–Nikodym derivative of Mn ( A × d x × ds) with respect to m n (d x × ds). Until the end of this proof, we apply the Fubini–Tonelli Theorem without explicit reference. We firstly verify that Fn (A, x, t) = e−
(0,t]
qx (πnM ,u)du−αt
m n -almost everywhere, as follows. Let us introduce the notation
mˆ n ( X ) := EγS e−αTn−1 I{X n−1 ∈ X } , ∀ X ∈ B(X∞ ), so that m n (d x × ds) = mˆ n (d x)ds. If Sn = n , we have Fn (A, x, t)m n (d x × dt) = Mn (A × X × R ) X ×R EγS e−q X n−1 (a)t−αt n (da|Hn−1 )e−αTn−1 I{X n−1 ∈ X } dt = A R = EγS 1 − (q X n−1 (a) + α)e−q X n−1 (a)s−αs n (da|Hn−1 )ds R (0,t] A −αTn−1 I{X n−1 ∈ X } dt ×e EγS q X n−1 (a)πˆ n (da|s, Tn−1 , X n−1 ) = m n ( X × R ) − R (0,t] A −αTn−1 + απˆ n (A|s, Tn−1 , X n−1 ) e I{X n−1 ∈ X } ds dt, ∀ X ∈ B(X∞ ), R ∈ B(R+ ). Similarly, if Sn = πn , then,
(1.38)
46
1 Description of CTMDPs and Preliminaries
X ×R
Fn (A, x, t)m n (d x × dt) = Mn (A × X × R )
EγS e− (0,t] q X n−1 (πn ,u)du−αt e−αTn−1 I{X n−1 ∈ X } dt R EγS 1 − (q X n−1 (πn , s) + α)e− (0,t] q X n−1 (πn ,u)du−αs ds = R (0,t] × e−αTn−1 I{X n−1 ∈ X } dt EγS q X n−1 (a)πˆ n (da|s, Tn−1 , X n−1 ) = m n ( X × R ) − R (0,t] A + απˆ n (A|s, Tn−1 , X n−1 ) e−αTn−1 I{X n−1 ∈ X } ds dt,
=
∀ X ∈ B(X∞ ), R ∈ B(R+ ). Now for both cases of Sn = πn and Sn = n and for all X ∈ B(X∞ ), R ∈ B(R+ ), we have EγS q X n−1 (a)πˆ n (da|s, Tn−1 , X n−1 ) R (0,t] A +απˆ n (A|s, Tn−1 , X n−1 ) e−αTn−1 I{X n−1 ∈ X } ds dt = (qx (a) + α) Mn (da × d x × ds) dt R A× X ×(0,t] = qx (a)Fn (da, x, s) + αFn (A, x, s) ds mˆ n (d x) dt R X (0,t] A = qx (a)Fn (da, x, s) + αFn (A, x, s) ds m n (d x × dt). X ×R
(0,t]
A
Thus, for both cases of Sn = πn and Sn = n , and for all X ∈ B(X∞ ), R ∈ B(R+ ), =
X ×R
X ×R
Fn (A, x, t)m n (d x × dt) 1− qx (a)Fn (da, x, s) + αFn (A, x, s) ds (0,t]
A
×m n (d x × dt). Hence, m n -almost everywhere,
Fn (A, x, t) = 1 −
(0,t]
qx (a)Fn (da, x, s) + αFn (A, x, s) ds
A
(1.39)
1.3 Detailed Occupation Measures and Further Sufficient Classes …
47
and, m n -almost everywhere, d Fn (A, x, t) = −qx (πnM , t)Fn (A, x, t) − αFn (A, x, t). dt We have thus proved that, m n -almost everywhere, ln Fn (A, x, t) is differentiable with respect to t, where its derivative equals −qx (πnM , t) − α. Note that (1.36) implies that m n -almost everywhere, ln Fn (A, x, t) is absolutely continuous in t. Therefore, ln Fn (A, x, t) = ln Fn (A, x, 0) −
(0,t]
qx (πnM , s)ds − αt,
and equality (1.38) is proved because limt→0+ Fn (A, x, t) = 1 m n -almost everywhere by (1.39). Secondly, we prove by induction that m n ( X × R ) =
R
Eγπ
M
−αTn−1 e I{X n−1 ∈ X } dt =: m nM ( X × R ),
∀ X ∈ B(X), R ∈ B(R+ ),
(1.40)
as follows. This equality is obviously valid for n = 1. (We always put T0 = 0.) Suppose it holds for some n ≥ 1. According to the construction of the strategic measure, Eγπ
M
= Eγπ
M
= Eγπ
M
−αTn e I{X n ∈ X } M e−αTn−1 Eγπ e−αn I{X n ∈ X }|Hn−1 (1.41) M e−αTn−1 e−αt q ( X |X n−1 , πnM , t)e− (0,t] q X n−1 (πn ,u)du dt . R+
From (1.38) we have q ( X |x, πnM , t)e−
(0,t]
qx (πnM ,u)du−αt
=
q ( X |x, a)Fn (da, x, t), A
∀ X ∈ B(X),
m n -almost everywhere. The induction hypothesis yields R+
Eγπ
M
M q ( X |X n−1 , πnM , t)e− (0,t] q X n−1 (πn ,u)du−αt e−αTn−1 dt
q ( X |x, πnM , t)e− (0,t] qx (πn ,u)du−αt m nM (d x × dt) X×R+ q ( X |x, a)Fn (da, x, t)m n (d x × dt) = =
X×R+
A
M
48
1 Description of CTMDPs and Preliminaries
=
q ( X |x, a)Mn (da × d x × dt). A×X×R+
The last expression equals
R+
EγS
q ( X |X n−1 , a)e
−q X n−1 (a)t−αt
n (da|Hn−1 )e
−αTn−1
dt,
A
if Sn = n , and
R+
EγS
q ( X |X n−1 , a)πn (da|Hn−1 , t)e
−
(0,t]
q X n−1 (πn ,u)du−αt −αTn−1
e
dt,
A
if Sn = πn . In either case, according to (1.41) and similar expressions for the strategy S, it follows that Eγπ
M
−αTn
e I{X n ∈ X } = EγS e−αTn I{X n ∈ X } , ∀ X ∈ B(X),
and thus (1.40) holds for n + 1. Finally, note that S,α ( X × A ) = m γ,n
X ×R+
Fn ( A , x, s)m n (d x × ds) = Mn ( A × X × R+ ).
From this, (1.38) and (1.40), we obtain π ,α m γ,n ( X × A ) M M = Eγπ e−αTn−1 I{X n−1 ∈ X } e− (0,s] q X n−1 (πn ,u)du−αs R+ × πnM ( A |X n−1 , s)ds Fn (A, x, s)πnM ( A |x, s)m nM (d x × ds) = ×R X + Fn ( A , x, s)m n (d x × ds) = M
X ×R+
S,α = m γ,n ( X × A ), ∀ X ∈ B(X), A ∈ B(A).
Remark 1.3.1 According to Theorem 1.3.1, Markov π-strategies are sufficient for the problems (1.15) and (1.16). In this connection, let us recall that π-strategies are usually not realizable.
1.3 Detailed Occupation Measures and Further Sufficient Classes …
49
1.3.3 Sufficiency of Markov Standard ξ-Strategies Here we show that, under appropriate conditions, the value of problem (1.16) is M . If there are cemeteries, we assume that Condition achieved within the class S 1.3.1 is satisfied. Condition 1.3.2 (a) qx (a) + α > 0 for all (x, a) ∈ X × A. (b) For each x ∈ X , there exists some ε = ε(x) > 0 inf a∈A qx (a) + α ≥ ε.
such
that
Note that the discounted cost model with α > 0 satisfies Condition 1.3.2(b). Theorem 1.3.2 Assume that Condition 1.3.1 is satisfied. Suppose Condition 1.3.2(a) holds true. Then, for any strategy S, there is a Markov standard ξ-strategy M such that ,α S,α ≥ m γ,n m γ,n M
(1.42)
on X × A for all n = 1, 2, . . .. Hence, Markov standard ξ-strategies are sufficient for solving problem (1.16) with nonpositive cost rates c j . If Condition 1.3.2(b) is satisfied, then the inequality in (1.42) can be taken as an equality. Proof Below in this proof, we use the Fubini–Tonelli Theorem without explicit reference. According to Theorem 1.3.1, without loss of generality, we accept that S is a Markov π-strategy, denoted as π M . M For n ≥ 1 such that Pγπ (X n−1 ∈ X ) > 0, we put for all A ∈ B(A)
=
nM ( A |xn−1 ) − (0,θ] (qxn−1 (πnM ,u)+α)du M (q (a) + α)π (da|x , θ)e dθ x n−1 n−1 n (0,∞) A xn−1 = .
1 − e−
M (0,∞) (q xn−1 (πn ,u)+α)du
(1.43) ,
Let us introduce the following measures on X : for each X ∈ B(X ) ρˆn ( X ) = Eγπ
M
3 2 M e−αTn−1 δ X n−1 ( X ) 1 − e− (0,∞) (q X n−1 (πn ,u)+α)du ;
ρˆn ( X ) = Eγ [e−αTn−1 δ X n−1 ( X )], n ≥ 1, M
and the following measures on X × A: for each X ∈ B(X ), A ∈ B(A), ρn ( X × A ) = ρn ( X × A ) =
X
X
ρˆn (d x)nM ( A |x); ρˆn (d x)nM ( A |x), n ≥ 1.
50
1 Description of CTMDPs and Preliminaries
Let us prove by induction that ∀ n ≥ 1, ρn ≥ ρn .
(1.44)
Inequality ρˆ1 = γ ≥ ρˆ1 is obvious. (Recall that T0 ≡ 0.) Assume ρˆn ≥ ρˆn for some n ≥ 1. Then ρn ≥ ρn by the definitions, and it remains to show that ρˆn+1 ≥ ρˆn+1 . Below in this proof, the sets X ∈ B(X ) and A ∈ B(A) are arbitrarily fixed. According to the definition of the measure ρˆn+1 , M M ρˆn+1 ( X ) = Eγ e−αTn−1 Eγ e−αn δ X n ( X )|Hn−1 e−αθ q ( X |xn−1 , a)e−qxn−1 (a)θ dθ = X
A
(0,∞)
×nM (da|xn−1 )ρˆn (d xn−1 ).
Since
(0,∞)
e−αθ q ( X |xn−1 , a)e−qxn−1 (a)θ dθ =
q ( X |xn−1 , a) , qxn−1 (a) + α
we obtain, using (1.43) and the inductive supposition: ρˆn+1 ( X ) ≥
e q ( X |xn−1 , a)e A (0,∞) ×nM (da|xn−1 )ρˆn (d xn−1 )
−qxn−1 (a)θ
dθ
X
=
−αθ
X
−
(1.45) M (0,θ] (q xn−1 (πn ,u)+α)du
q ( X |xn−1 , a)πnM (da|xn−1 , θ)e A (0,∞) − (0,∞) (qxn−1 (πnM ,u)+α)du
dθ
1−e
×ρˆn (d xn−1 ) M = Eγπ e−αTn−1
q ( X |X n−1 , a)πnM (da|X n−1 , θ) M × e− (0,θ] (q X n−1 (πn ,u)+α)du dθ . A
(0,∞)
The first equality holds true because the terms (qxn−1 (a) + α) cancel when integrating with respect to nM (da|xn−1 ). On the other hand, M e−αTn−1 Eγπ e−αn I{X n ∈ X }|Hn−1 πM −αTn−1 e−αθ q ( X |X n−1 , πnM , θ) = Eγ e (0,∞) − (0,θ] q X n−1 (πnM ,u)du ×e dθ ,
ρˆn+1 ( X ) ≤ Eγπ
M
1.3 Detailed Occupation Measures and Further Sufficient Classes …
51
leading to the desired inequality ρˆn+1 ( X ) ≥ ρˆn+1 ( X ). Finally, the following calculations are straightforward:
1 ρn (d x × da) q (a) +α x X × A 1 = ρn (d x × da) I{x ∈ X } qx (a) + α X × A M πnM ( A |X n−1 , θ) = Eγπ e−αTn−1 I{X n−1 ∈ X } (0,∞) − (0,θ] (q X n−1 (πnM ,u)+α)du ×e dθ
(1.46)
π ,α ( X × A ) = m γ,n M
(the second equality is similar to (1.45), where q ( X |xn−1 , a) is replaced by 1) and X × A
1 M ,α ρ (d x × da) = m γ,n ( X × A ). qx (a) + α n
,α π ,α The required inequality m ≥ m γ,n follows from (1.44). γ,n If the stronger Condition 1.3.2(b) is satisfied, then all the inequalities in the previous calculations become equalities, because M
e−
M
M (0,∞) (q xn−1 (πn ,u)+α)du
=0
for all xn−1 ∈ X . The proof is complete.
Corollary 1.3.1 Suppose Conditions 1.3.1 and 1.3.2 are satisfied and α = 0. Then, for the strategies S = π M and M as in Theorem 1.3.2, the following equalities hold for all n = 1, 2, . . ., X ∈ B(X ) and A ∈ B(A).
S,0 qx (a)m γ,n (d x × da) = Pγ (X n−1 ∈ X ); M
X ×A
nM ( A |x)Pγ (X n−1 ∈ d x) = M
X
X × A
S,0 qx (a)m γ,n (d x × da).
If S = π s is a stationary π-strategy, then the Markov standard ξ-strategy M has the form s q x (a)π (da|x n−1 ) M n ( A |xn−1 ) = A s A q x (a)π (da|x n−1 ) and therefore, is stationary.
52
1 Description of CTMDPs and Preliminaries
Proof Under the imposed conditions, e− (0,∞) (qxn−1 (πn ,u)+α)du = 0 and all the inequalities in the proof of Theorem 1.3.2 become equalities. Note also that all the measures S,0 are finite and ρn = ρn and m γ,n M
S,0 (d x × da) ρn (d x × da) = qx (a)m γ,n
(1.47)
by (1.46). Hence, for all X ∈ B(X ) Pγ (X n−1 ∈ X ) = ρˆn ( X ) = ρˆn ( X ) = ρn ( X × A) S,0 = qx (a)m γ,n (d x × da). M
X ×A
The second equality also follows from (1.47) because ρn (d x × da) = ρˆn (d x)nM (da|x) = nM (da|x)Pγ (X n−1 ∈ d x). M
The last statement follows from formula (1.43).
According to Theorem 1.3.2, if α = 0, under appropriate conditions, Markov standard ξ-strategies are sufficient for the (constrained) problem (1.16). (Recall that this is desirable because standard ξ-strategies are realizable.) This does not hold in general without imposing those conditions.
1.3.4 Counterexamples In this subsection we discuss the roles played by Condition 1.3.2 on the statement of Theorem 1.3.2 by means of examples. Example 1.3.1 This simple example shows that in general Markov standard ξstrategies are not sufficient for solving optimization problems if the performance functional is undiscounted. Suppose X = {0}, A = (0, 1], γ(0) = 1, q(0|0, a) ≡ 0: the single state 0 is absorbing, but we do not call it a “cemetery” because the cost rates may be nonzero. For an arbitrary standard ξ-strategy we have m ,0 γ,1 ({0} × A ) = 1 ( A |0) × ∞, where 0 × ∞ = 0. If c0 (0, a) = a, then
W00 (, γ)
=
(0,1]
a m ,0 γ,1 ({0} × da) = ∞.
At the same time, e.g., for the deterministic π-strategy ϕ given by ϕ1 (0, s) = e−κs with κ > 0, we have
1.3 Detailed Occupation Measures and Further Sufficient Classes …
ϕ,0
m γ,1 ({0} × A ) =
(0,∞)
53
I{e−κs ∈ A }ds,
ϕ,0 so that, e.g., m γ,1 ({0} × [ 21 , 1]) = − ln 21 /κ is different from 0 and ∞ and, for the same cost rate c0 (0, a) = a, 1 W00 (ϕ, γ) = e−κs ds = =⇒ inf W00 (S, γ) = 0. S∈S κ (0,∞) In this example, the first statement of Theorem 1.3.2 still holds true. For an arbitrary π-strategy, we introduce the following M -strategy: 1M ( A |0)
=
∞ i 1 i=1
2
(i−1,i]
π1 ( A |0, s)ds,
A ∈ B(A).
,0 M Now m π,0 γ,1 ({0} × A ) > 0 if and only if 1 ( A |0) > 0, i.e., m γ,1 ({0} × A ) = ∞, M
,0 ≥ m π,0 so that m γ,1 γ,1 . M
Example 1.3.2 Let us show that the first statement of Theorem 1.3.2 can fail to hold if the transition rate takes zero values. Suppose X = {1, }, where is the cemetery with q (a) ≡ 0. We put A = {1, 2}, q1 (1) = 0, q1 (2) = q(|1, 2) = 1, γ(1) = 1. Consider the deterministic π-strategy ϕ given by ϕ1 (1, s) = I{s ∈ (0, 1]} · 1 + I{s ∈ (1, ∞)} · 2. Then ϕ,0 m γ,1 ({1}
× {1}) = 1 and
ϕ,0 m γ,1 ({1}
× {2}) =
(1,∞)
e−(s−1) ds = 1.
On the other hand, for an arbitrary Markov standard ξ-strategy M defined by 1M (2|1) = 1 − 1M (1|1) = β ∈ [0, 1], we have ,0 m γ,1 ({1} × {1}) = M
M 0, if β = 1; ,0 ({1} × {2}) = β. and m γ,1 ∞, if β < 1,
ϕ,0
,0 ≥ m γ,1 . Hence there is no β ∈ [0, 1] such that m γ,1 It is interesting to see what happens if we slightly increase q1 (1) and take q1 (1) = q(|1, 1) = ε > 0. Now, for the same deterministic π-strategy ϕ, we have M
54
1 Description of CTMDPs and Preliminaries ϕ,0
m γ,1 ({1} × {1}) = Eγϕ
(0,1 ∧1]
and
1 dt = e−ε +
(0,1]
εe−εu u du =
1 (1 − e−ε ) ε
ϕ,0
m γ,1 ({1} × {2}) = e−ε .
The Markov standard ξ-strategy M is constructed using (1.43): 1M (1|1) = 1 − e−ε ; 1M (2|1) = e−ε . Now ,0 m γ,1 ({1} × {1}) = M
M 1 ,0 −ε (1 − e−ε ) and m γ,1 ({1} × {2}) = e ε
ϕ,0
,0 and m = m γ,1 on X × A in accordance with the second part of Theorem 1.3.2. γ,1 When ε approaches zero, we have M
ϕ,0
,0 lim m γ,1 ({1} × {1}) = lim m γ,1 ({1} × {1}) = 1; M
ε→0
ε→0
ϕ,0
,0 lim m γ,1 ({1} × {2}) = lim m γ,1 ({1} × {2}) = 1, M
ε→0
ε→0
,0 ,0 but m γ,1 ({1} × {1}) as a function of ε is not continuous at zero: m γ,1 ({1} × {1}) = ϕ,0 0 when ε = 0 and m γ,1 ({1} × {1}) = 1 when ε = 0. M
M
Example 1.3.3 Now we show that even if qx (a) is strictly positive, but not separated from zero, the space D Ra M may be a proper subset of D ReM , and standard randomized strategies may not be sufficient for solving optimization problems with the undiscounted performance functional. Let X = {0, 00, 1}, A = (0, 1], γ(1) = 1, q1 (a) = q(0|1, a) = a, q(00|0, a) = q(0|00, a) = 1, and suppose all other transition rates are zero. Thus, after leaving the state 1, the process fluctuates between 0 and 00. We put c0 (0, a) = c0 (00, a) ≡ 0, c0 (1, a) = a and consider the problem W00 (S, γ) → inf S∈S . We don’t merge the states 0 and 00 to a single cemetery because other cost rates may be non-zero there. Clearly, for any standard ξ-strategy in this model, m ,0 γ,1 ({1} × A ) = Eγ
A
(0,∞)
e−as ds 1 (da|X 0 ) =
A
1 1 (da|1). a
Therefore, W00 (, γ)
= A
c0 (1, a)m ,0 γ,1 ({1} × da) = 1 (A|1) = 1.
Similarly, for an arbitrary stationary π-strategy π s , we have
1.3 Detailed Occupation Measures and Further Sufficient Classes …
m πγ,1,0 ({1} × A ) = Eγπ s
s
(0,∞)
55
s e−q1 (π )u π s ( A |X 0 )du = π s ( A |1)/q1 (π s ),
where q1 (π s ) =
q1 (a)π s (da|1) =
aπ s (da|1). Again
A
A
W00 (π s , γ) =
π ,0 c0 (1, a)m γ,1 ({1} × da) = s
A
aπ s (da|1)/q1 (π s ) = 1. A
Now consider the deterministic π-strategy ϕ given by ϕ1 (1, s) = e−κs , where κ > 0 is an arbitrarily fixed constant. (The kernels Sn for n > 1 can be arbitrary as well as the actions in states 0 and 00.) The sojourn time 1 has the distribution function given by 1 − e ϕ,0 m γ,1 ({1}
−1+e−κθ κ
× (u, 1]) =
and, for an arbitrarily fixed u ∈ (0, 1], we have Eγϕ
e e
−1+e−κs κ
(0,∞)
=
e (0,1)
(0,s]
e−κv dv
(0,∞)
=
−
−1+y κ
I{e
−κs
∈ (u, 1]}ds
I{e−κs ∈ (u, 1]}ds
dy = I{y ∈ (u, 1]} κy
(u,1)
e
−1+y κ
κy
dy.
π ,0 This detailed occupation measure is different from the measures m ,0 γ,1 and m γ,1 presented above because, for the same cost rate c0 , we have s
W00 (ϕ, γ) =
ϕ,0
A
c0 (1, a)m γ,1 ({1} × da) =
(0,1]
1 −1+y 1 e κ dy = 1 − e− κ . κ
(1.48)
ϕ,0
∞ / D Ra M . By the way, inf S∈S W00 (S, γ) = inf π∈SπM W00 Therefore, m ϕ,0 γ = {m γ,n }n=1 ∈ (π, γ) = 0 (see (1.48) with κ → ∞), and obviouslyinf ∈SM W00 (, γ) = 1.
1.3.5 The Discounted Cost Model as a Special Case of Undiscounted Formulae (1.20) with α > 0 can be interpreted in the following two ways. • The decision-maker takes the cost rates c j in the future less seriously than currently. More precisely, if the cost rates are expressed in monetary units, then there is continuously compounded interest at rate α > 0, and the decision-maker aims at minimizing the present value of the cost. • The lifetime Tlife of the controlled process is random, independent of anything else, exponentially distributed with parameter α. Then
56
1 Description of CTMDPs and Preliminaries
W jα (S, γ)
=E
EγS
∞ n=1
(Tn−1 ,Tn ]∩(0,Tlife ]
A
c+j (X n−1 , a)
× Sn (da|Hn−1 , t − Tn−1 )dt Tlife ∞ c−j (X n−1 , a) −E EγS n=1
(Tn−1 ,Tn ]∩(0,Tlife ]
A
× Sn (da|Hn−1 , t − Tn−1 )dt Tlife , where the mathematical expectation E corresponds to the lifetime Tlife . In what follows, see Theorem 1.3.3, we formally justify the second interpretation. One can slightly modify the original model. We extend the state space X to X ∪ {} with ∈ / X. Given the current state x ∈ X, we set, independently of the action a ∈ A, the transition rate to the cemetery equal to α. That is, we consider the new conservative transition rate on X ∪ {} defined by q( ˆ X \ {x}|x, a) = I{x = }q( X \ {x, }|x, a) + αI{ ∈ X \ {x}}.
(1.49)
Note that, for x = , qˆ x (a) = qx (a) + α. Here the cemetery point is absorbing and costless: qˆ (a) ≡ 0 and c j (, a) ≡ 0 for all j = 0, 1, . . . , J . The modified model here is signified by the hat “∧” and is called the “hat” model with killing. Formally speaking, the marked point process in the “hat” model with killing is ˆ denoted by {(Tˆn , Xˆ n )}∞ n=0 . Note, however, that the sample space and the spaces of ˆ n-term histories Hn for the “hat” model remain almost the same as for the original model: the only difference is that, instead of sequences (1.3), we also consider (x0 , θ1 , x1 , . . . , θm−1 , xm−1 , θm , , ∞, x∞ , . . . ), i.e., the process can be killed after a finite sojourn time θm < ∞. Sequences (1.3) with finite m will have zero probability. Rigorously speaking, all the elements of ˆ n etc. But, in the current the “hat” model must be equipped with “∧”: Hˆ n Xˆ n , subsection, we adopt a simplified notation as follows. As soon as xm−1 ∈ X, we denote by h m−1 = (x0 , θ1 , x1 , . . . , θm−1 , xm−1 ) the histories in both the original and “hat” models. State is indicated explicitly. The actions in state play no role. Thus, there is a one-to-one correspondence between the control strategies in the original and the modified models, so we keep the common generic notation S for them. For any S ∈ S, one can construct the strategic measure Pˆ γS in the modified model in the standard way; see Sect. 1.1.3. The kernels Gˆ n in the modified model
1.3 Detailed Occupation Measures and Further Sufficient Classes …
57
will be equipped with the hat “∧,” cf. (1.12) and (1.13). Eˆ γS is the mathematical expectation with respect to Pˆ γS . If Sn = n , then Gˆ n ({∞} × {x∞ }|h n−1 ) = δxn−1 ({x∞ }) + δxn−1 ({}); q ( X \ {}|xn−1 , a) Gˆ n (R × X |h n−1 ) = δxn−1 (X) A R + αI{ ∈ X } e−qxn−1 (a)θ−αθ dθn (da|h n−1 ), ∀ R ∈ B(R+ ), X ∈ B(X ∪ {}); Gˆ n ({∞} × (X ∪ {})|h n−1 ) = Gˆ n (R+ × {x∞ }|h n−1 ) = 0. If Sn = πn , then Gˆ n ({∞} × {x∞ }|h n−1 ) = δxn−1 ({x∞ }) + δxn−1 ({}); ˆ q ( X \ {}|xn−1 , πn , θ) G n (R × X |h n−1 ) = δxn−1 (X) R + αI{ ∈ X } e− (0,θ] qxn−1 (πn ,u)du−αθ dθ, ∀ R ∈ B(R+ ), X ∈ B(X ∪ {}); Gˆ n ({∞} × (X ∪ {})|h n−1 ) = Gˆ n (R+ × {x∞ }|h n−1 ) = 0. S,0 ∞ }n=1 is the sequence of the detailed (undiscounted) occupaTheorem 1.3.3 If {mˆ γ,n tion measures in the modified model with killing for a control strategy S, then, for all n = 1, 2, . . ., X ∈ B(X), A ∈ B(A), S,0 S,α ( X × A ) = m γ,n ( X × A ). mˆ γ,n
Before presenting its proof, we mention that intuitively, the above statement is obvious. Indeed, under a fixed control strategy S, on the space X the dynamics of the original and modified process are the same, but the term I{X n−1 ∈ X } is taken into account in the modified model only if the transition to the cemetery due to killing has not happened before Tn−1 (with probability e−αTn−1 ). Similarly, when integrating with respect to s, one has to add the factor e−αs , the probability that the process was not killed on the interval (Tn−1 , Tn−1 + s]. The formal proof is as follows. Proof of Theorem 1.3.3. For n = 1 the statement is obvious because qˆ X 0 (a) = q X 0 (a) + α on {X 0 ∈ X}. Let n = 2, 3, . . . be fixed and, for an arbitrary bounded measurable function F on Hn−1 , consider the function I{xn−1 ∈ X}F(h n−1 ) on Hn−1 . Note that
58
1 Description of CTMDPs and Preliminaries
Eˆ γS [I{X n−1 ∈ X}F(Hn−1 )] S ˆ = Eγ I{X n−2 ∈ X} G n−1 (dθn−1 × d xn−1 |Hn−2 ) R+ ×X × F(Hn−2 , θn−1 , xn−1 )e−αθn−1 because Gˆ n−1 (dθ × d x|h n−2 ) = e−αθ G n−1 (dθ × d x|h n−2 ) on R+ × X. Applying this argument recursively, we finish with Eˆ γS [I{X n−1 ∈ X}F(Hn−1 )] S ˆ = Eγ I{X 0 ∈ X} G 1 (dθ1 × d x1 |X 0 )e−αθ1 . . . R+ ×X G n−1 (dθn−1 × d xn−1 |X 0 , θ1 , x2 , . . . , θn−2 , xn−2 )e−αθn−1 R+ ×X ×F(X 0 , θ1 , x1 , . . . , θn−1 , xn−1 )
= EγS e−αTn−1 I{X n−1 ∈ X}F(Hn−1 ) . Suppose that Sn = πn . Then, for each X ∈ B(X), A ∈ B(A), S,0 ( X × A ) mˆ γ,n = Eˆ γS I{X n−1 ∈ X}I{X n−1 ∈ X } e− (0,s] q X n−1 (πn ,u)du−αs (0,∞) ×πn ( A |Hn−1 , s)ds
and, using (1.50), we obtain S,0 ( X × A ) mˆ γ,n ⎡
⎢ = EγS ⎣e−αTn−1 I{X n−1 ∈ X}I{X n−1 ∈ X } ⎤
(0,∞)
S,α ×πn ( A |Hn−1 , s)ds ⎦ = m γ,n ( X × A ).
e−
(0,s]
q X n−1 (πn ,u)du−αs
(1.50)
1.3 Detailed Occupation Measures and Further Sufficient Classes …
59
Suppose that Sn = n . Then, for each X ∈ B(X), A ∈ B(A), S,0 ( X × A ) mˆ γ,n S ˆ = Eγ I{X n−1 ∈ X}I{X n−1 ∈ X } e−q X n−1 (a)s−αs ds A (0,∞) × n (da|Hn−1 )
and, using (1.50), we obtain S,0 ( X × A ) mˆ γ,n S −αTn−1 = Eγ e I{X n−1 ∈ X}I{X n−1 ∈ X } e−q X n−1 (a)s−αs ds A (0,∞) S,α × n ( A |Hn−1 ) = m γ,n ( X × A ).
This completes the proof.
Remark 1.3.2 (a) According to Theorem 1.3.3, problem (1.16) is equivalent to that for the modified model with killing. The latter satisfies Condition 1.3.2(b) with ε = α. Therefore, due to Theorem 1.3.2, Markov standard ξ-strategies are sufficient for the (constrained) problem (1.16) when α > 0 without any other requirements. Recall that such strategies are realizable. (b) Recall that here α > 0. For a stationary π-strategy π s (da|x), the stationary standard ξ-strategy s (da|x) :=
(qx (a) + α)π s (da|x) s A (q x (a) + α)π (da|x)
is such that, for each cost rate c, the total α-discounted expected costs W α (π s , γ) and W α (s , γ), given by (1.20), (1.21), coincide: see formula (1.43). For a (nonstationary) Markov π-strategy, the corresponding formula for the Markov standard ξ-strategy is similar. Conversely, for each stationary standard ξ-strategy s , the stationary π-strategy defined by 4 s (da|x) s (da|x) s π (da|x) := qx (a) + α A q x (a) + α is such that, for each cost rate c, W α (π s , γ) = W α (s , γ).
60
1 Description of CTMDPs and Preliminaries
1.4 Bibliographical Remarks The formulation of CTMDPs under the various classes of strategies had already appeared in [15, 133]. CTMDP models with deterministic stationary π-strategies were introduced in spscitehoward. Natural Markov strategies were introduced in [169], where the construction of CTMDPs under such strategies was based on [90]: each natural Markov strategy induces a Q-function. In [14, 154, 170], the optimal control problem was formulated in terms of Q-functions. Past-dependent π-strategies were described somewhat informally in [213]. Their mathematically rigorous formulation appeared in [253] for the model with a discrete state space and in [28, 255] for the model with a Borel state space. An elegant but equivalent way of formulating CTMDPs under relaxed π-strategies was used in [149], employing the theory of marked point processes [139]. The class of randomized strategies, with particular examples being the standard ξ-strategies, was considered for CTMDPs in [185]. More often it was considered in controlled semi-Markov processes or, say, semi-Markov decision processes (SMDPs). In fact, restricted to standard ξ-strategies only, a CTMDP model becomes a special SMDP, as under such a strategy, the selection of actions at each decision epoch of this SMDP does not depend on the past actions. The SMDP induced by a CTMDP with standard ξ-strategies is called an “exponential semi-Markov decision process (ESMDP)” in [75, 76], where the relation between CTMDP and ESMDP was exploited. One reason for introducing standard ξ-strategies in CTMDPs is merely for readability: we would prefer to stay in the common CTMDP model and switch between different classes of strategies instead of switching between different models. The more important reason is that standard ξ-strategies are realizable in the sense that they induce measurable action processes. It was noticed in [131] and [147], see Example 1.2.5 therein, that relaxed π-strategies induce action processes with rather irregular trajectories, though they can be constructed using the Kolmogorov Consistency Theorem, see Lemma 2.1 of [100]. The author of [76, 77] called for strategies which are not only optimal but also realizable for CTMDPs, and obtained the first results on this matter. In particular, the class of standard ξ-strategies was shown in [76, 77] to be sufficient for discounted CTMDPs. For total undiscounted CTMDPs, this sufficiency result may fail to hold without extra conditions on the model. In [117], it was shown that the class of standard ξ-strategies is sufficient for total undiscounted CTMDP problems if the cost rates are nonnegative, and the model satisfies some compactness-continuity conditions. A more general class of realizable strategies, namely Poisson-related strategies, was introduced in [185]. This class of strategies is sufficient for both total discounted and undiscounted CTMDPs without extra requirements on the CTMDP model. It is not introduced in the present chapter, but will be studied in Chaps. 4 and 6. A mathematical definition of realizability of a strategy in a CTMDP adopted in this book was introduced in [187]. The space of strategic measures of CTMDPs was studied as a special case in [131, 236]. Marginal distributions of CTMDPs for π-strategies are useful in deducing the sufficiency of Markov strategies for CTMDPs with total cost and average cost criteria.
1.4 Bibliographical Remarks
61
For this purpose, they were considered in [84, 87]. Detailed occupation measures are useful in establishing the sufficiency of Markov π-strategies for CTMDPs with total cost criteria, and were studied in [185]. For the discounted model, detailed occupation measures are equivalent to the occupation measures introduced in [76], which were called “occupancy measures” in [77]. This equivalence serves to show the sufficiency of standard ξ-strategies. The aggregation of detailed occupation measures gives rise to a total occupation measure, which is useful in establishing the sufficiency of stationary π-strategies for total cost CTMDPs. It was studied in [188] for the total discounted model, and in [117] for the more demanding total undiscounted model. There seems to be an inconsistency of terminology in the CTMDP literature. Different terms are used to mean the same object in different papers. Sometimes, this inconsistency is present in different papers by the same author. That is why the terminology used in this book cannot always be consistent with that of the existing literature. For example, relaxed and randomized strategies were both used to mean π-strategies in the older CTMDP literature. The terms ‘policy’ and ‘strategy’ are both present in the literature. In most cases, they are used as synonyms, but some authors use both in the same paper with different meanings to signify different models under consideration, see e.g., [76]. We choose to avoid the use of ‘policy’ throughout this book. For one good reason, ‘strategy’ is the term preferred in the early books [69, 149]. Finally, we do not intend to investigate SMDPs in this book. Textbook treatments of SMDPs can be found in [200, 210, 235], among other references. Nevertheless, we mention that in certain situations, both SMDPs and CTMDPs may be viewed as special models of controlled piecewise-deterministic Markov processes, as a result of how a continuous distribution is characterized by its hazard function. The idea is as follows. Suppose the sojourn time in an SMDP has a nondegenerate continuous distribution with density f x,a and cumulative distribution function F x,a , given the current state and action (x, a). Then one can consider the so-called hazard function of f x,a (s) this distribution defined by qx,s (a) = 1−F x,a (s) (assuming that the denominator does not vanish) with s being understood as the time elapsed since the beginning of the current sojourn. If the sojourn time has an exponential distribution, as in CTMDPs, then the hazard function is constant in s, and is the familiar jump intensity. Otherwise, one should extend the state space to include the “local” time s. The process with the extended state space X × [0, ∞) is a piecewise deterministic Markov process: after each jump, the s-component increases deterministically and linearly from zero as time goes on. The paper [238] seems to be the first to treat SMDPs using this technique. Despite the extra generality, optimal control of piecewise deterministic Markov processes is beyond the scope of this book. Monographs on that topic include [46, 49] and the thesis [236]. Certain techniques presented in this book are also applicable to the optimal control of piecewise deterministic processes, see e.g., [47], but some others are specifically tailored to CTMDPs. The more specific sources of each section in this chapter are as follows. Section 1.1. The description of CTMDPs is a special case of the one in [185]. In Sect. 1.1.3, the construction of the strategic measure based on the kernels (1.12) and (1.13) follows from [139], see also [158, Chapter 3,§4]. The random measure
62
1 Description of CTMDPs and Preliminaries
ν given by (1.14) was called the “dual predictable projection” in [139]; the shorter term “compensator” was used, e.g., in [158, Chapter 3,§2]. The relevant facts about marked point processes were well summarized in Chap. 4 of [150]. Further details can be found in [30]. When restricted to π-strategies, the marginal distributions in Definition 1.1.5 were considered in [84, 87]. Theorem 1.1.1 in the presented form does not seem to appear in the literature, though its proof is similar to that of Theorem 8.4 in [76]. Subsection 1.1.4 mainly comes from [187]. In the simplest case when πn ({1}|h n−1 , s) = πn ({−1}|h n−1 , s) = 21 , the decision-maker has to apply the actions −1 and +1 with equal probabilities at each moment in time. Based on the Kolmogorov Consistency Theorem, it is possible to construct a complete probability , F, space ( P) and a stochastic process A(·) measurable in ω for any t ∈ (0, ∞), P(A(t) = −1) = P(A(t) = +1) = 21 and the variables such that, for any t ∈ R+ , A(t) and A(s) are independent for s = t. However, A(·) is not measurable jointly in ( ω , t). The proof appeared in [147, Example 1.2.5], and its ideas are reproduced in Appendix A.2, for the reader’s convenience. In Sect. 1.1.5, the formula c(x, a) = C(x, y) q (dy|x, a), which allows one to take into account the instant costs C(x, y) X
at the jump epochs, was known long ago, see e.g., [76, 133, 189]. The recent paper [87] contains an interesting example which demonstrates that this trick may fail to work if the lump sum cost C(x, a, y) also depends on the action. Section 1.2. Controlled queueing systems as described in Sect. 1.2.1 is a fruitful area for applications of CTMDPs. Let us mention only [13, p. 263], [129, 150, 155, 174] and [200, §3.7.1], where similar models were considered, sometimes in the discrete-time framework. The optimality of the Cμ-rule in rather general cases was proved in [174]. A similar problem to the one presented in Sect. 1.2.2 was formulated in [210, p. 166] under the title “The Streetwalker’s Dilemma”; see also [157]. Let us also mention that actually this is a special case of admission control with several job types and no space for waiting. Similar controlled epidemic models to the one in Sect. 1.2.3 were investigated in [1, 42]. Another possibility is to immunise the susceptibles [2]. Similar stochastic inventory models to the one in Sect. 1.2.4 were considered in [228, 248]. Problems regarding selling an asset in a random environment, similar to the one in Sect. 1.2.5, were studied, e.g., in [13, §10.3.1] and [200, §3,4,2] in a discrete-time framework. See also [61, 63], where the “asset” was a house, and the landlord had an opportunity to invite a tenant (if the current offer is far from being acceptable). The model of a computer as a power-managed system, described in Sect. 1.2.6, was investigated in [201], where the authors studied the corresponding optimal control problem as a CTMDP. The pure fragmentation model described in Sect. 1.2.7 was considered in [244], see also [92, 243]. A more general infrastructure surveillance model than the one in Sect. 1.2.8 was studied as a CTMDP in [175]. Several similar reliability models to the one in Sect. 1.2.9 were considered in [200, §4.7.5] and [210, p. 167]. Section 1.3. The materials of this section mainly come from [185]. Many results presented here were obtained for discounted models in [76, 77]. Examples similar to Example 1.3.3 appeared in [117, Example 3.1], [185, §7], and [186, §5].
Chapter 2
Selected Properties of Controlled Processes
The main purpose of this chapter is to present some properties of the controlled processes that are closely related to the optimal control problems considered in this book. In Sect. 2.1, we rigorously show that under a natural Markov strategy π, ˘ the controlled process X (·) is a Markov pure jump process, with the explicit construction of its transition function being presented. In Sect. 2.2, we provide conditions for the non-explosiveness of X (·) under a fixed natural Markov strategy, and under all strategies simultaneously. These conditions are illustrated by examples in Sect. 2.3. Section 2.4 introduces Dynkin’s formula, and discusses when it is valid. The nonexplosiveness is closely related to the validity of Dynkin’s formula for the class of functions of interest, which in turn is a useful tool to study CTMDPs, as will be seen in Sect. 2.4 and subsequent chapters.
2.1 Transition Functions and the Markov Property In this section, we shall always consider a fixed natural Markov strategy π˘ and the corresponding Q-function defined by (1.11). The main objective of this section is to show formally that the process X (·) defined by (1.7) on the stochastic basis (, F, {Ft }, Pπγ˘ ) is a Markov pure jump process. To this end, we present the explicit construction of its transition probability function. Since the natural Markov strategy and the initial distributions are fixed, for brevity, throughout this section we shall often write P for Pγπ˘ , unless we want to emphasize the underlying initial state, in which case notations like Px will be in use. The same concerns the mathematical expectations E and Ex .
© Springer Nature Switzerland AG 2020 A. Piunovskiy and Y. Zhang, Continuous-Time Markov Decision Processes, Probability Theory and Stochastic Modelling 97, https://doi.org/10.1007/978-3-030-54987-9_2
63
64
2 Selected Properties of Controlled Processes
2.1.1 Basic Definitions and Notations In agreement with the previous notations, before moving on, we recall the following definitions. Definition 2.1.1 The process X (·) defined by (1.7) is a Markov pure jump process if the following conditions are satisfied. (a) For each s, t ∈ R0+ , P(X (t + s) ∈ |Ft ) = P(X (t + s) ∈ |X (t)), ∀ ∈ B(X∞ ). (b) Each trajectory of X (·) is piecewise constant and right-continuous, such that for each t ∈ [0, T∞ ), there are finitely many discontinuity points on the interval [0, t]. The equality in the above definition holds almost surely with respect to P. We shall not indicate “a.s.” when it is obvious in what context the equality holds. If P(X (t + s) ∈ |X (t)) = P(X (s) ∈ |X (0)) on {X (0) = X (t)} for each s, t ∈ R0+ , then the Markov process is called homoge˘ below we show that the process X (·) neous. Under a fixed natural Markov strategy π, is in general a nonhomogeneous Markov pure jump process. Definition 2.1.2 A sub-stochastic kernel p(s, x, t, dy) on the Borel σ-algebra B(X) given x ∈ X and s, t ∈ R0+ , s ≤ t is called a transition function on the Borel space X if p(s, x, s, dy) = δx (dy) and the Kolmogorov–Chapman equation p(s, x, t, dy) p(t, y, u, ) = p(s, x, u, ), ∀ ∈ B(X)
(2.1)
X
holds for each 0 ≤ s ≤ t ≤ u < ∞. If p(s, x, t, X) = 1 for each 0 ≤ s ≤ t < ∞, then it is called a transition probability function on X. Definition 2.1.3 Suppose the process X (·) defined by (1.7) is a Markov pure jump process. (a) A transition function p(s, x, t, dy) on X is said to correspond to X (·) if, for each s ≤ t, on {s < T∞ }, P(X (t) ∈ |Fs ) = p(s, X (s), t, ), ∀ ∈ B(X).
2.1 Transition Functions and the Markov Property
65
(b) A transition probability function p (s, x, t, dy) on X∞ is said to correspond to X (·) if, for each 0 ≤ s ≤ t < ∞, p (s, X (s), t, ), ∀ ∈ B(X∞ ). P(X (t) ∈ |Fs ) = In Sect. 1.3.5, when transforming the discounted model to the undiscounted one, we described the so-called “hat” model with kernels Gˆ n , jump epochs Tˆn , strategic measure Pˆ γS , etc. These constructions will be used in Sect. 2.2, but, with a slight abuse of notation, under a natural Markov strategy π, ˘ the stochastic kernels Gˆ n will ˆ be written as G(dt × dy|s, x) indicating only the arguments which are explicitly depended upon, without indicating π, ˘ as agreed earlier.
2.1.2 Construction of the Transition Function The transition function will be constructed in an iterative manner as follows. For each ∈ B(X), x ∈ S, s, t ∈ R0+ and s ≤ t, we define
pq(0) (s, x, t, ) := δx ()e− (s,t] qx (v)dv , − (s,u] qx (v)dv (n+1) (n) pq (s, x, t, ) := e pq (u, z, t, ) q (dz|x, u) du ∀ n = 0, 1, . . . .
(s,t]
X
(2.2)
It is clear that the following assertions hold: (a) for each s, t ∈ R0+ , s ≤ t and x ∈ X, pq(n) (s, x, t, dy) is a [0, 1]-valued measure on B(X); (b) for each ∈ B(X), (s, x, t) → pq(n) (s, x, t, ) is jointly measurable in x ∈ X and s, t ∈ R0+ , s ≤ t. Thus, one can legitimately define the kernel pq (s, x, t, dy) on B(X) by pq (s, x, t, ) :=
∞
pq(n) (s, x, t, )
(2.3)
n=0
for each x ∈ X, s, t ∈ R0+ , s ≤ t, and ∈ B(X). The next statement asserts that pq is the promised transition function. Theorem 2.1.1 Under a fixed natural Markov strategy π, ˘ the kernel pq defined by (2.3) is a transition function on X; cf. Definition 2.1.2. For fixed x ∈ X and ∈ B(X), the function pq (s, x, t, ) is continuous in s and t. Proof Firstly, we verify that pq (s, x, t, dy) is a sub-stochastic kernel. Let x ∈ X and 0 ≤ s ≤ t < ∞ be arbitrarily fixed. According to (2.3), it suffices to show by
66
2 Selected Properties of Controlled Processes
induction that m
pq(n) (s, x, t, X) ≤ 1
(2.4)
n=0
for each m ≥ 0 as follows. Clearly, (2.4) holds when m = 0. Suppose (2.4) holds for some m ≥ 0. Then m+1
pq(n) (s, x, t, X) = pq(0) (s, x, t, X) +
n=0
= pq(0) (s, x, t, X) + ≤ pq(0) (s, x, t, X) + = 1,
e−
(s,u]
qx (v)dv
(s,t]
n=0
pq(n+1) (s, x, t, X)
n=0
m
m
e−
X
(s,u]
qx (v)dv
(s,t]
pq(n) (u, y, t, X) q (dy|x, u)du
qx (u)du
where the second equality is by (2.2) and the inequality follows from the Fubini Theorem and the inductive supposition. Hence, for each x ∈ X and 0 ≤ s ≤ t < ∞, pq (s, x, t, X) ≤ 1, as required. Secondly, let us verify the validity of the Kolmogorov–Chapman equation (2.1) for pq . Firstly, we show that, for each n ≥ 0, pq(n) (s, x, u, ) =
n m=0 X
pq(m) (s, x, t, dy) pq(n−m) (t, y, u, ),
∀ 0 ≤ s ≤ t ≤ u < ∞, x ∈ X, ∈ B(X).
(2.5)
It is clear that (2.5) holds when n = 0. Suppose (2.1) holds for some n ≥ 0. Then, for each 0 ≤ s ≤ t ≤ u < ∞, x ∈ X and ∈ B(X), e− (s,r ] qx (v)dv q (dy|x, r ) pq(n) (r, y, u, )dr pq(n+1) (s, x, u, ) = (s,u] X − (s,r ] qx (v)dv e q (dy|x, r ) = (s,t] n
X
×
m=0 X
+ =
e (t,u]
n
m=0 X
pq(m) (r, y, t, dz) pq(n−m) (t, z, u, )dr
−
(s,r ]
qx (v)dv
X
q (dy|x, r ) pq(n) (r, y, u, )dr
pq(m+1) (s, x, t, dz) pq(n−m) (t, z, u, )
2.1 Transition Functions and the Markov Property
+
e−
(s,t]
qx (v)dv −
e
(t,r ]
qx (v)dv
(t,u]
X
67
q (dy|x, r ) pq(n) (r, y, u, )dr.
Applying (2.2) to the last expression, we see pq(n+1) (s, x, u, ) +e =
−
(s,t] q x (v)dv
n
m=0 X
+ X
=
=
n m=0 X
pq(m+1) (s, x, t, dz) pq(n−m) (t, z, u, )
pq(n+1) (t, x, u, )
pq(m+1) (s, x, t, dz) pq(n−m) (t, z, u, )
pq(0) (s, x, t, dz) pq(n+1) (t, z, u, )
n+1 m=0 X
pq(m) (s, x, t, dz) pq(n+1−m) (t, z, u, ),
∀ 0 ≤ s ≤ t ≤ u < ∞, x ∈ X, ∈ B(X). Hence, by induction, (2.5) holds. It remains to take the sum over n = 0, 1, . . . on both sides of (2.5): pq (s, x, u, ) =
∞ ∞ m=0 n=m
X
pq(m) (s, x, t, dy) pq(n−m) (t, y, u, )
pq (s, x, t, dy) pq (t, y, u, )
= X
for each 0 ≤ s ≤ t ≤ u < ∞, x ∈ X and ∈ B(X). For the last statement, note that, for fixed x ∈ X and ∈ B(X) and for arbitrarily fixed 0 ≤ s < t¯ < ∞, | pq(n) (s, x, t, )| ≤ γ n for all s, t ∈ [s, t¯], where γ :=
(s,t¯]
qx (u)e−
(s,u]
qx (v)dv
du = 1 − e−
(s,¯t ]
qx (v)dv
< 1.
The rigorous proof is by induction w.r.t. n. The function pq(n) (s, x, t, ) is obviously continuous in s and t for all n = 1, 2, . . .. Hence the function pq (s, x, t, ) is continuous in s and t by Proposition B.1.9. The proof is now complete. The recursions (2.2) are known as Feller’s backward iteration. There is also Feller’s forward iteration, see (2.6), which can be used to construct the transition function pq . The details are as follows.
68
2 Selected Properties of Controlled Processes
Lemma 2.1.1 For each ∈ B(X), x ∈ X, s, t ∈ R0+ and s ≤ t, pq(0) (s, x, t, ) = δx ()e− pq(n+1) (s, x, t, )
=
(s,t]
X
(s,t]
qx (v)dv
pq(n) (s, x, v, dy)
e
−
(v,t]
,
qz (r )dr
q (dz|y, v) dv.
(2.6)
Proof The first relation is automatic. For the second relation, consider the case of n = 0. Then, by Feller’s backward iteration, see (2.2), pq(1) (s, x, t, ) − (s,u] qx (v)dv − (u,t] q y (v)dv = du e q (dy|x, u)δ y ()e (s,t] X = e− (s,u] qx (v)dv q (dy|x, u)e− (u,t] q y (v)dv du. (s,t]
On the other hand, the right-hand side of the claimed relation for n = 0 reads
(s,t]
=
(s,t]
=
X
δx (dy)e
−
X
e−
(s,t]
pq(0) (s, x, v, dy)
(s,u]
qx (v)dv
e−
(v,t]
(s,u] q x (v)dv
e−
e−
(u,t]
qz (v)dv
qz (r )dr
(u,t]
q (dz|y, v) dv
qz (v)dv
q (dz|y, u)du
q (dz|x, u)du.
Thus, the claimed relation holds when n = 0. Assume the relation holds for the case of n ∈ {0, 1, 2, . . . }. Then pq(n+1) (s, x, t, ) = e− (s,u] qx (w)dw pq(n) (u, z, t, ) q (dz|x, u) du (s,t] X − (s,u] qx (w)dw = e pq(n−1) (u, z, v, dy) (s,t] X (u,t] X − (v,t] qr (w)dw × e q (dr |y, v) dv q (dz|x, u)du e− (s,u] qx (w)dw pq(n−1) (u, z, v, dy) = (s,t] (s,v] X X − (v,t] qr (w)dw × e q (dr |y, v) q (dz|x, u)du dv
2.1 Transition Functions and the Markov Property
=
(s,t]
× =
(s,t]
X
−
e (s,v] X
e−
(v,t]
qr (w)dw
69
q (dr |y, v)
pq(n−1) (u, z, v, dy) q (dz|x, u)du dv X − (v,t] qr (w)dw e q (dr |y, v) pq(n) (s, x, v, dy)dv,
(s,u]
qx (w)dw
where the first equality is by Feller’s backward iteration, see (2.2), the second equality is by the inductive supposition, the third and the fourth equalities are by the Fubini Theorem, and the last equality is by Feller’s backward iteration (2.2). The statement of the lemma follows from this. Feller’s forward iteration is convenient when we show in the next subsection that pq is the minimal nonnegative solution (out of all transition functions) to the Kolmogorov forward equation.
2.1.3 The Minimal (Nonnegative) Solution to the Kolmogorov Forward Equation Definition 2.1.4 A transition function p(s, x, t, dy) is said to be a solution of the Kolmogorov forward equation corresponding to the Q-function q(dy|x, a) if, for each x ∈ X, s, t ∈ R0+ , s ≤ t and ∈ B(X) satisfying sup
qx (s) < ∞,
(2.7)
x∈, s∈R0+
the following holds:
q (|y, u) p(s, x, u, dy) du p(s, x, t, ) = δx () + (s,t] X − q y (u) p(s, x, u, dy) du (s,t] = δx () + q(|y, u) p(s, x, u, dy) du. (s,t]
(2.8)
X
(Note that, provided that the first equality in the above holds, all the expressions on its right-hand side are finite.) It is convenient to introduce the following as a definition. Definition 2.1.5 Any set ∈ B(X) satisfying (2.7) is called a q -bounded set. The whole state space can be represented as the union of a sequence of q-bounded sets, as shown in the next lemma.
70
2 Selected Properties of Controlled Processes
Lemma 2.1.2 There exists a monotone nondecreasing sequence of q-bounded sets {Vˆm }∞ m=1 such that Vˆm ↑ X.
(2.9)
Proof For each m = 1, 2, . . . , we define the analytic subset Cm =
⎧ ⎨ ⎩
x ∈ X : sup qx (s) > m s∈R0+
⎫ ⎬ ⎭
of X. Note that ∞ m=1 C m = ∅. Therefore, according to the Novikov Separation Theorem, see Proposition B.1.3, ∞ there exists a sequence of Borel measurable subsets ˆ ˆ of X such that {Cˆ m }∞ m=1 m=1 C m = ∅, and for each m = 1, 2, . . . , C m ⊆ C m . Con∞ sequently, m=1 (X \ Cˆ m ) = X. Define for each m = 1, 2, . . . , Vˆm =
m
(X \ Cˆ n ).
n=1
Then it is easy to see that the sets {Vˆm }∞ m=1 satisfy the requirements in the statement. In fact, for each x ∈ Vˆm , sups∈R0+ qx (s) ≤ m. Lemma 2.1.3 The kernel pq (s, x, t, dy) defined in (2.3) is a solution to the Kolmogorov forward equation corresponding to the Q-function q(dy|x, a). Proof Let a q-bounded set ∈ B(X) be fixed. Let us verify that equality (2.8) holds for pq (s, x, t, dy). For each x ∈ X, s, t ∈ [0, ∞), s ≤ t,
(0)
(s,t] X
=
pq (s, x, v, dy)q y (v)δ y ()dv
(s,t] X
δx (dy)e
= δx ()
e
(s,t] (0)
− (s,v] qx (u)du q y (v)δ y ()dv
− (s,v] qx (u)du − q (u)du qx (v)dv = δx () 1 − e (s,t] x
= δx () − pq (s, x, t, ),
(2.10)
where the first equality is by (2.2). By (2.2) and the Fubini Theorem, for each x ∈ X, s, t ∈ R0+ , s ≤ t, =
(s,t] X
(1)
pq (s, x, v, dy)δ y ()q y (v)dv
(s,t] X (s,v]
e
− (s,u] qx (r )dr
− q (r )dr × q (dz|x, u)δz (dy)e (u,v] z δ y ()q y (v)du dv X − q (r )dr − q (r )dr = e (s,u] x q (dz|x, u)δz ()e (u,v] z qz (v)du dv. (s,t] (s,v]
X
2.1 Transition Functions and the Markov Property
71
Upon interchanging the order of integration, the above expression equals
e−
(s,t]
e−
(s,u]
qx (r )dr
qx (r )dr
q (dz|x, u)δz ()
X
e− (u,t]
(u,v]
qz (r )dr
qz (v)dv du
q (dz|x, u)δz ()du − e− (s,u] qx (r )dr q (dz|x, u) pq(0) (u, z, t, )du (s,t] X e− (s,u] qx (r )dr q (dz|x, u)δz ()du − p (1) (s, x, t, ) = (s,t] X pq(0) (s, x, u, dz) q (|z, u)du − p (1) (s, x, t, ). = =
(s,u]
(s,t]
X
(s,t]
X
Suppose, for some n ≥ 0, =
(s,t]
(s,t]
X
X
pq(n+1) (s, x, v, dy)δ y ()q y (v)dv pq(n) (s, x, u, dz) q (|z, u)du − p (n+1) (s, x, t, ),
∀ x ∈ X, s, t ∈ R0+ , s ≤ t. Then for each x ∈ X, s, t ∈ R0+ , s ≤ t, using the Fubini Theorem, we obtain:
pq(n+2) (s, x, v, dy)δ y ()q y (v)dv − (s,u] qx (r )dr e q (dz|x, u) = (s,t] X pq(n+1) (u, z, v, dy)δ y ()q y (v)dv du × (u,t] X e− (s,u] qx (r )dr q (dz|x, u) = (s,t] X (n) (n+1) × pq (u, z, v, dy) q (|y, v)dv − pq (u, z, t, ) du (u,t] X = e− (s,u] qx (r )dr q (dz|x, u) (s,t] X (n) × pq (u, z, v, dy) q (|y, v)dv du − pq(n+2) (s, x, t, ) (u,t] X pq(n+1) (s, x, v, dy) q (|y, v)dv − pq(n+2) (s, x, t, ), = (s,t]
X
(s,t]
X
(2.11)
72
2 Selected Properties of Controlled Processes
where the first, the third and the last equalities are by (2.2) and the second equality is by (2.11). Thus, by induction, (2.11) holds for each n = 1, 2, . . . . Finally, for each x ∈ X, s, t ∈ R0+ , s ≤ t, ∞
(s,t]
n=0
X
pq(n+1) (s, x, v, dy)δ y ()q y (v)dv < ∞
by (2.7); recall that pq (s, x, t, dy) defined in (2.3) is a sub-stochastic kernel by Theorem 2.1.1. Therefore, ∞
pq(n+1) (s, x, t, ) =
n=0 ∞
−
n=0
n=0
(s,t]
∞
X
(s,t]
p (n) (s, x, v, dy) q (|y, v)dv
X
pq(n+1) (s, x, v, dy)δ y ()q y (v)dv
pq (s, x, v, dy) q (|y, v)dv − pq (s, x, v, dy)δ y ()q y (v)dv (s,t] X pq(0) (s, x, v, dy)δ y ()q y (v)dv + (s,t] X pq (s, x, v, dy) q (|y, v)dv − pq (s, x, v, dy)δ y ()q y (v)dv =
=
(s,t]
X
(s,t]
X
(s,t]
X
+δx () − pq(0) (s, x, t, ), where the first equality is by (2.11), the second equality is by (2.3), and the last equality is by (2.10). The statement to be proved now follows from the above relation. Lemma 2.1.4 Any solution p(s, x, t, dy) to the Kolmogorov forward equation satisfies
p(s, x, t, ) = I{x ∈ }e− (s,t] qx (v)dv − (u,t] q y (v)dv + e q (dy|z, u) p(s, x, u, dz)du (s,t]
X
for each q-bounded set ∈ B(X ), and s, t ∈ R0+ , s ≤ t, x ∈ X. Proof Let p(s, x, t, dy) be a solution to the Kolmogorov forward equation. Let x ∈ X, s, t ∈ R0+ , s ≤ t, and some q-bounded set ∈ B(X) be fixed. Note that − (s,t] qx (v)dv − (u,t] qx (v)dv + qx (u)e du . I{x ∈ } = I{x ∈ } e (s,t]
2.1 Transition Functions and the Markov Property
73
Since p(s, x, t, dy) is a solution to the Kolmogorov forward equation, and is a q-bounded set, we have p(s, x, t.) = I{x ∈ } − q y (w) p(s, x, w, dy)dw (s,t] q (|y, w) p(u, x, w, dy)dw + (s,t] X q y (v)e− (v,t] q y (r )dr δx (dy)dv = I{x ∈ }e− (s,t] qx (v)dv + (s,t] − q y (w) p(s, x, w, dy)dw + q (|y, w) p(u, x, w, dy)dw, (s,t]
(s,t]
X
where the second equality is due to the relation observed above. Now the claimed relation in the statement is equivalent to the following:
−
q y (r )dr
q y (v)e δx (dy)dv − q y (w) p(s, x, w, dy)dw (s,t] e− (w,t] q y (r )dr q (dy|z, w) p(s, x, w, dz)dw = (s,t] X q (|y, w) p(s, x, w, dy)dw. (2.12) −
(v,t]
(s,t]
(s,t]
X
Using the Fubini Theorem, the right-hand side of the above can be written as
e− (w,t] qz (r )dr − 1 q (dz|y, w) p(s, x, w, dy)dw (s,t] X =− qz (v)e− (v,t] qz (r )dr dv q (dz|y, w) p(s, x, w, dy)dw (s,t] X (w,t] =− qz (v)e− (v,t] qz (r )dr q (dz|y, w) p(s, x, w, dy)dw dv. (s,t]
(s,v]
X
Now (2.12) is equivalent to
(s,t]
+ =
(s,t]
(s,t]
q y (v)e−
(v,t]
qz (v)e−
q y (r )dr
(v,t]
δx (dy)dv
qz (r )dr
(s,v]
q (dz|y, w) p(s, x, w, dy)dw dv
X
q y (w) p(s, x, w, dy)dw.
The left-hand side of the above can be written as
(2.13)
74
2 Selected Properties of Controlled Processes
(s,t]
qz (v)e−
(v,t]
qz (r )dr
× δx (dz) + q (dz|y, w) p(s, x, w, dy)dw dv (s,v] X qz (v)e− (v,t] qz (r )dr = (s,t] × δx (dz) + q(dz|y, w) p(s, x, w, dy)dw dv (s,v] X qz (v)e− (v,t] qz (r )dr q y (w)δ y (dz) p(s, x, w, dy)dw dv + (s,t] (s,v] X qz (v)e− (v,t] qz (r )dr p(s, x, v, dz)dv = (s,t] qz (v)e− (v,t] qz (r )dr q y (w)δ y (dz) p(s, x, w, dy)dw dv, + (s,t]
(s,v]
X
where the second equality holds because p(s, x, t, dy) is a solution to the Kolmogorov forward equation, and is a q-bounded set. Now relation (2.13) to be verified is equivalent to
q y (w) p(s, x, w, dy)dw q y (v)e− (v,t] q y (r )dr p(s, x, v, dy)dv − (s,t] − (v,t] qz (r )dr = qz (v)e q y (w)δ y (dz) p(s, x, w, dy)dw dv, (s,t]
(s,t]
(s,v]
X
which is true because the left-hand side can be written as q y (w) 1 − e− (w,t] q y (r )dr p(s, x, w, dy)dw (s,t] = q y (v)e− (v,t] q y (r )dr dvq y (w) p(s, x, w, dy)dw (s,t] (w,t] q y (v)e− (v,t] q y (r )dr q y (w) p(s, x, w, dy)dw dv, = (s,t]
(s,w]
which coincides with the right-hand side. The statement hence follows.
As a direct consequence of the previous lemma and Lemma 2.1.2, we can strengthen the previous statement from q-bounded sets to any measurable subset of X. Corollary 2.1.1 Any solution p(s, x, t, dy) to the Kolmogorov forward equation satisfies
2.1 Transition Functions and the Markov Property
75
p(s, x, t, ) = I{x ∈ }e− (s,t] qx (v)dv + e− (u,t] q y (v)dv q (dy|z, u) p(s, x, u, dz)du, (s,t]
X
∀ ∈ B(X ), s, t ∈ R0+ , s ≤ t, x ∈ X.
(2.14)
We gather the facts established earlier to conclude the following statement characterizing pq as the minimal solution (out of all transition functions) to the Kolmogorov forward equation. Theorem 2.1.2 pq (s, x, t, dy) is the minimal solution (out of all transition functions) to the Kolmogorov forward equation. That is, pq (s, x, t, ) ≤ p(s, x, t, ) for all s, t ∈ R0+ , s ≤ t, x ∈ X and ∈ B(X) for any solution p to the Kolmogorov forward equation. (Below we understand pq ≤ p in this sense.) Proof Let p be a solution to the Kolmogorov forward equation. By Corollary 2.1.1, p satisfies (2.14). For the statement of this theorem to hold, it remains to show that pq is the minimal transition function (see Theorem 2.1.1) satisfying relation (2.14), as follows. Recall the Feller forward iteration (2.6). Clearly, pq(0) (s, x, t, ) = I{x ∈ }e− (s,t] qx (v)dv ≤ p(s, x, t, ). Suppose n (m) ≤ p. Let s, t ∈ R0+ , s ≤ t, x ∈ X and ∈ B(X) be fixed. Then m=0 pq n+1
pq(m) (s, x, t, ) = pq(0) (s, x, t, ) +
m=0
= I{x ∈ }e n + m=0
−
n+1
pq(m) (s, x, t, )
m=1
(s,t]
qx (v)dv
pq(m) (s, x, v, dy)
(s,t] X − (s,t] qx (v)dv
e−
(v,t]
qz (w)dw
q (dz|y, v) dv
= I{x ∈ }e n − (v,t] qz (w)dw (m) + pq (s, x, v, dy) e q (dz|y, v) dv (s,t]
X m=0 − (s,t] qx (v)dv
≤ I{x ∈ }e − (v,t] qz (w)dw + p(s, x, v, dy) e q (dz|y, v) dv (s,t]
X
= p(s, x, t, ), where the second equality is by the Feller forward iteration (2.6), the inequality is by the inductive supposition, and the last equality is by Corollary 2.1.1. Thus, n m=0 pq (s, x, t, ) ≤ p(s, x, t, ) for all n = 0, 1, 2, . . .. The statement of the theorem follows from this and (2.3). We now specialise the previous statement to a form which will be convenient when we show the sufficiency of the class of natural Markov strategies for CTMDP problems (1.15)–(1.18). By fixing s = 0 and x ∈ X in the Kolmogorov forward
76
2 Selected Properties of Controlled Processes
equation, we have the following statement immediately following from the above theorem and its proof. Corollary 2.1.2 pq (0, x, t, dy) = pq (t, dy) is the minimal substochastic kernel on B(X) given t ∈ R0+ satisfying the following Kolmogorov forward equation for fixed initial time and state (0, x): p(t, ) = δx () + = δx () +
(0,t]
(0,t]
X
X
q (|y, u) p(u, dy) du − (0,t] q(|y, u) p(u, dy) du, t ∈ R0+ ,
q y (u) p(u, dy) du
for each q-bounded set . Moreover, if pq (t, X) = 1 for all t ∈ R0+ , then pq (t, dy) is the unique solution (out of all substochastic kernels) to the above Kolmogorov forward equation for the fixed initial time and state (0, x). Proof Only the last assertion needs justification. Let p(t, dy) be another substochastic kernel satisfying the equation in the statement of this corollary. Suppose for contradiction that for some t ∈ R0+ and ∈ B(X), pq (t, ) < p(t, ). Then 1 = pq (t, X) = pq (t, ) + pq (t, X \ ) < p(t, ) + p(t, X \ ) ≤ 1, which is the desired contradiction. The statement of this corollary thus follows, and the proof is complete.
2.1.4 Markov Property of the Controlled Process Under a Natural Markov Strategy In this subsection, we show that the controlled process X (·) under a fixed natural Markov strategy is a Markov pure jump process with the transition function being the one constructed in the previous subsection. To this end, we need the following technical result. Lemma 2.1.5 For each u ∈ R+ , on {μ((0, u] × X) < ∞} = {u < T∞ }, P(Tμ((0,u]×X)+1 ∈ , X μ((0,u]×X)+1 ∈ X |Fu ) = e− (u,t] q X (u) (s)ds q (X |X (u), t)dt,
∀ ∈ B((u, ∞)), X ∈ B(X);
(2.15)
and P(Tμ((0,u]×X)+1 = ∞, X μ((0,u]×X)+1 = x∞ |Fu ) = e−
(u,∞)
q X (u) (s)ds
.
(2.16)
2.1 Transition Functions and the Markov Property
77
Proof Let u ∈ R+ , ∈ B((u, ∞)) and X ∈ B(X) be arbitrarily fixed. On {μ((0, u] × X) < ∞}, P(Tμ((0,u]×X)+1 ∈ , X μ((0,u]×X)+1 ∈ X |Fu ) P(Tμ((0,u]×X)+1 ∈ , X μ((0,u]×X)+1 ∈ X |Fu , μ((0, u] × X) = n) = n≥0
×I{μ((0, u] × X) = n} = P(Tn+1 ∈ , X n+1 ∈ X |Fu , μ((0, u] × X) = n) n≥0
×I{μ((0, u] × X) = n} = P(Tn+1 ∈ , X n+1 ∈ X |FTn , Tn ≤ u < Tn+1 ) n≥0
×I{μ((0, u] × X) = n} = P(Tn+1 ∈ , X n+1 ∈ X |FTn , u < Tn+1 )I{μ((0, u] × X) = n} n≥0
P(Tn+1 ∈ , X n+1 ∈ X |Hn ) I{μ((0, u] × X) = n}, = P(Tn+1 > u|Hn ) n≥0 where the last equality is by (1.6). It follows from the last equality and (1.13) that, on {μ((0, u] × X) < ∞}, P(Tμ((0,u]×X)+1 ∈ , X μ((0,u]×X)+1 ∈ X |Fu ) e− (Tn ,t] q X n (s)ds q (X |X n , t)dt = I{μ((0, u] × X) = n} − (Tn ,u] q X n (s)ds e n≥0 = e− (u,t] q X n (s)ds q (X |X n , t)dtI{μ((0, u] × X) = n} n≥0
=
e−
(u,t]
q X (u) (s)ds
q (X |X (u), t)dt.
Thus, (2.15) is proved. The proof of (2.16) is absolutely the same.
Now we are in position to present the main statement of this subsection. Theorem 2.1.3 Under a fixed natural Markov strategy π, ˘ the stochastic process X (·) defined by (1.7) is a Markov pure jump process on (, F, {Ft }, Pγπ˘ ) with the corresponding transition function pq (s, x, t, dy) given by (2.3). Proof Keeping in mind (1.7), we firstly show that, for each u, t ∈ R0+ , u ≤ t, and ∈ B(X), P(X (t) ∈ |Fu ) = P(X (t) ∈ |X (u)) = pq (u, X (u), t, ) on {u < T∞ }.
78
2 Selected Properties of Controlled Processes
To this end, as the first step, we show that, for each u ∈ R0+ , on {u < T∞ }, P(X (t) ∈ , μ((u, t] × X) = n|Fu ) = pq(n) (u, X (u), t, ), ∀ t ∈ R0+ , t ≥ u, ∈ B(X)
(2.17)
for each n ≥ 0 as follows. Let u, t ∈ R0+ , u ≤ t, and ∈ B(X) be arbitrarily fixed. Note that, by Lemma 2.1.5, P(Tμ((0,u]×X)+1 > t|Fu ) = e−
[u,t)
q X (u) (s)ds
(2.18)
on {u < T∞ }. Then, on {u < T∞ }, P (X (t) ∈ , μ((u, t] × X) = 0|Fu ) = P X (t) ∈ , Tμ((0,u]×X)+1 > t|Fu = I{X (u) ∈ }P Tμ((0,u]×X)+1 > t|Fu = I{X (u) ∈ }e− [u,t) q X (u) (s)ds = pq(0) (u, X (u), t, ), where the third equality is by (2.18), and the last equality is by (2.2). Thus, (2.17) holds when n = 0. Suppose (2.17) holds for some n ≥ 0. Then, on {u < T∞ }, P (X (t) ∈ , μ((u, t] × X) = n + 1|Fu ) = E P X (t) ∈ , μ((u, t] × X) = n + 1|Fu , Tμ((0,u]×X)+1 , X μ((0,u]×X)+1 Fu = E P X (t) ∈ , μ((u, t] × X) = n + 1| FTμ((0,u]×X)+1 Fu = E P X (t) ∈ , μ((Tμ((0,u]×X)+1 , t] × X) = n FTμ((0,u]×X)+1 Fu = pq(n) (s, y, t, ) q (dy|X (u), s)e− (u,s] q X (u) (v)dv ds =
(u,t] X pq(n+1) (u,
X (u), t, ),
where the last but one equality is by Lemma 2.1.5 and the inductive supposition, and the last equality is by (2.2). Thus, by induction, (2.17) holds for each n ≥ 0. Note that for each u, t ∈ R0+ , u ≤ t, and ∈ B(X), P(X (t) ∈ |Fu ) = P(X (t) ∈ |Fu )I{u < T∞ } + P(X (t) ∈ |Fu )I{u ≥ T∞ } = P(X (t) ∈ , μ(u, t] × X) = n|Fu )I{u < T∞ } n≥0
=
n≥0
pq(n) (u, X (u), t, )I{u < T∞ }
2.1 Transition Functions and the Markov Property
79
= pq (u, X (u), t, )I{u < T∞ } = pq (u, X (u), t, )I{X (u) ∈ X}, where the third equality is by (2.17), and the fourth equality is by (2.3). Now we observe that on {u < T∞ }, P(X (t) ∈ |Fu ) = E[P(X (t) ∈ |Fu )|X (u)] = P(X (t) ∈ |X (u)) = pq (u, X (u), t, ). Finally, for any 0 ≤ u ≤ t < ∞, P(X (t) = x∞ |Fu ) = P(X (t) = x∞ |Fu )I{u < T∞ } + P(X (t) = x∞ |Fu )I{u ≥ T∞ } = (1 − pq (u, X (u), t, X))I{X (u) ∈ X} + I{X (u) = x∞ } = P(X (t) = x∞ |X (u)).
The proof is complete.
In general, the transition function pq may be substochastic. The next statement presents a simple and natural way of complementing it to build the transition probability function of the Markov pure jump process X (·) under the fixed natural Markov strategy. Corollary 2.1.3 Under a fixed natural Markov strategy π, ˘ the transition probability function corresponding to the Markov pure jump process X (·) is given by ∀ x ∈ X, ∈ B(X); pq (s, x, t, ) = pq (s, x, t, ), pq (s, x, t, {x∞ }) = 1 − pq (s, x, t, X), ∀ x ∈ X; pq (s, x∞ , t, ) = I{x∞ ∈ },
∀ ∈ B(X∞ ).
Proof According to Theorem 2.1.1, pq is a transition probability function on X∞ . The statement is now an immediate consequence of Theorem 2.1.3.
2.2 Conditions for Non-explosiveness Under a Fixed Natural Markov Strategy Here we provide sharp and verifiable conditions for the non-explosiveness of a controlled process under a fixed natural Markov strategy. We first deal with the general case, where the controlled process is a possibly nonhomogeneous Markov pure jump process. After that we specialize to the case when the controlled process is a homogeneous Markov pure jump process.
80
2 Selected Properties of Controlled Processes
2.2.1 The Nonhomogeneous Case The exact meaning of non-explosiveness and explosiveness is defined as follows. Definition 2.2.1 A controlled process X (·) is called non-explosive under a given strategy S if PxS (T∞ = ∞) = 1, ∀ x ∈ X, or equivalently, ∀ x ∈ X, t ∈ R0+ , PxS (X (t) ∈ X) = 1. If, for some x ∈ X, PxS (T∞ = ∞) < 1, then the process X (·) is called explosive (under the initial state x and the strategy S). Note that, if the controlled process X (·) is non-explosive under strategy S, then, for any initial distribution γ, for all t ∈ R0+ , PγS (T∞ = ∞) = PγS (X (t) ∈ X) = 1. If PγS (T∞ = ∞) < 1, we say that the process X (·) is explosive under the initial distribution γ and the strategy S. Below, we provide verifiable conditions for the non-explosiveness of the process X (·) under a specific natural Markov strategy. The notations from the previous section are adopted, without special emphasis. The main condition, which will be shown below to be sufficient and necessary (in a fairly general setup) for the nonexplosiveness, is the following one. 0 n }∞ Condition 2.2.1 There exist a monotone nondecreasing sequence {V n=1 ⊆ B(R+ 0 0 × X) and a R+ -valued measurable function w on R+ × X such that the following hold. n ↑ R0+ × X. (a) As n ↑ ∞, V (b) For each n = 1, 2, . . . , sup x∈Vn , t∈R0+
qx (t) < ∞,
n on X. where Vn denotes the projection of V
(2.19)
2.2 Conditions for Non-explosiveness Under a Fixed Natural Markov Strategy
81
(c) As n ↑ ∞, inf
n (t,x)∈(R0+ ×X)\V
w(t, x) ↑ ∞.
(2.20)
(d) For some α ∈ R+ , for any x ∈ X and v ∈ R0+ ,
(0,∞)
w(t + v, y)e−αt−
(0,t]
qx (s+v)ds
q (dy|x, t + v)dt ≤ w(v, x).
X
(2.21) Remember, the above condition is for a fixed natural Markov strategy π. ˘ We firstly present the next simple but useful observation concerning the equivalent definitions of non-explosiveness (under a fixed natural Markov strategy). Lemma 2.2.1 Under a fixed natural Markov strategy π, ˘ the following assertions are equivalent. (a) (b) (c) (d)
The (Markov pure jump) process X (·) is non-explosive. For each x ∈ X and t ∈ R0+ , pq (0, x, t, X) = 1. For each x ∈ X, and almost all t ∈ R0+ , pq (0, x, t, X) = 1. For each x ∈ X and 0 ≤ s ≤ t, pq (s, x, t, X) = 1.
Proof Assertions (a) and (b) are equivalent due to Theorem 2.1.3. Let us prove that (b), (c) and (d) are equivalent. Clearly, (d) implies (b), which in turn implies (c). That (c) implies (b) is true because clearly t → pq (0, x, t, X) is monotone nonincreasing in t ∈ R0+ . The latter fact follows from the Kolmogorov– Chapman equation, see (2.1), and Theorem 2.1.3. Now suppose (b) is true. By the Kolmogorov–Chapman equation, we see that for each x ∈ X and 0 ≤ s ≤ t, pq (s, y, t, X) = 1
(2.22)
for almost all y with respect to pq (0, x, s, dy). Now suppose for contradiction that (d) does not hold, so that there exist some 0 ≤ s < t and x ∈ X such that pq (s, x, t, X) < 1. Then 1 = pq (0, x, t, X)
= pq (0, x, s, {x}) pq (s, x, t, X) +
pq (0, x, s, dy) pq (s, y, t, X) X\{x}
= pq (0, x, s, {x}) p(s, x, t, X) + pq (0, x, s, X \ {x}) < pq (0, x, s, {x}) + pq (0, x, s, X \ {x}) = 1,
(2.23)
82
2 Selected Properties of Controlled Processes
where the first equality and the last equality are by assumption that (b) holds, the second equality is by the Kolmogorov–Chapman equation, the third equality is by (2.22), and the inequality is by (2.23). This is the desired contradiction, and thus assertions (b), (c) and (d) are equivalent. The proof is complete. In a nutshell, we shall link the non-explosiveness of the process X (·) under the fixed natural Markov strategy with the absorption of an induced discrete-time Markov chain at a certain state. In fact, Lemma 2.2.1 shows the equivalence between the two properties; see Corollary 2.2.1 below. To describe this induced discrete-time Markov chain rigorously, we recall the “hat” model considered in the previous chapter and introduce some necessary notations as follows. Let us fix an arbitrary α ∈ R+ and consider the “hat” model described in Sect. 1.3.5 under the fixed natural Markov strategy π. ˘ As before, we often omit the upper index π˘ when considering probabilities and mathematical expectations. For any ˆ we put ωˆ = (xˆ0 , θˆ1 , xˆ1 , θˆ2 , xˆ2 , . . .) ∈ , = X∞ , ω = (tˆ0 = 0, xˆ0 ), (tˆ1 = θˆ1 , xˆ1 ), (tˆ2 = θˆ1 + θˆ2 , xˆ2 ) . . . ∈ where X := R0+ × X ∪ (R0+ × {}) ∪ {(∞, x∞ )}. By the way, when it matters concerning the topology, here the two points and x∞ are regarded as two isolated points added to the original state space X; “∞” is an X is fixed. isolated point added to R0+ , and the product topology in The image of the strategic measure Pˆ πx˘ with respect to the described mapping ωˆ → (0,x) . As before, Tˆn = tˆn , Xˆ n = xˆn and H n = (tˆ0 , xˆ0 ), (tˆ1 , xˆ1 ), . . . , ω is denoted by P . Note that, for ω are considered as random elements on (tˆn , xˆn ) as functions of 0 ¯ all n ∈ N, R ∈ B(R+ ), X ∈ B(X ∪ {} ∪ {x∞ }), n−1 ) (0,x) (Tˆn ∈ R , xˆn ∈ X | H P = δ∞ (R )δx∞ ( X ) δxˆn−1 ({x∞ }) + δxˆn−1 ({}) + δxˆn−1 (X) I{s > Tˆn−1 } q ( X \ {, x∞ }|xˆn−1 , s) + αI{ ∈ X } × ×e
R ∩R0+ − (Tˆ ,s] qxˆn−1 (u)du−α(s−Tˆn−1 ) n−1
ds
meaning that the marked point process {(Tˆn , xˆn )}∞ n=0 is a discrete-time Markov defined as follows. On process with state space X and transition probability G B(R0+ × X), for each (t, z) ∈ R0+ × X, G(ds × dy|(t, z)) = I{s > t} q (dy|z, s)e− on B(R0+ × {}), for each (t, z) ∈ R0+ × X,
(t,s]
qz (u)du−α(s−t)
ds;
2.2 Conditions for Non-explosiveness Under a Fixed Natural Markov Strategy
G(ds × {}|(t, z)) = I{s > t}αe−
(t,s]
qz (u)du−α(s−t)
83
ds.
Besides, G({(∞, x∞ )}|(t, )) ≡ 1; and finally, the state (∞, x∞ ) is absorbing: G({(∞, x∞ )}|(∞, x∞ )) = 1. Below, we omit the brackets for brevity and write G(ds × dy|t, z). We will also constarting from an arbitrary initial sider the discrete-time Markov chain {(Tˆn , xˆn )}∞ n=0 state (v, x) ∈ X; the corresponding probability measure and mathematical expecta(v,x) . The discrete-time Markov chain {(Tˆn , Xˆ n )}∞ (v,x) and E tion are denoted by P n=0 will be referred to as the “α-jump chain” of the with the transition probability G nonhomogeneous Markov pure jump process X (·). For the discrete-time Markov chain {(Tˆn , xˆn )}∞ n=0 , we denote by σ and τ its hitting time and return time to the set ∈ B( X), respectively, i.e., σ = inf{n ≥ 0 : (Tˆn , Xˆ n ) ∈ }; τ = inf{n ≥ 1 : (Tˆn , Xˆ n ) ∈ }. X by Furthermore, we define iteratively τ (k) for each ∈ τ (0) = 0, τ (k) := inf{n > τ (k − 1) : (Tˆn , Xˆ n ) ∈ }, k ≥ 1. As usual, the infimum taken over the empty set is defined as +∞. Finally, we put η :=
I{(Tˆn , Xˆ n ) ∈ }
n≥1
for each ∈ X. Intuitively speaking, if the process X (·) is non-explosive in the original model, then in the “hat” model it will be almost surely killed at a finite time, meaning that the set of sequences of the type (xˆ0 = x, θˆ1 , xˆ1 , . . .) ∈ (X × R0+ )∞ is null with respect to Pˆ πx˘ for each x ∈ X if and only if the process X (·) is nonexplosive in the original model. Equivalently, the process X (·) is non-explosive in the original model if and only if the discrete-time Markov chain {(Tˆn , xˆn )}∞ n=0 is (0,x) for each x ∈ X, i.e., the absorbed at (∞, x∞ ) almost surely with respect to P (0,x) for each hitting time to the set R0+ × {} is finite almost surely with respect to P x ∈ X. More generally, we have the next statement.
84
2 Selected Properties of Controlled Processes
Lemma 2.2.2 For each x ∈ X and v ∈ R0+ , α (0,∞)
(v,x) (σR0 ×{} < ∞). e−αt pq (v, x, v + t, X)dt = P +
Proof For brevity, let us verify this statement for the case of v = 0 as follows. For each x ∈ X, e−αt pq (0, x, t, X)dt = αEx e−αt I{X (t) ∈ X}dt α (0,∞)
=α =α = = =
∞ n=0 ∞
n=0 ∞
Eˆ x
(Tˆn ,Tˆn+1 ]
Eˆ x Eˆ x
I{ Xˆ n ∈ X}dt
(Tˆn ,Tˆn+1 ]
Eˆ x I{ Xˆ n ∈ X}
n=0 ∞ n=0 ∞
(0,∞)
I{ Xˆ n ∈ X}dt FTˆn
αe−αt−
(0,t]
q Xˆ n (s+Tˆn )ds
dt
(0,∞)
ˆEx I{ Xˆ n ∈ X} 1 −
e
−αt− (0,t] q Xˆ n (Tˆn +s)ds
(0,∞)
q Xˆ n (t + Tˆn )dt
(0,x) Xˆ m ∈ X, Tˆm ∈ [0, ∞), 0 ≤ m ≤ n, P
n=0
(Tˆn+1 , Xˆ n+1 ) ∈ R0+ × {}
(0,x) (σR0 ×{} < ∞). =P +
(2.24)
where the first equality is by Theorem 2.1.3, and the second equality is by Theorem 1.3.3. The general case is a matter of shifting time. As an immediate consequence of Lemmas 2.2.1 and 2.2.2, as well as the definition of non-explosiveness, we have the next observation. Corollary 2.2.1 Under a fixed natural Markov strategy π, ˘ the Markov pure jump process X (·) is non-explosive if and only if (t,x) (σ{(∞,x∞ )} < ∞) = 1 (0,x) (σ{(∞,x∞ )} < ∞) = P P
(2.25)
for each (t, x) ∈ X, for any α ∈ R+ . Proof The assertion follows from Lemmas 2.2.1 and 2.2.2.
According to Corollary 2.2.1, to show the sufficiency of Condition 2.2.1 for the non-explosiveness of a controlled process X (·) under a fixed natural Markov strategy, one simply needs to show the validity of the second equality in (2.25) about the discrete-time Markov chain {(Tˆn , Xˆ n )}∞ n=0 . This is done in the next statement.
2.2 Conditions for Non-explosiveness Under a Fixed Natural Markov Strategy
85
Theorem 2.2.1 Under a natural Markov strategy π, ˘ the Markov pure jump process X (·) with the corresponding transition function pq is non-explosive if Condition 2.2.1 is satisfied. Proof In view of Corollary 2.2.1, suppose for contradiction that there exists some (t ∗ , x ∗ ) ∈ R0+ × X such that (t ∗ ,x ∗ ) (σ{(∞,x∞ )} < ∞) < 1. P Let M, M1 ∈ N be large enough, so that (t ∗ ,x ∗ ) (σ{(∞,x∞ )} < ∞) , w(t ∗ , x ∗ ) < M 1 − P M , M1 > M, (t ∗ , x ∗ ) ∈ V
(2.26)
and inf
M (t,x)∈(R0+ ×X)\V 1
w(t, x) > M.
(2.27)
The existence of such constants M and M1 is guaranteed under Condition 2.2.1. Let us extend the definition of the function w by putting w(t, ) = w(∞, x∞ ) = 0, ∀ t ∈ R0+ . According to (2.21) and simple iterations, we see that for each n ≥ 0, w(t, x) ≥ ≥
M (R0+ ×X)\V 1
R0+ ×X
n (ds × dy|t, x)w(s, y) G
n (ds × dy|t, x)w(s, y) G
M1 |t, x) n (R0+ × X \ V ≥ MG n (V M1 |t, x)) − G n (R0+ × {}|t, x)) , = M (1 − G ∀ (t, x) ∈ R0+ × X,
(2.28)
n is the n-th power of the transition probability where the last inequality is by (2.27); G 0 Remember, the transition from R+ × X to (∞, x∞ ) is only possible through the G. transition first to R0+ × {}. In the last expression of the above equality, (t,x) (σ{(∞,x∞ )} ≤ n + 1), ∀ n ∈ N, (t, x) ∈ R0+ × X, n (R0+ × {}|t, x) ≤ P G so that
86
2 Selected Properties of Controlled Processes
(t,x) σ{(∞,x∞ )} < ∞ , n (R0+ × {}|t, x) ≤ P lim G
n→∞
∀ (t, x) ∈ R0+ × X.
(2.29)
n , On the other hand, note that, for each n ∈ N and (v, x) ∈ V G(|v, x)
≥ I{ ⊇ R0+ × {}} 1 − e−αt− (0,t] qx (s+v)ds qx (t + v)dt (0,∞) αe−αt e− (0,t] qx (s+v)ds dt = I{ ⊇ R0+ × {}} (0,∞)
≥ I{ ⊇
R0+
× {}}
α , ∀ ∈ B( X), α + supx∈Vn , t∈R0+ {qx (t)}
(2.30)
n on X. cf. (2.19). Remember the notation used here that Vn is the projection of V Let us justify M1 |t, x) = 0, ∀ (t, x) ∈ V M1 n (V lim G
n→∞
(2.31)
as follows. Indeed, by (2.30) and the fact that (∞, x∞ ) is absorbing, 0+ × {}|t, x) (t,x) (ηV < 1) ≥ G(R P M1 α M1 . ∈ (0, 1), ∀ (t, x) ∈ V ≥ α + supx∈VM , t∈R0+ {qx (t)} 1
Thus, (t,x) (ηV ≥ 1) ≤ 1 − P M1
α α + supx∈VM
, t∈R0+ {q x (t)} 1
M1 . ∈ (0, 1), ∀ (t, x) ∈ V
Suppose (t,x) (ηV ≥ k) P M1 ≤ 1− for some k ≥ 1. Then
α
α + supx∈VM
, t∈R0+ {q x (t)} 1
k M1 ∀ (t, x) ∈ V
(2.32)
2.2 Conditions for Non-explosiveness Under a Fixed Natural Markov Strategy
(t,x) (ηV ≥ k + 1) P M1 (t,x) ((Tˆτ (k) , Xˆ τ = P VM VM M V 1
1
(s,y) (ηV ≥ 1) ×P M1 ≤
1−
1
(k) )
∈ ds × dy, τVM1 (k) < ∞) k+1
α
α + supx∈VM
87
M1 . , ∀ (t, x) ∈ V
, t∈R0+ {q x (t)} 1
Thus, (2.32) holds for all k ≥ 1. It follows that ! (t,x) ηV = M1 . (t,x) (ηV ≥ k) < ∞, ∀ (t, x) ∈ V E P M1 M1 k≥1
Thus (2.31) holds. Now apply (2.28), (2.29) and (2.31) with (t, x) being replaced by (t ∗ , x ∗ ). We see that (t ∗ ,x ∗ ) σ{(∞,x∞ )} < ∞ . w(t ∗ , x ∗ ) ≥ M 1 − P This contradicts (2.26). The statement thus follows.
The next statement shows that, under fairly general assumptions, under a fixed natural Markov strategy π, ˘ Condition 2.2.1 is also necessary for the non-explosiveness of the Markov pure jump process X (·). Theorem 2.2.2 Suppose the following assumptions hold: (a) the topological Borel space X is locally compact; and, under a fixed natural Markov strategy π, ˘ (b) supx∈, t≥0 qx (t) < ∞ for each compact subset of X; and (c) the function (v, x) ∈ [0, ∞) × X → X
(0,∞)
f (t, y)G(dt × dy|v, x)
(2.33)
is continuous for each bounded continuous function f on [0, ∞) × X. If the corresponding Markov pure jump process X (·) is non-explosive under the natural Markov strategy π, ˘ then Condition 2.2.1 is satisfied. Proof By the local compactness assumption, there exist monotone nondecreasing 0 ◦ ∞ sequences of open precompact sets {Un◦ }∞ n=1 ⊆ B(R+ ) and {Vn }n=1 ⊆ B(X) such that is a monotone nondecreasing Un◦ ↑ R0+ and Vn◦ ↑ X as n → ∞. Thus, {Un◦ × Vn◦ }∞ n=1 sequence of open precompact subsets of R0+ × X such that (Un◦ × Vn◦ ) ↑ (R0+ × X).
(2.34)
88
2 Selected Properties of Controlled Processes
For each n ∈ N, define (t,x) σ(R0 ×X)\(U ◦ ×V ◦ ) < σR0 ×{} wn (t, x) = P + + n n for each (t, x) ∈ X. For each n ∈ N, it is clear that wn (s, y)G(ds × dy|t, x) = wn (t, x), ∀ (t, x) ∈ Un◦ × Vn◦ , R0+ ×X
and wn (t, x) = 1, ∀ (t, x) ∈ (R0+ × X) \ (Un◦ × Vn◦ ),
(2.35)
wn (s, y)G(ds × dy|t, x) ≤ wn (t, x), ∀ (t, x) ∈ R0+ × X.
(2.36)
and thus R0+ ×X
On the other hand, for each n, m ∈ N, for all (t, x) ∈ R0+ × X, wn (t, x) (t,x) σ(R0 ×X)\(U ◦ ×V ◦ ) < σR0 ×{} , σR0 ×{} < m =P + + + n n (t,x) σ(R0 ×X)\(U ◦ ×V ◦ ) < σR0 ×{} , σR0 ×{} ≥ m +P + + + n n (t,x) σR0 ×{} ≥ m . (t,x) σ(R0 ×X)\(U ◦ ×V ◦ ) < m + P ≤ P +
n
+
n
(2.37)
We will construct the desired function w based on the functions wn , but beforehand we need to establish several properties of the functions which appeared above in (2.37). According to (2.33), for each n, m ∈ N, the function of (t, x) ∈ R0+ × X defined by (t,x) σ(R0 ×X)\(U ◦ ×V ◦ ) ≥ m P + n n m−1 = G| (U ◦ ×V ◦ )∪(R0 ×{})∪{(∞,x n
n
+
∞ )}
((Un◦ × Vn◦ ) ∪ (R0+ × {})
∪{(∞, x∞ )}|t, x) is lower semi-continuous. For each fixed n, the proof follows from a standard induction argument with respect to m. Here
2.2 Conditions for Non-explosiveness Under a Fixed Natural Markov Strategy
89
(U ◦ ×V ◦ )∪(R0 ×{})∪{(∞,x )} (ds × dy|t, x) G| ∞ + n n = I{(s, y) ∈ (Un◦ × Vn◦ ) ∪ (R0+ × {}) ∪ {(∞, x∞ )}}
×I{(t, x) ∈ (Un◦ × Vn◦ ) ∪ (R0+ × {}) ∪ {(∞, x∞ )}}G(ds × dy|t, x)
on the open set is the restriction of the kernel G (Un◦ × Vn◦ ) ∪ (R0+ × {}) ∪ {(∞, x∞ )}. (t,x) σ(R0 ×X)\(U ◦ ×V ◦ ) < m is upper semi-continuous on R0+ × Therefore, function P + n n X. (t,x) σR0 ×{} ≥ m is upper semiBy the same argument, the function P + 0 0 × X because the set X \ (R continuous in (t, x) ∈ R + + × {}) is closed. Clearly, P(t,x) σR0+ ×{} ≥ m = P(t,x) σ{(∞,x∞ )} > m , and this expression decreases with m. According to Corollary 2.2.1, for all (t, x) ∈ R0+ × X, (t,x) σR0 ×{} ≥ m = lim P (t,x) σ{(∞,x∞ )} > m = 0. lim P +
m→∞
m→∞
(2.38)
Remember, the sets Ui◦ and Vi◦ are precompact, so that the closures cl(Ui◦ × Vi◦ ) are compact for all i ∈ N. Therefore, the convergence in (2.38) is uniform with respect to (t, x) on each compact cl(Ui◦× Vi◦ ): see CorollaryB.1.1. (t,x) σ(R0 ×X)\(U ◦ ×V ◦ ) < m decreases Next, for a fixed value of m, expression P +
n
n
with n because of (2.34). Moreover, the limit equals zero. To show this, notice that, for 0 , there exists some N such that (Tˆ j ( ω ), Xˆ "j ( ω )) ∈ (U N◦ × VN◦ ) ∪ (R each ω∈ # +× {}) ∪ {(∞, x∞ )} for each j = 0, 1, . . . , m, so that I σ(R0+ ×X)\(U N◦ ×VN◦ ) < m = 0, and hence " # lim I σ(R0+ ×X)\(Un◦ ×Vn◦ ) < m = 0 n→∞
. Now the equality for each ω∈ (t,x) σ(R0 ×X)\(U ◦ ×V ◦ ) < m = 0 lim P + n n
n→∞
(2.39)
holds after one applies Lebesgue’s Dominated Convergence Theorem. Again, the convergence in (2.39) is uniform with respect to (t, x) on each compact cl(Ui◦ × Vi◦ ). Now, for each fixed i ∈ N, we take m i such that (t,x) σ{(∞,x∞ )} > m i ≤ P
1 2i+1
for all (t, x) ∈ cl(Ti◦ × Si◦ ). And, for that value of m i , we choose n i ∈ N such that
90
2 Selected Properties of Controlled Processes
(t,x) σR0 ×X\(U ◦ ×V ◦ ) < m i ≤ P + n n i
i
1 2i+1
for all (t, x) ∈ cl(Ui◦ × Vi◦ ). As a result, according to (2.37), we have that wni (t, x) ≤
1 , ∀ (t, x) ∈ cl(Ui◦ × Vi◦ ). 2i
(2.40)
It is possible to take (n i ) as a strictly increasing sequence, as is assumed to have been done. For each fixed k ∈ N, for all (t, x) ∈ cl(Uk◦ × Vk◦ ), the inequality wni (t, x) ≤ 21i holds for i = k, k + 1, . . . because the sequence of subsets {cl(Uk◦ × Vk◦ )}∞ k=1 is not decreasing. Now we define wni (t, x), ∀ (t, x) ∈ R0+ × X w(t, x) = i∈N
and put Un = cl(Un◦ ), Vn = cl(Vn◦ ), and Vn = Un × Vn = cl(Un◦ × Vn◦ ). For each k , since wni (t, x) ≤ 1, k ∈ N and (t, x) ∈ V w(t, x) ≤ k − 1 +
wni (t, x) ≤ k.
i≥k
Now, the function w is R0+ -valued, and satisfies the condition sup
(t,x)∈Un ×Vn
w(t, x) < ∞.
It follows from (2.36) that the function w satisfies (2.21). ∞ i }i=1 is nonincreasing, and all of them are The sequence of sets {(R0+ × X) \ V nonempty because Ui are compact and R0+ is not. Let k ∈ N and (t, x) ∈ (R0+ × k be fixed. If n i ≤ k then (t, x) ∈ (R0+ × X) \ (Un◦ × Vn◦ ) and, according to X) \ V i i (2.35), wni (t, x) = 1. As a result, if n 1 , n 2 , . . . , n m ≤ k then w(t, x) ≥ m. Clearly, m → ∞ as k → ∞. Hence (2.20) holds true. Finally, (2.19) holds because of assumption (b) of the current theorem. In order to connect to a known sufficient condition in the literature for the nonexplosiveness of a nonhomogeneous continuous-time Markov chain, see Sect. A.4, as well as an application of Theorem 2.2.1, we show that the next condition is sufficient for the non-explosiveness of the process X (·) under a fixed natural Markov strategy π. ˘ Condition 2.2.2 There exist a monotone nondecreasing sequence (Vn ) ⊆ B(X) and a R+ -valued measurable function w on R0+ × X such that the following assertions hold. (a) As n ↑ ∞, Vn ↑ X.
2.2 Conditions for Non-explosiveness Under a Fixed Natural Markov Strategy
91
(b) For each T ∈ R+ , as n ↑ ∞, inf
w(v, x) ↑ ∞.
sup
qx (v) < ∞.
v∈[0,T ], x∈X\Vn
(c) For each T ∈ R+ , and n ∈ N, v∈[0,T ], x∈Vn
(d) For each T ∈ R+ , there is some constant αT ∈ R+ such that (0,T −v]
w(t + v, y)e−αT t−
(0,t]
qx (s+v)ds
q (dy|x, t + v)dt
X
≤ w(v, x), ∀ v ∈ [0, T ], x ∈ X. To serve the verification of the sufficiency of the above condition for the nonexplosiveness of the controlled process under a natural Markov strategy, let us point out the following. All the previous arguments given in this chapter are valid when q(dy|x, s) is an arbitrarily fixed (conservative and stable) Q-function in the uncontrolled case, and thus the statements presented earlier in this chapter all hold after one systematically omits therein the term “Under a fixed natural Markov strategy π”. ˘ This can be seen by inspecting the proofs of all the involved statements. Now we are in position to present the sufficiency of Condition 2.2.2 as a consequence of Theorem 2.2.1. Corollary 2.2.2 Under a fixed natural Markov strategy π, ˘ the Markov pure jump process X (·) with the corresponding transition function pq (s, x, t, dy) is nonexplosive if Condition 2.2.2 is satisfied. Proof For each fixed T ∈ R0+ , let us consider the following Q-function on X T q(|x, t)
:= q(|x, t)I{t ∈ [0, T ]}, ∀ ∈ B(X).
(2.41)
According to Sect. 2.1.2, cf. (2.2) and (2.3), one can see that in particular, pq (0, x, t, ) = pT q (0, x, t, ), ∀ x ∈ X, t ∈ [0, T ], ∈ B(X); (2.42) pT q (0, x, t, ) = pT q (0, x, T, ), t ≥ T, ∈ B(X). n = [0, ∞) × Vn , the corresponding Note that under Condition 2.2.2, after defining V version of Condition 2.2.1 for the Q-function T q is satisfied for each T ∈ [0, ∞). Here, keeping in mind the definition (2.41), only the validity of the corresponding version of (2.20) needs a special explanation, as follows. For each fixed T ∈ R0+ , we can modify the definition of w by putting wT (v, x) = I{v > T }w(0, x) + w(v, x)I{v ∈ [0, T ]}
92
2 Selected Properties of Controlled Processes
for each x ∈ X. In this way, we obtain for each T ∈ R0+ , inf
x∈X\Vn , v∈[0,∞)
wT (v, x) =
inf
x∈X\Vn , v∈[0,T ]
w(v, x) ↑ ∞, ∀ x ∈ X,
as n → ∞ under Condition 2.2.2. Consequently, for each T ∈ R+ , the corresponding version of Condition 2.2.1 for the Q-function T q is satisfied by the function n . By Theorem 2.2.1, for each T ≥ 0 and x ∈ X, wT and the sequence of sets V pT q (0, x, t, X) = 1 for all t ∈ R0+ , and thus pq (0, x, t, X) = 1, ∀ x ∈ X, t ∈ R0+ , by (2.42). Now the statement follows from the above equality and Lemma 2.2.1. The statement of Corollary 2.2.2 will be used when our conditions are compared with a known sufficient condition in the literature for the non-explosiveness of a nonhomogeneous continuous-time Markov chain; see Sect. A.4. Next, we complement the previous material by providing a criterion for the explosiveness of the process X (·) under a fixed natural Markov strategy π. ˘ Theorem 2.2.3 Under a fixed natural Markov strategy π, ˘ the process X (·) is explosive if and only if for some α > 0, there is a nontrivial bounded nonnegative measurable function u on R0+ × X satisfying the following inequality:
u(v, x) ≤ ∀ x ∈ X, v
(0,∞) X ∈ R0+ .
u(v + t, y) q (dy|x, v + t)e−αt−
(0,t]
qx (s+v)ds
dt, (2.43)
Proof It follows from Propositions C.2.4 and C.2.8 that the function (v, x) ∈ [0, ∞) × X $∞ (v,x) I{ Xˆ n ∈ X} →E n=0
αe
−αt −
e
ˆ (0,t] q Xˆ n (s+ Tn )ds
dt
(0,∞)
(v,x) (σR0 ×{} < ∞) = α = P +
(0,∞)
e−αt pq (v, x, v + t, X)dt
is the minimal nonnegative solution to the following inequality w(v, x) ≥ αe−αt e− (0,t] qx (s+v)ds dt (0,∞) + w(v + t, y) q (dy|x, v + t)e−αt− (0,t] qx (v+s)ds dt, (0,∞)
X
∀ x ∈ X, v ∈ R0+
(2.44)
2.2 Conditions for Non-explosiveness Under a Fixed Natural Markov Strategy
93
out of the class of [0, 1]-valued measurable functions on [0, ∞) × X. Equivalently, the function u(v, x) = 1 − w(v, x) = 1 − α e−αt pq (v, x, v + t, X)dt (0,∞)
is the maximal solution to inequality (2.43) out of the class of [0, 1]-valued measurable functions on R0+ × S. The statement immediately follows from this and Lemma 2.2.2; see also Corollary 2.2.1. (Note that there is a bounded nonnegative (measurable) solution to inequality (2.43) if and only if there is a [0, 1]-valued solution to it, because if u satisfies inequality (2.43) then, for any C ∈ R0+ , Cu also satisfies that inequality.) The next statement is immediate by inspecting the proof of Theorem 2.2.3. Corollary 2.2.3 Under a fixed natural Markov strategy π, ˘ the process X (·) is nonexplosive if and only if, for any α > 0, the equation u(v, s) = ∀ x ∈ X,
(0,∞) X v ∈ R0+
u(v + t, y) q (dy|x, v + t)e−αt−
(0,t]
qx (s+v)ds
dt, (2.45)
has no other nonnegative bounded solutions except for u(v, x) ≡ 0. Proof The function w defined in (2.44) is in fact the minimal nonnegative solution to the equation w(v, s) =
αe−αt e− (0,t] qx (s+v)ds dt (0,∞) + w(v + t, y) q (dy|x, v + t)e−αt− (0,t] qx (s+v)ds dt. (0,∞)
X
Arguing exactly as in the proof of Theorem 2.2.3, we see that the process X (·) is explosive if and only if w(·) ∈ [0, 1] is not identically equal to 1, that is, the function u = 1 − w, satisfying Eq. (2.45), is not identically zero.
2.2.2 The Homogeneous Case Let us say a few words about the homogeneous case, i.e., when the Q-function q(dy|x, s) on X does not depend on time s ≥ 0, which is the case under a stationary π-strategy. In this case, we do not include s in expressions such as q(dy|x, s), qx (s), etc. For a homogeneous Q-function, we formulate the next condition.
94
2 Selected Properties of Controlled Processes
Condition 2.2.3 There exist a monotone nondecreasing sequence {Vn }∞ n=1 of measurable subsets of X and an R0+ -valued measurable function w on X such that the following hold. (a) (b) (c) (d)
As n ↑ ∞, Vn ↑ X. For each n = 1, 2, . . . , supx∈Vn qx < ∞. As n ↑ ∞, inf x∈X\Vn w(x) ↑ ∞. For some constant α ∈ R+ , w(y)q(dy|x) ≤ αw(x), ∀ x ∈ X. X
Non-homogeneous versions of Condition 2.2.3 are given in Sect. 2.2.1 and Appendix A.4. In this subsection, we only consider the homogeneous case. Then X (·) is a homogeneous Markov pure jump process. Instead of the discrete-time Markov chain ˆ ∞ {(Tˆn , Xˆ n )}∞ n=0 , one can simply focus on the discrete-time Markov chain { X n }n=0 , whose transition probability can be obtained automatically from the expressions for the underlying transition presented in Sect. 2.2.1. Keeping the same notation G probability, we have G(|z) = q ( ∩ X|z)/(qz + α) + I{ ∈ } ∀ ∈ B(X × {} × {x∞ })
α , qz + α
∞ }|x∞ ) = 1. By the way, the Markov chain {xˆn }∞ ∞ }|) = G({x if z ∈ X, and G({x n=0 is also referred to as the “α-jump chain” of the homowith the transition probability G geneous Markov pure jump process X (·). The reasoning for the nonhomogeneous case in the previous discussions all applies to this case. In particular, we have the following statement for the non-explosiveness and explosiveness of the homogeneous Markov pure jump process X (·). Theorem 2.2.4 Consider the homogeneous Q-function q on X. Then the following assertions hold. (a) The homogeneous Markov pure jump process X (·) is non-explosive if Condition 2.2.3 is satisfied. (b) Suppose the topological Borel space X is a locallycompact separable metric space, and the functions x ∈ X → qx and x ∈ X → X f (y)q(dy|x) are continuous for each bounded continuous function f on X. If the homogeneous Markov pure jump process X (·) is non-explosive, then Condition 2.2.3 holds, where the sets {Vn }∞ n=1 can be taken as compact sets in X such that sup w(x) < ∞, ∀ n = 1, 2, . . . . x∈Vn
2.2 Conditions for Non-explosiveness Under a Fixed Natural Markov Strategy
95
(c) The process X (·) is non-explosive if and only if for some and all α ∈ R+ , there is no nontrivial [0, 1]-valued measurable function u on R0+ × X satisfying u(y)q(dy|x), ∀ x ∈ X.
αu(x) =
(2.46)
X
(d) The process X (·) is explosive if and only if there exists a nontrivial bounded nonnegative measurable function u on X and some α ∈ R+ such that αu(x) ≤
u(y)q(dy|x), ∀ x ∈ X. X
We emphasize that, if the state space X is denumerable, with the discrete topology, then all the assumptions in Theorem 2.2.4(b) are obviously satisfied. Remember, the transition rate is stable: supa∈A qx (a) < ∞. In particular, if the state space X is denumerable, then Condition 2.2.3 is necessary and sufficient for the non-explosiveness of the process X (·) in the homogeneous case. As an immediate consequence of Theorem 2.2.4(d), we have the next statement. Corollary 2.2.4 Consider the homogeneous Q-function q on X. Suppose there is a nontrivial bounded nonnegative measurable function u on X such that, for each x ∈ X, u(y)q(dy|x) ≥ ε > 0. X
Then the process X (·) is explosive. Proof Let U ∈ R+ be such that for each x ∈ X, u(x) ≤ U . Then, for α = we have u(y)q(dy|x) ≥ αu(x), ∀ x ∈ X.
ε U
∈ R+ ,
X
Now the statement follows from assertion (d) of Theorem 2.2.4.
2.2.3 Possible Generalizations Under a natural Markov strategy, the non-explosiveness of the controlled process X (·) is equivalent to (2.25); see Corollary 2.2.1. Then by Theorem 2.2.1, Condition 2.2.1 is sufficient for (2.25); and by Theorem 2.2.2, under additional regularity conditions, it is also necessary. Now consider a Markov π-strategy, say {πn }∞ n=1 . In this case, instead of {(Tˆn , Xˆ n )}∞ , it is natural to consider the discrete-time Markov chain n=0 ∞ ˆ ˆ {(Tn , X n , n)}n=0 . The one-step transition probability for this Markov chain from (t, x, n) ∈ R0+ × X × N0 to B(R0+ × X × N0 ) is given by
96
2 Selected Properties of Controlled Processes
G(ds × dy × {m}|(t, x, n)) = I{m = n + 1}I{s > t} q (dy|x, a)πn+1 (da|x, s − t) ×e
−
(t,s] A
A qx (a)πn+1 (da|x,u−t)du−α(s−t)
ds;
and from (t, x, n) ∈ R0+ × X × N0 to B(R0+ × {} × N0 ) is given by G(ds × {} × {m}|(t, x, n)) = I{m = n + 1}I{s > t} × αe− (t,s] A qx (a)πn+1 (da|x,u−t)du−α(s−t) ds. A
for the transition probability of the underlyHere we keep the same notation G ing discrete-time Markov chain as in Sect. 2.2.1. The transition from (t, , n) ∈ R0+ × {} × N0 to (∞, x∞ , ∞) is deterministic, whereas the state (∞, x∞ , ∞) is absorbing. Following the reasoning for Lemma 2.2.2 and Corollary 2.2.1, we see that the non-explosiveness of the process X (·) under this Markov π-strategy is equivalent to the absorption of the discrete-time Markov chain {(Tˆn , Xˆ n , n)}∞ n=0 at (∞, x ∞ , ∞) from the initial state (0, x, 0) for each x ∈ X. Then one can formulate a version of Condition 2.2.1; now the function w will have three arguments (t, x, m), and the sets 0 n }∞ {V n=1 will be subsets of R+ × X × N0 , etc. More precisely, we have the following condition. n }∞ Condition 2.2.4 There exist a monotone nondecreasing sequence {V n=1 of mea0 0 surable subsets of R+ × X × N0 and an R+ -valued measurable function w on R0+ × X × N0 such that the following hold. n ↑ R0+ × X × N0 . (a) As n ↑ ∞, V (b) For each n = 1, 2, . . . , sup qx (a)πm+1 (da|x, t) < ∞, (x,m)∈Vn , t∈R0+
A
n on X × N0 . where Vn denotes the projection of V (c) As n ↑ ∞, inf
n (t,x,m)∈(R0+ ×X×N0 )\V
w(t, x, m) ↑ ∞.
(d) For some α ∈ R+ , for any x ∈ X, m ∈ N0 and v ∈ R0+ ,
2.2 Conditions for Non-explosiveness Under a Fixed Natural Markov Strategy
97
(0,∞)
×e
w(t + v, y, m + 1)
X −αt− (0,t] A qx (a)πm+1 (da|x,s)ds
≤ w(v, x, m).
q (dy|x, a)πm+1 (da|x, t)dt A
Arguing as in the proof of Theorem 2.2.1, the above condition implies the absorption of the discrete-time Markov chain {(Tˆn , Xˆ n , n)}∞ n=0 at (∞, x ∞ , ∞) from each initial state (t, x, n) ∈ R0+ × X × N0 , and is thus sufficient for the non-explosiveness of X (·) under the Markov π-strategy {πn }∞ n=1 . On the other hand, it is clear that this condition is not necessary for the absorption of the discrete-time Markov chain {(Tˆn , Xˆ n , n)}∞ n=0 at (∞, x ∞ , ∞) from the initial state (0, x, 0) for each x ∈ X, and is thus not necessary for the non-explosiveness of X (·) under the Markov π-strategy {πn }∞ n=1 . The following example demonstrates this. Example 2.2.1 Consider the pure birth process with the state space X = {0, 1, 2, . . .} and the birth rate λi = i 2 , i ∈ X. The action space is A = {0, 1}, where 0 means “do nothing” and 1 means “stop the birth process and, after the exponentially distributed random time (with rate μ > 0), kill the whole population”: q( j|i, 0) = I{ j = i + 1}i 2 − I{ j = i}i 2 ; q( j|i, 1) = μI{ j = 0} − μI{ j = i} M for each i, j ∈ X. For the Markov π-strategy {πnM }∞ n=1 with π1 (1|x, s) ≡ 1, M πn (0|x, s) ≡ 1 for all n ≥ 2, it is evident that X (·) is non-explosive (after the first jump it visits the state 0 and remains there forever). For each x ∈ X, starting from (0, x, 0), the discrete-time Markov chain {(Tˆn , xˆn , n)}∞ n=0 hits the absorbing state (∞, x∞ , ∞) within finite time with probability one. At the same time, starting from (0, 1, 1), that chain does not hit the absorbing state (∞, x∞ , ∞) with probability one, as under the shifted strategy {π2M , π3M , . . .}, the process X (·) is explosive. (See Example 2.3.1.) Thus Condition 2.2.4 is not satisfied and hence not necessary for the non-explosiveness of the process X (·).
Therefore, the argument for deriving a necessary condition as in Theorem 2.2.2 and Corollary 2.2.3 for the non-explosiveness of X (·) under a natural Markov strategy cannot be directly extended to the case of a general strategy. In Sect. 2.2.4, we present a sufficient condition for the non-explosiveness simultaneously for all strategies.
2.2.4 A Condition for Non-explosiveness Under All Strategies In this subsection, the main objective is to show that the following condition is sufficient for the non-explosiveness of the process X (·) simultaneously under all strategies.
98
2 Selected Properties of Controlled Processes
Condition 2.2.5 There exist a monotone nondecreasing sequence of measurable 0 subsets {Vn }∞ n=1 of X, a measurable function w : X → R+ and constants ρ ∈ R, 0 b ∈ R+ such that the following assertions hold. ∞ (a) X = l=1 Vl , supx∈Vl w(x) < ∞. (b) For every l ∈ N, supx∈Vl , a∈A qx (a) < ∞. (c) liml→∞ inf x∈X\Vl w(x) = ∞. (d) For each x ∈ X and a ∈ A, w(y)q(dy|x, a) ≤ ρw(x) + b. X
Evidently, if supx∈X,a∈A qx (a) < ∞, then Condition 2.2.5 is satisfied. It is easy to check that Condition 2.2.5 implies Condition 2.2.1 is satisfied under all natural Markov strategies π. ˇ Indeed, suppose Condition 2.2.5 is satisfied by a of X, some function w : X → R0+ , and constants ρ ∈ R, sequence of subsets {Vn }∞ n=1 0 b ∈ R+ . Without loss of generality we can assume that w is [1, ∞)-valued, ρ > 0, and b = 0. Let some natural Markov strategy πˇ be arbitrarily fixed. Then for this n := R0+ × Vn , and natural Markov strategy, Condition 2.2.1 is satisfied with α = ρ, V w(t, x) := w(x) (with slight abuse of notations). For the verification of Condition 2.2.1(d), note that w(t + v, y)e−αt− (0,t] qx (s+v)ds q (dy|x, t + v)dt (0,∞) X w(y)e−αt− (0,t] qx (s+v)ds q (dy|x, t + v)dt = (0,∞) X (α + qx (t + v))w(x)e−αt− (0,t] qx (s+v)ds dt = w(x) = w(v, x) ≤ (0,∞)
for each v ∈ R0+ , where the inequality holds because of Condition 2.2.5(d). The verification of the other parts of Condition 2.2.1 is immediate. We shall show that the non-explosiveness of the process X (·) under all strategies follows from the nonexplosiveness under all natural Markov strategies. This will lead to the sufficiency of Condition 2.2.5 for the non-explosiveness of X (·) under all strategies. Definition 2.2.2 The natural Markov strategy πˇ is said to be induced by a π-strategy S = {πn }∞ n=1 and an initial state x ∈ X if ˇ t), ∀ t ∈ R+ PxS (t, dy × da) = PxS (t, dy)π(da|y, on B(X × A). Recall the notations PxS (t, dy × da) and PxS (t, dy) for the marginal distributions: see Definition 1.1.5. Lemma 2.2.3 Let a π-strategy S = {πn }∞ n=1 and some initial state x ∈ X be fixed. Then the following assertions hold:
2.2 Conditions for Non-explosiveness Under a Fixed Natural Markov Strategy
99
(a) There exists a natural Markov strategy πˇ induced by the π-strategy S = {πn }∞ n=1 and the initial state x ∈ X. (b) The marginal distribution PxS (t, dy) satisfies the Kolmogorov forward equation for the fixed initial time and state (0, x) corresponding to the Q-function q induced by the natural Markov strategy π. ˇ Proof (a) The existence of the induced natural Markov strategy πˇ follows from Proposition B.1.33. Below, we justify the second claim in this statement. (b) If we put R = (0, t] with t ∈ R+ , then the random measure μ given by (1.4) counts the entrances of the process X (·) in the set ∈ B(X) on the interval (0, t]. Let us consider also the following random measure μ defined by μ(ω; R × ) =
I{Tn (ω) < ∞}δ(Tn (ω),X n−1 (ω)) (R × ),
n≥1
∀ R ∈ B(R+ ), ∈ B(X), see Lemma A.1.2. For R = (0, t], it counts the exits of the process from the set ∈ B(X) on the interval (0, t]. Let some q-bounded set ∈ B(X) be fixed. For each m ∈ {1, 2, . . . }, it holds that, having omitted ω for brevity, μ((0, t ∧ Tm ] × ), I{X (t ∧ Tm ) ∈ } = I{x ∈ } + μ((0, t ∧ Tm ] × ) − where all the expressions are finite and in fact bounded with respect to ω ∈ . Upon taking expectations on both sides of the previous equality, we have PxS (X (t ∧ Tm ) ∈ ) = I{x ∈ } +
ExS
[μ((0, t ∧ Tm ] × )] −
(2.47) ExS
μ((0, t ∧ Tm ] × )] , [
where all the expressions are finite. Note that μ((0, t ∧ T∞ ] × )] = ExS [ ν ((0, t ∧ T∞ ] × )] ExS [ S = Ex q X (s) (a)(da|s)I{X (s) ∈ }ds (0,t∧T∞ ] A S Ex q X (s) (a)(da|s)I{X (s) ∈ } ds < ∞. = (0,t]
(2.48)
A
Here the process is as defined in (1.8) and the compensator ν is as in Lemma A.1.2. The second equality holds by Lemma A.1.2, and the last equality holds because is a q-bounded set. Recall that X (s) = x∞ ∈ / for s ≥ T∞ . Let us verify that lim PxS (X (t ∧ Tm ) ∈ ) = PxS (X (t) ∈ ),
m→∞
100
2 Selected Properties of Controlled Processes
as follows. From the representation PxS (X (t ∧ Tm ) ∈ ) = PxS (X (t) ∈ , t < Tm ) + PxS (X (Tm ) ∈ , t ≥ Tm ), we see that the required equality would hold if limm→∞ PxS (X (Tm ) ∈ , t ≥ Tm )=0, because limm→∞ PxS (X (t) ∈ , t < Tm ) = PxS (X (t) ∈ , t < T∞ )=PxS (X (t) ∈ ), where the last equality holds because ∈ X and X (t) ∈ / X if t ≥ T∞ . Suppose for contradiction that limm→∞ PxS (X (Tm ) ∈ , t ≥ Tm ) > 0. Then there exist some > ∞ 0 and a subsequence {m k }∞ k=1 of {m}m=1 such that PxS (X (Tm k ) ∈ , t ≥ Tm k )) > , k = 1, 2, . . . . Then ExS
[μ((0, t ∧ T∞ ] × )] ≥
∞
PxS (X (Tm k ∈ , t ≥ Tm k )) = ∞,
k=1
which is the desired contradiction, because ExS [μ((0, t ∧ T∞ ] × )] = lim ExS [μ((0, t ∧ Tm ] × )] m→∞
≤ 1+
lim E S m→∞ x
μ((0, t ∧ Tm ] × )] [
= 1 + ExS [ μ((0, t ∧ T∞ ] × )] < ∞, where the first inequality is by (2.47) and the last inequality is by (2.48). Now by legitimately taking the limit as m → ∞ on both sides of (2.47), we see PxS (X (t) ∈ ) = I{x ∈ } + ExS [μ((0, t ∧ T∞ ] × )] − ExS [ μ((0, t ∧ T∞ ] × )] S = I{x ∈ } + Ex q (|X (s), a)(da|s) ds (0,t] A S − Ex q X (s) (a)(da|s)I{X (s) ∈ } ds, (0,t]
A
where the last equality holds by Lemmas A.1.1 and A.1.2. Note that all the expressions in the above equalities are finite. To complete the proof of this lemma, it remains to note that ExS
=
X×A
q (|X (t), a)(da|t) =
q (|y, a)PxS (t, dy × da) q (|y, a)π(da|y, ˇ t)PxS (t, dy) = q (|y, t)PxS (t, dy) A
X×A
X
2.2 Conditions for Non-explosiveness Under a Fixed Natural Markov Strategy
101
and similarly ExS
q X (s) (a)(da|s)I{X (s) ∈ } =
A
q y (t)PxS (t, dy).
The proof is complete.
As a consequence of the definition of the induced natural Markov strategy, the above lemma and Corollary 2.1.2, we see that the next statement holds. Corollary 2.2.5 Consider the natural Markov strategy πˇ induced by a given πstrategy S = {πn }∞ n=1 , which always exists, and a fixed initial state x ∈ X. Then Pxπˇ (t, dy × da) ≤ PxS (t, dy × da). Moreover, if the process X (·) is non-explosive under the natural Markov strategy π, ˇ then Pxπˇ (t, dy × da) = PxS (t, dy × da). Now we are in position to present the main statement of this subsection. Theorem 2.2.5 Suppose Condition 2.2.5 is satisfied. Then the process X (·) is nonexplosive under each control strategy S. Proof Let x ∈ X and a strategy S˜ be fixed. The objective is to show that under ˜ Condition 2.2.5, PxS (X (t) ∈ X) = 1 for each t ∈ R0+ . According to Theorem 1.1.1, there is a π-strategy, say S, such that ˜
PxS (X (t) ∈ X) = PxS (X (t) ∈ X) for all t ∈ R0+ . According to Corollary 2.2.5, there is a natural Markov strategy πˇ induced by S and x such that Pπxˇ (X (t) ∈ X) ≤ PxS (X (t) ∈ X) for each t ∈ R0+ . On the other hand, as was shown, Condition 2.2.5 implies Condition 2.2.1 is satisfied under all natural Markov strategies π. ˇ Therefore, one can refer to ˜ Theorem 2.2.1 for the statement to be proved: PxS (X (t) ∈ X) = PxS (X (t) ∈ X) ≥ P πˇ (X (t) ∈ X) = 1. We may deduce the following statement from Theorem 2.2.5 and Corollary 2.2.5. Corollary 2.2.6 Suppose Condition 2.2.5 is satisfied. Consider the natural Markov strategy πˇ induced by a given π-strategy S = {πn }∞ n=1 , which always exists, and a fixed initial state x ∈ X. Then Pxπˇ (t, dy × da) = PxS (t, dy × da). Regarding the sufficiency of the class of natural Markov strategies for the CTMDP problems (1.15)–(1.18), let us also note the following consequence of Corollaries 2.2.5 and 2.2.6.
102
2 Selected Properties of Controlled Processes 0
Remark 2.2.1 When the cost rates c j are R+ -valued, or when Condition 2.2.5 is satisfied, the class of natural Markov strategies is sufficient for all the CTMDP problems for the fixed initial state x ∈ X. In fact, this assertion holds when the fixed initial state is replaced by a fixed initial distribution, say γ. In that case, one should add an extra state, say x, to the state space, with the uncontrolled transition rate equal to 1, whose post-jump distribution is given by γ, and then consider the CTMDP model with x as the initial state.
2.2.5 Direct Proof for Non-explosiveness Under All Strategies It is possible to provide a more direct proof of Theorem 2.2.5. We provide it in this subsection, which also consists of some auxiliary results to be referred to in the future. Lemma 2.2.4 Let Condition 2.2.5(d) be satisfied and ρ = 0. Suppose that π(da|x, u) is a stochastic kernel on B(A) given (x, s) ∈ X × R+ and the kernel q is given by (1.10). Then, for any x ∈ X, 0 ≤ s ≤ t < ∞, e (s,t]
−
q (X|x,π,v)dv (s,u]
h(u, y, t) q (dy|x, π, u)du X
+w(x)e− (s,t] q (X|x,π,v)dv ≤ h(s, x, t) = h(0, x, t − s), where h is the R0+ -valued function defined for x ∈ X, 0 ≤ s ≤ t < ∞ by h(s, x, t) = w(x)eρ(t−s) +
b ρ(t−s) e −1 . ρ
Proof Straightforward calculations result in the following formulae: (s,t]
e−
q (X |x,π,v)dv (s,u]
h(u, y, t) q (dy|x, π, u)du X
+w(x)e− (s,t] q (X |x,π,v)dv = e− (s,u] q (X |x,π,v)dv+ρ(t−u) w(y) q (dy|x, π, u)du (s,t] X b + e− (s,u] q (X |x,π,v)dv eρ(t−u) − 1 q (X|x, π, u)du ρ (s,t]
+w(x)e− (s,t] q (X |x,π,v)dv ≤ e− (s,u] q (X |x,π,v)dv+ρ(t−u) [ρw(x) + b + q (X|x, π, u)w(x)] du (s,t]
(2.49)
2.2 Conditions for Non-explosiveness Under a Fixed Natural Markov Strategy
+
b ρ
e−
(s,t]
+w(x)e−
q (X |x,π,v)dv (s,u]
103
ρ(t−u) e −1 q (X|x, π, u)du
q (X |x,π,v)dv (s,t]
= w(x)eρ(t−s) + b e− (s,u] q (X |x,π,v)dv+ρ(t−u) du (s,t] b + e− (s,u] q (X |x,π,v)dv eρ(t−u) − 1 q (X|x, π, u)du ρ (s,t] b = w(x)eρ(t−s) + q (X|x, π, u)) e− (s,u] q (X |x,π,v)dv+ρ(t−u) du (ρ + ρ (s,t] b − (s,t] q (X |x,π,v)dv + e − 1 = h(s, x, t), ρ
as desired.
We stress that the statement of Lemma 2.2.4 is still valid if the kernel π therein further depends on other parameters, e.g., if it has the form π(da|x0 , θ1 , , x1 , . . . , θn , x, u). In the forthcoming arguments, this remark will be automatically kept in mind when referring to Lemma 2.2.4. Lemma 2.2.5 Suppose Condition 2.2.5(d) is satisfied and ρ = 0. Then, under each control strategy S, for each x ∈ X, 0 ≤ n ≤ m and t ∈ R0+ , ExS w(X (t))I{t < Tm+1 }|Hm−n ≤ I{Tm−n ≤ t}h(Tm−n , X m−n , t) +
m−n
I{Tk−1 ≤ t < Tk }w(X k−1 ),
k=1
(2.50) where the function h is given by (2.49). Proof Let m ∈ N0 and x ∈ X be fixed. For n = 0 we have ExS w(X (t))I{t < Tm+1 }|Hm = ExS (I{Tm ≤ t} + I{Tm > t})w(X (t))I{t < Tm+1 }|Hm = I{Tm ≤ t}w(X m )PxS (m+1 > t − Tm |Hm ) m I{Tk−1 ≤ t < Tk }w(X k−1 ). + k=1
Depending on Sm+1 , for t ≥ Tm , the conditional probability PxS (m+1 > t − Tm |Hm ) equals either e−
(0,t−Tm ]
q X m (πm+1 ,u)du
,
104
2 Selected Properties of Controlled Processes
if Sm+1 = πm+1 , or
e−q X m (a)(t−Tm ) m+1 (da|Hm ),
A
if Sm+1 = m+1 . In either case, according to Lemma 2.2.4, w(X m )PxS (m+1 > t − Tm |Hm ) ≤ h(Tm , X m , t) q (X|x, π, u). Inequality (2.50) is proved for the on {t ≥ Tm }. Recall that qx (π, u) = case of n = 0. Suppose now that inequality (2.50) holds for some 0 ≤ n < m, and consider the case of n + 1. Then ExS w(X (t))I{t < Tm+1 }|Hm−n−1 = ExS ExS w(X (t))I{t < Tm+1 }|Hm−n |Hm−n−1 ≤ ExS I{Tm−n ≤ t}h(Tm−n , X m−n , t) m−n + I{Tk−1 ≤ t < Tk }w(X k−1 ) Hm−n−1 k=1
= I{Tm−n−1 ≤ t} ExS I{Tm−n ≤ t}h(Tm−n , X m−n , t) m−n I{Tk−1 ≤ t < Tk }w(X k−1 ) Hm−n−1 + k=1
+I{Tm−n−1 > t} $m−n ×ExS I{Tk−1 ≤ t < Tk }w(X k−1 ) Hm−n−1 .
(2.51)
k=1
For t ≥ Tm−n−1 , it holds that ExS I{Tm−n ≤ t}h(Tm−n , X m−n , t) m−n + I{Tk−1 ≤ t < Tk }w(X k−1 ) Hm−n−1 k=1 S = Ex I{Tm−n ≤ t}h(Tm−n−1 + m−n , X m−n , t) +I{m−n > t − Tm−n−1 }w(X m−n−1 )| Hm−n−1 = ExS I{m−n ≤ t − Tm−n−1 }h(m−n , X m−n , t − Tm−n−1 ) +I{m−n > t − Tm−n−1 }w(X m−n−1 )|Hm−n−1 . As before, on {t ≥ Tm−n−1 }, the last expression equals either
(2.52)
2.2 Conditions for Non-explosiveness Under a Fixed Natural Markov Strategy
e−
(0,u]
105
q X m−n−1 (πm−n ,u)du
(0,t−Tm−n−1 ]
×
h(u, y, t − Tm−n−1 ) q (dy|X m−n−1 , πm−n , u)du X
+w(X m−n−1 )e
−
(0,t−Tm−n−1 ]
q X m−n−1 (πm−n ,u)du
,
if Sm−n = πm−n , or A
e−q X m−n−1 (a)u (0,t−Tm−n−1 ]
×
h(u, y, t − Tm−n−1 ) q (dy|X m−n−1 , a)du + w(X m−n−1 )e−q X m−n−1 (a)(t−Tm−n−1 ) m−n (da|Hm−n−1 ), X
if Sm−n = m−n . In either case, according to Lemma 2.2.4 and (2.52), we see that ExS I{Tm−n ≤ t}h(Tm−n , X m−n , t) m−n + I{Tk−1 ≤ t < Tk }w(X k−1 ) Hm−n−1 k=1
≤ h(0, X m−n−1 , t − Tm−n−1 ) = h(Tm−n−1 , X m−n−1 , t) on {t ≥ Tm−n−1 }. Finally, from this and (2.51) we obtain that ExS w(X (t))I{t < Tm+1 }|Hm−n−1 ≤ I{Tm−n−1 ≤ t}h(Tm−n−1 , X m−n−1 , t) $ m−n−1 +ExS I{Tk−1 ≤ t < Tk }w(X k−1 ) Hm−n−1 . k=1
The proof is now complete.
The next statement is an immediate consequence of Lemma 2.2.5. Corollary 2.2.7 Suppose Condition 2.2.5(d) is satisfied. Then under any control strategy S, for each x ∈ X, m ∈ N0 and t ∈ R0+ , b ρt e − 1 I{ρ = 0} ExS [w(X (t))I{t < Tm+1 }] ≤ eρt w(x) + ρ +(w(x) + bt)I{ρ = 0}. Proof If ρ = 0, it is sufficient to substitute n = m into (2.50).
(2.53)
106
2 Selected Properties of Controlled Processes
If ρ = 0, then note that Condition 2.2.5(d) holds also for all ρˆ > 0. Since inequality (2.53) is satisfied for any ρˆ > 0, one can pass to the limit as ρˆ → 0 in the expression ExS [w(X (t))I{t < Tm+1 }] ≤ eρˆ t w(x) +
b ρˆ t e −1 . ρˆ
The proof is complete. Now we are in position to prove the main statement in this subsection.
Proof of Theorem 2.2.5 Throughout this proof, let x ∈ X be arbitrarily fixed. For a fixed l ∈ N, consider the modified transition rate q l defined by q l (dy|x, a) =
q(dy|x, a), if x ∈ Vl ; 0, if x ∈ X \ Vl .
The sample space (, F) remains the same, and any control strategy S can be viewed as the control strategy in the modified model. The strategic measure and the corresponding mathematical expectation are denoted by PxS,l and ExS,l . It is clear that, for each t ∈ R0+ and n ∈ N, PxS,l (n+1 > t − Tn , Tn ≤ t, X 1 ∈ Vl , . . . , X n ∈ Vl ) = PxS (n+1 > t − Tn , Tn ≤ t, X 1 ∈ Vl , . . . , X n ∈ Vl ); PxS,l (n+1 ≤ t − Tn , X n+1 ∈ Vl , Tn ≤ t, X 1 ∈ Vl , . . . , X n ∈ Vl ) = PxS (n+1 ≤ t − Tn , X n+1 ∈ Vl , Tn ≤ t, X 1 ∈ Vl , . . . , X n ∈ Vl ). The proof is obvious by an induction argument with respect to n = 1, 2, . . .. Therefore, for any t ∈ R0+ , PxS,l (X (t) ∈ Vl ) = PxS,l (t < T1 , X (0) ∈ Vl ) +
PxS,l (Tn ≤ t < Tn+1 , X n ∈ Vl )
n≥1
= PxS,l (t < T1 , X (0) ∈ Vl ) PxS,l (Tn ≤ t < Tn+1 , X 1 ∈ Vl , . . . X n ∈ Vl ) + n≥1
=
PxS (t
< T1 , X (0) ∈ Vl ) +
PxS (Tn ≤ t < Tn+1 , X 1 ∈ Vl , . . . X n ∈ Vl )
n≥1
= PxS (∀ t ∈ [0, t], X ( t) ∈ Vl ). Since supx∈Vl , a∈A qx (a) < ∞, it clearly holds that Pˆ xS,l (T∞ = ∞) = 1,
(2.54)
2.2 Conditions for Non-explosiveness Under a Fixed Natural Markov Strategy
107
and hence, for each t ∈ R0+ , PxS,l (X (t) ∈ X \ Vl ) = PxS (X (t) = x∞ or (X (t) = x∞ and ∃ t ∈ [0, t] : X ( t) ∈ X \ Vl )). Let us prove that, for any fixed t ∈ R0+ , lim PxS,l (X (t) ∈ X \ Vl ) = 0
(2.55)
l→∞
as follows. Without loss of generality, we assume that ρ > 0. According to Condition 2.2.5, for any ε > 0, there is a J (ε) ∈ N such that ∀ l ≥ J (ε),
inf w(y) >
eρt w(x) + bρ (eρt − 1) ε
y∈X\Vl
.
(2.56)
According to Corollary 2.2.7 and equality (2.54), for any l ∈ N, ExS,l [w(X (t))] $ = ExS,l w(X (t))
∞
I{Tn ≤ t < Tn+1 }
n=0
= ExS,l [w(X (t))I{t < T∞ }] = lim ExS,l [w(X (t))I{t < Tm+1 }] m→∞
b ≤ e w(x) + (eρt − 1). ρ ρt
(2.57)
Recall that Condition 2.2.5 obviously holds for the modified model with the transition rate q l (dy|x, a). If equality (2.55) is false, then there are some ε > 0 and l ≥ J (ε) such that PxS,l (X (t) ∈ X \ Vl ) > ε.
(2.58)
For l satisfying inequalities (2.56) and (2.58), we have ExS,l [w(X (t))] ≥ ExS,l [w(X (t))|X (t) ∈ X \ Vl ]PγS,l (X (t) ∈ X \ Vl ) b > ε inf w(y) > eρt w(x) + (eρt − 1). y∈X\Vl ρ The obtained formula contradicts (2.57). Therefore, equality (2.55) is established. Now formula (2.55) implies that, for any t ∈ R0+ , t ∈ [0, t] : X ( t) ∈ X \ Vl )) = 0, lim PxS (X (t) = x∞ or (X (t) = x∞ and ∃
l→∞
108
2 Selected Properties of Controlled Processes
or, equivalently, t ∈ [0, t], X ( t) ∈ Vl ) = 1. lim PxS (∀
l→∞
After we introduce the monotone nondecreasing sequence of measurable sets l = {ω ∈ : ∀ t ∈ [0, t], X ( t) ∈ Vl }, l ∈ N, we see that ⎛ lim
l→∞
PxS (l )
=
PxS
⎝
⎞ l ⎠ =
PγS (0 )
+
∞ l=1
l∈N0
Clearly, for any l ∈ N, PxS (T∞ > t| l \ sup
x∈Vl , a∈A
PxS
l \
l−1
i
= 1.
i=0
i=0 i = 1 because
l−1
qx (a) < ∞.
Therefore, PxS (T∞ > t) =
∞
PxS T∞
l=0
l−1 l−1 > t l \ i PxS l \ i = 1. i=0
i=0
Since t ∈ R0+ was fixed arbitrarily, we conclude that PxS (T∞ = ∞) = 1, as desired. Corollary 2.2.8 Suppose Condition 2.2.5 is satisfied. Then under any control strategy S, for each x ∈ X and t ∈ R0+ , ExS [w(X (t))]
b ρt ρt e − 1 I{ρ = 0} + (w(x) + bt)I{ρ = 0}. ≤ e w(x) + ρ
Proof It is sufficient to pass to the limit as m → ∞ in formula (2.53): remember, T∞ = ∞ PxS -a.s. It is not difficult to show that, under Condition 2.2.5, for an arbitrary natural Markov strategy π, ˘ Condition 2.2.1 is satisfied, too. Indeed, one can take the sequence n = R0+ × Vn , n ∈ N and w(t, x) = w(x) + 1. Conditions 2.2.1(a)–(c) are obviV ously satisfied. For (d), note that (w(y) + 1)q(dy|x, a) ≤ ρ(w(x) + 1) + b X
and take α ≥ ρ + b. Now
2.2 Conditions for Non-explosiveness Under a Fixed Natural Markov Strategy
(0,∞)
(0,∞)
≤ ≤
[w(y) + 1]e−αt−
(0,t]
qx (s+v)ds
109
q (dy|x, t + v)dt
X
e−αt−
(0,t]
qx (s+v)ds
[ρ(w(x) + 1) + b + (w(x) + 1)qx (t + v)]dt
(0,t]
qx (s+v)ds
(w(x) + 1)(α + qx (t + v))dt ≤ w(x) + 1,
e−αt− (0,∞)
and Condition 2.2.1(d) is satisfied as well.
2.3 Examples In this section, we provide several examples illustrating the verification of the conditions obtained in the previous section for the explosiveness and non-explosiveness of the process X (·) under a particular or an arbitrary control strategy.
2.3.1 Birth-and-Death Processes Example 2.3.1 This simple example illustrates Theorem 2.2.4 and the corresponding discrete-time Markov chain {xˆn }∞ n=0 introduced in Sect. 2.2.2. Suppose (under a fixed stationary π-strategy) X (·) is a pure birth process with the state space X = {0, 1, 2, . . .} and the birth rate λi = i m , i ∈ X, with m > 0, that is, q( j|i) = I{ j = i + 1}λi − I{ j = i}λi . Then Eq. (2.46) takes the form αu(i) = i m u(i + 1) − i m u(i), i = 0, 1, 2, . . . Obviously, for an arbitrarily fixed α ∈ R+ , the function u(i) =
0, ) k i−1 j=1
α+ j m , jm
if i = 0, if i ≥ 1
is a solution, where u(1) = k ≥ 0 is arbitrarily fixed, and any nonnegative solution has this form. Note that ∞ ∞ * * α α + jm = ∞, if m ≤ 1, 1+ m = K = m ∈ (0, ∞), if m > 1, j j j=1 j=1 because the series m.
∞
α j=1 j m
diverges or converges for the corresponding values of
110
2 Selected Properties of Controlled Processes
Therefore, Eq. (2.46) has no bounded nontrivial solutions if and only if m ≤ 1. If m > 1, one can always normalise the non-zero function u and take u(i) =
0, 1 K
)i−1
α+ j m j=1 j m ,
if i = 0, if i ≥ 1.
According to Theorem 2.2.4(c), the X (·) process is non-explosive if and only if m ≤ 1. The homogeneous discrete-time Markov process {xˆn }∞ n=0 , introduced in Sect. 2.2.2, has transition probability j|i) = I{ j = i + 1} G(
α im + I{ j = } m , im + α i +α
∞ |) = G(x ∞ |x∞ ) = 1. The probability of absorption in the state x∞ , if i ∈ X; G(x starting from any state i ≥ 1, equals Pi (σx∞ < ∞) = 1 − Pi (σx∞ = ∞) = 1 − = 1−
∞ *
1−
j=i
)∞ j=i 1 −
jm
∞ * j=i
jm jm + α
α , +α
α > 0 if and only if the series ∞ j=i j m +α converges, i.e., if and only if m > 1. For i = 0, P0 (σx∞ < ∞) = 1. Let us recall that the standard deterministic analogue of the considered birth process is given by the ordinary differential equation ddtx = x m , and, if x(0) = x0 > 0 1 then, for m > 0, m = 1, x(t) = (1 − m)t + x01−m 1−m . For any m < 1, x(t) is well defined for all t ≥ 0, but if m > 1, limt↑x01−m /(m−1) x(t) = ∞, which means explosion. The special case of m = 1 is trivial: x(t) = x0 et .
and
α j m +α
Example 2.3.2 We describe a general enough model with discrete state space X = {0, 1, 2, . . .} and an arbitrary action space A, such that Condition 2.2.5 is satisfied, and hence the controlled process X (·) is non-explosive under each control strategy. Let q( j|i, a) = qi (a)H ( j − i|i, a) for j = i, where, for each fixed i ∈ X and a ∈ A, H (·|i, a) is the probability distribution on the set {−i, −i+1, . . . , −1, +1, +2, . . .} of all the possible increments of the process X (·) after any one jump from the current state i ∈ X. Proposition 2.3.1 Suppose, for a fixed constant m ∈ N, the series k=0
(i, a) |k|m H (k|i, a) = H
2.3 Examples
111
is uniformly bounded: H (i, a) ≤ H¯ for all (i, a) ∈ X × A, converges, the function H i (a) = h¯ < ∞. Then Condition 2.2.5 is satisfied for and supi∈X,a∈A qi+1 Vn = {0, 1, 2, . . . , n}, w(i) = (i + 1)m , ρ = (2m − 1)h¯ H¯ , b = 0. Proof Conditions 2.2.5(a)–(c) are trivially satisfied. For case (d), elementary calculations imply
q( j|i, a)( j + 1)m
j∈X
= qi (a)
= qi (a)
⎧ ⎨ ⎩ j=i ⎧ ⎨ ⎩ k=0 ⎧ ⎨
H ( j − i|i, a)( j + 1)m − (i + 1)m
H (k|i, a)(k + i + 1)m − (i + 1)m
⎫ ⎬ ⎭ ⎫ ⎬ ⎭
⎫ ⎬ m! (i + 1)l k m−l = qi (a) H (k|i, a) ⎩ ⎭ l!(m − l)! k=0 l=0 $m−1 m! m−1 (i + 1) ≤ (i + 1)h¯ |k|m H (k|i, a) l!(m − l)! l=0 k=0 $m−1
¯ + 1)m [2m − 1] H¯ = ρ(i + 1)m . ≤ h(i
The proof is complete. Clearly, this model covers all birth-and-death processes with (for j = i) q( j|i, a) = λi (a)I{ j = i + 1} + μi (a)I{ j = i − 1}
i (a) < ∞. In particular, it covers the corresponding pure birth when supi∈X,a∈A λi (a)+μ i+1 processes (cf. Example 2.3.1).
Example 2.3.3 In this example, we present the birth-and-death model, where the transition rate can increase arbitrarily. We show that the process is still non-explosive under each control strategy if there is a negative trend in the dynamics. Suppose X = {0, 1, 2, . . .} and the action space A is arbitrary. Let q( j|i, a) = λi (a)I{ j = i + 1} + μi (a)I{ j = i − 1}
for j = i.
Here μ0 (a) = 0. We assume that supa∈A {λi (a) + μi (a)} < ∞ for each i ∈ X. Proposition 2.3.2 Suppose there is an I ∈ N0 such that
112
2 Selected Properties of Controlled Processes
μi (a) + 1 = h > 1. i≥I,a∈A λi (a) + 1 inf
Then Condition 2.2.5 is satisfied for Vn = {0, 1, 2, . . . , n}, w(i) = h i+1 , ρ = 0, b = 0 ∨ max{b0 , b1 , . . . , b I −1 }, where bi = sup h i {λi (a)h 2 + μi (a) − [λi (a) + μi (a)]h}. a∈A
Proof Conditions 2.2.5(a)–(c) are trivially satisfied. For part (d), elementary calculations imply
q( j|i, a)h j+1 = h i [λi (a)h 2 + μi (a) − {λi (a) + μi (a)}h]
j∈X
≤ h i [(λi (a) + 1)h 2 − {λi (a) + μi (a) + 2}h + (μi (a) + 1)] μi (a) + 1 μi (a) + 1 i 2 = h (λi (a) + 1) h − +1 h+ . λi (a) + 1 λi (a) + 1 (a)+1 Since h > 1, the last square bracket decreases as μλii (a)+1 increases. Therefore, for i ≥ I, q( j|i, a)h j+1 ≤ h i (λi (a) + 1)[h 2 − (h + 1)h + h] = 0, j∈X
and the proof follows.
2.3.2 The Gaussian Model The example presented below, with an uncountable state space X = R, might concern regulations of a cash flow; the control space A can be an arbitrary Borel space. We will show that, under appropriate conditions, the process X (·) is non-explosive (or explosive). Suppose the total jump intensity k(·) is an arbitrary measurable positive-valued function on X × A, and the distribution of the state after a jump from x ∈ X is normal with the unit variance and expectation d x, where d ∈ R is a fixed constant. In other words, (y−d x)2 1 q(|x, a) = k(x, a) √ e− 2 dy − δx () . 2π
2.3 Examples
113
Proposition 2.3.3 (a) If supx∈X,a∈A
≤ M < ∞ and |d| ≤ 1, then Condition 2.2.5 is satisfied for
k(x,a) x 2 +1
Vn = (−n, n), w(x) = x 2 , ρ = b = M, so that the process X (·) is non-explosive under each control strategy. (b) If |d| > 1 and the stationary π-strategy π s is such that
k(x, a)π s (da|x) ≥ leγx
2
/2
A
for some l, γ > 0, then, for the function u(x) = 1 − e−cx /2 , with an arbitrarily 1 2 fixed positive c < (d − 1) ∧ γ, inequality (2.47) holds for ε = l 1 − √c+1 , and hence the process X (·) is explosive. 2
Proof (a) Conditions 2.2.5(a)–(c) are trivially satisfied. For part (d), elementary calculations imply
(y−d x)2 1 w(y)q(dy|x, a) = k(x, a) √ y 2 e− 2 dy − x 2 2π X (−∞,∞) 2 2 = k(x, a)[1 + (d x) − x ] ≤ k(x, a) ≤ M(x 2 + 1) = ρw(x) + b.
(b) In order to compute u(y)q(dy|x) X =
k(x, a)π (da|x) − s
A
note that
2
e
(−∞,∞)
2
−( y2 + cy2 −d x y)
(−∞,∞)
2 1 − (y−d x)2 +cy2 − cx2 2 dy + e , √ e 2π
√ (d x)2 2π 2(c+1) e dy = √ . c+1
Therefore, u(y)q(dy|x) ≥ le X
(γ−c)x 2 2
1− √
1 c+1
e
(d x)2 d2 x2 2(c+1) − 2
2
+ cx2
Clearly, d2 d2 c c − + = · (c + 1 − d 2 ) < 0, 2(c + 1) 2 2 2(c + 1)
.
114
2 Selected Properties of Controlled Processes
so that u(y)q(dy|x) ≥ le
(γ−c)x 2 2
X
1 1 1− √ ≥l 1− √ > 0. c+1 c+1
2.3.3 The Fragmentation Model Consider the fragmentation model described in Sect. 1.2.7 under a fixed stationary strategy. Suppose X = R+ . We adopt the notations in Sects. 1.2.7 and 2.2.2. Below, we show that the process X (·) is explosive under appropriate conditions. Due to the mass conservation law imposed on the fragmentation kernel F, it is legitimate to consider the fragmentation model with the restricted state space + XC =
x = (s1 , . . . , s N ) : N ≥ 1, si ∈ X , ∀ i = 1, . . . , N ,
N
, si ≤ C
i=1
⊆X for the fixed constant C ∈ R+ . Proposition 2.3.4 Suppose the stationary strategy π s (da|x) does not depend on x, and the fragmentation rate r (s) = A r (s, a)π s (da|x) satisfies the inequality r (s) ≥
CF , ∀s ∈X sα
(2.59)
for some constants α ∈ R+ and C F ∈ R+ . Then there is a nontrivial bounded nonnegative measurable function u on XC and a constant ε > 0 such that XC u(y)q(dy|x) ≥ ε for any x ∈ XC . Proof Let β = min{α, 1} be fixed. Consider the function g on R+ defined by g(y) =
yβ , ∀ y ∈ R+ . 1 + yβ
After a standard analysis of derivatives, one can see that g is a monotone nondecreasing bounded function, and β−1
g(y2 ) − g(y1 ) ≥
β y2
β
(1 + y2 )2
(y2 − y1 )
(2.60)
2.3 Examples
115
for each ∞ > y2 ≥ y1 ≥ 0. Let the bounded nontrivial nonnegative function u on XC be defined by u(x) = g(N ) if x = (s1 , . . . , s N ) ∈ XC . Then it holds that for each x = (s1 , . . . , s N ) ∈ XC , u(y)q(dy|x) ≥ XC
N
(g(N + 1) − g(N ))r (si )
i=1
N −α N β(N + 1)β−1 1 β(N + 1)β−1 1+α ≥ CF ≥ C N si F α (1 + (N + 1)β )2 i=1 si (1 + (N + 1)β )2 i=1 ≥ CF
β(N + 1)β−1 N 1+α C −α , (1 + (N + 1)β )2
where the first inequality is by the monotonicity of the function g in y ∈ R+ , the second inequality is by (2.59) and (2.60), and the third inequality is a consequence of the Jensen inequality N 1 −α s ≥ N i=1 i
N 1 si N i=1
−α . β−1
β(N +1) Finally, it is easy to check that ε = inf N ≥1 C F (1+(N N 1+α C −α > 0. +1)β )2
According to Corollary 2.2.4, the process X (·) with values in XC is explosive. It follows that, given (2.59), the original process with the state space X is explosive, too.
2.3.4 The Infrastructure Surveillance Model Below, we consider a simplified version of the surveillance model described in Sect. 1.2.8. Namely, put N = 1 and consider only one threat event: X = (0, ∞), A = {0, 1}, q(|x, a) = λ(x)δmx () + μ(x)δbx ()I{a = 1}, where m > 1 and b ∈ (0, 1) are fixed constants. Proposition 2.3.5 Suppose for each n ∈ N, supx∈(0,n) [λ(x) + μ(x)] < ∞, and there is a K ∈ R+ such that inf
x≥K
μ(x) + 1 mα − 1 > inf . λ(x) + 1 α>0 1 − bα
116
2 Selected Properties of Controlled Processes α
Therefore, there is some α > 0 satisfying the inequality inf x≥K μ(x)+1 > m1−b−1α . We λ(x)+1 s fix this α. Then, under the stationary strategy given by π (1|x) = 1 for each x ∈ X, Condition 2.2.3 is satisfied for Vn = (0, n), w(x) = x α , ρ = 2, b = sup {λ(x)(mx)α + μ(x)(bx)α }. x∈(0,K )
Hence, the process X (·) is non-explosive. Proof (a) Conditions 2.2.3(a)–(c) are trivially satisfied. For part (d), elementary calculations imply w(y)q(dy|x) − ρw(x) − b X
≤ λ(x)(mx)α + μ(x)(bx)α − [λ(x) + μ(x)]x α − 2x α ≤ [λ(x) + 1](mx)α + [μ(x) + 1](bx)α − [λ(x) + 1 + μ(x) + 1]x α μ(x) + 1 = (m α − 1) + · (bα − 1) x α · [λ(x) + 1] λ(x) + 1 α −1 m α · (1 − b ) x α · [λ(x) + 1] = 0, ≤ (m α − 1) − 1 − bα if x ≥ K . The third inequality follows from the condition b ∈ (0, 1). If x < K , inequality X w(y)q(dy|x) ≤ ρw(x) + b follows from the definition of the constant b. Consider now the stationary control strategy given by π s (0|x) = 1 for each x ∈ X. Starting from each x0 ∈ X, after the transformation x →i =
ln x − ln x0 , ln m
we deal with the pure birth process in the state space {0, 1, 2, . . .} i, so that all reasoning from Example 2.3.1 applies. < ∞, then the proIn particular, if λ(x) > 0 and, for some K ∈ X, supx≥K λ(x) ln x cess X (·) is non-explosive. Indeed, for an arbitrary value of u(x0 ) > 0 one can compute u(x0 m i ) > 0 for all i such that x0 m i < K , and after that arguments similar to Example 2.3.1 imply that the function u is unbounded, meaning that Eq. (2.46) has no nontrivial bounded solution. Hence the process X (·) is non-explosive due to Theorem 2.2.4(c). λ(x0 m i ) Similarly, if, for some x0 ∈ X, I ∈ N0 , inf i≥I [ln(x i l > 0 with l > 1, then the 0 m )] process X (·), starting from X (0) = x0 m I , is explosive. In the special case λ(x) = x α , it is easy to show that, for each x0 ∈ X, inequality (2.47) is satisfied for some ε > 0 and function u(x) = 1 − x α1+1 , for all achievable values x0 , mx0 , m 2 x0 , . . .:
117
1 1 − x α + 1 (mx)α + 1 X α α (mx) − x xα xα = xα · α = · · (m α − 1) (x + 1)((mx)α + 1) x α + 1 (mx)α + 1 α m −1 1 1 ≥ 1− α 1− α α = ε > 0. x0 + 1 mα m x0 + 1
u(y)q(dy|x) = λ(x)
According to Corollary 2.2.4, the process X (·) with values in {x0 , mx0 , m 2 x0 , . . .} is explosive. It follows that X (·) is explosive under each initial state, too.
2.4 Dynkin’s Formula Dynkin’s formula is an important and convenient tool for establishing the optimality equation, and proving the existence of a deterministic stationary strategy for the unconstrained discounted CTMDP problem. In this section, we discuss its validity for a class of functions firstly under a natural Markov strategy, and then under a general π-strategy.
2.4.1 Preliminaries We shall first present several definitions and establish several preliminary results. Recall that each natural Markov strategy generates a Q-function, say q, see (1.11), which in turn defines the transition function pq and the corresponding transition probability function pq of a Markov pure jump process X (·), see Corollary 2.1.3. Below in this section we use the notation q Ex [ f (X (t))] := f (y) pq (0, x, t, dy), (2.61) X∞
for each x ∈ X and [0, ∞]-valued measurable function f on X∞ . When we wish to signify that the Q-function q is induced by a natural Markov strategy π, ˇ we also q write Ex as Eπxˇ for each x ∈ X. Definition 2.4.1 Let c ∈ R0+ be fixed. A function f on X is called a c-drift function with respect to the Q-function q on X if (a) it is (0, ∞)-valued and measurable on X, and (b) for each x ∈ X,
118
2 Selected Properties of Controlled Processes
f (y)q(dy|x, s) ≤ c f (x)
(2.62)
X
for all s ∈ R0+ . In (2.62) of the previous definition, it is implicit that f (y) q (dy|x, s) < ∞ X
so that the integral
f (y)q(dy|x, s) := X
f (y) q (dy|x, s) − qx (s) f (x),
(2.63)
X
i.e., the left-hand side of (2.62), is well defined and actually finite. Here and below in this section, for a Q-function q on X, we use the notations q (|x, s) = q( \ = q (X|x, s). This is consistent with (1.11). {x}|x, s) and qx (s) Note that, if X w(y)q(dy|x, a) ≤ ρw(x) + b for w ≥ 1 and ρ, b ∈ R0+ , then w is a (ρ + b + 1)-drift function. For a Q-function q on X, a constant c ∈ R0+ and a c-drift function f with respect to the Q-function q, let us introduce the f -transformed Q-function as follows. Definition 2.4.2 Let Xδ := X ∪ {δ} with δ ∈ / X being an isolated point that satisfies δ = x∞ . (Here we choose a different isolated point because we want to reserve x∞ as the isolated point in the definition of X (t), see (1.7), when the state space is Xδ .) The f -transformed Q-function q f on Xδ is defined by ⎧ ⎪ ⎪ f (y)q(dy|x, s) ⎪ ⎪ ⎪ ⎪ , if x ∈ X, ∈ B(X), x ∈ / ; ⎪ ⎪ f (x) ⎪ ⎪ ⎪ ⎨ q f (|x, s) = (2.64) f (y)q(dy|x, s) ⎪ ⎪ ⎪ X ⎪ , if x ∈ X, = {δ}; c− ⎪ ⎪ f (x) ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ 0, if x = δ, = X. for each s ∈ R0+ . It follows from the above definition that the state δ is absorbing, and the transition rate at x ∈ X is given by
2.4 Dynkin’s Formula
119
qxf (s) = c + qx (s), ∀ s ∈ [0, ∞). The transition function on Xδ associated with the Q-function q f on Xδ is denoted by pq f (s, x, t, dy). It is constructed according to (2.2) and (2.3). The nonhomogeneous Markov pure jump process with the transition function pq f is still denoted by X (·) despite its state space now being different, but this should not lead to confusion because its corresponding probability and expectation will be signified by a superscript q f . Definition 2.4.3 Suppose f is a fixed (0, ∞)-valued measurable function on X. A measurable function u on X satisfying ||u|| f := sup x∈X
|u(x)| 0,
2.4 Dynkin’s Formula
133 f
ect Eqx [I{X (t) ∈ X}] = I{x ∈ X} f + ecu Eqx I{y ∈ X}q f (dy|X (u), u) (0,t] X f +cEqx [I{X (u) ∈ X}] du ∀ x ∈ X, t ∈ R0+ . Applying Lemma 2.4.5 to the left-hand side, the above equality reads q
= =
=
=
Ex [ f (X (t))] f (x) f f 1+ ecu Eqx q f (X|X (u), u) + cEqx [I{X (u) ∈ X}] du (0,t] −cu e cu q Ex 1+ e f (y)q(dy|X (u), u) f (x) (0,t] X f f (y) pq (0, x, u, dy) + cEqx [I{X (u) ∈ X}] du −c X −cu e 1+ Eqx ecu f (y)q(dy|X (u), u) f (x) (0,t] X e−cu −c f (y) pq (0, x, u, dy) +c f (y) pq (0, x, u, dy) du f (x) X X q Ex X f (y)q(dy|X (u), u) du, ∀ x ∈ X, t ∈ R0+ , 1+ f (x) (0,t]
where the second equality is by (2.85), and the third equality is by Lemma 2.4.5. Now (2.73) follows.
2.4.3 Dynkin’s Formula Under All π-Strategies Using the results presented earlier in this chapter, it is possible to give sharp conditions for (2.84) to hold under all natural Markov strategies, as follows. Condition 2.4.1 There exist a [0, ∞)-valued measurable function w on X, a [1, ∞)valued measurable function w on X and a monotone nondecreasing sequence of measurable subsets {Vˇm }∞ m=1 ⊆ B(X), such that (a) (b) (c) (d)
Vˇm ↑ X as m → ∞; supx∈Vˇm q x < ∞ for each m = 1, 2, . . . ; (y)q(dy|x, a) ≤ ρ ˇw (x) for some constant ρˇ ∈ R+ , for all (x, a) ∈ X × A; Xw inf x∈X\Vˇm ww(x) ↑ ∞ as m → ∞; (x)
134
(e)
2 Selected Properties of Controlled Processes
w (y)q(dy|x, a) ≤ ρw (x) + b for some constants ρ, b ∈ R0+ , for all (x, a) ∈ X × A. X
Theorem 2.4.3 Suppose Condition 2.4.1 is satisfied. Let a natural Markov strategy πˇ be arbitrarily fixed. Then for each w -bounded measurable function g on X such that ds < ∞, q(dy|X (s), a) π(da|X ˇ (s), s) g(y) Eπxˇ ∀t ∈
(0,t] R0+ ,
X
A
x ∈ X,
(2.87)
Dynkin’s formula (2.76) under the strategy πˇ holds. Proof Let some natural Markov strategy πˇ be arbitrarily fixed, and consider its induced Q-function q. Note that the function w from Condition 2.4.1 is a c-drift function with c = ρ + b. According to Lemma 2.4.8 and Theorems 2.4.1 and 2.4.2, where f is replaced by w , the statement of the theorem holds if relation (2.84) holds, where f is replaced by w . The rest verifies relation (2.84) with f being replaced by w . Note that (y) w (y) w w (y) w w (x) q q (dy|x, s) − (c + qx (s)) (dy|x, s) = w (y) (y) w (x) w (x) X w X w (x) w (y) w (x) q (dy|x, s) − (c + qx (s)) ≤ (ρˇ − c) , = (x) w (x) w (x) X w 0 ∀ x ∈ X, s ∈ R+ . (2.88) Consider the R0+ -valued measurable function w on R0+ × Xδ defined for each v ∈ R0+ by w(v, x) = ww(x) if x ∈ X and w(v, δ) = 0. Then Condition 2.2.1, with X and (x) ˇ the monotone q being replaced by Xδ and q w , is satisfied by the constant α = ρ, 0 n = n }∞ of R × X defined by V nondecreasing sequence of measurable subsets {V δ + n=1 R0+ × (Vˇn {δ}), ∀ n ∈ N, and the function w on R0+ × Xδ defined in the above. In greater detail, part (d) of the corresponding version of Condition 2.2.1 is satisfied because, by (2.88),
≤
R+
w(t + v, y)e−αt− Xδ
e−ˇρt− R+
(0,t]
qxw (s+v)ds
(0,t]
qxw (s+v)ds 8 q w (dy|x, t
+ v)dt
qxw (t + v) + ρˇ w(v, x)dt = w(v, x), ∀ x ∈ X,
and the last inequality holds trivially when x = δ. Thus, by Theorem 2.2.1, we see that relation (2.84) is satisfied, and the statement follows.
2.4 Dynkin’s Formula
135
Let us comment on the sharpness of Condition 2.4.1 in relation to the statement of the above theorem. Suppose there are a measurable function w and constants 0 is a c-drift function with ρ, b ∈ R+ such that Condition 2.4.1(e) is satisfied. Then w respect to the Q-function q with c = ( ρ + b). If A is a singleton and if the state space X is denumerable, then the other parts of Condition 2.4.1 are actually necessary and sufficient for the statement of Theorem 2.4.3 to hold. In greater detail, as in the proof of Theorem 2.4.3, according to Theorems 2.4.1 and 2.4.2 with f being replaced by w , the statement of Theorem 2.4.3 is equivalent to relation (2.84) with f being replaced by w . In this homogeneous denumerable case, relation (2.84), with f being replaced by w , is equivalent to Condition 2.2.3 with X and q being replaced by Xδ w and q , according to Theorem 2.2.4(b). Part (d) of the corresponding version of Condition 2.2.3 now reads
w (y) q (y|x) − (c + qx )w(x) w (x) y∈Xδ y∈X (y)q(y|x) y∈X w +w(δ) c − ≤ αw(x), ∀ x ∈ X. w (x)
w(y)q w (y|x) =
w(y)
This is equivalent to
w(y) w (y) q (y|x) − (c + qx )w(x) w (x)
y∈X
+w(δ) w (x) c −
y∈X
w (y)q(y|x)
≤ αw(x) w (x), ∀ x ∈ X,
w (x)
or equivalently,
w(y) w (y)q(y|x) + w(δ) w (x) c −
y∈X
y∈X
w (y)q(y|x)
w (x)
≤ (α + c)w(x) w (x), ∀ x ∈ X. Since, w is a c-drift function with respect to the Q-function q, the above inequality implies
w(y) w (y)q(y|x) ≤ (α + c)w(x) w (x), ∀ x ∈ X.
y∈X
being defined by Therefore, Condition 2.4.1(a)–(d) is satisfied with Vˇm = Vm , w w (x) = w(x) w (x) for each x ∈ X, and ρˇ = α + c. To summarise, if A is a singleton and X is denumerable, then, under Condition 2.4.1(e), Condition 2.4.1(a)–(d) is necessary and sufficient for the statement in Theorem 2.4.3 to hold. Let us impose the following conditions.
136
2 Selected Properties of Controlled Processes
Condition 2.4.2 There exist a measurable function w : X → R0+ and constants ρ ∈ R, L , b ∈ R0+ such that the following assertions hold. (a) q x ≤ Lw(x) for each x ∈ X and a ∈ A, where q x is defined by (1.1). (b) For each x ∈ X and a ∈ A, w(y)q(dy|x, a) ≤ ρw(x) + b. X
The above condition is stronger than Condition 2.2.5. Indeed, Condition 2.2.5 holds after we put Vl := {x ∈ X : w(x) ≤ l}. Note also that if Condition 2.4.2 holds, then it holds also after we add any positive constant to the function w. Therefore, one can assume that the function w in Condition 2.4.2 is [1, ∞)-valued without loss of generality. Similarly, without loss of generality, one can assume that ρ ≥ 0. Recall that, under Condition 2.4.2, there is no explosion under each control strategy according to Theorem 2.2.5. It is also evident that Condition 2.4.2 is satisfied if supx∈X q x < ∞, by putting e.g., w(x) = 1 for each x ∈ X. Condition 2.4.3 (a) Condition 2.4.2 is satisfied, and there exist an [1, ∞)-valued measurable function w on X and constants L , b ≥ 0 and ρ ∈ R such that the following assertions hold: (b) (q x + 1)w (x) ≤ L w(x) for each x ∈ X. (c) X q(dy|x, a)w (y) ≤ ρ w (x) + b for each x ∈ X and a ∈ A. (d) α > ρ . (e) There exists a constant M ≥ 0 satisfying | inf a∈A c0 (x, a)| ≤ M w (x) for each x ∈ X. Clearly, without loss of generality, one can assume in the previous condition that ρ ≥ 0. Parts (d), (e) of Condition 2.4.3, which concern the discount factor and the cost rates, are of no use in this chapter. Lemma 2.4.9 Suppose Condition 2.4.3(a)–(c) is satisfied for functions w and w and nonnegative constants ρ, b, L , L , ρ , b . Then Condition 2.4.1 is satisfied by = w + 1, the constants ρˇ = max{ρ, b}, ρ = ρ , b = b , the functions w = w and w w (x) ∞ ˇ ˇ and the subsets {Vm }m=1 with Vm = {x ∈ X : w (x) ≤ m} for each m ∈ N. Moreover, (2.87) holds for each w -bounded measurable function g on X under each natural Markov strategy π. ˇ (In fact, the left-hand side of (2.87) is w-bounded.) Consequently, Dynkin’s formula (2.76) holds for each w -bounded measurable function g on X under each natural Markov strategy. Proof For the first assertion, Conditions 2.4.1(a), (d), (e) are obviously satisfied. For Condition 2.4.1(b), note that Condition 2.4.3(b) implies q¯ x ≤ L
w(x) w (x) − 1 = L ≤ L m for all x ∈ Vˇm , m = 1, 2, . . . w (x) w (x)
For Condition 2.4.1(c), note that Condition 2.4.3(a) implies
2.4 Dynkin’s Formula
137
w (y)q(dy|x, a) ≤ ρw(x) + b ≤ max{ρ, b}(w(x) + 1) = ρ ˇw(x). X
The first assertion is thus verified. For the second assertion, let some natural Markov strategy π, ˇ t ∈ R0+ , x ∈ X be ρ + b, see the proof of arbitrarily fixed. Recall that w is a c-drift function for c = Theorem 2.4.3. By Lemma 2.4.1, it suffices to note that Eπxˇ =
Eπxˇ
≤ L Eπxˇ
(0,t]
w (X (s))(1 + q X (s) (s))ds
(0,t]
L Eπxˇ
w (X (s))(1 + q X (s) (s))ds ≤ w(X (s))ds (0,t] t w (X (s))ds ≤ L (w(x) + 1) eρˇ s ds < ∞,
(0,t]
(2.89)
0
where the first equality, and the first and the second inequalities hold automatically, see Condition 2.4.3(b), and the last inequality immediately follows from Lemma 2.4.2 applied to the ρ-drift ˇ function w . We can straightforwardly extend the previous results to the class of π strategies. In the next statement, recall that the P(A)-valued predictable process was introduced in (1.8). Theorem 2.4.4 Suppose Condition 2.4.3(a), (b), (c) is satisfied. Then for each g ∈ Bw (X) and each π-strategy S, the following two versions of the Dynkin formula hold: S S (da|v)q(dy|X (v), a)g(y)dv , Ex [g(X (t))] − g(x) = Ex (0,t]
X
A
∀ x ∈ X, t ∈ R0+ ;
(2.90)
ExS [g(X (t))]e−αt
= ExS
(0,t]
− g(x) −αv −αg(X (v)) + e (da|v)q(dy|X (v), a)g(y) dv ,
∀ x ∈ X, t ∈ R0+ ,
X
A
(2.91)
where α > 0 is a constant. All of the expressions in (2.90) and (2.91) are finite. If the initial distribution γ is such that X w(x)γ(d x) < ∞, then formulae (2.90) and (2.91) are valid for the expectation EγS ; the term g(x) on the left should be replaced with X g(x)γ(d x). Proof Let S be a fixed π-strategy, and πˇ its induced natural Markov strategy. Condition 2.4.3(a), which is stronger than Condition 2.2.5, as noted below Condition 2.4.2, implies that the process X (·) is non-explosive under each strategy, see Theorem 2.2.5. It follows from this and Corollary 2.1.2 that
138
2 Selected Properties of Controlled Processes
Pπxˇ (t, dy × da) = PxS (t, dy × da).
(2.92)
It remains to apply Lemma 2.4.9 for the verification of relation (2.90). All the expressions in (2.90) are finite by (2.89) and (2.92). To prove (2.91), note that the right-hand side can be represented, using (2.90), as αe
−ExS
(0,t]
(0,t]
=− +
g(X (v))dv −αv S e Ex (da|v)q(dy|X (v), a)g(y) dv (0,t]
+
−αv
(0,t]
X
A
αe−αv ExS [g(X (v))] dv + e−αt ExS [g(X (t))] − g(x) αe−αv ExS [g(X (v))] − g(x) dv.
The second integral was calculated by parts. It was proved that all particular terms in the presented expression are finite, the integrals are well defined, and the interchange of the order of the integrals and expectations is legal. Therefore, the right-hand side of (2.91) equals e−αt ExS [g(X (t))] − g(x) e−αt +
(0,t]
αe−αv dv = e−αt ExS [g(X (t))] − g(x),
i.e., coincides with the left-hand side of (2.91). Finally, the relations (2.90) and (2.91) remain valid if the initial state x is replaced by an initial distribution γ satisfying the requirement in this theorem, virtually by (2.89) and (2.92). It is also possible to obtain Theorem 2.4.4 more directly by using the following Kolmogorov forward equation under a π-strategy. Although we will not reproduce that proof, this result is included for future reference. Theorem 2.4.5 Suppose Condition 2.4.2 is satisfied. Then under each fixed πstrategy S, the Kolmogorov forward equation holds: PxS (X (t)
∈ ) = I{x ∈ } +
−ExS
(0,t]
(da|u) q (|X (u), a)du (da|u)q X (u) (a)I{X (u) ∈ }du ExS
(0,t]
A
A
∀ x ∈ X, t ∈ R0+ , ∈ B(X).
(2.93)
If the initial distribution γ is such that X w(x)γ(d x) < ∞, then formula (2.93) holds for the probability and expectation PγS and EγS ; the indicator function I{x ∈ } should be replaced with γ().
2.4 Dynkin’s Formula
139
Proof We assume that a π-strategy S is fixed. If we put R = (0, t], then the random measure μ given by (1.4) counts the entrances of the process X (·) in the set X ∈ B(X) on the interval (0, t]. Let us consider also the following random measure μ defined by μ(ω; R × X ) =
I{Tn (ω) < ∞}δ(Tn (ω),X n−1 (ω)) (R × X ),
n≥1
∀ R ∈ B(R+ ), X ∈ B(X). For R = (0, t], it counts the exits of the process from the set X ∈ B(X) on the interval (0, t]. Thus μ((0, t] × X ) + μ((0, t] × X ). I{X (t) ∈ X } = I{X (0) ∈ X } −
(2.94)
The random measure ν given by (1.14) is the dual predictable projection of μ. (See Lemma A.1.1.) For the π-strategy S, in the case of no explosion, it has the form: for all R ∈ B(R+ ), X ∈ B(X), ν(ω; R × X ) =
R
(da|ω, u) q ( X |X (u), a)du. A
According to Lemma A.1.2, in the case of no explosion, the dual predictable projection of μ is given for all R ∈ B(R+ ), X ∈ B(X) by ν (ω; R × X ) =
R
(da|ω, u)q X (u−) (a)δ X (u−) ( X )du. A
Under the imposed conditions, for all x ∈ X, t ∈ R0+ , ∈ B(X), ExS [ ν ((0, t] ≤L
(0,t]
× )] ≤
ExS
(0,t]
Lw(X (u−))I{X (u−) ∈ }du
ExS [w(X (u))]du < ∞,
where the last inequality follows from Corollary 2.2.8. Therefore, ExS [ μ((0, t] × )] = ExS [ ν ((0, t] × )] < ∞. Since |μ((0, t] × ) − μ((0, t] × )| ≤ 1, ExS [μ((0, t] × )] = ExS [ν((0, t] × )] < ∞. S After taking expectations Ex on both sides of (2.94), we obtain the equalityS(2.93). If X w(x)γ(d x) < ∞, the same reasoning applies for the expectations Eγ .
140
2 Selected Properties of Controlled Processes
We remark that the expectations that appear in (2.93) are finite.
2.4.4 Example We end this section with the following example, which shows that in general, for a c-drift function f , the conditions Eqx [ f (X (t))] < ∞, ∀ t ∈ R0+ , x ∈ X; q 0 Ex f (y)q(y|X (s)) ds < ∞, ∀ t ∈ R+ , x ∈ X, (0,t]
X
are not sufficient for (2.73). Consider on the state space X = {0, 1, 2, . . . } the homogeneous Q-function q defined by ⎧ 5 x 2 , if x = 0, y = x + 1; ⎪ ⎪ 12 ⎪ ⎪ ⎨ 7 x q(y|x) = 12 2 , if x = 0, y = x − 1; ⎪ ⎪ ⎪ ⎪ ⎩ 0, if x = 0. Here, by homogeneous Q-function q we mean that q(y|x, s) does not depend on s ∈ R0+ for each x, y ∈ X. For brevity, we have written in the above definition q(y|x) instead of q(y|x, s). Consider the function f on X defined by f (x) =
x 7 , ∀ x ∈ X. 5
Then one can verify that y∈X
f (y)q(y|x) = 0 ≤
1 f (x), ∀ x ∈ X, 5
i.e., f is a 0-drift function with respect to the Q-function q. Thus, it is clear that Condition 2.2.3 is satisfied with w being replaced by f , and from Theorem 2.2.4(a), we see that the Markov pure jump process corresponding to this Q-function is nonexplosive.
2.4 Dynkin’s Formula
141
Now the f -transformed Q function q f on Xδ is defined by ⎧ 7 x 2 , ⎪ ⎪ ⎨ 12 5 x q f (y|x) = 12 2 , ⎪ ⎪ ⎩ 0, 0,
if x = δ, x = 0, y = x + 1; if x = δ, x = 0, y = x − 1; if x = δ, y = δ; if x = δ or x = 0,
f
f
and qx = 2x for each x = δ, 0, and qx = 0 if x = 0, δ. Next we shall verify that relation (2.84) does not hold. The isolated point δ is absorbing and is not accessible to any other states, and thus, we can ignore it, and consider, without loss of generality, q f as a Q-function on X, still denoted as q f . This Q-function q f is upwardly skipfree in the sense that q f (x + 1|x) > 0 for each x ∈ N, and q f (y|x) = 0 for each x ∈ N0 , and y ∈ N such that y > x + 1. Here we quote without proof a useful criterion for (2.84) for such an upwardly skip-free Q-function q f on X = N0 . Let R0 = 1, and for each n ∈ N, n m−1 1 f Rn = f q (k|n)Rm−1 . 1+ q (n + 1|n) m=1 k=0 Then (2.84) holds if and only if ∞
Rn = ∞.
n=1
See more information on this in Sect. 3.4 below. Note that in our case, we have Rn =
q
f (n
12 1 1 5 1 + q f (n − 1|n)Rn−1 = + Rn−1 , ∀ n = 1, 2, . . . . n + 1|n) 7 2 7
Simple iterations result in n n−1 i n i 5 12 1 5 12 1 5 Rn = + ≤ , ∀ n = 1, 2, . . . . n−i 7 7 2 7 7 7 2n−i i=0 i=0 Thus, ∞ n=1
∞ ∞ ∞ n i 5 12 1 12 5 i 1 Rn ≤ = = 12 < ∞. 7 7 2n−i 7 i=0 n=i 7 2n−i n=0 i=0
Thus, by the criterion stated in the above, we see that relation (2.84) does not hold. Now by Theorem 2.4.2, relation (2.73) does not hold, even though
142
2 Selected Properties of Controlled Processes
Eqx [ f (X (t))] < ∞, ∀ t ∈ R0+ , x ∈ X; q 0 Ex f (y)q(y|X (s)) ds = 0 < ∞, ∀ t ∈ R+ , x ∈ X, (0,t]
X
where the first inequality is by Lemma 2.4.2. In the framework of Theorem 2.4.3, Condition 2.4.1(e) is satisfied for w = f and ρ= b = 0, but the Dynkin formula (2.76) fails to hold, so that, from the discussions after Theorem 2.4.3, the other parts of Condition 2.4.1 cannot be satisfied for this example.
2.5 Bibliographical Remarks Section 2.1. The definitions presented in Sect. 2.1.1 mainly come from [70, 96, 153], though compared to Definition 2.1.2, the requirement in the definition of a transition function in [153] is more relaxed. The construction of the transition function given a Q-function presented in Sect. 2.1.2 is known as Feller’s construction. It was originally introduced in [90], assuming the additional continuity properties of q(·|x, t) in t. For this reason, some earlier works on CTMDPs were restricted to the class of natural Markov strategies, under which the induced signed kernel q(·|x, t) satisfies the continuity condition; see e.g., [99, 101, 108, 114]. The continuity condition on the Q-function is relaxed to a measurability condition as in the present text in [249] for the case of a denumerable state space, and more recently in [85, 251] for the case of a Borel state space. The materials in this section mainly come from [85, 86, 90, 251]. Section 2.2. The main statements presented in Sect. 2.2.1 are taken from [264]. The key observation is Corollary 2.2.1, which can be reformulated as follows: the Markov pure jump process X (·) with the transition function pq is non-explosive if and only if the discrete-time Markov chain {(Tˆn , Xˆ n )}∞ n=0 is Harris recurrent with the maximal irreducibility measure concentrated on the atom {(∞, x∞ )}. The definition, and sufficient and necessary (in a fairly general setup) conditions for the Harris recurrence of this type of discrete-time Markov chain can be found in e.g., [167]. In this connection, we mention that the proof of Theorems 2.2.1 and 2.2.2 presented here are an adoption of the reasoning in the proof of Theorem 8.4.3 and Theorem 9.4.2 in [167] concerning the Harris recurrence of {(Tˆn , Xˆ n )}∞ n=0 . Theorem 2.2.3 and Corollary 2.2.3 were presented in [135]. In the homogeneous case, the sufficiency of Condition 2.2.3 for the non-explosiveness was proved in [36], see also [39, 40]. If the state space is denumerable, making use of a result in [164], the author of [230] noted that Condition 2.2.3 is also necessary for the non-explosiveness. An extension of Condition 2.2.3 to the nonhomogeneous case was presented in [41, 265]. That condition seems to be stronger than our condition presented in this chapter; see Sect. A.4 for more details. The necessary and sufficient condition for non-explosiveness in part (c) of Theorem 2.2.4 is known as Reuter’s criterion, see [39, 91, 202]. Although Reuter’s
2.5 Bibliographical Remarks
143
criterion is in general not easy to verify, it is quite powerful when dealing with birth-and-death processes; see [5, 31]. It seems that explosion is also related to the understanding of some nonstandard phenomena in other fields such as ecology. As an example, according to [126], “a striking feature of populations of certain species of small mammals is the occurrence of periodic outbreaks. While for most of the time the populations persist at low density, it happens from time to time that birth-rates and numbers show a rapid and accelerating increase, resulting, if the animal is a pest, in a ‘plague’ when density is extremely high.” In [126] a birth-and-death process is used to model the population of such a species, and the explosion corresponds to the occurrence of the outbreak. It is reported that the model considered therein could reproduce some of the notable characteristics of the phenomenon observed by ecologists. Subsection 2.2.4. Condition 2.2.5 and Theorem 2.2.5 appeared in [190]. The rest of the material in this subsection is largely based on [84, 86, 87]. Subsection 2.2.5. These materials are extensions of the corresponding statements in [112, 113, 190], which all deal with π-strategies. Section 2.3. More discussions on birth-and-death processes can be found in Sect. 3.2 and Chap. 8 of [5] and the monograph [245]. Example 2.3.2 appeared in Example 8.1 of [112], and a simpler version was presented in Example 6.1 of [101]. Example 2.3.3 again appeared as Example 8.3 in [112], and its special version was presented in Example 4.3 of [115]. Example 2.3.2 is similar to the one considered in [132] in discrete-time. Example 2.3.3 is borrowed from [243], where stronger results were obtained for the underlying fragmentation model under the assumption of (2.59); see Theorem 4.3 therein. Section 2.4. The materials presented in this section are mainly taken from [264], which is an extension of [231] from the denumerable homogeneous case to the general nonhomogeneous case. The notion of the f -transformed Q-function described in Definition 2.4.2 can be found in [5, 231]. If f is a positive constant function, the f -transformed Q-function corresponds to the transformation of an α-discounted cost model to a total undiscounted cost model described in Sect. 1.3.5. Lemma 2.1.2 comes from [86]. Lemma 2.4.3 is an extension of the corrected version of Proposition 2.14 in [5] for the denumerable homogeneous case, which was demonstrated to be inaccurate in [37]. The statement of Lemma 2.1.3 is also stated as Corollary 4.2 in [85]. The proof presented here is the same as the proof of Lemma 1(ii) of [251], where the important condition on is missing; consequently part (ii) of Theorem 2 therein is not quite correct. Condition 2.4.1 is similar to the main condition assumed in [26]. Example 2.4.4 is taken from [231]. The criterion quoted in Example 2.4.4 is taken from Theorem 2 of [35], and it is sometimes called M.F. Chen’s criterion; see [38] for one of its early versions. Instead of using this criterion, some results in [231] can also be used. As pointed out in [230, 231], Example 2.4.4 demonstrates that the statement in Appendix C.3 of [106] and Theorem 2.2 in Chap. 9 of [31] are inaccurate. For the case of bounded transition rates, Theorems 2.4.5 and 2.4.4 under deterministic natural Markov strategies appeared in [253, 255]. In the current form, Theorem 2.4.5 for the case of the bounded transition rate was first established in [149], where
144
2 Selected Properties of Controlled Processes
the proof that ν˜ is the dual predictable projection of μ˜ is subject to a correctable inaccuracy, see also Chap. 4 of [150]. The reasoning in [149, 150] is later applied to the unbounded transition rates in [112, 113, 190]. The material presented in this section is mainly based on [190].
Chapter 3
The Discounted Cost Model
In this chapter, let α > 0 be a fixed discount factor. We shall consider the α-discounted CTMDP problems (1.15) and (1.16). Without loss of generality we can restrict to (relaxed) π-strategies because (Markov) π-strategies are sufficient for α-discounted problems; see Remark 1.3.1. (In fact, in this chapter, the class of natural Markov strategies is also sufficient, according to (2.92). However, we choose to stay with the more general class of π-strategies. The same is true for some of the subsequent chapters.) Although most π-strategies are not realizable, under appropriate conditions, the solution to the unconstrained problem is given by a realizable deterministic stationary strategy (see Theorem 3.1.2). Moreover, as was explained in Sect. 1.3.5, for every π-strategy there exists an equivalent standard ξ-strategy which is realizable. For ease of reference, we formulate the two α-discounted CTMDP problems restricted to π-strategies as follows. The unconstrained α-discounted CTMDP problem (restricted to the class of πstrategies) reads (3.1) Minimize over S ∈ Sπ : W0α (S, γ), where we recall W0α (S, γ)
=
EγS
e (0,∞)
−αt
c0 (X (t−), a)(da|t)dt . A
The P(A)-valued process was introduced in (1.8). Note that one can replace here X (t−) with X (t), and the convention of c(x∞ , a) := 0 for each a ∈ A is in use. A π-strategy S ∗ ∈ Sπ is called optimal for problem (3.1) under the given initial distribution γ if W0α (S ∗ , γ) = inf W0α (S, γ). S∈Sπ
© Springer Nature Switzerland AG 2020 A. Piunovskiy and Y. Zhang, Continuous-Time Markov Decision Processes, Probability Theory and Stochastic Modelling 97, https://doi.org/10.1007/978-3-030-54987-9_3
145
146
3 The Discounted Cost Model
In the case of unconstrained optimization, it is also traditional not to fix the initial distribution and to look for a strategy which is optimal for all initial states x ∈ X. Definition 3.0.1 A strategy S ∗ ∈ Sπ is called uniformly optimal for the α-discounted problem (3.1) if W0α (S ∗ , x) = inf W0α (S, x) =: W0∗α (x), ∀ x ∈ X. S∈Sπ
(3.2)
Here the last equality holds automatically because of the sufficiency of π-strategies for α-discounted CTMDP problems. The constrained α-discounted CTMDP problem (restricted to the class of πstrategies) reads Minimize over S ∈ Sπ :
W0α (S, γ)
subject to W jα (S, γ) ≤ d j , ∀ j = 1, 2, . . . , J,
(3.3)
where d j ∈ R is a given number for each j = 1, 2, . . . , J. Definition 3.0.2 A π-strategy S is called feasible for problem (3.3) if W jα (S, γ) ≤ d j , ∀ j = 1, 2, . . . , J. A strategy S ∗ ∈ Sπ is called optimal under the initial distribution γ for problem (3.3) if it is feasible for problem (3.3), and satisfies W0α (S ∗ , γ) ≤ W0α (S, γ) for all feasible π-strategies S for problem (3.3). After the optimal or uniformly optimal strategy is obtained, one can construct the equivalent realizable Markov standard ξ-strategy using Theorem 1.3.2. Note that Condition 1.3.2(b) is satisfied according to the second interpretation of discounting in Sect. 1.3.5. Below we introduce the Dynamic Programming and Convex Analytic Approaches to investigating problems (3.1) and (3.3), respectively. Note that, in this chapter, we study the continuous-time model itself. Another method of attack is based on the reduction to the discrete-time model: see Chap. 4.
3.1 The Unconstrained Problem From this section until the end of this chapter, the state and action spaces are both topological Borel spaces. In the current section, we investigate the α-discounted optimal control problem (3.1).
3.1 The Unconstrained Problem
147
3.1.1 The Optimality Equation Condition 3.1.1 (a) Condition 2.2.5 is satisfied. (b) α > ρ. (c) There exists a constant M ≥ 0 such that, for all x ∈ X, | inf a∈A c0 (x, a)| ≤ Mw(x). Lemma 3.1.1 Suppose Condition 3.1.1 is satisfied. Then, for each π-strategy S, M (αw(x) + b) > −∞. α(α − ρ) If the initial distribution γ is such that X w(y)γ(dy) < ∞, then W0α (S, x) ≥ −
M α X w(y)γ(dy) + b ≥− > −∞. α(α − ρ)
W0α (S, γ)
Proof According to Condition 3.1.1(c), W0α (S, x) ≥ −ExS
R+
e−αt Mw(X (t))dt = −
R+
e−αt MExS [w(X (t))]dt.
Now, from Corollary 2.2.8 and using Condition 3.1.1, we obtain W0α (S, x) ≥ −
b M (αw(x) + b) , e−αt M eρt w(x) + (eρt − 1) dt = − ρ α(α − ρ) R+
if ρ = 0, and W0α (S, x)
≥−
R+
e−αt M(w(x) + bt)dt = −
if ρ = 0. The very last assertion is now obvious.
M(αw(x) + b) , α2
Condition 3.1.2 (a) There is a [1, ∞)-valued continuous function w on X such that Condition 2.4.3(c,d,e) is satisfied. (b) For continuous function u on X, the function (x, a) ∈ X × A → each bounded u(y)w (y)
q (dy|x, a) is continuous. X (c) The function (x, a) ∈ X × A → c0 (x, a) is lower semicontinuous. (d) The (Borel) action space A is compact.
148
3 The Discounted Cost Model
Theorem 3.1.1 Suppose that Condition 3.1.2 is satisfied. Then the optimality equation αu(x) = inf c0 (x, a) + u(y)q(dy|x, a) , ∀ x ∈ X (3.4) a∈A
X
admits a lower semicontinuous solution u ∗ ∈ Bw (X), which has the form u ∗ (x) = u ∞ (x)w (x). Here u ∞ is the point-wise limit of the following non-decreasing uniformly bounded sequence of lower semicontinuous functions {u n }∞ n=0 : M M b − , ∀ x ∈ X, α − ρ α(α − ρ )w (x) u n+1 (x) := inf {Tα ◦ u n (x, a)}, ∀ x ∈ X, u 0 (x) := −
(3.5)
a∈A
where Tα ◦ u(x, a) :=
1 + q¯ x c0 (x, a) + w (x)(α + 1 + q¯ x ) α + 1 + q¯ x δx (dy) q(dy|x, a) + . × w (x)(1 + q¯ x ) w (x)
u(y)w (y)
(3.6)
X
For each n = 0, 1, 2, . . . |u n (x)| ≤
M M M b M b ≤ + + α − ρ α(α − ρ )w (x) α − ρ α(α − ρ )
(3.7)
and the function u ∞ satisfies the equalities u ∞ (x) = inf {Tα ◦ u ∞ (x, a)} = Tα ◦ u ∞ (x, ϕ∗ (x)) a∈A
(3.8)
for some measurable mapping ϕ∗ : X → A. This mapping also satisfies the equalities ∗ ∗ (3.9) αu (x) = inf c0 (x, a) + u (y)q(dy|x, a) a∈A X = c0 (x, ϕ∗ (x)) + u ∗ (y)q(dy|x, ϕ∗ (x)), ∀ x ∈ X. X
Proof First of all, note that Condition 3.1.2(b), applied to the bounded continuous function x ∈ X → u(x) = w 1(x) , results in the following assertions. The function (x, a) ∈ X × A → qx (a) is continuous and hence the function x ∈ X → q¯ x is continuous by Proposition B.1.40: q¯ x = − inf a∈A (−qx (a)). Moreover, the stochastic kernel on B(X) given (x, a) ∈ X × A defined by
3.1 The Unconstrained Problem
149
δx (dy) q(dy|x, a) + w (y) w (x)(1 + q¯ x ) w (x) , ∀ (x, a) ∈ X × A X w (y)q(dy|x, a) + 1 w (x)(1 + q¯ x )
w (y)q(dy|x,a)
is continuous. Note that (x, a) ∈ X × A → X w (x)(1+q¯x ) + 1 is a positive bounded continuous function. Indeed, for the boundedness, we see, according to Condition 2.4.3(c), that w (y)
q (dy|x, a) qx (a) X − +1 0< w (x)(1 + q¯ x ) 1 + q¯ x ≤
qx (a) w (x)qx (a) + ρ w (x) + b − + 1 ≤ ρ + b + 1, w (x)(1 + q¯ x ) 1 + q¯ x ∀ (x, a) ∈ X × A.
Therefore, if u n is bounded and lower semicontinuous, then the function 1 + q¯ x α + 1 + q¯ x
q(dy|x, a) δx (dy) u n (y)w (y) + w (x)(1 + q¯ x ) w (x) X
is bounded and lower semicontinuous by Proposition B.1.32. Hence, the function u n+1 is lower semicontinuous by Proposition B.1.40. Clearly, the function u 0 is lower semicontinuous and satisfies inequality (3.7). If this statement holds for function u n , then, as was shown above, the function u n+1 is lower semicontinuous. Moreover, for each x ∈ X, c0 (x, a) 1 + q¯ x + u n+1 (x) ≤ inf a∈A w (x)(α + 1 + q¯ x ) α + 1 + q¯ x M X w (y)q(dy|x, a) × sup +1 w (x)(1 + q¯ x ) a∈A α − ρ M b 1 + q¯ x · + α + 1 + q¯ x α(α − ρ )w (x) ρ w (x) + b 1 + q¯ x M M w (x) + + 1 · ≤ w (x)(α + 1 + q¯ x ) α + 1 + q¯ x α − ρ w (x)(1 + q¯ x ) M M b (1 + q¯ x )M b = , + + w (x)(α + 1 + q¯ x )α(α − ρ ) α − ρ α(α − ρ )w (x) where the second inequality holds by Condition 2.4.3(c,e). Similarly, for each x ∈ X,
150
3 The Discounted Cost Model
c0 (x, a) 1 + q¯ x − a∈A w (x)(α + 1 + q¯ x ) α + 1 + q¯ x M X w (y)q(dy|x, a) +1 × sup w (x)(1 + q¯ x ) a∈A α − ρ (1 + q¯ x )M b − (α + 1 + q¯ x )w (x)α(α − ρ ) ρ w (x) + b 1 + q¯ x M −M w (x) − +1 · ≥ w (x)(α + 1 + q¯ x ) α + 1 + q¯ x α − ρ w (x)(1 + q¯ x ) (1 + q¯ x )M b − w (x)(α + 1 + q¯ x )α(α − ρ ) M b M = u 0 (x). + =− α − ρ α(α − ρ )w (x)
u n+1 (x) ≥ inf
By induction, all of the functions u n are lower semicontinuous, and satisfy inequality (3.7). By the way, for all x ∈ X, u 1 (x) ≥ u 0 (x) and, since the inequality u n (x) ≥ u n−1 (x) for each x ∈ X implies that u n+1 (x) ≥ u n (x) for each x ∈ X, we conclude that the sequence {u n }∞ n=0 is increasing. Let u ∞ (x) := limn→∞ u n (x) for each x ∈ X. The function u ∞ also satisfies inequality (3.7), and is lower semicontinuous due to Proposition B.1.7. Let us show that the function u ∞ satisfies the first equality in (3.8). Since the operator defined by inf a∈A {Tα ◦ u(x, a)} is monotone and u n ≤ u ∞ for all n ∈ N0 , u n+1 (x) = inf a∈A {Tα ◦ u n (x, a)} ≤ inf a∈A {Tα ◦ u ∞ (x, a)}; hence u ∞ (x) ≤ inf a∈A {Tα ◦ u ∞ (x, a)} for all x ∈ X. On the other hand, for an arbitrarily fixed x ∈ X, let ai∗ ∈ A be such that u i+1 (x) = Tα ◦ u i (x, ai∗ ), i ∈ N0 . Such a minimizer exists because the function a ∈ A → Tα ◦ u i (x, a), under the fixed x ∈ X, is lower semicontinuous, and A is compact, see ¯ ∈ A as j → ∞. Proposition B.1.40. Let {ai∗j }∞ j=1 be a subsequence converging to a Clearly, for each n ∈ N0 such that n ≤ i j , Tα ◦ u n (x, ai∗j ) ≤ Tα ◦ u i j (x, ai∗j ) = u i j +1 (x), ∀ x ∈ X,
(3.10)
where the first inequality follows from the fact that u i j ≥ u n pointwise. Recall that the function a ∈ A → Tα ◦ u n (x, a) is lower semicontinuous (under the fixed x ∈ X). Therefore, passing to the limit in (3.10), as j → ∞, yields ¯ ≤ lim Tα ◦ u n (x, ai∗j ) ≤ lim u i j +1 (x) = u ∞ (x), Tα ◦ u n (x, a) j→∞
j→∞
∀ n ∈ N0 , x ∈ X. Finally, by passing here to the limit as n → ∞ and using the Lebesgue Dominated ¯ ≤ u ∞ (x) for each x ∈ X, and Convergence Theorem, we see that Tα ◦ u ∞ (x, a) hence inf a∈A Tα ◦ u ∞ (x, a) ≤ u ∞ (x) for each x ∈ X. The first equality in (3.8) is
3.1 The Unconstrained Problem
151
thus proved. The existence of a measurable mapping ϕ∗ from X to A satisfying the second equality in (3.8) immediately follows from Proposition B.1.39. Now it is clear that the function u ∗ = u ∞ w is lower semicontinuous and belongs to Bw (X). It satisfies Eqs. (3.4) and (3.9) because the equation
u ∗ (x) =
⎫ ⎧ ⎬ ⎨ q(dy|x, a) inf c0 (x, a) + (1 + q¯ x ) u ∗ (y) + δx (dy) ⎭ a∈A ⎩ 1 + q¯ x X
(α + 1 + q¯ x )
, ∀x ∈X
directly follows from the justified equality u ∞ (x) = inf a∈A {Tα ◦ u ∞ (x, a)} for each x ∈ X. Remark 3.1.1 If A is a singleton, i.e., the process is uncontrolled, conditions 3.1.2(b),(c) are not needed, and the function w does not need to be continuous. The proof of Theorem 3.1.1 remains the same, all the functions u n , u ∞ being just measurable and uniformly bounded. Remark 3.1.2 The proof of Theorem 3.1.1 survives if we weaken Condition 2.4.3(e): inf c0 (x, a) ≥ −M w (x), ∀ x ∈ X.
a∈A
The successive approximations (3.5) converge monotonically to the bounded from below, lower semicontinuous function u ∞ , and the function u ∗ (x) = u ∞ (x)w (x) solves Eq. (3.4). But in Theorem 3.1.2 below the u ∗ function is to be from Bw (X).
3.1.2 Dynamic Programming and Dual Linear Programs Theorem 3.1.2 Suppose Conditions 2.4.3 and 3.1.2 are satisfied (for the same function w ). Then the following assertions hold. (a) Equation (3.4) has a unique lower semicontinuous solution u ∗ in the class Bw (X). It can be constructed using the value iterations (3.5). (b) There is a measurable mapping ϕ∗ : X → A which provides the infimum in (3.4): ∗ αu (x) = inf c0 (x, a) + u (y)q(dy|x, a) a∈A X ∗ = c0 (x, ϕ (x)) + u ∗ (y)q(dy|x, ϕ∗ (x)), ∀ x ∈ X. ∗
(3.11)
X
Each measurable mapping ϕ∗ : X → A satisfying (3.9) defines a deterministic stationary strategy, also denoted by ϕ∗ , which is uniformly optimal for the α-
152
3 The Discounted Cost Model
discounted problem (3.1): W0α (ϕ∗ , x) = inf {W0α (S, x)} = W0∗α (x) = u ∗ (x), ∀ x ∈ X. S∈Sπ
(3.12)
(We emphasize that deterministic stationary strategies are realizable.) (c) If the initial distribution γ is such that X w(x)γ(d x) < ∞, then the strategy ϕ∗ from part (b) is optimal for the α-discounted problem (3.1) under the initial distribution γ. Moreover, a strategy π ∈ Sπ is optimal if and only if −αu ∗ (X (v)) + c0 (X (v), a) +
q(dy|X (v), a)u ∗ (y) = 0 X
for (da|ω, v) × dv × Pγπ (dω)-almost all (a, v, ω) ∈ A × R+ × . The process was defined in (1.8). Proof Statement (a), apart from the uniqueness, and equalities (3.11) follow from Theorem 3.1.1. It will become clear that Eq. (3.4) cannot have other solutions in the class Bw (X), after we prove the equality inf S∈Sπ {W0α (S, x)} = u ∗ (x) in assertion (b). Let us prove the last part of assertion (b) as follows. Under the imposed conditions, the first assertion of Lemma 3.1.1 holds for the function w and the constants ρ , b M instead of the function w and the constants ρ, b, M. For an arbitrarily fixed π-strategy π and initial state x ∈ X, the expression Eπx
e−αv (0,t]
c0 (X (v), a)(da|v)dv A
is well defined because c0 (x, a) ≥ inf c0 (x, a) ≥ −M w (x). a∈A
We add that expression to both sides of the Dynkin formula (2.91), which holds for u ∗ ∈ Bw (X), resulting in
c0 (X (v), a)(da|v)dv + Eπx u ∗ (X (t)) e−αt (0,t] A ∗ π −αv − αu ∗ (X (v)) = u (x) + Ex e (0,t] ∗ + (da|v)q(dy|X (v), a)u (y) + c0 (X (v), a) dv . Eπx
e
A
−αv
X
After that, one can pass to the limit as t → ∞ (see (1.22):
3.1 The Unconstrained Problem
153
W0α (π, x) = u ∗ (x) + Eπx
e−αv
− αu ∗ (X (v)) (0,∞) A +c0 (X (v), a) + q(dy|X (v), a)u ∗ (y) dv . (da|v)
(3.13)
X
Note that
lim e−αt Eπx [u ∗ (X (t))] = 0
t→∞
because of Corollary 2.2.8: u ∗ ∈ Bw (X). Expression (3.13) yields that W0α (π, x) ≥ u ∗ (x), ∀ π ∈ Sπ , x ∈ X and W0α (ϕ∗ , x) = u ∗ (x) for each x ∈ X. Note that formula (3.13) holds for each function u ∈ Bw (X), not only for the function u ∗ . Now assertion (b) and the uniqueness of the solution u ∗ to Eq. (3.4) in the class Bw (X) are proved. The proof of the first part of statement (c) is exactly the same as that of statement (b), with the obvious replacement of Eπx with Eγπ . The very last assertion follows from formula (3.13), where again Eπx is replaced with Eγπ . Corollary 3.1.1 Suppose Condition 2.4.3 is satisfied with the strengthened Item (e): |c0 (x, a)| ≤ M w (x). Let π s be a fixed stationary π-strategy. Then the equation
αu(x) =
c0 (x, a)π s (da|x) + A
u(y)q(dy|x, a)π s (da|x) ∀x ∈ X (3.14) A
X
has a unique measurable solution u ∗ in the class Bw (X), which can be built with the same successive approximations as in Theorem 3.1.1, and coincides with W0α (π s , x) for all x ∈ X. the uncontrolled process with the cost and transition rates c(x) ˆ := Proof Consider s s c (x, a)π (da|x) and q(dy|x) ˆ := q(dy|x, a)π (da|x). For it, Condition 2.4.3 A 0 A is satisfied. According to Remark 3.1.1, Eq. (3.14) has a measurable solution u ∗ ∈ Bw (X) which can be built by the successive approximations (3.5). Similarly to the proof of Theorem 3.1.2, one can use the Dynkin formula (2.91) to obtain the equalities W0α (π s , x)
e−αv −αu ∗ (X (v)) (0,∞) ∗ +c(X ˆ (v)) + u (y)q(dy|X ˆ (v)) dv = u ∗ (x). ∗
= u (x) +
s Eπx
X
Corollary 3.1.2 Suppose Conditions 2.4.3 and 3.1.2 are satisfied (for the same function w ).
154
3 The Discounted Cost Model
(a) If u ∈ Bw (X) and αu(x) ≥ c0 (x, a) +
q(dy|x, a)u(y), ∀(x, a) ∈ X × A, X
then u(x) ≥ W0α (S, x) for each π-strategy S and for each x ∈ X. (b) If u ∈ Bw (X) and αu(x) ≤ c0 (x, a) +
q(dy|x, a)u(y) ∀(x, a) ∈ X × A, X
then u(x) ≤ W0α (S, x) for each π-strategy S and for each x ∈ X. Proof This follows directly from expression (3.13), which holds for an arbitrary function u ∈ Bw (X). The next theorem shows that under the conditions of Theorem 3.1.2, the value (Bellman) function W0∗α is a solution to the linear program, usually called dual, which is described in the next statement. Theorem 3.1.3 Suppose Conditions 2.4.3 and 3.1.2 are satisfied (for the same function w ), and X w(y)γ(dy) < ∞. Then the function W0∗α solves the following Dual Linear Program in the space Bw (X): Maximize over v ∈ Bw (X) :
v(y)γ(dy) (3.15) subject to c0 (x, a) + v(y)q(dy|x, a) − αv(x) ≥ 0, X
X
∀(x, a) ∈ X × A. In fact, a feasible function v for problem (3.15) solves problem (3.15) if and only if v(x) = W0∗α (x) for almost all x ∈ X with respect to γ. Proof Below we use Theorem 3.1.2. Recall that W0∗α = u ∗ according to (3.12). Since W0∗α is in Bw (X), and satisfies Eq. (3.9), it is feasible for problem (3.15). Consider an arbitrary function v ∈ Bw (X), which is also feasible. According to Corollary 3.1.2(b), for any π-strategy S, v(x) ≤ W0α (S, x) for all x ∈ X. Suppose
v(y)γ(dy) > X
X
W0∗α (y)γ(dy).
(The expression on the left-hand side is finite because v ∈ Bw (X).) Then there is ˆ < v(x) ˆ − δ. Therefore, W0∗α (x) ˆ < an xˆ ∈ X and constant δ > 0 such that W0∗α (x) α ∗α ˆ − δ for each S ∈ Sπ . Hence W0 (x) ˆ ≤ inf S∈Sπ W0α (S, x) ˆ − δ, which conW0 (S, x) tradicts (3.12). Consequently,
v(y)γ(dy) ≤ X
X
W0∗α (y)γ(dy)
3.1 The Unconstrained Problem
155
for each feasible function v for problem (3.15), and the first assertion is proved. From the first assertion just proved, if a feasible function v equals W0∗α almost surely with respect to γ, then
v(y)γ(dy) = X
X
W0∗α (y)γ(dy)
is the value of problem (3.15), and hence v solves problem (3.15). Let v be an optimal solution to problem (3.15) and suppose the relation that v = W0∗α almost surely with respect to γ is false. Then there exist measurable subsets 1 , 2 ⊆ X, such that the following conditions are satisfied: 1 ∩ 2 = ∅, v(x) > W0∗α (x) on 1 , v(x) < W0∗α (x) on 2 , v(x) = W0∗α (x) on (X \ 1 ) \ 2 , and the case γ(1 ) = γ(2 ) = 0 is excluded. Now let us define a function vˆ by v(x) ˆ = I{x ∈ X \ 2 }v(x) + I{x ∈ 2 }W0∗α (x), ∀ x ∈ X, which is feasible for problem (3.15). Indeed, firstly, it is evident that vˆ ∈ Bw (X). Secondly, we have that for each x ∈ X \ 2 and a ∈ A, ˆ + c0 (x, a) − αv(x)
X
v(y)q(dy|x, ˆ a)
= c0 (x, a) − αv(x) +
v(y)q(dy|x, a) + X\2
≥ c0 (x, a) − αv(x) +
v(y)q(dy|x, a) +
X\2
2
2
W0∗α (y)q(dy|x, a) v(y)q(dy|x, a) ≥ 0,
and for each x ∈ 2 and a ∈ A, c0 (x, a) − αv(x) ˆ + v(y)q(dy|x, ˆ a) X = c0 (x, a) − αW0∗α (x) + v(y)q(dy|x, a) + W0∗α (y)q(dy|x, a) X\ 2 2 ∗α ∗α W0 (y)q(dy|x, a) + W0∗α (y)q(dy|x, a) ≥ c0 (x, a) − αW0 (x) + 2
X\2
≥ 0. However, if γ(2 ) > 0, then ∗α v(y)γ(dy) ˆ = v(x)γ(d x) + W0 (x)γ(d x) > v(x)γ(d x), X
X\2
2
X
which contradicts the fact that v is optimal for problem (3.15). If γ(2 ) = 0 then γ(1 ) > 0 and
156
3 The Discounted Cost Model
v(y)γ(dy) = X
1
v(x)γ(d x) + X\1
W0∗α (x)γ(d x) >
X
W0∗α (x)γ(d x),
which contradicts the fact that W0∗α is optimal for problem (3.15).
The functions w and w , and the corresponding conditions were introduced in order to work with unbounded transition and cost rates. If supx∈X q x < ∞ and sup(x,a)∈X×A |c j (x, a)| < ∞ for all j = 0, 1, 2, . . . , J , then obviously one can take w(x) ≡ w (x) ≡ 1, ρ = ρ = 0 and so on. Sometimes, one needs to solve the unconstrained problem (3.1) in a specific class of strategies Sˆ π ⊆ Sπ . One such case is studied below. Suppose K ⊆ X × A is a fixed measurable subset such that A(x) := {a : (x, a) ∈ K} = ∅ for all x ∈ X. Then A(x) is a measurable subset of A for each x ∈ X. Let Sˆ π be the class of relaxed strategies π satisfying the constraint (A(X (v−))|ω, v) = 1 for Pγπ (dω) × dv-almost all (ω, v) ∈ × R+ , and consider the problem for the A(·)model: Minimize over π ∈ Sˆ π : W0α (π, γ). It can also be solved using the Dynamic Programming Approach in combination with the penalty method. Namely, we will assign such a big cost rate for the points (x, a) ∈ / K that the standard method provides a solution from Sˆ π . Theorem3.1.4 Suppose Conditions 2.4.3 and 3.1.2 are satisfied (for the same function w ), X w(y)γ(dy) < ∞, and K ⊆ X × A is a closed subset such that A(x) := {a : (x, a) ∈ K} = ∅ ∀ x ∈ X. Let | inf a∈A(x) c0 (x, a)| Mˆ := sup w (x) x∈X and put cˆ0 (x, a) :=
if (x, a) ∈ K; c0 (x, a), c0 (x, a) ∨ K (x), if (x, a) ∈ / K,
where K is a big enough continuous function of x ∈ X, e.g.,
3.1 The Unconstrained Problem
K (x) =
157
2 Mˆ (α + 1 + q¯ x )(αw (x) + b ) , ∀ x ∈ X. α(α − ρ )
Then the following assertions hold. (a) All the statements of Theorems 3.1.1–3.1.3 hold true for the model with the cost rate cˆ0 and constant Mˆ replacing M . (b) The mapping ϕ∗ from Theorem 3.1.2(b) has its values in A(x) for all x ∈ X. (c) The strategy ϕ∗ satisfies inf π∈Sˆ π W0α (π, γ) = W0α (ϕ∗ , γ), where Sˆ π is the class of π-strategies satisfying (A(X (v))|ω, v) = 1 for Pγπ (dω) × dv-almost all (ω, v) ∈ × R+ . Proof (a) The presented function K is continuous because the functions w and q¯ x are continuous: see the proof of Theorem 3.1.1. Therefore, the function cˆ0 is lower semicontinuous according to Lemma B.1.1. Indeed, if c0 is nonnegative, then we write cˆ0 (x, a) = c0 (x, a) ∨ (K (x)I{(x, a) ∈ Kc }), which clearly defines a lower semicontinuous function on X × A. In general, we write cˆ0 (x, a) = cˆ0 (x, a) + M w (x) − M w (x) = (c0 (x, a) + M w (x)) ∨ ((K (x) + M w (x))I{(x, a) ∈ Kc }) − M w (x) and the claim follows. The proof of Theorem 3.1.1 remains the same: if |u n (x)| ≤
Mˆ Mˆ b , + α − ρ α(α − ρ )w (x)
then cˆ0 (x, a) 1 + q¯ x inf u n (y)w (y) a∈A(x) w (x)(α + 1 + q¯ ) + α + 1 + q¯ x x X δx (dy) q(dy|x, a) + × w (x)(1 + q¯ x ) w (x) Mˆ Mˆ b ≤ |u 0 (x)| = . + α−ρ α(α − ρ )w (x) If a ∈ A \ A(x) = ∅ then the expression in the parentheses is greater than or equal to
158
3 The Discounted Cost Model
ρ w (x) + b 2 Mˆ (αw (x) + b ) Mˆ 1 + q¯ x − + α(α − ρ )w (x) α − ρ w (x)(α + 1 + q¯ x ) α + 1 + q¯ x (1 + q¯ x )b + w (x)(α + 1 + q¯ x )α 2(αw (x) + b ) Mˆ αw (x) + b 1 + q¯ x > − − α−ρ αw (x) w (x)(α + 1 + q¯ x ) α + 1 + q¯ x (1 + q¯ x )b − αw (x)(α + 1 + q¯ x ) because ρ < α. Now the last expression equals 2(αw (x) + b ) b + αw (x) Mˆ − = |u 0 (x)|, α − ρ αw (x) αw (x) so that the infimum in (3.5) is attained at a ∈ A(x) and the statement of Theorem 3.1.1 is valid for cˆ0 and Mˆ , as well as the statements of Theorems 3.1.2 and 3.1.3. Mˆ Mˆ b ∗ (b) Since |u ∞ (x)| ≤ α−ρ + α(α−ρ )w (x) , the mapping ϕ from Theorem 3.1.2(b) has its values in A(x) for all x ∈ X according to the above calculations. (c) For every strategy π from Sˆ π ,
(da|v) − αu ∗ (X (v)) X (0,∞) A +c0 (X (v), a) + q(dy|X (v), a)u ∗ (y) dv X ∗ u (x)γ(d x). ≥
W0α (π, γ) =
u ∗ (x)γ(d x) + Eγπ
e−αv
X
For the details, the proof of Theorem 3.1.2. Obviously, ϕ∗ ∈ Sˆ π and see α ∗ ∗ W0 (ϕ , γ) = X u (x)γ(d x). This completes the proof. 3.1.2.1
Comparable Results Under the Strong Continuity Condition
Instead of Condition 3.1.2, sometimes the following condition is imposed. Condition 3.1.3 (a) There is a [1, ∞)-valued measurable function w on X such that Condition 2.4.3(c,d,e) is satisfied. (b) For each bounded measurable function u on X and each x ∈ X, the function q (dy|x, a) is continuous. a ∈ A → X u(y)w (y)
(c) For each x ∈ X, the function a ∈ A → c0 (x, a) is lower semicontinuous. (d) The (Borel) action space A is compact.
3.1 The Unconstrained Problem
159
The assertions in Theorems 3.1.1 and 3.1.2 hold if we replace therein Condition 3.1.2 with Condition 3.1.3, and replace the lower semicontinuous functions u ∗ and ∞ ∗ {u n }∞ n=0 with measurable functions u and {u n }n=0 . One only needs to observe that the proofs survive if we apply Proposition B.1.39 instead of Proposition B.1.40. Similarly, the assertions in Theorem 3.1.4 hold if we replace therein Condition 3.1.2 with Condition 3.1.3, and replace the closedness of K with the closedness of A(x) for each x ∈ X.
3.2 The Constrained Problem According to (1.34) and Theorem 1.3.1, the constrained problem (1.16) or (3.3) can be rewritten as 1 t c0 (x, a)η(d x × da) (3.16) Minimize over η ∈ D : α X×A 1 subject to c j (x, a)η(d x × da) ≤ d j , α X×A ∀ j = 1, 2, . . . , J, where D = η=α t
∞
mn ,
{m n }∞ n=1
∈ D S = D ReM
(3.17)
n=1
is the collection of all total (normalized) occupation measures η. A total occupation measure η is called feasible if it satisfies all the inequalities in (3.16). As was explained, under Condition 2.4.2, the controlled process X (·) is nonexplosive for every strategy S: PxS (T∞ = ∞) = 1 for all x ∈ X. In that case, the (αdiscounted) total (normalized) occupation measure is often equivalently introduced in the following form. Definition 3.2.1 Under Condition 2.4.2, the (α-discounted) total occupation measure ηγS,α ∈ Dt of a π-strategy S for the CTMDP {X, A, q, {c j } Jj=0 } with the initial distribution γ is a measure on B(X × A) defined, for each X ∈ B(X) and A ∈ B(A), by ηγS,α ( X × A ) = αEγS
(0,∞)
e−αt I{X (t) ∈ X }( A |t)dt ,
where the P(A)-valued predictable process was introduced in (1.8).
(3.18)
160
3 The Discounted Cost Model
3.2.1 Properties of the Total Occupation Measures Theorem 3.2.1 If Condition 2.4.2 is satisfied, then, for every S ∈ Sπ , ηγS,α (X × A) = 1. If, additionally, α > ρ and X w(x)γ(d x) < ∞, then the following assertions hold: (a) For each fixed S ∈ Sπ , ηγS,α satisfies the following two relations: 1 η( X × A) = γ( X ) + α
q( X |y, a)η(dy × da),
∀ X ∈ B(X),
X×A
(3.19)
and
w(x)η(d x × A) < ∞,
(3.20)
X
where the function w comes from Condition 2.4.2. (b) If a (finite) measure η on X × A satisfies the two relations in part (a), then there s exists a stationary π-strategy π s such that η = ηγπ ,α . One can take π s as in the following formula: η( X × A ) =
X
π s ( A |y)η(dy × A),
∀ X ∈ B(X), A ∈ B(A). (3.21)
(c) Consider the (probability) measure η as in part (b). If a stationary π-strategy πˆ s s is such that ηγπˆ ,α = η, then πˆ s is a version of the π s in (3.21), and in fact, ηγπˆ ,α ( X × A ) = s
X
πˆ s ( A |y)ηγπˆ ,α (dy × A), s
∀ X ∈ B(X), A ∈ B(A).
(d) Suppose a stationary π-strategy π s is fixed. Then the equation 1 q( X |y, a)π s (da|y)η(dy), ˜ α X A ∀ X ∈ B(X)
η( ˜ X ) = γ( X ) +
(3.22)
has a unique solution in the class of probability measures on X satisfying w(x)η(d ˜ x) < ∞.
(3.23)
X
Moreover, the unique solution is provided by η(d ˜ x) := ηγπ ,α (d x × A). s
Proof The very first equality (ηγS,α (X × A) = 1, ∀ S ∈ Sπ ) follows directly from the definition (3.18). Recall that Condition 2.4.2 implies Condition 2.2.5.
3.2 The Constrained Problem
161
(a) According to Corollary 2.2.8 and using inequalities q x ≤ Lw(x), α > ρ, we deduce that, for each X ∈ B(X), 0 ≤ lim e−αt t→∞
≤ lim e
−αt
t→∞
(0,t]
(0,t]
EγS
q ( X |X (u), a)(da|u) du A
EγS [w(X (u))]du
= 0.
Hence, integration by parts leads to the equality EγS
q ( X |X (u), a)(da|u) du dt e−αt (0,∞) (0,t] A = e−αt EγS
q ( X |X (t), a)(da|t) dt, (3.24)
α
(0,∞)
A
and the last expression is finite again due to Corollary 2.2.8. Similarly, for each X ∈ B(X), e−αt EγS q X (u) (a)I{X (u) ∈ X }(da|u) du dt (0,∞) (0,t] A = e−αt EγS q X (t) (a)I{X (t) ∈ X }(da|t) dt < ∞. (3.25)
α
(0,∞)
A
According to Theorem 2.4.5, e−αt PγS (X (t) ∈ X )dt ηγS,α ( X × A) = α (0,∞) e−αt γ( X )dt =α (0,∞) −αt S +α e Eγ
q ( X |X (u), a)(da|u)du (0,∞) (0,t] A EγS q X (u) (a)I{X (u) ∈ X }(da|u)du dt. − (0,t]
A
All the terms in the above formula are finite, and we use Fubini’s Theorem without special reference. Applying expressions (3.24) and (3.25) leads to the equalities
162
3 The Discounted Cost Model
ηγS,α ( X × A) −αt S = γ( X ) + e Eγ
q ( X |X (t), a)(da|t) dt (0,∞) A −αt S − e Eγ q X (t) (a)I{X (t) ∈ X }(da|t) dt (0,∞) A e−αt EγS [q( X |X (t), a)(da|t)] dt = γ( X ) + (0,∞) A −αt e q( X |y, a)EγS [(da|t)|X (t) = y] = γ( X ) + (0,∞)
×PγS (X (t)
A
X
∈ dy)dt.
For every A ∈ B(A), ∈ B(X),
EγS [( A |t)|X (t)
=
y]PγS (X (t)
∈ dy) =
EγS [( A |t)I{X (t) ∈ dy}],
i.e., for each fixed t, EγS [( A |t)|X (t) = y] is the Radon–Nikodym derivative of the measure on the right, with respect to PγS (X (t) ∈ ). Therefore, ηγS,α ( X × A ) e−αt = γ( X ) +
q( X |y, a)EγS [(da|t)I{X (t) ∈ dy}]dt (0,∞) X×A S −αt = γ( X ) + q( X |y, a)Eγ e (da|t)I{X (t) ∈ dy}dt X×A (0,∞) 1 = γ( X ) + q( X |y, a)ηγS,α (dy × da). α X×A
Relation (3.19) is proved. For (3.20), it is sufficient to use Corollary 2.2.8: w(x)ηγS,α (d x × A) S −αt S −αt e w(X (t))dt = α Ex e w(X (t))dt γ(d x) = αEγ (0,∞) X (0,∞) ≤α e−(α−ρ)t w(x)γ(d x) dt (0,∞) X −αt b ρt (e − 1)dt I{ρ = 0} + e ρ (0,∞) X
3.2 The Constrained Problem
+α
e−αt
163
(0,∞)
w(x)γ(d x) dt + b
X
(0,∞)
e−αt tdt I{ρ = 0}
w(x)γ(d x) b =α X + . α−ρ α−ρ (b) Since X×A q(X|x, a)η(d x × da) = 0, η is a probability measure. According to Proposition B.1.33, a stationary policy π s satisfying (3.21) exists. Denote s by ηγπ ,α its occupation measure. We only need prove that for every bounded measurable function v(x, a) on X × A, s v(x, a)η(d x × da) = v(x, a)ηγπ ,α (d x × da) X×A
X×A
because after that the statement immediately follows if we take v(x, a) to be an appropriate indicator function. For convenience, let us introduce the function u(x), defined by u(x) :=
s Eπx
e
−αt
(0,∞)
π (da|X (t))v(X (t), a)dt . s
A
We have that u is the unique bounded solution to the following equation:
αu(x) =
v(x, a)π (da|x) +
π s (da|x)q(dy|x, a)u(y). (3.26)
s
A
X
A
Indeed, Condition 2.4.3 holds true for w (x) ≡ 1, the cost rate c0 being replaced with the function v. The ρ constant can be chosen at level zero. Now Eq. (3.26) follows from Corollary 3.1.1. Since the function u is bounded, the imposed conditions imply the inequality
η(d x × A) X
A
q(dy \ {x}|x, a)π s (da|x)|u(y)| + qx (a)π s (da|x)|u(x)| < ∞, X
A
which legalizes the interchange of orders of integrations and expectations in the forthcoming calculations. Now on the one hand, by using (3.19), (3.21) and (3.26), we have
164
3 The Discounted Cost Model
v(y, a)η(dy × da) = η(dy × A) v(y, a)π s (da|y) X×A X A η(dy × A) αu(y) − q(dz|y, a)π s (da|y)u(z) = X A X η(dy × A)αu(y) − η(dy × A) q(dz|y, a)π s (da|y)u(z) = X X X A 1 αu(y) γ(dy) + q(dy|z, a)η(dz × da) (3.27) = α X X×A − η(dy × A) q(dz|y, a)π s (da|y)u(z) = α γ(dy)u(y).
X
X
A
X
On the other hand, by using the properties of regular conditional expectations, direct calculations result in s v(y, a)ηγπ ,α (d x × da) X×A s =α e−αt γ(d x) Pπx (X (t) ∈ dy) (0,∞) X X s × v(y, a)Eπx π s (da|X (t))|X (t) = y dt A −αt πs s e γ(d x)Ex v(X (t), a)π (da|X (t)) dt =α (0,∞) X A = α γ(d x)u(x). X
s This and (3.27) lead to X×A v(x, a)η(d x × da) = X×A v(x, a)ηγπ ,α (d x × da). Thus, as remarked earlier, the statement follows. (c) Suppose πˆ s is not a version of π s (with respect to η(da × A)). Then there exist some X ∈ B(X) and A ∈ B(A) such that η( X × A) > 0 and πˆ s ( A |x) = π s ( A |x) for each x ∈ X . Clearly, we can write X = 1X ∪ 2X , where 1X , 2X ∈ B(X), 1X ∩ 2X = ∅, πˆ s ( A |x) > π s ( A |x) on 1X , πˆ s ( A |x) < π s ( A |x) on 2X , and at least one of these two sets has a positive η(d x × A) measure. Below, we only consider the case of η( 1X × A) > 0, as the case of η( 2X × A) > 0 can be similarly treated with the obvious modifications. s s Since η = ηγπˆ ,α , we have ηγπˆ ,α (d x × A) = η(d x × A). By using the definition s of the total occupation measures, formula (3.21) and the fact that η = ηγπˆ ,α , direct calculations lead to
3.2 The Constrained Problem
165
s s ηγπˆ ,α ( 1X × A ) = αEγπˆ e−αt I{X (t) ∈ 1X }πˆ s ( A |X (t))dt (0,∞) πˆ s ,α s = ηγ (d x × A)πˆ ( A |x) = η(d x × A)πˆ s ( A |x) >
1X
1X
1X
η(d x × A)π s ( A |x) = η( 1X × A ),
which contradicts ηγπˆ ,α = η. As a result, πˆ s must be a version of π s . (d) The existence of a probability measure satisfying (3.22) and (3.23) is clear: s by part (b) of this theorem, η(d ˜ x) := ηγπ ,α (d x × A) provides such a solution. Hence, to prove the statement, we need only show the uniqueness. Suppose η(d ˆ x) = η(d ˜ x) is another probability measure satisfying (3.22) and (3.23). Consider the measure η defined by s
η ( X × A ) :=
X
η(d ˆ x)π s ( A |x),
∀ X ∈ B(X), A ∈ B(A).
Then relation (3.20) holds true for the measure η , which also satisfies (3.19). Consequently, by parts (b) and (c) of this theorem, applied to η , there exists a stationary strategy π such that
η = ηγπ ,α ,
(3.28)
and X
ηγπ ,α (d x
× A)π ( A |x) = η ( X × A ) =
∀ X ∈ B(X), A ∈ B(A).
X
η(d ˆ x)π s ( A |x),
This, by part (b) of this theorem, implies that η = ηγπ ,α , and so ηγπ ,α = ηγπ ,α , see (3.28). In particular, s
s
˜ x) ηγπ ,α (d x × A) = ηγπ ,α (d x × A) = η(d s
again by the definition of the total occupation measures. On the other hand,
ˆ x) ηγπ ,α (d x × A) = η(d ˆ x) = η(d ˜ x). Howby (3.28) and the definition of η . Therefore, we have η(d ever, this contradicts the supposition of η(d ˆ x) = η(d ˜ x). Thus, the uniqueness is proved, and the statement follows. Remark 3.2.1 (a) Theorem 3.2.1 implies that the elements of Dt are fully characterized by the (probability) measures on X × A satisfying relations (3.19) and
166
3 The Discounted Cost Model
(3.20). Therefore, under the conditions of Theorem 3.2.1, the space Dt is convex. See also Remark 3.2.4. (b) According to the proof, inequality (3.20) can be strengthened:
α w(x)η(d x × A) ≤ X
w(x)γ(d x) + b X
α−ρ
.
If the transition rate is uniformly bounded: supx∈X q¯ x = L < ∞, then all the conditions of Theorem 3.2.1 are satisfied: one can put w(x) ≡ 1 and ρ = 0. Now Eq. (3.19) can be rewritten as L 1+ η( X × A) α 1 = γ( X ) + [q( X |x, a) + LI{x ∈ X }]η(d x × da), ∀ X ∈ B(X), α X×A and Eq. (3.22) takes the form L 1+ η( ˜ X) α 1 = γ( X ) + [q( X |y, a) + LI{y ∈ X }]π s (da|y)η(dy), ˜ α X A ∀ X ∈ B(X). This form is convenient because the probability measure η˜ can be constructed by monotonically increasing successive approximations η˜0 ( X ) = 0; η˜n+1 ( X ) =
α 1 γ( X ) + [q( X |y, a) L +α α X A + LI{y ∈ X }]π s (da|y)η˜n (dy) .
The function in the square brackets is positive, so that η˜0 ( X ) ≤ η˜1 ( X ) ≤ η˜2( X ) ≤ α 1 + αL . . . for all X ∈ B(X). If η˜n (X) ≤ 1, then η˜n+1 (X) ≤ L+α = 1. The limiting function η˜ = limn→∞ η˜n on B(X) is clearly a probability meas sure satisfying Item (d) of Theorem 3.2.1. Therefore, η(d ˜ x) = ηγπ ,α (d x × A) is the marginal of the total occupation measure associated with the stationary π-strategy π s . One can recognise the uniformization technique here. Another way to study the total occupation measures is based on the reduction to a Discrete-Time Markov Decision Process (DTMDP): see the second subsubsection of Sect. 4.2.4.
3.2 The Constrained Problem
167
3.2.2 The Primal Linear Program and Its Solvability Condition 3.2.1 (a) Condition 2.4.2 holds for a [1, ∞)-valued function w; α > ρ, and w(x)γ(d x) < ∞ X
with γ being the fixed initial distribution. (b) For each function u ∈ C(X), the function X u(y)q(dy|x, a) is continuous on X × A. According to Theorem 3.2.1, under Condition 3.2.1(a), Dt ⊂ Pw (X × A). The space Pw (X × A) is described in Definition B.1.8. Recall that the w -weak converw gence ηn → η means that, for every function g ∈ Cw (X × A),
lim
n→∞ X×A
g(x, a)ηn (d x × da) =
g(x, a)η(d x × da). X×A
0 Condition 3.2.2 Condition 2.4.2, with constants ρ ∈ R and L , b ∈ R+ , holds for a [1, ∞)-valued function w ; α > ρ , and X w (x)γ(d x) < ∞.
Under Condition 3.2.2, Theorem 3.2.1, applied to the function w , implies that D ⊂ Pw (X × A), and the constrained problem (3.16) can be rewritten as the following so-called Primal Linear Program: t
Minimize over η ∈ Pw (X × A) :
subject to
1 α
c0 (x, a)η(d x × da)
(3.29)
X×A
η( X × A) (3.30) 1 = γ( X ) + q( X |y, a)η(dy × da), ∀ X ∈ B(X); α X×A 1 c j (x, a)η(d x × da) ≤ d j , ∀ j = 1, 2, . . . , J. (3.31) α X×A
Remark 3.2.2 Under Condition 3.2.1(a), one could consider this program in the space Pw (X × A). But the presented version is more convenient because the following theorem states that, under appropriate conditions, the space Dt , that is, the space of measures in Pw (X × A) satisfying the characteristic Eq. (3.30) is compact in (Pw (X × A), τ (Pw (X × A))), where τ (Pw (X × A)) stands for the w -weak topology on Pw (X × A). For more details, see Appendix B.1.
168
3 The Discounted Cost Model
Theorem 3.2.2 (a) If Condition 3.2.1 is satisfied and the function w is continuous, then the space Dt is w-weakly closed in (Pw (X × A), τ (Pw (X × A))). (b) Suppose both Condition 3.2.1 and Condition 3.2.2 are satisfied, the function w is continuous, the space A is compact, and the ratio ww is a strictly unbounded function on X. Then the space Dt is compact in (Pw (X × A), τ (Pw (X × A))). Proof (a) Dt ⊆ Pw (X × A) due to Theorem 3.2.1 (see (3.20)). According to Corollary B.1.3, (Pw (X × A), τ (Pw (X × A))) is a topological Borel space, so that it is sufficient to consider only convergence of sequences. w Suppose ηn → η, where η ∈ Pw (X × A) and ηn ∈ Dt , n = 1, 2, . . .. We need to show that η ∈ Dt , i.e., the measure η satisfies relations (3.19) and (3.20): see Remark 3.2.1(a). (i) Inequality (3.20) holds because, according to Remark 3.2.1(b),
X×A
w(x)η(d x × da) = lim w(x)ηn (d x × da) n→∞ X×A α X w(x)γ(d x) + b < ∞. ≤ α−ρ
Recall that the function w is continuous. (ii) To prove equality (3.19), let us introduce the finite signed measure η( ˜ X ) := γ( X ) +
1 α
q( X |y, a)η(dy × da), ∀ X ∈ B(X) X×A
and show that η(d ˜ x) = η(d x × A). The measure η˜ is well defined because q x ≤ Lw(x) and η ∈ Pw (X × A). Let u ∈ C(X). Conditions 2.4.2 and 3.2.1(b) imply that the function u(x)q(d x|y, a) belongs to Cw (X × A). Therefore, X lim
n→∞ X
u(x)ηn (d x × A)
1 u(x)γ(d x) + u(x)q(d x|y, a) ηn (dy × da) = lim n→∞ α X×A X X 1 u(x)γ(d x) + u(x)q(d x|y, a) η(dy × da) = α X×A X X 1 u(x)γ(d x) + u(x) q(d x|y, a)η(dy × da) = α X X×A X u(x)η(d ˜ x). = X
3.2 The Constrained Problem
169
Fubini’s Theorem is applicable because all the integrals are finite. On the other hand, by Lemma B.1.7, w
w
ηn → η =⇒ ηn (d x × A) → η(d x × A) =⇒ ηn (d x × A) → η(d x × A), ˜ − the last convergence being the usual weak one. Therefore, X u(y)(η(dy) η(dy × A)) = 0. Since u ∈ C(X) was arbitrarily fixed, by Proposition B.1.23, η(d x × A) = η(d ˜ x). (b) Since Condition 2.4.2 holds for the continuous function w , the space Dt is w -weakly closed in (Pw (X × A), τ (Pw (X × A))) according to Item (a). Lemma B.1.7 implies that the mapping P → P˜ = Q w (P), defined by
w (x)P(d x × da) ˜ X × A ) := X × A , P( X×A w (x)P(d x × da) provides a homeomorphism between (Pw (X × A), τ (Pw (X × A))) and (P(X × A), τweak ), where τweak is the standard weak topology which is known to be metrizable. (See Proposition B.1.27.) Thus, it remains to show that the space D˜ t := Q w (Dt ) is relatively compact in (P(X × A), τweak ). For an arbitrary measure η˜ ∈ D˜ t , we introduce the probability measure ˜ by the formula η := Q −1 w (η) η(d ˜ x × da) w (x) η(d x × da) := . η(d ˜ x × da) w (x) X×A It is straightforward to check that η˜ = Q w (η), that is, Q −1 w is the inverse mapping to Q w . Therefore, η ∈ Dt and
X×A
w(x) w(x) η(d ˜ x × da) = Q w (η)(d x × da) w (x) X×A w (x) w(x)η(d x × da) = X×A X×A w (x)η(d x × da) α X w(x)γ(d x) + b < ∞. ≤ α−ρ
The last inequality holds due to Remark 3.2.1(b); remember also that w (x) ≥ 1 and η ∈ Dt is a probability measure. We have proved that
sup ˜t η˜ ∈D
X×A
w(x) η(d ˜ x × da) < ∞. w (x)
170
3 The Discounted Cost Model
The ratio ww , as a function on X × A, is strictly unbounded because A is compact. Hence, by Proposition B.1.30, the space D˜ t is tight and relatively compact. Theorem 3.2.3 Suppose both Condition 3.2.1 and Condition 3.2.2 are satisfied, the function w is continuous, the space A is compact, and the ratio ww is a strictly unbounded function on X. Assume also that, for each j = 0, 1, . . . , J , the function c j is lower semicontinuous on X × A and satisfies the inequality c j (x, a) ≥ −M w (x) for all (x, a) ∈ X × A, where M ∈ R0+ is a constant. Finally, assume that there is at least one feasible strategy for problem (1.16). Then there exists a stationary optimal π-strategy π ∗ solving the constrained problem (3.3) (as well as problem (1.16)). Proof The target is to prove the solvability of the program (3.29)–(3.31), which is equivalent to the problems (3.16), (3.3) and (1.16). According to Theorem 3.2.2(b), the space of total occupation measures Dt , that is, the space of measures in Pw (X × A) satisfying equality (3.30), is compact in (Pw (X × A), τ (Pw (X × A))). We are going to show that, for each j = 0, 1, . . . , J , the mapping η → X×A c j (x, a)η(d x × da) from Dt to R is lower semicontinuous in the w -weak topology, that is, the set Drt
:= η ∈ Dt :
c j (x, a)η(d x × da) ≤ r
X×A
is w -weakly closed for every r ∈ R. Note that the integral X×A c j (x, a) × η(d x × da) is well defined because c j (x, a) ≥ −M w (x) for all (x, a) ∈ X × A and
w (x)η(d x × da) ≤ X×A
α
X
w (x)γ(d x) + b < ∞ ∀ η ∈ Dt α − ρ
due to Theorem3.2.1 applied to the function w : see Remark 3.2.1(b). Moreover, the ¯ is bounded from below for mapping η → X×A c j (x, a)η(d x × da) from Dt to R all j = 0, 1, . . . , J . Below, the index j ∈ {0, 1, . . . , J } and r ∈ R are fixed. First, we show that there is a sequence {cmj }∞ m=1 of w -bounded continuous funcm tions on X × A such that c j ↑ c j as m → ∞. Since the function c j is lower semicontinuous and c j (x, a) ≥ −M w (x), and the function w is continuous, we see that c (x,a) the function cˆ j (x, a) := wj (x) is bounded from below and lower semicontinuous on X × A. Hence, by Proposition B.1.7, there is a sequence of bounded continum ous functions {cˆmj }∞ m=1 on X × A such that cˆ j ↑ cˆ j as m → ∞, and the sequence m m ∞ {c j := cˆ j w }m=1 is the desired one. w
Suppose ηn → η as n → ∞ and ηn ∈ Drt . Then η ∈ Dt by Theorem 3.2.2(a) applied to the function w and, by the Monotone Convergence Theorem,
3.2 The Constrained Problem
171
c j (x, a)η(d x × da) X×A m = lim c j (x, a) η(d x × da) = lim cm (x, a)η(d x × da) m→∞ X×A j X×A m→∞ = lim lim cmj (x, a)ηn (d x × da) ≤ lim lim r = r, m→∞ n→∞ X×A
m→∞ n→∞
where the inequality is valid because cmj ≤ c j and ηn ∈ Drt . Therefore, η ∈ Drt and the set Drt is w -weakly closed. Now the set t t c j (x, a)η(d x × da) ≤ αd j , j = 1, 2, . . . , J Dfeasible := η ∈ D : X×A
is non-empty and compact in (Pw (X × A), τ (Pw (X × A))) asthe finite intersection of closed subsets of the compact Dt . Since the mapping η → X×A c0 (x, a)η(d x × t is again lower semicontinuous in the w -weak topology, there is a da) from Dfeasible t satisfying measure η ∗ ∈ Dfeasible
inf t
η∈Dfeasible
c0 (x, a)η ∗ (d x × da)
c0 (x, a)η(d x × da) = X×A
X×A
(see Proposition B.1.39 or Proposition B.1.40). Clearly, the total occupation measure η ∗ solves the problem (3.29)–(3.31). Finally, the existence of the desired strategy π ∗ follows from Theorem 3.2.1(b). Remark 3.2.3 According to the previous proof, under the conditions of Theorem 3.2.3, the space Dt is compact in the w -weak topology and all the mappings η → X×A c j (x, a)η(d x × da), j = 0, 1, 2, . . . , J , are lower semicontinuous and bounded from below by the constant −M applied to the function w .
α
X
w (x)γ(d x)+b : α−ρ
see Remark 3.2.1(b)
The functions w and w , and the corresponding conditions, were introduced in order to work with unbounded transition and cost rates (from above and from below). Suppose supx∈X q x < ∞ and inf (x,a)∈X×A c j (x, a) > −∞ for all j = 0, 1, 2, . . . , J . Then obviously one can take w(x) ≡ w (x) ≡ 1, ρ = ρ = 0 and so on. All the statements in the current section hold if the state space X is compact: in this case the function ww(x) (x) ≡ 1 is strictly unbounded. Moreover, the requirement about the compactness of the space X is not needed, as the following statement shows. Theorem 3.2.4 Suppose supx∈X q x < ∞, the space A is compact and, for each function u ∈ C(X), the function X u(y)q(dy|x, a) is continuous on X × A. Then the space Dt is weakly compact. Assume additionally that all the functions c j are bounded from below and lower semicontinuous, and there is at least one feasible strategy for problem (1.16). Then
172
3 The Discounted Cost Model
there exists a stationary optimal π-strategy π ∗ solving the constrained problem (3.3), as well as problem (1.16). Proof As was mentioned, one can put w(x) ≡ w (x) ≡ 1; ρ = ρ = b = b = 0; L = L = supx∈X q x ; M = 0 ∨ min j=0,1,...,J inf (x,a)∈X×A c j (x, a), so that Conditions 2.4.2, 3.2.1 and 3.2.2 are satisfied. Equality (3.19) can be rewritten as α η( X × A) L +α
q ( X |y, a) I{y ∈ X }[L − q y (a)] L α γ( X ) + + = L +α L + α X×A L L L ×η(dy × da) − η( X × A) ⇐⇒ L +α η( X × A) α L = γ( X ) + p( X |y, a)η(dy × da), ∀ X ∈ B(X), L +α L + α X×A where p( X |y, a) :=
1
q ( X |y, a) + I{y ∈ X }[L − q y (a)] L
is a continuous stochastic kernel on B(X) given (y, a) ∈ X × A. According to Theorem 3.2.1, D = η ∈ P(X × A) : η( X × A) = (1 − β)γ( X ) p( X |y, a)η(dy × da), ∀ X ∈ B(X) , +β t
X×A L ∈ (0, 1). Now, by Proposition C.2.19, Dt is weakly compact. where β := L+α The proof of the last assertion is just a simplified version of the proof of Theorem 3.2.3: the Primal Linear Program (3.29)–(3.31) is solvable and its solution η ∗ ∈ P(X × A) gives rise to the stationary optimal π-strategy π ∗ coming from the decomposition ∗
η ( X × A ) =
X
π ∗ ( A |y)η(dy × A), ∀ X ∈ B(X), A ∈ B(A).
One can recognise the uniformization technique here.
3.2 The Constrained Problem
173
3.2.3 Comparison of the Convex Analytic and Dynamic Programming Approaches If J = 0, the unconstrained problem (3.1) can also be studied using the Primal Linear Program (3.29)–(3.31). Let us compare the conditions under which it was proved to be solvable (see Theorem 3.2.3) and the conditions under which the Dynamic Programming method was justified, including the solvability of the Dual Linear Program (3.15) (see Theorems 3.1.2 and 3.1.3). All the imposed conditions can be divided into the following three lists. 1. Common conditions which appear in Theorems 3.1.2, 3.1.3 and 3.2.3. (i) There exist a measurable function w : X → [1, ∞) and constants ρ ∈ R, L , b ∈ R0+ such that q x ≤ Lw(x) and w(y)q(dy|x, a) ≤ ρw(x) + b for X
all x ∈ X and a ∈ A. (ii) w(x)γ(d x) < ∞. X
(iii) There exist a continuous function w : X → [1, ∞) and constants ρ < α, b ∈ R0+ such that
w (y)q(dy|x, a) ≤ ρ w (x) + b for all x ∈ X and a ∈
X
A. (iv) ∀u ∈ C(X) the function u(y)q(dy|x, a) is continuous on X × A. X
(v) The action space A is compact, the function c0 is lower semicontinuous and satisfies the inequality inf a∈A c0 (x, a) ≥ −M w (x), for some constant M ∈ R0+ , for all x ∈ X. 2. Additional conditions which appear in Theorems 3.1.2 and 3.1.3. (i) There is a constant L ∈ R0+ such that (q x + 1)w (x) ≤ L w(x) for all x ∈ X. (ii) inf a∈A c0 (x, a) ≤ M w (x) for all x ∈ X. u(y)w (y)q(dy|x, a) is continuous on X × A.
(iii) ∀u ∈ C(X), the function X
3. Additional conditions which appear in Theorem 3.2.3. There is a constant L ∈ R0+ such that q x ≤ L w (x) for all x ∈ X. The ratio ww is a strictly unbounded function on X. α > ρ. w (x)γ(d x) < ∞. (iv)
(i) (ii) (iii)
X
In the first list, we see the requirement that two Lyapunov-type functions w and w exist. Although only one function w is needed for the non-explosiveness of the controlled process, without the function w , the Dynkin formula is more delicate, as well as the compactness of the space Dt in the corresponding topology. Items 1(iv,v)
174
3 The Discounted Cost Model
are widely known as the compactness-continuity conditions. Requirement 2(iii) is the slightly stronger version. Items 3(i) along with 3(ii) mean that, roughly speaking, the function w is bigger than w (which is consistent with Condition 2(i)), and the function q x is smaller than w . Conditions inf a∈A c0 (x, a) ≷ ±M w (x) are the weak versions of the requirement that the c0 function on X × A is w -bounded. Suppose | inf a∈A c0 (x, a)| ≤ M . Then in Theorems 3.1.2 and 3.1.3 one can put qx < w (x) ≡ 1 and still work with the unbounded transition rate, as soon as supx∈X w(x) ∞. In contrast, w (x) ≡ 1 means that the Convex Analytic Approach developed in this section applies only if the transition rate q x is bounded. (See Item 3(i).) Thus, in some sense, the conditions appearing in Theorems 3.1.2 and 3.1.3 are less restrictive. Moreover, the Dynamic Programming Approach, namely Theorem 3.1.2, allows us to build the uniformly optimal strategy, while the Convex Analytic Approach deals with the fixed initial distribution γ. But on the other hand, Theorem 3.2.3 allows the positive part of the cost rate(s) to be arbitrary. As a result, the value of the Primal Linear Program can equal infinity, but the function u ∗ in Theorem 3.1.2 is finite-valued. Moreover, the Convex Analytic Approach can be used for tackling constrained problems with J ≥ 1, where application of Dynamic Programming and of the Dual Linear Program is less convenient. One can say that the Dynamic Programming and Convex Analytic approaches complement each other in the case J = 0, but Dynamic Programming is much more popular in unconstrained optimization.
3.2.4 Duality We called the Linear Program (3.15) dual, and (3.29)–(3.31) primal. In this subsection, we demonstrate how the general theory of linear and convex programming (see Sect. B.2) applies to the optimal control problems under investigation. First, we show that the program (3.29)–(3.31) is indeed an example of the Primal Linear Program (B.14). To do this, we assume that Conditions 3.2.1(a) and 3.2.2 are satisfied and, additionally, |c j (x, a)| ≤ Mw(x), (q¯ x + 1)w (x) ≤ L w(x), ∀ x ∈ X, a ∈ A, j = 0, 1, . . . , J,
(3.32)
for some constants M, L . Recall that MwF (X × A) is the space of all finite signed measures on (X × A, B(X × A)) with a finite w-norm. (See Definition B.1.8.)
3.2 The Constrained Problem
175
Let X := MwF (X × A) × R J = {X = {η, β1 , β2 , . . . , β J ) : η ∈ MwF (X × A), β j ∈ R, j = 1, 2, . . . , J }, (3.33) Y := Bw (X × A) × R J = {Y = ( f, e1 , e2 , . . . , e J ) : f ∈ Bw (X × A), e j ∈ R, j = 1, 2, . . . , J }. The bilinear form on X × Y is defined as X, Y :=
f (x, a)η(d x × da) + X×A
J
βjej.
(3.34)
j=1
We fix the following positive cone in X : Co := {(η, β1 , β2 , . . . , β J ) : η ≥ 0, β j ≥ 0, j = 1, 2, . . . , J }
(3.35)
and its dual cone in Y: Co∗ := {( f, e1 , e2 , . . . , e J ) : f ≥ 0, e j ≥ 0, j = 1, 2, . . . , J }. Let Z := MwF (X) × R J = {Z = (z 0 , h 1 , h 2 , . . . , h J ) : z 0 ∈ MwF (X), h j ∈ R, j = 1, 2, . . . , J }, V := Bw (X) × R J = {V = (v , g1 , g2 , . . . , g J ) : v ∈ Bw (X), g j ∈ R, j = 1, 2, . . . , J }. The bilinear form on Z × V is defined as Z , V :=
v (x)z 0 (d x) +
X
J
h j g j .
j=1
The mapping U from X to Z is defined as U ◦ X = Z = (z 0 , h 1 , h 2 , . . . , h J ), ∀ X = (η, β1 , β2 , . . . , β J ) ∈ X with 1 q( X |y, a)η(dy × da), z 0 ( X ) := η( X × A) − α X×A 1 c j (x, a)η(d x × da) + β j , j = 1, 2, . . . , J. h j := α X×A
(3.36)
176
3 The Discounted Cost Model
If a net {ηζ } in MwF (X × A) converges to η in the τ (MwF (X × A), Bw (X × A)) topology, then, according to the imposed conditions,
w ∈ Bw (X × A) and
w (x)q(d x|y, a) ∈ Bw (X × A), X
so that the net ηζ (d x × A) −
1 α
q(d x|y, a)ηζ (dy × da) X×A
converges to η(d x × A) −
1 α
q(d x|y, a)η(dy × da) X×A
in the τ (MwF (X), Bw (X)) topology. The convergence of α1 X×A c j (x, a) ηζ (d x × da) to α1 X×A c j (x, a)η(d x × da) is also clear for all j = 1, 2, . . . , J . Therefore, the mapping U is weakly continuous and linear. Its adjoint mapping U ∗ : V → Y is defined by U ∗ ◦ V = Y = ( f, e1 , e2 , . . . , e J ), ∀ V = (v , g1 , g2 , . . . , g J ) ∈ V with f (x, a) := v (x) − e j :=
g j ,
1 α
1 g c j (x, a), α j=1 j J
v (y)q(dy|x, a) +
X
j = 1, 2, . . . , J.
Indeed, by Fubini’s Theorem, v (x)η(d x × A) U ◦ X, V = X, U ∗ ◦ V = X 1 − v (x)q(d x|y, a)η(dy × da) α X×A X J J 1 + g c j (x, a)η(d x × da) + g j β j . α j=1 j X×A j=1 Under the imposed conditions, if X = (η, β1 , β2 , . . . , β J ) ∈ Co satisfies U ◦ X = B, then the finite signed measure η is positive, and equality (3.30) implies that η ∈ Pw (X × A). Therefore, the Primal Linear Program (3.29)–(3.31), considered in the space Pw (X × A) (see Remark 3.2.2), or, equivalently, the constrained problem (3.16), can be represented in the form of (B.14):
3.2 The Constrained Problem
177
Minimize over X ∈ Co : X, C subject to U ◦ X = B.
(3.37)
Here and below in this subsection, 1 C := ( c0 , 0, 0, . . . , 0) ∈ Y, α B := (γ, d1 , d2 , . . . , d J ) ∈ Z. The Dual Linear Program (B.15) Maximize over V ∈ V : B, V subject to C − U ∗ ◦ V ∈ Co∗
(3.38)
can be explicitly written as follows: Maximize over (v , g1 , g2 , . . . g J ) ∈ V : subject to 1 1 c0 (x, a) − v (x) + α α
v (x)γ(d x) +
X
J
g j d j
j=1
1 g c j (x, a) ≥ 0, α j=1 j J
v (y)q(dy|x, a) −
X
∀ (x, a) ∈ X × A; −g j ≥ 0, j = 1, 2, . . . , J. If we introduce the change of variables through g j := −g j , j = 1, 2, . . . , J , and v(x) := v (x) − Jj=1 g j d j , then the above dual program takes the following more familiar form: Maximize over (v, g1 , . . . , g N ) : v(x)γ(d x) (3.39) X
subject to αv(x) ≤ c0 (x, a) +
J j=1
g j c j (x, a) − αd j +
v(y)q(dy|x, a), X
∀ (x, a) ∈ X × A; g j ≥ 0, j = 1, . . . , J ; v ∈ Bw (X). In the unconstrained case, when J = 0, we obtained exactly the Dual Linear Program (3.15). Now we will show that the constrained problem (3.16) is an example of the Primal Convex Program (B.17). Apart from the requirement that there is at least one feasible total occupation measure η, ˆ we need all the integrals X×A c j (x, a)η(d x × da), j = 0, 1, . . . , J , to be finite. Thus, we assume that Condition 3.2.2 is satisfied
178
3 The Discounted Cost Model
and inf a∈A c j (x, a) ≥ −M w (x) for some constant M , for all j = 0, 1, . . . , J and ˆ x× for all x ∈ X. Suppose also that, for the feasible measure η, ˆ X×A c0 (x, a)η(d da) < ∞ and introduce the subset D˜ t := η ∈ Dt :
c j (x, a)η(d x × da) ∈ R, j = 0, 1, . . . , J . X×A
Now problem (3.16) remains the same if we replace Dt with D˜ t . (Remember that Theorem 3.2.1(a), applied to the function w , gives X w (x)η(d x × A) < ∞, so that all the integrals X×A c j (x, a)η(d x × da), j = 0, 1, . . . , J , are bounded from below for each η ∈ Dt .) Now one can rewrite problem (3.16) as the following Primal Convex Program 1 c0 (x, a)η(d x × da) α X×A 1 c j (x, a)η(d x × da) ≤ d j , subject to α X×A ∀ j = 1, 2, . . . J,
Minimize over η ∈ D˜ t :
(3.40)
or, equivalently, Minimize over η ∈ D˜ t :
sup L(η, g). ¯
g¯ ∈(R0+ ) J
(cf. (B.17)). Here the Lagrangian L has the form 1 L(η, g) ¯ = α +
c0 (x, a)η(d x × da) X×A J
gj
j=1
1 α
c j (x, a)η(d x × da) − d j , X×A
∀ η ∈ D˜ t , g¯ ∈ (R0+ ) J . The Dual Convex Program (B.18) reads ¯ Maximize over g¯ ∈ (R0+ ) J : inf L(η, g). ˜t η∈D
(3.41)
We formulate the following Slater’s condition for problem (1.16). Condition 3.2.3 All the inequalities in (1.16) are strict for some S ∈ S and W0α (S, γ) < ∞. Theorem 3.2.5 Suppose all the conditions of Theorem 3.2.3 are satisfied, together with Condition 3.2.3. Then the following assertions hold.
3.2 The Constrained Problem
179
(a) Convex Programs (3.40) and (3.41) are solvable, and with the common finite value: −∞ < inf(P c ) = sup(D c ) < ∞, correspondingly. (b) Let g¯ ∗ ∈ (R0+ ) J be an optimal solution to the Dual Convex Program (3.41). Then a point η ∗ ∈ D˜ t is optimal for the Primal Convex Program (3.40) if and only if one of the following two equivalent statements holds: (i) The pair (η ∗ , g¯ ∗ ) is a saddle point of the Lagrangian: ¯ ≤ L(η ∗ , g¯ ∗ ) ≤ L(η, g¯ ∗ ), ∀ η ∈ D˜ t , g¯ ∈ (R0+ ) J . L(η ∗ , g) (ii) η ∗ is feasible for (3.40) and the following equalities are valid: inf L(η, g¯ ∗ ) = L(η ∗ , g¯ ∗ ) =
˜t η∈D J j=1
g ∗j
1 α
1 α
c0 (x, a)η ∗ (d x × da); X×A
(3.42)
c j (x, a)η ∗ (d x × da) − d j = 0. X×A
(The last equality is known as the Complementary Slackness Condition.) (c) If additionally inequalities (3.32) hold true, then the Primal Linear Program (3.37) is well posed and is solvable with the value inf(P) = inf(P c ). (d) If, additionally to (3.32), all the conditions of Theorems 3.1.2 and 3.1.3 are satisfied then the Dual Linear Program (3.38) is also solvable with the value sup(D) = sup(D c ). Proof (a) As was explained, the Primal Convex Program (3.40) equivalently represents the constrained optimal control problem (3.16), which, in turn, is equivalent to the constrained problems (1.16), (3.3), and (3.29)–(3.31). Therefore, it is solvable by Theorem 3.2.3 and inf(P c ) > −∞. The last inequality follows from the condition c0 (x, a) ≥ −M w (x) for each x ∈ X and a ∈ A, and the inclusion Dt ⊆ Pw (X × A), as proved in Theorem 3.2.2(b). According to Proposition B.2.6, inf(P c ) = sup(D c ) < ∞ and there is a vector g¯ ∗ ∈ (R0+ ) J solving the Dual Convex Program (3.41): sup(D c ) = inf L(η, g¯ ∗ ).
t η∈D
(b) This part of the theorem follows directly from Proposition B.2.6.
(3.43)
180
3 The Discounted Cost Model
(c) The Primal Linear Program (3.37) was introduced above. Since it also equivalently represents the constrained optimal control problem (3.16), inf(P) = inf(P c ) > −∞. Let η ∗ be an optimal solution to the problem (3.16), or, equivalently, to (3.29)–(3.31) and to (3.40). Clearly, the point X ∗ := (η ∗ , β1∗ , β2∗ , . . . , β ∗j ), where β ∗j
1 := d j − α
c j (x, a)η ∗ (d x × da), j = 1, 2, . . . , J, X×A
solves the Primal Linear Program (3.37). (d) According to (3.32), all the integrals X×A c j (x, a)η(d x × da), j = 0, 1, 2, . . . , J , are finite for each η ∈ Dt due to Theorem 3.2.1(a). Therefore, the spaces D˜ t = Dt coincide in the Convex Programs (3.40) and (3.41). For each ¯ → inf η∈Dt is just the uncong¯ = (g1 , g2 , . . . , g J ) ∈ (R0+ ) J , the problem L(η, g) strained problem (3.1) with the cost rate c(x, ¯ a) := c0 (x, a) +
J
g j (c j (x, a) − αd j ).
j=1
Theorems 3.1.2 and 3.1.3 are applicable to the cost rate c. ¯ Hence, there exists an optimal solution v ∗ to the following program: Maximize over v ∈ Bw (X) :
v(y)γ(dy) (3.44) subject to c¯0 (x, a) + v(y)q(dy|x, a) − αv(x) ≥ 0, X
X
∀ (x, a) ∈ X × A. The value of the latter program equals inf η∈Dt L(η, g) ¯ and is finite for each g¯ ∈ (R0+ ) J because
w(x)γ(d x) < ∞, w (x) ≤ L w(x), ∀ x ∈ X, v ∈ Bw (X). X
Therefore, the Dual Convex Program (3.41) is equivalent to (3.39), so that the latter has an optimal solution (v ∗ , g1∗ , g2∗ , . . . , g ∗J ), where v ∗ solves problem (3.44) under the above introduced vector g¯ ∗ ∈ (R0+ ) J : see (3.43). The value of program (3.39) coincides with sup(D c ). It remains to recall that the value of program (3.39) equals sup(D), and the point
3.2 The Constrained Problem
181
⎛ ⎝v ∗ +
J
⎞ g ∗j d j , − g1∗ , −g2∗ , . . . , −g ∗J ⎠
j=1
solves the Dual Linear Program (3.38) as soon as (v ∗ , g1∗ , g2∗ , . . . , g ∗J ) solves problem (3.39). If all the conditions of Theorem 3.2.3 are satisfied and |c j (x, a)| ≤ Mw(x) for all (x, a) ∈ X × A, j = 0, 1, . . . , J , for some constant M, then all the integrals X×A c j (x, a)η(d x × da), j = 0, 1, . . . , J , are finite for all η ∈ Dt because t ˜t X w(x)η(d x × da) < ∞ by Theorem 3.2.1(a). Therefore, D = D . Now one can solve the Dual Convex Program (3.41) using the following observations. The cal¯ (g¯ ∈ R+J ) can be performed using Theorems 3.1.1 and culation of inf η∈Dt L(η, g) 3.1.2. After that, the maximization over g¯ ∈ (R0+ ) J can be done by any standard ¯ is finite-dimensional method. Recall that the function g¯ ∈ (R0+ ) J → inf η∈Dt L(η, g) concave: see Appendix B.2.3. We underline that sufficient conditions for the solvability of problems (1.16) and (3.3) are given in Theorem 3.2.3. Additional conditions, which appear in Theorem 3.2.5, are needed for the accurate formulations of the Primal/Dual Linear/Convex Programs and for the investigation of the duality concepts. Until the end of the current subsection, we assume that all the conditions of Theorem 3.2.5 are satisfied.
3.2.4.1
Discussion of the Lagrange Multipliers Method
The optimal values of the Lagrange multipliers g¯ ∗ help to understand what happens if one slightly changes the constraints d j , j = 1, 2, . . . , J . Namely, suppose dˆ j = d j + for some fixed j ∈ {1, 2, . . . , J }. Then, for small enough ||, all the linear ˆ = and convex programs are still solvable. We define as inf(P) = inf(P c ) and inf( P) c ˆ inf( P ) the original and new values. Proposition B.2.7 says that inf(P c ) − inf( Pˆ c ) ≤ g ∗j . If > 0, one can expect that inf( Pˆ c ) < inf(P c ), but the decrease in the optimal value does not exceed g ∗j . If < 0, one can expect that inf( Pˆ c ) > inf(P c ), and the increase is not smaller than ||g ∗j . The optimal values of the Lagrange multipliers g¯ ∗ can also help to find the solution to the Primal Convex (and also Linear) Program, especially if J = 1 and the cost rate c1 is continuous and satisfies requirement |c1 (x, a)| ≤ M w (x) for all x ∈ X. Using Theorems 3.1.1 and 3.1.2, one can obtain a function u ∗ ∈ Bw (X) such that ∗ L(ηγπ ,α , g1∗ ) = inf η∈Dt L(η, g1∗ ) for a π-strategy π ∗ if and only if ∗
¯ (v), a) + −αu (X (v)) + c(X X
q(dy|X (v), a)u ∗ (y) = 0
182
3 The Discounted Cost Model ∗
for ∗ (da|ω, v) × dv × Pγπ (dω)-almost all (a, v, ω) ∈ A × R+ × , where c(x, ¯ a) := c0 (x, a) + g1∗ c1 (x, a), ∀ x ∈ X, a ∈ A is a lower semicontinuous function satisfying all the required conditions. Namely, u ∗ is the unique lower semicontinuous solution of Eq. (3.4) in the class Bw (X). We introduce the measurable set ∗ ∗ ¯ a) + q(dy|x, a)u (y) = 0 , K := (x, a) ∈ X × A : − αu (x) + c(x, X
which is further assumed to be a closed subset of X × A, and subsets A(x) := {a : (x, a) ∈ K}. For each x ∈ X, A(x) = ∅ because, according to Theorem 3.1.2(b), ϕ∗ (x) ∈ A(x). ∗ Now L(ηγπ ,α , g1∗ ) = inf η∈Dt L(η, g1∗ ) for a π-strategy π ∗ if and only if π ∗ ∈ Sˆ π , where Sˆ π is the class of relaxed π-strategies satisfying constraint ∗ (A(X (v))|ω, v) = 1 ∗ for Pγπ (dω) × dv-almost all (ω, v) ∈ × R+ . We are in the framework of Theorem 3.1.4, where the cost rate c1 plays the role of c0 . That theorem provides a method for solving unconstrained problems in the class Sˆ π . According to Theorem 3.2.5, the Primal Convex Program (3.40) is solvable and each of its solutions is a total occupation measure of a strategy from Sˆ π . It remains to find a feasible occupation measure η ∗ generated by a strategy from Sˆ π satisfying condition g1∗
1 α
c1 (x, a)η ∗ (d x × da) − d1 = 0.
X×A
If g1∗ = 0, then, based on Theorem 3.1.4, one can solve the problem Minimize over π ∈ Sˆ π : W1α (π, γ). The resulting π-strategy π ∗ must be feasible and will give a solution to the constrained ∗ α-discounted CTMDP problem (3.3). Its total occupation measure ηγπ ,α is a solution ∗ to the Primal Convex Program (3.40), and the pair (ηγπ ,α , β1∗ ) with β1∗
1 := d1 − α
X×A
c1 (x, a)ηγπ
∗
,α
(d x × da)
solves the Primal Linear Program (3.37). If g1∗ > 0, then one has to solve two unconstrained problems Minimize over π ∈ Sˆ π : W1α (π, γ)
3.2 The Constrained Problem
183
Fig. 3.1 Solving the optimal control problem with one constraint. The shaded area corresponds to the set of performance vectors
and Maximize over π ∈ Sˆ π : W1α (π, γ). Both problems meet all the requirements of Theorem 3.1.4 because of the imposed conditions on the cost rate c1 . If the solutions are given by the π-strategies π and π, then one can be sure that 1 c1 (x, a)ηγπ,α (d x × da) ≤ d1 (3.45) W1α (π, γ) = α X×A 1 ≤ c1 (x, a)ηγπ,α (d x × da) = W1α (π, γ). α X×A π,α
After that, the appropriate convex combination η ∗ of the measures ηγ and ηγπ,α satisfies all the requirements of Item (b)-(ii) of Theorem 3.2.5 and hence is a solution to the Primal Convex Program (3.40). The solution to the Primal Linear Program (3.37) can be presented similarly to the above. The stationary relaxed strategy π ∗ can be obtained by disintegrating the measure η ∗ , according to Theorem 3.2.1(b). Figure 3.1 illustrates the presented reasoning.
3.2.5 The Space of Performance Vectors In this subsection, we base our analysis on the study of the space of performance vectors (objectives)
184
3 The Discounted Cost Model
(η) = (W0 (η), W1 (η), . . . , W J (η)) : O := {W
η ∈ Dt },
where 1 W j (η) := α
c j (x, a)η(d x × da),
j = 0, 1, 2, . . . , J.
(3.46)
X×A
The constrained problem (1.16) (or (3.3), or problem (3.16), or (3.29)–(3.31) etc) can be reformulated in terms of the space O: Minimize W0 = (W0 , W1 , . . . , W J ) ∈ O subject to W
(3.47)
and W j ≤ d j , ∀ j = 1, 2, . . . , J. ∈ O can take infinite values We underline that the components of the vectors W W j = ±∞. Definition 3.2.2 The constrained problem (1.16) is called non-degenerate if there exists at least one feasible strategy and −∞ < inf{W0α (S, γ) : S is feasible } < ∞. Similarly, problems (3.3), (3.16), and (3.29)–(3.31), etc. are called non-degenerate if they admit a feasible solution with a finite value. Under the conditions of Theorem 3.2.3 inf{W0α (S, γ) : S is feasible } > −∞ because c0 ≥ −M w . (See Remark 3.2.3.) To guarantee that the constrained problem is non-degenerate, it is sufficient to require additionally that there is a feasible strategy S such that W0α (S, γ) < ∞, or to require that c0 ≤ Mw (again, see Remark 3.2.3). For example, the constrained problem is non-degenerate if the conditions of Theorem 3.2.5 are fulfilled. Condition 3.2.4 (a) The constrained problem is non-degenerate. (b) There exists a metrizable topology in the space Dt of total occupation measures, with respect to which Dt is a convex compact set, and all the functionals W j (η) ( j = 0, 1, . . . , J ) on Dt , defined by (3.46), are bounded from below and lowersemicontinuous. Remark 3.2.4 In Chap. 6, we will consider more general strategies, including the mixtures of the strategies introduced in Definition 1.1.2. It will become clear there that the space of all total occupation measures Dt , defined in (3.17), is convex without any conditions (see Lemma 6.4.1(c)), and hence the set O ∩ R J +1 is convex. Under Condition 3.2.4(b), the space of performance vectors O is also convex.
3.2 The Constrained Problem
185
If Condition 3.2.4 is satisfied, one can make the following observations. The case W j (η) = −∞ is excluded. The solution η ∗ to problem (3.16) exists and all the costs W j (η ∗ ), j = 0, 1, 2, . . . , J , are finite. Indeed, in this case the subset t := {η ∈ Dt : W j (η) ≤ d j , j = 1, 2, . . . , J } Dfeasible
is non-empty and compact, so that the minimizer exists due to Proposition B.1.39 or to Proposition B.1.40. In fact, this was the idea of the proof of Theorem 3.2.3. All the costs W j (η ∗ ) are finite because the constrained problem is non-degenerate and the functionals W j (η), j = 1, 2, . . . , J , are bounded from below. Thus, one can replace the space O in (3.47) with O ∩ R J +1 . Condition 3.2.4(b) is satisfied, e.g., if all the conditions of Theorem 3.2.3 are fulfilled. The required topology is the w -weak topology. Condition 3.2.4 holds under the assumptions of Theorem 3.2.5. Lemma 3.2.1 Suppose Condition 3.2.4 is satisfied. Then there exists an optimal ∗ , to problem (3.47) such that solution, say W ∗ = (W0∗ , W1∗ , . . . , W J∗ ) ∈ Par(O ∩ R J +1 ). W Proof Since the constrained problem is non-degenerate and is solvable, ∈ O, W j ≤ d j , j = 1, 2, . . . , J } < ∞. −∞ < d0 := min{W0 : W Now, the new constrained problem in the form of (3.16) Minimize over η ∈ Dt
J
W j (η)
j=0
subject to W j (η) ≤ d j , j = 0, 1, 2, . . . , J satisfies Condition 3.2.4 and hence has a solution η ∗ providing the solution (η ∗ ) ∈ R J +1 to the problem ∗ = W W Minimize
J
Wj
j=0
= (W0 , W1 , . . . , W J ) ∈ O ∩ R J +1 , subject to W W j ≤ d j , ∀ j = 0, 1, 2, . . . , J.
(3.48)
∗ provides a solution to the original problem (3.47) The performance vector W because it satisfies all the required constraints and W0∗ = d0 : this component cannot ∈ O ∩ R J +1 is such that W ≤W ∗ (componentbe smaller than d0 . Finally, if W ∗ ∗ wise), then W = W because otherwise W would not have been a solution to prob ∗ ∈ Par(O ∩ R J +1 ). lem (3.48). Therefore, W
186
3 The Discounted Cost Model
, b is the standard scalar product in R J +1 , 0 × ∞ := 0 and a + ∞ := Below, W ∞ for all a ∈ R ∪ {∞}. All the inequalities for vectors are component-wise. The notion of a supporting hyperplane is in accordance with Definition B.2.3. Remember, , which appear below, cannot have components W j = −∞ under the the vectors W conditions imposed therein. Lemma 3.2.2 Suppose Condition 3.2.4 is satisfied, and D˜ ⊆ Dt is a convex compact subset. Let ˜ ⊆ (R ∪ {∞}) J +1 ˜ := {W (η) : η ∈ D} O be a convex subset of the space of performance vectors O. Suppose β ∈ R and a ˜ = β and W , b ≥ β for u , b vector b > 0 in R J +1 are such that, for some u ∈ O, J +1 ˜ all W ∈ O ∩ R . Then the exposed face ˜ : W ˜ : W ∈O , b = β} = {W ∈O , b ≤ β} {W is closed and bounded in R J +1 , hence, it is a convex compact set. ˜ because , b ∈ R ∪ {∞} is well defined for all W ∈O Proof The scalar product W all the functions W j , j = 0, 1, 2, . . . , J , defined in (3.46) are bounded from below ˜ : W ∈O , b < β} is empty. Secand b > 0. Firstly, we emphasize that the set {W ˜ then W , b = β for W ∈O ∈ R J +1 : the case W j = ∞ for each compoondly, if W ˜ ⊆ R J +1 then H := {W ∈ R J +1 : W , b = β} is excluded. If O nent of the vector W ˜ ˜ is just a supporting hyperplane to the set O at u ∈ O. ˜ : W ∈O , b = β} is bounded. All the functionals Let us show that the face {W W j (η), j = 0, 1, 2, . . . , J , are bounded from below by Condition 3.2.4(b): W j (η) ≥ (η) ∈ W j > −∞. Since b0 , b1 , . . . , b J > 0, we can conclude the following: if W ˜ : W ∈O , b = β} then, for each j = 0, 1, 2, . . . , J , {W ⎛ W j (η) =
1 ⎝ β− bj
⎞
⎛
bi Wi (η)⎠ ≤
i= j
1 ⎝ β− bj
⎞ bi W i ⎠ < ∞.
i= j
˜ : W ∈O , b = β} is bounded. Therefore the face {W ∞ as n → ∞ with W n ∈ {W ∈ n → W Let us show that it is closed. Suppose W ∞ n ∞ J +1 ˜ O : W , b = β}. Clearly, W , b = limn→∞ W , b = β, and W ∈ R . We ˜ Consider the convergent subsequence {η ni } ⊆ D˜ of the ∞ ∈ O. will show that W n ∞ n: W n = W (η n ) (n = 1, 2, . . . ). Suppose η ni → sequence {η }n=1 of preimages of W ∞ η . According to Condition 3.2.4(b), ∞; (η ni ) = lim W ni = W (η ∞ ) ≤ lim W W i→∞
i→∞
3.2 The Constrained Problem
187
(η ∞ ) ∈ R J +1 . If at least one component W j (η ∞ ) is strictly smaller than hence W ∞ W j , then (η ∞ ), b < W ∞ , b = β, W ˜ ∩ R J +1 . Thus W ˜ and ∞ = W (η ∞ ) ∈ O (η ∞ ) ∈ O which is impossible because W therefore the exposed face under study is closed. Hence it is compact; the convexity is obvious. Theorem 3.2.6 Suppose Condition 3.2.4 is satisfied. Then there exists a solution ∗ to the constrained problem (3.47) such that W ∗ is a convex combination of no W more than J + 1 points, which are simultaneously extreme and Pareto in O ∩ R J +1 . ∗ ∈ Par(O ∩ R J +1 ), withProof According to Lemma 3.2.1, we can assume that W ∗ ∗. out loss of generality. Let G(W ) be the minimal face of O ∩ R J +1 containing W ∗ J +1 Then, by Lemma B.2.1, G(W ) ⊆ Par(O ∩ R ) and, for some 1 ≤ k ≤ J + 1, there exist hyperplanes ∈ R J +1 : W , bi = β i }, i = 1, 2, . . . , k H i = {W such that • bi ≥ 0 for i = 1, 2, . . . , k − 1 and bk > 0; ∗ and, for every 1 ≤ i ≤ k − 1, H i+1 • H 1 is supporting to O0 := O ∩ R J +1 at W i i−1 i ∗ ; is supporting to O := O ∩ H at W ∗ ) = Ok . • G(W Clearly, all the sets O0 , O1 , . . . , Ok are convex. Observe that for each i = 0, 1, . . . , k, each extreme point of Oi (i = 0, 1, 2, . . . , k) is also an extreme point of O ∩ R J +1 , because each set Oi is a face of O ∩ R J +1 : see Appendix B.2.1. We intend to show that Ok is a convex compact set. Along with the introduced subsets Oi ⊂ R J +1 , i = 0, 1, 2, . . . , k, we define the ¯ i ⊆ (R ∪ {∞}) J +1 : following subsets O • •
¯ 0 := O; O ¯ i−1 ∩ {W ¯ : W , bi ≤ β i }, i = 1, 2, . . . , k. Oi := O
Let us prove that, for all i = 0, 1, 2, . . . , k, ¯ i ∩ R J +1 = Oi ; (i) O ¯ i } is convex and compact. (η) ∈ O (ii) the set {η ∈ Dt : W These statements trivially hold at i = 0. Suppose they hold for some 0 ≤ i < k. Then ¯ i ∩ R J +1 ) ∩ {W ¯ i+1 ∩ R J +1 = (O ∈ R J +1 : W , bi+1 ≤ β i+1 } O " # ∈ R J +1 : W , bi+1 < β i+1 } = Oi+1 . = Oi ∩ H i+1 ∪ {W
188
3 The Discounted Cost Model
Here, we used the induction hypothesis and that ∈ R J +1 : W , bi+1 < β i+1 } = ∅ Oi ∩ {W holds because H i+1 is a supporting hyperplane to Oi . (η), bi+1 ≤ β i+1 } is convex and compact Finally, note that the set {η ∈ Dt : W according to Condition 3.2.4(b). Therefore, by the induction hypothesis, the set ¯ i+1 } = {η ∈ Dt : W ¯ i } ∩ {η ∈ Dt : (η) ∈ O (η) ∈ O {η ∈ Dt : W (η), bi+1 ≤ β i+1 } W is convex and compact, too. Properties (i) and (ii) are proved. Now the set ¯ k−1 } (η) ∈ O D˜ := {η ∈ Dt : W is convex and compact, and we put ˜ =O ˜ := {W ¯ k−1 . (η) : η ∈ D} O ˜ is such that W ∗ , bk = β k and W , bk ≥ β k for all W ∈ ∗ ∈ Ok−1 ⊆ O Further, W k−1 k−1 J +1 J +1 ¯ ˜ =O ∩R = O ∩ R . We are in the framework of Lemma 3.2.2. Thus O ˜ : W ¯ k = {W ∈O , bk ≤ β k } O is a convex compact set in R J +1 , and so is the set ¯ k ∩ R J +1 = O ¯ k. Ok = O ∗ ) is at most J . Therefore, accordSince k ≥ 1, the dimensionality of Ok = G(W ∗ ∗ ing to Corollary B.2.1, W ∈ G(W ) can be expressed as a convex combina ∗ ) = Ok (not necessarily distinct). As was tion of J + 1 extreme points of G(W said, all such points are also extreme in O ∩ R J +1 ; they are Pareto points, since ∗ ) ⊆ Par(O ∩ R J +1 ). G(W The proof is complete. is an extreme and Pareto Lemma 3.2.3 Suppose Condition 3.2.4 is satisfied, and W J +1 point of O ∩ R . Then W = W (η) ˆ for some extreme point ηˆ in Dt . Proof Let (η) = W } D := {η ∈ Dt : W n t . If {η n }∞ be the full preimage of W n=1 ⊆ D and η → η in D , as n → ∞, then n by Condition 3.2.4(b), and W (η) ∈ O ∩ R J +1 . Thus (η ) = W (η) ≤ limn→∞ W W
3.2 The Constrained Problem
189
(η) = W because W ∈ Par(O ∩ R J +1 ). We proved that the set D is closed, hence W compact and obviously convex. Let ηˆ be an extreme point in D , existing due to Lemma B.2.3. Then ηˆ is also extreme in Dt (cf. the reasoning after Definition B.2.3). Indeed, if ηˆ = αη1 + (1 − α)η2 with η1 = η2 ∈ Dt and α ∈ (0, 1), then (η2 ) =W (η) (η1 ) + (1 − α)W W ˆ = αW (η2 ): otherwise, the point ηˆ is not extreme in D . Since W ∈ R J +1 , (η1 ) = W and W (η2 ) must also have all finite components W j (η1 ), W j (η2 ) (η1 ) and W both vectors W (η2 ) ∈ R J +1 . Hence W (η1 ) = W (η2 ) (η1 ), W < ∞ for all j = 0, 1, 2, . . . , J , i.e., W J +1 is extreme in O ∩ R J +1 . ∈ O ∩ R , which contradicts the assumption that W Definition 3.2.3 For the α-discounted CTMDP with the initial distribution γ, a π-strategy S is called a TOM-mixture of π-strategies S1 , S2 , . . . , Sn if ηγS,α is a n convex combination of the total occupation measures {ηγSi ,α }i=1 . A convex combination of any finite collection of measures in Dt still belongs to Dt , because Dt is convex, which is evident under the conditions of Theorem 3.2.1, see also Remark 3.2.4. Theorem 3.2.7 Suppose all the conditions of Theorem 3.2.3 are satisfied, along with the Slater condition, see Condition 3.2.3, i.e., both Condition 3.2.1 and Condition 3.2.2 are satisfied; the function w is continuous; the space A is compact; the ratio w is a strictly unbounded function on X; for each j = 0, 1, . . . , J , the function c j is w lower semicontinuous on X × A, and satisfies inequality c j (x, a) ≥ −M w (x) for all (x, a) ∈ X × A, where M ∈ R0+ is a constant. Then there exists a solution to the constrained problem (1.16) (or (3.3)) given by a TOM-mixture of no more than J + 1 deterministic stationary strategies. Proof According to the proof of Theorem 3.2.3, all the mappings η → X×A c j (x, a) ¯ are bounded from below ( j = 0, 1, . . . , J ). By the Slater η(d x × da) from Dt to R condition, the constrained problem (1.16) is non-degenerate. Further, the space Dt is compact in (Pw (X × A), τ (Pw (X × A))) and the functionals W j (η) on Dt ( j = 0, 1, . . . , J ) are lower semicontinuous in the w -weak topology. (Again, see the proof of Theorem 3.2.3.) Therefore, Condition 3.2.4 is satisfied. According to Theorem 3.2.6 and Lemma J +1 3.2.3, there exists a solution to problem (1.16) givent by a βk ηk of no more than J + 1 extreme points ηk in D . We convex combination k=1 (ηk ), k = 0, 1, . . . , J + 1, all have finite components, emphasise that the vectors W so that $ J +1 % J +1 J +1 (ηk ) = W βk W βk ηk , where βk ≥ 0 and βk = 1. k=1
k=1
k=1
It remains to show that, for each extreme point η in Dt , there is a deterministic stationary strategy ϕ such that η = ηγϕ,α .
190
3 The Discounted Cost Model
Below, we show that the sets {η ∈ Dt : η is extreme in Dt } and
{ηγϕ,α : ϕ is a deterministic stationary strategy}
coincide. The proof is based on the reduction of the CTMDP to a DTMDP. According to Remark 1.3.2(a), D = η=α t
∞
mn ,
{m n }∞ n=1
∈ D Ra M = D S ,
n=1 M . so that one can restrict to Markov standard ξ-strategies M ∈ S Let us introduce the following DTMDP (X , A, pα ):
– X = X ∪ {}, where is the cemetery state; – A is the same action space; – the transition probability is given by ⎧ q ( ∩ X|x, a) + α I{ } ⎨
, if x ∈ X , a ∈ A; pα (|x, a) = qx (a) + α ⎩ I{ ∈ }, if x = . For a strategy σ in the described DTMDP with the same initial distribution γ as in the CTMDP model, the corresponding strategic measure is denoted by Pσ,α γ . The is written as Eσ,α expectation taken with respect to Pσ,α γ γ . The necessary knowledge on DTMDP is presented in Appendix C. Note that the actions in state do not affect the detailed occupation measures σ,α Mσ,α γ,n ( X × A ) := Pγ (X n−1 ∈ X , An ∈ A ),
∀ n ∈ N, X ∈ B(X), A ∈ B(A). Therefore, below, with some abuse of notation, we say that σ M is a Markov strategy in the DTMDP if the stochastic kernels σnM (da|x) are defined for all x ∈ X, n ∈ N. Consider a Markov standard ξ-strategy M = {nM (da|x)}∞ n=1 in the CTMDP in the DTMDP. The and the Markov strategy σ M = {σnM (da|x) = nM (da|x)}∞ n=1 goal is to show that, for each n ∈ N, ,α m γ,n (d x × da) = M
1 M Mσ ,α (d x × da) qx (a) + α γ,n
(3.49)
,α on B(X × A). The detailed occupation measure m was introduced in Definition γ,n 1.3.1. Consider the “hat” model with killing, described in Sect. 1.3.5. According to M
3.2 The Constrained Problem
191
,α ,0 Theorem 1.3.3, the measures m and mˆ coincide on X × A for all n ∈ N. γ,n γ,n Note, the “hat” model is undiscounted, with qˆ x (a) = qx (a) + α > 0 for each x ∈ X and a ∈ A. Therefore, according to Definition 1.3.1 for Sn = n , it is sufficient to show by induction that, for each n ≥ 1, M
M
M M M Pˆ γ (X n−1 ∈ X ) = Eˆ γ [I{X n−1 ∈ X }] = Pσγ ,α (X n−1 ∈ X ), ∀ X ∈ B(X).
This equality obviously holds for n = 1 because the initial distribution γ is the same in both the models. If it is valid for some n ≥ 1, then M M Eˆ γ [I{X n ∈ X }] = Eˆ γ [Gˆ n (R+ × X |X n−1 )]
q ( X |X n−1 , a) M M ˆ = Eγ δ X n−1 (X) n (da|X n−1 ) A q X n−1 (a) + α
q ( X |X n−1 , a) M σ M ,α = Eγ δ X n−1 (X) σn (da|X n−1 ) A q X n−1 (a) + α
= Pσγ
M
,α
(X n ∈ X ).
Formula (3.49) is proved. Now the relations ηγ
M
,α
( X × A ) = α
∞
,α m γ,n ( X × A ) M
n=1
α M Mσγ (d x × da); × q x (a) + α X A qx (a) + α M ,α M ηγ Mσγ ( X × A ) = (d x × da), α X × A X ∈ B(X), A ∈ B(A), =
(3.50)
define the one-to-one correspondence between Dt and D = {Mσγ , σ ∈ }: in DTMDP, for a general strategy σ, there exists a Markov strategy σ M with the same M total occupation measure (see Proposition C.1.4). Different measures ηγ ,α correM M spond to different measures Mσγ and vice versa; if a measure ηγ ,α is a convex M combination of measures from Dt , then the corresponding measure Mσγ is the convex combination of the corresponding measures from D and vice versa. Therefore, expressions (3.50) give the one-to-one correspondence between the extreme points in Dt and D. According to the imposed conditions, for each η ∈ Dt ,
qx (a)η(d x × da) ≤ X×A
X×A
w (x)η(d x × da) < ∞
192
3 The Discounted Cost Model
by (3.20), so Mσγ (X × A) < ∞ for all σ ∈ , and the introduced DTMDP is absorbing (see Definition C.2.7). Now every extreme point in D is generated by a deterministic stationary strategy ϕ in the DTMDP according to Proposition C.2.7. The same strategy ϕ generates the corresponding extreme point in Dt . The proof is complete.
3.3 Examples 3.3.1 A Queuing System Consider a one-channel queuing system without any space for waiting: any job that finds the server busy is rejected. We characterize every job by its volume x ∈ (0, 1], so that the state space is X = [0, 1]: X (t) = 0 means the system is idle at time t; X (t) = x ∈ (0, 1] means the job of the corresponding volume is under service at ¯ where A¯ ≥ 0. The action a ∈ A represents the service time t. We put A = [0, A], intensity. The jobs arrive according to a Poisson process with a fixed rate λ > 0, and the volume is distributed according to the density 5x 4 , x ∈ (0, 1], with respect to the Lebesgue measure, independently of anything else. Therefore, q(|0, a) = 5λ
\{0}
y 4 dy − λI{ 0}, ∀ ∈ B([0, 1]).
For any fixed x ∈ (0, 1], a ∈ A, the service time of a job of volume x is exponentially distributed with parameter ax , so that q(|x, a) =
a a δ0 () − I{x ∈ }, ∀ ∈ B([0, 1]), x ∈ (0, 1], a ∈ A. x x
We assume that when a served job leaves the system, it gives an income of one unit; the holding cost of a job of volume x ∈ (0, 1] equals C1 x per time unit; and the service intensity a ∈ A is associated with the cost rate C2 a 2 . Here C1 ≥ 0 and C2 > 0 are two constants. According to Sect. 1.1.5, we accept that c0 (x, a) = C1 x + C2 a 2 −
a , ∀ x ∈ (0, 1], a ∈ A, x
and c0 (0, a) = C2 a 2 , ∀ a ∈ A. We emphasize that as can be easily verified, x ∈ X → q¯ x and (x, a) ∈ X × A → c0 (x, a) are unbounded functions. Finally, let α, the discount factor, be big enough:
3.3 Examples
193
α>
2 λ. 3
Proposition 3.3.1 (a) For the model described in this example, Conditions 2.4.3 and 3.1.2 are satisfied (for the same function w ). (b) If C1 ≤ 2α then the following recursion relations z 0 = 0;
& u n (x) = −2αC2 x 2 − z n + 2 α2 C22 x 4 + C1 C2 x 3 + αC2 x 2 z n , z n+1
∀ x ∈ (0, 1]; 5λ = 1− u n (y)y 4 dy, ∀ n = 0, 1, 2, . . . α + λ (0,1]
∗ converge: the sequence {z n }∞ n=0 is increasing and has a finite limit z = limn→∞ z n , and
lim u n (x) := u ∗ (x)
n→∞
& = −2αC2 x 2 − z ∗ + 2 α2 C22 x 4 + C1 C2 x 3 + αC2 x 2 z ∗ , ∀ x ∈ (0, 1].
(c) Let us extend the function u ∗ in part (b) to [0, 1] by putting u ∗ (0) := 1 − z ∗ . Suppose C1 ≤ 2α and the constant A¯ is big enough so that the function u ∗ satisfies the following inequality: ∗
∗
−αC2 x +
u (x) + z = ϕ (x) := 2C2 x ¯ ≤ A, ∗
&
α2 C22 x 2 + C1 C2 x + αC2 z ∗
C2 ∀ x ∈ (0, 1].
(3.51)
Then the function u ∗ = W0∗α solves the optimality Eq. (3.4), and the deterministic ∗ (0) := 0 is uniformly optimal. It is optimal for the stationary strategy ϕ∗ with ϕγ(dy) initial distribution γ if (0,1] y 4 < ∞. Proof In the state space X, we fix the standard Euclidean topology in (0, 1] and treat 0 as an isolated point. In the action space A, the topology is standard Euclidean. (a) We define the functions w and w by w(x) =
1, if x = 0; 1 , if x ∈ (0, 1]; x4
w (x) =
1, if x = 0; 1 , if x ∈ (0, 1]. x2
194
3 The Discounted Cost Model
Condition 2.4.2 is satisfied with ¯ ρ = 4λ, b = 0. L = max{λ, A}, ¯
Indeed, for part (a), note that q¯0 = λ and q¯ x = Ax for x ∈ (0, 1]. Part (b) can be verified as follows: – if x = 0 then 1 4 q(dy|x, a)w(y) = 5λ y dy − λ = 4λ = ρw(0); 4 X (0,1] y – if x ∈ (0, 1] then
a a a q(dy|x, a)w(y) = w(0) − w(x) = x x x X
1 1− 4 x
≤ 0 < ρw(x).
Conditions 2.4.3(b,c) for ¯ + 1, ρ = L = max{λ, A}
2 λ, b = 0 3
can be checked similarly to the presented above reasoning. Now α > ρ and, for Condition 2.4.3(e), note that, for each x ∈ (0, 1], inf c0 (x, a) =
a∈A(x)
⎧ ⎨ C1 x − ⎩
1 , 4C2 x 2
if
1 2C2 x
¯ < A;
¯ C1 x + C2 A¯ 2 − Ax , otherwise.
Hence | inf c0 (x, a)| ≤ C1 + C2 A¯ 2 + a∈A
1 x2
1 + A¯ I{x > 0}, 4C2
so that one can take M = C1 + C2 A¯ 2 +
1 ¯ + A. 4C2
Condition 3.1.2(b) holds because, for a bounded measurable function u on X, – for the isolated point x = 0, X
is a finite constant;
u(y)w (y)
q (dy|x, a) = 5λ
u(y)y 2 dy (0,1]
3.3 Examples
195
– for x ∈ (0, 1],
u(y)w (y)
q (dy|x, a) = X
a u(0)w (0) x
defines a continuous function on (0, 1] × A. Finally, in the introduced topology, the cost rate c0 is continuous, so that Condition 3.1.2 is satisfied. (b) Let us define the function f on R0+ by f (z) = 1 −
5λ α+λ
& −2αC2 y 2 − z + 2 α2 C22 y 4 + C1 C2 y 3 + αC2 y 2 z (0,1]
×y dy. 4
Then z n+1 = f (z n ). Note that the function f is differentiable on R0+ : df −5λ = dz α+λ
⎧ ⎨ −1 + & (0,1] ⎩
⎫ ⎬
αC2 y 2 α2 C22 y 4
⎭ y2 z
y 4 dy
+ C1 C2 + αC2 & ⎧ ⎫ ⎨ αC2 y 2 − α2 C22 y 4 + C1 C2 y 3 + αC2 y 2 z ⎬ −5λ & = y 4 dy, ⎭ α + λ (0,1] ⎩ 2 4 2 3 2 α C2 y + C1 C2 y + αC2 y z y3
where & αC2 x 2 − α2 C22 x 4 + C1 C2 x 3 + αC2 x 2 z & ∈ (−1, 0), ∀ x ∈ (0, 1]. α2 C22 x 4 + C1 C2 x 3 + αC2 x 2 z Therefore, 0
1 −
5λC1 α(α + λ)
(0,1]
y 5 dy = 1 −
5λC1 ≥0 6α(α + λ)
because α > 23 λ and C1 ≤ 2α. Thus the sequence {z n }∞ n=0 increases from zero, and the mapping z → f (z) is contracting on [0, ∞). The proof of statement (b) is completed. (c) Clearly, the function u ∗ on X is bounded; hence u ∗ ∈ Bw (X). By the way, it is lower semicontinuous in the introduced topology. Therefore, according to Theorem 3.1.2, it is sufficient to check that u ∗ satisfies Eq. (3.9) and ϕ∗ provides the infimum. The expression inside the parenthesis of (3.9) equals C2 a + λ 2
(0,1]
u ∗ (y)5y 4 dy − λu ∗ (0) if x = 0,
and C1 x + C2 a 2 −
a a a + u ∗ (0) − u ∗ (x) if x ∈ (0, 1]. x x x
Therefore, ϕ∗ (0) = 0; 5λ u (0) = α+λ ∗
(0,1]
u ∗ (y)y 4 dy
and ϕ∗ (x) given by (3.51) provides & the infimum in (3.9) for x ∈ (0, 1]. (Note ∗ ∗ 2 that u (x) + z ≥ −2αC2 x + 2 α2 C22 x 4 = 0, so that ϕ∗ (x) ≥ 0.) Finally, at x > 0, the right-hand side of (3.9) equals C1 x −
(u ∗ (x)+z ∗ )2 , 4x 2 C2
and the equation
4αC2 x 2 u ∗ (x) = 4C1 C2 x 3 − (u ∗ (x))2 − 2u ∗ (x)z ∗ − (z ∗ )2 holds because & u ∗ (x) = −2αC2 x 2 − z ∗ + 2 α2 C22 x 4 + C1 C2 x 3 + αC2 x 2 z ∗ , ∀ x ∈ X. The very last assertion follows from Theorem 3.1.2(c). It is interesting to look at the case of C1 = 0. Under the imposed conditions,
3.3 Examples
197
& ⎞ 2 C 2 x 2 + αC z ∗ αC x − α 2 2 2 dϕ (x) ⎠ < 0, ∀ x ∈ (0, 1] : & = α⎝ dx 2 2 2 α C2 x + αC2 z ∗ ⎛
∗
the service intensity decreases when the job volume increases. Of course, the income from completing big jobs becomes smaller, but that is compensated by the reduced ¯ service cost. In this case, can estimate the sufficient level of the constant A: & one ∗ αz ∗ ∗ A¯ ≥ lim0 −2αC2 x 2 − z n + 2 α2 C22 x 4 = −z n , ∀ n = 1, 2, . . . , x ∈ (0, 1]. (Recall, the sequence {z n }∞ n=0 is increasing and z 1 > 0.) Therefore, z n+1 < 1 +
λ zn α+λ
and hence z∗ ≤ 1 +
λ z∗. α+λ
Consequently, z∗ ≤ 1 + because α >
2λ . 3
5 λ < α 2
As a result, we see that the value of ' A¯ =
5α 2C2
is large enough to satisfy all the requirements. If C1 is very big, then it can happen that the action a ∗ = 0 becomes optimal for small values of x. Indeed, if a > 0 then there can be transitions x → 0 → y → . . . with a good chance to have a big value of y leading to a big holding cost in the future. Thus, in this situation it can be reasonable to select a ∗ = 0 and finish with the cost rate Cα1 x , which is small if x is small.
198
3 The Discounted Cost Model
3.3.2 A Birth-and-Death Process The state of the process X (t) ∈ X := {0, 1, 2, . . .} represents the population size. The action space A is a metric compact space. The birth and death rates λ(i, a) and μ(i, a) are assumed to be continuous in a and satisfy the conditions • μ(0, a) ≡ 0; • supi∈X supa∈A
λ(i,a)+μ(i,a) i+1
=: L < ∞.
All the cost rates c j , j = 0, 1, . . . , J , are assumed to be continuous in a and satisfy ∈ A, for some constant M ∈ R+ . the condition |c j (i, a)| ≤ M(i + 1) for all i ∈ X, a Finally, suppose the initial distribution satisfies i∈X γ(i)i 2 < ∞. The non-zero transition rates q( j|i, a) for j = i are given by q( j|i, a) :=
λ(i, a), if j = i + 1; μ(i, a), if i = 0 and j = i − 1.
Let us show that all the statements of Chap. 3 hold true if the Slater condition 3.2.3 is satisfied and α > 3L. To do this, it is sufficient to check that all the conditions imposed therein are satisfied. Below, we fix the discrete topology in the space X and consider functions w(i) := (i + 1)2 and w (i) := i + 1, i ∈ X. The model under study is a special case of Example 2.3.2, where the probability distribution H (·|i, a) is concentrated on the set {−1, 1}. According to Proposition 2.3.1, Condition 2.2.5 is satisfied for the function w with ρ = 3L (here m = 2) and for the function w with ρ = L (here m = 1); b = b = 0. Therefore, Condition 2.4.2(b) and Condition 2.4.3(c) are satisfied. One can also check, e.g., Condition 2.4.2(b) directly as follows: for all i ∈ X, a ∈ A,
w( j)q( j|i, a) = λ(i, a)(i + 2)2 + μ(i, a)i 2 − [λ(i, a) + μ(i, a)](i + 1)2
j∈X
= I{[λ(i, a) + μ(i, a)] > 0}[λ(i, a) + μ(i, a)] λ(i, a) × [2(i + 1) + 1] λ(i, a) + μ(i, a) μ(i, a) [2i + 1] − λ(i, a) + μ(i, a) ≤ L(i + 1)[3(i + 1)] = 3Lw(i). For Condition 2.4.3(b), note that (q¯i + 1)w (i) ≤ L(i + 1)(i + 1) + (i + 1) ≤ (L + 1)(i + 1)2 , so L := L + 1. Now all the conditions 2.4.2, 2.4.3, and 3.1.1 are satisfied. Obviously, Conditions 3.1.2, 3.1.3, 3.2.1, and 3.2.2 are also satisfied along with the inequalities (3.32). The
3.3 Examples
199
ratio ww is strictly unbounded; Condition 3.2.4 is satisfied according to the proof of Theorem 3.2.7. Thus, all the statements of Chap. 3 hold true. The described model can be understood as a queueing system: X (t) is the number of jobs in the system at time t and λ and μ are the arrival and service rates. Under the imposed conditions on λ and μ, the process X (·) is non-explosive, but we did not require that λ < μ, meaning that the equilibrium may not exist.
3.4 Bibliographical Remarks Section 3.1. Here we restrict our interest to unconstrained problems. The αdiscounted problem has been intensively studied in book treatments of CTMDPs, see e.g., [106, 138, 150, 197]. The first few works on the α-discounted CTMDP problem are [161, 168, 213], all assuming a finite state space with [213] seeming to be the first to consider such problems. Then [143] is one of the first to consider α-discounted CTMDPs in a countable state space, and established the dynamic programming equation in the form presented in this section. Basically, the class of natural Markov strategies is considered in [143], which also assumed the bounded transition and cost rates. A lot of later developments aim at relaxing these restrictions: the denumerable state space was studied in [26, 102, 103, 106, 108, 118, 197, 227], and the Polish state space was considered in [99]. In [99, 102, 103, 118, 227], only the subclass of natural Markov strategies was considered, which induces continuous Q-functions. The reason for this is that the authors refer to [90] for the construction of the corresponding transition function. The continuity condition in [90] was relaxed in [85, 251] for the case of a Borel state space, and in [142, 143, 249] for the case of a denumerable state space. One can find more relevant references on the denumerable models in the survey [108]. The class of general π-strategies is considered for α-discounted CTMDPs with unbounded transition and cost rates in [112] for the denumerable state space, and in [190] for the Borel state space. The materials presented in this section mainly come from [190], which assumes a different version of the compactness-continuity condition. The arguments in the aforementioned works in this paragraph do not reduce the α-discounted CTMDP to an equivalent DTMDP problem. Section 3.2. One of the first papers on constrained discounted CTMDPs is [180], which is based on the earlier work [178] on the finite time horizon. There, the transition and cost rates could be time-dependent. The linear program was stated for the extended strategic measures PγS (dω)e−αt dt(da|ω, t) on × R0+ × A; only π-strategies were considered. The next milestone is [76], where constrained and unconstrained problems were reduced to DTMDPs. As a result, one can use the Convex Analytic Approach developed for discrete-time models. The transition rate was assumed to be bounded in [76, 180]; it can be unbounded in the article [77] extending the ideas of [76]. Many results about constrained problems in [76] were formulated for a denumerable state space only. Note also the early article [104],
200
3 The Discounted Cost Model
where only one constraint was considered and the state space was denumerable; the analysis was based on the Lagrange multipliers: see the corresponding subsubsection of Sect. 3.2.4. The results of [104] were included in the book [106] as Chap. 11 and also in the book [197, Sect. 8.2]. Note that in [104, 106, 115, 197] only natural Markov strategies were considered. It was claimed in [84] that natural Markov strategies are sufficient out of the class of π-strategies for problems with total and average cost criteria. It seems that occupation measures, directly for the discounted CTMDPs, were firstly introduced and investigated in [183], where the transition rate was again bounded, and the state space was denumerable. The characteristic Eq. (3.19) for occupation measures in Borel models appeared in [115]. The material presented in Sect. 3.2 mainly comes from [112, 113, 188]. Conditions similar to those formulated in Sects. 3.1 and 3.2 appeared in many of the aforementioned articles and books, and their roles were clarified in [26, 98]. Section 3.3. The example in Sect. 3.3.1 comes from [190]. Examples similar to birth-and-death process (Sect. 3.3.2) appeared in [104, 112], [106, Sect. 6.5], and [197, Sect. 9.3].
Chapter 4
Reduction to DTMDP: The Total Cost Model
In this chapter, we show that solving an (unconstrained or constrained) optimal control problem for the model with total expected cost is equivalent to solving the control problem for the corresponding Discrete-Time Markov Decision Process (DTMDP). After that, we use the known facts for DTMDP models to obtain the corresponding results for CTMDP models. The necessary information about DTMDPs is in Appendix C. We do not assume that the controlled process X (·) is non-explosive. In the first section, we introduce a new class of strategies called Poisson-related, for which the reduction to DTMDP is most straightforward. We also show that each Markov π-strategy is equivalent to a corresponding Poisson-related strategy and vice versa. According to Theorem 1.3.1, solving an (unconstrained or constrained) optimal control problem (1.15) or (1.16) within the class of strategies introduced in Chap. 1 (see Definition 1.1.2) is equivalent to solving that problem within the class of Markov π-strategies and also within the class of Poisson-related strategies. We emphasize that Poisson-related strategies are realizable. Several properties of stationary Poisson-related strategies, presented in Theorem 4.1.2, are of independent interest. Finally, for the discounted model with α > 0 the reduction to DTMDP can be implemented in the framework of Markov standard ξ-strategies.
4.1 Poisson-Related Strategies Below in this chapter, we are concerned with the total cost model only, and will not always explicitly indicate this. As was shown in Sect. 1.3, in general, Markov standard ξ-strategies are not sufficient for solving optimal control problems. The class of Markov π-strategies is sufficient, but such strategies are usually not realizable. Below, we introduce a new class of control strategies that are sufficient and realizable. © Springer Nature Switzerland AG 2020 A. Piunovskiy and Y. Zhang, Continuous-Time Markov Decision Processes, Probability Theory and Stochastic Modelling 97, https://doi.org/10.1007/978-3-030-54987-9_4
201
202
4 Reduction to DTMDP: The Total Cost Model
One of the important features of a Markov π-strategy {πnM }∞ n=1 is that the kernels πnM (da|x, s) can depend on the time s elapsed since the previous jump epoch Tn−1 < ∞. In the case of a Markov standard ξ-strategy, for each n ∈ N, the action on the non-empty interval (Tn−1 , Tn ] is realised at the jump epoch Tn−1 < ∞ and remains unchanged up to Tn . Poisson-related strategies are similar to Markov standard ξ-strategies, but the actions are allowed to change on the non-empty interval (Tn−1 , Tn ] at random times coming from the Poisson process with rate ε > 0, which is independent of Hn−1 . Definition 4.1.1 A (Markov) Poisson-related strategy S P is defined by a constant P ˜ n,k }∞ ε > 0 and a sequence {SnP }∞ n=1 . For each n ∈ N, Sn = { p k=0 , where for each k = 0, 1, 2, . . . p˜ n,k is a stochastic kernel on B(A) given x ∈ X. S P is the set of all Poisson-related strategies. SεP is the set of all Poisson-related strategies under a fixed ε > 0. P
As before, the strategic measure PγS (dω) on (, F) is constructed recursively on the spaces of histories H0 , H1 , . . .. The distribution of H0 = X 0 is given by ¯ + × X∞ ) given γ(d x) and, for a fixed n = 1, 2, . . ., the stochastic kernel G n on B(R h n−1 ∈ Hn−1 is defined by the following formulae. Below, := (R+ × A)∞ . Let us introduce the stochastic kernel pn on B() given x ∈ X such that all the coordinate random variables of ξ = (τ0 , α0 , τ1 , α1 , . . .) ∈ are mutually independent under pn (·|x) (for a fixed x ∈ X), and have distributions pn (τ0 = 0|x) = 1;
pn (τk ≤ t|x) = 1 − e−εt
for k ≥ 1;
pn (αk ∈ A |x) = p˜ n,k ( A |x) for k ≥ 0. Now G ({∞} × {x∞ }|h n−1 ) n ξ δxn−1 ({x∞ }) + δxn−1 (X)e− (0,∞) qxn−1 (s)ds pn (dξ|xn−1 ); =
G n (R × X |h n−1 ) ξ q ξ ( X |xn−1 , θ)e− (0,θ] qxn−1 (s)ds dθ pn (dξ|xn−1 ), = δxn−1 (X)
R
∀ R ∈ B(R+ ), X ∈ B(X); G n ({∞} × X|h n−1 ) = G n (R+ × {x∞ }|h n−1 ) = 0. Here, for ξ = (τ0 , α0 , τ1 , α1 , . . .) ∈ ,
(4.1)
4.1 Poisson-Related Strategies
203
q ξ ( X |x, s) ∞ q ( X |x, αk )I{τ0 + τ1 + · · · + τk < s ≤ τ0 + τ1 + . . . + τk+1 }, := k=0
∀ X ∈ B(X);
:=
qxξ (s) ∞
qx (αk )I{τ0 + τ1 + . . . τk < s ≤ τ0 + τ1 + . . . + τk+1 } = q ξ (X|x, s).
k=0 P
After this, as in Sect. 1.1.3, we define the strategic measure PγS on (, F) using the Ionescu-Tulcea Theorem. Remark 4.1.1 If p˜ n,k ( A |x) = δϕn (x) (da) for all k ≥ 0, where the measurable mappings ϕn : X → A define the deterministic Markov standard ξ-strategy S, the strateP gic measures PγS and PγS obviously coincide because the kernels G n in the both cases are identical. We call such strategies indistinguishable. ¯ + × X∞ ) given It will also be useful to introduce the stochastic kernels G n on B(R (h n−1 , ξ) ∈ Hn−1 × defined by ξ
ξ G ξn ({∞} × {x∞ }|h n−1 ) = δxn−1 ({x∞ }) + δxn−1 (X) e− (0,∞) qxn−1 (s)ds ; ξ q ξ ( X |xn−1 , θ)e− (0,θ] qxn−1 (s)ds dθ, G ξn (R × X |h n−1 ) = δxn−1 (X) R
∀ R ∈ B(R+ ), X ∈ B(X); ξ G n ({∞} × X|h n−1 ) = G ξn (R+ × {x∞ }|h n−1 ) = 0, so that G n (·) =
G ξn (·) pn (dξ|xn−1 ).
(4.2)
As before, the kernels pn (dξ|xn−1 ) are of no importance if xn−1 = x∞ . One can certainly also consider mixed strategies, where at each jump epoch tn−1 < ∞ one can choose either a relaxed control πn , or a randomised control n , or a Poisson-related control. Such strategies and more general ones are considered in Chap. 6, Sect. 6.1. Here, we deal only with the strategies described in Definition 1.1.2 and in Definition 4.1.1. Standard ξ-strategies and Poisson-related strategies S P have much in common. Roughly speaking, relaxations are absent in the following sense. There is a fixed Borel space , the source of randomness for actions: in the case of , := A, and, in the case of S P , the space is as described above. At each jump epoch tn−1 < ∞, the decision-maker simulates an element from using the distribution n (·|h n−1 ) in the first case, and the distribution pn (·|xn−1 ) in the second case.
204
4 Reduction to DTMDP: The Total Cost Model
After that, up to the next jump epoch, the actions at times tn−1 + u are not relaxed, being equal to ϕ(ξ, u), where ϕ(ξ, u) = ξ in the case of a standard ξ-strategy, and ϕ(ξ, u) = ∞ k=0 αk I{τ0 + . . . + τk < u ≤ τ0 + . . . + τk+1 } in the case of a Poissonrelated strategy. This interpretation of standard ξ-strategies is closely related to their = A = . The realizability: see the discussion at the end of Sect. 1.1.4, where most general ξ-strategies will be introduced in Chap. 6: see Definition 6.1.4. Below, SεP is the space of all Poisson-related strategies under a fixed value ε > 0. Suppose a Poisson-related strategy is fixed. After the history Hn−1 = h n−1 ∈ Hn−1 is realised for some n ∈ N, for a fixed value of ξ ∈ , the total discounted cost, associated with the cost rate c j on the interval (tn−1 , tn−1 + θ], equals
e−α(tn−1 +u) (0,θ]∩R+
∞
c j (xn−1 , αk ) I{τ0 + . . . + τk < u ≤ τ0 + . . . + τk+1 }du.
k=0
Here and below, as usual, ξ = (τ0 , α0 , τ1 , α1 , . . .) ∈ , α ≥ 0. Remember, c j (x∞ , a) = 0. The regular conditional expectation of the described cost over the sojourn time n given Hn−1 = h n−1 ∈ Hn−1 equals C αj (SnP , h n−1 ) −αtn−1 =e
(0,∞)
e−αu
(0,θ]
∞
c j (xn−1 , αk )
k=0
× I{τ0 + . . . + τk < u ≤ τ0 + . . . + τk+1 }du +G ξn ({∞} × {x∞ }|h n−1 )
e−αu (0,∞)
∞
G ξn (dθ × X|h n−1 )
c j (xn−1 , αk )
k=0
× I{τ0 + . . . + τk < u ≤ τ0 + . . . + τk+1 }du pn (dξ|xn−1 ). ∞ −αtn−1 c j (xn−1 , αk )I{τ0 + . . . + τk < θ =e
(0,∞) k=0
≤ τ0 + . . . + τk+1 } × e
ξ −αθ− (0,θ] qxn−1 (s)ds
dθ pn (dξ|xn−1 ).
(4.3)
The second equality here is by integrating by parts. Recall that we always deal separately with positive and negative parts of the costs. (See (1).) Remember also the convention ∞ − ∞ = +∞. We introduce the detailed occupation measures of a Poisson-related strategy S P as follows.
4.1 Poisson-Related Strategies
205
S ,α Definition 4.1.2 For each n ≥ 1, the detailed occupation measure m γ,n of a P Poisson-related strategy S (under the fixed initial distribution γ and discount factor α ≥ 0) is defined for each n ∈ N, X ∈ B(X), A ∈ B(A) by P
S ,α m γ,n ( X × A ) P := EγS I{X n−1 ∈ X }e−αTn−1 P
×
∞
e
ξ −αθ− (0,θ] q X
n−1
(s)ds
(0,∞)
δαk ( A )I{τ0 + . . . + τk < θ ≤ τ0 + . . . + τk+1 }dθ
(4.4)
k=0
× pn (dξ|X n−1 ) . S ,α ∞ }n=1 , S P ∈ SεP } be the space of sequences of detailed Let DεP = {m γS ,α = {m γ,n occupation measures for all Poisson-related strategies under a fixed value ε > 0. P
P
For α ≥ 0, the total α-discounted costs, associated with the originally given cost rates {c j } Jj=0 , equal W jα (S P , γ)
:=
P EγS
∞
C αj (SnP ,
Hn−1 ) =
n=1
∞ n=1
S ,α c j (x, a)m γ,n (d x × da). P
X×A
Definition 1.3.2 of the realizability can be applied to Poisson-related strategies in the following sense. For a fixed h n−1 ∈ Hn−1 with tn−1 < ∞, there must exist a , F, complete probability space ( P) and a measurable with respect to (s, ω ) process with values in A such that, for any conservative and stable transition A on R+ × rate q, ˆ for all A ∈ B(A), E
=
e−αθ−
(0,θ]
qˆ xn−1 (A(s, ω ))ds
(0,θ]
qˆ xn−1 (s)ds
(0,∞)
e−αθ− (0,∞)
ξ
∞
I{A(θ, ω ) ∈ A }dθ δαk ( A )I {τ0 + . . . + τk < θ
k=0
≤ τ0 + . . . + τk+1 }dθ pn (dξ|xn−1 ). As before, E is the mathematical expectation with respect to the measure P. Additionally, the assertion (b) of Definition 1.1.10 is valid, where, naturally, formulae (4.1) play the role of (1.12) and (1.13). , , F One can take ( P) as the probability space (, B(), pn (·|xn−1 )), assumed to have been completed with respect to pn (·|xn−1 ), and put A(s, ξ) :=
∞ k=0
αk I{τ0 + τ1 + . . . + τk < s ≤ τ0 + τ1 + . . . + τk+1 }.
206
4 Reduction to DTMDP: The Total Cost Model
After that, all the requirements formulated above are satisfied, so that all Poissonrelated strategies are realizable. This also follows from the more general reasoning presented in Sect. 6.6. Definition 4.1.3 A Poisson-related strategy is called stationary if all the stochastic kernels p˜ n,k = p˜ are the same for all n ∈ N, k = 0, 1, 2, . . .. Lemma 4.1.1 For a fixed Poisson-related strategy S P ∈ SεP defined by the stochastic kernels p˜ n,k , for each nonnegative measurable function g on A × X, for each n ∈ N, α ≥ 0, x ∈ X,
e−αθ−
ξ
(0,θ]
qx (s)ds
∞
(0,∞)
g(αk , x)I{τ0 + . . . + τk < θ
k=0
≤ τ0 + . . . + τk+1 }dθ pn (dξ|x) k ∞ ε p˜ n,i−1 (da|x) g(a, x) p˜ n,k (da|x) . = α + q (a) + ε x A α + q x (a) + ε k=0 i=1 A As usual, here ξ = (τ0 , α0 , τ1 , α1 , . . . ) ∈ . Proof For fixed n ∈ N, k ≥ 0, let us evaluate the following expression
e−αθ− (τ0 +...+τk ,τ0 +...τk+1 ]
ξ
(0,θ]
qx (s)ds
g(αk , x)dθ pn (dξ|x).
(4.5)
It equals k
e
−ατi −qx (αi−1 )τi
(0,τk+1 ]
i=1
g(αk , x)e−αu−qx (αk )u du
pn (dξ|x).
Since all the components of ξ = (τ0 , α0 , τ1 , . . .) are mutually independent and each component τi is exponential with parameter ε > 0, we can calculate separately the following integral:
g(αk , x)e−αu−qx (αk )u du pn (dξ|x) −εs g(αk , x) εe e−(α+qx (αk ))u du ds p˜ n,k (dαk |x) = A (0,∞) (0,s] g(a, x) p˜ n,k (da|x) . = α + qx (a) + ε A
(0,τk+1 ]
Note, the last formula is valid in any case, whether α + qx (a) is zero or not. Indeed, if α + qx (αk ) = 0, then we have (0,∞) εe−εs s ds = 1ε . Otherwise,
4.1 Poisson-Related Strategies
εe
−εs
e
(0,∞)
207
−(α+qx (αk ))u
(0,s]
du ds = =
εe−εs (0,∞)
1 − e−(α+qx (αk ))s ds α + qx (αk )
1 . α + qx (αk ) + ε
Similarly, k i=1
e−ατi −qx (αi−1 )τi pn (dξ|x) =
k ε p˜ n,i−1 (da|x) . α + qx (a) + ε i=1 A
Therefore, expression (4.5) equals k ε p˜ n,i−1 (da|x) g(a, x) p˜ n,k (da|x) , α + q (a) + ε α + qx (a) + ε x A i=1 A
and the required expression follows.
Corollary 4.1.1 Let g be a nonnegative measurable function on A × X and let S P be a stationary Poisson-related strategy with the stochastic kernel p. ˜ Then, for each n ∈ N, α ≥ 0, x ∈ X,
e−αθ−
ξ
(0,θ]
qx (s)ds
(0,∞)
∞
g(αk , x)I{τ0 + . . . + τk < θ
k=0
≤ τ0 + . . . + τk+1 }dθ pn (dξ|x) g(a, x) p(da|x) ˜ (α + qx (a)) p(da|x) ˜ . = α + qx (a) + ε A α + q x (a) + ε A As usual, here ξ = (τ0 , α0 , τ1 , α1 , . . . ) ∈ ,
0 0
:= 0, and
d 0
:= ∞ for d > 0.
Proof It directly follows from Lemma 4.1.1 and Definition 4.1.3.
Corollary 4.1.2 Suppose S P ∈ SεP . Then the detailed occupation measures (4.4), for all n ∈ N, X ∈ B(X), A ∈ B(A), have the form S ,α ( X × A ) m γ,n P
=
P EγS
I{X n−1 ∈ X }e
×
A
−αTn−1
p˜ n,k (da|X n−1 ) . α + q X n−1 (a) + ε
k ∞ ε p˜ n,i−1 (da|X n−1 ) α + q X n−1 (a) + ε k=0 i=1 A
Proof The claimed equality follows from (4.4) and Lemma 4.1.1 with g(x, a) := I{a ∈ A }.
208
4 Reduction to DTMDP: The Total Cost Model
Corollary 4.1.3 Suppose S P ∈ SεP . Then, for each X ∈ B(X), k ∞ ε p˜ n,i−1 (da|xn−1 ) G n (R+ × X |h n−1 ) = δxn−1 (X) qxn−1 (a) + ε k=0 i=1 A q ( X |xn−1 , a) p˜ n,k (da|xn−1 ) × . qxn−1 (a) + ε A Proof The claimed equality follows from (4.1) and Lemma 4.1.1 under g(x, a) := q ( X |x, a), α = 0. Note that for each S P ∈ SεP , by the previous statement,
∞ ε p˜ n,i−1 (da|xn−1 ) G n (R+ × X|h n−1 ) = δxn−1 (X) 1 − qxn−1 (a) + ε i=1 A
and G n ({∞} × {x∞ }|h n−1 ) = δxn−1 ({x∞ }) + δxn−1 (X)
∞ ε p˜ n,i−1 (da|xn−1 ) . qxn−1 (a) + ε i=1 A
Theorem 4.1.1 For each α ≥ 0 and ε > 0, it holds that D ReM = DεP . Proof According to Sect. 1.3.5, namely, the second interpretation of discounting, there is no loss of generality if we take α = 0. To be more precise, if α > 0, then we prove the current theorem for the “hat” model and after that refer to Theorem 1.3.3. The proof goes in two parts as follows. (a) For a fixed Markov π-strategy π M = {πnM (da|xn−1 , s)}∞ n=1 , we define the required Poisson-related strategy S P ∈ SεP in the following way. Put n (t, x) :=
(0,t]
Q n,k (w, x) :=
qx (πnM , s)ds; ∀ t ≥ 0, x ∈ X, n ≥ 1;
ε(εw)k−1 −εw−n (w,x) e , ∀ w ≥ 0, x ∈ X, k = 1, 2, . . . . (k − 1)!
After that, the strategy S P is given by the following formulae:
4.1 Poisson-Related Strategies
p˜ n,0 ( A |x) :=
209
e−εt−n (t,x) (0,∞)
A
p˜ n,k ( A |x) :=
[qx (a) + ε]πnM (da|x, t)dt;
e−εt−n (t,x) 1 , × [qx (a) + ε]πnM (da|x, t)dt dw Q n,k (w, x)dw A (0,∞) (0,∞)
Q n,k (w, x)eεw+n (w,x)
(w,∞)
k = 1, 2, . . .
It is straightforward to check that p˜ n,k (A|x) = 1 for all n = 1, 2, . . ., k = 0, 1, . . .. Both control strategies π M and S P are Markov in the sense that the corresponding M P stochastic kernels G πn defined by (1.13) and G nS defined by (4.1) depend only on the state xn−1 ∈ X∞ . Below, we consider the case xn−1 ∈ X. Firstly, we concentrate on the sojourn time n under the strategies under consideration. According to (4.2),
P
G nS (R × X∞ |xn−1 ) =
G ξn (R × X∞ |xn−1 ) pn (dξ|xn−1 ).
For a fixed ξ = (τ0 , α0 , τ1 , α1 , . . .) ∈ , for every k = 1, 2, . . ., G ξn
k i=1
ξ − k qx (s)ds τi , ∞ × X∞ xn−1 = e (0, i=1 τi ] n−1 ,
so that, under the Poisson-related strategy S P , the (conditional) probability that
k the sojourn time n is bigger than i=1 τi equals
G ξn
k i=1
k τi , ∞ × X∞ xn−1 pn (dξ|xn−1 ) = pn,i , k = 1, 2, . . . ,
i=1
where pn,i = A
=
A
(0,∞)
εe−εw e−qxn−1 (a)w dw p˜ n,i−1 (da|xn−1 )
(4.6)
ε p˜ n,i−1 (da|xn−1 ) , qxn−1 (a) + ε
i.e., pn,i is the expectation of e−τi qxn−1 (αi−1 ) with respect to pn (·|xn−1 ). Remember, all the components of ξ are mutually independent for a fixed xn−1 ∈ X under pn (·|xn−1 ). M Under
k the strategy π , the probability that the sojourn time n is bigger than i=1 τi equals
210
4 Reduction to DTMDP: The Total Cost Model
τi , ∞ × X∞ xn−1 pn (dξ|xn−1 ) = Q n,k (w, xn−1 )dw, (0,∞) i=1 (4.7)
k τi has the Erlang(ε, k) distribution. We will prove by induction because i=1 that k Q n,k (w, xn−1 )dw = pn,i , k = 1, 2, . . . . (4.8)
M G πn
k
(0,∞)
If k = 1 then pn,1 = A
ε qxn−1 (a) + ε
=
i=1
εe
e−εt−n (t,xn−1 ) [qxn−1 (a) + ε]πnM (da|xn−1 , t)dt
(0,∞)
−εt−n (t,xn−1 )
(0,∞)
dt =
(0,∞)
Q n,1 (w, xn−1 )dw.
Fubini’s Theorem is always used without special reference. Suppose (0,∞)
Q n,k (w, xn−1 )dw =
k
pn,i
i=1
for some k = 1, 2, . . ., and let us prove that the same relation holds for k + 1 as follows. k+1
pn,i =
i=1
(0,∞)
Q n,k (w, xn−1 )dw
=
A
ε p˜ n,k (da|xn−1 ) qxn−1 (a) + ε
ε Q n,k (w, xn−1 )eεw+n (w,xn−1 ) q (a) + ε x A (0,∞) n−1 −εt−n (t,xn−1 ) × e [qxn−1 (a) + ε]πnM (da|xn−1 , t)dt dw (w,∞)
=ε
(0,∞)
ε(εw)k−1 (k − 1)!
e−εt−n (t,xn−1 ) dt dw. (w,∞)
Integrating by parts gives k+1
pn,i =
i=1
Note that
(0,∞)
ε(εw)k −εw−n (w,xn−1 ) e dw = k!
(0,∞)
Q n,k+1 (w, xn−1 )dw.
4.1 Poisson-Related Strategies
211
ε(εw)k lim e−εt−n (t,xn−1 ) dt w→∞ k! (w,∞) −εt−n (t,xn−1 ) k+1 dt ε (w,∞) e lim = =0 −k k! w→∞ w by the L’Hopital rule. Formula (4.8) is proved. The next step is to prove that, for all X ∈ B(X), n = 0, 1, 2, . . ., PγS (X n ∈ X ) = Pγπ (X n ∈ X ). P
M
(4.9)
This equality is obviously valid at n = 0 because the initial distribution γ is fixed. Suppose it holds for some n − 1 ≥ 0 and prove that, for all xn−1 ∈ X, X ∈ B(X), P M (4.10) G nS (R+ × X |xn−1 ) = G πn (R+ × X |xn−1 ). Since, for all k = 0, 1, 2, . . ., X ∈ B(X), G ξn
k
τi ,
i=0
k+1
τi
i=0
= q ( X |xn−1 , αk ) = q ( X |xn−1 , αk )e
× X xn−1
k
(
−
k+1 i=0 τi , i=0 τi ]
k (0, i=0 τi ]
e−
ξ
qxn−1 (s)ds
ξ
(0,θ]
qxn−1 (s)ds
dθ
e−qxn−1 (αk )θ dθ,
(0,τk+1 ]
and having in mind that all the components of ξ = (τ0 , α0 , τ1 , . . .) ∈ are mutually independent under pn (·|xn−1 ), we conclude that
⎛⎛
⎤
⎞
k k+1 ξ Gn ⎝ ⎝ τi , τi ⎦ × X xn−1 ⎠ pn (dξ|xn−1 ) i=0 i=0
⎛ =⎝
k
⎞
pn,i ⎠
i=1
A
q ( X |xn−1 , a)
× p˜ n,k (da|xn−1 ) = q ( X |xn−1 , a) A
×
k
(0,∞)
(0,∞) (θ,∞)
εe
εe−εt
(0,t]
−εt−qxn−1 (a)θ
e
(4.11)
−qxn−1 (a)θ
dθ dt
dt dθ p˜ n,k (da|xn−1 )
pn,i
i=1
=
A
=
q ( X |xn−1 , a)
(0,∞)
e
−qxn−1 (a)θ−εθ
dθ p˜ n,k (da|xn−1 )
k i=1
q ( X |xn−1 , a) p˜ n,k (da|xn−1 ) pn,i . q (a) + ε A xn−1 i=1 k
pn,i
212
4 Reduction to DTMDP: The Total Cost Model
Before proceeding further, we will prove the following technical result: for each nonnegative measurable function h on A, for all k = 0, 1, 2, . . .,
k h(a) p˜ n,k (da|xn−1 ) pn,i (4.12) A q xn−1 (a) + ε i=1 = h(a)πnM (da|xn−1 , t)e−n (t,xn−1 ) dt pn (dξ|xn−1 )
k i=1 τi ,
(
k+1 i=0 τi ]
A
as follows. Clearly, it is sufficient to assume that the function h is bounded. If k = 0, then we have on the left-hand side of (4.12) the expression
e−εt−n (t,xn−1 ) (0,∞)
A
h(a)πnM (da|xn−1 , t)dt,
and the right-hand side, by integrating by parts, equals =
(0,∞)
h(a)πnM (da|xn−1 , θ)e−n (θ,xn−1 ) dθ dt −εt−n (t,xn−1 ) e h(a)πnM (da|xn−1 , t)dt. εe
(0,∞)
−εt
(0,t]
A
A
For k > 0, using (4.8), we obtain the following expression on the left-hand side of (4.12):
k h(a) p˜ n,k (da|xn−1 ) pn,i A q xn−1 (a) + ε i=1 ε(εw)k−1 −εt−n (t,xn−1 ) = e h(a)πnM (da|xn−1 , t)dt dw. (0,∞) (k − 1)! (w,∞) A
On the right we have
h(a)πnM (da|xn−1 , t)e−n (t,xn−1 ) dt pn (dξ|xn−1 )
k
k+1 ( i=1 τi , i=0 τi ] A ε(εw)k−1 −εw εe−εs h(a)πnM (da|xn−1 , t) = e (k − 1)! (0,∞) (0,∞) (w,w+s] A ×e−n (t,xn−1 ) dt ds dw =
ε(εw)k−1 −εw e−εs h(a)πnM (da|xn−1 , w + s) e (0,∞) (k − 1)! (0,∞) A
×e−n (w+s,xn−1 ) ds dw =
ε(εw)k−1 (0,∞) (k − 1)!
(w,∞)
e−εt−n (t,xn−1 )
A
h(a)πnM (da|xn−1 , t)dt dw,
4.1 Poisson-Related Strategies
213
k where the first equality holds because i=1 τi has the Erlang(ε, k) distribution, and τk+1 is the independent exponential random variable under pn (·|xn−1 ), the second equality holds after integrating by parts with respect to s, and the last equality appears after we introduce the notation t := w + s. Formula (4.12) is proved. Now, according to (4.11) and (4.12),
G ξn
=
τi ,
k+1
i=0
=
k
k+1 i=0 τi , i=0 τi ]
M G πn
τi
i=0
k
(
k
× X xn−1 pn (dξ|xn−1 )
q ( X |xn−1 , πnM , t)e−n (t,xn−1 ) dt pn (dξ|xn−1 ),
τi ,
i=0
k+1
τi
i=0
× X xn−1 pn (dξ|xn−1 ).
(See also (1.13).)
∞ τi = ∞|xn−1 ) = 1, Since pn ( i=1 P
G nS (R+ × X |xn−1 ) k ∞ k+1 ξ = Gn τi , τi × X xn−1 pn (dξ|xn−1 ) k=0 i=0 i=0 ∞ k k+1 πM = Gn τi , τi × X xn−1 pn (dξ|xn−1 ) k=0 i=0 i=0 M = G πn (R+ × X |xn−1 ) pn (dξ|xn−1 ) =
M G πn (R+
× X |xn−1 ).
Formula (4.10) is proved by induction, and hence (4.9) is proved. According to Corollary 4.1.2, S P ,0 m γ,n ( X
× A) =
X
P PγS (X n−1
×
A
k ∞ ε p˜ n,i−1 (da|x) ∈ d x) qx (a) + ε k=0 i=1 A
p˜ n,k (da|x) . qx (a) + ε
Since we have already established equality (4.9), it remains to show that, for all xn−1 ∈ X, A ∈ B(A)
214
4 Reduction to DTMDP: The Total Cost Model
k ∞ ε p˜ n,i−1 (da|xn−1 ) p˜ n,k (da|xn−1 ) q (a) + ε x A q xn−1 (a) + ε n−1 k=0 i=1 A e−n (t,xn−1 ) πnM ( A |xn−1 , t)dt : =
(4.13)
(0,∞)
π ,0 see (1.31) for the definition of the detailed occupation measure m γ,n . According to (4.12) with h(a) := I{a ∈ A }, M
k ε p˜ n,i−1 (da|xn−1 ) I{a ∈ A } p˜ n,k (da|xn−1 ) q (a) + ε q xn−1 A xn−1 (a) + ε i=1 A k I{a ∈ A } p˜ n,k (da|xn−1 ) = pn,i A q xn−1 (a) + ε i=1 = e−n (t,xn−1 ) πnM ( A |xn−1 , t)dt pn (dξ|xn−1 )
(
k i=0 τi ,
k+1 i=0 τi ]
∀ k = 0, 1, . . . ,
∞ recall (4.6). Since pn ( i=1 τi = ∞|xn−1 ) = 1, it follows from the above that the left-hand side of (4.13) equals k ∞ ε p˜ n,i−1 (da|xn−1 ) p˜ n,k (da|xn−1 ) qxn−1 (a) + ε A q xn−1 (a) + ε k=0 i=1 A e−n (t,xn−1 ) πnM ( A |xn−1 , t)dt pn (dξ|xn−1 ) = (0,∞) e−n (t,xn−1 ) πnM ( A |xn−1 , t)dt, = (0,∞)
and S ,0 π ,0 m γ,n = m γ,n . P
M
(b) For a fixed Poisson-related strategy S P ∈ SεP with the associated stochastic kernel pn , we define the Markov π-strategy in the following way: πnM ( A |xn−1 , s) ∞ ξ e− (0,s] qxn−1 (u)du δαk ( A )I{τ0 + . . . + τk < s :=
k=0
≤ τ0 + . . . + τk+1 } pn (dξ|xn−1 )
e−
1
ξ (0,s] q xn−1 (u)du
pn (dξ|xn−1 )
4.1 Poisson-Related Strategies
215
for each xn−1 ∈ X, A ∈ B(A). Note that πnM (A|xn−1 , s) ≡ 1 because, for each s ∈ R+ , I{τ0 + . . . + τk < s ≤ τ0 + . . . + τk+1 } xn−1 pn k=0 ∞ = pn s < τk xn−1 = 1, ∞
k=0
since pn ( ∞ k=0 τk = ∞|x n−1 ) = 1. Firstly, let us prove that, for all n = 0, 1, 2, . . ., the following joint distributions coincide PγS (n ∈ R , X n ∈ X ) = Pγπ (n ∈ R , X n ∈ X ), ¯ + ), X ∈ B(X∞ ). ∀ R ∈ B(R P
M
(4.14)
Note that P
P
PγS (n = ∞, X n = x∞ ) = 1 − PγS (n < ∞, X n ∈ X), Pγπ (n = ∞, X n = x∞ ) = 1 − Pγπ (n < ∞, X n ∈ X). M
M
Therefore, it is sufficient to consider only R ∈ B(R+ ) and X ∈ B(X). Formula (4.14) is valid at n = 0 because the initial distribution γ is fixed, and we always put 0 := 0. We suppose it holds for some n − 1 ≥ 0 and prove that, for all xn−1 ∈ X, R ∈ B(R+ ) and X ∈ B(X), G πn (R × X |xn−1 ) = G nS (R × X |xn−1 ). M
P
(4.15)
As before, the stochastic kernels G πn defined by (1.13) and G nS defined by (4.1) depend only on the state xn−1 ∈ X. Since ξ e− (0,s] qxn−1 (u)du pn (dξ|xn−1 ) ξ = 1− qxξn−1 (t)e− (0,t] qxn−1 (u)du pn (dξ|xn−1 )dt, M
(0,s]
the equality
P
216
4 Reduction to DTMDP: The Total Cost Model
ξ d ln e− (0,s] qxn−1 (u)du pn (dξ|xn−1 ) ds ξ ξ − qxn−1 (s)e− (0,s] qxn−1 (u)du pn (dξ|xn−1 ) = −qxn−1 (πnM , s) = − qxξ (u)du (0,s] n−1 pn (dξ|xn−1 ) e holds for almost all s ∈ R+ . Thus, for all θ ∈ R+ ,
e−
ln
ξ
(0,θ]
qxn−1 (u)du
pn (dξ|xn−1 ) = −
qxn−1 (πnM , s)ds
(0,θ]
= −n (θ, xn−1 ) and e−n (θ,xn−1 ) =
e−
ξ
(0,θ]
qxn−1 (u)du
pn (dξ|xn−1 ).
(4.16)
Therefore, according to (1.13) and (4.1), G πn (R × X |xn−1 ) = M
R
q ξ ( X |xn−1 , θ)e−
ξ
(0,θ]
qxn−1 (u)du
P
× pn (daξ|xn−1 )dθ = G nS (R × X |xn−1 ). Formula (4.15) is proved, and hence so is (4.14) by induction. Now, according to (1.31), (4.14) and (4.16), π ,0 ( X × A ) = m γ,n M
Pγπ (X n−1 ∈ d x) M
X
×πnM ( A |xn−1 , θ)dθ P = PγS (X n−1 ∈ d x) X
×
∞
e−n (θ,xn−1 ) (0,∞)
(0,∞)
e−
ξ
(0,θ]
qxn−1 (u)du
δαk ( A )I{τ0 + . . . + τk < θ ≤ τ0 + . . . + τk+1 }
k=0 S ,0 ( X × A ), × pn (dξ|xn−1 )dθ = m γ,n P
where the last equality is by (4.4).
In Sect. 6.2.2, we provide the proof of Theorem 4.1.1 explicitly for an arbitrary α ≥ 0 in a more general framework. In view of Theorems 1.3.1 and 4.1.1 (see also (1.34)), problems (1.15) and (1.16) are equivalent to the problems
4.1 Poisson-Related Strategies
217 ∞
minimize
n=1
c0 (x, a)m n (d x, da)
(4.17)
X×A
ε over m = {m n }∞ n=1 ∈ D P
and minimize
∞ n=1
c0 (x, a)m n (d x, da)
(4.18)
X×A
ε over m = {m n }∞ n=1 ∈ D P ∞ subject to c j (x, a)m n (d x, da) ≤ d j , n=1
j = 1, 2, . . . , J
X×A
correspondingly. The value of ε > 0 can be chosen arbitrarily. For the optimal Poisson-related strategy for problem (4.17) or (4.18), the corresponding (in the sense of Theorem 4.1.1) Markov π-strategy will be optimal for problems (1.15) and (1.16). Theorem 4.1.2 Suppose S P is a stationary Poisson-related strategy with the stochastic kernel p. ˜ Then the following assertions hold true. (a) The detailed occupation measures (4.4) for S P , for all n ∈ N, X ∈ B(X), A ∈ B(A) have the form S ,α ( X × A ) m γ,n ⎡ ⎤ p(da|X ˜ n−1 ) I{X n−1 ∈ X }e−αTn−1 P ⎢ A α + q X n−1 (a) + ε ⎥ ⎥. = EγS ⎢ ⎣ ⎦ (α + q X n−1 (a)) p(da|X ˜ n−1 ) α + q X n−1 (a) + ε A P
As usual, here 00 := 0, d0 = ∞ for d > 0, and 0 × ∞ = 0. (b) For all n ∈ N, the following statements hold for the stationary π-strategy π s defined by ˜ π s (da|x) = p(da|x). (i) For each X ∈ B(X), lim PγS (X n−1 ∈ X ) = Pγπ (X n−1 ∈ X ) P
s
ε→∞
and, for all measurable bounded functions f R and f X on R+ and on X, respectively,
218
4 Reduction to DTMDP: The Total Cost Model
=
P lim E S [I{X n ∈ X} f R (n ) f X (X n )] ε→∞ γ s Eγπ [I{X n ∈ X} f R (n ) f X (X n )] .
(ii) Suppose Condition 1.3.1 is satisfied, and for each x ∈ X , α + inf qx (a) ≥ δ > 0. a∈A
Then, for each nonnegative bounded measurable function g on X × A, lim
ε→∞ X ×A
S P ,α g(x, a)m γ,n (d x
× da) =
π ,α g(x, a)m γ,n (d x × da). s
X ×A
(c) For all n ∈ N, the following statements hold for the stationary standard ξstrategy s defined by ˜ s (da|x) = p(da|x). (i) For each X ∈ B(X), lim PγS (X n−1 ∈ X ) = Pγ (X n−1 ∈ X ) P
s
ε→0
and, for all bounded measurable functions f R and f X on R+ and on X, respectively,
=
P lim EγS [I{X n ∈ X} f R (n ) f X (X n )] ε→0 s Eγ [I{X n ∈ X} f R (n ) f X (X n )].
(ii) Suppose Condition 1.3.1 is satisfied, and for each x ∈ X , α + inf qx (a) ≥ δ > 0. a∈A
Then, for each nonnegative bounded measurable function g on X × A, lim
ε→0 X ×A
S ,α g(x, a)m γ,n (d x × da) = P
,α g(x, a)m γ,n (d x × da). s
X ×A
Proof (a) The formula to be proved follows from (4.4) and Corollary 4.1.1 under g(a, x) = I{a ∈ A }. (b) We will prove by induction with respect to n that for each α ≥ 0,
4.1 Poisson-Related Strategies
219
P lim EγS I{X n−1 ∈ X}e−αTn−1 Fε (X n−1 ) ε→∞ s = Eγπ I{X n−1 ∈ X}e−αTn−1 F∞ (X n−1 ) , n = 1, 2, . . .
(4.19)
for each nonnegative bounded measurable function Fε (x) on X × R+ (with argument (x, ε)), such that the limit limε→∞ Fε (x) = F∞ (x) exists for all x ∈ X. If n = 1, then T0 = 0 and the statement follows from the Dominated Convergence Theorem. Suppose formula (4.19) holds for some n ≥ 1 and consider the case of n + 1. P According to (4.1), the conditional expectation EγS [I{X n ∈ X}e−αn Fε (X n )|Hn−1 ] given Hn−1 = h n−1 ∈ Hn−1 with xn−1 ∈ X equals
e−αθ−
ξ
(0,θ]
qxn−1 (s)ds
(0,∞)
∞
Fε (y) q (dy|xn−1 , αk )
X
k=0
×I{τ0 + . . . + τk < θ ≤ τ0 + . . . + τk+1 }dθ pn (dξ|xn−1 ) q (dy|xn−1 , a) X Fε (y) p(da|x ˜ n−1 ) α + q xn−1 (a) + ε A =: Fε (xn−1 ) = (α + qxn−1 (a)) p(da|x ˜ n−1 ) A α + q xn−1 (a) + ε due to Corollary 4.1.1, where g(a, xn−1 ) := The limit, for each xn−1 ∈ X, lim Fε (xn−1 ) = lim
ε→∞
ε A
ε→∞
=
A
X
X
(4.20)
Fε (y) q (dy|xn−1 , a).
Fε (y) q (dy|xn−1 , a) p(da|x ˜ n−1 ) α + qxn−1 (a) + ε ε(α + qxn−1 (a)) p(da|x ˜ n−1 ) α + qxn−1 (a) + ε A X
F∞ (y) q (dy|xn−1 , a) p(da|x ˜ n−1 ) (α + qxn−1 (a)) p(da|x ˜ n−1 ) A
(xn−1 ) =: F∞
was calculated using the dominated convergence theorem. If we have
(α + qxn−1 (a)) p(da|x ˜ n−1 ) = 0, then Fε (x n−1 ) = 0 for all ε > 0 and F∞ (x n−1 ) A 0 πs = 0 = 0, too. That limit coincides with the conditional expectation Eγ [I{X n ∈ X}e−αn F∞ (X n )|Hn−1 ] given Hn−1 = h n−1 ∈ Hn−1 with xn−1 ∈ X, according
(x) are nonnegative, meato (1.13). Note also that the functions Fε (x) and F∞ surable and bounded on X × R+ and on X respectively under the imposed conditions. Now
220
4 Reduction to DTMDP: The Total Cost Model
I{X n ∈ X}e−αTn Fε (X n ) ε→∞ P = lim EγS I{X n−1 ∈ X}e−αTn−1 ε→∞ P ×EγS I{X n ∈ X}e−αn Fε (X n ) Hn−1 P = lim EγS I{X n−1 ∈ X}e−αTn−1 Fε (X n−1 ) ε→∞ s
(X n−1 ) , = Eγπ I{X n−1 ∈ X}e−αTn−1 F∞ lim EγS
P
according to the induction hypothesis, and s Eγπ I{X n ∈ X}e−αTn F∞ (X n ) s s = Eγπ I{X n−1 ∈ X}e−αTn−1 Eγπ I{X n ∈ X}e−αn F∞ (X n ) Hn−1 s
(X n−1 ) . = Eγπ I{X n−1 ∈ X}e−αTn−1 F∞ Formula (4.19) is proved. The first formula of Item (i) follows from (4.19) under α = 0 and Fε (x) := I{x ∈ X }. The next target is to show that, for the fixed nonnegative bounded measurable functions f R on R and f X on X, and for a fixed h n−1 ∈ Hn−1 with xn−1 ∈ X,
f R (θ) f X (y)G εn (dθ × dy|h n−1 ) =
lim
ε→∞ R+ ×X
f R (θ) f X (y)G n (dθ × dy|h n−1 ). R+ ×X
(4.21) Here, the stochastic kernel G εn is defined in (4.1) and G n is given by (1.13) at πn = π s . Note that ε f R (θ) f X (y)G n (dθ × dy|h n−1 ) = f R (θ)κε (θ)dθ, R+ ×X
R+
where ε
κ (θ) =
∞ k=0
f X (y) q (dy|xn−1 , αk )I{τ0 + . . . + τk < θ
X
≤ τ0 + . . . + τk+1 }e−
ξ
(0,θ]
qxn−1 (s)ds
pn (dξ|xn−1 ).
Similarly,
R+ ×X
where
f R (θ) f X (y)G n (dθ × dy|h n−1 ) =
R+
f R (θ)κ∞ (θ)dθ,
4.1 Poisson-Related Strategies
κ∞ (θ) :=
221
f X (y) q (dy|xn−1 , π s )e−qxn−1 (π
s
)θ
X
and
∞
R+
κ (θ)dθ =
X
f X (y) q (dy|xn−1 , π s ) < ∞. qxn−1 (π s )
The last integral equals zero if qxn−1 (π s ) = 0 and does not exceed supx∈X f X (x). According to Corollary 4.1.1,
f X (y) q (dy|xn−1 , a) p(da|x ˜ n−1 ) α + q xn−1 (a) + ε A e−αθ κε (θ)dθ = (α + qxn−1 (a)) R+ p(da|x ˜ n−1 ) α + qxn−1 (a) + ε A X
and
e−αθ κε (θ)dθ ⎧ ⎫ ε X f X (y) q (dy|xn−1 , a) ⎪ ⎪ ⎪ ⎪ ⎪ p(da|x ˜ n−1 ) ⎪ ⎨ ⎬ α + qxn−1 (a) + ε A (4.22) = lim ε(α + qxn−1 (a)) ε→∞ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ p(da|x ˜ n−1 ) ⎩ ⎭ A α + q xn−1 (a) + ε = f X (y) q (dy|xn−1 , π s ) (α + qxn−1 (π s )) = e−αθ κ∞ (θ)dθ. lim
ε→∞ R +
R+
X
equalities are also valid if α + qxn−1 (π s ) = 0: in that case All these −αθ ε κ (θ)dθ ≡ 0, X f X (y) q (dy|xn−1 , π s ) = 0 and 00 = 0 as usual. R+ e Since α ≥ 0 can be arbitrary, we conclude that lim
ε→∞ R +
ε
f R (θ)κ (θ)dθ =
R+
f R (θ)κ∞ (θ)dθ
(4.23)
for every bounded continuous function f R on R+ . (See Proposition B.1.18.) Clearly, κε (θ), κ∞ (θ) ≤ sup y∈X f X (y) supa∈A qxn−1 (a), so that all the conditions of Corollary B.1.2 are fulfilled. Hence (4.23) holds also for all bounded measurable functions f R on (R+ , B(R+ )), and equality (4.21) is proved. Finally, note that for all bounded nonnegative measurable functions f R on R and f X on X, ∀ε > 0,
R+ ×X
f R (θ) f X (y)G εn (dθ × dy|h n−1 ) ≤ sup f R (θ) sup f X (y) θ∈R+
y∈X
222
4 Reduction to DTMDP: The Total Cost Model
and the stochastic kernels G εn and G n depend only on xn−1 , so that below we write them as G εn (·|xn−1 ) and G n (·|xn−1 ). Using the first formula Item (i) as established earlier, by Proposition B.1.14 we obtain that, for all measurable nonnegative bounded functions f R and f X , P
lim EγS [I{X n ∈ X} f R (n ) f X (X n )] SP = lim P (X n−1 ∈ d x) f R (θ) f X (y)G εn (dθ × dy|X n−1 ) ε→∞ X γ R+ ×X πs = Pγ (X n−1 ∈ d x) f R (θ) f X (y)G n (dθ × dy|X n−1 ) ε→∞
R+ ×X
X
= Eγπ [I{X n ∈ X} f R (n ) f X (X n )]. s
The generalization for arbitrary measurable bounded functions f R and f X is straightforward. (ii) Let us extend the function g to X × A by putting g(, a) ≡ 0. Then, according to Item (a), X ×A
S P ,α g(x, a)m γ,n (d x
⎡
P ⎢ = EγS ⎢ ⎣I{X n−1
S ,α × da) = g(x, a)m γ,n (d x × da) X×A ⎤ g(X n−1 , a) ) p(da|X ˜ n−1 ⎥ A α + q X n−1 (a) + ε ⎥. ∈ X}e−αTn−1 ⎦ (α + q X n−1 (a)) p(da|X ˜ n−1 ) A α + q X n−1 (a) + ε P
Formula (4.19) under the nonnegative bounded measurable function Fε (x) := A
g(x, a) p(da|x) ˜ α + qx (a) + ε
A
(α + qx (a)) p(da|x) ˜ α + qx (a) + ε
takes the form s S P ,α lim g(x, a)m γ,n (d x × da) = Eγπ I{X n−1 ∈ X}e−αTn−1 F∞ (X n−1 ) , ε→∞ X×A
where '
( εg(x, a) ε(α + qx (a)) F∞ (x) = lim p(da|x) ˜ p(da|x) ˜ ε→∞ A α + q x (a) + ε A α + q x (a) + ε g(x, a) p(da|x) ˜ (α + qx (a)) p(da|x). ˜ = A
A
The last expression is well defined for x = and equals 0 when x = , recall the convention 00 := 0.
4.1 Poisson-Related Strategies
223
According to (1.32), π s ,α π s ,α g(x, a)m γ,n (d x × da) = g(x, a)m γ,n (d x × da) X ×A X×A ⎤ ⎡ s g(X , a)π (da|X ) n−1 n−1 ⎥ s ⎢ −αTn−1 A ⎥ = Eγπ ⎢ ⎦ ⎣I{X n−1 ∈ X}e s α + q X n−1 (a)π (da|X n−1 ) A s = Eγπ I{X n−1 ∈ X}e−αTn−1 F∞ (X n−1 ) .
Statement (ii) is proved. (c) The proof follows the same steps as that of Item (b). Formula (4.19) looks like the following P lim EγS I{X n−1 ∈ X}e−αTn−1 Fε (X n−1 ) ε→0 s = Eγ I{X n−1 ∈ X}e−αTn−1 F0 (X n−1 ) , n = 1, 2, . . . ,
(4.24)
where Fε (x) is a nonnegative measurable bounded function on X × R+ such that the limit limε→0 Fε (x) = F0 (x) exists for all x ∈ X. For the proof (as before, by induction) note that the function Fε (xn−1 ) is exactly the same, given by (4.20), and F0 (y) q (dy|xn−1 , a) p(da|x ˜ I{α + qxn−1 (a) = 0} X lim Fε (xn−1 ) = n−1 ) ε→0 α + qxn−1 (a) A =: F0 (xn−1 ). ˜ If α + qxn−1 (a) = 0 for p(·|x n−1 )-almost all a ∈ A, Fε (xn−1 ) ≡ 0 = F0 (xn−1 ). The conditional expectations EγS [I{X n ∈ X}e−αn Fε (X n )|Hn−1 ], Eγ [I{X n ∈ X}e−αn F0 (X n )|Hn−1 ], P
s
given Hn−1 = h n−1 ∈ Hn−1 with xn−1 ∈ X, equal Fε (xn−1 ) and F0 (xn−1 ) correspondingly. (See (1.12).) The functions Fε (xn−1 ) and F0 (xn−1 ) are nonnegative, measurable and bounded on X × R+ and on X by sup(x,ε)∈X×R+ Fε (x) and supx∈X F0 (x) respectively. Finally, Eγ [I{X n ∈ X}e−αTn F0 (X n )] = Eγ [I{X n−1 ∈ X}e−αTn−1 F0 (X n−1 )] s
and
s
224
4 Reduction to DTMDP: The Total Cost Model
lim EγS [I{X n ∈ X}e−αTn Fε (X n )] P
ε→0
= lim EγS [I{X n−1 ∈ X}e−αTn−1 Fε (X n−1 ) P
=
ε→0 s Eγ [I{X n−1
∈ X}e−αTn−1 F0 (X n−1 )]
by induction. Formula (4.24) is proved. The first formula in Item (i) follows from (4.24) under α = 0 and Fε (x) = I{x ∈ X }. The analogue of the equality (4.21) is lim
ε→0 R ×X +
=
R+ ×X
f R (θ) f X (y)G εn (dθ × dy|h n−1 )
f R (θ) f X (y)G n (dθ × dy|h n−1 ),
˜ The proof is absolutely the same; the where G n is given by (1.12) at s = p. key formula (4.22) looks as follows
e−αθ κε (θ)dθ ⎧ ⎫ q (dy|xn−1 , a) ⎪ ⎪ X f X (y) ⎪ ⎪ ⎪ p(da|x ˜ n−1 ) ⎪ ⎨ ⎬ α + qxn−1 (a) + ε A = lim (α + qxn−1 (a)) ε→0 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ p(da|x ˜ n−1 ) ⎩ ⎭ A α + q xn−1 (a) + ε f X (y) q (dy|xn−1 , a) = p(da|x ˜ I{α + qxn−1 (a) = 0} X n−1 ) α + qxn−1 (a) A = e−αθ κ0 (θ)dθ, lim
ε→0 R +
R+
where
f X (y) q (dy|xn−1 , a)e−qxn−1 (a)θ s (da|xn−1 ).
κ (θ) := 0
A
X
−αθ ε ˜ κ (θ)dθ ≡ 0 = R+ If α + qxn−1 (a) = 0 for p(·|x n−1 )-almost all a ∈ A, R+ e e−αθ κ0 (θ)dθ. The ending of the proof coincides with that presented above. (ii) As before, we put g(, a) ≡ 0 and consider the same function Fε (x). Then, by (4.24), lim
ε→0 X×A
S ,α g(x, a)m γ,n (d x × da) = Eγ [I{X n−1 ∈ X}e−αTn−1 F0 (X n−1 )], P
s
4.1 Poisson-Related Strategies
225
where F0 (x) = A
g(x, a) p(da|x) ˜ . qx (a) + α
The last expression is well defined for x = and equals Finally, X×A
0 0
= 0 when x = .
,α −αTn−1 g(x, a)m F0 (X n−1 )] γ,n (d x × da) = Eγ [I{X n−1 ∈ X}e s
s
,α by the definition of the detailed occupation measure m γ,n . s
4.2 Reduction to DTMDP In this section, we show that solving an (unconstrained or constrained) optimal control problem for an undiscounted model, in the class SεP of Poisson-related strategies under a fixed value ε > 0, is equivalent to solving the control problem for the corresponding Discrete-Time Markov Decision Process (DTMDP) in a specific class of control strategies which includes all stationary strategies. Actually, by Theorem 4.1.1, solving the original CTMDP problem is equivalent to solving the problem over Poisson-related strategies. Moreover, for an arbitrary strategy in that DTMDP, there is a strategy in SεP (and also a Markov π-strategy) leading to the same objective values; this statement will be proved in Sect. 6.3 in the most general framework. Restriction to the class SεP does not lead to any loss of generality due to Theorem 4.1.1. After that, we use the known facts for DTMDP models to obtain the corresponding results for CTMDP models. Under the compactness-continuity conditions, if all the cost rates are nonnegative in the constrained problem, then there exists an optimal stationary Poisson-related strategy, as well as an optimal stationary standard ξ-strategy and an optimal stationary π-strategy. Everywhere in Sects. 4.2.1,4.2.2, and 4.2.3, α = 0, and ε > 0 is arbitrarily fixed. In Sect. 4.2.4, we point out that, in the simpler case of strongly positive jump intensities, one can reduce the CTMDP model to DTMDP more straightforwardly without reference to the Poisson-related strategies. The discounted model can be treated as a special case in Sect. 4.2.4.
4.2.1 Description of the Concerned DTMDP For the given CTMDP (X, A, q, {c j } Jj=0 ), we introduce the following DTMDP (X, A, p, {l j } Jj=0 ): • X and A are the state and action spaces, the same as those in the CTMDP;
226
4 Reduction to DTMDP: The Total Cost Model
• the transition probability is given by p(|x, a) = • the cost functions are l j (x, a) =
q (|x, a) + ε I{x ∈ } ; qx (a) + ε c j (x,a) , qx (a)+ε
j = 0, 1, . . . , J .
Remember, ε > 0 is arbitrarily fixed in the current section. For a strategy σ in the described DTMDP with the same initial distribution γ as in the CTMDP model, the corresponding strategic measure is denoted by Pσγ . The expectation taken with respect to Pσγ is written as Eσγ . The collection of all control strategies in the introduced DTMDP is denoted by . Below, the controlled and ∞ ∞ and {Bi }i=1 to distinguish controlling processes in the DTMDP are denoted as {Yi }i=0 them from the corresponding elements of the CTMDP. The necessary knowledge on DTMDP is presented in Appendix C. Lemma 4.2.1 For each Poisson-related strategy S P in the CTMDP, there exists a strategy σ in the DTMDP such that for all nonnegative bounded measurable functions f on X × A,
∞ n=1
X×A
S P ,0 f (x, a)m γ,n (d x
× da) =
Eσγ
∞ f (Yi−1 , Bi ) q (Bi ) + ε i=1 Yi−1
(4.25)
for each initial distribution γ. Proof Assume that a Poisson-related strategy S P is fixed. Let us construct the required control strategy σ in the DTMDP as follows. To this end, we must decide, for each sequence h dm−1 = (y0 , b1 , . . . , ym−1 ) ∈ (X × A)m−1 × X, which stochastic kernel to apply at the discrete time index m − 1, out of p˜ n,k , n = 1, 2, . . .; k = 0, 1, 2, . . .. We equip the histories in the DTMDP with the upper index d and denote the realized states and actions in the DTMDP as yi and b j to distinguish them from those in the CTMDP, denoted as xn and αkn . Here ξn = (τ0n , α0n , τ1n , α1n , . . .) ∈ is the realization of the random element having the distribution pn (dξ|xn−1 ). The history h dm−1 in the DTMDP will correspond to a sequence (x0 , x1 , . . . , xn−1 ) in the CTMDP in the following way. Based on the history h dm−1 with m > 1, one can calculate n(h dm−1 ) − 1, the number of changes in states in (y0 , y1 , . . . , ym−1 ). Then tn(h dm−1 )−1 is the jump moment in the CTMDP that results in the post-jump state ym−1 . In greater detail, l1 (h dm−1 ) := min{l ≥ 1 :
l ≤ m − 1; yl = yl−1 } ∧ m
is the first discrete time index when the value of yl changes (if l1 (h dm−1 ) < m). It corresponds to θ1 = t1 in the CTMDP.
4.2 Reduction to DTMDP
227
If l1 (h dm−1 ) = m, then the jump epoch t1 is not yet reached. In this case, we put n(h dm−1 ) := 1, k(h dm−1 ) := l1 (h dm−1 ) − 1 = ln(h dm−1 ) (h dm−1 ) − 1, and apply the stochastic kernel p˜ n(h dm−1 ),k(h dm−1 ) at the discrete time index m − 1. If l1 (h dm−1 ) < m, we continue the calculations: l2 (h dm−1 ) := min{l ≥ 1 :
l ≤ m − l1 (h dm−1 ) − 1;
yl1 (h dm−1 )+l = yl1 (h dm−1 )+l−1 } ∧ (m − l1 (h dm−1 )) is such that l1 (h dm−1 ) + l2 (h dm−1 ) is the second discrete time index when the value of yl changes (if l2 (h dm−1 ) < m − l1 (h dm−1 )). It corresponds to θ1 + θ2 = t2 in the CTMDP. If l2 (h dm−1 ) = m − l1 (h dm−1 ), then the jump epoch t2 is not yet reached, and we put n(h dm−1 ) := 2, k(h dm−1 ) := l2 (h dm−1 ) − 1 = ln(h dm−1 ) (h dm−1 ) − 1, and the stochastic kernel p˜ n(h dm−1 ),k(h dm−1 ) is applied at the discrete time index m − 1. k−1 And so on. In general, for m − 1 ≥ k ≥ 1, if lk (h dm−1 ) = m − li (h dm−1 ), then i=1
the jump epoch tk is not yet reached, we put n(h dm−1 ) := k, k(h dm−1 ) := lk (h dm−1 ) − 1 = ln(h dm−1 ) (h dm−1 ) − 1, and the stochastic kernel p˜ n(h dm−1 ),k(h dm−1 ) is applied at the discrete time index m − 1. Otherwise,
) lk+1 (h dm−1 )
:= min l ≥ 1 : l ≤ m −
k
li (h dm−1 ) − 1;
i=1
y k
d i=1 li (h m−1 )+l
∧ m−
k i=1
= y k
d i=1 li (h m−1 )+l−1
li (h dm−1 ) .
*
228
4 Reduction to DTMDP: The Total Cost Model
For the finite sequence h dm−1 , this procedure will terminate within finitely many steps, so that n(h dm−1 ) is well defined and n(h dm−1 )
li (h sm−1 ) = m.
i=1
After the values n(h dm−1 ) and k(h dm−1 ) are calculated, in the DTMDP, at the discrete time index m − 1, we apply the randomized control σm (da|h dm−1 ) := p˜ n(h dm−1 ),k(h dm−1 ) (da|ym−1 ).
(4.26)
Figure 4.1 illustrates this construction and the connection between the histories h dm−1 and the trajectories of the CTMDP. Schematically, the history h dm−1 in the
Fig. 4.1 Two scenarios illustrating the construction and the connection between the histories h dm−1 and the trajectories of the CTMDP: a m = 7; h d6 = (y0 , b1 , . . . , y6 ); l1 (h d6 ) = 3, l2 (h d6 ) = 2, l3 (h d6 ) = 1, l4 (h d6 ) = 1, n(h d6 ) = 4, k(h d6 ) = 0; b m = 5; h d4 = (y0 , b1 , . . . , y4 ); l1 (h d4 ) = 1, l2 (h d4 ) = 2, l3 (h d4 ) = 2, n(h d4 ) = 3, k(h d4 ) = 1
4.2 Reduction to DTMDP
229
DTMDP corresponds to the sequence (x0 , x1 , . . . , xn−1 ) in the CTMDP in the following way, where elements b1 , b2 , . . . , bm−1 are of no importance and are not indicated, and the numbers li (h dm−1 ) are written just as li for brevity: y0 = y1 = . . . = yl1 = . . . = yl1 +l2 = . . .
first
second
x1 sojourn x2 x0 sojourn time time = y n(hdm−1 )−1 = . . . = ym−1 li
i=1
n(h dm−1 )th sojourn time
xn(h dm−1 )−1 All the values
y n(h d
m−1 )−1 li i=1
, y n(h d
m−1 )−1 li +1 i=1
, . . . , ym−1
coincide, they are in the correspondence with n(h dm−1 )
τ0
n(h dm−1 )
, τ1
n(h d
)
, . . . , τk(h dm−1) , m−1
where n(h dm−1 )−1
k(h dm−1 )
=m−
li − 1
i=1
and for such a history h dm−1 , the stochastic kernel p˜ n(h dm−1 ),k(h dm−1 ) is applied. On the space Hd = (X × A)∞ of trajectories h d = (y0 , b1 , y1 , . . .) in the DTMDP, we introduce the following sequence of functions m n (h d ) := min{m ∈ {1, 2, . . . } : n(h dm−1 ) = n}, n = 1, 2, . . . , with m n (h d ) − 1 denoting the (first and only) discrete time index when the stochastic kernel p˜ n,0 is applied. Note that if m n (h d ) < ∞, then m n+1 (h d ) > m n (h d ). Also, m 1 (h d ) ≡ 1 and the functions m n can take infinite values. If m n (h d ) = ∞ for some n ∈ N, then the minimal value n(h ˆ d ) = min{n ∈ {1, 2, . . . } : m n (h d ) = ∞} is such d , i = 0, 1, 2, . . ., are the same. Clearly, that all the values ym n−1 ˆ (h )−1+i {h d : m n+1 (h d ) < ∞} ⊆ {h d : m n (h d ) < ∞}, n = 1, 2, . . . . We will prove by induction, with respect to n, the following equality:
230
4 Reduction to DTMDP: The Total Cost Model
Eσγ
∞
P
I{m n (H ) = k}I{Yk−1 ∈ X } = PγS (X n−1 ∈ X ), ∀ X ∈ B(X). d
k=1
(4.27) Here and below, capital letters H d , Ym and Bm+1 denote the random elements in the DTMDP (the trajectory, the corresponding state and the action). Eσγ is the mathematical expectation with respect to the strategic measure Pσγ on the canonical sample space Hd . Equality (4.27) obviously holds for n = 1: both sides are equal to γ( X ). Suppose it holds for some n ≥ 1 and fix an arbitrary m ∈ N. Then, according to the definition of the strategy σ, for each n ∈ N,
=
d Eσγ I{m n+1 (H d ) = m + 1 + k}I{Ym+k ∈ X }I{m n (H d ) = m} Hm−1 k ε p˜ n,i−1 (da|Ym−1 ) q ( X |Ym−1 , a) p˜ n,k (da|Ym−1 ) i=1
A
qYm−1 (a) + ε
×I{m n (H d ) = m},
A
qYm−1 (a) + ε
∀ k = 0, 1, . . . .
d )-measurable: for n > 1 this means that ym−1 = Note, {m n (H d ) = m} is B(Hm−1 ym−2 , and this change of the state yi occurs for the (n − 1)-th time. If n = 1, m 1 (h d ) ≡ 1 for all h d ∈ Hd and thus m 1 (H d ) is B(H0d )-measurable. For j = 1, 2, . . . , m, I{m n+1 (h d ) = j}I{m n (h d ) = m} = 0 for all h d ∈ Hd . Therefore, ⎡ ⎤ ∞ d ⎦ Eσγ ⎣ I{m n+1 (H d ) = j}I{Y j−1 ∈ X }I{m n (H d ) = m} Hm−1 j=1 k ∞ ε p˜ n,i−1 (da|Ym−1 ) q ( X |Ym−1 , a) p˜ n,k (da|Ym−1 ) = q (a) + ε qYm−1 (a) + ε Ym−1 A k=0 i=1 A
×I{m n (H d ) = m}. Hence, ⎡ Eσγ ⎣
∞
⎤ I{m n+1 (H d ) = j}I{Y j−1 ∈ X }I{m n (H d ) = m}⎦
j=1
k * ) ∞ ε p˜ n,i−1 (da|y) q ( X |y, a) p˜ n,k (da|y) = μm (dy), q y (a) + ε q y (a) + ε X k=0 i=1 A A where
μm ( X ) := Eσγ [I{m n (H d ) = m}I{Ym−1 ∈ X }].
According to the inductive supposition, for each X ∈ B(X),
(4.28)
4.2 Reduction to DTMDP
231 ∞
P
μm ( X ) = PγS (X n−1 ∈ X ),
m=1
and, due to Proposition B.1.14, ⎡ Eσγ ⎣I{m n (H d ) < ∞}
∞
⎤ I{m n+1 (H d ) = j}I{Y j−1 ∈ X }⎦
j=1
k * ) ∞ ε p˜ n,i−1 (da|y) q ( X |y, a) p˜ n,k (da|y) = q y (a) + ε q y (a) + ε X k=0 i=1 A A ∞ × μm (dy) m=1
k ∞ ε p˜ n,i−1 (da|X n−1 ) = I{X n−1 ∈ X} q X n−1 (a) + ε k=0 i=1 A
q ( X |X n−1 , a) p˜ n,k (da|X n−1 ) × q X n−1 (a) + ε A P EγS
P
= PγS (X n ∈ X ). P
The last equality is by Corollary 4.1.3 and the construction of the measure PγS . Since, for every h d ∈ Hd , I{m n (h d ) = ∞}I{m n+1 (h d ) < ∞} = 0, we conclude that ⎤ ⎡ ∞ P Eσγ ⎣ I{m n+1 (H d ) = j}I{Y j−1 ∈ X }⎦ = PγS (X n ∈ X ), j=1
and equality (4.27) is proved. By the way, equality (4.27) implies that Pσγ (m n (H d ) = +∞) = PγS (X n−1 = x∞ ). P
The next step is to show that, for every nonnegative bounded measurable function f on X × A, for all n ∈ N ⎡ Eσγ ⎣I{m n (H d ) < ∞} = X×A
m n+1 (H d )−1
j=m n (H d )
⎤ f (Y j−1 , B j ) ⎦ qY j−1 (B j ) + ε
S ,0 f (x, a)m γ,n (d x × da). P
According to the definition of the strategy σ, for each m ∈ N
(4.29)
232
4 Reduction to DTMDP: The Total Cost Model
f (Y j−1 , B j ) d H Eσγ I{m n (H d ) = m}I{m ≤ j < m n+1 (H d )} qY j−1 (B j ) + ε m−1 j−m f (Ym−1 , a) pn, j−m (da|Ym−1 ) ε pn,i−1 (da|Ym−1 ) = q (a) + ε q Ym−1 Ym−1 (a) + ε A i=1 A ×I{m n (H d ) = m},
∀ j = 1, 2, . . . .
Therefore, ⎡ Eσγ ⎣ I{m n (H d ) = m} =
∞
Eσγ
j=m
=
m n+1 (H d )−1
j=m n (H d )
⎤ f (Y j−1 , B j ) d ⎦ H qY j−1 (B j ) + ε m−1
f (Y j−1 , B j ) d H I{m n (H ) = m}I{m ≤ j < m n+1 (H )} qY j−1 (B j ) + ε m−1 d
d
k ∞ ε pn,i−1 (da|Ym−1 ) f (Ym−1 , a) pn,k (da|Ym−1 ) qYm−1 (a) + ε qYm−1 (a) + ε A k=0 i=1 A ×I{m n (H d ) = m}.
Similarly to the previous calculations, using the measure μm (4.28), we have ⎡
⎤
m n+1 (H d )−1
f (Y j−1 , B j ) ⎦ q Y j−1 (B j ) + ε j=m n (H d ) k ∞ ε pn,i−1 (da|y) f (y, a) pn,k (da|y) = μm (dy). q y (a) + ε q y (a) + ε X k=0 i=1 A A Eσγ ⎣I{m n (H d ) = m}
According to (4.27),
∞
P
m=1
μm ( X ) = PγS (X n−1 ∈ X ), so that
⎡ Eσγ ⎣I{m n (H d ) < ∞} = EγS
P
= X×A
∞ k k=0 i=1
A
m n+1 (H d )−1
j=m n (H d )
⎤ f (Y j−1 , B j ) ⎦ qY j−1 (B j ) + ε
ε pn,i−1 (da|X n−1 ) q X n−1 (a) + ε
A
f (X n−1 , a) pn,k (da|X n−1 ) q X n−1 (a) + ε
S ,0 f (x, a)m γ,n (d x × da). P
The last equality is by Corollary 4.1.2. Having in hand (4.29), the end of the proof is simple: for each nonnegative bounded measurable function f on X × A
4.2 Reduction to DTMDP
233
⎡
⎤ ∞ f (Y , B ) j−1 j ⎦ Eσγ ⎣ q (B ) + ε Y j j−1 j=1 ⎡ ⎤ m n+1 (H d )−1 ∞ f (Y , B ) j−1 j ⎦ = Eσγ ⎣I{m n (H d ) < ∞} q (B ) + ε Y j j−1 d n=1 =
j=m n (H )
∞ n=1
S ,0 f (x, a)m γ,n (d x × da). P
X×A
(Remember, m 1 (H d ) ≡ 1.) The proof is complete.
Corollary 4.2.1 For each Poisson-related strategy S P in the CTMDP, there exists a strategy σ in the DTMDP such that for all nonnegative measurable functions f on X × A, equality (4.25) holds for each initial distribution γ. Proof It is sufficient to pass to the limit, as K → ∞, in the equality ∞ n=1
⎡
⎤ ∞ f (Y , B ) ∧ K j−1 j S ,0 ⎦, ( f (x, a) ∧ K )m γ,n (d x × da) = Eσγ ⎣ q (B ) + ε Y j X×A j−1 j=1 P
using the Monotone Convergence Theorem.
Remark 4.2.1 It will be shown in Sect. 6.3 (see Lemma 6.3.1) that the reverse statement of Lemma 4.2.1 and Corollary 4.2.1 are also valid: for each control strategy σ in the DTMDP, there is a Poisson-related strategy S P ∈ SεP in the CTMDP, such that (4.25) holds true for all nonnegative measurable functions f on X × A.
4.2.2 Selected Results of the Reduction to DTMDP According to Lemma 4.2.1, see also Corollary 4.2.1, if we solve the problems Minimize
Eσγ
∞ c0 (X n , An+1 ) over all strategies σ q (An+1 ) + ε n=0 X n
(4.30)
or ∞ c0 (X n , An+1 ) Minimize over all strategies σ q (An+1 ) + ε n=0 X n ∞ c j (X n , An+1 ) subject to Eσγ ≤ d j , j = 1, 2, . . . , J, q (An+1 ) + ε n=0 X n Eσγ
(4.31)
234
4 Reduction to DTMDP: The Total Cost Model ∗
and obtain a strategy σ ∗ for which there is a Poisson-related strategy S P satisfying ∗ (4.25) for all nonnegative measurable functions f on X × A, then S P will be a solution to the problem (1.15) or (1.16) respectively. Remember that Poisson-related strategies form a sufficient class in the CTMDP due to Theorems 4.1.1 and 1.3.1; the ∗ value of ε > 0 is arbitrarily fixed. According to Remark 4.2.1, such a strategy S P exists for every σ ∈ . Remark 4.2.2 If the strategy σ ∗ (da|x) = δϕ∗ (x) (da) in the DTMDP is deterministic stationary, the equality (4.25) holds for the stationary Poisson-related strategy with ∗ ≡ σ ∗ : see the proof of an arbitrarily fixed ε > 0 and identical stochastic kernels p˜ n,k Lemma 4.2.1 and the key formula (4.26). On each interval (Tn−1 , Tn ], n = 1, 2, . . ., after the state X n−1 is realised, the value of action is constant, equal to ϕ∗ (X n−1 ), meaning that, in fact, the Poisson-related strategy is indistinguishable from the deterministic stationary strategy ϕ∗ in CTMDP. (See also Remark 4.1.1 for a more rigorous justification.) Therefore, in this situation, the strategy ϕ∗ solves the problem (1.15) or (1.16) respectively, if it solves problems (4.30) or (4.31) respectively. In this subsection, we mainly concentrate on the positive models, when c j ≥ 0 for all j = 0, 1, . . . , J . As is known (see Proposition C.2.4), in the unconstrained case, the value (Bellman) function W0DT ∗ for the DTMDP is the minimal nonnegative lower semianalytic solution to the optimality equation ' W (x) = inf
a∈A
c0 (x, a) + qx (a) + ε
X
( W (y) q (dy|x, a) + εW (x) , ∀ x ∈ X. qx (a) + ε
(4.32)
In view of Remark 4.2.1, that solution coincides with W0∗0 , the value (Bellman) function for the problem (1.15). However, the proof of the next two theorems (Theorems 4.2.1 and 4.2.2) are not based on Remark 4.2.1. Condition 4.2.1 (a) The (Borel) action space A is compact. (b) For each bounded continuous function f on X, the mapping (x, a) ∈ X × A → q (dy|x, a) is continuous. X f (y) (c) For each j = 0, 1, . . . , J, the mapping (x, a) ∈ X × A → c j (x, a) is lower semicontinuous. Recall that part (b) of Condition 4.2.1 is equivalent to that for each [0, ∞]valued lower semicontinuous function f on X, the mapping (x, a) ∈ X × A → q (dy|x, a) is lower semicontinuous. X f (y) Condition 4.2.2 (a) The (Borel) action space A is compact. (b) For each bounded measurable function f on X and x ∈ X, the integral X f (y) q (dy|x, a) is continuous in a ∈ A.
4.2 Reduction to DTMDP
235
(c) For each j = 0, 1, . . . , J, and x ∈ X, the mapping a ∈ A → c j (x, a) is lower semicontinuous. Theorem 4.2.1 Suppose c0 ≥ 0 and W is the minimal nonnegative lower semianalytic solution to Eq. (4.32). (a) If a deterministic stationary strategy ϕ∗ satisfies the equality '
(
inf c0 (x, a) + W (y)q(dy|x, a) X = c0 (x, ϕ∗ (x)) + W (y)q(dy|x, ϕ∗ (x)) = 0 a∈A
(4.33)
X
for all x ∈ X such that W (x) < ∞, then W (x) = W0∗0 (x) for all x ∈ X and ϕ∗ is uniformly optimal for the unconstrained CTMDP problem (1.15). (b) If Condition 4.2.1 (respectively Condition 4.2.2) is satisfied, then W0∗0 is the minimal nonnegative lower semicontinuous (respectively, measurable) solution to (4.32), and there exists a deterministic stationary strategy ϕ∗ which satisfies (4.33) and hence is uniformly optimal for problem (1.15). Proof (a) Equality (4.33) holds if and only if, for all x ∈ X, ( q (dy|x, a) + εW (x) c0 (x, a) X W (y) + W (x) = inf a∈A q x (a) + ε qx (a) + ε W (y) q (dy|x, ϕ∗ (x)) + εW (x) c0 (x, ϕ∗ (x)) X + . = qx (ϕ∗ (x)) + ε qx (ϕ∗ (x)) + ε '
Proposition C.2.4 implies that ϕ∗ is uniformly optimal in the DTMDP problem (4.30) and W0DT ∗ = W . According to the proof of Lemma 4.2.1, see also ∗ Corollary 4.2.1, and Remark 4.2.2, the Poisson-related strategy S P defined by p˜ n,k (da|x) = δϕ∗ (x) (da) for all n ∈ N, k = 0, 1, . . ., is in fact just the determin∗ istic stationary strategy ϕ∗ . Equality (4.25) for S P is valid for all nonnegative measurable functions f on X × A, where σ is the strategy ϕ∗ , and the initial distribution γ can be arbitrary, e.g., concentrated at an arbitrary point x ∈ X. Thus ϕ∗ is uniformly optimal for the problem (1.15), and the value (Bellman) functions W0DT ∗ for the DTMDP and W0∗0 for the CTMDP coincide, being equal to ⎡ ⎤ ∞ ∞ c (Y , B ) P∗ ∗ 0 j−1 j S ,0 ⎦. c0 (y, a)m x,n (dy × da) = Eϕx ⎣ q (B ) + ε Y j X×A j−1 n=1 j=1 (b) The proof follows from Proposition C.2.8 and Item (a).
Theorem 4.2.2 Suppose all functions c j , j = 0, 1, . . . , J , are nonnegative, Condition 4.2.1 is satisfied, and there exists a feasible strategy for the constrained undiscounted problem (1.16). Then the following assertions hold true.
236
4 Reduction to DTMDP: The Total Cost Model ∗
(a) There is a stationary Poisson-related strategy S P ∈ SεP , optimal for problem (1.16). (b) There is a stationary π-strategy π s , optimal for problem (1.16). (c) There is a stationary standard ξ-strategy s , optimal for problem (1.16). Proof (a) Proposition C.2.18 implies that there is an optimal stationary strategy σ ∗ in the DTMDP solving problem (4.31). According to the proof of Lemma 4.2.1 ∗ (see also Corollary 4.2.1), for the stationary Poisson-related strategy S P , defined ∗ by p˜ n,k (da|x) = σ (da|x) for all n ∈ N, k = 0, 1, 2, . . ., equality (4.25) holds for all nonnegative measurable functions f on X × A. Therefore, the strategy ∗ S P solves problem (1.16). ∗ (b) Let S P ∈ SεP and σ ∗ be the strategies from the proof of part (a). According to Sect. C.2.2, one should put π s (da|x) = δϕ∗ (x) (da) on the set ⎧ ⎨
⎫ ⎡ ⎤ J ⎬ c j (X n , An+1 ) ⎦ (ζ DT )c := x ∈ X : inf Eσx ⎣ =0 σ∈ ⎩ ⎭ q (An+1 ) + ε j=0 X n where ϕ∗ satisfies
q ((ζ DT )c |x, ϕ∗ (x)) + ε =1 qx (ϕ∗ (x)) + ε
and
J c j (x, ϕ∗ (x)) =0 q (ϕ∗ (x)) + ε j=0 x
for all x ∈ (ζ DT )c : see Remark C.2.3. The controlled continuous-time process ∗ under the strategy S P will not leave the costless subset (ζ DT )c once it is reached. ∗ After that, we concentrate on the set ζ DT . Along with the measure Mσγ (Definition C.2.5), we also need the measure P∗
:=
MγS ( X × A ) ∗ Mσγ (d x × da) X × A
qx (a) + ε
(4.34) ∞ I{Yi−1 ∈ X }I{Bi ∈ A } ∗ = Eσγ . qYi−1 (Bi ) + ε i=1 ∗
According to Remark C.2.4, the measure Mσγ (d x × A) is σ-finite on (ζ DT+ , B(ζ DT )): there exists a sequence {X i } of measurable subsets of ζ DT such ∗ ∞ X i = ζ DT , and Mσγ (X i × A) < ∞ for each i ≥ 1. Consequently, that i=1 P∗
MγS (X i × A) ≤ P∗
1 σ∗ M (X i × A) < ∞, ε γ
and the measure MγS is also σ-finite on (ζ DT , B(ζ DT )). Therefore, there exists a stochastic kernel π s on B(A) given x ∈ ζ DT such that
4.2 Reduction to DTMDP
237
P∗
MγS ( X × A ) =
P∗
X
π s ( A |x)MγS (d x × A)
for all X ∈ B(ζ DT ), A ∈ B(A). We will show that the stochastic kernel π s (along with the mapping ϕ∗ described above) defines the desired stationary πstrategy. P∗ ∗ On every set X i × A, the measures MγS and Mσγ are absolutely continuous with respect to each other, and the Radon–Nikodym derivatives equal qx (a) + ε 1 , respectively. Therefore, by Remark C.2.4, for all ∈ B(ζ DT ) and qx (a)+ε
P∗
(qx (a) + ε)MγS (d x × da) ×A P∗ = γ() + ( q (|x, a) + εI{x ∈ })MγS (d x × da). ζ DT ×A P∗
Since the measure MγS is σ-finite on (ζ DT × A, B(ζ DT × A)), it is legal to cancel ε on both sides and obtain ∗ P∗ SP ν() := qx (a)Mγ (d x × da) = qx (π s )MγS (d x × A) (4.35) ×A P∗ = γ() + q (|x, π s )MγS (d x × A) ζ DT ν(d x) = γ() + , ∈ B(ζ DT ). q (|x, π s )I{qx (π s ) = 0} s DT q x (π ) ζ P∗
On the set {qx (π s ) = 0}, the marginal measure MγS (d x × A) is absolutely continuous with respect to the measure ν introduced above, and the Radon– 1 Nikodym derivative is qx (π s ) . It is clear that
P∗
ζ DT
I{qx (π s ) = 0} q (|x, π s )MγS (d x × A) = 0.
Let us compute the detailed occupation measures for the strategy π s on (ζ DT × A, B(ζ DT × A)). According to (1.32), π ,0 ( X × A ) = m γ,n s
X
pn−1 (d x)
for each X ∈ B(X) and A ∈ B(A), where pn () := Pγπ (X n ∈ ) s
π s ( A |x) qx (π s )
238
4 Reduction to DTMDP: The Total Cost Model
for ∈ B(ζ DT ). Since, under the strategy π s , the set (ζ DT )c is absorbing, we have, according to (1.13): p0 () = γ(); pn () = pn−1 (d x)G n (R+ × ) ζ DT q (|x, π s ) = I{qx (π s ) = 0} pn−1 (d x), ∈ B(ζ DT ), qx (π s ) ζ DT n = 1, 2, . . . . Therefore, for each nonnegative measurable function f on ζ DT × A, we have N n=1
ζ DT ×A
π s ,0 f (x, a)m γ,n (d x
× da) =
ζ DT
N = 1, 2, . . . , where the measures ν˜ N := ν˜1 () = γ(); ν˜ N () = γ() +
ζ DT
A
N −1 n=0
f (x, a)π s (da|x) ν˜ N (d x), qx (π s ) pn satisfy the equalities
q (|x, π s ) I{qx (π s ) = 0}ν˜ N −1 (d x), ∈ B(ζ DT ), qx (π s )
N = 2, 3, . . . . If we compare the last expression with (4.35), we see that, for each N = 1, 2, . . ., ∈ B(ζ DT ), it holds that ν˜ N () ≤ ν() : the proof is by trivial induction. Therefore, for each nonnegative measurable function f on ζ DT × A, we have N n=1
π ,0 f (x, a)m γ,n (d x × da) ≤ s
ζ DT ×A
ζ DT
A
f (x, a)π s (da|x) ν(d x), qx (π s )
N = 1, 2, . . . . Remember, ν({x ∈ ζ DT : qx (π s ) = 0}) = 0; hence, by (4.35),
4.2 Reduction to DTMDP ∞ n=1
239 π ,0 f (x, a)m γ,n (d x × da) s
ζ DT ×A
f (x, a)π s (da|x) P∗ I{qx (π s ) = 0}qx (π s )MγS (d x × A) s) DT q (π x ζ A ∗ f (x, a) ∗ SP ≤ Mσγ (d x × da) f (x, a)Mγ (d x × da) = DT DT q (a) + ε ζ ×A ζ ×A x ∞ DT f (Y , B )I{Y ∈ ζ } ∗ i−1 i i−1 = Eσγ . qYi−1 (Bi ) + ε i=1 ≤
The last two equalities are by (4.34). For all n = 1, 2, . . ., j = 0, 1, . . . , J ,
s
(ζ DT )c ×A
=
∞ n=1
=
π ,0 c j (x, a)m γ,n (d x × da)
∗ Eσγ
P∗
(ζ DT )c ×A
S ,0 c j (x, a)m γ,n (d x × da)
∞ c j (Yi−1 , Bi )I{Yi−1 ∈ (ζ DT )c } =0 qYi−1 (Bi ) + ε i=1
because c j (x, ϕ∗ (x)) = 0 for all x ∈ (ζ DT )c . According to Corollary 4.2.1, since ∗ the strategy σ ∗ solves problem (4.31), and the strategy S P solves problem (1.16), the inequalities ∞
=
n=1 X×A ∞ n=1
X×A
π s ,0 c j (x, a)m γ,n (d x
× da) ≤
∗ Eσγ
∞ c j (Yi−1 , Bi ) q (Bi ) + ε i=1 Yi−1
P∗
S ,0 c j (x, a)m γ,n (d x × da) ≤ d j
hold for all j = 1, 2, . . . , J , and, moreover, ∞ n=1
X×A
π s ,0 c0 (x, a)m γ,n (d x
∞ c0 (Yi−1 , Bi ) × da) ≤ (4.36) q (Bi ) + ε i=1 Yi−1 ∞ ∗ S P ,0 c0 (x, a)m γ,n (d x × da). = ∗ Eσγ
n=1 ∗
X×A
Since the strategy S P is optimal for problem (1.16), the strategy π s is also optimal for problem (1.16), and in (4.36) equality holds. (c) The proof follows a similar pattern as in (b). On the set (ζ DT )c , we put s (da|x) = δϕ∗ (x) (da).
240
4 Reduction to DTMDP: The Total Cost Model
On B(ζ DT × A), we introduce the measure M( X × A ) qx (a) P∗ ∗ := Mσγ (d x × da) qx (a)MγS (d x × da) = X × A X × A q x (a) + ε ∞ qYi−1 (Bi )I{Yi−1 ∈ X }I{Bi ∈ A } σ∗ = Eγ , qYi−1 (Bi ) + ε i=1 ∀ X ∈ B(ζ DT ), A ∈ B(A). Clearly, the measure M(d x × A) is σ-finite on (ζ DT , B(ζ DT )), because so is the ∗ measure Mσγ (d x × A). We introduce the stochastic kernel s on B(A) given x ∈ ζ DT , such that s ( A |x)M(d x × A), ∀ X ∈ B(ζ DT ), A ∈ B(A), M( X × A ) = X
and prove that it defines the desired optimal stationary standard ξ-strategy. The measure ν remains the same, but we rewrite equality (4.35) in the following form P∗ ν() = M( × A) = γ() + q (|x, a)MγS (d x × da) ζ DT ×A q (|x, a) = γ() + I{qx (a) = 0}M(d x × da) qx (a) ζ DT ×A
q (|x, a) I{qx (a) = 0} s (da|x)ν(d x), = γ() + qx (a) ζ DT A ∈ B(ζ DT ). P∗ On the set 0 := {(x, a) : qx (a) = 0}, 0 q (|x, a)MγS (d x × da) = 0 for all ∈ B(ζ DT ). Calculation of the detailed occupation measures for the strategy s leads to the following expression for each nonnegative measurable function f on ζ DT × A: N n=1
ζ DT ×A
s ,0 f (x, a)m γ,n (d x
× da) =
ζ DT
N = 1, 2, . . . , where the measures ν˜ N () := B(ζ DT )) satisfy the equalities
A
f (x, a) s (da|x)ν˜ N (d x), qx (a)
N −1 n=0
Pγ (X n ∈ ) on (ζ DT , s
4.2 Reduction to DTMDP
ν˜1 () = γ();
241
ν˜ N () = γ() +
ζ DT
q (|x, a) I{qx (a) = 0} s (da|x)ν˜ N −1 (d x), qx (a) A
∈ B(ζ DT ), N = 2, 3, . . . . As before, ν˜ N () ≤ ν() for all ∈ B(ζ DT ) and ∞ n=1
,0 f (x, a)m γ,n (d x × da) ≤ s
ζ DT ×A
ζ DT
A
= ≤
f (x, a) s (da|x)ν(d x) qx (a) f (x, a) M(d x × da) qx (a)
ζ DT ×A ∞ σ∗ Eγ i=1
f (Yi−1 , Bi )I{Yi−1 ∈ ζ DT } . qYi−1 (Bi ) + ε
Finally, ∞ X×A
n=1
s ,0 c j (x, a)m γ,n (d x
∞ c j (Yi−1 , Bi ) × da) ≤ q (Bi ) + ε i=1 Yi−1 ∞ ∗ S P ,0 = c j (x, a)m γ,n (d x × da) ∗ Eσγ
X×A
n=1
≤ dj,
j = 1, 2, . . . , J,
and ∞ n=1
X×A
,0 c0 (x, a)m γ,n (d x × da) ≤ s
∞ n=1
X×A
P∗
S ,0 c0 (x, a)m γ,n (d x × da),
so that the stationary standard ξ-strategy is optimal for the problem (1.16). Let us introduce the (undiscounted) total occupation measure for a strategy S as follows. Definition 4.2.1 The (undiscounted) total occupation measure ηγS,0 for a strategy S (with the given initial distribution γ) is a measure on (X × A) defined by ηγS,0 (d x × da) =
∞
S,0 m γ,n (d x × da).
n=1
Here S is allowed to be a Poisson-related strategy.
242
4 Reduction to DTMDP: The Total Cost Model
Then problem (1.16) can be represented as c0 (x, a)η(d x × da) → min,
X×A
c j (x, a)η(d x × da) ≤ d j ,
subject to
j = 1, 2, . . . , J.
X×A
Here η ∈ {ηγS,0 , S ∈ S} = {ηγS ,0 , S P ∈ SεP }: see Theorem 4.1.1. (ε > 0 is arbitrarily fixed.) For a nonnegative measurable function f on X × A, the equality P
∞ n=1
X×A
S,0 f (x, a)m γ,n (d x × da) =
X×A
f (x, a)ηγS,0 (d x × da)
is valid due to Proposition B.1.17. Lemma 4.2.1 can be reformulated in the following way: ∀ S P ∈ SεP ∃ σ : ηγS
P
,0
(d x × da) =
1 Mσ (d x × da). qx (a) + ε γ
(See Definition C.2.5.) P∗ P∗ In the proof of Theorem 4.2.2, the measure MγS actually coincides with ηγS ,0 , P∗
and the established inequalities ηγπ ,0 , ηγ ,0 ≤ ηγS ,0 on (ζ DT × A, B(ζ DT × A)) ∗ result in the optimality of the strategies π s and s , given that S P is optimal. In the more general case, when the bounded cost functions c j may take positive and negative values, the convex programming approach, as described in Appendix C.2.2, can be useful for solving the constrained problem (4.31), at least for an absorbing DTMDP. s
s
4.2.3 Examples 4.2.3.1
SIR Epidemic
Consider the model described in Sect. 1.2.3. We combine together the both cost rates, put c0 ((x1 , x2 ), a) = βx1 x2 + Ca, and study problem (1.15) with α = 0. Equation (4.32) takes the form
4.2 Reduction to DTMDP
243
'
βx1 x2 + Ca βx1 x2 + W ((x1 − 1, x2 + 1)) βx1 x2 + ζ x2 + a + ε βx1 x2 + ζ x2 + a + ε ζ x2 + a (4.37) + W ((x1 , x2 − 1)) βx1 x2 + ζ x2 + a + ε ( ε W ((x1 , x2 )) − W ((x1 , x2 )) + βx1 x2 + ζ x2 + a + ε ' βx1 x2 + Ca βx1 x2 + W ((x1 − 1, x2 + 1)) = inf a∈[0,a] ¯ βx1 x2 + ζ x2 + a + ε βx1 x2 + ζ x2 + a + ε ζ x2 + a W ((x1 , x2 − 1)) + βx1 x2 + ζ x2 + a + ε ( βx1 x2 + ζ x2 + a W ((x1 , x2 )) = 0 − βx1 x2 + ζ x2 + a + ε inf
a∈[0,a] ¯
when x2 > 0. When x2 = 0, we have ' inf
a∈[0,a] ¯
( Ca a − W ((x1 , 0)) = 0, a+ε a+ε
and the minimal nonnegative solution is W ((x1 , 0)) ≡ 0, provided by the action a = 0. The last equation is equivalent to Eq. (4.33) in this example. For x2 > 0, we can multiply Eq. (4.37) by βx1 x2 + ζ x2 + a + ε βx1 x2 + ζ x2 + a and obtain '
βx1 x2 + Ca βx1 x2 + ζ x2 + a βx1 x2 W ((x1 − 1, x2 + 1)) + βx1 x2 + ζ x2 + a ( ζ x2 + a W ((x1 , x2 − 1)) . + βx1 x2 + ζ x2 + a
W ((x1 , x2 )) = inf
a∈[0,a] ¯
The minimal nonnegative solution can be built by successive approximations '
βx1 x2 + Ca a∈[0,a] ¯ βx1 x2 + ζ x2 + a βx1 x2 + Wi ((x1 − 1, x2 + 1)) βx1 x2 + ζ x2 + a ( ζ x2 + a Wi ((x1 , x2 − 1)) + βx1 x2 + ζ x2 + a
Wi+1 (x1 , x2 )) = inf
(4.38)
244
4 Reduction to DTMDP: The Total Cost Model
with the starting function W0 ((x1 , x2 )) ≡ 0. Remember also that W ((x1 , 0)) ≡ 0. ∞ increases, According to Proposition C.2.8, the sequence of functions {Wi }i=0 and pointwise converges to the limiting finite-valued function W of interest. We emphasize that the infima in (4.38) and on each step of the approximations is provided either by a = 0 or a = a. ¯ This is the widely known typical bang-bang control. The mapping from X to A providing the infimum in (4.38), or, equivalently, in equation W (x1 , x2 )) = inf {βx1 x2 + Ca + βx1 x2 W ((x1 − 1, x2 + 1)) a∈[0,a] ¯
+(ζ x2 + a)W ((x1 , x2 − 1))} , is denoted as ϕ∗ . The latter equation is just Eq. (4.33) in this example. For the specific values β = 0.5; ζ = 1; a¯ = 50; C = 0.2; N = 10, the solution, W ((x1 , x2 )), is given in the following table. x2 : Infectives
x1 : Susceptibles
10 0 1 2 3 4 5 6 7 8 9 10
0 0 0 0 0 0 0 0 0 0 0
0 0.204 0.219 0.232 0.245 0.259 0.273 0.287 0.303 0.319 X
2 0 0.408 0.456 0.494 0.533 0.575 0.618 0.664 0.713 X X
3 0 0.606 0.703 0.782 0.861 0.944 1.032 1.126 X X X
4 0 0.737 0.953 1.090 1.223 1.361 1.508 X X X X
5 0 0.825 1.197 1.411 1.609 1.814 X X X X X
6 0 0.883 1.429 1.735 2.010 X X X X X X
7 0 0.922 1.642 2.050 X X X X X X X
8 0 0.948 1.803 X X X X X X X X
9 0 0.965 X X X X X X X X X
10 0 X X X X X X X X X X
The uniformly optimal control strategy ϕ∗ , which was constructed using Theorem 4.2.1, prescribes to always apply the maximal treatment rate a¯ = 50 if the number of ¯ 2< susceptibles x1 is 3 or more. In other cases, ϕ∗ ((0, x2 )) ≡ 0, ϕ∗ ((1, x2 )) = aI{x ¯ 2 < 8}. 3}, and ϕ∗ ((2, x2 )) = aI{x If the value of x1 is small, the treatment is not justified for big values of x2 , because, with high probability, all the existing susceptibles will nevertheless become infective, and the cost of treatment will be simply lost. Since the treatment results in the fixed recovery rate a, it is clearly more effective if the number of infectives, x2 , is small. Therefore, it can happen that, even for big values of x1 , ϕ∗ ((x1 , x2 )) = 0 for big values of x2 . We repeated the calculations for the bigger population, N = 500. All the parameters were the same, except for β = 0.01. This change is natural because, as explained in Sect. 1.2.3, the product of β and N is normally constant. The optimal control strategy, ϕ∗ ((x1 , x2 )), is graphically presented in Fig. 4.2. Treatment (no treatment) means ϕ∗ = a¯ (ϕ∗ = 0). In particular cases, the decrease of the number of infectives (natural, or resulting from action a) may be associated with the removals/isolation of those individuals, rather than recoveries. If so, in the mathematical model, one has to change the infec1 x2 . As explained in Sect. 1.2.3, the infection transition rate tion rate βx1 x2 to xβx1 +x 2
4.2 Reduction to DTMDP
245
160
No treatment 140 120
Infectives
100 80 60
Treatment 40 20 0
x1+x 2=N 0
100
200
300
400
500
600
Susceptibles
Fig. 4.2 Controlled epidemic
in a closed population of size N equals β = ˆ βδ . X 1 (t)+X 2 (t)
4.2.3.2
ˆ βδ . N
In case of removals, that will be
Selling An Asset
Consider the undiscounted constrained problem described in Sect. 1.2.5, where Z is the uniform random variable on the interval [b1 , b2 ] with 0 ≤ b1 < b2 < ∞, whose distribution is independent of the current offer value x. Taking into account Remark 1.2.1, the primitives of the CTMDP are as follows: X = {0} ∪ [b1 , b2 ] ∪ {} =: X ∪ {}; A = [b1 , b2 ]. For x = , )
if y < a; if y ≥ a; a − b1 ; q({}|x, a) = λ 1 − b2 − b1 q({x}|x, a) = λ;
q([b1 , y] \ {x}|x, a) = λ
and
y−b1 , b2 −b1 a−b1 , b2 −b1
246
4 Reduction to DTMDP: The Total Cost Model
q({}|, a) ≡ 0. The cost rates are given for each x ∈ X and a ∈ A by c0 (x, a) = I{x = } C − λ
dr r b 2 − b1 [a,b2 ] 2 2 b −a = I{x = } C − λ 2 ; 2(b2 − b1 ) c1 (x, a) = I{x = }.
According to Lemma 4.2.1 (see also (4.26) in its proof), Corollary 4.2.1 and Remark 4.2.1 (see the beginning of Sect. 4.2.2), we will solve the DTMDP problem (4.31) with J = 1. The state and action spaces remain the same, the state is absorbing with no future costs and, for x = , the transition probability is as follows: ⎧ y−b1 λ + εI{x ∈ [b1 , y]} ⎪ ⎪ ⎨ b2 −b1 , if y < a; λ+ε p([b1 , y]|x, a) = a−b1 ⎪ λ + εI{x ∈ [b1 , y]} ⎪ ⎩ b2 −b1 , if y ≥ a; λ+ε −a λ bb2−b p({}|x, a) = 2 1 . λ+ε The cost functions are b2 −a 2
C − λ 2(b22 −b1 ) c0 (x, a) l0 (x, a) = = I{x = } ; λ+ε λ+ε 1 c1 (x, a) l1 (x, a) = = I{x = } . λ+ε λ+ε The initial distribution is γ(d x) = δ0 (d x). Below, Mσ0 is the total occupation measure associated with a control strategy σ ∈ , as described in Definition C.2.5; D = {Mσ0 , σ ∈ }, as usual. If Mσ0 (X × A) = ∞ then X×A l1 (x, a)Mσ0 (d x × da) = ∞, so that Condition C.2.4 is satisfied. We also assume that d1 >
1 λ
in order for the Slater condition to be fulfilled: for the stationary deterministic strategy ϕ(x) ≡ b1 we have
4.2 Reduction to DTMDP ϕ E0
247
∞ 1 ϕ E l1 (X n , An+1 ) = I{X n = } (4.39) λ + ε 0 n=0 n=0 ε ε2 1 1 1+ + = + . . . = . 2 λ+ε λ + ε (λ + ε) λ
∞
Finally, by Remark 4.2.1, the minimal value in problem (4.31) is bigger than −∞: it is bigger than −Expectation of the Reward ≥
inf
inf
x∈[b1 ,b2 ]∪{0} y∈([b1 ,b2 ]\{0})×{}
C0 (x, y)
= −b2 > −∞; C0 (x, y) is as introduced in Sect. 1.2.5. Now all the conditions of Proposition C.2.16 are satisfied, and we are going to solve the Primal Convex Program (C.18), which is equivalent to the problem (4.31). To solve the Dual Convex Program G(g1 ) :=
inf
l0 (x, a)M(d x × da)
l1 (x, a)M(d x × da) − d1 → max ,
M∈D: M(X×A) −∞,
(4.42)
n=0
hence the DTMDP with the cost function l0 + g1l1 is summable. Note also that the infimum of expression (4.42) over σ ∈ is smaller than +∞ because, e.g., for the stationary deterministic strategy ϕ(x) ≡ b1 , it equals (C + g1 )/λ − (b1 + b2 )/2: the proof is similar to (4.39). So, to compute G(g1 ), we need to find the Bellman function W0DT ∗ (x) which, by Proposition C.2.2, satisfies the Bellman equation (C.7) and is finite-valued as explained above. It is intuitively clear that W0DT ∗ (x) ≡ W0DT ∗ does
248
4 Reduction to DTMDP: The Total Cost Model
not vary with x = , but this can also be seen from the Bellman equation. Indeed, the Bellman equation (C.7) reads as follows: W0DT ∗ (x) =
⎧ 2 2 ⎨ C − λ b2 −a 2(b2 −b1 )
inf
a∈[b1 ,b2 ] ⎩
λ+ε
+
g1 λ+ε
λ 1 W DT ∗ (y)dy + εW0DT ∗ (x) + λ + ε [b1 ,a) b2 − b1 0 x ∈ [b1 , b2 ] ∪ {0}.
( ,
We omitted the obvious terms containing W0DT ∗ () = 0. After we multiply this equa, we see that W0DT ∗ (x) ≡ W0DT ∗ is a constant on [b1 , b2 ] ∪ {0} satisfying tion by λ+ε λ equation W0DT ∗ =
' inf
a∈[b1 ,b2 ]
( C b2 − a 2 g1 a − b1 . − 2 + + W0DT ∗ λ 2(b2 − b1 ) λ b2 − b1
(4.43)
It has the unique finite solution
W0DT ∗ =
⎧, ⎪ ⎪ 2(C + g1 )(b2 − b1 ) − b2 , if 2(C + g1 ) ≤ λ(b2 − b1 ); ⎪ ⎨ λ ⎪ ⎪ ⎪ ⎩ C + g1 − b1 + b2 , λ 2
(4.44)
if 2(C + g1 ) > λ(b2 − b1 ).
The (constant for a fixed g1 ) mapping ∗
∗
'
ϕ (x) ≡ a =
−W0DT ∗ , if 2(C + g1 ) ≤ λ(b2 − b1 ); if 2(C + g1 ) > λ(b2 − b1 ) b1 ,
from [b1 , b2 ] ∪ {0} to [b1 , b2 ] provides the infimum in (4.43). The value of ϕ() plays no role. Suppose C + g1 > 0. Then a ∗ < b2 and, for x ∈ X, ∗
∗
Eϕx [W0DT ∗ (X n )] = W0DT ∗ Eϕx [I{X n = }], where ∗ Eϕx [I{X n
= }] =
∗
λ + ε − λ bb22 −a −b1 λ+ε
n =: δ n → 0 as n → ∞.
According to Proposition C.2.3, the constructed stationary deterministic strategy ϕ∗ (x) ≡ a ∗ is uniformly optimal for problem (C.4); in particular it solves problem (4.41). Moreover, it is such that
4.2 Reduction to DTMDP ϕ∗ M0 (X
× A) =
249
ϕ∗ E0
∞
I{X n = } =
n=0
∞
δn =
n=0
(λ + ε)(b2 − b1 ) < ∞. λ(b2 − a ∗ )
Therefore, G(g1 ) = W0DT ∗ − g1 d1 . Note, W0DT ∗ is a function of g1 , sometimes denoted as W0DT ∗ (g1 ) below. If C + g1 = 0, the control strategy ϕ∗ (x) ≡ b2 is not optimal for problem (4.41): ϕ∗ E0
∞
l0 (X n , An+1 ) = 0 > W0DT ∗ = −b2 .
n=0
Nevertheless, for the sequence ϕk (x) ≡ b2 − ϕk E0
1 k
(such that k >
1 ) b2 −b1
we have
∞ −λ(b22 − (b2 − k1 )2 ) ϕk E l0 (X n , An+1 ) = I{X n = } 2(λ + ε)(b2 − b1 ) 0 n=0 n=0
∞
−λ( 2bk 2 − k12 ) (λ + ε)(b2 − b1 ) · 2(λ + ε)(b2 − b1 ) λ(b2 − (b2 − k1 )) 1 = −b2 + . 2k =
Moreover, for each k, ϕk M0 (X
× A) =
ϕk E0
∞
I{X n = } =
n=0
(λ + ε)(b2 − b1 ) < ∞. λ · k1
Therefore, again G(0) = W0DT ∗ = −b2 . The constructed function G(g1 ) is continuous and concave. The solution to the Dual Convex Program (4.40) is provided by ) g1∗
:=
b2 −b1 2λd12
0
− C, if d12 ≤ if d12 >
b2 −b1 ; 2Cλ b2 −b1 . 2Cλ
1 The case C = 0 is not excluded here: b2 −b := +∞. Observe that, if C = 0, then 0 ∗ g1 > 0. The last step is to show that the deterministic stationary strategy
ϕ∗ (x) ≡ a ∗ =
'
−W0DT ∗ (g1∗ ), if 2(C + g1∗ ) ≤ λ(b2 − b1 ); if 2(C + g1∗ ) > λ(b2 − b1 ), b1 ,
where W0DT ∗ (g1 ) is given in (4.44), solves the initial constrained problem (4.31).
250
4 Reduction to DTMDP: The Total Cost Model
Note that C + g1∗ > 0, and, as shown above, ϕ∗ provides the optimal solution to the ϕ∗ unconstrained problem (4.41) and M0 (X × A) < ∞. Hence, the total occupation ∗ ϕ measure M0 provides the infimum in (4.40) at g1∗ . To compute the value sup(D c ) of the Dual Convex Program (4.40) and to show that ϕ∗ solves the constrained problem (4.31), we consider the following cases. (a) b2 − b1 b2 − b1 =⇒ g1∗ = − C, d12 ≤ 2Cλ 2λd12 and 2(C + g1∗ ) =
b2 −b1 λd12
< λ(b2 − b1 ) because d1 > λ1 . Therefore, a ∗ = −W0DT ∗ (g1∗ ) = b2 −
b2 − b1 λd1
and b2 − b1 + Cd1 − b2 . 2λd1
sup(D c ) = W0DT ∗ (g1∗ ) − g1∗ d1 = Similarly to (4.39), ϕ∗ E0
∞ 1 ϕ∗ E l1 (X n , An+1 ) = I{X n = } λ + ε 0 n=0 n=0 λ+ε 1 = = d1 . ∗ λ + ε λ bb2 −a −b
∞
2
(b) d12 >
1
b2 − b1 =⇒ g1∗ = 0 2Cλ
and sup(D c ) = W0DT ∗ (0) =
⎧⎪ ⎨ 2C(bλ2 −b1 ) − b2 if 2C ≤ λ(b2 − b1 ); ⎪ ⎩C
λ
−
b1 +b2 , 2
if 2C > λ(b2 − b1 ).
If 2C ≤ λ(b2 − b1 ), then , ∗
a = and, similarly to (4.39),
−W0DT ∗ (0)
= b2 −
2C(b2 − b1 ) λ
4.2 Reduction to DTMDP ϕ∗ E0
251
∞ 1 ϕ∗ E l1 (X n , An+1 ) = I{X n = } λ + ε 0 n=0 n=0 ⎞ ⎛ 1 ⎝ (b2 − b1 )(λ + ε) ⎠ = λ+ε λ 2C(b2 −b1 )
∞
λ
√ b2 − b1 = √ < d12 = d1 , 2Cλ ϕ∗
so that the total occupation measure M0 is feasible. If 2C > λ(b2 − b1 ), then a ∗ = b1
and ϕ∗ E0
∞
l1 (X n , An+1 ) =
n=0
1 < d1 λ
according to (4.39). ϕ∗ To summarise, the total occupation measure M0 solves the Primal Convex Program (C.18) by Proposition C.2.16 and hence the deterministic stationary strategy ⎧ −b1 b2 − b2λd , if d12 ≤ ⎪ ⎪ 1 ⎪ ⎪ ⎪ ⎨ ∗ ∗ ϕ (x) ≡ a = b2 − 2C(b2 −b1 ) , if d12 > λ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ b1 if d12 >
b2 −b1 ; 2Cλ b2 −b1 2Cλ
and 2C ≤ λ(b2 − b1 );
b2 −b1 2Cλ
and 2C > λ(b2 − b1 )
provides an optimal solution to the constrained problem (4.31). The minimal value of that constrained problem equals the value inf(P c ) of the Primal Convex Program (C.18) which, in turn, coincides with the value sup(D c ) of the Dual Convex Pro2 −b1 , then the gram (4.40) calculated above, again by Proposition C.2.16. If d12 > b2Cλ constraint is not active: the solutions to the constrained and unconstrained problems coincide. According to Remark 4.2.2 and the discussions at the beginning of Sect. 4.2.2, we see that ϕ∗ is an optimal deterministic stationary strategy for the original constrained CTMDP problem (1.16).
4.2.4 Models with Strongly Positive Intensities The reduction of the CTMDP problems (1.15) and (1.16) to equivalent problems in a DTMDP becomes much simpler, and was actually justified in the proof of
252
4 Reduction to DTMDP: The Total Cost Model
Theorem 3.2.7, in the case when there is a positive discount factor α > 0. That reasoning survives when α = 0 provided that the CTMDP model (X, A, q, {c j } Jj=0 ) has strongly positive intensities in the following sense. Definition 4.2.2 Suppose Condition 1.3.1 is satisfied. The CTMDP model (X, A, q, {c j } Jj=0 ) has strongly positive intensities if for each x ∈ X , there is some ε(x) > 0 such that qx (a) ≥ ε(x) > 0 for all a ∈ A. In a model with strongly positive intensities, as far as the present section is concerned, the relevant proofs for the discounted model all survive after one puts α = 0 as the only modification. Thus, we choose to single out the model with strongly positive intensities here mainly for the sake of readability. Note also that, as shown in Sect. 1.3.5, the investigation of a discounted model with α > 0 is equivalent to the investigation of the “hat” model with killing, which is undiscounted, but obviously with strongly positive intensities.
4.2.4.1
Description of the Induced DTMDP
For the given CTMDP (X, A, q, {c j } Jj=0 ), we consider the following DTMDP (X , A, pα , {l j } Jj=0 ), where • X = X ∪ {}, where is the cemetery state; • A is the action space; • the transition probability is given for each ∈ B(X ) by ⎧ q ( ∩ X|x, a) + α I{ } ⎨ , if x ∈ X , a ∈ A; pα (|x, a) = α + qx (a) ⎩ I{ ∈ }, if x = ; • the cost functions are l j (x, a) = 0, j = 0, 1, . . . , J .
c j (x,a) qx (a)+α
for each (x, a) ∈ X × A, and l j (, a) ≡
Here it is allowed that α = 0, in which case 00 = 0 is accepted in the above description of the DTMDP. When α > 0, this DTMDP already appeared in the proof of Theorem 3.2.7. For a strategy σ in the described DTMDP with the same initial distribution γ as in the CTMDP model, the corresponding strategic measure is denoted by Pσ,α γ . The σ,α is written as E . The necessary knowledge expectation taken with respect to Pσ,α γ γ on DTMDP is presented in Appendix C. Note that the actions in state do not affect the measures σ,α Mσ,α γ,n ( X × A ) := Pγ (X n−1 ∈ X , An ∈ A ),
∀ n ∈ N, X ∈ B(X), A ∈ B(A). Therefore, below with some abuse of notation, we say that σ M is a Markov strategy in the DTMDP if the stochastic kernels σnM (da|x) are defined for all x ∈ X, n ∈ N.
4.2 Reduction to DTMDP
253
Theorem 4.2.3 Let α ∈ [0, ∞) be fixed, and γ be the common initial distribution in the CTMDP (X, A, q, {c j } Jj=0 ) and DTMDP (X , A, pα , {l j } Jj=0 ). Suppose the CTMDP (X, A, q, {c j } Jj=0 ) has strongly positive intensities when α = 0. Consider a Markov standard ξ-strategy M = { nM (da|x)}∞ n=1 in the CTMDP ∞ M (X, A, q, {c j } Jj=0 ) and the Markov strategy σ M = {σnM (da|x)}∞ n=1 = { n (da|x)}n=1 J in the DTMDP (X , A, pα , {l j } j=0 ). For each n ∈ N, ,α m γ,n (d x × da) = M
1 M Mσγ,n,α (d x × da) qx (a) + α
on B(X × A). Proof In the case of α > 0, the desired equality was established in the proof of Theorem 3.2.7: see (3.49). That reasoning can be repeated word by word with α = 0. Suppose that the CTMDP (X, A, q, {c j } Jj=0 ) has strongly positive intensities when α = 0. Markov standard ξ-strategies in the CTMDP model are sufficient for solving the CTMDP problems (1.15) and (1.16), see Theorem 1.3.2 and Sect. 1.3.5; and Markov strategies in the DTMDP model are sufficient for solving the total cost problems (C.4) and (C.5), see Proposition C.1.4. According to Sect. 1.3.1 and Theorem 4.2.3, W jα ( M , γ)
=
∞ n=1 X×A ∞
,α c j (x, a)m γ,n (d x × da) M
c j (x, a) σ M ,α Mγ,n (d x × da) q (a) + α x n=1 X×A ∞ c j (X n−1 , An ) σ M ,α = Eγ , q (An ) + α n=1 X n−1
=
if M = σ M . Therefore, provided that the CTMDP (X, A, q, {c j } Jj=0 ) has strongly positive intensities when α = 0, the CTMDP problems (1.15) and (1.16) are reduced to the corresponding optimal control problems in DTMDP (X , A, pα , {l j } Jj=0 ). 4.2.4.2
Selected Results of the Reduction to DTMDP
Suppose the CTMDP (X, A, q, {c j } Jj=0 ) has strongly positive intensities when α = 0. According to the first subsubsection, if we solve the problems ∞ c0 (X n , An+1 ) Minimize Eσγ ,α over all Markov strategies σ M α + q (A ) X n+1 n n=0 M
254
4 Reduction to DTMDP: The Total Cost Model
or ∞ c0 (X n , An+1 ) Minimize over all Markov strategies σ M α + q (A ) X n+1 n n=0 ∞ c j (X n , An+1 ) σ M ,α subject to Eγ (4.45) ≤ d j , j = 1, 2, . . . , J, α + q X n (An+1 ) n=0 M Eσγ ,α
in the (sufficient) class of Markov strategies for the DTMDP
( ' c j (x, a) J X , A, pα , l j = I{x ∈ X} , qa (x) + α j=0
then the corresponding Markov standard ξ-strategy M = σ M will be a solution to problem (1.15) or (1.16) respectively. We can readily deduce optimality results for the CTMDP problems (1.15) and (1.16) from the known results for the equivalent DTMDP problems. An example of solving problem (4.45) is presented in Sect. 4.2.5. Theorem 4.2.4 Consider the CTMDP (X, A, q, {c j } Jj=0 ) and the induced DTMDP (X , A, pα , {l j } Jj=0 ) described in the first subsubsection. Suppose the CTMDP (X, A, q, {c j } Jj=0 ) has strongly positive intensities when α = 0, and c j (x, a) ∈ [0, ∞) for all (x, a) ∈ X × A and j = 0, 1, . . . , J. Then the following assertions hold. (a) The value (Bellman) function defined by W0∗α (x) = inf W0α (S, x) = inf W0α (S, x), ∀ x ∈ X, S∈SπM
M S∈S
of the unconstrained α-discounted CTMDP problem (1.15) is the minimal nonnegative lower semianalytic solution to the equation: ' W (x) = inf
a∈A
c0 (x, a) + α + qx (a)
W (y) X
( q (dy|x, a) , x ∈ X. α + qx (a)
(4.46)
(b) A deterministic stationary strategy ϕ∗ is uniformly optimal for the unconstrained α-discounted CTMDP problem (1.15) if and only if '
( c0 (x, a) q (dy|x, a) + W0∗α (y) a∈A α + q x (a) α + qx (a) X q (dy|x, ϕ∗ (x)) c0 (x, ϕ∗ (x)) ∗α + W (y) = 0 α + qx (ϕ∗ (x)) α + qx (ϕ∗ (x)) X inf
4.2 Reduction to DTMDP
255
for each x ∈ X. (c) If there is a uniformly optimal π-strategy (or ξ-strategy) for the unconstrained α-discounted CTMDP problem (1.15), then there is a uniformly optimal deterministic stationary strategy for problem (1.15). (d) Suppose Condition 4.2.1 (or Condition 4.2.2) is satisfied. Then there exists a uniformly optimal deterministic stationary strategy ϕ∗ for the unconstrained CTMDP problem (1.15). The value (Bellman) function W0∗α of problem (1.15) is the minimal nonnegative lower semicontinuous (respectively, measurable) solution to (4.46). (e) Suppose Condition 4.2.1 is satisfied, and there exists a feasible strategy for problem (1.16). Then there is an optimal stationary ξ-strategy for the constrained α-discounted CTMDP problem (1.16). Proof All the statements follow from the relevant results in Appendix C for the equivalent DTMDP problems; see Propositions C.2.4, C.2.6, C.2.8 and C.2.18. For part (e), if for each feasible strategy W0α (S, γ) = +∞, it is sufficient to find an arbitrary feasible stationary ξ-strategy, which comes from Proposition C.2.18 after we put, e.g., c0 (x, a) = 0. Remark 4.2.3 Theorem 4.2.4(a) still holds if we replace “ =” with “ ≥” in (4.46). Several statements presented so far in this section hold under the following slightly weaker version of Condition 4.2.1. Condition 4.2.3 (a) The (Borel) action space A iscompact. (b) For each bounded continuous function f on S, S f (y)q(dy|x, a) is continuous in (x, a) ∈ X × A. (c) For each j = 0, 1, . . . , J, the mapping (x, a) ∈ X × A → c j (x, a) is lower semicontinuous. (d) There exists a continuous function F on X taking values in (1, ∞) such that q x ≤ F(x), ∀ x ∈ X. For example, one can formulate the following version of Theorem 4.2.4(d). Proposition 4.2.1 Suppose Condition 4.2.3(a,b,c) holds. Let α > 0 be fixed. Let c0 be nonnegative. Then W0∗α is lower semicontinuous on X, and there exists a deterministic stationary α-discounted uniformly optimal strategy. If the minimal nonnegative lower semianalytic solution to Eq. (4.46) is finitevalued, that equation can be rewritten in different ways:
256
4 Reduction to DTMDP: The Total Cost Model
'
( 1 c0 (x, a) + W (y) q (dy|x, a) − (α + qx (a))W (x) = 0 a∈A α + q x (a) X ' ( W (y)q(dy|x, a) − αW (x) = 0 ⇐⇒ inf c0 (x, a) + a∈A ' X 1 c0 (x, a) + W (y)q(dy|x, a) ⇐⇒ inf a∈A α + F(x) X ( − (α + F(x))W (x) + F(x)W (x) = 0 '
( c0 (x, a) F(x) q(dy|x, a) + + δx (dy) W (y) ⇐⇒ W (x) = inf a∈A α + F(x) α + F(x) X F(x) ' c0 (x, a) W (x) = inf ⇐⇒ a∈A w(x)(α + F(x)) w(x)
( W (y) q(dy|x, a) δx (dy) F(x) w(y) + , + α + F(x) X w(y) w(x)F(x) w(x) (4.47) inf
where F(x) > 0 and w(x) > 0 are arbitrarily fixed functions. For example, under the conditions of Theorem 3.1.1, iterations (3.5) converge to (x) the solution u ∞ (x) := W of the Eq. (4.47), where F(x) := 1 + q¯ x and w(x) := w(x)
w (x). Another example corresponds to the case when the transition rate is uniformly bounded: q¯ x ≤ K (and K > 0). Here one can put F(x) ≡ K . If the cost rate c0 (·) is also bounded, then, after we take w(x) ≡ 1, Eq. (4.47) transforms to ' W (x) = inf
a∈A
c0 (x, a) K + α+K α+K
q(dy|x, a) + δx (dy) W (y) K X
( .
(4.48)
Under mild conditions, the operator on the right is contractive in the space of bounded lower semicontinuous functions on X with the norm being supx∈X |W (x)|. This trivial situation is well known: see Sect. 4.3. Using the reduction to the DTMDP, if α > 0, one can deduce equation
qx (a) + α S,α ηγ (d x × da) α X ×A 1 = γ( X ) + q ( X |x, a)ηγS,α (d x × da), ∀ X ∈ B(X) α X×A
(4.49)
for the (normalized) α-discounted total occupation measure, given by ηγS,α ( X × A ) = α
∞ n=1
S,α m γ,n ( X × A ),
(4.50)
4.2 Reduction to DTMDP
257
for a π-strategy S. As explained in Sect. 1.3.5, we can consider S as the strategy in the “hat” model. Due to Theorem 1.3.3, ηγS,α ( X × A ) = α
∞
S,α m γ,n ( X × A ) = α
n=1
∞
S,0 mˆ γ,n ( X × A ),
n=1
∀ X ∈ B(X), A ∈ B(A). According to Theorem 1.3.2, there is a Markov standard ξ-strategy M such that M S,0 ,0 and mˆ on X × A coincide for all n ∈ N. When considering the measures mˆ γ,n γ,n M as the strategy in the original discounted model and using Theorem 4.2.3, we see that ηγS,α ( X × A ) = α =
,0 mˆ γ,n ( X × A ) = α M
n=1
X × A
=
∞
∞
,α m γ,n ( X × A ) M
n=1 ∞
α M Mσ ,α (d x × da) qx (a) + α n=1 γ,n
α M Mσγ ,α (d x × da), X × A q x (a) + α ∀ X ∈ B(X), A ∈ B(A),
(4.51)
strategy in the corresponding DTMDP, coincident with M , where σ M is the
Markov ∞ σ M ,α σ M ,α = n=1 Mγ,n is the total occupation measure of the strategy σ M in the and Mγ DTMDP: see Definition C.2.5. Now Proposition C.2.12 implies the equation X ×A
= Mγσ
M
qx (a) + α S,α ηγ (d x × da) α
,α
( X × A) ∞ q ( X |x, a) σ M ,α = γ( X ) + Mγ,n (d x × da) X×A α + q x (a) n=1 for all X ∈ B(X), which is equivalent to (4.49). Under Condition 2.4.2, the controlled process X (t) is non-explosive under every control strategy S; PγS (T∞ = ∞) = 1, so that one can accept Definition 3.2.1 of the total occupation measure. Moreover, if all the conditions of Theorem 3.2.1 are satisfied, then Eq. (4.49) is equivalent to Eq. (3.19). We emphasize that Eq. (4.49) is valid without any conditions. Clearly, equality (4.51) holds if and only if Mγσ
M
,α
( X × A ) =
X × A
qx (a) + α S,α ηγ (d x × da). α
258
4.2.4.3
4 Reduction to DTMDP: The Total Cost Model
Reduction in the Presence of Cemeteries
According to the discussions at the beginning of the present section, in the presence of a cemetery in the CTMDP (X, A, q, {c j } Jj=0 ), the reduction of the CTMDP problems (1.15) and (1.16) with α = 0 to equivalent problems in a DTMDP can be immediately seen, under the following condition. Condition 4.2.4 If there are cemeteries in the CTMDP (X, A, q, {c j } Jj=0 ), Condition 1.3.1 is satisfied. Moreover, Condition 1.3.2(b) is satisfied with α = 0, i.e., for each x ∈ X = X \ {}, inf a∈A qx (a) ≥ ε(x) for some ε(x) > 0. Suppose Condition 4.2.4 is satisfied. Consider the DTMDP (X, A, p, {l j } Jj=0 ) defined by the following objects: • X and A are the state and action spaces; • the transition probability p is given for each ∈ B(X) by p(|x, a) =
' q (|x,a)
, if x = ; qx (a) δ (), if x = ;
• the cost functions are given by ) l j (x, a) =
c j (x,a) , qx (a)
0,
if x = ; if x = .
j = 0, 1, . . . , J,
Then under Condition 4.2.4, the total cost problems (C.4) and (C.5) in this DTMDP (X, A, p, {l j } Jj=0 ) are equivalent to problems (1.15) and (1.16) with α = 0 in the CTMDP (X, A, q, {c j } Jj=0 ). One may notice the minor but immaterial difference between the DTMDP (X, A, p, {l j } Jj=0 ) and the DTMDP (X , A, p0 , {l j } Jj=0 ) formulated in the first subsubsection.
4.2.5 Example: Preventive Maintenance In this subsection, we show how one can find an acceptable solution to the multipleobjective problem described in Sect. 1.2.9. Namely, we are interested in making all three objectives W0α (S, γ), W1α (S, γ) and W2α (S, γ) as small as possible. To do this, we will solve the constrained problem (1.16) under different values of the constants d1 , d2 and find a satisfactory combination (d1 , d2 ). The solution to each such problem comes from the solution of the corresponding constrained DTMDP (4.45). Since α > 0, the spaces X and A are finite and thus the transition rate qx (a) is bounded, we see that the DTMDP is absorbing: at each step, the probability of absorption in is α α ≥ > 0. α + qx (a) α + max(x,a)∈X×A qx (a)
4.2 Reduction to DTMDP
259
Therefore, Propositions C.2.12 and C.2.14 imply that the constrained DTMDP (4.45) is equivalent to the linear program (C.17) in the space of finite measures M+F (X × A). The latter is the standard finite-dimensional linear program, which can be easily solved numerically. Below, we fix the following values of the parameters: M = 3; λ0 = 3; λ1 = 2; λ2 = 2; λ3 = 5; α = 0.5; μ0 = 0; μ1 = 20; μ2 = 10; μ3 = 8; C f = 100; r0 = 0; r1 = 150; r2 = 100; r3 = 100; γ(0) = 1; C0r = 1; C1r = 5; C2r = 15; C3r = 20. Note that r0 = 0 and μ0 = 0 mean the new machine is not operating: it takes exponentially distributed time with parameter λ0 = 3 to unpack and fix the new machine. The value of C0r does not play any role because c2 (0, 1) = 0. According to the first subsubsection of Sect. 4.2.4, the transition probability pα ( j|i, a) is as follows j i 0 1 2 3
0
1
2
3
0 (i f a = 0) 0.8571 (i f a = 0) 0 (i f a = 0) 0 (i f a = 0) 0 (i f a = 1) 0.8571 (i f a = 1) 0 (i f a = 1) 0 (i f a = 1) 0 (i f a = 0) 0 (i f a = 0) 0.8000 (i f a = 0) 0 (i f a = 0) 0.8889 (i f a = 1) 0 (i f a = 1) 0.0889 (i f a = 1) 0 (i f a = 1) 0 (i f a = 0) 0 (i f a = 0) 0 (i f a = 0) 0.8000 (i f a = 0) 0.8000 (i f a = 1) 0 (i f a = 1) 0 (i f a = 1) 0.1600 (i f a = 1) 0.9091 (i f a = 0) 0 (i f a = 0) 0 (i f a = 0) 0 (i f a = 0) 0.9630 (i f a = 1) 0 (i f a = 1) 0 (i f a = 1) 0 (i f a = 1)
This matrix (under a = 0 and a = 1) is substochastic because there are positive probabilities of the absorption in the cemetery . The cost functions are as follows. l0 (x, a) (associated with the failures, see c0 (·)): x 012 a 0 1
3
0 0 0 90.9091 0 0 0 37.0370
l1 (x, a) (associated with the rewards, see c1 (·)): x 0 a 0 1
1
2
3
0 −60.0000 −40.0000 −18.1818 0 −6.6667 −8.0000 −7.4074
260
4 Reduction to DTMDP: The Total Cost Model
l2 (x, a) (associated with the replacements, see c2 (·)): x 0 a 0 1
1
2
3
0 0 0 0 0 4.4444 12.0000 11.8519
We use the presented data for solving the linear program (C.17) numerically for different constants d1 and d2 . Firstly, we obtain the minimal and maximal possible values of the main objective W0α (S, γ): • min S∈S W0α (S, γ) = 2.7297. Here d1 = d2 = 1000, and the solution is given by the deterministic stationary strategy ϕ1 (x) = I{x > 0}: replace the machine as soon as possible. As expected, the other objectives are far from good; W1α (ϕ1 , γ) = −38.7785; W2α (ϕ1 , γ) = 29.4327. The constraints are not active, i.e., satisfied with strict inequalities. • max S∈S W0α (S, γ) = 99.4819. Here d1 = 1000, d2 = 0: we do not allow any preventive replacements, and max W0α (S, γ) S∈S
= min W0α (S, γ) subject to W1α (S, γ) ≤ d1 , W2α (S, γ) ≤ d2 = 0. S∈S
The solution is given by the deterministic stationary strategy ϕ0 (x) ≡ 0 (no preventive replacement at all), and W1α (ϕ1 , γ) = −177.2021; W2α (ϕ1 , γ) = 0. Suppose there is a budget restriction for the preventive replacements: d2 = 20. We put d1 = 1000, i.e., ignore the rewards, and obtain the solution in the form of a randomized stationary ξ-strategy s with the following values of the objectives: W0α ( s , γ) = 26.5560; W1α ( s , γ) = −108.3817; W2α ( s , γ) = 20. Now, looking at the rewards, we see that 108.3817 177.2021. Thus, we put the ˆs constraints d1 = −150, d2 = 20 and obtain a randomized stationary ξ-strategy with the following values of the objectives: ˆ s , γ) = 28.1162; W1α ( ˆ s , γ) = −150; W2α ( ˆ s , γ) = 20. W0α ( Both constraints are active, i.e., satisfied with equalities, so improving any one objective results in the increase of other objectives. The obtained solution looks acceptable. The explicit form of the ξ-strategy ˆ s is as follows:
4.2 Reduction to DTMDP
261
ˆ s (1|0) = 0; ˆ s (1|1) = 0.3469; ˆ s (1|2) = 0.4141; ˆ s (1|3) = 1. The randomization is needed in two states 1 and 2. This is natural in the case when two constraints are active.
4.3 Bibliographical Remarks A popular method of investigating a CTMDP is by reducing it to an equivalent DTMDP. This method is also preferred and presented in many textbooks on MDPs, see e.g., [19, 133, 200]. Nevertheless, due to the nature of the exposition therein, attention is often restricted to a simpler class of strategies, under which the CTMDP becomes a special SMDP (Semi-Markov Decision Process). More precisely, the socalled ESMDP (Exponential SMDP) arises from a CTMDP when one is restricted to the class of standard ξ-strategies. As far as total cost criteria are concerned, the equivalence between SMDPs and DTMDPs was understood long ago: see, e.g., [123, p. 202]. When the transition rate of the CTMDP is uniformly bounded, and being restricted to stationary strategies, the reduction of an ESMDP to a DTMDP is a consequence of the uniformization technique, see [156], where there was an inaccuracy corrected in [222], see also [146]. In the case of unbounded transition rates, and being restricted to stationary π-strategies, the relation between discounted CTMDPs and DTMDPs was considered in [136], based on the optimality equation and assuming some extra conditions. Further reduction results based on the comparison of optimality equations can be found in [138, 189]. For discounted CTMDPs under general π-strategies, a CTMDP can be reduced to an ESMDP, which is in turn equivalent to a DTMDP, see [76, 77]. The justification is based on the comparison of the so-called occupation and occupancy measures. If the transition rate is bounded, this method leads to an equivalent discounted DTMDP problem. Otherwise, the equivalent DTMDP problem has total undiscounted cost criteria. In [98] the aforementioned reduction method was applied to gaining further understanding of the role of the weight functions: they are there to guarantee that the reduced DTMDP problem has positive cost functions, despite the fact that the cost rates in the CTMDP are not necessarily bounded from below. The mentioned reduction method is applicable to the total undiscounted CTMDPs automatically, provided that the transition rates in the CTMDP are separated from zero. Without this requirement, this reduction does not hold in general, because the reduction of a CTMDP to an ESMDP may fail when α = 0, and consequently, the standard ξ-strategies are in general not sufficient for solving total undiscounted CTMDP problems, as demonstrated in Sects. 1.3.3 and 1.3.4. This reduction method was justified in [117] for total undiscounted CTMDP problems, where the requirement of strongly positive transition rates was replaced with the nonnegativity of the cost rates, as well as the compactness-continuity conditions. A more powerful reduction method was proposed and justified in [186] by making use of Poisson-related strategies, which were introduced in [185], and are natural generalizations of the standard ξ-strategies:
262
4 Reduction to DTMDP: The Total Cost Model
one can change the actions not only at the jump moments, but also between them. Compared with the reduction method suggested in [76, 77], this reduction method is applicable to both total discounted and undiscounted problems, and survives without requiring any extra conditions. The current chapter has presented it in detail. The induced DTMDP in the aforementioned literature has the same action space as the CTMDP. Another reduction method, introduced in [254], leads to a DTMDP with a more complicated action space than in the original CTMDP, which is the space of relaxed control functions. Now we provide the main source of the materials presented in this chapter. Section 4.1. Poisson-related strategies were first introduced in [185]. Their realizability was discussed in [187]. This section mainly comes from [185]. Section 4.2. The presented reduction method in Sect. 4.2.1 was very briefly described in [186]. After the reduction to DTMDP is justified, one can invoke the corresponding statements established for the discrete-time models, see e.g., the monographs [7, 21, 69, 120, 121, 179]. In Sect. 4.2.2, we mainly referred to [21] for the unconstrained problem, and to [61] for the constrained problem. In the latter case, one can also use the more recent results in [63], where the cost functions were not necessarily nonnegative. The investigated epidemic model in Sect. 4.2.3 is similar to the controlled SIR epidemic considered in [1], where the possible control was understood as the immediate isolation of one infective. In our setting, that means an extremely intensive treatment. Formally speaking, that is the case of impulsive control, which is treated in Chap. 7, and corresponds to the limiting case in the gradual control when a¯ → ∞. In [1], it was proved that the optimal isolation strategy is in the form of what was graphically presented in Fig. 4.2: there is a function f (x1 ) such that one must isolate immediately all infectives if and only if x2 ≤ f (x1 ). In the case of uncontrolled 1 x2 , compared with the standard case of βx1 x2 models, the corrected infection rate xβx1 +x 2 in [1], was introduced in [11], motivated by modelling of AIDS (acquired immune deficiency syndrome). The unconstrained discrete-time selling problem was investigated in Sect. 10.3.1 2 −b1 , making the constraint non-active, then of [13]. If one puts λ = 1 and d12 > b2Cλ ∗ the obtained optimal strategy ϕ coincides with the solution coming from Theorem 10.3.1 of [13]. The reduction method for discounted CTMDPs presented in Sect. 4.2.4 comes from [76, 77], which is also applicable to total undiscounted CTMDPs, when the transition rates are strongly positive. Preventive maintenance problems similar to the one investigated in Sect. 4.2.5 were considered in [200, Sect. 4.7.5] and [210, p. 167].
Chapter 5
The Average Cost Model
In this chapter, restricted to the class of π-strategies, we consider the following long-run average cost as the performance measure of the system: 1 S Eγ c j (X (t), a)(da|t)dt T →∞ T (0,T ] A 1 S c j (X (t), a)(da|t)dt = lim Eγ T →∞ T (0,min{T, T∞ }] A
W j (S, γ) := lim
for each initial distribution γ, j = 0, 1, . . . , J and π-strategy S ∈ Sπ . Recall that the convention of c j (x∞ , a) := 0 for each a ∈ A is in use. A π-strategy S ∗ ∈ Sπ is called average optimal under the given initial distribution γ if W0 (S ∗ , γ) = inf W0 (S, γ), S∈Sπ
and is called average uniformly optimal if W0 (S ∗ , x) = inf W0 (S, x) =: W0∗ (x), ∀ x ∈ X. S∈Sπ
The constrained average CTMDP problem (restricted to the class of π-strategies) reads Minimize over S ∈ Sπ :
W0 (S, γ)
subject to W j (S, γ) ≤ d j , ∀ j = 1, 2, . . . , J,
(5.1)
where d j ∈ R is a given number for each j = 1, 2, . . . , J. A π-strategy S is called feasible for problem (5.1) if W j (S, γ) ≤ d j , ∀ j = 1, 2, . . . , J. © Springer Nature Switzerland AG 2020 A. Piunovskiy and Y. Zhang, Continuous-Time Markov Decision Processes, Probability Theory and Stochastic Modelling 97, https://doi.org/10.1007/978-3-030-54987-9_5
263
264
5 The Average Cost Model
A strategy S ∗ ∈ Sπ is called optimal under the initial distribution γ for problem (5.1) if it is feasible for problem (5.1), and satisfies W0 (S ∗ , γ) ≤ W0 (S, γ) for all feasible π-strategies S for problem (5.1).
5.1 Unconstrained Problems 5.1.1 Unconstrained Problem: Nonnegative Cost Throughout this subsection, we assume the following to hold without special reference. Assumption 5.1.1 All the cost rates c j ( j = 0, 1, . . . , J ) are [0, ∞]-valued.
5.1.1.1
Verification Theorem
Consider an arbitrarily fixed natural Markov strategy π, ˇ and recall the notations (1.11) and (2.3). We put for each x ∈ X, u ≤ t, u, t ∈ R0+ ,
W (u, x, t) :=
(u,t]
c0 (y, a)π(da|y, ˇ s) pq (u, x, s, dy)ds. X
A
ˇ x) = limt→∞ 1t W (0, x, t) for each x ∈ X. Note that W0 (π, Lemma 5.1.1 (a) Let a natural Markov policy πˇ be fixed. Suppose a nonnegative measurable function v(u, x, t) defined for x ∈ X, u ≤ t, u, t ∈ R0+ , satisfies the following inequality
c0 (x, a)π(da|x, ˇ θ)dθ e− (u,t] qx (θ)dθ (u,t] A + e− (u,s] qx (θ)dθ qx (s) c0 (x, a)π(da|x, ˇ θ)dθ (u,t] (u,s] A + q(dy|x, s)v(s, y, t) ds. v(u, x, t) ≥
(5.2)
X\{x}
Then v ≥ W (pointwise), and the function W satisfies (5.2) with equality. (b) Let a stationary π-strategy πˇ = π s be fixed, and suppose there exist a constant g ∈ [0, ∞] and a nonnegative measurable function h on X satisfying the following inequality
5.1 Unconstrained Problems
265
g + h(x) qx (a)π s (da|x) A s ≥ c0 (x, a)π (da|x) + h(y) q(dy|x, a)π s (da|x) A
X\{x}
A
for each x ∈ X. Then g ≥ W0 (π s , x) for each x ∈ X satisfying h(x) < ∞. Proof (a) For simplicity, throughout the proof of this lemma, we put c0 (x, a)π(da|x, ˇ s).
c(x, s) := A
Furthermore, if c(x, s), qx (s) and q(dy|x, s) in the above are s-independent, as in the case of a stationary strategy, we omit s from the arguments, as in Sect. 2.2.2. It follows from Feller’s construction of pq , see (2.2) and (2.3), and Proposition B.1.16 that m n (u, x, t) :=
c(y, s) (u,t]
n
X
pq(n) (u, x, s, dy)ds ↑ W (u, x, t)
k=0
as n ↑ ∞. We verify firstly that W (u, x, t) satisfies (5.2) with equality as follows. By the iterative definitions of the transition functions pq(n) , m n (u, x, t) n c(y, s) pq(k) (u, x, s, dy)ds := (u,t]
X
= m 0 (u, x, t) +
k=0
c(y, s) (u,t]
= m 0 (u, x, t) +
X
pq(k) (u, x, s, dy)ds
k=1
c(y, s)
(u,t]
n
X
n k=1
(u,s]
e−
(r,t]
X
k−1=0
× pq(k−1) (r, z, s, dy)ds dr = m 0 (u, x, t) + e− (u,r ] qx (θ)dθ (u,t]
(u,r ]
qx (θ)dθ
q(dz|x, r )
X\{x}
× pq(k−1) (r, z, s, dy)dr ds n−1 c(y, s) = m 0 (u, x, t) + (u,t]
e−
(u,r ]
qx (θ)dθ
q(dz|x, r )
X\{x}
q(dz|x, r )m n−1 (r, z, t)dr, X\{x}
266
5 The Average Cost Model
where the last two inequalities follow from the legal interchange of the order of integrations. One can see that c(x, θ)dθ m 0 (u, x, t) = e− (u,t] qx (θ)dθ (u,t] + c(x, θ)dθe− (u,s] qx (θ)dθ qx (s)ds. (u,t]
(u,s]
It thus follows that m n (u, x, t) = c(x, θ)dθe− (u,t] qx (θ)dθ + e− (u,s] qx (θ)dθ (u,t] (u,t] × qx (s) c(x, θ)dθ + q(dy|x, s)m n−1 (s, y, t) ds. (5.3) (u,s]
X\{x}
By the Monotone Convergence Theorem, passing to the limit as n ↑ ∞ on both sides of the above equality gives c(x, θ)dθe− (u,t] qx (θ)dθ + e− (u,s] qx (θ)dθ W (u, x, t) = (u,t] (u,t] × qx (s) c(x, θ)dθ + q(dy|x, s)W (s, y, t) ds, (u,s]
X\{x}
as required. For the minimality of W (u, x, t) as a nonnegative measurable solution to inequality (5.2), suppose that there is another nonnegative measurable solution v(u, x, t) to inequality (5.2). Thus, v(u, x, t) ≥ m 0 (u, x, t). Now an inductive argument based on (5.3) and the fact that v satisfies (5.2) implies v(u, x, t) ≥ m n (u, x, t) for each n = 0, 1, . . . , which, together with the fact that m n ↑ W point-wise as n ↑ ∞, leads to v(u, x, t) ≥ W (u, x, t), as desired. (b) Without loss of generality, we assume that qx > 0 and g < ∞, for otherwise the statement holds automatically. Since πˇ = π s is stationary, W (u, x, t) = (x, t − u). It follows from this and part (a) specialized to a W (0, x, t − u) =: W (x, t) is the minimal nonnegative measurable stationary policy and u = 0, that W solution to the inequality (x, t) W ≥ c(x)te
−qx t
+
e (0,t]
−qx s
qx c(x)s +
q(dy|x)W (y, t − s) ds. X\{x}
Now let us verify, based on the properties of the constant g and the function h, (x, t) as follows. that the above inequality is satisfied with h(x) + gt in lieu of W
5.1 Unconstrained Problems
267
Let x ∈ X be fixed. Clearly, we only need consider the case of h(x) < ∞. In this case, c(x) < ∞ necessarily. Then e−qx s c(x)te−qx t + (0,t] × qx c(x)s + q(dy|x)(h(y) + g(t − s)) ds X\{x} c(x) −qx t −qx s (1 − e )+ e q(dy|x)(h(y) + g(t − s)) ds = qx (0,t] X\{x} c(x) −qx t −qx s = (1 − e )+ e q(dy|x)h(y) + gqx (t − s) ds qx (0,t] X\{x} c(x) (1 − e−qx t ) + e−qx s {g + h(x)qx − c(x) + gqx (t − s)} ds ≤ qx (0,t] = gt + h(x) − h(x)e−tqx ≤ gt + h(x), as claimed. (x, t) by part (a) of this lemma. At x ∈ X such Consequently, h(x) + gt ≥ W that h(x) < ∞, dividing both sides of the previous inequality and then passing to the upper limit as t → ∞ yields the statement. The next statement provides a verification theorem. Corollary 5.1.1 Let F be a [1, ∞)-valued measurable function, and g ∈ [0, ∞) be a finite constant such that g ≤ lim αW0∗α (x) α↓0
for each x ∈ X. If there exist a [0, ∞)-valued measurable function h and a deterministic stationary strategy ϕ∗ such that q(dy|x, ϕ∗ (x)) g + F(x)h(x) ≥ c(x, ϕ (x)) + F(x) h(y) + δx (dy) , F(x) X ∀ x ∈ X, ∗
then the deterministic stationary strategy ϕ∗ is average uniformly optimal, and g = W0∗ (x), ∀ x ∈ X. Proof. By Lemma 5.1.1(b), g ≥ W0 (ϕ∗ , x) ≥ W0∗ (x), ∀ x ∈ X.
(5.4)
268
5 The Average Cost Model
For the opposite direction, let x ∈ X be arbitrarily fixed. For each π-strategy S, by the Tauberian Theorem, see Proposition B.1.20, g ≤ lim αW0α (S, x) ≤ W0 (S, x). α↓0
Thus, g ≤ W0∗ (x). Since x ∈ X was arbitrarily fixed, we see from (5.4) that g = W0∗ (x) for each x ∈ X. The statement follows. 5.1.1.2
Optimality Inequality
Let m α := inf W0∗α (x), x∈X
h α (x) := W0∗α (x) − m α ≥ 0, ∀ α ∈ R+ , where the convention ∞ − ∞ := ∞ is in use. Let g := lim αm α ∈ [0, ∞]
(5.5)
α↓0
and h(x) :=
lim
α↓0, y→x
h α (y) :=
sup
δ>0, >0
inf
0 r for some r ∈ R. It suffices to show that this inequality remains true on a neighborhood of x ∈ X as follows. There exist > 0 and δ > 0 such that inf 0 r. Let z ∈ X be such that ρ(x, z) < 2 . Then
5.1 Unconstrained Problems
269
inf
0 r,
and thus h(z) > r , as required. (b) Let > 0 and δ > 0 be arbitrarily fixed. Suppose xn → x and αn ↓ 0. For all large enough n, ρ(x, xn ) < and αn < δ, so that inf
0 0. Then HαE decreases in α. Clearly, h(x) = sup HαE (x), x ∈ X. α∈(0,∞)
Therefore, h(x) ≤ limα↓0 h α (x) for each x ∈ X. Lemma 5.1.3 Suppose F is a [1, ∞)-valued function such that q x ≤ F(x) for each x ∈ X. Then for each α > 0, the equation
c0 (x, a) F(x) + a∈A α + F(x) F(x) + α ∀x ∈X
v(x) = inf
v(y) X
q(dy|x, a) + δx (dy) F(x)
, (5.7)
and (4.46) have the same minimal nonnegative lower semianalytic solution given by W0∗α . Proof For ease of reference we recall Eq. (4.46): u(x) = inf
a∈A
c0 (x, a) + α + qx (a)
q (dy|x, a) , x ∈ X. u(y) α + qx (a) X
Firstly, consider the minimal nonnegative lower semianalytic solution u to (4.46). By Theorem 4.2.4, u = W0∗α . Let x ∈ X be arbitrarily fixed. If u(x) = ∞, then
c0 (x, a) α + F(x)
F(x) q(dy|x, a) + I{x ∈ dy} . + u(y) F(x) + α X F(x)
u(x) = inf
a∈A
(5.8)
Now suppose u(x) < ∞. Then it follows that u(x) ≤
F(x) c(x, a) + α + F(x) F(x) + α ∀ a ∈ A.
u(y) X
q(dy|x, a) + I{x ∈ dy} , F(x) (5.9)
Let δ > 0 be arbitrarily fixed, and take any 0 < < δ. Then there exists some aδ ∈ A such that c(x, aδ ) q(dy|x, aδ ) u(x) + ≥ + , u(y) α + qx (aδ ) α + qx (aδ ) X\{x} so that
5.1 Unconstrained Problems
271
F(x)u(x) + u(x)α + qx (aδ )u(x) + (α + qx (aδ )) ≥ c(x, aδ ) + u(y)q(dy|x, aδ ) + u(x)qx (aδ ) + u(x)F(x) X
and thus (α + qx (aδ )) α + F(x)
F(x) q(dy|x, aδ ) c(x, aδ ) + + δx (dy) . u(y) ≥ α + F(x) F(x) + α X F(x) u(x) + δ > u(x) +
Since δ > 0 is arbitrarily fixed, we see that u(x) ≥ inf
a∈A
c(x, a) F(x) + α + F(x) F(x) + α
u(y) X
q(dy|x, a) + δx (dy) F(x)
,
and by (5.9), we see that u satisfies (5.8). Thus, u≥v
(5.10)
with v being the minimal nonnegative solution to (5.7). For the opposite direction, note that if v(x) = ∞, then v(x) ≥ inf
a∈A
c0 (x, a) + α + qx (a)
v(y) X
q (dy|x, a) . α + qx (a)
Suppose now v(x) < ∞. Let δ > 0 be arbitrarily fixed, and choose > 0 such that (α+F(x)) < δ. Since v satisfies (5.7), there exists some aδ ∈ A such that α 1 F(x)v(x) c(x, aδ ) + − v(y)q(dy|x, aδ ) + α + F(x) α + F(x) X α + F(x) c(x, aδ ) 1 F(x)v(x) (α + qx (aδ )) ≥ + − δ. v(y)q(dy|x, aδ ) + α + F(x) α + F(x) X α + F(x) α + F(x) v(x) ≥
Simple rearrangements of this inequality further lead to v(x) ≥
1 c(x, aδ ) + α + qx (aδ ) α + qx (aδ )
v(y)q(dy|x, aδ ) − δ. X\{x}
Since δ > 0 was arbitrarily fixed, we see that v(x) ≥ inf
a∈A
c0 (x, a) + α + qx (a)
q (dy|x, a) , ∀ x ∈ X. v(y) α + qx (a) X
272
5 The Average Cost Model
It follows from this and Remark 4.2.3 that u ≤ v, and thus u = v by (5.10). The proof is complete. The next example demonstrates that the two equations (5.7) and (4.46) are not equivalent. Example 5.1.1 Consider X = {1, 2}, A is a singleton and thus “a” is omitted everywhere. Let qx = 0, c0 (x) = 1, F(x) = 2 for each x ∈ X. Let α = 1. Consider v(1) = 1, v(2) = ∞. Then
2 q(1|1) q(2|1) 1 + v(1) + v(2) + v(1) ; 1 = v(1) = 1+2 3 2 2
2 q(1|2) q(2|2) 1 + v(1) + v(2) + v(2) . ∞ = v(2) = 1+2 3 2 2 Recall that 0 · ∞ := 0. Thus, the function v satisfies (5.7). On the other hand, q(2|1) , 1 q(1|2) ∞ = v(2) = 1 = 1 + v(1) . 1 1 = v(1) = 1 + v(2)
Therefore, v does not satisfy (4.46). Theorem 5.1.1 Suppose Condition 4.2.1 is satisfied. Consider the constant g and function h defined by (5.5) and (5.6), respectively. Furthermore, let F be a [1, ∞)valued continuous function on X such that F(x) ≥ q x for each x ∈ X. Then there exists a deterministic stationary strategy ϕ∗ such that
q(dy|x, a) g + F(x)h(x) ≥ inf c(x, a) + F(x) h(y) + δx (dy) a∈A F(x) X
q(dy|x, ϕ∗ (x)) + δx (dy) , = c(x, ϕ∗ (x)) + F(x) h(y) F(x) X ∀ x ∈ X. (5.11) Proof By Lemma 5.1.2, h is lower semicontinuous on X. Then according to Proposition B.1.40, there is a deterministic stationary strategy ϕ∗ satisfying the second equality in (5.11) automatically. Below we verify the inequality in (5.11). Let x ∈ X be arbitrarily fixed. If g = ∞ or h(x) = ∞, then there is nothing to prove. Therefore, we assume below in this proof that g < ∞ and h(x) < ∞. By Lemma 5.1.3, for each α > 0,
5.1 Unconstrained Problems
273
W0∗α (x) = h α (x) + m α
c0 (x, a) q(dy|x, a) F(x) = inf (h α (y) + m α ) + + δx (dy) a∈A α + F(x) F(x) + α X F(x)
c0 (x, a) F(x) q(dy|x, a) + + δx (dy) h α (y) = inf a∈A α + F(x) F(x) + α X F(x) F(x)m α , ∀ x ∈ X, + α + F(x) and so (F(x) + α)h α (x) + αm α
q(dy|x, a) + δx (dy) , = inf c0 (x, a) + F(x) h α (y) a∈A F(x) X
(5.12)
for all x ∈ X. Remember, for a fixed α > 0, if m α = ∞, then h α (x) = ∞, and the convention ∞ − ∞ := ∞ is in use. For each fixed α > 0, (F(x) + β)h β (x) + βm β
q(dy|x, a) + δx (dy) = inf c0 (x, a) + F(x) h β (y) a∈A F(x) X
q(dy|x, a) + δx (dy) ≥ inf c0 (x, a) + F(x) Hα (y) a∈A F(x) X
q(dy|x, a) + δx (dy) , ∀ x ∈ X, ≥ inf c0 (x, a) + F(x) HαE (y) a∈A F(x) X for each β ∈ (0, α]. Let > 0 be arbitrarily fixed. Recall that g < ∞. Therefore, there exists some α0 > 0 such that g + ≥ βm β for each β ∈ (0, α0 ]. Now let α < α0 be fixed. For each β ∈ (0, α], (F(x) + α)h β (x) + g + ≥ (F(x) + β)h β (x) + g +
q(dy|x, a) E + δx (dy) , ∀ x ∈ X. ≥ inf c0 (x, a) + F(x) Hα (y) a∈A F(x) X Thus,
274
5 The Average Cost Model
(F(x) + α) inf h β (x) + g + = (F(x) + α)Hα (x) + g + β∈(0,α]
q(dy|x, a) ≥ inf c0 (x, a) + F(x) HαE (y) + δx (dy) , ∀ x ∈ X. a∈A F(x) X According to Proposition B.1.40, the right-hand side of the last inequality defines a lower semicontinuous function on X. Therefore, (F(x) + α)HαE (x) + g +
q(dy|x, a) E + δx (dy) , ∀ x ∈ X. ≥ inf c0 (x, a) + F(x) Hα (y) a∈A F(x) X Let 0 < αn ≤ α0 be such that αn ↓ 0. Then h(x) = limn→∞ HαEn (x) for each x ∈ X. Remember that h(x) < ∞. Taking the limit on both sides along the sequence αn ↓ 0, we see F(x)h(x) + g +
q(dy|x, a) + δx (dy) ≥ lim inf c0 (x, a) + F(x) HαEn (y) n→∞ a∈A F(x) X
q(dy|x, a) + δx (dy) , = inf c0 (x, a) + F(x) h(y) a∈A F(x) X where the last equality is by Lemma B.1.5 and the Monotone Convergence Theorem. Since > 0 and x ∈ X were arbitrarily fixed, the statement is thus proved. Theorem 5.1.2 Suppose Conditions 4.2.1 and 5.1.2 are satisfied. Then the following assertions hold. (a) There exists a deterministic stationary average uniformly optimal strategy. (b) Suppose Condition 5.1.1 is also satisfied, and let F be a [1, ∞)-valued continuous function on X such that F(x) ≥ q x for each x ∈ X. Then any deterministic stationary strategy ϕ∗ satisfying (5.11) is average uniformly optimal, where the constant g and function h are defined by (5.5) and (5.6), respectively. In this case, g = inf lim αW0∗α (x) = W0∗ (x) ∈ [0, ∞), ∀ x ∈ X. x∈X α↓0
Proof If Condition 5.1.1 is not satisfied, then part (a) holds trivially. Suppose Condition 5.1.1 is satisfied below, and let us verify part (b). Then there exists some z ∈ X and π-strategy S such that W0 (S, z) < ∞. By the Tauberian Theorem, see Proposition B.1.20, lim αW0α (S, z) ≤ W0 (S, z) < ∞, α↓0
5.1 Unconstrained Problems
275
and thus g < ∞. Now the statements (a) and (b) follow from Corollary 5.1.1 and g = W0∗ (x) for all x ∈ X. It remains to show that g = inf lim αW0∗α (x). x∈X α↓0
For this, note that, g := inf lim αW0∗α (x) ≤ lim αW0∗α (z) < ∞. x∈X α↓0
α↓0
Moreover, since limα↓0 αW0∗α (x) ≥ limα↓0 αm α = g for each x ∈ X, we see that g ≤ g , so that the finite constant g ∈ [0, ∞) together with the function h and the deterministic stationary strategy ϕ∗ also satisfies the condition in the verification theorem, see Corollary 5.1.1, and thus g = W0∗ (x) ∈ [0, ∞) = g for each x ∈ X. Theorem 5.1.3 Suppose Conditions 4.2.1 and 5.1.1 are satisfied. Let F be a [1, ∞)valued continuous function on X such that F(x) ≥ q x for each x ∈ X. Let {αn }∞ n=1 be a sequence such that αn ↓ 0; lim αn m αn = lim αm α . n→∞
α↓0
In addition, suppose for each x ∈ X, the sequence {h αn (x)}∞ n=1 is bounded in n ≥ 1. Then Condition 5.1.2 is satisfied, and the optimality inequality (5.11) is satisfied by g˜ := lim αm α , α↓0
˜ h(x) :=
lim
αn ↓0, y→x
h αn (y), ∀ x ∈ X.
(5.13)
In this case, g = lim αm α = W0∗ (x) = W0 (ϕ∗ , x) α↓0 1 ∗ = lim Eϕx c0 (X (t), ϕ∗ (X (t)))dt = lim W0∗α (x), α↓0 T →∞ T (0,T ] ∀ x ∈ X, with g being defined by (5.5), and ϕ∗ coming from (5.11) with g˜ and h˜ in lieu of g and h. ˜ Proof By Lemma 5.1.2, h(x) ≤ h(x) < ∞ for each x ∈ X. Hence, Condition 5.1.2 is satisfied. The same reasoning as in the proof of Theorem 5.1.1 implies that the ˜ Note that g˜ ≤ g < ∞. By the optimality inequality (5.11) is satisfied by g˜ and h. verification theorem, see Corollary 5.1.1, the deterministic stationary strategy ϕ∗ in (5.11) with g˜ and h˜ is average uniformly optimal, and g˜ = W0∗ (x) = g for each x ∈ X, where the last equality is by Theorem 5.1.2.
276
5 The Average Cost Model
For the remaining part of the statement, note that g ≥ lim αW0α (ϕ∗ , x) ≥ lim αW0α (ϕ∗ , x) ≥ g˜ = g ∈ [0, ∞), ∀ x ∈ X, α↓0
α↓0
where the first inequality is by Lemma 5.1.1(b) and the Tauberian theorem, see Proposition B.1.20. Applying the Tauberian theorem again, we see that the statement is proved.
5.1.1.3
Sufficient Conditions
Condition 5.1.3 There exists a sequence αn ↓ 0 such that (a) αn m αn is bounded in n; (b) for some [0, ∞)-valued function m, h αn (x) ≤ m(x) for each x and n. Condition 5.1.4 There exist a state z ∈ X and a sequence αn ↓ 0 such that (a) αn W0∗αn (z) is bounded in n; (b) for some [0, ∞)-valued function m on X, W0∗αn (x) − W0∗αn (z) ≤ m (x) for each x and n; (c) for some finite constant L ∈ [0, ∞), W0∗αn (x) − W0∗αn (z) ≥ −L for each x and n. Proposition 5.1.1 Conditions 5.1.3 and 5.1.4 are equivalent. Furthermore, Under Condition 4.2.1, Condition 5.1.3 implies Conditions 5.1.1 and 5.1.2. Proof Suppose Condition 5.1.4 is satisfied. Then Condition 5.1.3(a) is satisfied automatically. From Condition 5.1.4(c), we see −W0∗αn (x) ≤ L − W0∗αn (z) for each x ∈ X and n, so that −m αn ≤ L − W0∗αn (z), ∀ n. By Condition 5.1.4(b), −W0∗αn (z) ≤ m (x) − W0∗αn (x) for each x ∈ X and n. Therefore, for each n, W0∗αn (x) − m αn ≤ L + m (x), ∀ x ∈ X. Thus Condition 5.1.3(b) is satisfied. Now suppose Condition 5.1.3 is satisfied. Let z ∈ X be arbitrarily fixed. Assume without loss of generality that the sequence αn in Condition 5.1.3 is such that αn ≤ 1 for each n. Then αn W0∗αn (x) ≤ αn m(x) + αn m αn for each x ∈ X, by Condition 5.1.3(b). In particular, αn W0∗αn (z) ≤ m(z) + αn m αn . Since αn m αn is bounded by Condition 5.1.3(a), we see that Condition 5.1.4(a) is satisfied. Also for each n, W0∗αn (x) − W0∗αn (z) ≥ m αn − W0∗αn (z) ≥ −m(z), ∀ x ∈ X
5.1 Unconstrained Problems
277
by Condition 5.1.3(b). Hence 5.1.4(c) is satisfied. Finally, 5.1.4(b) is trivially valid, and thus Condition 5.1.4 is satisfied. Now we verify the last assertion of this statement. Suppose Conditions 4.2.1 and 5.1.3 are satisfied. Then the constant g˜ := limαn ↓0 αn m αn is finite. The reasoning in the proof of Theorem 5.1.1 can be applied along the sequence {αn }∞ n=1 to show that for some deterministic stationary strategy ϕ∗ , formula (5.11) holds with g and h ˜ := limαn ↓0, y→x h αn (y) for each x ∈ X. being replaced by g˜ and h˜ defined by h(x) ˜ Note that h is [0, ∞)-valued. Now applying Corollary 5.1.1 to g˜ , h˜ and ϕ∗ , we see that g˜ = W0∗ (x) for each x ∈ X. Since g˜ is finite, we see that Condition 5.1.1 is satisfied. By Remark 5.1.1 specialized to the case of yn ≡ x, we see that Condition 5.1.2 is also satisfied.
5.1.1.4
Counterexample and Optimality Equation
The next example demonstrates that in Theorem 5.1.2, one cannot replace “≥” in (5.11) with “=”. Example 5.1.2 Consider the state and action spaces X = {0, 1, . . . } and A = {1, 2}. The transition and cost rates are defined by qx (a) = 1, ∀ x ∈ X, a ∈ A, q(x − 1|x, a) = 1, q(x|0, a) = px(a) , ∀ x ∈ {1, 2, . . . }, a ∈ A, c0 (x, a) = 1, ∀ x ∈ {1, 2, . . . }, a ∈ A, c0 (0, a) = I{a = 2}, ∀ a ∈ A, where for each a ∈ A, 0 ≤ p (a) y for each y = 1, 2, . . . , and
∞ y=1
p (a) y = 1.
Proposition 5.1.2 Consider Example 5.1.2. Then the following assertions hold. (a) g =
∞
(1) y=1 yp y ∞ 1+ y=1 yp(1) y
≡ W0∗ (x) and h(x) =
x (1) 1+ ∞ y=1 p y y
for each x ∈ X, with g and h
being defined by (5.5) and (5.6); Conditions 5.1.2 and 5.1.1 are satisfied; the ∗ (x) ≡ 1 is average uniformly optimal. strategy ϕ (1) (b) Suppose ∞ y=1 yp y = ∞. Then there do not exist a finite constant g and an R-valued measurable function h on X satisfying g + h(x) = inf c0 (x, a) + h(y) (q(dy|x, a) + δx (dy)) , a∈A
∀ x ∈ X.
X
(5.14)
The strategy ϕ(x) ≡ 2 is average uniformly optimal but does not achieve the infimum in the optimality inequality (5.11) with g and h defined as in part (a). Proof (a) Let α > 0 be fixed for now. It is clear that a deterministic stationary α-discounted uniformly optimal strategy is given by ϕ∗ (x) ≡ 1.
278
5 The Average Cost Model W ∗α (x−1)
1 By Theorem 4.2.4, for x > 0, W0∗α (x) = 1+α + 0 1+α x 1 1 1 ∗α k=1 (1+α)k + (1+α)x W0 (0). Put β := 1+α . Then
W0∗α (0) = =
∞ x=1
px(1)
∞
px(1) W0∗α (x)
x=1 x
, and thus W0∗α (x) =
1 1+α
1 1 1 ∗α + W0 (0) k x (1 + α) (1 + α) 1+α
k=1
so that ∞ y β 2 1 − y=1 p (1) y β = ∞ (1) y , 1 − β 1 − β y=1 p y β
∞ y 1 − βx β x+1 1 − y=1 p (1) y β ∗α + W0 (x) = β , ∀ x = 1, 2, . . . . (1) y 1−β 1−β 1−β ∞ y=1 p y β W0∗α (0)
One can verify that m α = W0∗α (0) and hence αm α = β h α (x) =
1−
∞
1−β
y=1
∞
y p (1) y β
y=1
y p (1) y β
,
β(1 − β x ) , x = 1, 2, . . . , (1) y 1−β ∞ y=1 p y β
h α (0) = 0. Now ∞
(1) ∞ y=1 yp y := 1; , where ∞ (1) α↓0 ∞ 1 + y=1 yp y x lim h α (x) = , ∀ x ∈ X. ∞ α↓0 1 + y=1 yp (1) y
lim αm α =
Assertion (a) follows from this and Theorem 5.1.2. (b) Suppose g and h satisfy equality (5.14). Then for each x ≥ 1, g + h (x) = 1 + h (x − 1), so that h (x) = x(1 − g ) + h (0), ∀ x ≥ 1. Also
5.1 Unconstrained Problems
g + h (0) = min
279
⎧ ∞ ⎨ ⎩
h (y) p (1) y , 1+
y=1
∞
h (y) p (2) y
y=1
⎫ ⎬ ⎭
⎧ ⎫ ∞ ∞ ⎨ ⎬ = min (y(1 − g ) + h (0)) p (1) (y(1 − g ) + h (0)) p (2) y , 1+ y ⎩ ⎭ y=1 y=1 ⎧ ⎫ ∞ ∞ ⎨ ⎬ = min y(1 − g ) p (1) y(1 − g ) p (2) + h (0). y , 1+ y ⎩ ⎭ y=1
y=1
Clearly g > 1 or g < 1 cannot satisfy the above relations, and neither can g = 1. The other assertion can be directly verified. Remember, at ϕ(x) ≡ 2, the cost rate identically equals 1, and the optimal strategy ϕ∗ (x) ≡ 1 (see item (a)) provides W0 (ϕ∗ , x) = W0∗ (x) ≡ g = 1. Condition 5.1.5 Let {αn }∞ n=1 be a sequence such that αn ↓ 0 and lim αn m αn = lim αm α .
n→∞
α↓0
(a) {h αn (x)} is bounded in n for each x ∈ X. (b) {h αn } is an equicontinuous family of functions on X, see Definition B.1.6. (c) There exists a [0, ∞)-valued measurable function h on X such that h ≥ h with q (dy|x, a)h(y) < ∞ for each x ∈ X and a ∈ A. h being defined by (5.6) and X Theorem 5.1.4 Suppose Conditions 4.2.1, 5.1.1 and 5.1.5 are satisfied. Let F be a [1, ∞)-valued continuous function on X such that F(x) ≥ q x for each x ∈ X. Then the following assertions hold. (a) h˜ = h with h˜ and h being defined by (5.13) and (5.6), and h is continuous. Moreover, there is a subsequence of {h αn } converging to h uniformly on compact subsets of X. (b) There exists a deterministic stationary strategy ϕ∗ such that g + F(x)h(x)
q(dy|x, a) + δx (dy) = inf c(x, a) + F(x) h(y) a∈A F(x) X
q(dy|x, ϕ∗ (x)) + δx (dy) , ∀ x ∈ X. = c(x, ϕ∗ (x)) + F(x) h(y) F(x) X Proof (a) Let x ∈ X be arbitrarily fixed. By Lemma 5.1.2(b), there exist a subse∞ quence {αn k }∞ k=1 ⊆ {αn }n=1 and {x n k } ⊆ X converging to x ∈ X such that h(x) = lim h αnk (xn k ). k→∞
280
5 The Average Cost Model
Without loss of generality, we assume that lim h αnk (xn k ) = lim h αnk (xn k ),
k→∞
k→∞
for, otherwise, one can take the corresponding subsequences of {αn k }∞ k=1 and {xn k }∞ k=1 . Applying the Arzela–Ascoli Theorem, see Proposition B.1.11, to the equi∞ ∞ continuous family {h n k }∞ k=1 , we obtain a subsequence of {h αn km }m=1 ⊆ {h αn k }k=1 and a continuous [0, ∞)-valued function hˆ on X such that the sequence {h αnk }∞ converges to hˆ uniformly on compact subsets of X. We show m m=1 ˆ that h(x) = h(x) as follows. Let > 0 be arbitrarily fixed. There exists a δ > 0 such that |h αn (x) − h αn (y)| < for each y ∈ X satisfying ρ(y, x) < δ, km km where ρ is the fixed metric on X. Then for all large enough m, ˆ |h(x) − h αnk (xn km )| < , ρ(xn km , x) < δ, |h αnk (x) − h(x)| < , m
m
ˆ so that |h(x) − h(x)| < 3. Since x ∈ X and > 0 are arbitrarily fixed, we see ˆ that h(x) = h(x) for each x ∈ X. By Lemma 5.1.2(b), for all x ∈ X
˜ ˆ h(x) ≤ h(x) ≤ lim h αn (x) ≤ lim h αnk (x) = h(x) = h(x). n→∞
m→∞
m
So part (a) is proved. (b) As the other direction of the inequality was justified in Theorem 5.1.1, we only need to verify the inequality
q(dy|x, a) + δx (dy) g + F(x)h(x) ≤ c(x, a) + F(x) h(y) F(x) X
for each a ∈ A, as follows. Let x ∈ X and a ∈ A be arbitrarily fixed. According to (5.12),
(F(x) + αn km )h αnk (x) + αn km m αnk m
m q(dy|x, a) ≤ c0 (x, a) + F(x) h αnk (y) + δx (dy) . m F(x) X By taking the limit as m → ∞ on both sides of the above relation, legitimately using Lebesgue’s Dominated Convergence Theorem, the desired inequality immediately follows from part (a) and Theorem 5.1.3: recall that limm→∞ αn km m αn = g. km
5.1 Unconstrained Problems
281
5.1.2 Unconstrained Problem: Weight Function Lemma 5.1.4 Suppose Condition 3.1.1 is satisfied with ρ < 0. Then, for each πstrategy S, W j (S, x) ≥
bM ∀ x ∈ X, j = 0, 1, . . . , J. ρ
Proof Similarly to Lemma 3.1.1, the statement follows from the following relation: b E xS [w(X (t))] ≤ eρt w(x) − (1 − eρt ), ∀ x ∈ X, ρ cf. Corollary 2.2.8 and its proof.
(5.15)
Theorem 5.1.5 Suppose Conditions 2.4.3 and 3.1.2 are satisfied (by the same function w and ρ, ρ < 0). Furthermore, assume that, for some fixed z ∈ X, and sequence αn ↓ 0, it holds that |W0∗αn (x) − W0∗αn (z)| ≤ κw (x), ∀ x ∈ X, where κ ≥ 0 is a constant. Let F be a [1, ∞)-valued continuous function on X such that F(x) ≥ q x for each x ∈ X. Then the following assertions hold. (a) There exist a constant g, a lower semicontinuous function h ∈ Bw (X) and a deterministic stationary strategy ϕ∗ satisfying (5.11). (b) For the constant g in part (a), there exists a measurable function h ∈ Bw (X) such that g + F(x)h(x)
q(dy|x, a) ≤ inf c0 (x, a) + F(x) h(y) + δx (dy) a∈A F(x) X for each deterministic stationary strategy ϕ. (c) The deterministic stationary strategy ϕ∗ from part (a) is average uniformly optimal with W0 (x) = W0 (ϕ∗ , x) = g for each x ∈ X. In fact, each deterministic stationary strategy satisfying the requirement in part (a) is average uniformly optimal. Proof (a) The proof is similar to that of Theorem 5.1.1. By Theorem 3.1.2, for each α > 0, W0∗α is the unique lower semicontinuous solution in Bw (X) to (3.4). Instead of (5.5), the constant g is defined as follows. Note that there exists a
282
5 The Average Cost Model
constant g ∈ R and a subsequence of {αn }∞ n=1 , with a slight abuse of notation , such that still denoted as {αn }∞ n=1 lim αn W0∗αn (x) = g, ∀ x ∈ X.
n→∞
The justification is as follows. According to (3.7), the product α|W0∗α (z)| is uniformly bounded in α. Therefore, there exist a constant g ∈ R and a subsequence ∞ of {αn }∞ n=1 , still denoted as {αn }n=1 , such that lim αn W0∗αn (z) = g.
n→∞
For each fixed x ∈ X, |αn W0∗αn (x) − g| ≤ |αn W0∗αn (x) − αn W0∗αn (z)| + |αn W0∗αn (z) − g| ≤ αn κw (x) + |αn W0∗αn (z) − g| → 0 as n → ∞, as desired. Now, instead of (5.6), we put h(x) =
lim
αn ↓0, y→x
h αn (y), ∀ x ∈ X,
where h α (x) = W0∗α (x) − W0∗α (z), ∀ α > 0, x ∈ X. Note that h ∈ Bw (X). One can rewrite (3.4) as c0 (x, a) W0∗αn (x) = h αn (x) + W0∗αn (z) = inf a∈A αn + F(x)
F(x) q(dy|x, a) + δx (dy) + (h αn (y) + W0∗αn (z)) αn + F(x) X F(x)
c0 (x, a) F(x) q(dy|x, a) + + δx (dy) h (y) = inf a∈A α + F(x) αn + F(x) X αn F(x) F(x)W0∗αn (z) , ∀ x ∈ X, + αn + F(x) and thus
5.1 Unconstrained Problems
283
(F(x) + αn )h αn (x) + αn W0∗αn (z)
q(dy|x, a) = inf c0 (x, a) + F(x) h αn (y) + δx (dy) , a∈A F(x) X ∀ x ∈ X, (5.16) cf. (5.12). From here, it is evident that the line of reasoning in the proof of Theorem 5.1.1 can be followed with m α , h, h α and Hα being replaced by W0∗αn (z), h, h αn , H αn (x) = inf k≥n h αk (x) for each x ∈ X. The functions H αn and H αEn belong to Bw (X), so that X H αEn (y) q(dy|x,a) + δ (dy) is lower semicontinuous on x F(x) X × A. Finally, the existence of the deterministic stationary strategy ϕ∗ is evident. (b) Consider the subsequence {αn } and notations as in the proof of part (a). Let h(x) = lim h αn (x), ∀ x ∈ X. n→∞
Then h belongs to Bw (X). By (5.16), (F(x) + αn )h αn (x) + αn W0∗αn (z)
q(dy|x, a) ≤ c0 (x, a) + F(x) h αn (y) + δx (dy) F(x) X for each x ∈ X and a ∈ A. The statement follows from this and Fatou’s Lemma. (c) Applying the Dynkin formula (2.90) to the function h ∈ Bw (X), we obtain ExS h(X (T )) S = h(x) + Ex
(0,T ]
(da|v)q(dy|X (v), a)h(y) dv , ∀ S ∈ Sπ . A
X
Note that Condition 3.1.1(b, c) is satisfied. Moreover, Condition 2.4.3 implies Condition 2.4.2, hence 2.2.5 is satisfied, too. Condition Therefore, by Lemma S 5.1.4, one can add Ex (0,T ] A c0 (X (v), a)(da|v)dv to the presented equality: ExS
(0,T ]
+ExS
c0 (X (v), a)(da|v)dv + ExS h(X (T )) = h(x) A
(da|v) q(dy|X (v), a)h(y) + c0 (X (v), a) dv ,
(0,T ]
A
X
∀ S ∈ Sπ . After dividing the above equality by T and passing to the upper limit as T → ∞, we see that
284
5 The Average Cost Model
W0 (ϕ∗ , x) ≤ g, ∀ x ∈ X, where the inequality is by Item (a). Similarly, applying the Dynkin formula (2.90) to the function h ∈ Bw (X) from part (b), we see that W0 (S, x) ≥ g, ∀ x ∈ X, S ∈ Sπ , so that W0 (x) = g = W0 (ϕ∗ , x) ∀ x ∈ X, as desired. Remark 5.1.2 Suppose a real-valued measurable function h ∗ on X, a finite constant g and a deterministic stationary strategy ϕ∗ satisfy g + F(x)h ∗ (x)
q(dy|x, a) ∗ = inf c0 (x, a) + F(x) h (y) + δx (dy) a∈A F(x) X
q(dy|x, ϕ∗ (x)) ∗ ∗ + δx (dy) , ∀ x ∈ X, = c0 (x, ϕ (x)) + F(x) h (y) F(x) X for some [1, ∞)-valued measurable function F on X, or equivalently, g = inf c0 (x, a) + F(x) h ∗ (y)q(dy|x, a) a∈A X = c0 (x, ϕ∗ (x)) + h ∗ (y)q(dy|x, ϕ∗ (x)), ∀ x ∈ X. X
If for each x ∈ X and π-strategy S, ExS h ∗ (X (T )) = h ∗ (x) S ∗ +Ex (da|v)q(dy|X (v), a)h (y) dv , ∀ T ∈ R+ , (0,T ]
A
X
where the expectations on both sides are finite, ExS
(0,T ]
c0 (X (v), a)(da|v)dv ∈ R, ∀ T ∈ R+ ,
A
and lim
T →∞
1 S ∗ 1 ∗ Ex h (X (T )) ≥ 0, lim Eϕx h ∗ (X (T )) ≤ 0, T →∞ T T
5.1 Unconstrained Problems
285
then from the reasoning in the proof of part (c) of Theorem 5.1.5, we can see that g = W0∗ (x) = W0 (ϕ∗ , x) for each x ∈ X, so that ϕ∗ is average uniformly optimal. In particular, if the functions defined by qx (a), h ∗ (x) and c0 (x, a) are all bounded, then the conditions in this remark are satisfied. Theorem 5.1.6 Suppose Conditions 2.4.3 and 3.1.2 are satisfied (by the same function w and ρ, ρ < 0). Furthermore, assume that for some fixed z ∈ X, and sequence αn ↓ 0, it holds that |W0∗αn (x) − W0∗αn (z)| ≤ κw (x), ∀ x ∈ X, where κ ≥ 0 is a constant. Let F be a [1, ∞)-valued continuous function on X such that F(x) ≥ q x for each x ∈ X. Then the following assertions hold. (a) Consider the constant g and the functions h and h from part (a) of Theorem 5.1.5. Suppose, in addition, inf {h(x) − h(x)} > −∞.
x∈X
(5.17)
Then there exists a measurable function h ∗ ∈ Bw (X) such that, for all x ∈ X, g + F(x)h ∗ (x)
q(dy|x, a) ∗ = inf c0 (x, a) + F(x) h (y) + δx (dy) . a∈A F(x) X (b) For each deterministic stationary strategy ϕ, let us introduce the notation
q(dy|x, ϕ(x)) + δx (dy) , p˜ (0) (dy|x, ϕ(x)) = p(dy|x, ˜ ϕ(x)) := F(x) p˜ (n+1) (dy|x, ϕ(x)) := p(dy|z, ˜ ϕ(z)) p˜ (n) (dz|x, ϕ(x)) X
for each n ≥ 0. If sup lim
x∈X n→∞ X
w (y) p˜ (n) (dy|x, ϕ∗ (x)) < ∞
for the deterministic stationary strategy ϕ∗ from part (a) of Theorem 5.1.5, then (5.17) holds. Proof (a) Note that h(x) = h(x) + h(x) − h(x) ≥ h(x) + inf {h(x) − h(x)}, x∈X
286
5 The Average Cost Model
where inf x∈X {h(x) − h(x)} is a constant in (−∞, 0]. Now let h 0 (x) = h(x), x ∈ X,
c0 (x, a) q(dy|x, a) + h n (y) + δx (dy) h n+1 (x) = inf a∈A F(x) F(x) X g , n ≥ 0, x ∈ X. (5.18) − F(x) In particular,
c0 (x, a) q(dy|x, a) g + h(y) + δx (dy) − a∈A F(x) F(x) F(x) X ≤ h(x) = h 0 (x), x ∈ X,
h 1 (x) = inf
where the last inequality is by Theorem 5.1.5(a). Hence, {h n } is a monotone nonincreasing sequence of functions majorized by h. It also holds that c0 (x, a) + h(y) + inf {h(x) − h(x)} h 1 (x) ≥ inf a∈A x∈X F(x) X
g q(dy|x, a) + δx (dy) ≥ h(x) + inf {h(x) − h(x)}, − x∈X F(x) F(x) where the last inequality is by Theorem 5.1.5(b). An inductive argument shows that {h n } is minorized by h + inf x∈X {h(x) − h(x)}. Recall that h, h ∈ Bw (X) by Theorem 5.1.5. Therefore, {h n } is a sequence of lower semicontinuous functions in Bw (X). Let h ∗ (x) = lim h n (x), ∀ x ∈ X. n→∞
Then h ∗ is measurable on X, and satisfies h(x) + inf {h(x) − h(x)} ≤ h ∗ (x) ≤ h(x), ∀ x ∈ X. x∈X
Applying Proposition B.1.13 and the Monotone Convergence Theorem, after passing to the limit as n → ∞ on both sides of (5.18), we see that h ∗ satisfies (5.18), as desired. (b) Consider the deterministic stationary strategy ϕ∗ from part (a) of Theorem 5.1.5, so that g + F(x)h(x) ≥ c0 (x, ϕ∗ (x)) + F(x) h(y) p(dy|x, ˜ ϕ∗ (x)), ∀ x ∈ X. S
5.1 Unconstrained Problems
287
According to Theorem 5.1.5(b), g + F(x)h(x) ≤ c0 (x, ϕ∗ (x)) + F(x)
h(y) p(dy|x, ˜ ϕ∗ (x)), ∀ x ∈ X. S
Then
(h(y) − h(y)) p(dy|x, ˜ ϕ∗ (x)), ∀ x ∈ X,
F(x)(h(x) − h(x)) ≥ F(x) S
and so, by induction with respect to n,
(h(y) − h(y)) p˜ (n) (dy|x, ϕ∗ (x)), ∀ x ∈ X, for all n ≥ 0.
h(x) − h(x) ≥ S
Remember, h, h ∈ Bw (X) and w is integrable with respect to the measure p˜ (n) (dy|x, ϕ∗ (x)) for each x ∈ X and n ≥ 0. Then
h(x) − h(x) w (y) p˜ (n) (dy|x, ϕ∗ (x)) h(x) − h(x) ≥ lim − sup w (x) n→∞ x∈X X
h(x) − h(x) = − sup lim w (y) p˜ (n) (dy|x, ϕ∗ (x)), ∀ x ∈ X, n→∞ X w (x) x∈X so that inf (h(x) − h(x))
h(x) − h(x) ≥ − sup lim w (y) p˜ (n) (dy|x, ϕ∗ (x)) > −∞, sup w (x) x∈X x∈X n→∞ X x∈X
as desired. Remark 5.1.3 Suppose the conditions in Theorem 5.1.6 are satisfied, which are the same as those in Theorem 5.1.5. If X is denumerable, then by taking a subsequence of {αn }∞ n=1 in the proof of Theorem 5.1.5, if necessary, we can assume that h αn (x) converges for each x ∈ X, as n → ∞. Therefore, lim h αn (x) = h(x) = h(x) = h ∗ (x)
n→∞
for each x ∈ X. According to Theorems 5.1.5 and 5.1.6, the triplet (h ∗ , g, ϕ∗ ) in Remark 5.1.2 exists (see the proof of Theorem 5.1.6). The existence of the average uniformly optimal deterministic stationary strategy ϕ∗ further follows from Condition 3.1.2; note also that h ∗ ∈ Bw (X).
288
5 The Average Cost Model
Lemma 5.1.5 Suppose Conditions 2.4.3 and 3.1.2 are satisfied (by the same function w and ρ, ρ < 0). If for some state z ∈ X,
w (y)q(dy|x, a) ≤ ρ w (x) + b I{x = z}, ∀ x ∈ X,
X
and, for some constant M ≥ 0, |c0 (x, a)| ≤ M w (x) for each x ∈ X and a ∈ A, then, for some constant κ ∈ [0, ∞), |W0∗α (x) − W0∗α (z)| ≤ κw (x), ∀ α ∈ (0, ∞), x ∈ X. Proof Consider the stopping time τz := inf{t ≥ 0 : X (t) = z}. We verify that for each deterministic stationary strategy ϕ, Eϕx
−ατz e ≤ w (x) + (ρ − α)Eϕx
e (0,τz ]
−αv
w (X (v))dv ,
∀ x ∈ X, α ∈ [0, ∞),
(5.19)
as follows. Indeed, one can check by using the Markov property and Theorem 2.4.4 together with its proof that for each α ∈ [0, ∞), x ∈ X and deterministic stationary strategy ϕ, w (X (t))e−αt − w (x) e−αv −αw (X (v)) + w (y)q(dy|X (v), ϕ(X (v))) dv − (0,t]
X
ϕ
ϕ
is a (Px -almost surely) right-continuous {F t }-martingale under Px ; the natural filtration {Ft }t∈R0+ was defined in (1.5). By the Doob Optional Sampling Theorem, see Proposition B.3.3, for each integer n > 0 and the finite stopping time τ n := min{τz , n}, n n Eϕx e−ατ ≤ Eϕx w (X (τ n ))e−ατ = w (x) + Eϕx e−αv −αw (X (v)) (0,τ n ] + w (y)q(dy|X (v), ϕ(X (v))) dv X ϕ −αv e −αw (X (v)) + ρ w (X (v)) dv , ≤ w (x) + Ex (0,τ n ]
∀ x ∈ X, α ∈ [0, ∞). Passing to the limit as n → ∞, we see that formula (5.19) holds for each deterministic stationary strategy ϕ.
5.1 Unconstrained Problems
289
Note that, since ρ < 0, Eϕx
e
−αv
(0,τz ]
and, at α = 0, Eϕx
ϕ Ex e−ατz − w (x) w (x) ≤ w (X (v))dv ≤ ρ − α −ρ
τz ≤ Eϕx
w (X (v))dv ≤
(0,τz ]
(5.20)
w (x) , −ρ
(5.21)
because w (y) ≥ 1 for all y ∈ X. For each α > 0, let ϕα be an α-discounted uniformly optimal deterministic stationary strategy which exists due to Theorem 3.1.2. Then by using the strong Markov property, see Proposition A.5.2, W0∗α (x) − W0∗α (z) ϕα −αv = Ex e c0 (X (v), ϕα (X (v)))dv (0,τ ] z ϕα −αv e c0 (X (v), ϕα (X (v)))dv − W0∗α (z) +Ex (τz ,∞) ϕα −αv = Ex e c0 (X (v), ϕα (X (v)))dv + Eϕx α e−ατz W0∗α (z) (0,τz ]
−W0∗α (z) ϕα −αv = Ex e c0 (X (v), ϕα (X (v)))dv + Eϕx α (e−ατz − 1)W0∗α (z) , (0,τz ]
so that ∗α W (x) − W ∗α (z) ≤ Eϕα
−αv
|c0 (X (v), ϕα (X (v)))|dv +Eϕx α (1 − e−ατz )|W0∗α (z)| ≤ M Eϕx α e−αv w (X (v))dv (0,τ ]
z M Mb w (z) . +Eϕx α ατz + α − ρ α(α − ρ ) 0
0
x
e
(0,τz ]
Here the obvious inequality 1 − e−ατz ≤ ατz and Theorem 3.1.1 are in use. Finally, using (5.20) and (5.21), we obtain, for all x ∈ X, α > 0,
∗α W (x) − W ∗α (z) ≤ M w (x) + M + M b w (z) w (x) , 0 0 −ρ −ρ −ρ because
α α−ρ
< 1 and
1 α−ρ
≤
1 . −ρ
The following version of Lemma 5.1.5 can be more convenient for verifications.
290
5 The Average Cost Model
Lemma 5.1.6 Suppose Conditions 2.4.3 and 3.1.2 are satisfied (by the same function w and ρ, ρ < 0). If for some state z ∈ X, there exists [0, ∞)-valued measurable functions V1 , V2 ∈ Bw (X) such that |c0 (x, a)| + V1 (y)q(dy|x, a) ≤ 0, ∀ a ∈ A, x ∈ X \ {z}; X\{z} V2 (y)q(dy|x, a) ≤ 0, ∀ a ∈ A, x ∈ X \ {z}, 1+ X\{z}
and there exists some constant M ≥ 0 such that | inf a∈A c0 (x, a)| ≤ M w (x) for each x ∈ X, then, for some constant κ ∈ [0, ∞), |W0∗α (x) − W0∗α (z)| ≤ κw (x), ∀ α ∈ (0, ∞), x ∈ X. In particular, the previous inequality holds if, for some state z ∈ X, the following conditions are satisfied: • for some constant M ≥ 0, |c0 (x, a)| ≤ M w (x) for each x ∈ X and a ∈ A, and • there exists a [0, ∞)-valued measurable function V ∈ Bw (X) such that w +
V (y)q(dy|x, a) ≤ 0, ∀ a ∈ A, x ∈ X \ {z}. X\{z}
Proof Inspecting the proof of Lemma 5.1.5 reveals that it suffices that the functions defined by sup ExS
(0,τz ]
S∈Sπ
A
|c0 (X (t), a)|(da|t)dt , sup ExS τz S∈Sπ
are both w -bounded. Let α > 0 be fixed. According to Theorem 4.2.3 and the discussion after it, by Proposition C.2.5, we see that x ∈ X → sup S∈Sπ
ExS
e (0,τz ]
−αt
|c0 (X (t), a)|(da|t)dt A
is given by the minimal nonnegative upper semianalytic solution to |c0 (x, a)| + α + qx (a) W (z) ≥ 0.
q (dy|x, a) W (y) ≤ W (x), ∀ a ∈ A, x ∈ X \ {z}, X\{z} α + q x (a)
Consider a [0, ∞)-valued upper semianalytic function V on X. Then it is a solution to the above inequality if and only if
5.1 Unconstrained Problems
291
|c0 (x, a)| +
V (y)q(dy|x, a) ≤ αV (x), ∀ a ∈ A, x ∈ X \ {z}. X\{z}
Obviously, the function V1 ≥ 0 satisfies the above inequality for each α > 0, so that ExS
e
−αt
(0,τz ]
|c0 (X (t), a)|(da|t)dt ≤ V1 (x), A
∀ x ∈ X, S ∈ Sπ , α > 0, and by the Monotone Convergence Theorem, after passing to the limit as α ↓ 0, we have |c0 (X (t), a)|(da|t)dt ≤ V1 (x), ∀ x ∈ X. sup ExS S∈Sπ
(0,τz ]
A
Similarly, sup S∈Sπ ExS τz ≤ V2 (x) for each x ∈ X. Since V1 , V2 ∈ Bw (X), the statement is proved. Theorem 5.1.6 and Lemma 5.1.6 provide verifiable conditions for the solvability of the unconstrained average CTMDP problem W0 (S, x) → inf S∈Sπ , see also Remark 5.1.2.
5.1.2.1
Comparable Results for Submodels
The statements here admit corresponding versions when Condition 3.1.2 is replaced by Condition 3.1.3, as well as when one is restricted to Sˆ π instead of Sπ . Recall that Sˆ π is the class of relaxed strategies π such that (A(X (t−)|ω, t) = 1 for Pγπ (dω) × dtalmost all (ω, t) ∈ × R+ , where A(x) := {a : (x, a) ∈ K} = ∅ for all x ∈ X and K ⊆ X × A is a fixed subset. Here we present the relevant statement, which is to be used in the next section. Theorem 5.1.7 Suppose Conditions 2.4.3 and 3.1.3 are satisfied (by the same function w and ρ, ρ < 0). Consider the submodel with K being measurable and A(x) ⊆ A being nonempty compact for each x ∈ X. Furthermore, assume that, for some fixed z ∈ X and sequence αn ↓ 0, it holds that |h ∗αn (x)| ≤ κw (x), ∀ x ∈ X,
(5.22)
where κ ≥ 0 is a constant and h ∗α (x) := inf W0α (S, x) − inf W0∗α (S, z). S∈Sˆ π
S∈Sˆ π
Let F be a [1, ∞)-valued measurable function on X such that F(x) ≥ q x for each x ∈ X. Then the following assertions hold.
292
5 The Average Cost Model
(a) There exist a constant g, a measurable function h ∈ Bw (X) and a deterministic stationary strategy ϕ∗ ∈ Sˆ π satisfying g + F(x)h(x)
q(dy|x, a) + δx (dy) ≥ inf c(x, a) + F(x) h(y) a∈A(x) F(x) X
q(dy|x, ϕ∗ (x)) ∗ + δx (dy) ∀ x ∈ X. = c(x, ϕ (x)) + F(x) h(y) F(x) X (b) For the constant g in part (a), there exists a measurable function h ∈ Bw (X) such that g + F(x)h(x)
q(dy|x, a) ≤ inf c0 (x, a) + F(x) h(y) + δx (dy) . a∈A(x) F(x) X (c) The deterministic stationary strategy ϕ∗ from part (a) is average uniformly optimal for the submodel with value g, i.e., inf W0 (S, x) = W0 (ϕ∗ , x) = g, ∀ x ∈ X.
S∈Sˆ π
In fact, each deterministic stationary strategy satisfying the requirement in part (a) satisfies the above equalities. Proof (a) Consider g and {αn }∞ n=1 as in the proof of Theorem 5.1.5(a). Now we define the function h as h(x) = lim h ∗αn (x), ∀ x ∈ X. αn ↓0
According to (5.22), the measurable functions h ∗α and h are in Bw (X). (F(x) + αn )h ∗αn (x) + αn inf W0∗αn (S, z) S∈Sˆ π
q(dy|x, a) ∗ + δx (dy) , ∀ x ∈ X, = inf c0 (x, a) + F(x) h αn (y) a∈A(x) F(x) X cf. (5.16). Let x ∈ X be arbitrarily fixed. According to Condition 3.1.3, for each n, there exists an ∈ A(x) such that (F(x) + αn )h ∗αn (x) + αn inf W0∗αn (S, z) = c0 (x, an ) + F(x)
S∈Sˆ π
inf h ∗αk (y)
X k≥n
q(dy|x, an ) + δx (dy) . F(x)
(5.23)
5.1 Unconstrained Problems
293
By taking a subsequence if necessary and reasoning similarly to the proofs of Theorems 5.1.1 and 5.1.5 we assume without loss of generality that for some a ∗ ∈ A(x), (F(x) + αn )h ∗αn (x) + αn inf W0∗αn (S, z) → F(x)h(x) + g; S∈Sˆ π
∗
an → a ∈ A(x) as n → ∞, where g := limn→∞ αn inf S∈Sˆ π W0∗αn (S, z). Thus, under Condition 3.1.3, ∗ ) n) + δx () → q(|x,a + δx () as n → ∞. Let us apply for each ∈ B(X), q(|x,a F(x) F(x) limn→∞ on both sides of (5.23): F(x)h(x) + g
q(dy|x, an ) + δx (dy) ≥ lim c0 (x, an ) + F(x) lim inf h ∗αk (y) F(x) n→∞ n→∞ X k≥n
∗ q(dy|x, a ) + δx (dy) ≥ c0 (x, a ∗ ) + F(x) h(y) F(x) X
q(dy|x, a) ≥ inf c0 (x, a) + F(x) h(y) + δx (dy) , a∈A(x) F(x) X where the second to last inequality is by Proposition B.1.15. The existence of the claimed deterministic stationary strategy ϕ∗ is according to Proposition B.1.39. The other parts can be proved just as in the proof of Theorem 5.1.5, with A being replaced by A(x).
5.2 Constrained Problems 5.2.1 The Primal Linear Program and Its Solvability Recall from Proposition B.1.33 that given any probability measure η on B(X × A), one can disintegrate it with respect to its marginal η(d x × A) to get a stochastic kernel πη (da|x), defining a stationary π-strategy denoted as πη , so that η(d x × da) = η(d x × A)πη (da|x).
(5.24)
Definition 5.2.1 Suppose Condition 2.4.2 holds for a [1, ∞)-valued function w and ρ < 0. A probability measure η ∈ Pw (X × A) is called a stable measure if X
A
q(|x, a)πη (da|x) η(d x × A) = 0
(5.25)
294
5 The Average Cost Model
for all ∈ B(X), or equivalently, for each ∈ B(X) such that qx (a)πη (da|x) < ∞.
sup x∈
A
(We postpone the justification of this equivalence to the remark after this definition.) On this occasion, the underlying stationary π-strategy πη is said to be stable, too. We denote by Dav the collection of all stable probability measures on X × A, and the set of stable strategies is denoted by Sstable . Remark 5.2.1 Suppose Condition 2.4.2 holds for a [1, ∞)-valued function w and ρ < 0. Consider some η ∈ Pw (X × A). Suppose (5.25) holds for each ∈ B(X) such that supx∈ A qx (a)πη (da|x) < ∞. Let us put q(dy|x) = X q(dy|x, a)πη (da|x) for brevity in this remark, where πη satisfies (5.24). Then by Proposition A.5.3, η(d x × A) satisfies q (X|x)η(d x × A) = q (|x)η(d x × A) for all ∈ B(X).
X
Since q (X|x) is w-bounded under the imposed conditions, both sides of the above equality define finite measures, and so (5.25) holds for each ∈ B(X). Lemma 5.2.1 Suppose Condition 2.4.2 holds for a [1, ∞)-valued function w and ρ < 0. Each η ∈ Dav satisfies
b w(y)η(d x × A) ≤ − . ρ X
Proof This statement follows from Proposition A.5.3. However, we present its complete proof for the sake of readability. Let η ∈ Dav be fixed. Consider the controlled process X (·) underthe stable stratq (dy|x, a)πη egy πη . For brevity, wewrite A q(dy|x, a)πη (da|x) as q(dy|x), A (da|x) as q (dy|x) and A q(X \ {x}|x, a)πη (da|x) as qx . Then X (·) is a homogeneous Markov pure jump process, whose transition function pq (s, x, t, dy) depends on s ≤ t only through t − s. Thus, we write pq (x, t, dy) instead of pq (s, x, s + t, dy). According to (2.2) and (2.3), we see that pq (x, t, ) =
e−qx (t−s)
(0,t]
q (dy|x) pq (y, s, )ds + δx ()e−qx t , X
∀ t ≥ 0, x ∈ X, ∈ B(X). Using Definition 3.2.1, upon taking the Laplace transform on both sides of the above equality, we obtain
5.2 Constrained Problems
295
1 πη ,α ηx ( × A) = e−αt e−qx (t−s) q (dy|x) pq (y, s, )ds dt α (0,∞) (0,t] X e−αt δx ()e−qx t dt + (0,∞) 1 = eqx s e−t (qx +α) q (dy|x) pq (y, s, )dtds + δx () α + qx (0,∞) (s,∞) X
1 1 π ,α = q (dy|x) η y η ( × A) + δx () , α + qx α X and thus 1 π ,α π ,α qx ηx η ( × A) + ηx η ( × A) = α
1 π ,α q (dy|x) η y η ( × A) + δx (), α X
so that π ,α ηx η (
× A) =
1 π ,α q(dy|x) η y η ( × A) + δx () α X
for each α > 0 and ∈ B(X). Integrating both sides of the above equality with respect to η(d x × A) and applying the Fubini Theorem and (5.25), we see that
π ,α
η(d x × A)ηx η ( × A) = η( × A), X
and thus
e−αt α
(0,∞)
η(dy × A) pq (t, y, ) − η( × A) dt = 0 X
for each α > 0 and ∈ B(X). Since the function X η(dy × A) pq (t, y, ) − η( × A) is continuous in t ∈ R+ 0 (see Sect. 2.1.2), we conclude that η(dy × A) pq (t, y, ) = η( × A) X
for each ∈ B(X), by Proposition B.1.24. In other words, η(d x × A) is an invariant distribution of the homogeneous Markov pure jump process X (·) (under the strategy ¯ x). Then according to (5.15), for each t ∈ R0+ , πη ). Let us denote η(d x × A) by η(d
π
η(d x × da)w(x) = Eη¯ η [w(X (t))] X×A
b b ρt ρt ρt . η(d ¯ x) e w(x) − (1 − e ) ≤ η(d ¯ x) e w(x) − ≤ ρ ρ X X
296
5 The Average Cost Model
Therefore, after passing to the limit as t → ∞, we obtain the desired inequality. Theorem 5.2.1 (a) If Condition 2.4.2 holds for a [1, ∞)-valued function w and ρ < 0, and the function w is continuous, then the space Dav is w-weakly closed in (Pw (X × A), τ (Pw (X × A))). (b) Suppose Condition 2.4.2 holds for a [1, ∞)-valued continuous function w and ρ < 0, Condition 3.2.2 is satisfied with ρ < 0, and the function w is continuous, the space A is compact, and the ratio ww is a strictly unbounded function on X. Then the space Dav is convex and compact in (Pw (X × A), τ (Pw (X × A))). Proof The proof is similar to that of Theorem 3.2.2. Assume without loss of generality that Dav is nonempty. w
(a) Let {ηn } be a sequence in Dav such that ηn → η, where η ∈ Pw (X × A). Consider the finite signed measure defined by X×A q(dy|x, a)η(d x × da). For each bounded continuous function g on X, it holds that g(y) q(dy|x, a)η(d x × da) X X×A
g(y)q(dy|x, a) η(d x × da) = X×A X
g(y)q(dy|x, a) ηn (d x × da) = lim n→∞ X×A X = lim g(y) (q(dy|x, a)ηn (d x × da)) = 0, n→∞ X
X×A
where the second equality follows from the fact that X g(y)q(dy|x, a) is continuous and w-bounded on X × A, and the last equality is by (5.25). This, by Proposition B.1.23, implies K q(dy|x, a)η(d x × da) = 0, and thus (5.25) is satisfied by the measure η. Consequently, η ∈ Dav , and Dav is closed in (Pw (X × A), τ (Pw (X × A))). (b) Since the ratio ww is strictly unbounded and both functions w and w are continuous, there exists a compact set ⊆ X such that sup x∈X\
1 w (x) = w(x) inf x∈X\
w(x) w (x)
< ∞; sup x∈
w (x) < ∞, w(x)
so that the function w belongs to Bw (X). Hence, the space of stable measures Dav is a subset of Pw (X × A). The reasoning in part (a) applies, and one can see that Dav is closed in (Pw (X × A), τ (Pw (X × A))). Next we prove the relative compactness of D in (Pw (X × A), τ (Pw (X × A))). This is equivalent to showing that the space D˜ av := Q w (Dav ) is relatively compact in (P(X × A), τweak ). Here, as in the proof of Theorem 3.2.2, the mapping P → Q w (P) is defined by
5.2 Constrained Problems
297
× w (x)P(d x × da) ˜ . P( X × A ) := X A X×A w (x)P(d x × da) For each η˜ = Q w (η) ∈ D˜ av , where η ∈ Dav , it holds that w(x)η(d x × da) w(x) η(d ˜ x × da) = X×A X×A w (x) X×A w (x)η(d x × da) b ≤ w(x)η(d x × da) ≤ − , ρ X×A where the first equality is by the definition of the mapping Q w , the first inequality follows from w ≥ 1, and the second inequality follows from η ∈ Dav and Lemma 5.2.1. The ratio ww , as a function on X × A, is strictly unbounded because A is compact. Hence, by Proposition B.1.30, the space D˜ av is tight and relatively compact in the weak topology. Consequently, Dav is compact in (Pw (X × A), τ (Pw (X × A))). Finally, the convexity of Dav is evident, following from the definition of stable measures. Condition 5.2.1 Condition 3.2.1 is satisfied with ρ < 0. For each stable strategy πη corresponding to a stable measure η ∈ Dav , it holds that 1 πη W j (πη , γ) = lim Eγ c j (X (s), a)πη (da|X (s))ds t→∞ t (0,t] A = c j (y, a)η(dy × da)
(5.26)
X×A
for each j = 0, 1, . . . , J. Theorem 5.2.2 Suppose that Conditions 3.2.2 and 5.2.1 are satisfied with ρ, ρ < 0, the functions w, w are continuous, the space A is compact, and the ratio ww is a strictly unbounded function on X. Furthermore, assume that for each j = 0, 1, . . . , J , the function c j is lower semicontinuous on X × A and satisfies inequality c j (x, a) ≥ −M w (x) for all (x, a) ∈ X × A, where M ∈ R0+ is a constant. Then for each π-strategy S, there exists a stable measure η ∈ Dav with an associated stable strategy πη such that W j (πη , γ) ≤ W j (S, γ), j = 0, 1, . . . , J. Proof Let S ∈ Sπ be arbitrarily fixed. Consider the family {ηk }∞ k=1 of its empirical measures defined by
298
5 The Average Cost Model
ηk (d x × da) :=
1 k
(0,k]
EγS [I{X (s) ∈ d x}(da|s)] ds
for each k = 1, 2, . . . . Clearly, {ηk }∞ k=1 ⊆ Pw (X × A). Now w(x)ηk (d x × da) w(x) Q (ηk )(d x × da) = X×A (x) w w w (x)ηk (d x × da) X×A X×A 1 w(x)ηk (d x × da) = E S [w(X (s))] ds ≤ k (0,k] γ X×A
for each k = 1, 2, . . . because w ≥ 1. Hence, w(x) 1 Q lim (η )(d x × da) ≤ lim EγS [w(X (s))] ds w k k→∞ X×A w (x) k→∞ k (0,k] b ≤− ρ (see (5.15)). Then there exists some k such that X×A
w(x) Q w (ηk )(d x × da) ≤ w (x)
w(x)ηk (d x × da) X×A
b ≤ 1 − , ∀ k ≥ k. ρ
(5.27)
Since ww is strictly unbounded on X × A, this implies that {Q w (ηk )}∞ is tight k=k and thus, by Prokhorov’s Theorem, relatively compact in (P(X × A), τweak ), see Proposition B.1.30. There exists a subsequence of {Q w (ηk )}∞ k=1 , still denoted by and some η ˜ ∈ P(X × A) such that Q (η ) → η ˜ in the standard weak {Q w (ηk )}∞ w k k=1 w
topology, and so ηk → η := Q −1 ˜ ∈ Pw (X × A). w (η) We verify that η is the desired stable measure as follows. By (5.27), for each m ≥ 1 and k ≥ k, b min{w(x), m}ηk (d x × da) ≤ 1 − , ρ X×A and hence X×A
b min{w(x), m}η(d x × da) ≤ 1 − . ρ
Since the above inequality holds for all m ≥ 1, by the Monotone Convergence Theorem,
5.2 Constrained Problems
299
b w(x)η(d x × da) ≤ 1 − , ρ X×A
that is, η belongs to Pw (X × A). By the Kolmogorov Forward Equation, see (2.93), for each bounded continuous function g on X, it holds that
q(dy|x, a)η(d x × da) g(y)q(dy|x, a) η(d x × da) = X×A X
g(y)q(dy|x, a) ηk (d x × da) = lim k→∞ X×A X
1 S Eγ [g(X (k))] − g(x)γ(d x) = 0, = lim k→∞ k X X
g(y)
X×A
and therefore, by Proposition B.1.23, η satisfies (5.25). Now we see that η ∈ Dav . Finally, as in the proof of Theorem 3.2.3, for each j = 0, 1, . . . , J , the mapping η → X×A c j (x, a)η(d x × da) from Dav to R is lower semicontinuous in the w weak topology. It follows that for each j = 0, 1, . . . , J, c j (x, a)ηk (d x × da) W j (S, γ) ≥ lim k→∞ X×A ≥ c j (x, a)η(d x × da) = W j (πη , γ), X×A
where the last equality is by (5.26). The proof of this statement is completed.
Theorem 5.2.2 implies that under the conditions therein, it suffices to be restricted to the class of stable strategies, and problem (5.1) can be reformulated as c0 (x, a)η(d x, da) Minimize overη ∈ P(X × A) : X×A subject to w(x)η(d x × A) < ∞, X q(|x, a)πη (da|x) η(d x × A) = 0, X A for all ∈ B(X) satisfying supx∈ A qx (a)πη (da|x) < ∞, c j (x, a)η(d x, da) ≤ d j , j = 1, 2, . . . , J.
(5.28)
X×A
Corollary 5.2.1 Suppose Conditions 3.2.2 and 5.2.1 are satisfied with ρ, ρ < 0, functions w and w are continuous, the space A is compact, and the ratio ww is a strictly unbounded function on X. Assume also that, for each j = 0, 1, . . . , J , the
300
5 The Average Cost Model
function c j is lower semicontinuous on X × A and satisfies the inequality c j (x, a) ≥ −M w (x) for all (x, a) ∈ X × A, where M ∈ R0+ is a constant. Finally, assume that there is at least one feasible strategy for problem (5.1). Then there exists a stable optimal strategy solving problem (5.1). Proof As mentioned in the proof of Theorem 5.2.2, X×A c j (x, a)η(d x, da) is lower semicontinuous in η ∈ Dav . By Theorem 5.2.1(b), we see that there exists an optimal solution η ∗ to problem (5.28). By (5.26), we see that the underlying stable strategy πη∗ is optimal for problem (5.1). The next example demonstrates that problem (5.28) without the requirement X w(x)η(d x × A) < ∞ and Condition 3.2.1 in general does not have a connection to problem (5.1). Example 5.2.1 Let X = {0, ±1, ±2, . . . }, A = {a1 , a2 }. Let 0 < μ < λ < 2μ be fixed constants such that λ + μ = 1. Consider the transition rate given by q0 (a1 ) = q({1}|0, a1 ) = λ = q0 (a2 ) = q({−1}|0, a2 ); qx (a) = q({x − 1}|x, a) = 1, ∀ x ∈ {−1, −2, . . . }, a ∈ A; qx (a) = 2x , q({x + 1}|x, a) = λ2x , q({x − 1}|x, a) = μ2x , ∀ x ∈ {1, 2, . . . }, a ∈ A. We introduce the notation π s ({a1 }|0) = ∈ [0, 1], π s ({a2 }|0) = 1 − . Note that the process is essentially controlled only at the state 0, and so one can consider the equivalent classes of π-strategies π s (da|x), each of which is fully specified by the constant ∈ [0, 1]. Let us fix a single cost rate given by c0 (x, a) = 0, ∀ x ∈ {0, 1, . . . }, a ∈ A; c0 (x, a) = −1, ∀ x ∈ {−1, −2, . . . }, a ∈ A, and consider problem (5.1) with J = 0. Suppose γ({0}) = 1. Proposition 5.2.1 Consider the CTMDP described in Example 5.2.1. Then the following assertions hold. (a) There exists a unique probability measure η ∗ on B(X × A) and stationary πstrategy πη∗ satisfying (5.24), and (5.25) for each ∈ B(X) such that qx (a)πη (da|x) < ∞.
sup x∈
(5.29)
A
However, the stationary π-strategy πη is not optimal for problem (5.1) with J = 0, for which the deterministic stationary strategy ϕ∗ (x) ≡ a2 is optimal.
5.2 Constrained Problems
301
(b) Condition 3.2.1 is not satisfied, and (5.26) holds for η ∗ and πη∗ in part (a). Proof (a) When = 1, the stationary π-strategy π s becomes deterministic, and there is a unique invariant probability p satisfying
q({y}|x, a)π (da|x) p(d x) = 0, ∀ y ∈ X, s
X
(5.30)
A
being p({x}) = 0, ∀ x = −1, −2, . . . ; x
λ λ p({x}) = 1 − , ∀ x = 0, 1, 2, . . . . 2μ 2μ Let η ∗ ({x} × {a}) = p({x})π s ({a}|x), and πη∗ = π s ({a}|x) for each x ∈ X. Then we see that (5.25) is satisfied by η ∗ and πη∗ for each ∈ B(X) such that (5.29) holds. When ∈ [0, 1), one can see that there is no probability p satisfying (5.30), and hence there is no probability η ∈ P(X × A) and stationary π-strategy πη satisfying (5.25) for each ∈ B(X) such that (5.29) holds. It is obvious that any deterministic stationary strategy ϕ∗ satisfying ϕ∗ (0) = a2 is optimal, and strictly outperforms πη∗ : W (ϕ∗ , 0) = −1 < W (πη∗ , 0) = 0. Thus, part (a) is proved. π∗ (b) Note that x ∈ X → Ex η [T∞ ] is the minimal nonnegative measurable solution to v(x) = ∞, ∀ x = −1, −2, . . . , 1 v(0) = + v(1), λ 1 v(x) = x + λv(x + 1) + μv(x − 1), ∀ x = 1, 2, . . . , 2 which is given by π
∗
v(x) = Ex η [T∞ ] = ∞, ∀ x = −1, −2, . . . , i
∞ μi λ k π∗ v(x) = Ex η [T∞ ] = , ∀ x = 0, 1, . . . . λi+1 k=0 2μ i=x
(5.31)
(See Chaps. 1 and 3.) Thus, the process under the strategy πη∗ is explosive. Therefore, Condition 3.2.1 is not satisfied. Indeed, Condition 3.2.1 implies Condition 2.4.2 which, in its turn, implies Condition 2.2.5. Now see Theorem 2.2.5.
302
5 The Average Cost Model
Furthermore, equality (5.26) holds for η ∗ and πη∗ in part (a) because
c0 (y, a)η ∗ (dy × da).
W0 (π , γ) = 0 = η∗
X×A
The proof is complete.
5.2.2 Duality Throughout this subsection, we assume that Conditions 3.2.1 and 3.2.2 are satisfied by the continuous functions w and w and constants ρ, ρ < 0. Moreover, suppose (3.32) holds for some constants M and L . Recall that MwF (X × A) is the space of all finite signed measures on (X × A, B(X × A)) with a finite w-norm. (See Definition B.1.8.) Consider the dual pair (X , Y) as given by (3.33) with the bilinear form given by (3.34). We fix the positive cone Co in X and its dual cone Co∗ in Y as defined by (3.35) and (3.36). Let Z := MwF (X) × R J +1 = {Z = (z 0 , h 0 , h 1 , . . . , h J ) : z 0 ∈ MwF (X), h j ∈ R, j = 0, 1, . . . , J }, V := Bw (X) × R J +1 = {V = (v , g0 , g1 , . . . , g J ) : v ∈ Bw (X), g j ∈ R, j = 0, 1, . . . , J }. The bilinear form on Z × V is defined as
v (x)z 0 (d x) +
Z , V := X
J
h j g j , ∀ Z ∈ Z , V ∈ V .
j=0
Consider the mapping U from X to Z defined by U ◦ X = Z = (z 0 , h 0 , h 1 , . . . , h J ), ∀ X = (η, β1 , β2 , . . . , β J ) ∈ X with z 0 ( X ) :=
q( X |y, a)η(dy × da), X×A
h 0 := η(X × A), h j := c j (x, a)η(d x × da) + β j , j = 1, 2, . . . , J. X×A
5.2 Constrained Problems
303
As in Sect. 3.2.4, one can see that the mapping U is weakly continuous and linear. Its adjoint mapping U ∗ : V → Y is defined by
U ∗ ◦ V = Y = ( f, e1 , e2 , . . . , e J ), ∀ V = (v , g0 , g1 , . . . , g J ) ∈ V with
v (y)q(dy|x, a) +
f (x, a) := X
e j :=
J
g j c j (x, a) + g0 ,
j=1
g j ,
j = 1, 2, . . . , J.
Indeed, by Fubini’s Theorem,
U ◦ X, V = X, U ∗ ◦ V J v (x)q(d x|y, a)η(dy × da) + gj = X×A
+
X
J
j=1
c j (x, a)η(d x × da)
X×A
g j β j + g0 η(X × A), ∀ X ∈ X , V ∈ V .
j=1
Below in this subsection, C := (c0 , 0, 0, . . . , 0) ∈ Y, B := (0, 1, d1 , d2 , . . . , d J ) ∈ Z . Until the end of this subsection, we put cg¯ (x, a) :=
J
g j c j (x, a), ∀ g¯ ∈ R J +1 , ∀ x ∈ X, a ∈ A.
(5.32)
j=0
Now problem (5.28) can be written as the following Primal Linear Program: Minimize over X ∈ Co: X, C subject to U ◦ X = B with its value being denoted by inf(Pav ). Its Dual Linear Program reads Maximize over V ∈ V : B , V subject to C − U ∗ ◦ V ∈ Co∗ and can be explicitly written as follows:
(5.33)
304
5 The Average Cost Model
Maximize over (v , g0 , g1 , . . . g J ) ∈ V : g0 +
J
g j d j
j=1
subject to
c0 (x, a) −
v (y)q(dy|x, a) −
X
J
g j c j (x, a) − g0 ≥ 0, ∀ (x, a) ∈ X × A;
j=1
−g j ≥ 0, j = 1, 2, . . . , J.
(5.34)
Its value is denoted by sup(Dav ). Below in this subsection, we assume that problem (5.28), or, equivalently, the Primal Linear Program (5.33) has at least one feasible solution. Now, the Primal Linear Program (5.33) can be written as the following Primal Convex Program (cf. (B.17)): ¯ Minimize over η ∈ Dav : sup L av (η, g),
(5.35)
g¯ ∈R+J
where the Lagrangian has the form L av (η, g) ¯ = +
J j=1
c0 (x, a)η(d x × da) X×A
gj X×A
c j (x, a)η(d x × da) − d j , ∀ η ∈ Dav , g¯ ∈ R+J .
c The value of problem (5.35) is denoted by inf(Pav ). Its Dual Convex Program reads
Maximize over g¯ ∈ R+J : infav L av (η, g). ¯ η∈D
(5.36)
c Its value is denoted by sup(Dav ). The following condition is the Slater condition for problem (5.1).
Condition 5.2.2 All the inequalities in (5.1) are strict for some S ∈ Sπ . Theorem 5.2.3 Suppose Conditions 2.4.3(b), 3.2.2, 5.2.1 and 5.2.2 are satisfied with ρ, ρ < 0 and continuous functions w and w on X, whereas the ratio ww is a strictly unbounded function on X. Assume also that A is compact, and for each j = 0, 1, . . . , J , the function c j is lower semicontinuous on X × A, and (3.32) holds for some constants M , M. Then the following assertions are valid. (a) Problems (5.33), (5.35) and (5.36) are all solvable with the common finite value. (b) Let g¯ ∗ = (g1∗ , g2∗ , . . . , g ∗L+1 ) ∈ R+J +1 be an optimal solution to the Dual Convex Program (5.36). Then a point η ∗ ∈ Dav is optimal for problem (5.28) if and only if one of the following two equivalent statements holds:
5.2 Constrained Problems
305
(i) The pair (η ∗ , g¯ ∗ ) is a saddle point of the Lagrangian: ¯ ≤ L av (η ∗ , g¯ ∗ ) ≤ L av (η, g¯ ∗ ), ∀ η ∈ Dav , g¯ ∈ R+J . L av (η ∗ , g) (ii) η ∗ is feasible for problem (5.28), and the following equalities hold: infav L av (η, g¯ ∗ ) = L av (η ∗ , g¯ ∗ ) =
η∈D J
g ∗j
j=1
c0 (x, a)η ∗ (d x × da); X×A
c j (x, a)η ∗ (d x × da) − d j = 0. X×A
(The last equality is known as the Complementary Slackness Condition.) (c) Suppose in addition that Condition 3.1.2 is satisfied. Let g¯ ∗ = (g1∗ , g2∗ , . . . , g ∗J ) ∈ R+J be an optimal solution to the Dual Convex Program (5.36). Suppose that for some fixed z ∈ X, and sequence αn ↓ 0, it holds that −αn t (1,g1∗ ,g2∗ ,...,g ∗J ) inf E S e c (X (t), a)(da|t)dt S∈Sπ x (0,∞) A S −αn t (1,g1∗ ,g2∗ ,...,g ∗J ) e c (X (t), a)(da|t)dt − inf Ez S∈Sπ
(0,∞)
A
≤ κw (x), ∀ x ∈ X, where κ ≥ 0 is a constant. Then problem (5.34) is solvable with the finite value equal to the common value of problems (5.33), (5.35) and (5.36). Proof (a) This immediately follows from Proposition B.2.6(a) and Corollary 5.2.1. (b) This part follows from Proposition B.2.6. (c) There exist a constant g0∗ ∈ R, a deterministic stationary strategy ϕ, and two functions h ∗ , h ∗ ∈ Bw (X) such that g0∗
≥ c0 (x, ϕ(x)) +
J
g ∗j (c j (x, ϕ(x))
X
j=1
∀ x ∈ X, g0∗ ≤ c0 (x, a) +
J j=1
∀ x ∈ X, a ∈ A,
h ∗ (y)q(dy|x, ϕ(x)),
− dj) +
g ∗j (c j (x, a) − d j ) +
∗
h (y)q(dy|x, a), X
(5.37)
306
5 The Average Cost Model ∗
see Theorem 5.1.5. If we put v ∗ = −h , g j ∗ = −g ∗j ≥ 0 for each j = 1, 2, . . . , J, and g0 ∗ = g0∗ − Jj=1 g j ∗ d j , then 0 ≤ c0 (x, a) +
J
g ∗j (c j (x, a)
X
j=1
∗
v (y)q(dy|x, a) −
= c0 (x, a) −
∗
− dj) +
X
J
h (y)q(dy|x, a) − g0∗ ,
∗
∗
g j c j (x, a) − g0 ∀ x ∈ X, a ∈ A,
j=1
and thus (v ∗ , g0 ∗ , . . . , g J ∗ ) ∈ V is a feasible solution to the Dual Linear Program (5.34), and the objective at this point is equal to g0∗ . Hence, g0∗ ≤ sup(Dav ).
(5.38)
On the other hand, by inspecting the proof of Theorem 5.1.5(c), we see from (5.37) that g0∗
1 ≥ lim Eϕx t→∞ t
c (0,t]
(1,g1∗ ,g2∗ ,...,g ∗J )
(X (s), a)(da|s)ds −
A
J
g ∗j d j ,
j=1
∀ x ∈ X. By taking the integral on both sides with respect to γ, we see that g∗ 0
1 ≥ γ(d x) lim Eϕx t→∞ t X −
J j=1
−
J
g ∗j d j
c (0,t]
(1,g1∗ ,g2∗ ,...,g ∗J )
(X (s), a)(da|s)ds
A
1 ϕ (1,g1∗ ,g2∗ ,...,g ∗J ) Eγ ≥ lim c (X (s), a)(da|s)ds t→∞ t (0,t] A
g ∗j d j ,
j=1
where the last inequality is by Fatou’s Lemma. Now c ) = infav L av (η, g ∗ ) inf(Pav ) = sup(Dav η∈D
J 1 ∗ ∗ ∗ ≤ lim Eγϕ c(1,g1 ,g2 ,...,g J ) (X (s), a)(da|s)ds − g ∗j d j t→∞ t (0,t] A j=1 ≤ g0∗ ,
5.2 Constrained Problems
307
where the inequality is by Theorem 5.2.2 applicable to the cost rate c
(1,g1∗ ,g2∗ ,...,g ∗J )
−
J
g ∗j d j .
j=1
It follows from this and (5.38) that inf(Pav ) ≤ g0∗ ≤ sup(Dav ). By Proposition B.2.5, inf(Pav ) = sup(Dav ) = g0∗ . Now it is clear that (v ∗ , g0 ∗ , . . . , g J ∗ ) ∈ V is an optimal solution to the Dual Linear Program (5.34). The statement now follows from part (a).
5.2.3 The Space of Performance Vectors Submodels associated with subsets K ⊆ X × A were introduced at the end of Sect. 3.1.2 and at the end of Sect. 5.1.2. Condition 5.2.3 For each compact-valued multifunction A(·) from X to B(A) with ∈ R+J +1 , the following holds. a measurable graph K, and each λ ↓ 0, a constant κ ≥ 0 and a state z , (a) There exist a decreasing sequence αn (λ) λ λ all possibly K-dependent, such that αn (λ) inf W αn (λ) (S, x) − inf W (S, z ) ≤ κλ w (x), x ∈ X, n ≥ 0, λ λ S∈Sˆ cλ c S∈Sˆ π
π
where Sˆ π depends on K, and Wcαλ (S, x)
:= α K
cλ (y, a)ηxS,α (dy × da), ∀ x ∈ X, α > 0, S ∈ Sπ .
The function cλ was defined in (5.32). (b) There is some constant Mλ ≥ 0 such that | inf a∈A cλ (x, a)| ≤ Mλ w (x) for each x ∈ X. Condition 5.2.4 Each deterministic stationary strategy is stable. Lemma 5.2.2 Suppose Conditions 2.4.3(a–d), 3.1.3, 3.2.1 and 5.2.4 are satisfied (by the same function w and ρ, ρ < 0). Assume also that for each j = 0, 1, . . . , J , the
308
5 The Average Cost Model
∈ R+J +1 be fixed, and consider function c j is lower semicontinuous on X × A. Let λ the submodel with K being measurable and A(x) ⊆ A being nonempty compact for each x ∈ X. Assume that the assertions in Conditions 5.2.3 are valid for the fixed ∈ R+J +1 . submodel and λ Then there exists a deterministic stationary stable strategy ϕ∗ such that infav
η∈D
cλ (x, a)η(d x × da) =
X×A
∗
cλ (x, a)η ϕ (d x × da).
X×A
Furthermore, if a stable strategy π ∈ Sˆ π satisfies inf η∈Dav X×A cλ (x, a)η(d x × da) = X×A cλ (x, a)η π (d x × da), where η π is a stable measure corresponding to
π, then there exists a measurable subset Xλπ ⊆ X such that
η π (Xλπ × A) = 1, π(B(x)|x) = 1 ∀ x ∈ Xλπ , with λ B(x) := a ∈ A(x) : c (x, a) + h(y)q(dy|x, a) = g , X
where the constant g ∈ R and the function h ∈ Bw (X) come from Theorem 5.1.7 with c0 being replaced by cλ . The multifunction B(·) on Xλπ is compact-valued with a measurable graph. Proof By Theorem 5.1.7, there exist a constant g ∈ R, w -bounded measurable functions h, h on X and deterministic stationary strategy ϕ∗ ∈ Sˆ π , such that for all x ∈ X, cλ (x, ϕ∗ (x)) + h(y)q(dy|x, ϕ∗ (x)) X λ = inf c (x, a) + h(y)q(dy|x, a) ≤ g; a∈A(x) X g ≤ inf cλ (x, a) + h(y)q(dy|x, a) . a∈A(x)
X
For any stable strategy π ∈ Sˆ π , let η π denote a stable measure corresponding to π. Then g≤ cλ (x, a)η π (d x × da) K + η π (d x × A) h(y) q(dy|x, a)π(da|x) X X A(x) = cλ (x, a)η π (d x × da) + h(y) q(dy|x, a)η π (d x × da) K
X
K
5.2 Constrained Problems
309
cλ (x, a)η π (d x × da).
=
(5.39)
K
∗ Similarly, g ≥ K cλ (x, a)η ϕ (d x × da), which, together with (5.39), implies ∗ ∗ inf η∈Dav X×A cλ (x, a)η(d x × da) = X×A cλ (x, a)η ϕ (d x × da) = g. Here, η ϕ is any stable measure corresponding to ϕ∗ . The statement to be proved is now obvious. For the last assertion, note that B(x) = a ∈ A(x) : cλ (x, a) + h(y)q(dy|x, a) ≤ g X
and, for fixed x ∈ X, the function cλ (x, a) + tinuous in a ∈ A(x).
X
h(y)q(dy|x, a) is lower semicon
It is sometimes convenient to signify explicitly the underlying submodel specified by a multifunction A(·). For example, we write Sˆ π (A(·)) instead of simply Sˆ π . Also we define av S S OA(·) := c0 (x, a)η (d x × da), . . . , c J (x, a)η (d x × da) : X×A
X×A
S ∈ Sˆ π (A(·)) ∩ Sstable , where η S denotes a stable measure generated by the stable strategy S. Condition 5.2.5 For any two stable measures η1 , η2 ∈ Dav , η1 (d x × A) and η2 (d x × A) are equivalent, i.e., they are absolutely continuous with respect to each other. The next statement provides a sufficient condition for verifying Condition 5.2.5. Lemma 5.2.3 Suppose Condition 2.4.2 holds for a [1, ∞)-valued function w and ρ < 0. Then, Condition 5.2.5 is satisfied if there exists a nontrivial σ-finite measure ν on X and a (0, ∞)-valued function g on X × A × X such that / , a ∈ A. q(|x, a) = g(x, a, y)ν(dy), ∀ ∈ B(X), x ∈
Proof Let η ∈ Dav be such that η( × A) = 0 for some ∈ B(X). Let us fix another ¯ x) := η(d x × A). Let πη be a stable strategy stable measure η ∈ Dav . Define η(d corresponding to the stable measure η, and, for brevity, we write pπη (x, t, dy) for the transition probability function of X (·) under the stable strategy πη , and q(dy|x, πη ) = A q(dy|x, a)πη (da|x). Recall that the π-strategy πη is stationary. Then, it holds that pπη (x, t, )η(d ¯ x), ∀ t ≥ 0. 0 = η() ¯ = X
Consider an arbitrary sequence tn ↓ 0. For each n = 1, 2, . . ., there is!an E n ∈ B(X) such that η(E ¯ n ) = 1 and pπη (x, tn , ) = 0 for all x ∈ E n . Take E := ∞ n=1 E n .
310
5 The Average Cost Model
Then η(E) ¯ = 1, pπη (x, tn , ) = 0, ∀ x ∈ E, n ≥ 1. Note that pπη (x, tn , ) ≥ e−q¯x tn > 0 for all x ∈ , n ≥ 1. It follows that there exists some state x ∈ E \ such that pπη (x , tn , ) = 0 for all n ≥ 1, and so by Proposition A.5.1, for all x ∈ E \ = ∅, pπη (x , tn , ) 0 = lim = q(|x , πη ) = g(x , a, y)πη (da|x )ν(dy), n→∞ tn A which implies that ν() = 0 because A g(x , a, y)πη (da|x ) > 0 for all x , y ∈ X. So, for a fixed η ∈ Dav , g(x, a, y)πη (da|x)ν(dy) = 0, ∀ x ∈ X. q(|x, πη ) =
A
Since q(X|x, πη ) ≡ 0, we see that
0=
q(|x, πη )η¯ (d x) = X
Since
A
q(|x, πη )η¯ (d x) = −
q(X \ |x, πη )η¯ (d x).
g(x, a, y)πη (da|x) > 0 for all x, y ∈ X and the measure ν is nontrivial,
q(X \ |x, πη ) =
g(x, a, y)πη (da|x)ν(dy) > 0 X\
A
for all x ∈ , because, as was shown, ν() = 0. It follows that η¯ () = 0, and, since η ∈ Dav was arbitrarily fixed, Condition 5.2.5 is satisfied. Lemma 5.2.4 Consider the submodel specified by a compact-valued multifunc ≥ 0, whose components are not tion A(·) with a measurable graph K. Fix some λ all zero. Suppose the conditions in Lemma 5.2.2 and Condition 5.2.5 are satisJ +1 fied. Assume that u ∈ Oav is a fixed point, and consider the hyperplane A(·) ∩ R " J H = (v0 , . . . , v J ) ∈ R J +1 : i=0 λi vi = g , where g ∈ R is a constant. If H sup J av J +1 J +1 ports OA(·) ∩ R at u in the sense that i=0 λi vi ≥ g for each v ∈ Oav A(·) ∩ R J and i=0 λi u i = g, then there!is a compact-valued multifunction A1 (·) with a measurable graph such that A(x) A1 (x) = ∅ for each x ∈ X, and J +1 J +1 ∩ H = Oav . Oav A(·) ∩ R A(·)∩A1 (·) ∩ R
Proof Let u be generated by a stable strategy π ∈ Sˆ π (A(·)). Define A1 (x) = B(x) if x ∈ Xλπ , and A1 (x) = A if x ∈ X \ Xλπ , where the sets Xλπ and B(x) come from Lemma 5.2.2, which implies the statement of the current lemma. The full details are
5.2 Constrained Problems
311
as follows. It is clear that J +1 J +1 ⊆ Oav ∩ H. Oav A(·)∩A1 (·) ∩ R A(·) ∩ R
Indeed, for each stable strategy π in the A(·) ∩ A1 (·)-model,
cλ (x, a)π (da|x) +
A
X
h(y)q(dy|x, a)π (da|x) = g
A
for each x ∈ Xλπ . Lemma 5.2.2 and Condition 5.2.5 imply that η π (Xλπ × A) = 1, so that by integrating with respect to η π (d x × A) on both sides of the previous equality, λ we see that A c (x, a)η π (d x × da) = g, as required. Next we verify that J +1 J +1 Oav ∩ H ⊆ Oav . A(·) ∩ R A(·)∩A1 (·) ∩ R
Let
πˆ
πˆ
c0 (x, a)η (d x × da), . . . , ∈
X×A Oav A(·) ∩
c J (x, a)η (d x × da) X×A
R J +1 ∩ H
be fixed, where πˆ is a stable strategy in the A(·)-model. Note that the stable measure η πˆ provides the infimum inf η∈Dav X×A cλ (x, a)η(d x × da). Consider the measurable
ˆ Then by subset Xλπˆ ⊆ X coming from Lemma 5.2.2 with π being replaced by π. Condition 5.2.5, for each stable strategy π, it holds that η π Xλπˆ ∩ Xλπ × A = η πˆ Xλπˆ ∩ Xλπ × A = 1.
Define π (da|x) = π(da|x) ˆ for each x ∈ Xλπˆ ∩ Xλπ and π (da|x) = ψ(x) for each λ λ x∈ / Xπˆ ∩ Xπ , where ψ is a deterministic stationary strategy in the A(·) ∩ A1 (·)model. Then η πˆ (d x × da) = η πˆ (d x × A)π(da|x) ˆ = η πˆ (d x × A)π (da|x),
so that η π = η πˆ , and the strategy π is stable. Next,
c0 (x, a)η πˆ (d x × da), . . . ,
X×A
=
X×A
c J (x, a)η πˆ (d x × da) X×A
c0 (x, a)η π (d x × da), . . . ,
c J (x, a)η π (d x × da) .
X×A
Since π is a strategy in the A(·) ∩ A1 (·)-model, we see that
312
5 The Average Cost Model J +1 J +1 Oav ∩ H ⊆ Oav , A(·) ∩ R A(·)∩A1 (·) ∩ R
as required.
Theorem 5.2.2 reduces problem (5.1) to problem (5.28) in the space of stable measures. Similar to Definition 3.2.3, we make the following definition. Definition 5.2.2 A stable strategy generating the optimal stable measure η ∗ for problem (5.28) is called an SM-mixture of the stable strategies S0 , . . . , Sn if η ∗ can be expressed as a convex combination of the stable measures generated by them. The abbreviation “SM” here stands for “stable measure”. Theorem 5.2.4 Suppose the conditions of Theorem 5.2.2 and Conditions 2.4.3(a–d), 3.1.3, 3.2.1, 5.2.3, 5.2.4 and 5.2.5 are satisfied. Furthermore, assume that problem (5.28) has a feasible solution with a finite value. Then there exists a stable optimal strategy for problem (5.1), which is an SM-mixture of J + 1 deterministic stationary (stable) strategies. Proof As mentioned earlier, Theorem 5.2.2 reduces problem (5.1) to problem (5.28) in the space of stable measures, which can be rewritten as Minimize W0 = (W0 , W1 , . . . , W J ) ∈ Oav subject to W and W j ≤ d j , ∀ j = 1, 2, . . . , J, where av
O
:=
c0 (x, a)η(d x × da), . . . , X×A av
c J (x, a)η(d x × da) : X×A
η ∈ D }, cf. problem (3.47). This problem is non-degenerate because so is problem (5.28). Note that Dav is convex and compact in the metrizable space (Pw (X × A), τ (Pw (X × A))), and η ∈ Dav → X×A c j (x, a)η(d x × da) is lower semicontinuous for each j = 0, 1, . . . , J : see Theorem 5.2.1(b), which is applicable under the conditions of the current theorem. Therefore, we are formally in the general framework of Sect. 3.2.5. In particular, the corresponding version of Theorem 3.2.6 holds. Now the statement to be proved follows from Lemmas 5.2.4 and B.2.1. In greater detail, applying the version of Theorem 3.2.6, we see that for some η = η π with π being a stable optimal strategy for (5.1),
c0 (x, a)η(d x × da), . . . ,
X×A
=
J +1 j=1
c J (x, a)η(d x × da) X×A
αj
c J (x, a)η j (d x × da) .
c0 (x, a)η j (d x × da), . . . , X×A
X×A
5.2 Constrained Problems
Here
J +1 j=1
313
α j = 1, and, for each j = 1, 2, . . . , J + 1, α j ∈ [0, 1], and the vector
U j :=
c0 (x, a)η j (d x × da), . . . , X×A
c J (x, a)η j (d x × da) X×A
is a Pareto and extreme point of Oav ∩ R J +1 . Consider some fixed j = 1, 2, . . . , J + 1, and U j . By Lemma B.2.1, # k % $ av J +1 Uj = O ∩ R ∩ Hi , i=1
where k ≤ J +1, and for each j = 1, 2, . . . , k, H j is a supporting hyperplane of ! ! ! j−1 Oav R J +1 i=1 Hi at the point U j . On the other hand, by Lemma 5.2.4, O ∩R av
J +1
∩
# k $
% Hi
= Oav
A(·)∩
i=1
!
k i=1
, Ai (·)
A (·) is a compact-valued multifunction with where A(x) ≡ A, and A(·) ∩ i=1 i a measurable graph. Therefore, U j is generated by a deterministic stationary strategy !k in the A(·) ∩ i=1 Ai (·) -model, and that strategy is stable by Condition 5.2.4. Finally, we obtain deterministic stationary strategies ϕ j , j = 1, 2, . . . , J + 1, such that
c0 (x, a)η(d x × da), . . . , c J (x, a)η(d x × da) ! k
=
X×A J +1
αj
=⎝
ϕj
c0 (x, a)
c J (x, a) X×A
ϕj
c J (x, a)η (d x × da) X×A
X×A
c0 (x, a)η (d x × da), . . . ,
X×A
j=1
⎛
X×A
J +1
α j η ϕ j (d x × da), . . . ,
j=1 J +1
⎞
α j η ϕ j (d x × da)⎠ .
j=1
The stable strategy generating
J +1 j=1
α j η ϕ j (d x × da) is the required one.
314
5 The Average Cost Model
5.2.4 Denumerable and Finite Models 5.2.4.1
Denumerable Models
If X is denumerable, and the model is unichain in the sense that under each stationary π-strategy, the state space has only one closed communicating class, then the conditions imposed in Theorem 5.2.4 can be further tidied up, provided that all the cost rates c j are w -bounded, and w is a strictly unbounded function on X. Corollary 5.2.2 Consider a unichain CTMDP model with a denumerable state space X, endowed with the discrete topology. Suppose the following hold. (a) Condition 2.4.2 is satisfied with a [1, ∞)-valued function w on X and a constant ρ < 0. (b) Condition 2.4.3(b, c) is satisfied with a [1, ∞)-valued strictly unbounded function w on X and a constant ρ < 0, such that ww is a strictly unbounded function on X. (c) The action space A is compact. (d) The function a ∈ A → q(y|x, a) is continuous for each x, y ∈ X. (e) All the cost rates c j are w -bounded and lower semicontinuous in a ∈ A. (f) The initial distribution γ is w-summable, i.e.,
w(y)γ({y}) < ∞.
y∈X
(g) Problem (5.28) has a feasible solution with a finite value. Then there exists a stable optimal strategy for problem (5.1), which is an SM-mixture of J + 1 deterministic stationary (stable) strategies. Moreover, the statement of Theorem 5.2.3 holds, provided that Condition 5.2.2 is also satisfied. Proof Note that for the statement of Theorem 5.2.4 to hold, if X is denumerable, Condition 5.2.5 is not needed, because then one can take Xλπ = X for each stable ∈ R+J +1 in the proof of Lemma 5.2.4 as well as in the statement of strategy π and λ Lemma 5.2.2. Therefore, in this proof, we only need explain why the other conditions in Theorem 5.2.4 are all satisfied here. Firstly, we verify that Condition 3.1.3 is satisfied. Only the verification of its part (b) is nontrivial, and for this, it suffices to show that a∈A→
u(y)w (y)q(y|x, a)
y∈X
is continuous for each x ∈ X and for each bounded function u on X. Let x ∈ X. Since ww is strictly unbounded, there exists a sequence of finite sets {Xn }∞ n=0 such that Xn ↑ X and inf x∈X\Xn ww(x) → ∞, as n → ∞. Let n be such that x ∈ X n . Then (x)
5.2 Constrained Problems
315
u(y)w (y)q(y|x, a)
y∈X\Xn
w (z) w(y)q(y|x, a) y∈X z∈X\Xn w(z) y∈X\X n ⎞ ⎛ w (z) ⎝ w(y)q(y|x, a) + w(x)q x ⎠ ≤ sup |u(y)| sup y∈X z∈X\Xn w(z) y∈X ≤ sup |u(y)| sup
≤ sup |u(y)| sup y∈X
z∈X\Xn
+ w (z) * ρw(x) + b + q x w(x) → 0 as n → ∞, w(z)
so that y∈Xn u(y)w (y)q(y|x, a) → y∈X u(y)w (y)q(y|x, a) uniformly in a ∈ A, as n → ∞. By Proposition B.1.9(a), we see that the mapping a∈A→
u(y)w (y)q(y|x, a)
y∈X
is continuous for each fixed x ∈ X. Condition 5.2.4 is satisfied automatically because of Propositions A.6.1 and A.5.3. (In fact, even without referring to these propositions, this condition can still be seen to be satisfied, according to the reasoning in the proof of Theorem 5.2.2.) Next, observe that for each x ∈ X, w -bounded function f and stationary πstrategy ϕ, 1 lim Eϕ t→∞ t x
(0,t]
f (X (s))ds =
μϕ ({y}) f (y),
y∈X
where μϕ is the invariant distribution existing by Proposition A.6.1. Indeed, 1 lim Eϕ t→∞ t x 1 = lim t→∞ t 1 ≤ lim t→∞ t and similarly,
f (X (s))ds −
(0,t]
(0,t]
y∈X
⎛
(0,t]
⎝Eϕx [ f (X (s))] −
μϕ ({y}) f (y) ⎞
μϕ ({y}) f (y)⎠ ds
y∈X
|| f ||w
y∈X
| pq ϕ (x, s, {y}) − μϕ ({y})|w (y)ds = 0,
316
5 The Average Cost Model
1 ϕ Ex t→∞ t
μϕ ({y}) f (y) − lim
y∈X
1 = lim t→∞ t 1 t→∞ t
⎛
≤ lim
⎝ (0,t]
(0,t]
(0,t]
f (X (s))ds ⎞
μϕ ({y}) f (y) − Eϕx [ f (X (s))]⎠ ds
y∈X
|| f ||w
| pq ϕ (x, s, {y}) − μϕ ({y})|w (y)ds = 0.
y∈X
Hence, (5.26) holds with γ(dy) = δx (dy). The case with a general initial distribution follows from this, if one adds an artificial state x with q(y|x , a) = γ({y}) for y ∈ X. Therefore, the conditions of Theorem 5.2.2 are satisfied. Finally, we note that according to Proposition A.6.1, there exist constants R > 0 and δ > 0 such that, for each stationary π-strategy ϕ, with the corresponding invariant distribution μϕ , and for each w -bounded function f on X, y∈X μϕ ({y}) f (y) ϕ −αt e f (X (t))dt − E x α (0,∞) ⎞ ⎛ ≤ || f ||w e−αt ⎝ w (y) pq ϕ (x, t, {y}) − μϕ ({y})⎠ dt (0,∞)
≤
y∈X
|| f ||w R w (x), δ
so that for each fixed x, y ∈ X and stationary π-strategy ϕ, ϕ E x ≤
e (0,∞)
−αt
f (X (t))dt −
|| f ||w R (w (x) + w (y)). δ
Eϕy
e
−αt
(0,∞)
f (X (t))dt (5.40)
∈ R+J +1 and y ∈ X It follows that Condition 5.2.3 is satisfied. In greater detail: Let λ be fixed. Let z λ = y. Under the imposed conditions in this statement, for each fixed α > 0, there is a deterministic stationary strategy ϕα such that inf Wcαλ (S, x) = Wcαλ (ϕα , x), x ∈ X.
S∈Sˆπ
Now applying (5.40) to ϕ = ϕα and f (x) = f α (x) = cλ (x, ϕα (x)), and keeping in mind || f α ||w ≤ ||cλ ||w , we see that the claim holds. The other conditions in Theorem 5.2.4 can be easily verified.
5.2 Constrained Problems
5.2.4.2
317
Finite Models
If X and A are both finite, then the CTMDP model is called finite. If in a finite CTMDP model, under each deterministic stationary strategy, for the underlying process X (·), the state space X has a single positive recurrent class plus a possibly empty set of transient states, then the model is called unichain. It can be seen that in the definition of a unichain model, it is equivalent if we require that, under each stationary π-strategy, there is a single positive recurrent class plus a possibly empty set of transient states. In a finite unichain model, all the conditions in Corollary 5.2.2 are satisfied, except part (g) therein. Thus all the conditions in Theorem 5.2.4 hold, except for the feasibility of the CTMDP problem. Below, we present a direct verification of this fact, without referring to Corollary 5.2.2. Indeed, in Conditions 2.4.2, 3.2.2 etc., one can take w(x) ≡ w (x) ≡ 1 and choose arbitrary ρ, ρ < 0 and b, b satisfying the inequalities ρ + b ≥ 0, ρ + b ≥ 0. Therefore, the only parts that we need to verify are covered by the following statement. Proposition 5.2.2 If the CTMDP model is unichain, then the second part of Conditions 5.2.1 and the first part of Condition 5.2.3 are satisfied. Consequently, Theorems 5.2.3 and 5.2.4 are applicable, provided that Condition 5.2.2 is also satisfied. Proof Let a stable strategy π be fixed, and consider the underlying continuous-time Markov chain X (·). Then X has a single positive recurrent class, say C, and for each z ∈ C and x ∈ X, it holds that Eπx [τz+ ] < ∞, where τz+ := inf{t ≥ T1 : X (t) = z} is the return time to the state z, and T1 is the first jump moment. According to Proposition A.5.4, we see that the second part of Conditions 5.2.1 is satisfied. Regarding the first part of Condition 5.2.3, for notational convenience, we confine ourselves to verifying the following fact that there exist some state z ∈ X and a (finite) constant L such that |W0∗α (x) − W0∗α (z)| ≤ L , ∀ x ∈ X, α ∈ (0, ∞),
(5.41)
see also Remark C.2.1. This is illuminating enough, as to pass to the submodel, one only needs replace A with A(x) everywhere below in this proof. Consider the DTMDP model with the state space X, action space A, and tran+ I{y = x} for each x, y ∈ X, sition probability given by p({y}|x, a) := q(y|x,a) λ where λ > 0 is a fixed constant such that λ > maxx∈X q x . Then this DTMDP is λ ∈ (0, 1). Consider unichain, see Definition C.2.3. For each α > 0, let β(α) := λ+α the β(α)-discounted DTMDP model with the cost function l0 (x, a) := c0 (x, a) for each (x, a) ∈ X × A. By Proposition C.2.10, there exist a state z ∈ X and a constant L ∈ [0, ∞) such that DT,β(α)∗
|W0
DT,β(α)∗
(x) − W0
(z)| ≤ L , ∀ α ∈ (0, ∞), x ∈ X.
318
5 The Average Cost Model
On the other hand, restricted to the class of R-valued solutions, we see that ⎫ ⎧ ⎬ ⎨ p({y}|x, a)V (y) , ∀ x ∈ X, V (x) = inf l0 (x, a) + β(α) ⎭ a∈A ⎩ y∈X ⎫ ⎧ ⎬ ⎨ ⇔ αV (x) = inf c0 (x)(λ + α) + q(y|x, a)V (y) . ⎭ a∈A ⎩ y∈X
According to the standard theory of finite discounted DTMDPs, the first equation is DT,β(α)∗ satisfied by W0 , whereas the unique bounded solution to the second equation is given by inf ExS
S∈Sπ
e−αt
(0,∞)
(α + λ)c0 (X (t), a)(da|t)dt A DT,β(α)∗
according to Theorem 3.1.2. Therefore, W0 (0, ∞). Hence, |W0∗α (x) − W0∗α (z)| ≤
= (α + λ)W0∗α for each α ∈
L L ≤ =: L , ∀ α ∈ (0, ∞), x ∈ X, λ+α λ
as required.
All the optimality results in this chapter hold for the finite unichain model. In particular, recalling Remark 5.1.2, in a finite unichain model, if one obtains a realvalued measurable function h ∗ on X, a finite constant g and a deterministic stationary strategy ϕ∗ satisfying g = inf
⎧ ⎨
a∈A ⎩
c0 (x, a) +
= c0 (x, ϕ∗ (x)) +
y∈X
⎫ ⎬
h ∗ (y)q(y|x, a)
⎭
h ∗ (y)q(y|x, ϕ∗ (x))
(5.42)
y∈X
for all x ∈ X, then the deterministic stationary strategy ϕ∗ is average uniformly optimal, with the value function being the constant g; and such a triplet (h ∗ , g, ϕ∗ ), known as the canonical triplet, exists according to Remark 5.1.3 with the conditions therein being satisfied. (Recall inequality (5.41).) Note also that, if a triplet (h ∗ , g, ϕ∗ ) satisfies the above conditions then so does the triplet (h ∗ + k, g, ϕ∗ ), where k ∈ R is an arbitrary constant. Recall that the braces in the expressions like q({y}|x, a) will often be omitted in the case of a denumerable or finite state space, where we simply write q(y|x, a).
5.3 Examples
319
5.3 Examples 5.3.1 The Gaussian Model We now present an example where all the conditions imposed in Theorem 5.2.4 are satisfied, and all the optimality results obtained in this chapter hold. Example 5.3.1 A hunter is hunting outside his house. Suppose the house is at state 0. A positive state represents the distance from the house to the right, and a negative state represents the distance from the house to the left. Let X = R. If the current position is x ∈ X, then after an exponentially distributed travel time with rate λ > 0, the hunter reaches the new position, and the travel distance follows the normal distribution with mean x and variance 1. If the action a ∈ A is chosen, where A = [A, A] for some constants 0 ≤ A ≤ A, then the hunter receives a call to go home from his manager after exponentially distributed time with parameter β(x, a) > 0. In general, under a non-stationary π-strategy, the calling rate can be time-dependent, resulting in a (conditional) non-stationary exponentially distributed time. Assume 2λ ≤ β ≤ β(x, a) ≤ β(x 2 + 1) for each (x, a) ∈ X × A, for some constants β, β > 0, where upon being called, the hunter returns home instantaneously. Under each π-strategy, given the current state x ∈ X, suppose the calls and the hunter’s travel times are independent. Therefore, we take the transition rate defined, for each ∈ B(X), x ∈ X, and a ∈ A, by q(|x, a) := β(x, a)δ0 () + λ
(y−x)2 1 √ e− 2 dy − (λ + β(x, a))δx (). (5.43) 2π
Let G ⊆ X be an open set, where game is available, so that the hunter receives a unit of reward for each unit of time he spends there. So we put c0 (x, a) := −I{x ∈ G} for each (x, a) ∈ X × A. Suppose there also exist constraint cost rates c j , j = 1, 2, . . . , J , such that there exist a constant M satisfying |c j (x, a)| ≤ M (1 + x 2 ) for each x ∈ X. The constraint constants are denoted as d j , j = 1, 2, . . . , J. For example, the constraints can be associated with the danger of being in state x ∈ X. Finally, assume that the function β is continuous on X × A, the cost rates c j , j = 1, 2, .. . , J , are all lower semicontinuous on X × A, and the initial distribution γ satisfies X x 4 γ(d x) < ∞. The objective is to maximize the long-run time fraction when the hunter is in the set G. Proposition 5.3.1 Consider the CTMDP described in the above example, which is assumed to be feasible. It satisfies all the conditions in Theorem 5.2.4. Moreover, Theorem 5.2.3 holds, provided that Condition 5.2.2 is also satisfied. Proof Let w(x) := 1 + x 4 , w (x) := 1 + x 2 .
320
5 The Average Cost Model
Since
2
(y−x) √1 y 4 e− 2 2π
X
dy = 3 + 6x 2 + x 4 for all x ∈ X, we see that
w(y)q(dy|x, a) X
= β(x, a) + λ(3 + 6x 2 + x 4 ) − (β(x, a) + λ)(1 + x 4 ) = −β(x, a)x 4 + λ(2 + 6x 2 ) 1 1 1 = − β(x, a)(1 + x 4 ) + β(x, a) − β(x, a)x 4 + λ(2 + 6x 2 ), ∀ x ∈ X. 2 2 2 ¯ 2 + 1)], it is clear that there exists a constant b > 0 such that Since β(x, a) ∈ [β, β(x 1 1 β(x, a) − 2 β(x, a)x 4 + λ(2 + 6x 2 ) ≤ b, and thus 2
1 w(y)q(dy|x, a) ≤ − βw(x) + b, ∀ x ∈ X. 2 X
Similarly, for some constant b ≥ 0,
w (y)q(dy|x, a) = β(x, a) + λ(2 + x 2 ) − (β(x, a) + λ)(1 + x 2 ) X
= −β(x, a)x 2 + λ 1 1 1 = − β(x, a)(1 + x 2 ) + β(x, a) − β(x, a)x 2 + λ 2 2 2 1 ≤ − βw (x) + b , ∀ x ∈ X. 2 Note thatq x ≤ (β + λ)w (x) for each x ∈ X. It is straightforward to check that the function X f (y)q(dy|x, a) is continuous in (x, a) ∈ X × A for each f ∈ Bw (X). Thus, Condition 3.2.2 with ρ = − 21 β < 0 and the first part of Condition 5.2.1 are verified. Let us check Condition 5.2.4. Note that w (y)q(dy|x, a) + β(x, a) = w (y)q(dy|x, a) = −β(x, a)x 2 + λ, X\{0}
X
∀ x ∈ X \ {0}, a ∈ A, so that
1 1 w (y)q(dy|x, a) + β(x, a) + β(x, a)x 2 2 2 X\{0} 1 1 1 = − β(x, a)x 2 + λ − β(x, a) ≤ λ − β ≤ 0, ∀ x ∈ X \ {0}, a ∈ A, 2 2 2 and consequently,
5.3 Examples
321
1 w (y)q(dy|x, a) + βw (x) ≤ 0, ∀ x ∈ X \ {0}, a ∈ A. 2 X\{0}
(5.44)
Note that by the strong Markov property, see Proposition A.5.2, under each stationary π-strategy π, for a fixed initial state x ∈ X, the process X (·) is a delayed regenerative process with successive returns to the state 0 being the regeneration times, see Definition B.3.5. From the proof of Lemma 5.1.6, the previous inequality shows that Eπx
(0,τ0+ ]
w (X (t))dt < ∞
for each x ∈ X. Here, recall that τ0+ := inf{t ≥ T1 : X (t) = 0} is the return time to the state 0, and E0π [τ0+ ] ∈ (0, ∞). In particular, for each x ∈ X, and w -bounded measurable function f on X, Eπx
(0,τ0+ ]
f (X (t))dt < ∞,
and thus, by Lemma B.3.1, 1 lim Eπx T →∞ T
(0,T ]
f (X (s))ds =
E0π
f (X (t))dt < ∞. E0π τ0+
(0,τ0+ ]
According to Proposition A.5.4, this implies that there exists a unique stable measure, say η, associated with π, given by η(d x × da) = η(d x × A)π(da|x), with η(d x × A) =
E0π
I{X (t) ∈ d x}dt . E0π τ0+
(0,τ0+ ]
Therefore, each stationary strategy is stable, and hence Condition 5.2.4 is satisfied. For the second part of Condition 5.2.1, let a stable strategy πη associated with a stable measure η be fixed. By Proposition A.5.4 and the relation established earlier, we see that the following limit exists:
322
5 The Average Cost Model
1 πη Ex c j (X (s), a)πη (da|X (s))ds t→∞ t (0,t] A πη E0 c (X (t), a)π (da|X (t))dt + j η (0,τ0 ] A c j (y, a)η(dy × da) = = π E0 η τ0+ X×A lim
and thus (5.26) holds. Therefore, the second part of Condition 5.2.1 is satisfied for γ(dy) = δx (dy). The treatment for the nondegenerate initial distribution γ can be reduced to the case of a fixed initial state xˆ ∈ / X, with the extension of q(dy| ˜ x, ˆ a) := γ(dy). The new process is still regenerative just with possibly a different initial cycle. Now it is clear that all the conditions in Theorem 5.2.2 are satisfied, and so are Conditions 2.4.3, 3.1.3, and 3.2.1. Condition 5.2.3 is satisfied because of (5.44) and Lemma 5.1.6. (In fact, Lemma 5.1.6 concerns the original model, but the validity of its corresponding version for the submodels is obvious.) Finally, Condition 5.2.5 is satisfied because of Lemma 5.2.3, where we put (y−x)2 λ ν(dy) := δ0 (dy) + dy, g(x, a, y) := β(x, a)I{y = 0} + √ e− 2 I{y = 0}. 2π
The proof is complete.
5.3.2 The Freelancer Dilemma 5.3.2.1
Unconstrained Version
Let us consider the long-run average version of the problem described in Sect. 1.2.2. The correctness of the model formulation is justified by Lemma A.1.2, see (A.4) in its proof. Recall that the decisions should be made only in the state 0 ∈ X, when the freelancer is free. The model is finite and unichain. Therefore, as explained at the end of Sect. 5.2.4, it is sufficient to solve equation (5.42), which takes the form g = Rx μx + μx h ∗ (0) − μx h ∗ (x), if x = 0; ⎫ ⎧ ⎬ ⎨ g = sup a y λ y h ∗ (y) − a y λ y h ∗ (0) ⎭ a∈A ⎩ y∈X\{0} y∈X\{0} = a ∗y λ y h ∗ (y) − a ∗y λ y h ∗ (0) if x = 0. y∈X\{0}
(5.45)
y∈X\{0}
Since we are considering rewards Rx , we change inf to sup in Eq. (5.42) and in similar expressions, and the supremum is achieved and coincides with the maximum. Hopefully, this will not lead to confusion. Without loss of generality, we put h ∗ (0) = 0. Then, for all x = 0,
5.3 Examples
323
h ∗ (x) =
R x μx − g . μx
In the formula (5.45), any one component of the vector a ∗ = (a1∗ , a2∗ , . . . , a ∗M ) ∈ A = {0, 1} M providing the maximum can be considered separately, i.e., for each m ∈ {1, 2, . . . , M}, am∗ = 1 if and only if λm h ∗ (m) > 0. Here (temporarily), for the sake of uniqueness, we accept that am∗ = 0 if h ∗ (m) = 0. To put it differently, am∗ = 1 ⇐⇒ Rm μm > g. This inequality makes clear the shape of the optimal stationary strategy: one has to accept all the jobs with Rm μm > g. It is convenient to order all the job types so that R1 μ1 ≥ R2 μ2 ≥ . . . ≥ R M μ M .
(5.46)
After that, there is no need to consider actions (at the state 0 ∈ X) of the form (1, 1, 0, 1, . . .): only actions like (1, 0, 0, . . .), (1, 1, 0, . . .), (1, 1, 1, . . .) can be optimal. Action (0, 0, . . . , 0) ∈ A is certainly not optimal. When trying the action (1, 1, . . . , 1, 0, 0, . . . , 0) for k ∈ {1, 2, . . . , M}, one obtains from (5.45) expressions , -. / k
rk =
k
λ y h ∗k (y)
y=1
=
k y=1
λy
rk Ry − μy
k
⇐⇒ rk =
λy Ry k λ y .
y=1
1+
(5.47)
y=1 μ y
Below, we put r0 := 0. Theorem 5.3.1 (a) For all k = 0, 1, 2, . . . , M − 1, rk+1 ≷ rk if and only if Rk+1 μk+1 ≷ rk . (b) For all k = 1, 2, . . . , M, Rk μk ≷ rk−1 if and only if Rk μk ≷ rk . (c) Consider k ∗ ∈ {2, 3, . . . , M − 1} such that rk ∗ −1 ≤ rk ∗ ≥ rk ∗ +1 , or k ∗ = 1 if r1 ≥ r2 , or k ∗ = M if r M−1 ≤ r M . Then g := rk ∗ , h ∗ (0) := 0 and h ∗ (x) = Rx μμxx −g for all x = 1, 2, . . . , M satisfy Eq. (5.45), where the maximum is attained at a ∗ = (1, 1, . . . , 1, 0, 0, . . . , 0). , -. / k∗
Hence the deterministic stationary strategy ϕ∗ (x) = a ∗ is average uniformly optimal, and the maximal average reward equals g. (d) Let k ∗ be the smallest integer satisfying the requirements formulated in Item (c). M strictly increases up to Then the following statements hold. The sequence {rk }k=1
324
5 The Average Cost Model
k ∗ (if k ∗ > 1), i.e.,
r1 < r2 < . . . < rk ∗ ;
(5.48)
may be constant for a while after k ∗ (if k ∗ < M), i.e., rk ∗ = rk ∗ +1 = . . . = rk ∗∗ and (in case k ∗∗ < M) rk ∗∗ +1 = rk ∗∗ ; and strictly decreases after k ∗∗ (if k ∗∗ < M), i.e., rk ∗∗ > rk ∗∗ +1 > . . . > r M .
(5.49)
(e) If rk−1 = rk for some k ∈ {2, 3, . . . , M} then one can accept k ∗ = k − 1 or k ∗ = k. After that, all the statements of Item (c) hold and rk−1 = rk = g. In particular, all the strategies as in Item (c), associated with k ∗ , k ∗ + 1, . . . , k ∗∗ , are equally uniformly average optimal. Remark 5.3.1 (a) According to part (a) of Theorem 5.3.1 and its proof presented below, k ∗ can be equivalently described as a number from {2, 3, . . . , M − 1} such that Rk ∗ μk ∗ ≥ rk ∗ −1 and Rk ∗ +1 μk ∗ +1 ≤ rk ∗ ⇐⇒ Rk ∗ +1 μk ∗ +1 ≤ rk ∗ ≤ Rk ∗ μk ∗ (see Item (b)), or k ∗ = 1 if R2 μ2 ≤ r1 , and in this case R1 μ1 ≥ r0 = 0 ⇐⇒ R1 μ1 ≥ r1 (see Item (b)), or k ∗ = M if R M μ M ≥ r M−1 . (b) If rk ∗ = rk ∗ +1 = . . . = rk ∗∗ for k ∗∗ > k ∗ , then, according to part (a) of Theorem 5.3.1, Rk ∗ +1 μk ∗ +1 = Rk ∗ +2 μk ∗ +2 = . . . = Rk ∗∗ μk ∗∗ = rk ∗ = g, and h ∗ (k ∗ + 1) = h ∗ (k ∗ + 2) = . . . = h ∗ (k ∗∗ ) = 0: it is equally optimal to accept or reject the jobs of types k ∗ + 1, k ∗ + 2, . . . , k ∗∗ . Proof of Theorem 5.3.1 (a) The following formulae, valid for all 0 ≤ k ≤ M − 1, justify assertion (a): k+1 rk+1 =
y=1
1+
λy Ry
k+1
λy y=1 μ y
k ≷ rk =
⎡
⇐⇒ λk+1 Rk+1 ⎣1 +
1+ ⎤
y=1 μ y
k λy y=1
k ⇐⇒ Rk+1 μk+1 ≷
λy Ry k λ y
y=1
k ⎦ ≷ λk+1 λy Ry μy μk+1 y=1
λy Ry k λ y = r k .
y=1
1+
y=1 μ y
5.3 Examples
325
(b) By the definition of rk−1 , for all k = 1, 2, . . . , M, ⎡ Rk μk ≷ rk−1 ⇐⇒ Rk μk ⎣1 +
k−1 λy y=1
μy
⎤ ⎦≷
k−1
λy Ry .
(5.50)
y=1
Furthermore, λ Rk μk 1 + ky=1 μyy − ky=1 λ y R y Rk μk − rk = λ 1 + ky=1 μyy and ⎡ Rk μk − rk ≷ 0 ⇐⇒ Rk μk ⎣1 + ⎡ = Rk μk ⎣1 +
k−1 λy y=1
μy
⎤ ⎦−
k−1 λy y=1
k−1
μy
⎤ ⎦ + R k λk −
k
λy Ry
y=1
λ y R y ≷ 0 ⇐⇒ Rk μk ≷ rk−1
y=1
by (5.50). (c) First of all, one can always find at least one k ∗ satisfying the listed requirements. According to (a) (see also Remark 5.3.1(a)), Rk ∗ μk ∗ ≥ rk ∗ −1 . According to (b), ∗ Rk μk ∗ ≥ rk ∗ =: g and hence, for all k = 1, 2, . . . , k ∗ , Rk μk ≥ g. According to (a) (see also Remark 5.3.1(a)), Rk ∗ +1 μk ∗ +1 ≤ rk ∗ = g and hence, for all k = k ∗ + 1, k ∗ + 2, . . . , M, Rk μk ≤ g. Therefore, Eq. (5.45) is satisfied and the maximum is attained at a ∗ such that ∗ ak = I{Rk μk ≥ g} for k = 1, 2, . . . , M. All the reasoning applies also in the cases k ∗ = 1 and k ∗ = M. The proof is complete. (d) (i) Assume that k ∗ > 1 and formula (5.48) is false. Then either there is a ≤ rkˆ ≥ rk+1 constant kˆ < k ∗ such that rk−1 ˆ ˆ , or r 1 ≥ r 2 , and, in the latter case, we ∗ ˆ ˆ put k := 1. Anyway, k < k satisfies the requirements formulated in Item (c), which contradicts the assumption that k ∗ is the minimal index among those. Formula (5.48) is proved. Assume that k ∗∗ < M, rk ∗∗ +1 = rk ∗∗ , and formula (5.49) is false. Then either ≤ rkˆ ≥ rk+1 there is a kˆ > k ∗∗ such that rk−1 ˆ ˆ , or, otherwise, r M−1 ≤ r M and, in ˆ the latter case, we put k := M. Arguing similarly to the proof of Item (c), one can conclude that the maximal average reward equals rkˆ , so that rkˆ = rk ∗ = g ˆ Rk μk ≥ g. But, for all k = k ∗ + 1, k ∗ + 2, . . . , M, and, for all k = 1, 2, . . . , k, ˆ Rk μk ≤ g by the proof of Item (c). Therefore, for all k = k ∗∗ + 1, k ∗∗ + 2, . . . , k, Rk μk = g = rk ∗ = rk ∗∗ . In particular, Rk ∗∗ +1 μk ∗∗ +1 = rk ∗∗ ; hence rk ∗∗ +1 = rk ∗∗ by
326
5 The Average Cost Model
Item (a). But we assumed that rk ∗∗ +1 = rk ∗∗ . The obtained contradiction confirms that the formula (5.49) is valid. (e) According to Item (d), both k − 1 and k satisfy all the requirements formulated in Item (c). As a numerical example, consider the following situation. There are M = 5 types of jobs. Their parameters are given in the table: m λ μ R
1 1 5 2
2 2 2 10
3 2 3 3
4 3 1 12
5 4 2 6
Here, e.g., the jobs of type 1 arrive once a week on average, bring two hundred pounds each, and the freelancer can complete each of those jobs during 1/5 of a week on average. First, we reorder the jobs so that the product Rm μm decreases: k m λ μ R
1 2 2 2 10
2 4 3 1 12
3 5 4 2 6
4 1 1 5 2
5 3 2 3 3
Direct calculations give the following table: k Rk μk rk
1 20 10
2 12 11.2
3 12 11 37 ≈ 11.43
4 10 ≈ 11.39
5 9 11.19
The optimal strategy prescribes to accept only the jobs corresponding to k = 1, 2, 3 = k ∗ , i.e., the jobs of the (original) types m = 2, 4, 5: ϕ∗ (0) = (0, 1, 0, 1, 1). The maximal average reward per week is g = r3 = 11 37 hundred pounds. Let h ∗ (x) = Rx μx −g for all x = 1, 2, . . . , 5: μx x
0
1
2
3
4
5
h ∗ (x)
0
− 27
30 7
− 17 21
4 7
2 7
Now the triplet (h ∗ , g, ϕ∗ ) solves equation (5.45) according to Theorem 5.3.1.
5.3.2.2
Constrained Version
Below, we assume that the jobs have been ordered according to (5.46). Suppose, in the considered example, it is desirable to work on a particular type of job, say, mˆ ∈ {1, 2, . . . , M}, no less than a specified fraction of time d ∈ (0, 1).
5.3 Examples
327
This means, ˆ c1 (x, a) = I{x = m}, and we deal with the problem (5.1) (J = 1), where again we consider maximization and inequality ≤ is replaced with ≥. If only the jobs of type mˆ are accepted, that mˆ mˆ , meaning that there are no feasible strategies if d > λmˆλ+μ . fraction equals λmˆλ+μ mˆ mˆ
mˆ , so that the Slater Condition 5.2.2 is satisfied. Note Thus, we assume that d < λmˆλ+μ mˆ also that, for each deterministic stationary strategy ϕ(x) ≡ a with amˆ = 1,
W1 (ϕ, x) =
1+
λmˆ μmˆ M y=1
λ
a y μyy
.
Remark 5.3.2 If the product Rmˆ μmˆ coincides with other similar product(s), then Rmˆ μmˆ comes first in the formula (5.46), so that, if it is equally optimal to accept the jobs of type mˆ and of other types, then the type mˆ is ultimately accepted. If a solution to the unconstrained problem is feasible, i.e., for k ∗ ∗
g = W0 (ϕ , x) = rk ∗ =
λy Ry k ∗ λ y
y=1
1+
y=1 μ y
we have that • either mˆ ≤ k ∗
and W1 (ϕ∗ , x) = lim
T →∞
=
1+
1 ϕ∗ E T x
λmˆ μmˆ k ∗ λ y y=1 μ y
(0,T ]
c1 (X (t), a ∗ )dt
A
≥ d,
• or, by Remark 5.3.2, mˆ = k ∗ + 1 ≤ k ∗∗ and 1 ϕ˜ ∗ ∗ ∗ W1 (ϕ˜ , x) = lim Ex c1 (X (t), a˜ )dt T →∞ T (0,T ] A =
λmˆ μmˆ
1+
k ∗ +1
λy y=1 μ y
≥ d,
then the constraint is not essential and can be ignored. Here k ∗ and k ∗∗ are as in Theorem 5.3.1(c, d), ϕ∗ is given in Theorem 5.3.1(c), and ϕ˜ ∗ (x) = a˜ ∗ with a˜ ∗ = (1, 1, . . . , 1, 1, 0, 0, . . . , 0). , -. / k∗
328
5 The Average Cost Model
Clearly, no (optimal) solution to the unconstrained problem is feasible if and only if • If k ∗∗ = k ∗ , either mˆ > k ∗
or (if mˆ ≤ k ∗ )
1+
λmˆ μmˆ k ∗ λ y y=1 μ y
< d.
• If k ∗∗ > k ∗ , mˆ > k ∗ + 1 or (if mˆ = k ∗ + 1) or (if mˆ ≤ k ∗ )
1+
λmˆ μmˆ k ∗ λ y y=1 μ y
1+
λmˆ μmˆ k ∗ +1 λ y y=1 μ y
< d,
< d.
Under the Slater condition, if no (optimal) solution to the unconstrained problem is feasible, then the optimal value of the Lagrange multiplier g in the Dual Convex Program (5.36) is positive. After that, the problem L av (η, g) → inf η∈Dav is just the unconstrained Freelancer Dilemma with the modified value of only Rmˆ : Rmˆ → Rmˆ + μgmˆ =: R˜ mˆ , and one should find such a value of g > 0 (equivalently, a value of R˜ mˆ > Rmˆ ) that two strategies are (unconstrained) optimal for the modified Freelancer Dilemma, and the long-run analogue of relations (3.45) is valid. Below, we implement this plan. Lemma 5.3.1 Suppose mˆ > k ∗ and
λmˆ μmˆ
1+
k ∗
λy y=1 μ y
+
λmˆ μmˆ
≥ d.
Furthermore, assume that either k ∗∗ = k ∗ , or k ∗∗ > k ∗ and mˆ = k ∗ + 1. Then one can increase the value of Rmˆ to R˜ mˆ > Rmˆ in such a way that the relations k ∗ rk ∗ =
λ y R y + λmˆ R˜ mˆ and Rk ∗ +1 μk ∗ +1 ≤ R˜ mˆ μmˆ ≤ Rk ∗ μk ∗ k ∗ λ y 1 + y=1 μ y + μλmmˆˆ y=1
hold. The proofs of the following and other lemmas are presented in Appendix A.7. Lemma 5.3.2 Suppose
1+
k ∗
λmˆ μmˆ
λy y=1,y =mˆ μ y
+
λmˆ μmˆ
0 and two deterministic stationary strategies ϕ∗1 and ϕ∗2 such that (independently of x ∈ X) the expression W0 (S, x) + g ∗ W1 (S, x) attains the maximum both at ϕ∗1 and ϕ∗2 , and B1 := W1 (ϕ∗1 , x) ≤ d ≤ W1 (ϕ∗2 , x) =: B2 with B2 > B1 . (b) Both strategies ϕ∗1 and ϕ∗2 are stable and their SM-mixture with the associated stable measure ∗ ∗ η ∗ = α1 η ϕ1 + (1 − α1 )η ϕ2 , where α1 =
B2 −d , B2 −B1
is optimal in the constrained problem.
Proof (a) We consider all the cases when no (optimal) solution to the unconstrained problem is feasible. (i) Suppose
330
5 The Average Cost Model
• mˆ > k ∗
and
1+
λmˆ μmˆ k ∗ λ y y=1 μ y
≥d
and • either k ∗∗ = k ∗ , or (k ∗∗ > k ∗ & mˆ = k ∗ + 1). Then ϕ∗1 (x) := a ∗1 = (1, 1, . . . , 1, 0, 0, . . . , 0) and , -. / k∗
ϕ∗2 (x) := a ∗2 = (1, 1, . . . , 1, 0, 0, . . . , 0, am∗2 ˆ = 1, 0, 0, . . . , 0); , -. / k∗
g ∗ := ( R˜ mˆ − Rmˆ )μmˆ , where R˜ mˆ comes from Lemma 5.3.1. (ii) Suppose
1+
k ∗
λmˆ μmˆ
λy y=1,y =mˆ μ y
+
λmˆ μmˆ
< d.
Compute kˆ as in Lemma 5.3.2 and consider the following two cases. (α) If mˆ = kˆ + 1 then ϕ∗1 (x) := a ∗1 = (1, 1, . . . , 1, 0, 0, . . . , 0, am∗2 ˆ = 1, 0, 0, , -. / ˆ k+1
. . . , 0)
and
ϕ∗2 (x) := a ∗2 = (1, 1, . . . , 1, 0, 0, . . . , 0, am∗1 ˆ = 1, 0, 0, . . . , 0); , -. / kˆ
g ∗ := ( R˜ mˆ − Rmˆ )μmˆ , where R˜ mˆ comes from Lemma 5.3.2. In the above formulae ˆ = 1 on the right does not appear if mˆ ≤ k. for a1∗ and a2∗ , am∗1,2 ˆ ˆ (β) If mˆ = kˆ + 1, then the inequality mˆ > k ∗ is excluded by the definition of k, ∗ ∗ ∗ ˆ and mˆ < k because otherwise (if k = k − 1 and mˆ = k )
1+
λmˆ μmˆ k ∗ λ y y=1 μ y
≥ d,
and the solution to the unconstrained problem would be feasible. Again using Lemma 5.3.2, we put ϕ∗1 (x) := a ∗1 = (1, 1, . . . , 1, 0, 0, . . . , 0), , -. / ˆ k+2
ϕ∗2 (x)
:= a
∗2
= (1, 1, . . . , 1, 0, 0, . . . , 0), g ∗ := ( R˜ mˆ − Rmˆ )μmˆ . , -. / ˆ k+1
5.3 Examples
331 λmˆ μmˆ
(If k ∗∗ > k ∗ , the case of mˆ = k ∗ + 1 and
∗ +1 1+ ky=1
λy μy
< d is covered in Item (ii)
above.) In all the considered cases B2 > B1 and R˜ mˆ > Rmˆ . Consider the optimization problem W0 (S, x) + gW1 (S, x) → sup . S∈Sπ
It is the standard unconstrained Freelancer Dilemma with the modified reward coming from the type mˆ jobs: Rmˆ → Rmˆ + μgmˆ = R˜ mˆ . After we re-order the jobs in the standard way (5.46), we see that in case (i) mˆ = k ∗ + 1 in the new order and rk ∗ = r˜k ∗ = r˜k ∗ +1 . (Here and below, r˜k are given by the standard formula (5.47) for the if mˆ ≤ kˆ in the original order, and r˜k+1 = modified rewards.) In case (ii-α), r˜kˆ = r˜k+1 ˆ ˆ ˆ ˆ r˜k+2 if m ˆ > k + 1 (in the original order), because, in the modified order m ˆ < k + 1, ˆ ˜ ˆ since Rk+1 < Rmˆ μmˆ . (See Lemma 5.3.2, case mˆ = k + 1.) In case (ii-β), r˜k+1 = ˆ μk+1 ˆ ˆ ∗ ∗ r˜k+2 ˆ . Therefore, in all the cases, the both strategies ϕ1 and ϕ2 are optimal by Theorem 5.3.1(e). (b) Since the model is unichain and finite, all deterministic stationary strategies are stable and η ∗ is a stable measure (see Theorem 5.2.1(b)) solving the problem L av (η, g ∗ )
:=
⎛
c0 (x, a)η(x, a) + g ∗ ⎝
(x,a)∈X×A
⎞
c1 (x, a)η(x, a) − d ⎠
(x,a)∈X×A
→ sup
η∈D av ∗
∗
because both η ϕ1 and η ϕ2 solve this problem. Let us show that g ∗ solves problem (5.36). (Remember, we need to replace maximization with minimization here.) Indeed, if we take 0 ≤ g < g ∗ then ∗
∗
sup L av (η, g ∗ ) = L av (η ϕ1 , g ∗ ) ≤ L av (η ϕ1 , g)
η∈D av
because B1 =
(x,a)∈X×A c1 (x, a)η
ϕ∗1
(x, a) ≤ d. Hence
sup L av (η, g ∗ ) ≤ sup L av (η, g).
η∈D av
η∈D av
Similarly, for g > g ∗ , ∗
∗
sup L av (η, g ∗ ) = L av (η ϕ2 , g ∗ ) ≤ L av (η ϕ2 , g) ≤ sup L av (η, g).
η∈D av
η∈D av
Now the stable measure η ∗ ∈ Dav satisfies Item (b–i) of Theorem 5.2.3:
332
5 The Average Cost Model
L av (η ∗ , g) = L av (η ∗ , g ∗ ) ≥ L av (η, g ∗ )
∀g ∈ R0+ .
Thus η ∗ solves problem (5.28) and therefore is optimal in the constrained problem. Clearly, Theorem 5.3.2 is consistent with Theorem 5.2.4. Note that the strategies ϕ∗1 and ϕ∗2 can be explicitly calculated, as well as the values B1 and B2 . In all the cases, a ∗1 and a ∗2 differ by one element only. Along with the presented optimal SM-mixture η ∗ of the stable strategies, one can provide the solution to the constrained problem in the form of a relaxed π-strategy: the additional job type in a ∗2 should be accepted with probability α2 ∈ [0, 1], and the value of α2 is given by the following expressions: in the case (a–i) of the proof of Theorem 5.3.2,
α2 = in the case (a–ii-α), α2 =
∗ d 1 + ky=1 λmˆ (1 μmˆ
− d)
ˆ d 1 + ky=1 −
λy μy
λy μy
;
−
λmˆ μmˆ
−
λmˆ μmˆ
dλk+1 ˆ μk+1 ˆ
;
and, in the case (a–ii-β),
α2 =
ˆ d 1 + k+1 y=1 −
λy μy
dλk+2 ˆ μk+2 ˆ
.
All these expressions come from the equation d=
∗2 [(1 − α2 )a ∗1 ˆ − μx h(x), x ∈ X \ {0}. y + α2 a y ]λ y h(y) = I{x = m}
y∈X\{0}
One can also compute the values of α2 using the convexity of the space Dav . Consider the case (a–i) in the proof of Theorem 5.3.2 as an example. Then η ∗ (y × A), the unique probability measure satisfying Eq. (5.25), takes the following non-zero values: λ
λ
ˆ α1 μyy I{y = m} (1 − α1 ) μyy ∗1 I{a = a } + η (y × a) = ∗ λ ∗ λ 1 + ky=1 μyy 1 + ky=1 μyy + ∗
∀y ∈ {1, 2, . . . , k ∗ , m}; ˆ α (1 − α1 ) 1 ∗1 η ∗ (0 × a) = k ∗ λ y I{a = a } + k ∗ λ y 1 + y=1 μ y 1 + y=1 μ y +
λmˆ μmˆ
λmˆ μmˆ
I{a = a ∗2 },
I{a = a ∗2 }.
5.3 Examples
333
Only decisions in state 0 ∈ X are relevant because transition rates q(y|x, a) do not depend on a if x = 0. Thus, we disintegrate η ∗ (0 × a): π ∗ (a|0) = 0 if a = a ∗1 , a ∗2 and 4 ∗ (1 − α1 ) η (0 × A). π ∗ (a ∗2 |0) = k ∗ λ y λmˆ 1 + y=1 μ y + μmˆ After we substitute α1 =
B2 −d B2 −B1
π ∗ (a ∗2 |0) =
=
λmˆ μmˆ ∗ λ λ 1+ ky=1 μ yy + μmˆ mˆ
∗ d 1 + ky=1 λmˆ μmˆ
−d
λy μy
with B1 = 0 and B2 =
∗ d 1 + ky=1 ∗ 1 + ky=1
λy μy
− d μλmˆ
λy μy
+
λmˆ μmˆ λy μy
∗ 1+ ky=1
λ
+ μmˆ
, we obtain
mˆ
λmˆ μmˆ
∗ + d 1 + ky=1
λy μy
= α2 ,
mˆ
and π ∗ (a ∗1 |0) = 1 − α2 . Recall that a ∗1 = (1, 1, . . . , 1, 0, 0, . . . , 0) , -. / k∗
and a
∗2
= (1, 1, . . . , 1, 0, 0, . . . , 0, am∗2 ˆ = 1, 0, 0, . . . , 0), , -. / k∗
meaning that all jobs of the types 1, 2, . . . , k ∗ are accepted, and the jobs of type mˆ are accepted with probability α2 . 1 , i.e. the Suppose, in the numerical example presented above, mˆ = 4 and d = 40 1 fraction of time for working on jobs of (original) type 1 should be no smaller than 40 . Then the presented optimal solution is not feasible as those jobs are rejected. When solving this new constrained problem, we are within the case (a–i) in the proof of Theorem 5.3.2: using the indices k, after re-ordering the jobs, we have
1+
3
λ4 μ4
λy y=1 μ y
+
λ4 μ4
=
1 1 ≥d= . 36 40
Therefore, α2 = 35 : one has to accept all the preliminarily selected jobs of the 39 (original) types 2, 4, 5 and accept the (original) type 1 jobs with probability α2 = 35 . Equivalently, one can mix the strategies ϕ∗1 (0) = (0, 1, 0, 1, 1) and ϕ∗2 (0) = 39
334
5 The Average Cost Model
(1, 1, 0, 1, 1), where we indicate the jobs in the original label, before re-ordering, 1 9 and 1 − α1 = 10 . with weights α1 = 10
5.4 Bibliographical Remarks The study of average CTMDPs dates back to at least the 1960s. Two early works are [168, 213], both dealing with finite models. They actually both obtained a deterministic stationary average uniformly optimal strategy as one of the Blackwell optimal strategies, cf. [25]. An example was constructed in [56] demonstrating that if the state space is denumerable, then a deterministic stationary average uniformly optimal strategy may fail to exist, even if the action space is finite, and the cost rate is bounded. The books [184, 221] collected quite a few pathological examples arising from average DTMDPs. One of the first works on average CTMDPs with a denumerable state space is [144], where the action space is finite and the cost rate is bounded. To show the existence of a deterministic stationary average uniformly optimal strategy, the author of [144] applied the vanishing discount factor approach. For this, the family of relative value functions was assumed to be uniformly bounded. One of the sufficient conditions for this requirement is the minorant condition, as presented in [144]. Under the minorant condition, a special technique reducing the average CTMDP to an equivalent α-discounted CTMDP was developed in [145]. This technique for DTMDPs is ascribed to [208, 209]. The author of [144] constructed a so-called canonical triplet. This term first appeared in the context of CTMDPs in [260], where a characterization was given for a CTMDP in a denumerable state space with a bounded cost rate. This characterization was extended to CTMDPs in a Borel state space with a possibly unbounded cost rate in [177]. The term ‘canonical triplet’ for DTMDPs seems to have been originally suggested in [252]. All the aforementioned works assumed bounded transition rates. The condition in [144] was subsequently generalized to allow the family of relative functions to be possibly unbounded, see e.g., [105, 111, 114, 262, 268], which all allow unbounded transition rates. Roughly speaking, these extensions follow two styles or a combination of both. One is to directly restrict the growth of the family of relative value functions, which may be unbounded but must be dominated in a suitable way, and does not require all the deterministic stationary strategies to be stable. A well-known condition in this style was formulated in [218] for denumerable DTMDPs, and the more recent version is in [220]. As demonstrated by [33], the conditions of [218] do not guarantee the optimality equation to be satisfied, but only an optimality inequality. The other style, as suggested in [127] for DTMDPs and further developed by the Dutch school, requires the existence of suitable Lyapunov functions, which guarantees certain stability properties of the controlled process under each deterministic stationary strategy, and the optimality equation to be satisfied. For denumerable DTMDPs, a thorough discussion on Lyapunov functions is in [7]. If the state space is uncountable, then it is more demanding to establish the optimal-
5.4 Bibliographical Remarks
335
ity equation. For this, either the family of relative value functions is assumed to be equicontinuous, as in [59, 172], or extra stability conditions must be imposed on the controlled process, as in [45, 141, 240] for DTMDPs, and in [107, 140, 152] for stochastic games. A useful observation made in [156, 222] is that, under stationary strategies, the average CTMDP in a denumerable state space with bounded transition rates can be reduced to an equivalent average DTMDP problem, justified by making use of the uniformization argument. Alternatively, the justification can be done by comparing the optimality equations between the CTMDP and DTMDP problems, see [137]. The reduction type argument was further applied in the study of average PDMDPs in [46], where the controlled process needs to be non-explosive. A different method from the vanishing discount factor approach was taken in [149], which was based on investigating the limit points of the family of empirical measures. For average DTMDPs in a denumerable state space, such limit points were called occupation measures in [7], and the expected finite-horizon empirical measures were called state-action frequencies in [8]. Derman is one of the first to investigate the space of such occupation measures and its relation with the stable measures, for finite DTMDPs, which allows one to formulate the finite average DTMDP problem with multiple constraints as a linear program, see [57]. A further development for finite DTMDPs appears in [128]. The extension to denumerable constrained DTMDPs is in [7, 8, 29], and the extension to constrained DTMDP models in a Borel space is in [119]. This method was developed in [75] for constrained CTMDPs with finite state and action spaces, see also [73], and in [100, 109, 116] for constrained CTMDPs with a Borel state space. An average CTMDP can also be formulated as a convex program in the space of so-called strategical measures, different from occupation measures, see [182]. After that, the constrained CTMDP was studied via its dual problem. The space of strategical measures was introduced and studied in [180]. Finally, the Lagrange multiplier method was also applied to denumerable CTMDPs with a single constraint in [195, 261]. The vanishing discount factor method was applied to constrained CTMDPs in a denumerable state space in [196]. One can find more relevant references in the survey [108]. Subsection 5.1.1. The materials are mainly from [262]. An early version of Lemma 5.1.1, together with its corollary, is given in [14] for a denumerable model with stationary strategies only. A verification theorem was given in [130, 236] for the average optimal control of more general processes. The proof of Lemma 5.1.2 is taken from [34]. The discrete-time version of Conditions 5.1.1 and 5.1.2 is in [80]. The proof of Theorem 5.1.1 is adapted from the reasoning in [241] for discrete-time problems. The origin of Condition 5.1.4 is [217] for discrete-time problems in a denumerable state space with bounded cost, and [218, 219], where the cost is allowed to be unbounded, though conditions of this type can be tracked back to [55]. The extension of [218] to discrete-time problems in a Borel state space and a finite action space was done in [203], and the extension to Borel state and compact action spaces was done in [216], where the discrete-time version of Condition 5.1.3 was formulated. Further extensions of [216] to not necessarily compact action spaces were done in [80, 82].
336
5 The Average Cost Model
A version of the first part of Proposition 5.1.1 was formulated in [221, p. 165], see also [173]. The original discrete-time version of Example 5.1.2 is a famous one constructed by Cavazos-Cadena in [33], which was also studied in detail in [184, §4.2.6]. Similar calculations were presented in the proof of Proposition 5.1.2. Example 5.1.2 was also presented in [106, 111]. The origin of Condition 5.1.5 in discrete-time is [83, 172], see also [59]. The proof of Theorem 5.1.4 comes from [83] for discrete-time problems. Subsection 5.1.2. The first two theorems in this subsection were not explicitly presented elsewhere. The proof of Theorem 5.1.5 is similar to the one in [114], which is for a w -bounded cost rate and under a strong continuity condition, or to the one in [141] dealing with discrete-time problems. The proof of Theorem 5.1.6 is similar to the one in [45], dealing with discrete-time problems, though the conditions here are slightly more general. Lemma 5.1.5 is taken from [269], which is a consequence of Theorem 6.1 of [60]. Lemma 5.1.6 is similar to the one presented in [250]. A slightly weaker version of Theorem 5.1.7 can be found in [114]. For more conditions regarding the growth of the relative value functions, see [114, 247]. Section 5.2. Most of the material is from [109, 116], where similar statements were proved under stronger conditions. The proof of Lemma 5.2.1 is largely from [39]. An uncontrolled version of Example 5.2.1 comes from [171]. In the proof of Proposition 5.2.1, formula (5.31) appears in Proposition 2 of [193]. A version of Theorem 5.2.3 was in [109]. Lemma 5.2.3 was stated without proof in [116]. A version of Theorem 5.2.4 can be found in [116]. The proof of Corollary 5.2.2 made use of the arguments in [198, 199]. Proposition 5.2.2 is well known, and the proof here is similar to the one in [137]. Section 5.3. Example 5.3.1 is similar to the examples in [109, 116]. A discretetime version of this example is in [132]. However, the interpretation here is new. As was mentioned in Sect. 1.4, the Freelancer Dilemma was described in [210, p. 166], where the shape of the strategy solving the unconstrained problem was very briefly introduced. A more general version was investigated in [157]. In the cited works, the model was non-Markov, and was studied using the theory of semi-Markov decision processes. It seems that the detailed investigation of both the unconstrained and constrained Markov versions appears here for the first time.
Chapter 6
The Total Cost Model: General Case
In this chapter, we continue to study the CTMDP model introduced in Sect. 1.1. One of the aims is to show that, after the essential extension of the notion of a strategy, the classes of Markov π-strategies and Markov Poisson-related strategies are still sufficient for solving constrained and unconstrained optimal control problems with total expected costs. In Sect. 6.4, we investigate the important strategies called mixtures. The realizability of different classes of strategies is discussed in Sect. 6.6.
6.1 Description of the General Total Cost Model The primitives of CTMDPs are the same as in Sect. 1.1.1. However, the definition of a control strategy will be essentially extended. Below we describe the generalized control strategies, as well as a number of specific subclasses of strategies, and explain that the strategies introduced in Sect. 1.1.3 are special cases of the generalized control strategies.
6.1.1 Generalized Control Strategies and Their Strategic Measures When describing the Markov Poisson-related strategies in Sect. 4.1, we introduced the space := (R+ × A)∞ , which served as the source of randomness of actions on the intervals (Tn−1 , Tn ], n = 1, 2, . . . , between the jumps of the controlled process X (·). Below, we generalize this idea by assuming that (, B()) is an arbitrary Borel space chosen by the decision-maker at the very beginning. The sample space (, F) now depends on and is constructed similarly to Sect. 1.1.3 as follows.
© Springer Nature Switzerland AG 2020 A. Piunovskiy and Y. Zhang, Continuous-Time Markov Decision Processes, Probability Theory and Stochastic Modelling 97, https://doi.org/10.1007/978-3-030-54987-9_6
337
338
6 The Total Cost Model: General Case
Having firstly defined the (measurable) Borel space (0 , F 0 ) := ( × (X × × R+ )∞ , B( × (X × × R+ )∞ )), we adjoin all the sequences of the form (ξ0 , x0 , ξ1 , θ1 , x1 , ξ2 , . . . , θm−1 , xm−1 , ξm , ∞, x∞ , ξ∞ , ∞, x∞ , ξ∞ , . . . ) to 0 , where m ≥ 1 is some integer, ξ0 , ξm ∈ , x0 ∈ X, ξi ∈ , θi ∈ R+ , xi ∈ X / is the artificial isolated point. As usual, for all positive 1 ≤ i ≤ m − 1, and ξ∞ ∈ X∞ := X ∪ {x∞ } and ∞ := ∪ {ξ∞ }. After the corresponding modification of the σ-algebra F 0 , we obtain the basic sample space (, F), which is clearly Borel as the countable union of Borel spaces. Below, we use the generic notation ω = (ξ0 , x0 , ξ1 , θ1 , x1 , ξ2 , θ2 , x2 , . . .) ∈ . ¯ + by n (ω) = θn ; for For each n ∈ {1, 2, . . .}, introduce the mapping n : → R each n ∈ {0, 1, 2, . . .}, the mappings X n : → X∞ and n : → ∞ are defined by X n (ω) = xn and n (ω) = ξn . As usual, the argument ω will often be omitted. The increasing sequence of random variables Tn , n ∈ N0 , is defined by Tn =
n i=1
i ; T∞ = lim Tn . n→∞
As before, n (respectively Tn , X n ) are understood as the sojourn times (respectively the jump moments and the states of the process on the intervals [Tn , Tn+1 )). We do not intend to consider the process after T∞ . The isolated point x∞ will be regarded as absorbing; it appears when and only when θm = ∞ for some m ∈ {1, 2, . . . }. The meaning of the ξn -components will be described a bit later. Currently, note only that, for each n ∈ {1, 2, . . . }, ξn will remain constant on the interval (Tn−1 , Tn ], and affect the distribution of the next pair (n , X n ). Finally, for n ∈ {0, 1, . . . }, Hn = (0 , X 0 , 1 , 1 , X 1 , . . . , n , n , X n ) is the n-term (random) history. As mentioned above, the capital letters , X, , T, H denote random elements; the corresponding lower case letters are for their realizations. The bold letters denote spaces; e.g., Hn = {(ξ0 , x0 , ξ1 , θ1 , x1 , . . . , ξn , θn , xn ) : (ξ0 , x0 , ξ1 , θ1 , x1 , . . .) ∈ } is the space of all n-term histories (n ≥ 0). The random measure μ on R+ × × X with values in {0, 1, . . . } ∪ {∞} is defined by
6.1 Description of the General Total Cost Model
μ(ω; R × × X ) =
339
I{Tn (ω) < ∞}δ(Tn (ω),n (ω),X n (ω)) (R × × X ),
n≥1
(6.1) where R ∈ B(R+ ), ∈ B(), and X ∈ B(X). The right-continuous filtration {Ft }t≥0 on (, F) is given by Ft = σ(H0 ) ∨ σ(μ((0, s] × B) : s ≤ t, B ∈ B( × X)).
(6.2)
Following from the definition, one can show that FTn = σ(Hn ). The controlled process X (·) we are interested in is defined by X (t) :=
I{Tn ≤ t < Tn+1 }X n + I{T∞ ≤ t}x∞ , t ∈ R0+ ,
(6.3)
n≥0
which takes values in X∞ and is right-continuous and {Ft }t≥0 -adapted. The filtration {Ft }t≥0 gives rise to the predictable σ-algebra Pr on × R0+ defined by (6.4) σ( × {0} ( ∈ F0 ), × (s, ∞) ( ∈ Fs− , s > 0)), where Fs− :=
t 0;
6.2 Detailed Occupation Measures and Sufficient Classes of Strategies
349
m n (d x × ds) := EγS e−αTn−1 δ X n−1 (d x) ds; Mn (da × d x × ds) := EγS πˆ n (da|s, Tn−1 , X n−1 )e−αTn−1 δ X n−1 (d x) ds; Mn (A × d x × ds) = Fn (A, x, s)m n (d x × ds) (definition of Fn (A, x, s)); Mn (da × d x × ds) = πnM (da|x, s)Mn (A × d x × ds) (definition of πnM );
Fn (A, x, t) = e− (0,t] qx (πn ,u)du−αt , x ∈ X∞ , t > 0; M m n (d x × ds) = Eγπ e−αTn−1 δ X n−1 (d x) ds; M
π ,α S,α ( X × A ) = m γ,n ( X × A ), ∀ X ∈ B(X), A ∈ B(A). m γ,n M
Remark 6.2.1 According to Theorem 6.2.1, Markov π-strategies are sufficient for problems (6.9) and (6.10). This and Remark 2.2.1 in turn imply that natural Markov strategies are sufficient for the problems (6.9) and (6.10) either when the cost rates c j 0
are R+ -valued, or when Condition 2.2.5 is satisfied. In this connection, let us recall that π-strategies are usually not realizable. Corollary 6.2.1 Assume that Condition 1.3.1 is satisfied. (a) Suppose Condition 1.3.2(a) holds true. Then, for any S ∈ S, there is a Markov standard ξ-strategy p M ∈ SξM such that p ,α S,α m γ,n ≥ m γ,n M
(6.18)
on X × A for all n = 1, 2, . . .. Hence, Markov standard ξ-strategies are sufficient for solving problem (6.10) with nonpositive cost rates c j . (Such models are called positive.) If Condition 1.3.2(b) is satisfied, then the inequality in (6.18) can be taken as equality and D S = D Ra M . If S = π M is a Markov π-strategy, the strategy p M is defined by the following formula:
=
pnM ( A |xn−1 ) 1
(qxn−1 (a) + α) M × πnM (da|xn−1 , θ)e− (0,θ] (qxn−1 (πn ,u)+α)du dθ , 1 − e−
M (0,∞) (q xn−1 (πn ,u)+α)du
(0,∞)
A
(6.19)
xn−1 = . (b) Suppose Condition 1.3.2(a) (respectively, Condition 1.3.2(b)) is satisfied. Then, for any quasi-stationary π-ξ-strategy S, there is a stationary standard ξps ,α ps ,α S,α S,α (respectively, m γ,n = m γ,n ) for all strategy p s ∈ Ssξ such that m γ,n ≥ m γ,n n = 1, 2, . . ..
350
6 The Total Cost Model: General Case
Proof (a) It is sufficient to refer to Theorem 1.3.2, where one can take S = π M without loss of generality by Theorem 6.2.1. (b) For the quasi-stationary π-ξ-strategy S we introduce the π-strategy π M as in the proof of Theorem 6.2.1: πˆ n ( A |s, Tn−1 , X n−1 ) := EγS π( A |0 , X n−1 , ξ, s)e− (0,s] q X n−1 (0 ,ξ,π,u)du−αs × p(dξ|0 , X n−1 )|Tn−1 , X n−1 , ∀ A ∈ B(A), s > 0, where qx (ξ0 , ξ, π, u) := −
q({x}|x, a)π(da|ξ0 , x, ξ, u); A
Mn (da × d x × ds) := EγS πˆ n (da|s, Tn−1 , X n−1 )e−αTn−1 δ X n−1 (d x) ds; m n (d x × ds) := EγS e−αTn−1 δ X n−1 (d x) ds. Clearly, one can take the following version of the regular conditional expectation πˆ n ( A |s, tn−1 , xn−1 ), which is independent of tn−1 and of n: π( ˆ A |s, x) =
π( A |ξ0 , x, ξ, s)e−
(0,s]
qx (ξ0 ,ξ,π,u)du−αs
p(dξ|ξ0 , x) p0 (dξ0 ).
Now, for each A ∈ B(A), ˆ A |s, x)m n (d x × ds), Mn ( A × d x × ds) = π( i.e., π( ˆ A |s, x) is the Radon–Nikodym derivative of Mn ( A × d x × ds) with respect to m n (d x × ds). The stochastic kernel π s ( A |x, s) :=
π( ˆ A |s, x) π(A|s, ˆ x)
satisfies the equality Mn ( A × d x × ds) = π s ( A |x, s)[π(A|s, ˆ x)m n (d x × ds)] s = π ( A |x, s)Mn (A × d x × ds). We proved that the stochastic kernel πnM , as in the proof of Theorem 6.2.1, can be taken independent of n, in the form of π s (·|x, s), and hence we obtained the s stationary π-strategy π s for which m γπ ,α = m γS,α .
6.2 Detailed Occupation Measures and Sufficient Classes of Strategies
351
In order to complete the proof, it is sufficient to notice that, for the stationary πstrategy π s , the formula (6.19) provides the n-independent stochastic kernel p s (da|x) defining the desired stationary standard ξ-strategy p s ∈ Ssξ . Recall that Markov Poisson-related strategies S P were introduced in Definition 4.1.1 and DεP is the space of sequences of detailed occupation measures for all Markov Poisson-related strategies under a fixed value ε > 0. Each Markov Poisson-related strategy is a ξ-strategy S = { := (R+ × A)∞ , p0 , {( pn , ϕn )}∞ n=1 }. The generic notation for ξn is ξn = (τ0n , α0n , τ1n , α1n , . . .). The measure p0 is of no importance as {( pn , ϕn )}∞ n=1 do not depend on ξ0 ; so, the component ξ0 is omitted everywhere in the rest of this subsection. The kernels pn (n = 1, 2, . . .) on B() given h n−1 ∈ Hn−1 are defined as follows: • pn (τ0n = 0|h n−1 ) = 1; for i ≥ 1, pn (τin ≤ t|h n−1 ) = 1 − e−εt ; • for all k ≥ 0, pn (αkn ∈ A |h n−1 ) = p˜ n,k (A |xn−1 ); • all the coordinate random variables of n are mutually independent under pn (·|h n−1 ), the components τin (i ≥ 1) are independent of h n−1 , and the components αkn (k ≥ 0) depend only on xn−1 . Finally, ϕn (h n−1 , ξn , s) =
∞
n I{τ0n + . . . + τkn < s ≤ τ0n + . . . + τk+1 }αkn .
k=0
This ξ-strategy is Markov in the sense that the stochastic kernels pn depend only on xn−1 and the functions ϕn depend only on ξn and s. Therefore, below we write pn (dξn |xn−1 ) and ϕn (ξn , s). Definition 4.1.2 of the detailed occupation measure of a Markov Poisson-related strategy is consistent with Definition 6.2.1: see calculation (4.3). All the statements from Sect. 4.1 remain valid. Nevertheless, below we provide the more direct proof of Theorem 4.1.1, without assuming that α = 0. Recall that Theorem 4.1.1 states that, for all α ≥ 0 and ε > 0, D ReM = DεP . Proof of Theorem 4.1.1 with α ≥ 0 in explicit presence. For a fixed Markov πstrategy π M = {πnM (da|xn−1 , s)}∞ n=1 , we define the required Poisson-related strategy S P ∈ SεP in the following way. Put
352
6 The Total Cost Model: General Case
n (t, x) := Q n,k (w, x) :=
(0,t]
[qx (πnM , s) + α]ds; ∀ t ≥ 0, x ∈ X, n ≥ 1;
ε(εw)k−1 −εw−n (w,x) e , ∀ w ≥ 0, x ∈ X, k = 1, 2, . . . . (k − 1)!
After that, the strategy S P is given by the following formulae: for n = 1, 2, . . ., p˜ n,0 ( A |x) :=
e−εt−n (t,x)
(0,∞)
A
[qx (a) + ε + α]πnM (da|x, t)dt;
1 (0,∞) Q n,k (w, x)dw εw+n (w,x) × Q n,k (w, x)e
p˜ n,k ( A |x) :=
e−εt−n (t,x) M × [qx (a) + ε + α]πn (da|x, t)dt dw (0,∞)
(w,∞)
A
k = 1, 2, . . .
It is straightforward to check that p˜ n,k (A|x) = 1 for all n = 1, 2, . . ., k = 0, 1, . . .. Both control strategies π M and S P are Markov in the sense that the corresponding stochastic kernels G n defined by (6.6) depend only on the state xn−1 ∈ X∞ . For an arbitrarily fixed xn−1 ∈ X∞ , we define pn,i :=
A
= A
(0,∞)
εe−εw e−qxn−1 (a)w−αw dw p˜ n,i−1 (da|xn−1 )
(6.20)
ε p˜ n,i−1 (da|xn−1 ) qxn−1 (a) + ε + α
(cf. (4.6)), i.e., pn,i is the expectation of e−τi qxn−1 (αi−1 )−ατi with respect to pn (dξn |xn−1 ). Remember, all the components of ξn are mutually independent for a fixed xn−1 ∈ X under pn (·|xn−1 ). After that, the equalities n
(0,∞)
Q n,k (w, xn−1 )dw =
k
n
pn,i , k = 1, 2, . . . ;
n
(6.21)
i=1
k h(a) p˜ n,k (da|xn−1 ) pn,i (6.22) A q xn−1 (a) + ε + α i=1 = h(a)πnM (da|xn−1 , t)e−n (t,xn−1 ) dt pn (dξ|xn−1 ),
(
k i=1 τi ,
k+1 i=0 τi ]
k = 0, 1, 2, . . . ,
A
6.2 Detailed Occupation Measures and Sufficient Classes of Strategies
353
where n = 1, 2, . . . and h is an arbitrary nonnegative measurable function on A, can be proved exactly in the same way as the equalities (4.8) and (4.12). Let us prove by induction that EγS
P
−αTn M e I{X n ∈ X } = Eγπ e−αTn I{X n ∈ X } , ∀ X ∈ B(X).
(6.23)
This equality is obviously valid at n = 0 because the initial distribution γ is fixed and T0 = 0. Suppose it holds for some n − 1 ≥ 0. According to the construction of the strategic measure, for each X ∈ B(X), Eγπ
M
= Eγπ
M
= Eγπ
M
−αTn e I{X n ∈ X } M e−αTn−1 Eγπ e−αn I{X n ∈ X }|Hn−1 (6.24) M e−αTn−1 e−αt q ( X |X n−1 , πnM , t)e− (0,t] q X n−1 (πn ,u)du dt R+
and e
−αtn−1
R+
e−αt q ( X |xn−1 , πnM , t)e−
(0,t]
qxn−1 (πnM ,u)du
dt
(6.25)
is the regular conditional expectation of e−αTn−1 e−αn I{X n ∈ X } with respect to M (Tn−1 , X n−1 ) under the measure Pγπ : see the text after Proposition B.1.41. On the other hand, −αTn e I{X n ∈ X } P P = EγS e−αTn−1 EγS e−αn I{X n ∈ X }|Hn−1 SP −αTn−1 e−αt q ( X |X n−1 , ϕn (ξn , t)) = Eγ e R+ − (0,t] q X n−1 (ϕn (ξn ,u))du ×e dt pn (dξn |X n−1 ) . EγS
P
The regular conditional expectation of e−αTn−1 e−αn I{X n ∈ X } with respect to P (Tn−1 , X n−1 ) under the measure PγS equals F(tn−1 , xn−1 ) ∞ = e−αtn−1 ×e
−
k
k=0
(
k+1 n k n i=0 τi , i=0 τi ]∩R
n n i=1 τi q xn−1 (αi−1 )−
e−αt q ( X |xn−1 , αkn )
k t− i=1 τin qxn−1 (αkn )
dt pn (dξn |xn−1 )
354
6 The Total Cost Model: General Case
∞ n because pn i=0 τi = ∞|x n−1 = 1. As before, Fubini’s Theorem is used without special reference. Continuing our calculations, we obtain F(tn−1 , xn−1 ) ∞ k n n = e−αtn−1 e− i=1 τi (qxn−1 (αi−1 )+α) k=0
×
e−αu q ( X |xn−1 , αkn )e−qxn−1 (αk )u du pn (dξn |xn−1 ) n
n (0,τk+1 ]∩R
= e−αtn−1
k ∞
pn,i A
×
k=0 i=1 −(qxn−1 (αkn )+α)t
1−e qxn−1 (αkn ) + α
= e−αtn−1
k ∞
q ( X |xn−1 , αkn )εe−εt
dt p˜ n,k (dαkn |xn−1 )
pn,i A
k=0 i=1
(0,∞)
q ( X |xn−1 , a) p˜ n,k (da|xn−1 ). qxn−1 (a) + ε + α
k The first equality is by the change of variable u = t − i=0 τin ; the second equaln n n ity holds because all the components of ξn = (τ0 = 0, α0 , τ1 , α1n , . . .) are mutually independent. The presented calculations are for the case qxn−1 (αkn ) + α > 0; the simpler case qxn−1 (αkn ) + α = 0 leads to the same final expression since q ( X |xn−1 , αkn ) = 0 if qxn−1 (αkn ) = 0. According to (6.22) with h(a) := q ( X |xn−1 , a), we obtain F(tn−1 , xn−1 ) = e−αtn−1
(0,∞)
A
q ( X |xn−1 , a)πnM (da|xn−1 , t)
×e−n (t,xn−1 ) dt pn (dξ|xn−1 ) M q ( X |xn−1 , πnM , t)e− (0,t] qxn−1 (πn ,s)ds dt. = e−αtn−1 e−αt R+
Since the obtained expression for the regular conditional expectation coincides with (6.25), we conclude by induction that equality (6.23) is valid. According to Corollary 4.1.2, S ,α ( X × A ) m γ,n
k ∞ ε p˜ n,i−1 (da|x) SP −αTn−1 Eγ e δ X n−1 (d x) = q (a) + ε + α X k=0 i=1 A x p˜ n,k (da|x) . × A q x (a) + ε + α P
6.2 Detailed Occupation Measures and Sufficient Classes of Strategies
355
Since we have already established equality (6.23), it remains to show that, for all xn−1 ∈ X, A ∈ B(A) k ∞ ε p˜ n,i−1 (da|xn−1 ) p˜ n,k (da|xn−1 ) q (a) + ε + α q A xn−1 (a) + ε + α k=0 i=1 A xn−1 e−n (t,xn−1 ) πnM ( A |xn−1 , t)dt : =
(6.26)
(0,∞)
π ,α see (1.31) for the definition of the detailed occupation measure m γ,n . According to (6.20) and (6.22) with h(a) := I{a ∈ A }, M
k ε p˜ n,i−1 (da|xn−1 ) I{a ∈ A } p˜ n,k (da|xn−1 ) q (a) + ε + α q A xn−1 (a) + ε + α i=1 A xn−1 k I{a ∈ A } p˜ n,k (da|xn−1 ) = pn,i A q xn−1 (a) + ε + α i=1 = e−n (t,xn−1 ) πnM ( A |xn−1 , t)dt pn (dξ|xn−1 )
(
k n i=0 τi ,
k+1 n i=0 τi ]
∞ n for all k = 0, 1, . . . . Since pn ( i=1 τi = ∞|xn−1 ) = 1, it follows from the above that the left-hand side of (6.26) equals =
(0,∞)
(0,∞)
e−n (t,xn−1 ) πnM ( A |xn−1 , t)dt pn (dξ|xn−1 )
e−n (t,xn−1 ) πnM ( A |xn−1 , t)dt,
and S ,α π ,α m γ,n = m γ,n . P
M
We have proved that D ReM ⊆ DεP . The reverse inclusion DεP ⊆ D ReM follows from Theorem 6.2.1. The proof is complete. In view of Theorem 4.1.1, the following corollary from Theorem 6.2.1 is obvious. Corollary 6.2.2 For each fixed α ≥ 0, γ ∈ P(X), and ε > 0, it holds that D S = D ReM = DεP . Therefore, Markov Poisson-related strategies are sufficient for the constrained problem (6.10). As was shown in Sect. 4.1, Markov Poisson-related strategies are realizable; this also follows from the more general statement presented in Sect. 6.6.
356
6 The Total Cost Model: General Case
The conclusion in Remark 6.2.1 regarding the sufficiency of natural Markov strategies can be strengthened according to the next observation regarding the sufficiency of π-strategies. Theorem 6.2.2 For each (generalized) π-ξ-strategy S = {, p0 , { pn , πn }∞ n=1 }, con∞ ˜ sider the π-strategy S = {π˜ n }n=1 , see Table 6.1, defined as follows: for each n ≥ 0, π˜ n+1 (da|x0 , θ1 , x1 , . . . , θn , xn , t) EγS πn+1 (da|Hn , n+1 , t)e− (0,t] qxn (n+1 ,πn+1 ,s)ds |x0 , θ1 , . . . , θn , xn := . EγS e− (0,t] qxn (n+1 ,πn+1 ,s)ds |x0 , θ1 , . . . , θn , xn Here with slight abuse of notation, qxn (n+1 , πn+1 , s) is understood with Hn replacing h n . Then the marginal distributions under S and under S˜ coincide, i.e., ˜ PγS (t, dy × da) = PγS (t, dy × da). Proof One can proceed as in the proof of Theorem 1.1.1. (Note that h n in that proof is (x0 , θ1 , x1 , . . . , θn , xn ).) The details are as follows: we have d S − (0,t] qxn (n+1 ,πn+1 ,s)ds ln Eγ e x 0 , θ1 , . . . , θn , x n dt EγS qxn (n+1 , πn+1 , t)e− (0,t] qxn (n+1 ,πn+1 ,s)ds |x0 , θ1 , . . . , θn , xn =− ; EγS e− (0,t] qxn (n+1 ,πn+1 ,s)ds |x0 , θ1 , . . . , θn , xn and ˜ EγS I{X (t) ∈ dy}π˜ n+1 (da|X 0 , 1 , X 1 , . . . , n , X n , t − Tn ) ×I{t ∈ (Tn , Tn+1 ]} ˜ = EγS I{X n ∈ dy, Tn < t}π˜ n+1 (da|X 0 , 1 , X 1 , . . . , n , X n , t − Tn ) × EγS e− (0,t−Tn ] q X n (n+1 ,πn+1 ,s)ds |X 0 , 1 , X 1 , . . . , n , X n = EγS I{X n ∈ dy, Tn < t}π˜ n+1 (da|X 0 , 1 , X 1 , . . . , n , X n , t − Tn ) × EγS e− (0,t−Tn ] q X n (n+1 ,πn+1 ,s)ds |X 0 , 1 , X 1 , . . . , n , X n = EγS I{X n ∈ dy, Tn < t}πn+1 (da|Hn , n+1 , t)e− (0,t−Tn ] q X n (n+1 ,πn+1 ,s)ds = EγS I{X (t) ∈ dy}πn+1 (da|Hn , n+1 , t − Tn )I{t ∈ (Tn , Tn+1 ]} ,
where the second equality holds because of
6.2 Detailed Occupation Measures and Sufficient Classes of Strategies
357
PγS (n+1 ∈ dt, X n+1 ∈ d x|X 0 , 1 , X 1 , . . . , n , X n ) = EγS [PγS (n+1 ∈ dt, X n+1 ∈ d x|X 0 , 1 , X 1 , . . . , n , X n )|Hn , n+1 ] = EγS [PγS (n+1 ∈ dt, X n+1 ∈ d x|Hn , n+1 )|X 0 , 1 , X 1 , . . . , n , X n ] = EγS q (d x|X n , n+1 , πn+1 , t) ×e− (0,t] q X n (n+1 ,πn+1 ,s)ds dt|X 0 , 1 , X 1 , . . . , n , X n = EγS q (d x|X n , n+1 , πn+1 , t) ×e− (0,t] q X n (n+1 ,πn+1 ,s)ds |X 0 , 1 , X 1 , . . . , n , X n dt ˜
= PγS (n+1 ∈ dt, X n+1 ∈ d x|X 0 , 1 , X 1 , . . . , n , X n ), ˜
PγS (X 0 ∈ d x) = γ(d x) = PγS (X 0 ∈ d x) and an inductive argument. The proof is complete. Note that Theorem 6.2.2 implies that π-strategies are sufficient not only for the total cost problems (6.9) and (6.10), but also for the average cost problems (1.17) and (1.18) with the concerned class of strategies being S ∈ S. As in Remark 6.2.1, natural Markov strategies are also sufficient for the average cost problems (1.17) 0 and (1.18) out of the class S either when the cost rates c j are R+ -valued, or when Condition 2.2.5 is satisfied.
6.2.3 Counterexamples In this subsection, we show that not all subclasses of strategies are sufficient for solving optimal control problems. Examples presented in Sect. 1.3.4 are also relevant in this connection. Example 6.2.1 This example illustrates that Markov standard ξ-strategies (as well as stationary standard ξ-strategies and stationary π-strategies) are not sufficient in the concerned optimal control problems. Consider the following CTMDP: X = {1, }, A = (0, 1], γ(1) = 1, q(|1, a) = a, c0 (1, a) = a, is the cemetery: q (a) ≡ 0, c0 (, a) ≡ 0; J = 0; α = 0. In this model, we have a single sojourn time = T , so that the n index is omitted. It is obvious that, for any Markov standard ξ-strategy p M (which is also stationary), M m γp ,0 ({1}
and
× A ) =
M Eγp
(0,T ]∩R+
I{A(t) ∈ A }dt =
A
p M (da|1) ·
1 a
358
6 The Total Cost Model: General Case
M M W00 ( p M , γ) = Eγp A(t)dt = a m γp ,0 ({1} × da) (0,T ]∩R+ A 1 M a p (da|1) = 1. = A a For an arbitrary stationary π-strategy π s , we similarly obtain s m γπ ,0 ({1}
× A ) = π ( A |1)
a π s (da|1) ,
s
A
because here is the exponential random variable with parameter (see (6.5)), and s 0 s W0 (π , γ) = a m γπ ,0 ({1} × da) = 1.
A
a π s (da|1)
A
On the other hand, under an arbitrarily fixed κ > 0, for the deterministic Markov strategy ϕ(1, s) = e−κs , the (first) sojourn time = T has the cumulative distribu−1+e−κθ 1 tion function (CDF) 1 − e κ , so that Pγϕ ( = ∞) = e− κ . Under an arbitrarily fixed U ∈ (0, 1] we have m ϕ,0 γ ({1}
× (U, 1]) = = =
Eγϕ Eγϕ 1 κ +
(0,]∩R+
I{e
−κu
∈ (U, 1]}du
[e−κ ,1)∩R+
(− lnκU
1 κ
,∞)
I{y ∈ (U, 1]}dy/(κy)
[− ln U ](e−κθ · e κθ(e−κθ · e
−1+e−κθ κ
−1+e−κθ κ
)dθ
)dθ +
1 1 [− ln U ] · e− κ κ
(0,− lnκU ]
− lnκU 1 1 U −1 −1+e−κθ = [− ln U ] (−e− κ + e κ ) + θ 1 − e κ 0 κ 1 −1+e−κθ 1 − 1−e κ dθ + [− ln U ] · e κ − κ (0,− lnκU ] −1+a e κ −1+e−κθ da, = e κ dθ = (0,− lnκU ] (U,1] κa so that the measure m ϕ,0 γ ({1} × da) is absolutely continuous with respect to the Lebesgue measure, the density being W00 (ϕ, γ) =
A
e
−1+a κ
κa
, and
−κ a m ϕ,0 . γ ({1} × da) = 1 − e 1
(6.27)
6.2 Detailed Occupation Measures and Sufficient Classes of Strategies
359
According to Corollary 6.2.1(a), there is a Markov standard ξ-strategy p M such M that m γp ,0 ≥ m ϕ,0 γ . It is given by formula (6.19):
p ( A |1) =
a · δe−κθ (da)e− (0,θ] ( A a·δe−κu (da))du dθ . 1 − e− (0,∞) ( A a·δe−κu (da))du
(0,∞) A
M
For A = (0, U ] we obtain the cumulative distribution function p M ((0, U ]|1) =
U 1 e− κ e κ − 1 1 − e− κ 1
,
1−a
the density being
M m γp ,0 ({1}
− e κ 1 , κ 1−e− κ
so that
× (U, 1]) =
Moreover, the density of m γP
M
e− κ da > m ϕ,0 γ ({1} × (U, 1]). 1 aκ 1 − e− κ 1−a
(U,1]
,0
({1} × da) equals
m ϕ,0 γ ({1}
−1+a
e κ 1 κa(1−e− κ )
and is bigger than
e
−1+a κ
κa
,
the density of × da). P Let us construct the Markov Poisson-related ξ-strategy such that m γS ,0 = m ϕ,0 γ by applying Corollary 6.2.2 and the proof of Theorem 4.1.1 presented in Sect. 6.2.2. As before, the index n is omitted. Since, for the ϕ strategy, π M (da|1, s) = δe−κs (da), (t, 1) = and Q k (w, 1) =
(0,t]
e−κs ds =
1 − e−κt κ
εk w k−1 −εw− 1−e−κw κ e , k = 1, 2, . . . . (k − 1)!
Furthermore, for each U ∈ (0, 1], p˜ 0 ((0, U ]|1) = =
e−εt−
1−e−κt κ
(0,∞)
e (0,1)
(0,1]
ε ln a 1−a κ − κ
I{a ∈ (0, U ]}(a + ε)δe−κt (da) dt
I{a ∈ (0, U ]}(a + ε)
1 da, κa
where the last equality is by the change of variable a = e−κt . Thus, on the first interval (0, τ1 ], one should choose α0 using the density function
360
6 The Total Cost Model: General Case
ε ε a−1 1 1+ a κ e κ , a ∈ (0, 1]. κ a k+1 k τ , On the interval i i=1 i=1 τi (k = 1, 2, . . .), the cumulative distribution function of αk is given by p˜ k ((0, U ]|1) =
1 −κw
εk w k−1 −εw− 1−eκ (0,∞) (k−1)! e k k−1
×
×
(0,∞)
(0,1]
ε w (k − 1)!
dw e−εt−
1−e−κt κ
(w,∞)
I{a ∈ (0, U ]}(a + ε)δe−κt (da) dt dw , U ∈ (0, 1].
The numerator can be rearranged as follows: (0,∞)
εk w k−1 (k − 1)!
e−εt− (w,∞)
1−e−κt κ
I{e−κt ∈ (0, U ]}(e−κt + ε) dt dw
εk w k−1 1 ε ln a 1−a = e κ − κ (a + ε) da dw −κw (k − 1)! κa (0,∞) (0,e ∧U ] k k−1 ε w 1 ε a−1 = dw a κ e κ (a + ε) da ln a (k − 1)! κa (0,U ] (0,− κ ] k ln a k ε a−1 a + ε ε = − da. aκe κ κ κa (0,U ] k! The first equality is by the change of variable a = e−κt , and the second equality is by Fubini’s Theorem. In the denominator, after integrating by parts, we obtain
εk k 1−e−κw w (ε + e−κw )e−εw− κ dw (0,∞) k! ln a k ε a−1 a + ε εk − da, aκe κ = κ κa (0,1) k! where the last equality is by the change of variable a = e−κw . Thus, on the subsequent intervals (τk , τk+1 ], k = 1, 2, . . ., one should choose αk using the density function ln a k ε a−1 a+ε − κ a κ e κ κa , a ∈ (0, 1]. ln a k ε a−1 a+ε − κ a κ e κ κa da
εk k! εk (0,1] k!
6.2 Detailed Occupation Measures and Sufficient Classes of Strategies
361
Finally, it is clear that inf S∈S W00 (S, γ) = 0: see (6.27) with κ → ∞, but the optimal strategy does not exist because > 0 and c0 (1, a) > 0. Note also that, if we extend the action space to [0, 1] and keep the functions q and c0 continuous, i.e., q(|1, 0) = c0 (1, 0) = 0, then the stationary deterministic strategy ϕ∗ (x) = 0 is optimal with W00 (ϕ∗ , γ) = 0. Example 6.2.2 This example shows that, if a π-strategy π s is stationary, then the π s ,0 ∞ }n=1 may not be generated by a stationary standard ξoccupation measures {m γ,n ps ,0 strategy. The reverse statement is also true: not every sequence {m γ,n }∞ n=1 coming from a stationary standard ξ-strategy p s can be generated by a stationary π-strategy. Let X = {1, }, A = {a1 , a2 }, γ(1) = 1, q(|1, a1 ) = λ > 0, q(|1, a2 ) = 0; is the cemetery: q (a) ≡ 0, and α = 0. Since we have a single sojourn time = T , the n index is omitted. For an arbitrary stationary π-strategy π s we have, either m γπ ,0 (1, a1 ) ∈ (0, ∞) and m γπ ,0 (1, a2 ) ∈ (0, ∞) s
s
(if π s (a1 |1) ∈ (0, 1)), s or m γπ ,0 (1, a1 ) = 0 and m γπ ,0 (1, a2 ) = ∞ (if π s (a1 |1) = 0), 1 s s and m γπ ,0 (1, a2 ) = 0 (if π s (a1 |1) = 1). or m γπ ,0 (1, a1 ) = λ s
Here, is the exponential random variable with parameter λπ s (a1 |1) (see (6.5)). For an arbitrary stationary standard ξ-strategy p s we have either m γp ,0 (1, a1 ) = s
p s (a1 |1) s ∈ (0, ∞) and m γp ,0 (1, a2 ) = ∞ λ (if p s (a1 |1) ∈ (0, 1)),
or m γp ,0 (1, a1 ) = 0 and m γp ,0 (1, a2 ) = ∞ (if p s (a1 |1) = 0), 1 s s or m γp ,0 (1, a1 ) = and m γp ,0 (1, a2 ) = 0 λ (if p s (a1 |1) = 1). s
s
If, for a stationary standard ξ-strategy p s , p(a1 |1) ∈ (0, 1) then m γp ,0 cannot s be generated by a stationary π-strategy. If π s (a1 |1) ∈ (0, 1) then m γπ ,0 cannot be generated by a stationary standard ξ-strategy. s
6.3 Reduction to DTMDP In Chap. 4, it was explained that, often enough, solutions to problems (1.15) or (1.16) can be obtained from the solutions to the corresponding problems in the associated DTMDP. In the present subsection, we develop the same ideas for the problems (6.9) and (6.10).
362
6 The Total Cost Model: General Case
If the model is discounted (α > 0), then, by Corollary 6.2.1(a), D S = D Ra M and solving the problems (6.9) or (6.10) in the class SξM (equivalently, in the whole class S), i.e., solving the problems (6.16) or (6.17) in the space D Ra M , is equivalent to solving the corresponding problems for the associated DTMDP: see Theorem 4.2.3. However, if α = 0, the situation is more delicate. As was said at the beginning ∗ of Sect. 4.2.2, one must be sure that there is a Markov Poisson-related strategy S P ∗ satisfying equation (4.25) for the given strategy σ , a solution to the corresponding problem for the associated DTMDP. That was the case in all the statements and ∗ examples in Sects. 4.2.2 and 4.2.3. Below, we show that such a strategy S P always exists. This result was announced in Remark 4.2.1. Recall that, for the given CTMDP (X, A, q, {c j } Jj=0 ), we consider the following DTMDP (X, A, p, {l j } Jj=0 ): • X and A are the state and action spaces, the same as those in the CTMDP; • the transition probability is given by p(|x, a) = • the cost functions are l j (x, a) =
q (|x, a) + ε I{x ∈ } ; qx (a) + ε c j (x,a) . qx (a)+ε
Remember, ε > 0 is arbitrarily fixed. Corollary 4.2.1 states that, for each Markov Poisson-related strategy S P in the CTMDP, there exists a strategy σ in the DTMDP such that, for all nonnegative measurable functions f on X × A, the equality ∞ n=1
X×A
⎡
⎤ ∞ f (Y , B ) j−1 j S ,0 ⎦ f (x, a)m γ,n (d x × da) = Eσγ ⎣ q (B ) + ε Y j j−1 j=1 P
(6.28)
holds for each initial distribution γ. Like in Sect. 4.2, the controlled and controlling ∞ processes in the DTMDP are denoted as {Y j }∞ j=0 and {B j } j=1 to distinguish them from the corresponding elements of the CTMDP. Below, we establish the reverse statement to Corollary 4.2.1. Lemma 6.3.1 For an arbitrarily fixed ε > 0, for each strategy σ in the DTMDP, there exists a Markov Poisson-related strategy S P in the CTMDP such that, for all nonnegative measurable functions f on X × A, equality (6.28) holds for each initial distribution γ. Proof According to Proposition C.1.4, without loss of generality, we assume that σ = σ M = {σmM (db|y)}∞ m=1 is a Markov strategy. Firstly, we introduce a past-dependent Poisson-related strategy S˜ P satisfying equality (6.28). Below, the value of ε > 0 is arbitrarily fixed. The component ξ0 and the measure p0 are omitted as they do not play any role; := (R × A)∞ as usual.
6.3 Reduction to DTMDP
363
For n = 1, we put M (da|x0 ), k ≥ 0. p˜ 1,k (da|h 0 ) = p˜ 1,k (da|x0 ) := σ1+k
For n > 1, for each history h n−1 = (x0 , ξ1 , θ1 , x1 , . . . , ξn−1 , θn−1 , xn−1 ) ∈ Hn−1 , we calculate ⎧ ⎫ l ⎨ ⎬ li (h n−1 ) := 1 + max l ≥ 0 : τ ij < θi , i = 1, 2, . . . , n − 1, ⎩ ⎭ j=0
and put M p˜ n,k (da|h n−1 ) := σ n−1 l (h i=1 i
n−1 )+1+k
(da|xn−1 ), k ≥ 0.
The functions li have the same meaning as those in the proof of Lemma 4.2.1. See also i S˜ P Fig. 6.1. Since, for each i = 1, 2, . . ., ∞ j=0 τ j = ∞ Pγ -a.s., the case li (h n−1 ) = ∞ for some n, i is of no interest because it implies that θi = ∞, i.e., the control process is over. If one constructs the control strategy in the DTMDP, based on S˜ P , as in the proof of Lemma 4.2.1, then the result will be the original strategy σ M . One can adjust the proof of that lemma and obtain equality (6.28). Below, we provide the direct proof. For each (random) trajectory in the CTMDP under the control strategy S˜ P (X 0 , 1 , 1 , X 1 , 2 , . . .) with Tn :=
n
i and n = (τ0n = 0, α0n , τ1n , α1n , . . .), n = 1, 2, . . . ,
i=1
we introduce the (random) sequence {(Tm , Xm , Am+1 )}∞ m=0 in the following way. T0 := 0; X0 := X 0 = X (T0 ); A1 = α01 . If Tm−1 = Tn−1 +
l j=1
τ nj (l ≥ 0 is also a random variable), then
364
6 The Total Cost Model: General Case
Fig. 6.1 Two scenarios illustrating the construction of the past-dependent Poisson-related strategy. The realized values of Tm (Tn ) are denoted as tm (tn ). a n = 4; l1 (h 3 ) = 3, l2 (h 3 ) = 2, l3 (h 3 ) = 1; b n = 3; l1 (h 2 ) = 1, l2 (h 2 ) = 2. The dashed arrows explain which kernels p˜ n,k (da|h n−1 ) are in use on the corresponding intervals
⎧ l+1 n ⎪ j=1 τ j > n ; ⎨ Tn = Tn−1 + if n l n Tm := = Tm−1 + n − j=1 τ j , ⎪ ⎩ n Tm−1 + τl+1 otherwise; Xm := X (Tm ); Am+1 :=
(6.29)
n α0n+1 , if l+1 j=1 τ j > n ; n αl+1 otherwise.
˜P
Note that PγS -almost surely Tm < Tm+1 < ∞ and Xm , Am+1 are well defined for ˜P
all m = 0, 1, 2, . . .. The strategic measure PγS gives rise to the probability measure on (R0+ × X × A)∞ , the space of trajectories of the sequence {(Tm , Xm , Am+1 )}∞ m=0 , and to the probability measure on (X × A)∞ , the space of trajectories of the discreteσM time process {(Xm , Am+1 )}∞ m=0 . We will show that the latter coincides with Pγ , the strategic measure in the DTMDP under consideration. Besides, given (Xm−1 , Am ), the random variable Tm − Tm−1 is independent of Xm (m ≥ 1). At m = 0, the initial distributions of X0 , X 0 , and Y0 coincide and equal γ. Recall that
6.3 Reduction to DTMDP
365
{(Ym , Bm+1 )}∞ m=0 is the (random) trajectory in the DTMDP. Then ˜P
PγS (A1 ∈ A |X0 = x0 ) = p˜ 1,0 ( A |x0 ) = σ1M ( A |x0 ) = Pσγ (B1 ∈ A |Y0 = x0 ), ∀ x0 ∈ X, A ∈ B(A). M
If, for some m ≥ 1, for all xm−1 ∈ X, A ∈ B(A), ˜P
PγS (Am ∈ A |Xm−1 = xm−1 ) = σmM ( A |xm−1 ) = Pσγ (Bm ∈ A |Ym−1 = xm−1 ), M
then, for all xm ∈ X, A ∈ B(A), ˜P
M PγS (Am+1 ∈ A |Xm = xm ) = σm+1 ( A |xm ) = Pσγ (Bm+1 ∈ A |Ym = xm ) M
according to (6.29) and to the construction of the strategy S˜ P . Let m := Tm − Tm−1 , m = 1, 2, . . .. We are going to prove that, for all m ≥ 1, x ∈ X, a ∈ A, t ≥ 0, ˜P
EγS [I{m > t}I{Xm ∈ }|Xm−1 = x, Am = a] q (|x, a) + εI{x ∈ } −(qx (a)+ε)t e . = qx (a) + ε
(6.30)
Consider the (regular) conditional probability ⎛ ˜P PγS
⎝m > t, Xm ∈ |Xm−1 = x, Am = a, Tn−1 = tn−1 ,
Tm−1 = tn−1 +
l
⎞ τ nj ⎠ ,
j=1
where (and below in this proof) we have slightly abused the notation for brevity: l, τ nj also denote the given value of the corresponding random variables, and the context will hopefully exclude any potential confusion. Then one may compute ⎛ ˜P PγS ⎝m > t, Xm ∈ |Xm−1 = x, Am = a, Tn−1 = tn−1 ,
366
6 The Total Cost Model: General Case
Tm−1 = tn−1 +
l
⎞ τ nj ⎠
j=1
n n ⎝ = τ j > τl+1 > t, Xm ∈ Xm−1 = x, Am = a, n − j=1 ⎞ l Tn−1 = tn−1 , Tm−1 = tn−1 + τ nj ⎠ ⎛
l
˜P PγS
j=1
n n ⎝ τl+1 > n − τ j > t, Xm ∈ Xm−1 = x, Am = a, j=1 ⎞ l = tn−1 , Tm−1 = tn−1 + τ nj ⎠ ⎛
˜P +PγS
Tn−1
l
j=1
= δx ()e−(ε+qx (a))t =
ε q (|x, a) −(ε+qx (a))t + e ε + qx (a) ε + qx (a)
q (|x, a) + εI{x ∈ } −(qx (a)+ε)t e , qx (a) + ε
n making use of the fact that n − lj=1 τ nj and τl+1 are (conditionally) independent exponential random variables, and that on {Tm−1 = Tn−1 + lj=1 τ nj }, ⎧ ⎨
m = min n − ⎩
l j=1
n τ nj , τl+1
⎫ ⎬ ⎭
.
Since the last expression on the right-hand side of the above chain of equalities does not depend on the shape of Tm−1 = Tn−1 + lj=1 τ nj (that is, independent of m, n, tn−1 , l, {τ nj }lj=1 ), we conclude that equality (6.30) is valid. Now it is clear that the probability measure on the space of trajectories of the proσM S˜ P cess {(Xm , Am+1 )}∞ m=0 , coming from Pγ , coincides with Pγ and, given (Xm−1 , Am ), the random variable m = Tm − Tm−1 is independent of Xm , with the conditional 1 (m ≥ 1). expectation being qX (A m )+ε m−1
According to (6.12) and Definition 6.2.1, if f : X × A → R0+ is a bounded function, then
6.3 Reduction to DTMDP ∞
=
˜P
X×A
n=1
367 S ,0 f (x, a)m γ,n (d x × da)
˜P EγS
˜P
(0,∞) ∞
X
A
f (x, a)(da|t)δ X (t) (d x)dt
f (Xm−1 , Am )(Tm − Tm−1 )
= EγS
m=1
∞ f (Xm−1 , Am ) q (Am ) + ε m=1 Xm−1
˜P
= EγS
⎡
= Eσγ
M
⎤ ∞ f (Y , B ) j−1 j ⎦ ⎣ , q (B ) + ε Y j j−1 j=1
where the second equality holds according to the constructions of the sequence ˜P {(Tm , Xm , Am+1 )}∞ m=0 and of the strategy S , and the third equality is valid because ˜P
EγS [Tm − Tm−1 |Xm−1 , Am ] =
1 qXm−1 (Am ) + ε
due to (6.30). According to Theorem 4.1.1, there exists a Markov Poisson-related strategy S P S P ,0 S˜ P ,0 such that, for all n = 1, 2, . . ., m γ,n = m γ,n . As a consequence, we have ∞ n=1
X×A
⎡ S ,0 f (x, a)m γ,n (d x × da) = Eσγ P
M
⎤ ∞ f (Y , B ) j−1 j ⎦ ⎣ . q (B ) + ε Y j j−1 j=1
In order to generalise the obtained equality to an arbitrary nonnegative measurable function f , it is sufficient to consider f ∧ K and pass to the limit as K → ∞, like in the proof of Corollary 4.2.1.
6.4 Mixtures of Strategies and Convexity of Spaces of Strategic and Occupation Measures As was mentioned in Sect. 6.1.1, the p0 -component of a π-ξ-strategy S = {, p0 , {( pn , πn )}∞ n=1 } is responsible for the mixtures of control strategies. In the current section, we discuss this in depth. Below, we assume that on the interval [0, 1], the Borel σ-algebra is fixed; the initial distribution γ and the discount factor α ≥ 0 are also assumed to be fixed.
368
6 The Total Cost Model: General Case
6.4.1 Properties of Strategic Measures 6.4.1.1
Generalized π-ξ-Strategies
Definition 6.4.1 Suppose, to each β ∈ [0, 1], there corresponds a (generalized) πξ-strategy β
S β = {, p0 , {( pnβ , πnβ )}∞ n=1 } with the common spaces , (, F), and, for each n = 1, 2, . . ., the kernels β β pn (dξn |h n−1 ) and πn (da|h n−1 , ξn , s) are (jointly) measurable mappings with respect β to all their arguments including β ∈ [0, 1]. Similarly, p0 (dξ0 ) is a stochastic kernel on B() given β ∈ [0, 1]. Then a (generalized) π-ξ-strategy S ∈ S is called the mixture of {S β }β∈[0,1] with the weights distribution ν ∈ P([0, 1]) if it satisfies the following: ˆ ˆ := [0, 1] × , p0 , {( pn , πn )}∞ • S = { n=1 } with the generic notation ξ = (β, ξ) for ˆ ˆ being in use. Let ˆ n be the sample space and the space of histories ˆ and H ξ∈ ˆ is in use). for this strategy S (when β • p0 (d ξˆ0 ) := ν(dβ) p0 (dξ). • For all n = 1, 2, . . ., pn (d ξˆn |hˆ n−1 ) := pnβ (dξn |h n−1 (hˆ n−1 ))δ0 (dβn ); the components βn = 0, n = 1, 2, . . ., are immaterial and can be omitted from consideration; πn (da|hˆ n−1 , ξˆn , s) := πnβ (da|h n−1 (hˆ n−1 ), ξn , s). Here, for hˆ n−1 = (β, ξ0 , x0 , ξ1 , θ1 , x1 , . . . , ξn−1 , θn−1 , xn−1 ), h n−1 (hˆ n−1 ) := (ξ0 , x0 , ξ1 , θ1 , x1 , . . . , ξn−1 , θn−1 , xn−1 ). Therefore, with slight abuse of notation, we may simply take the following sample space under the strategy S: ˆ := {ωˆ = (β, ξ0 , x0 , ξ1 , θ1 , x1 , . . .)},
(6.31)
where (β, ξ0 ) = ξˆ0 . The requirement about the identical spaces in Definition 6.4.1 is not very restrictive in view of Remark 6.1.2. Remark 6.4.1 In the present section PγS (·|β) denotes the (regular) conditional distribution. With some abuse of notation, PγS (|β) means the value of this conditional measure on the set
6.4 Mixtures of Strategies and Convexity of Spaces of Strategic and Occupation Measures
369
{ωˆ = (β, ξ0 , x0 , ξ1 , θ1 , x1 , . . .) : (ξ0 , x0 , ξ1 , θ1 , x1 , . . .) ∈ }, where ∈ F. To be more rigorous, one must say that PγS (·|β) is the image of the conditional distribution with respect to the mapping ω(ω) ˆ := (ξ0 , x0 , ξ1 , θ1 , x1 , . . .), where ωˆ = (β, ξ0 , x0 , ξ1 , θ1 , x1 , . . .). Theorem 6.4.1 For a fixed initial distribution γ and the strategies S β and S as in Definition 6.4.1, β PγS () = PγS (|β) ν-a.s. ∀ ∈ F (6.32) and hence the marginal Pˆ of the strategic measure PγS on = {ω = (ξ0 , x0 , ξ1 , θ1 , x1 , . . .)} satisfies the equality ˆ P() =
[0,1]
PγS (dβ × ) =
β
[0,1]
PγS ()ν(dβ) ∀ ∈ F.
(6.33)
Remark 6.4.2 of equality (6.33), one can say that the mixture S replicates In view β the measure [0,1] PγS (·)ν(dβ) on (, F). See also Definition 6.4.3. β
Proof of Theorem 6.4.1 First of all, note that the mapping β → PγS is measurable according to the construction of the strategic measure, in the sense that this mapping defines a stochastic kernel on F given β ∈ [0, 1]. Indeed, in the space , associated β with the strategies S β , we have the product σ-algebra as usual, and the kernels G n β (6.6), built for the strategies S , are measurable with respect to (β, h n−1 ). Thus, for β each n ≥ 0, the marginals on Hn of the measures PγS are measurable in β, i.e., those marginals form the stochastic kernel on B(Hn ) given β ∈ [0, 1] (more rigorously, one should use induction, Proposition B.1.34 and the remarks after it). Hence, the β mapping β → PγS is measurable according to the Monotone Class Theorem (see β Proposition B.1.42): the class of subsets ⊂ , for which the mapping β → PγS () ¯ + × X∞ )∞ for all is measurable, is a monotone class and includes H × (∞ × R H ∈ B(Hn ), n = 0, 1, . . .. Next, we verify (6.32) and (6.33) as follows. For n = 0, for all ∈ B(∞ ) and X ∈ B(X∞ ), it holds that β
PγS (0 ∈ , X 0 ∈ X |β) ν(dβ) = p0 ( ∩ )γ( X ∩ X)ν(dβ) β
= PγS (0 ∈ , X 0 ∈ X )ν(dβ) and consequently, for each β ∈ B([0, 1]), one has
370
6 The Total Cost Model: General Case
ˆ 0 ∈ β , 0 ∈ , X 0 ∈ X ) = P(β
β
β
= =
PγS (0 ∈ , X 0 ∈ X |β)ν(dβ) β
p0 ( ∩ )γ( X ∩ X)ν(dβ) β
β
PγS (0 ∈ , X 0 ∈ X )ν(dβ).
Therefore, β
PγS (0 ∈ , X 0 ∈ X ) = PγS (0 ∈ , X 0 ∈ X |β) ν-a.s.; and consequently, ˆ 0 ∈ H ) = P(H
β
[0,1]
PγS (H0 ∈ H )ν(dβ) ∀ H ∈ B(H0 ).
Suppose, for all H ∈ B(Hn−1 ), ˆ n−1 ∈ H ) = P(H and
[0,1]
β
PγS (Hn−1 ∈ H )ν(dβ)
β
PγS (Hn−1 ∈ H ) = PγS (Hn−1 ∈ H |β) ¯ + ), X ∈ B(X∞ ), H ∈ ν-almost surely. Then, for all ∈ B(∞ ), R ∈ B(R β B(Hn−1 ), according to the construction of the strategic measures PγS and PγS and by the inductive supposition, PγS (n ∈ , n ∈ R , X n ∈ X , Hn−1 ∈ H |β)ν(dβ) = G n ( × R × X |β, h n−1 )PγS (dβ × dh n−1 ) H = G n ( × R × X |β, h n−1 )PγS (dh n−1 |β)ν(dβ) H β = G n ( × R × X |β, h n−1 )PγS (dh n−1 )ν(dβ) H β
= PγS (n ∈ , n ∈ R , X n ∈ X , Hn−1 ∈ H )ν(dβ). Therefore, for all H ∈ B(Hn ), β ∈ B([0, 1]),
6.4 Mixtures of Strategies and Convexity of Spaces of Strategic and Occupation Measures
ˆ 0 ∈ β , Hn ∈ H ) = P(β
β
PγS (Hn ∈ H |β)ν(dβ)
= and ˆ n ∈ H ) = P(H
371
β
β
PγS (Hn ∈ H )ν(dβ)
[0,1]
(6.34) β PγS (Hn
∈ H )ν(dβ);
β
PγS (Hn ∈ H ) = PγS (Hn ∈ H |β) ν-a.s. By induction, equalities (6.34) are valid for all n = 0, 1, 2, . . ., for all H ∈ B(Hn ). Hence equalities (6.32) and (6.33) are valid on the σ-algebra F according to the Monotone Class Theorem (see Proposition B.1.42): the class of subsets ⊆ which ¯ + × X∞ )∞ for all satisfy (6.32) is a monotone class and includes all H × (∞ × R H ∈ B(Hn ), n = 0, 1, . . .. Recall also the uniqueness of the measure on (, F) having fixed marginals on all Hn : see Proposition B.1.37. Let us specialize Definition 6.4.1 to the case when the weights distribution is discrete. Definition 6.4.2 Suppose S k = {, p0k , {( pnk , πnk )}∞ n=1 }, k = 1, 2, . . ., are (generalized) π-ξ-strategies with the same spaces and (, F). Then ∞ the countable mixture S ∈ S of {S k }∞ with the weights distribution ν(k) ≥ 0, k=1 ν(k) = 1, is defined k=1 as follows. ˆ ˆ ˆ := {1, 2, . . .} × , p0 , {( pn , πn )}∞ • S = { n=1 } with the generic notation for ξ ∈ ˆ n be the sample space and the space of histories ˆ and H being ξˆ = (k, ξ). Let ˆ is in use). under the strategy S (when • p0 ({k} × dξ) := ν(k) p0k (dξ). • For all n = 1, 2, . . ., pn ({kn } × dξn |hˆ n−1 ) := pnk (dξn |h n−1 (hˆ n−1 ))I{kn = 1}; the components kn = 1, n = 1, 2, . . ., are immaterial and can be omitted from consideration; πn (da|hˆ n−1 , ξˆn , s) := πnk (da|h n−1 (hˆ n−1 ), ξn , s). Here, for hˆ n−1 = (k, ξ0 , x0 , ξ1 , θ1 , x1 , . . . , ξn−1 , θn−1 , xn−1 ), h n−1 (hˆ n−1 ) := (ξ0 , x0 , ξ1 , θ1 , x1 , . . . , ξn−1 , θn−1 , xn−1 ). Therefore, with a slight abuse of notation, we may take ˆ := {ωˆ = (k, ξ0 , x0 , ξ1 , θ1 , x1 , . . .)} as the sample space under the strategy S, where (k, ξ0 ) = ξˆ0 .
372
6 The Total Cost Model: General Case
It was explained below Definition 6.4.1 why the requirement for a common space was not restrictive. For countable mixtures, there is a more specific justification. Namely, if the spaces k are not all identical, one can assume without loss of generality that they are disjoint, for otherwise one may extend m to {m} × m , m = 1, 2, . . ., where the artificial component m is immaterial. After that, one can introduce the common Borel space (see Proposition B.1.2) = 1 ∪ 2 ∪ . . . . For each k = 1, 2, . . . the measure p0k (·) and the kernels pnk (·|h kn−1 ) are concen/ k × X × (k × trated on k ⊆ for all n = 1, 2, . . .. For the histories h n−1 ∈ n−1 k k / the kernels pn (·|h n−1 ) and π(·|h n−1 , ξn ) can be defined in an R+ × X) , ξn ∈ arbitrary but measurable way. We thus obtain the stratified sample space : each k strategic measure PγS with the initial distribution γ is concentrated on the subset of containing only the ξ-components from k ⊆ . For the countable set of indices k = 1, 2, . . . with the discrete topology, the kernels pnk (dξn |h n−1 ) and πnk (da|h n−1 , ξn , s) are always measurable with respect to all their arguments. We state the next corollary from Theorem 6.4.1. Corollary 6.4.1 For a fixed initial distribution γ and the strategies S k and S as in Definition 6.4.2, k
PγS () = PγS (|k) for all k = 1, 2, . . . with ν(k) > 0, ∀ ∈ F
(6.35)
and hence the marginal Pˆ of the strategic measure PγS on = {ω = (ξ0 , x0 , ξ1 , θ1 , x1 , . . .)} satisfies the equality ˆ P() =
∞ k=1
PγS ({k} × ) =
∞
k
PγS ()ν(k), ∀ ∈ F.
(6.36)
k=1
The proof is identical to the proof of Theorem 6.4.1. Definition 6.4.3 Suppose the initial distribution γ is fixed and S k = {k , k p0k , {( pnk , πnk )}∞ n=1 } ∈ S, k = 1, 2, . . .. If all the spaces ≡ are identical, we fix m the usual sample space (, F). Otherwise, we identify with {m} × m and introduce the Borel space = 1 ∪ 2 ∪ · · · with the associated sample space (, F) as described below Definition 6.4.2. The k strategic measures PγS are considered to be defined on (, F).
6.4 Mixtures of Strategies and Convexity of Spaces of Strategic and Occupation Measures
373
∞ (a) For each weights distribution ν(k) ≥ 0, k=1 ν(k) = 1, the measure ∞ S k k=1 Pγ (·)ν(k) on (, F) is called a convex combination of the strategic meak sures PγS . (b) The convex combination in (a) is said to be replicated by a strategy S = , p0 , {( pn , πn )}∞ { n=1 ∈ S if = 0 × ¯ + )∞ coincides with and the marginal Pˆ of PγS on × (X∞ × ∞ × R ∞ S k k=1 Pγ (·)ν(k). Remark 6.4.3 Note that if a strategy S replicates a convex combination of the strategies {S k }∞ k=1 in the sense of Definition 6.4.3, then it is not necessarily true that the performance functionals W jα (S, γ) coincide with the convex combinations of the objective functionals W jα (S k , γ) because in Definition 6.4.3 the kernels {πnk }∞ n=1 are not explicitly involved. The following example confirms this remark. Example 6.4.1 Let X = {1, }, where is the (costless) cemetery, γ(1) = 1, A = {1, 2} and q(|1, 1) = q(|1, 2) = λ > 0. Consider the stationary π-strategies S 1 = {π(1|x) = 1} and S 2 = {π(2|x) = 1}. The spaces 1 and 2 are 1 2 immaterial and one can put 1 = 2 = {ξ} as a singleton. Clearly, PγS = PγS and 1 2 hence the strategy S = S 1 replicates the convex combination 21 PγS + 21 PγS . (The space 0 can be taken arbitrarily.) If c0 (1, 1) = 0 and c0 (1, 2) = 1, then W00 (S 1 , γ) = 0; W00 (S 2 , γ) = and
1 λ
1 0 1 1 1 W0 (S , γ) + W00 (S 2 , γ) =
= W00 (S, γ) = 0. 2 2 2λ
Sk Remark 6.4.4 Suppose ∞ k=1 Pγ (·)ν(k) is a convex combination of the strategic k measures PγS . Without loss of generality, assume that the spaces k ≡ are identical. Then it can easily happen that no strategy S = {, p0 , {( pn , πn )}∞ n=1 } with the Sk same space has the strategic measure PγS coincident with ∞ P (·)ν(k). In this k=1 γ sense, the space of strategic measures with the same space is not convex (vf. Proposition C.1.1(a)). The following example confirms this remark. Example 6.4.2 Let X = {0, 1, 2}, where 1 and 2 are absorbing states, γ(0) = 1, A = {1, 2}, q(1|0, 1) = λ1 , q(2|0, 2) = λ2 , 0 < λ1 < λ2 , λ1 + λ2 = 2, and other transition rates are zero. Consider the deterministic stationary strategies ϕ1 (x) ≡ 1
374
6 The Total Cost Model: General Case
and ϕ2 (x) ≡ 2. Here the ξ-components are immaterial and we put := {ξ} as a singleton. Then Pγϕ (1 ∈ dt, X 1 = 1) = λ1 e−λ1 t dt, Pγϕ (1 ∈ dt, X 1 = 2) = 0; 1
1
Pγϕ (1 ∈ dt, X 1 = 2) = λ2 e−λ2 t dt, Pγϕ (1 ∈ dt, X 1 = 1) = 0. 2
2
For the convex combination P := 21 Pγϕ + 21 Pγϕ we have 1
2
1 −λ1 s e + e−λ2 s ; 2 1 P(X 1 = 1) = P(X 1 = 2) = . 2 P(1 ≤ s) = 1 −
(6.37)
An arbitrary strategy with = {ξ} is just a π-strategy S = {πn }∞ n=1 because the single ξ-component is immaterial. In order to have F(s) := PγS (1 ≤ s) = 1 −
1 −λ1 s e + e−λ2 s , 2
it is necessary and sufficient that the hazard rate of the distribution characterized by F equals λ1 π1 (1|0, s) + λ2 π1 (2|0, s), that is, λ1 π1 (1|0, s) + λ2 [1 − π1 (1|0, s)] =
λ1 e−λ1 s + λ2 e−λ2 s , e−λ1 s + e−λ2 s
which leads to π1 (1|0, s) =
e−λ1 s e−λ2 s ; π (2|0, s) = . 1 e−λ1 s + e−λ2 s e−λ1 s + e−λ2 s
Now PγS (X 1 = 1) λ e−λ1 s +λ2 e−λ2 s e−λ1 t − (0,t] 1 −λ s −λ ds e 1 +e 2 s = e dt −λ1 t + e−λ2 t (0,∞) e ∞ λ e−λ1 s +λ2 e−λ2 s −e−λ1 t − (0,t) 1 −λ s −λ ds s 1 2 e +e = e λ1 e−λ1 t + λ2 e−λ2 t 0 −λ −λ1 t−λ2 t λ1 e 1 s +λ2 e−λ2 s (λ2 − λ1 )λ2 e − ds + e (0,t] e−λ1 s +e−λ2 s dt −λ1 t + λ e−λ2 t ]2 2 (0,∞) [λ1 e 1 s +λ2 e−λ2 s (λ2 − λ1 )λ2 e−λ1 t−λ2 t − (0,t] λ1 e−λ 1 1 ds e−λ1 s +e−λ2 s e dt > , = + −λ t −λ t 2 1 + λ e 2 ] 2 [λ e 2 1 2 (0,∞)
6.4 Mixtures of Strategies and Convexity of Spaces of Strategic and Occupation Measures
375
where the second equality is by integrating by parts. Thus, we see that in this example it is impossible to construct a strategy S with the same space such that the strategic measure PγS satisfies equalities (6.37). Theorem 6.4.2 Every convex combination of strategic measures in the sense of Definition 6.4.3 is replicated by the countable mixture S as in Definition 6.4.2. Moreover, for the performance functionals W jα , j = 0, 1, . . . , J , the equality W jα (S, γ) =
∞
W jα (S k , γ)ν(k)
k=1
holds true. Proof The countable mixture S replicates the convex combination ∞ Sk k=1 Pγ (·)ν(k) according to Corollary 6.4.1. To complete the proof, it is sufficient to prove the following equality for the detailed occupation measures S,α m γ,n ( X × A ) =
∞
S ,α ν(k)m γ,n ( X × A ) k
(6.38)
k=1
for all n ≥ 1, X ∈ B(X), A ∈ B(A). According to Definitions 6.2.1 and 6.4.2, S,α ( X × A ) m γ,n ∞ S ν(k)Eγ =
(0,n ]∩R+
k=1
× =
e−α(Tn−1 +s) I{X n−1 ∈ X }
πnk ( A |h n−1 ( Hˆ n−1 ), n , t
∞
ν(k)EγS
k
(0,n ]∩R+
k=1
− Tn−1 )dt k
e−α(Tn−1 +s) I{X n−1 ∈ X }
× πnk ( A |Hn−1 , n , t − Tn−1 )dt =
∞
S ,α ν(k)m γ,n ( X × A ), k
k=1
and equality (6.38) is proved. The second equality above is by (6.35); see also Remark 6.4.1. The proof is complete.
376
6 The Total Cost Model: General Case
6.4.1.2
Generalized Standard ξ-Strategies
Now consider the class of generalized standard ξ-strategies SξG . Recall that the sample space is = {ω = (x0 , a1 , θ1 , x1 , a2 , θ2 , . . .)}. (See Table 6.1.) The strategic measure PγS of a strategy S ∈ SξG is defined by the following stochastic kernels (see (6.6)): G n ({ξ∞ } × {∞} × {x∞ }|h n−1 ) = δxn−1 ({x∞ }); G n ( A × {∞} × {x∞ }|h n−1 ) e− (0,∞) qxn−1 (an )ds pn (dan |h n−1 ); = δxn−1 (X) A
G n ( A × R × X |h n−1 ) q ( X |xn−1 , an )e−qxn−1 (an )θ dθ pn (dan |h n−1 ); = δxn−1 (X) A
R
G n ((A ∪ {ξ∞ }) × {∞} × X|h n−1 ) = G n ({ξ∞ } × R+ × X∞ |h n−1 ) = 0 for all A ∈ B(A), R ∈ B(R+ ), X ∈ B(X). Recall that here = A and ω = (x0 , a1 , θ1 , x1 , a2 , θ2 , x2 , . . .). After we substitute R = R+ , we see that, given (h n−1 , an ) ∈ Hn−1 × A, the component X n has the distribution PγS (X n ∈ X |Hn−1 = h n−1 , An = an ) q ( X \ {x∞ }|xn−1 , an ) = I{xn−1 = x∞ and qxn−1 (an ) = 0} qxn−1 (an ) +I{xn−1 = x∞ or qxn−1 (an ) = 0}I{x∞ ∈ X }, ∀ X ∈ B(X∞ ). According to Proposition C.1.2, the marginal of PγS on the space {(x0 , a1 , x1 , a2 , x2 , . . .)} = (X∞ × (A ∪ {ξ∞ }))∞ is a strategic measure Pσγ for some strategy σ in the DTMDP, called here the “associated” DTMDP, with the following elements • X∞ is the state space; • A∞ := A ∪ {ξ∞ } is the action space; • q (\{x∞ }|x,a) , if x = x∞ , qx (a) = 0, a = ξ∞ ; qx (a) p(|x, a) := δx∞ () otherwise is the transition probability; • γ is the initial distribution;
6.4 Mixtures of Strategies and Convexity of Spaces of Strategic and Occupation Measures
377
• the only admissible action in the state x∞ is ξ∞ ∈ / A, and the action ξ∞ is not admissible at x ∈ X. Conversely, if P = Pσγ is a strategic measure in this DTMDP, then {σn (da|x0 , a1 , . . . , xn−1 )}∞ n=1 with x n−1 ∈ X can be considered as a specific generalized standard ξ-strategy S in the CTMDP, leading to the same marginal on (X∞ × A∞ )∞ . The latter, in its turn, can be extended to because, for each n = 1, 2, . . ., the pair (xn−1 , an ) ∈ X × A gives rise to the (unique) distribution of ¯ + ), n : for all ∈ B(R PγS (n ∈ |X n−1 = xn−1 , An = an ) −qxn−1 (an )θ dθ, if qxn−1 (an ) = 0; \{∞} q xn−1 (an )e = δ∞ (), otherwise, and
(6.39)
PγS (n ∈ |X n−1 = x∞ , An = ξ∞ ) = δ∞ (). Therefore, the space of strategic measures in the associated DTMDP, denoted below as P D , is in 1-1-correspondence with the space of strategic measures of generalized standard ξ-strategies in the CTMDP, denoted below by P C . This 11-correspondence is denoted by P P. More formally, we accept the following definition. Definition 6.4.4 Let (, F) be the sample space for a generalized standard ξstrategy. If P is a probability measure on (, F) and P is a probability measure on (X∞ × A∞ )∞ , then P P means that the marginal of P on (X∞ × A∞ )∞ coincides with P and the distribution of the θ-components of ω ∈ is given by (6.39). Note that (deterministic) Markov standard ξ-strategies are in the correspondence with (deterministic) Markov strategies in the associated DTMDP. Theorem 6.4.3 Suppose, for each β ∈ [0, 1], P(β) ∈ P D , P(β) ∈ P C and P(β) P(β). Then the following assertions hold true. (a) The mapping β → P(β) is a stochastic kernel on B((X∞ × A∞ )∞ ) given β ∈ [0, 1] if and only if β → P(β) is a stochastic kernel on F given β ∈ [0, 1]. (b) If the mappings mentioned in (a) are stochastic kernels, then for each probability measure ν ∈ P([0, 1]), D P(β)ν(dβ) ∈ P and P P := P(β)ν(dβ) ∈ P C . P := [0,1]
[0,1]
Proof (a) For each β ∈ [0, 1], the measure P(β) coincides with the marginal of P(β) on (X∞ × A∞ )∞ , and the “if” part follows. For the “only if” part, note that, for each S ∈ SξG , PγS (n ∈ |X n−1 ¯ + ) given (xn−1 , an ) = xn−1 , An = an ) is a stochastic kernel on B(R ∈ (X × A) ∪ ({x∞ } × {ξ∞ }). Thus, for each n = 0, 1, 2, . . ., the mapping β →
378
6 The Total Cost Model: General Case
P(β), with some abuse of notation, is a stochastic kernel on B(Hn ) given β ∈ [0, 1] with the images being the marginals of P(β) on Hn . Finally, β → P(β) is a stochastic kernel on F given β ∈ [0, 1] by the Monotone Class Theorem (see Proposition B.1.42): the class of subsets ⊂ , for which the mapping β → P(β)() is measurable, is a monotone class and includes H × (A∞ × ¯ + × X∞ )∞ for all H ∈ B(Hn ). R (b) P ∈ P D by Corollary C.1.1; P P because the correspondence is affine: see Definition 6.4.4. Hence P ∈ P C . One can slightly modify Definition 6.4.3 just for the strategies S ∈ SξG . Note that here = A. β
Definition 6.4.5 Suppose {S β }β∈[0,1] ⊆ SξG are such that the mapping β → PγS is a stochastic kernel on F given β ∈β [0, 1]. For each weights distribution ν(dβ) ∈ P([0, 1]), the measure P = [0,1] PγS (·)ν(dβ) on (, F) is called a generalized conβ
vex combination of the strategic measures PγS . Definition 6.4.6 We say that a measure P ∈ P() is replicated by a mixture S of strategies {S β }β∈[0,1] ⊆ SξG with the weights distribution ν if the marginal Pˆ on = {ω = (x0 , a1 , θ1 , x1 , a2 , θ2 , . . .)} of PγS coincides with P: Pˆ =
β
[0,1]
PγS ν(dβ) = P.
See Remark 6.4.1 and Theorem 6.4.1, and recall that PγS is a measure on [0, 1] × . The following theorem states that every generalized convex combination of the strategic measures as in Definition 6.4.5 is again a strategic measure for some strategy from SξG ; it can also be represented by a mixture of the initial strategies from SξG , after a slight modification. Part (b) states that every strategic measure of a strategy from SξG is a generalized convex combination of the strategic measures coming from deterministic strategies in SξG ; it can also be represented by a mixture of those deterministic strategies. Theorem 6.4.4 (a) Suppose P = β
[0,1]
β
PγS ν(dβ) is a generalized convex combination of the strategic
measures PγS with S β ∈ SξG . Then the following assertions hold true. – There exists a strategy S ∈ SξG such that PγS = P. Hence, the space {PγS , S ∈ SξG } is convex. – For each n = 1, 2, . . ., there exists a stochastic kernel pnβ (dan |h n−1 ) on B(A) given (β, h n−1 ) ∈ [0, 1] × Hn−1 such that, for all β ∈ [0, 1], the β ˜β equality PγS = PγS is valid for the generalized standard ξ-strategy
6.4 Mixtures of Strategies and Convexity of Spaces of Strategic and Occupation Measures
379
G S˜ β := {A, { pnβ }∞ n=1 } ∈ Sξ .
– The measure P is replicated by the mixture S˜ of { S˜ β }β∈[0,1] with the weights distribution ν. (b) There exists a probability measure ν on ([0, 1], B([0, 1])) such that, for each generalized standard ξ-strategy S ∈ SξG , the following statements hold. – There exist measurable mappings rn (β, x0 , a1 , . . . , xn−1 ) : [0, 1] × (X∞ × A∞ )n−1 × X∞ → A∞ , n = 1, 2, . . . , defining, for each β ∈ [0, 1], the deterministic generalized standard ξ-strategy β ϕβ := {A, {ϕn (h n−1 )}∞ n=1 } with ϕβn (h n−1 ) := rn (β, x0 , a1 , . . . , xn−1 ) for all h n−1 = (x0 , a1 , θ1 , x1 , a2 , . . . , xn−1 ), such that PγS =
β
[0,1]
Pγϕ ν(dβ)
β
with Pγϕ being measurable in β ∈ [0, 1]. – The measure PγS is replicated by the mixture S˜ of {ϕβ }β∈[0,1] with the weights distribution ν. Without loss of generality, one can assume that ν is the uniform distribution (i.e. the Lebesgue measure) on [0, 1]. Proof (a) The first assertion holds true by Theorem 6.4.3(b). The convex combina1 2 tion δPγS + (1 − δ)PγS , δ ∈ (0, 1), of two strategic measures with S 1 , S 2 ∈ SξG is a special case of the generalized convex combination: one can take ν as the Lebesgue measure and put Sβ =
S 1 , if β ∈ [0, δ]; S 2 , if β ∈ (δ, 1].
The convexity of the space {PγS , S ∈ SξG } follows.
β
To prove the second assertion, consider the strategic measures P(β) PγS in the associated DTMDP. The mapping β → P(β) is the stochastic kernel on B((X∞ × A∞ )∞ ) given β ∈ [0, 1] by Theorem 6.4.3(a). For each n = 1, 2, . . ., the marginal Pn (·|β) of P(β) on (X × A)n is also the stochastic kernel on B((X × A)n ) given β ∈ [0, 1]. According to Proposition B.1.33, there exist stochastic kernels ϕn (dh n−1 |β) on B(X × (A × X)n−1 ) given β ∈ [0, 1] and
380
6 The Total Cost Model: General Case
rn (dan |β, h n−1 ) on B(A) given (β, h n−1 ) ∈ [0, 1] × (X × (A × X)n−1 ) such that Pn (n−1 × A |β) = rn ( A |β, h n−1 )ϕn (dh n−1 |β), n−1
∀ n−1 ∈ B(X × (A × X)n−1 ), A ∈ B(A). We put rn ({ξ∞ }|β, h n−1 ) = 1 in the case when xn−1 = x∞ and extend Pn (·|β) to (X∞ × A∞ )n . For each β ∈ [0, 1], let σ β := {σnβ (da|x0 , a1 , . . . , xn−1 ) := rn (da|β, x0 , a1 , . . . , xn−1 )}∞ n=1 be the strategy in the associated DTMDP. Using induction, it is easy to show that, for each n = 1, 2, . . ., β
D ∈ n−1 ); Pn (n−1 × A∞ |β) = ϕn (n−1 |β) = Pσγ (Hn−1 β
D ∈ n−1 , An ∈ A ), Pn (n−1 × A |β) = Pσγ (Hn−1 D ∀ n−1 ∈ B(Hn−1 ), A ∈ B(A∞ ). D = X∞ × (A∞ × X∞ )n−1 is the space of histories in the associated Here Hn−1 β DTMDP. Therefore, P(β) = Pσγ by the Monotone Class Theorem (see Proposition B.1.42): the class of subsets ⊆ (X∞ × A∞ )∞ , for which P(β)() = β Pσγ (), is a monotone class and includes H × (A∞ × X∞ )∞ for all H ∈ D B(Hn−1 ), n = 1, 2, . . .. β β It remains to put pn (dan |h n−1 ) := σn (dan |x0 , a1 , . . . , xn−1 ) for all h n−1 = (x0 , a1 , θ1 , x1 , a2 , . . . , xn−1 ): β β ˜β for all β ∈ [0, 1], PγS Pσγ = P(β) PγS . ˜β
β
Recall that is a 1-1-correspondence, hence PγS = PγS . ˜ Now the mixture S˜ is well defined, and for the marginal Pˆ of PγS on = {ω = (x0 , a1 , θ1 , x1 , a2 , θ2 , . . .)} we have the required equality Pˆ = P according to Theorem 6.4.1: see (6.33). The third assertion is proved. (b) We take ν as in Proposition C.1.3 (see also the remarks after it), introduce Pσγ PγS , and consider the corresponding measurable mappings rn , as in part (a) of Proposition C.1.3, which give rise, for each β ∈ [0, 1], to the deterministic strategy ϕβ,D = {rn (β, x0 , a1 , . . . , xn−1 )}∞ n=1 in the associated DTMDP and to the corresponding deterministic strategy ϕβ ∈ SξG in the CTMDP. Now Pϕγ
β,D
β
Pγϕ , and the mapping β → Pϕγ
β,D
is a stochastic kernel by Propo-
6.4 Mixtures of Strategies and Convexity of Spaces of Strategic and Occupation Measures
381
β
sition C.1.3(c). Hence the mapping β → Pγϕ is also a stochastic kernel by Theorem 6.4.3(a), and β,D β PγS Pσγ = Pϕγ ν(dβ) Pγϕ ν(dβ) [0,1]
[0,1]
according to Proposition C.1.3(a) and Theorem 6.4.3(b). To complete the proof of the first statement, it remains to recall that is a 1-1-correspondence. The proof of the second statement is identical to the proof of the last assertion in part (a).
6.4.1.3
Markov Standard ξ-Strategies
The space {PγS , S ∈ SξM } of strategic measures of Markov standard ξ-strategies is usually not convex, as the following example shows. However, all the other statements of Theorem 6.4.4 hold if we consider only Markov standard ξ-strategies and simple deterministic Markov strategies. Example 6.4.3 Consider the following DTMDP associated with a CTMDP: A∞ = {1, 2, ξ∞ }; X∞ = {1, 2, x∞ }; p(2|1, a) = 1 for a = 1, 2; p(x∞ |2, a) = 1 for a = 1, 2; p(x∞ |x∞ , ξ∞ ) = 1; γ(1) = 1. Let M,1 (1|x) = 1 for x = 1, 2; σ1,2 M,2 σ1,2 (2|x) = 1 for x = 1, 2.
For the Markov strategies σ M,1 and σ M,2 we have Pσγ
M,1
(A1 = 1, A2 = 1) = 1; Pσγ
M,2
(A1 = 2, A2 = 2) = 1,
and, for the convex combination P :=
1 σ M,1 1 σ M,2 P + Pγ , 2 γ 2
we have the equality P(A1 = 1, A2 = 1) =
1 1 ; P(A1 = 2, A2 = 2) = . 2 2
382
6 The Total Cost Model: General Case
This measure P, while being strategic by Proposition C.1.1(a), is not generated by any Markov strategy. Indeed, on one hand, one has to define σ1 (1|1) = σ1 (2|1) = 21 and σ2 (1|2) = σ2 (2|2) = 21 , but as the result, we have Pσγ (A1 = 1, A2 = 1) = 14 . Since is an affine 1-1-correspondence and Markov standard ξ-strategies are in correspondence with Markov strategies in the associated DTMDP, the conM,1 M,1 M,2 M,2 sidered convex combination of the strategies Pγp : Pσγ and Pγp : Pσγ is not generated by a Markov stationary ξ-strategy. Here p M,1 , p M,2 ∈ SξM are the strategies coincident with σ M,1 , σ M,2 . Part (a) of the following theorem states that every generalized convex combination of the strategic measures (as in Definition 6.4.5) coming from Markov standard ξstrategies can be replicated by a mixture of the initial strategies from SξM , after a slight modification. Part (b) states that every strategic measure of a strategy from SξM is a generalized convex combination of the strategic measures coming from simple deterministic Markov strategies; it can also be replicated by a mixture of those deterministic strategies. Theorem 6.4.5 (a) Suppose P = β
β
[0,1]
PγS ν(dβ) is a generalized convex combination of the strategic β
measures PγS with S β ∈ SξM , where PγS is measurable in β ∈ [0, 1]. Then the following assertions hold. – For each n = 1, 2, . . ., there exists a stochastic kernel M,β pn (dan |xn−1 ) on B(A) given (β, xn−1 ) ∈ [0, 1] × X such that, for all β ˜β β ∈ [0, 1], the equality PγS = PγS holds for the Markov standard ξ-strategy M,β M S˜ β := {A, { pn }∞ n=1 } ∈ Sξ . – The measure P is replicated by the mixture S˜ of { S˜ β }β∈[0,1] with the weights distribution ν. (b) There exists a probability measure ν on ([0, 1], B([0, 1])) such that the following statements hold for each Markov standard ξ-strategy S ∈ SξM . – There exist measurable mappings rn (β, xn−1 ) : [0, 1] × X∞ → A∞ , n = 1, 2, . . . , defining for each β ∈ [0, 1] the simple deterministic Markov strategy ϕˆ β := β {ϕˆ n (xn−1 )}∞ n=1 with ϕˆ βn (xn−1 ) := rn (β, xn−1 ), ∀ xn−1 ∈ X,
such that PγS β
=
β
[0,1]
Pγϕˆ ν(dβ)
with Pγϕˆ being measurable in β ∈ [0, 1].
6.4 Mixtures of Strategies and Convexity of Spaces of Strategic and Occupation Measures
383
– The measure PγS is replicated by the mixture S˜ of {ϕˆ β }β∈[0,1] with the weights distribution ν. Without loss of generality, one can assume that ν is the uniform distribution (i.e. the Lebesgue measure) on [0, 1]. β
Proof (a) Consider the strategic measures P(β) PγS in the associated DTMDP. β σβ Below, σ β = {σn (da|xn−1 )}∞ n=1 is such that Pγ = P(β): recall that Markov standard ξ-strategies are in correspondence with Markov strategies in the associated DTMDP. The mapping β → P(β) is the stochastic kernel on B((X∞ × A∞ )∞ ) given β ∈ [0, 1] by Theorem 6.4.3(a). For each n = 1, 2, . . ., the marginal Pn (·|β) of P(β) on the space X × A, corresponding to (xn−1 , an ), is a finite kernel on B(X × A) given β ∈ [0, 1]. According to Proposition B.1.33, there exist a finite kernel ϕn (d xn−1 |β) on B(X) given β ∈ [0, 1] and a stochastic kernel rn (dan |β, xn−1 ) on B(A) given (β, xn−1 ) ∈ [0, 1] × X such that Pn ( X × A |β) =
X
rn ( A |β, xn−1 )ϕn (d xn−1 |β),
(6.40)
∀ X ∈ B(X), A ∈ B(A).
We put rn ({ξ∞ }|β, x∞ ) = 1 and introduce the Markov strategy in the associated DTMDP p M,β := { pnM,β (da|xn−1 ) := rn (da|β, xn−1 )}∞ n=1 for each β ∈ [0, 1]. Note that ϕn (d xn−1 |β) = Pn (d xn−1 × A|β) and, by the construction of the β strategic measure Pσγ = P(β), Pn ( X × A |β) =
X
σnβ ( A |xn−1 )ϕn (d xn−1 |β), ∀ X ∈ B(X), A ∈ B(A).
Taking into account (6.40) and Proposition B.1.33, we see that there is a set B ∈ B(X) (possibly dependent on β) with ϕn (B|β) = 0 such that rn ( A |β, xn−1 ) = σnβ ( A |xn−1 ) = pnM,β ( A |xn−1 ) for all A ∈ B(A) and xn−1 ∈ X \ B. If xn−1 = x∞ then pnM,β (ξ∞ |x∞ ) = σnβ (ξ∞ |x∞ ) = 1. β
M,β
The described properties of the stochastic kernels σn and pn mean that, under β a fixed β ∈ [0, 1], they coincide for Pσγ -almost all X n−1 (i.e., for almost all xn−1 β with respect to the distribution of X n−1 under Pσγ ). β M,β Next, we intend to show that Pσγ = Pγp for an arbitrarily fixed β ∈ [0, 1]. To this end, we shall prove for each n = 1, 2, . . . that
384
6 The Total Cost Model: General Case
Pγp
M,β
β
(n−1 × (A∞ × X∞ )∞ ) = Pσγ (n−1 × (A∞ × X∞ )∞ ), ∀ n−1 ∈
(6.41)
D B(Hn−1 ),
D = X∞ × (A∞ × X∞ )n−1 is the space of histories in the associated where Hn−1 DTMDP. Equality (6.41) holds when n = 1: on both sides we have γ(0 ). Suppose it holds β β M,β for some n ≥ 1. Then the stochastic kernels σn and pn coincide for Pσγ -almost M,β all X n−1 , and for Pγp -almost all X n−1 , too. Therefore, by the definition of the M,β β strategic measures Pγp and Pσγ , equality (6.41) holds also for n + 1, and hence M,β β Pγp = Pσγ by the Ionescu-Tulcea Theorem (see Proposition B.1.37): recall the uniqueness of the extension for the fixed marginals as in (6.41). M,β M Now we have for all β ∈ [0, 1] and S˜ β := {A, { pn }∞ n=1 } ∈ Sξ that ˜β
PγS Pγp ˜β
β
β
= Pσγ = P(β) PγS ,
M,β
β
and PγS = PγS because is a 1-1-correspondence. ˜ The mixture S˜ is well defined and for the marginal Pˆ of PγS on = {ω = (x0 , a1 , θ1 , x1 , a2 , θ2 , . . .)}, we have the required equality Pˆ = P according to Theorem 6.4.1: see (6.33). The second assertion is proved. (b) The proof is identical to the proof of Theorem 6.4.4(b) with reference to part (b) of Proposition C.1.3.
6.4.2 Properties of Occupation Measures Lemma 6.4.1 (a) For the strategies S β and S as in Definition 6.4.1, for any α ≥ 0, the detailed occupation measures satisfy the equalities S,α m γ,n
=
β
[0,1]
S ,α m γ,n ν(dβ),
n = 1, 2, . . . .
(b) For the strategies S k and S as in Definition 6.4.2, for any α ≥ 0, the detailed occupation measures satisfy the equality S,α = m γ,n
∞
S ,α m γ,n ν(k), k
n = 1, 2, . . . .
k=1
(c) The space D S of all sequences of detailed occupation measures defined by (6.15), the space Dt of all the total normalized occupation measures defined by (3.17)
6.4 Mixtures of Strategies and Convexity of Spaces of Strategic and Occupation Measures
385
and the space of all the total nonnormalized occupation measures defined by (6.12) are all convex. Proof (a) According to Definition 6.2.1, we have for each X ∈ B(X) and A ∈ B(A) that S,α ( X × A ) m γ,n S −αt = Eγ e I{X n−1 ∈ X }πn ( A |Hn−1 , n , t − Tn−1 )dt (Tn−1 ,Tn ]∩R+ = e−αt I{X n−1 ∈ X }πnβ ( A |Hn−1 , n , t − Tn−1 )dt ˆ
(Tn−1 ,Tn ]∩R+ S ˆ ×Pγ (d ω)
= [0,1]
⎛ ⎜ ⎝
⎞ ⎟ e−αt I{X n−1 ∈ X }πnβ ( A |Hn−1 , n , t − Tn−1 )dt ⎠
(Tn−1 ,Tn ]∩R+
×PγS (dω|β)ν(dβ) = [0,1]
⎛ ⎜ ⎝
⎞ ⎟ e−αt I{X n−1 ∈ X }πnβ ( A |Hn−1 , n , t − Tn−1 )dt ⎠
(Tn−1 ,Tn ]∩R+
β
×PγS (dω)ν(dβ) S β ,α = m γ,n ν(dβ), [0,1]
where ˆ = [0, 1] × is given in (6.31), the third equality is by Proposition B.1.37, and the fourth equality is by (6.32). (b) The proof is identical to part (a) with reference to Corollary 6.4.1. (See (6.35).) (c) As was explained below Definition 6.4.2, one can always define mixtures of two arbitrary π-ξ-strategies. Now the convexity of all the spaces mentioned in the statement follows from part (b). For the rigorous justification of the convexity of Dt , one should also refer to Theorem 6.2.1: D S = D ReM . (Note that the space Dt was introduced only for the strategies described in Chap. 1, which include all π-strategies.) Now we investigate the detailed occupation measures coming from generalized standard ξ-strategies: S,α ∞ Dξ := {m γS,α = {m γ,n }n=1 , S ∈ SξG }.
386
6 The Total Cost Model: General Case
Recall that
S,α ∞ }n=1 , S ∈ SξM } D Ra M := {m γS,α = {m γ,n
and introduce the following class of detailed occupation measures: S,α ∞ }n=1 , S is a mixture of simple Ddet M := {m γS,α = {m γ,n
deterministic Markov strategies {ϕˆ β }β∈[0,1] }. Theorem 6.4.6 (a) The space Dξ is convex. (b) For an arbitrarily fixed initial distribution γ and discount factor α ≥ 0, Dξ = D Ra M = Ddet M . Proof (a) The space of strategic measures {PγS , S ∈ SξG } is convex by Theorem 6.4.4(a). Therefore the space Dξ is convex by Lemma 6.2.1. (b) Suppose S ∈ SξG and PγS Pσγ , where σ is a strategy in the associated DTMDP: see Definition 6.4.4 and the explanations preceding it. According to Proposition C.1.4, there is a Markov strategy σ M in the associated DTMDP such that Pσγ (X n−1 ∈ d x, An ∈ da) = Pσγ (X n−1 ∈ d x, An ∈ da), ∀ n = 1, 2, . . . . M
This Markov strategy corresponds to the Markov standard ξ-strategy in CTMDP M ∞ M p M = { pnM (da|xn−1 )}∞ n=1 = {σn (da|x n−1 )}n=1 ∈ Sξ ,
for which Pγp Pσγ . Therefore, M
M
Pγp (X n−1 ∈ d x, An ∈ da) = Pσγ (X n−1 ∈ d x, An ∈ da) M
M
= Pσγ (X n−1 ∈ d x, An ∈ da) = PγS (X n−1 ∈ d x, An ∈ da) for all n = 1, 2, . . . and hence m γp
M
,α
(d x × da) = m γS,α (d x × da)
by Lemma 6.2.1. Thus Dξ = D Ra M because the inclusion D Ra M ⊆ Dξ is trivial. Next, let us show that D Ra M = Ddet M . Suppose S˜ is a mixture of strategies {S β }β∈[0,1] from SξG with the weights distribution ν ∈ P([0, 1]). Note, we do not assume that the strategies S β are simple deterministic Markov. Then the marginal of β ˜ PγS on is the generalized convex combination [0,1] PγS ν(dβ) by Theorem 6.4.1,
6.4 Mixtures of Strategies and Convexity of Spaces of Strategic and Occupation Measures
387
and Theorem 6.4.4(a) implies that there is a strategy S ∈ SξG such that PγS = Sβ [0,1] Pγ ν(dβ). Therefore, ˜
m γS,α = m γS,α ˜ we obviously have the equality by Lemma 6.2.1: for the mixture S, ˜ S,α m γ,n ( X
× A) =
[0,1]
EγS
β
I{X n−1
I{A ∈ } n A ∈ X }e−αTn−1 ν(dβ), α + q X n−1 (An )
for all n = 1, 2, . . . and for all X ∈ B(X), A ∈ B(A) (see (6.32) and (6.33)). We have thus proved that Ddet M ⊆ Dξ = D Ra M . Suppose S ∈ SξM is arbitrarily fixed. According to Theorem 6.4.5(b), the measure S Pγ is represented by a mixture S˜ of simple deterministic Markov strategies {ϕˆ β }β∈[0,1] β with the weights distribution ν (recall Definition 6.4.6). Hence, PγS = [0,1] Pγϕˆ ν(dβ) and ˜ m γS,α = m γS,α by Lemma 6.2.1. Therefore, D Ra M ⊆ Ddet M . The proof is complete.
Corollary 6.4.2 Suppose S,α ∞ }n=1 , S is a mixture of strategies from a class D := {m γS,α = {m γ,n
S˜ ⊂ SξG which includes all simple deterministic Markov strategies}. Then D = Dξ = D Ra M . Proof During the proof of Theorem 6.4.6, we proved that D ⊆ Dξ = D Ra M = Ddet M . Clearly, Ddet M ⊆ D. Hence D = Ddet M = Dξ = D Ra M . In particular, the spaces S,α ∞ }n=1 , S is a mixture of deterministic strategies Ddet := {m γS,α = {m γ,n
{ϕβ }β∈[0,1] from SξG } and S,α ∞ }n=1 , S is a mixture of Markov standard D M := {m γS,α = {m γ,n
ξ-strategies { p M,β }β∈[0,1] }
388
6 The Total Cost Model: General Case
satisfy the relation Dξ = D Ra M = Ddet = D M .
6.5 Example: Utilization of an Unreliable Device Let us illustrate several theoretical issues presented in the current chapter with the following example arising from the context of reliability. Suppose a device can be in one of three states: X := {, 1, 2}, where 1 means that the device is new, 2 corresponds to the partially broken state, and the absorbing costless cemetery means that the device is broken. In the states 1 and 2, the device can be used (utilized) with the controlled intensity a ∈ A := [0, 1] leading to the transition rates q(2|1, a) = λ10 + λ11 a and q(|2, a) = λ20 + λ21 a. We accept that q(|1, a) ≡ q(1|2, a) ≡ 0; γ(1) = 1. When in the state 1 (respectively, 2), the action a ∈ A leads to the reward rate r 1 a (respectively, r 2 a). Here r x does not mean the xth power of r . Since we are working on minimization problems, we put c0 (x, a) := −r x a. Another objective is the total lifetime of the device, so we put c1 (x, a) := −1. x > 0 and r x > 0 are fixed, x ∈ X := X \ {} = {1, 2}. We are The constants λ0,1 going to investigate the total undiscounted constrained problem (6.10) with α = 0 under a properly chosen constant d. Since we have a single constraint, the index 1 has been omitted. Since Condition 1.3.2(b) is satisfied, it is sufficient to restrict to Markov standard ξ-strategies S M by Corollary 6.2.1(a); see also Theorem 1.3.2. Similarly to Example 6.4.3, one can easily see that the space {PγS , S ∈ S M } is not convex. Let us show how one can construct the mappings rn as in Theorem 6.4.5(b). For an arbitrary Markov standard ξ-strategy S = p M ∈ SξM and the strategic measure PγS , we consider the measure Pσγ PγS in the associated DTMDP, where σ is the corresponding Markov strategy. Only the marginal
6.5 Example: Utilization of an Unreliable Device
389
Pσγ (X 0 ∈ X 0 , A1 ∈ A1 , X 1 ∈ X 1 , A2 ∈ A2 ), X i ∈ B(X ), Ai ∈ B([0, 1]) is important: the cemetery can be ignored. To be more specific, we can put pn (da|) = σn (da|) := δ0 (da). The event {X 0 = 1, X 2 = 2} is with probability 1 under Pσγ and PγS , and the random variables A1 and A2 are independent under Pσγ and PγS because the strategies S = p M and σ are Markov. The Borel space [0, 1] × [0, 1] of the values of the actions (A1 , A2 ) is isomorphic to [0, 1] by Proposition B.1.1. Let us denote this isomorphism by g : [0, 1] → [0, 1] × [0, 1]. The image on ([0, 1], B([0, 1]) of the measure defined by Pσγ (X 0 = 1, A1 ∈ A1 , X 1 = 2, A2 ∈ A2 ) = Pσγ (A1 ∈ A1 )Pσγ (A2 ∈ A2 ), Ai ∈ B([0, 1]),
(6.42)
with respect to g −1 is denoted by μ(), ∈ B([0, 1]). Finally, let f : [0, 1] → [0, 1] be a measurable mapping such that the corresponding image of the Lebesgue measure ν = Leb on [0, 1] coincides with μ: see Proposition B.1.21 and the remark after it. We put r1 (β, x) := g1 ( f (β)), r2 (β, x) := g2 ( f (β)) for β ∈ [0, 1], where g1 (·) and g2 (·) are the coordinates of the mapping g. To complete the definition of the mappings {rn }∞ n=1 , we put r 3 (β, x) := 0; for n > 3, x n−1 = x ∞ , and rn (β, x∞ ) := ξ∞ . The measure (6.42) coincides with the image of the Lebesgue measure ν = Leb on [0, 1] with respect to the mapping β → (r1 (β, x), r2 (β, x)), x ∈ {1, 2}, meaning that Pσγ (X 0 = 1, A1 ∈ A1 , X 1 = 2, A2 ∈ A2 ) = δr1 (β,1) ( A1 )δr2 (β,2) ( A2 )ν(dβ) [0,1] β,D = Pϕγ (X 0 = 1, A1 ∈ A1 , X 1 = 2, A2 ∈ A2 )ν(dβ), [0,1]
Ai ∈ B([0, 1]), where for each β ∈ [0, 1], ϕβ,D = {rn (β, xn−1 )}∞ n=1 is a deterministic Markov strategy in the associated DTMDP. In the condensed form, we have the equality
390
6 The Total Cost Model: General Case
Pσγ =
β,D
[0,1]
Pϕγ ν(dβ).
The same mappings {rn }∞ n=1 define the deterministic Markov strategy ϕˆ βn (xn−1 ) := rn (β, xn−1 ), n = 1, 2, . . . β
in the CTMDP, for which Pγϕˆ Pϕγ [0,1]
β Pγϕˆ ν(dβ)
β,D
and hence, by Theorem 6.4.3(b),
β,D
[0,1]
Pϕγ ν(dβ) = Pσγ PγS .
(According to the proof of Theorem 6.4.4, the mappings β → Pγϕ are measurable.) Since is a 1-1-correspondence, PγS
=
β
,D
and β → Pγϕˆ
β
β
[0,1]
Pγϕˆ ν(dβ)
and the measure PγS is represented by the mixture S˜ of {ϕˆ β }β∈[0,1] with the weights distribution ν = Leb on [0, 1] by Theorem 6.4.5(b). Below, we provide the explicit solution to the formulated constrained problem. It is more natural to change the signs of the functions c0 , c1 and of the constant d and to investigate the maximization problem under the “bigger than” condition: W00 (S, γ) → max S∈S M
subject to W10 (S, γ) ≥ d.
Since Condition 4.2.4 is satisfied, we have to solve the following constrained maximization problem for the associated DTMDP (see the third subsubsection of Sect. 4.2.4): • • • •
X and A are the same state and action spaces; p(2|1, a) = p(|2, a) = p(|, a) = 1 is the transition probability; γ(1) = 1 is the initial distribution; the cost functions are given by rxa , x ∈ {1, 2}; + λ1x a 1 l1 (x, a) = x , x ∈ {1, 2}; λ0 + λ1x a l0,1 (, a) ≡ 0; l0 (x, a) =
λ0x
• d > 0 is the arbitrarily fixed constraint constant. If σ ∗ is an optimal Markov strategy in this DTMDP, then the Markov standard ξ-strategy p M = σ ∗ is optimal in the original CTMDP according to the third sub-
6.5 Example: Utilization of an Unreliable Device
391
subsection of Sect. 4.2.4. One can construct the optimal Markov strategy σ ∗ using Proposition C.2.16. Note that Conditions C.2.1, C.2.2, C.2.3 and C.2.4 are all satisfied for the DTMDP under consideration. In what follows, we assume that λ10
1 1 1 1 + 2 λ1 0 , then both maxima in (6.45) and (6.46) are provided by a2 = a1 = 0. 1 The optimal value of problem (6.44) equals F(g) =
g g + 2 − gd. λ10 λ0
Step 2. Minimize the piecewise linear function F(g) over g ∈ R0+ .
2 2 r λ According to the inequalities (6.43), one only has to compare the values F λ2 0 1 1 1 r λ and F λ1 0 . 1 (i) Suppose 1 1 + 2 > d. (6.47) 1 1 λ0 + λ1 λ0 Then ming∈R0+ F(g) is attained at g ∗ = ∗ WgDT (2) = max ∗
a∈[0,1]
r 2 λ20 . λ21
Expression (6.45) takes the form
r 2 a + g∗ λ20 + λ21 a
r2 r2 = , a∈[0,1] λ2 λ21 1
= max
where any value a ∈ [0, 1] provides the maximum. One has to choose a value a2∗ ∈ [0, 1] so that the constraint-inequality becomes an equality, that is, λ10
1 1 + 2 = d. 1 + λ1 λ0 + λ21 a2∗
Remember, the action a1∗ = 1 providing the maximum in (6.46) for g = g ∗ is fixed. The obtained equation has a unique solution a2∗ in the interval [0, 1] because, according to (6.43) and (6.47), 1 1 1 1 + 2 > d and 1 + 2 < d. λ10 + λ11 λ0 λ0 + λ11 λ0 + λ21 Now the Markov (in fact, deterministic stationary) strategy σn∗ (1|1) = σn∗ (a2∗ |2) = 1,
n = 1, 2, . . .
solves the constrained DTMDP problem by Proposition C.2.16. By the way, the optimal total occupation measure is ∗
M∗ ({x} × da) = Mσγ ({x} × da) =
δ1 (da), if x = 1; δa2∗ (da), if x = 2.
The actions in the cemetery state are immaterial, and M∗ ({} × A) = ∞. The Markov standard ξ-strategy (in fact, deterministic stationary)
6.5 Example: Utilization of an Unreliable Device
ϕs (x) =
393
1, if x = 1; a2∗ , if x = 2
(6.48)
solves the original constrained CTMDP problem. Another optimal solution is given by the following Markov (in fact, stationary) standard ξ-strategy: pnM (1|1) = 1; pnM (1|2) = aˆ 2∗ ; pnM (0|2) = 1 − aˆ 2∗ , n = 1, 2, . . . , where aˆ 2∗ is the solution to the equation a 1−a 1 + =d− 1 , 2 2 + λ1 λ0 λ0 + λ11
λ20
because the Markov (in fact, stationary) strategy σˆ n∗ (1|1) = 1;
σˆ n∗ (1|2) = aˆ 2∗ ; σˆ n∗ (0|2) = 1 − aˆ 2∗ , n = 1, 2, . . .
again solves the constrained DTMDP problem by Proposition C.2.16. The solution to the original constrained CTMDP problem can also be represented as the mixture S˜ of the following simple deterministic Markov (in fact, stationary) strategies ϕˆ βn (x)
= rn (β, x) =
1, if x = 1, or if x = 2 and β ≤ aˆ 2∗ ; 0, if x = 2 and β > aˆ 2∗
with ν = Leb being the uniform distribution (i.e., Lebesgue measure) on [0, 1]. For M the justification, see Theorem 6.4.5: the strategic measure Pγp is represented by the mixture S˜ of the strategies {ϕˆ β }β∈[0,1] with the weights distribution ν. One can construct the Markov π-strategy π M having the same detailed occupation measures as the above strategy p M , and hence also solving the original constrained CTMDP problem. Using the proof of Theorem 6.2.1, we obtain π1M (a|1, s) = 1; e−(λ0 +λ1 )s aˆ 2∗ 2
π2M (1|2, s) = 1 − π2M (0|2, s) =
2
e−(λ0 +λ1 )s aˆ 2∗ + e−λ0 s (1 − aˆ 2∗ ) 2
2
2
.
Let us check that the total lifetime up to the absorption in the cemetery equals d. Eγπ [1 ] = M
Eγπ [2 ] = M
λ10
1 ; + λ11
R+
θq(θ)e−
θ 0
q(u)du
dθ =
e− R+
θ 0
q(u)du
dθ,
394
6 The Total Cost Model: General Case
where
(λ20 + λ21 )e−(λ0 +λ1 )s aˆ 2∗ + λ20 e−λ0 s (1 − aˆ 2∗ ) 2
q(s) =
2
2
e−(λ0 +λ1 )s aˆ 2∗ + e−λ0 s (1 − aˆ 2∗ ) 2
2
2
is the jump intensity from 2 to . It is clear that d −(λ20 +λ21 )θ ∗ 2 ln e aˆ 2 + e−λ0 θ (1 − aˆ 2∗ ) = −q(θ); dθ hence
e−
(0,θ]
q(u)du
= e−(λ0 +λ1 )θ aˆ 2∗ + e−λ0 θ (1 − aˆ 2∗ ) 2
and Eγπ [2 ] = M
2
2
aˆ 2∗ 1 − aˆ 2∗ + , λ20 + λ21 λ20
so that Eγπ [1 + 2 ] = d by the definition of aˆ 2∗ . (ii) Suppose 1 1 + 2 ≤ d. 1 1 λ0 + λ1 λ0 M
r 1 λ1
Then ming∈R0+ F(g) is attained at g ∗ = λ1 0 . All the reasoning remains the same. The 1 action a2∗ = 0 providing the maximum in (6.45) is fixed, and one has to choose a value a1∗ ∈ [0, 1] such that 1 1 + 2 = d. λ10 + λ11 a1∗ λ0 The optimal Markov standard ξ-strategy p M is defined by pnM (1|1) = 1 − pnM (0|1) = aˆ 1∗ ; where aˆ 1∗ is such that
pnM (0|2) = 1,
n = 1, 2, . . . ,
aˆ 1∗ 1 − aˆ 1∗ 1 + = d − 2. 1 1 1 λ0 + λ1 λ0 λ0
Solutions in the form of a mixture of simple deterministic Markov strategies and in the form of a Markov π-strategy can be obtained similarly to the case (i). The total reward W00 and the total lifetime W10 have two components associated with the states 1 and 2. Namely, w0x
r x ax := x , x ∈ {1, 2} λ0 + λ1x ax
and w1x :=
1 , x ∈ {1, 2} : λ0x + λ1x ax
6.5 Example: Utilization of an Unreliable Device
395
see the expressions (6.45) and (6.46); ax ∈ [0, 1] is the utilization intensity in the state dw x r x λx x ∈ {1, 2}. The derivative − dw0x = λx 0 equals the reduction of the reward per unit 1
1
r 2 λ2
r 1 λ1
increase of the lifetime. We investigated the case λ2 0 < λ1 0 , which means that, if we 1 1 need to increase the lifetime, then we firstly try to decrease the utilization intensity in the state 2. In accordance with this observation, the solution to the constrained problem, as the constraint constant d increases, is as follows. 1 1 • When d ≤ λ1 +λ , the constraint is not essential, and one should simply 1 + λ20 +λ21 0 1 maximize the reward by putting a1 = a2 = 1. 1 1 1 1 < d < λ1 +λ , in order to meet the constraint on the • When λ1 +λ 1 + 1 + λ20 +λ21 λ20 0 1 0 1 lifetime, one should reduce the utilization intensity in the state 2. 1 1 • When λ1 +λ ≤ d < λ11 + λ12 , the utilization intensity in the state 2 is minimal 1 + λ20 0 1 0 0 possible, and, in order to meet the increasing constraint on the lifetime, one has to reduce the utilization intensity in the state 1, too. • When d = λ11 + λ12 , the only feasible solution is a1 = a2 = 0, and the Slater con0
0
dition is violated. If d > In the case of
r 2 λ20 λ21
>
1 λ10
r 1 λ10 , λ11
+
1 , λ20
there are no feasible solutions.
the solution is similar: one simply has to swap the
indices 1 and 2. All the expressions remain correct. The case one can follow either of the optimal strategies.
r 2 λ20 λ21
=
r 1 λ10 λ11
is neutral:
6.6 Realizable Strategies In this section, we generalize the ideas presented in Sect. 1.1.4 to π-ξ-strategies. Definition 6.6.1 A π-ξ-strategy S = {, p0 , {( pn , πn )}∞ n=1 } is called realizable for (h n−1 , ξn ) ∈ Hn−1 × with xn−1 ∈ X if there is a complete probability space n , n × R+ with n , F Pn ) and a measurable, with respect to ( ω , s), process An on ( values in A such that the following assertions hold. (a) For almost all s ∈ R+ , πn ( A |h n−1 , ξn , s) coincides with Pn (An (s) ∈ A ) for n is often omitted. each A ∈ B(A). As usual, the argument ω∈ (b) For any conservative and stable transition rate q, ˆ for the random probability ¯ + × X∞ depending on n and defined by ω∈ measure G ω on R
ω ({∞} × {x∞ }) := e− (0,∞) qˆxn−1 (An (ω,s))ds ; G G ω (R × X ) := ω , θ))e− (0,θ] qˆxn−1 (An (ω,s))ds dθ, q( ˆ X |xn−1 , An ( R
∀ R ∈ B(R+ ), X ∈ B(X), Pn , we must obtain the measure after taking expectation En with respect to
396
6 The Total Cost Model: General Case
R ∩R+
q( ˆ X |xn−1 , ξn , πn , θ)e−
(0,θ]
qˆ xn−1 (ξn ,πn ,s)ds
dθ
+I{∞ ∈ R , x∞ ∈ X }e− (0,∞) qˆxn−1 (ξn ,πn ,s)ds , ¯ + ), X ∈ B(X∞ ). ∀ R ∈ B(R A generalized π-ξ-strategy S is realizable if for each n ∈ N, it is realizable for (Hn−1 (ω), n (ω)) with X n−1 ∈ X almost surely with respect to PγS . Recall that the initial distribution γ is fixed. This definition is slightly different from Definition 1.1.10. Consider a generalized standard ξ-strategy with pn being independent of the ξ-components so that the ξ-components will be omitted from ω ∈ here. Essentially, this strategy is just a standard ξ-strategy in the sense of Definition 1.1.2. (See Remark 6.1.4.) However, Definition 6.6.1 requires the existence of the process An (·) for (almost all) (Hn−1 , n = An ), n = 1, 2, . . ., and Definition 1.1.10 requires the existence of the process An (·) for (almost all) Hn−1 , n = 1, 2, . . .. Nevertheless, if a strategy from Chap. 1 is a specific π-ξ-strategy, then it is realizable in the sense of Definition 1.1.10 if and only if it is realizable in the sense of Definition 6.6.1, as Theorems 6.6.1 and 6.6.2 show. In the current section, we accept Definition 6.6.1. Similarly to Definition 1.3.2, one can define a realizable generalized π-ξ-strategy in a slightly different but equivalent way; the proof of the equivalence is similar to the proof of Lemma 1.3.1 for the case of Sn = πn . In Corollary 6.6.2, we provide another equivalent definition of realizability. Theorem 6.6.1 The following statements are equivalent. (a) A generalized π-ξ-strategy S is realizable for (h n−1 , ξn ) ∈ Hn−1 × with xn−1 ∈ X. n , n , F Pn ) and a measurable (with (b) There is a complete probability space ( respect to ( ω , t)) process An on n × R+ with values in A such that for almost Pn (An (s) ∈ A ) = πn ( A |h n−1 , ξn , s) for each A ∈ B(A), and for all s ∈ R+ , each θ ∈ R and each bounded measurable function qˆ on A, the random variable + ˆ n (s))ds is degenerate (not random), that is, equals a constant Pn -a.s. (0,θ] q(A (c) For almost all s ∈ R+ , πn (·|h n−1 , ξn , s) = δϕ(s) (·) is a Dirac measure, where ϕ is an A-valued measurable mapping on R+ . n , n , F Pn ) and a measurable (with (d) There is a complete probability space ( respect to ( ω , t)) process An on n × R+ with values in A such that – for almost all s ∈ R+ , Pn (An (s) ∈ A ) = πn ( A |h n−1 , ξn , s) for each A ∈ B(A), and – for each bounded measurable function qˆ on A and bounded nonˆ n (u))du and overlapping intervals I1 , I2 ⊆ R+ , the random variables I1 q(A q(A ˆ (u))du are independent. n I2 The proof is identical to the proof of Theorem 1.1.2.
6.6 Realizable Strategies
397
Theorem 6.6.2 (a) Every ξ-strategy is realizable. (b) For every realizable generalized π-ξ-strategy, there is a ξ-strategy indistinguishable from it. (c) If S is a realizable generalized π-ξ-strategy S and S is a strategy indistinguishable from S, then S is realizable, too. Proof (a) Let S = {, p0 , {( pn , ϕn )}∞ n=1 } ∈ Sξ . Then, for each n = 1, 2, . . ., n := { ω } as a singleton (h n−1 , ξn ) ∈ Hn−1 × with xn−1 ∈ X, one can take and put ω , s) := ϕn (h n−1 , ξn , s). An ( (b) Suppose a generalized π-ξ-strategy S = {, p0 , {( pn , πn )}∞ n=1 } is realizable. Let D := {Dirac measures on A} ⊆ P(A). After we fix an arbitrary compatible metrizable and separable topology on the Borel space A and the corresponding weak topology on the space P(A), we see that the set D is measurable by Proposition B.1.29. For each n = 1, 2, . . . the set / D} N n := {(h n−1 , ξn , s) : xn−1 ∈ X, πn (·|h n−1 , ξn , s) ∈ is measurable by Proposition B.1.31. Let Mn (dh × dξ) be the image of the strategic measure PγS with respect to the mapping → Hn−1 × . Since the strategy S is realizable, by Fubini’s Theorem, (Mn × Leb)(N n ) = 0 because, according to Theorem 6.6.1(c), for Mn -almost all (h n−1 , ξn ), the Lebesgue measure of the section N(hn n−1 ,ξn ) := {s ∈ R+ : (h n−1 , ξn , s) ∈ N n } is zero. On the set Kn := ({h n−1 ∈ Hn−1 : xn−1 ∈ X} × × R+ ) \ N n we put ϕn (h n−1 , ξn , s) equal to the point at which the Dirac measure πn (·|h n−1 , ξn , s) is concentrated. The mapping ϕn is measurable on Kn because the set {(h n−1 , ξn , s) ∈ Kn : ϕ(h n−1 , ξn , s) ∈ } = {(h n−1 , ξn , s) ∈ Kn : πn (|h n−1 , ξn , s) = 1}
398
6 The Total Cost Model: General Case
is measurable for each ∈ B(A). After we extend ϕn in an arbitrary but measurable way to ϕn : ({h n−1 ∈ Hn−1 : xn−1 ∈ X} × × R+ ) → A, we obtain the desired mapping ϕn . Now for the ξ-strategy S˜ = {, p0 , {( pn , ϕn )}∞ n=1 } with the same measure p0 ˜ and stochastic kernels pn , n = 1, 2, . . ., we have PγS = PγS : equalities PγS (Hn ∈ ˜
H ) = PγS (Hn ∈ H ) are valid for all n = 0, 1, . . ., H ∈ B(Hn ) by the induction argument. The strategies S and S˜ are indistinguishable because for all n = 1, 2, . . . πn (da|h n−1 , ξn , s) = δϕn (h n−1 ,ξn ,s) (da) for Mn × Leb-almost all (h n−1 , ξn , s) with xn−1 ∈ X. (c) It is sufficient to notice that, for the strategy S , one can take the same spaces n , n , F Pn ) and the same processes An (n = 1, 2, . . .) as for the strategy S. ( Corollary 6.6.1 (a) A generalized π-ξ-strategy S is realizable if and only if it is indistinguishable from a ξ-strategy S ∈ Sξ . (b) For all indistinguishable realizable strategies, one can fix the common measurable mappings ϕn (h n−1 , ξn , s) : {h n−1 ∈ Hn−1 : xn−1 ∈ X} × × R+ → A, n = 1, 2, . . . such that, in Definition 6.6.1, for all n = 1, 2, . . . and (h n−1 , ξn ) ∈ Hn−1 × n ≡ { ω } is a singleton and with xn−1 ∈ X, ω , s) = ϕn (h n−1 , ξn , s). An ( Proof (a) This assertion is a direct consequence of Theorem 6.6.2. (b) The proof follows from part (a) and the proof of parts (c) and (a) of Theorem 6.6.2. n = { ω } is a singleton, the argument ω in An ( ω , s) is omitted, In what follows, if and we underline that, under fixed (h n−1 , ξn ) with xn−1 ∈ X, the function An (·) is nonrandom. According to Corollary 6.6.1, for each realizable π-ξ-strategy S, there exists an action process A(ω, t) =
I{Tn−1 < t ≤ Tn }An (Hn−1 , n , t − Tn−1 ), t > 0
(6.49)
n≥1
(in the sense of Definition 6.1.4(b), where ϕn = An : Hn−1 × × R+ → A is a measurable mapping for all n = 1, 2, . . .), which represents S in the following sense.
6.6 Realizable Strategies
399
Definition 6.6.2 The action process (6.49) represents the generalized π-ξ-strategy S = {, p0 , {( pn , πn )}∞ n=1 } if the following assertions are PγS -a.s. valid for each n = 1, 2, . . .. • For every nonnegative measurable function c on X × A, I{X n−1 ∈ X}
c(X n−1 , a)πn (da|Hn−1 , n , s) A
= I{X n−1 ∈ X}c(X n−1 , A(ω, Tn−1 + s)) for almost all s ∈ (Tn−1 , Tn ]. • For any conservative and stable transition rate q, ˆ the measure
q( ˆ X |X n−1 , n , πn , θ)e−
I{X n−1 ∈ X} R ∩R+
+ I{X n−1 ∈ X}I{∞ ∈ R , x∞ ∈ X }e−
(0,∞)
(0,θ]
qˆ X n−1 (n ,πn ,s)ds
dθ
qˆ X n−1 (n ,πn ,s)ds
coincides with I{X n−1 ∈ X} ×e−
(0,θ]
R ∩R+
q( ˆ X |X n−1 , A(ω, Tn−1 + θ))
qˆ X n−1 (A(ω,Tn−1 +s))ds
dθ
+I{X n−1 ∈ X}I{∞ ∈ R , x∞ ∈ X }e− (0,∞) qˆ X n−1 (A(ω,Tn−1 +s))ds , ¯ + ), X ∈ B(X∞ ). ∀ R ∈ B(R Along with the filtration {Ft }t≥0 given by (6.2), we introduce the right-continuous filtration Gt = σ(H0 , 1 ) ∨ σ(μ((0, ˆ s] × B) : s ≤ t, B ∈ B(X × )), where the random measure μˆ on R+ × X × is defined by μ(ω; ˆ R × X × ) =
I{Tn (ω) < ∞}δ(Tn (ω),X n (ω),n+1 (ω)) (R × X × ).
n≥1
1 on × R0+ is given by The associated predictable σ-algebra Pr σ( × {0} ( ∈ G0 ), × (s, ∞) ( ∈
2 t 0)).
400
6 The Total Cost Model: General Case
1Now an A-valued random process A(·) has the form (6.49) if and only if it is Pr measurable, i.e., Gt -predictable, by Proposition A.1.1. Therefore, we have the following corollary from Theorem 6.6.2. Corollary 6.6.2 A generalized π-ξ-strategy S = {, p0 , {( pn , πn )}∞ n=1 } is realizable if and only if there exists a Gt -predictable process A : × R+ → A, which represents S. Proof The ‘only if’ part follows from the discussion after Corollary 6.6.1. If the Gt -predictable process A(·) exists, it has the form (6.49), and the strategy S is realizable by definition. Indeed, for each (h n−1 , ξn ) ∈ Hn−1 × with xn−1 ∈ X n := { ω } as a singleton and put An ( ω , s) := (n = 1, 2, . . .), one should take An (h n−1 , ξn , s). With some abuse of notation, here the mapping An (h n−1 , ξn , s) : Hn−1 × × R+ → A is as in the expression (6.49). The assertion (a) of Definition 6.6.1 holds after we take c(x, a) := I{a ∈ A } in Definition 6.6.2, and the assertion (b) is valid just by Definition 6.6.2. To summarize the material presented in the current chapter, let us underline the following. • π-strategies are sufficient for solving the (constrained) total cost problems, see Theorem 6.2.1. Nevertheless, most of them are not realizable, see Theorem 6.6.1. • All ξ-strategies are realizable, see Theorem 6.6.2. • Markov standard ξ-strategies are in general not sufficient, see Sect. 1.3.4 and Corollary 6.2.1. • Poisson-related strategies are sufficient (see Theorem 4.1.1) and realizable because they belong to the class Sξ of ξ-strategies. These observations are illustrated in Fig. 6.2. Note also that the spaces of occupation measures, coming from the generalized standard ξ-strategies, Markov standard ξ-strategies, and mixtures of simple deterministic Markov strategies, coincide by Theorem 6.4.6(b).
6.7 Bibliographical Remarks Section 6.1. Generalized π-ξ-strategies were introduced in [185]. Note that usually in the literature only π-strategies are considered, see [106, 108, 112, 114, 115, 149, 150, 180, 188, 190, 197]. Let us call such models “classical CTMDPs”. Although π-strategies are sufficient for solving many optimal control problems (see Theorem 6.2.1), one must keep in mind that they are usually not realizable (see Theorem 6.6.1). Generalized π-ξ-strategies can be useful also in the theory of Markov games, as in Chap. 10 of [197]. The solution, i.e., the Nash equilibrium, in the form of a pair of π-strategies, can be equivalently represented as a pair of (realizable) ξ-strategies.
6.7 Bibliographical Remarks
401
Fig. 6.2 Overview of sufficient and realizable strategies
Section 6.2. For the discounted model, detailed occupation measures are equivalent to the occupation (occupancy) measures introduced in [76] (in [77]). The justification of formula (6.19) was presented in [76, 77] for the case α > 0. Poisson-related strategies were introduced and investigated in [185, 187]. Example 6.2.1 is very similar to the examples described in [117, Example 3.1] and in [185, Example 2]. Most of the material in this section was published in [185]. Section 6.3. A brief literature survey on reducing a CTMDP to the corresponding DTMDP was presented in Sect. 4.3. The most relevant references here are [76, 77], where the reduction for the discounted model was done through the transformations CTMDP→ESMDP→DTMDP. In this connection, the advantage of considering generalized π-ξ-strategies is in that ESMDP simply means the restriction to a specific class of strategies in the CTMDP. Thus, instead of studying different models, i.e., classical CTMDPs and ESMDPs, we can restrict ourselves to different classes of strategies, i.e., π-strategies and generalized standard ξ-strategies SξG . Since in general the set SξG is not sufficient for solving optimization problems, we use the sufficient set of Poisson-related strategies, first introduced in [185], to reduce the CTMDP to the DTMDP. The preliminary draft of this method of reduction appeared in [186, 187]. After the problem is reduced to the DTMDP, one can invoke many powerful theoretical and numerical methods developed for discrete-time problems: see [4, 7, 13, 21, 61, 63, 66, 69, 74, 89, 120, 121, 179, 200, 259] to mention the most important and recent articles and monographs. Section 6.4. The convexity of the spaces of total and detailed occupation measures for different CTMDPs was established, e.g., in [110, 112, 113, 116, 188]. Then it is natural to call the strategy, giving a convex combination of occupation measures, a mixture [110, Theorem 5.2], [112, Theorem 7.2], [116], [188, Corollary 3.14]. Intuitively, a mixture means that the decision-maker flips a coin at the very beginning
402
6 The Total Cost Model: General Case
and afterwards applies this or that control strategy. Such a way of controlling the process is easy for implementation, but, formally speaking, it does not fit the definition of a strategy introduced in the cited works. One of the advantages of generalized πξ-strategies introduced in [185] is that such mixtures are really specific strategies. The material presented in Sect. 6.4 is an extension of the article [185]. Section 6.6. This material was first published in [187].
Chapter 7
Gradual-Impulsive Control Models
The CTMDPs discussed in the previous chapters only allow the decision-maker to control the local characteristics of the process. This chapter considers a more general situation, where the decision-maker can control both the local characteristics and the trajectories of the process. The resulting CTMDP is said to be with gradualimpulsive control. In Sect. 7.1, we consider the total cost model, for which we show that CTMDPs with gradual-impulsive control can be reduced to CTMDPs with gradual control only. In Sect. 7.2 an example of an epidemic with carriers is solved by reducing the gradual-impulsive problem to a CTMDP problem with gradual control only, which is then further solved in the style of a main theme of this book: reduction of CTMDP to DTMDP. In this sense, the present chapter both complements and illustrates the application of the materials presented earlier in this book. Section 7.3 is devoted to the discounted cost model, and we develop the dynamic programming method for it. Since we allow multiple impulses to be applied at a single moment in time, leading to multiple values associated with the process at that moment, it would be notionally heavy to introduce the trajectories of the process rigorously. Instead, we describe the CTMDP with gradual-impulsive control as a DTMDP. In order to do this rigorously, we make use of the relevant facts regarding the space of relaxed controls. This DTMDP formulation allows us to immediately write down the optimality equation in the integral form, and the efforts are to be spent on establishing from that equation the local optimality equation. Finally, the results obtained in Sect. 7.3 are illustrated with an example of inventory control.
7.1 The Total Cost Model and Reduction 7.1.1 System Primitives We describe the primitives of the gradual-impulsive control model of CTMDPs as follows. The state space is X, the space of gradual actions (controls) © Springer Nature Switzerland AG 2020 A. Piunovskiy and Y. Zhang, Continuous-Time Markov Decision Processes, Probability Theory and Stochastic Modelling 97, https://doi.org/10.1007/978-3-030-54987-9_7
403
404
7 Gradual-Impulsive Control Models
is AG , and the space of impulsive actions (controls) is A I . As usual, we use the words “action” and “control” interchangeably. It is assumed that X, AG and A I are all (topological) Borel spaces, endowed with their Borel σ-algebras B(X), B(AG ) and B(A I ), respectively. The transition rate, on which the gradual control acts, is given by q(dy|x, a), which is a signed kernel from X × AG , endowed with its Borel σ-algebra, to B(X), satisfying the following conditions: q(|x, a) ∈ [0, ∞) for each ∈ B(X), x ∈ / ; q(X|x, a) = 0, x ∈ X, a ∈ AG ; q¯ x := sup qx (a) < ∞, x ∈ X, a∈AG
where qx (a) := −q({x}|x, a) for each (x, a) ∈ X × AG . For notational convenience, we introduce q (dy|x, a) := q(dy \ {x}|x, a), ∀ x ∈ X, a ∈ AG . If the current state is x ∈ X, and an impulsive control (or simply “impulse”) b ∈ A I is applied, then the state immediately following this impulse obeys the distribution given by Q(dy|x, b), which is a stochastic kernel from X × A I to B(X). Finally, there are a family of cost rates and functions {c Gj , c Ij } Jj=0 , with J being a fixed positive integer, representing the number of constraints in the concerned optimal control problem to be described below, see (7.5). For each j ∈ {0, 1, . . . , J }, c Gj and c Ij are [−∞, ∞]-valued measurable functions on X × AG and X × A I , respectively. Throughout this chapter, unless stated otherwise, by measurable we mean Borel measurable. The cost rates c Gj have the same meaning as in Chap. 1; the costs c Ij (x, b) are paid as the lump sums at the moment when the impulse b is applied at the state x. Remark 7.1.1 Without loss of generality we can assume AG and A I are two disjoint measurable subsets of a Borel space A such that A = AG ∪ A I , for otherwise, one can consider AG × {G} instead of AG and A I × {I } instead of A I and A = AG × {G} ∪ A I × {I }. Interpretation The description of the system dynamics in the gradual-impulsive control problem is as follows. Assume qx (a) > 0 for each x ∈ X and a ∈ AG for simplicity. At the initial time 0 with the initial state x0 , the decision-maker selects the triple (c˘1 , b˘1 , ρ1 ) with c˘1 ∈ [0, ∞], b˘1 ∈ A I , and ρ1 = {ρ1t (da)}t∈(0,∞) ∈ R(AG ), where R(AG ) denotes the collection of P(AG )-valued measurable mappings on (0, ∞) with any two elements therein being identified if they differ only on a null set with respect to the Lebesgue measure. Compared with Definition 1.1.2, under a fixed x0 ∈ X, one can say that ρ1t (da) plays exactly the same role as the stochastic kernel π1 (da|x0 , t). Then, the time until the next natural jump follows the nonstationary exponential distribution with the rate function AG
qx0 (a)ρ1t (da) =: qx0 (ρ1t ).
7.1 The Total Cost Model and Reduction
405
Here and below, if ρ ∈ R(AG ), then qx (a)ρt (da), q (dy|x, ρt ) := qx (ρt ) := AG
AG
q (dy|x, a)ρt (da).
If by time c˘1 , there is no occurrence of a natural jump, then the first sojourn time is c˘1 , at which, the impulsive action b˘1 ∈ A I is applied, and the next state X 1 follows the distribution Q(dy|x0 , b˘1 ). If the first natural jump happens before c˘1 , say at t1 , then the first sojourn time is t1 , and the next state X 1 follows the distribution q (dy|x0 , ρ1t1 ) qx0 (ρ1t1 )
.
Except for the initial one, a decision epoch occurs immediately after a sojourn time. At the next decision epoch, the decision-maker selects (c˘2 , b˘2 , ρ2 ), and so on. This leads to a natural description of the gradual-impulsive control problem as a DTMDP, with a fairly complicated action space, involving the space of relaxed controls. We present the necessary preliminaries regarding the space of relaxed controls in the next section and Appendix B.
7.1.2 Total Cost Gradual-Impulsive Control Problems Now we are in position to describe the gradual-impulse control model as a DTMDP. The state space of the DTMDP model is ˘ := {(∞, x∞ )} ∪ ([0, ∞) × X) , X ˘ The first coordinate represents the previous where (∞, x∞ ) is an isolated point in X. sojourn time in the gradual-impulsive control problem, and the state of the controlled process in the gradual-impulsive control problem is given in the second coordinate. The inclusion of the first coordinate in the state allows us to consider control strategies that select actions depending on the past sojourn times. The action space of the DTMDP is ˘ := [0, ∞] × A I × R(AG ). A Recall that R(AG ) is the collection of P(AG )-valued measurable mappings on (0, ∞) with any two elements therein being identified if they differ only on a null set, where P(AG ) stands for the space of probability measures on (AG , B(AG )). We endow P(AG ) with its weak topology (generated by bounded continuous functions on AG ) and the Borel σ-algebra, so that P(AG ) is a Borel space. According to Lemma B.1.3, each element in R(AG ) can be regarded as a stochastic kernel from (0, ∞) to B(AG ). According to Proposition B.1.12, the space R(AG ), endowed with the
406
7 Gradual-Impulsive Control Models
smallest σ-algebra with respect to which the mapping
∞
ρ = (ρt (da)) ∈ R(A ) → G
e−t g(t, ρt )dt
0
is measurable for each bounded measurable function g on (0, ∞) × P(AG ), is a Borel space. The transition probability p˘ in this DTMDP is defined as follows. For each ˘ and action a˘ = (c, ˘ ˘ ρ) ∈ A, bounded measurable function g on X ˘ b, − q (ρ )ds g(t, y) p(dt ˘ × dy|(θ, x), a) ˘ := I{c˘ = ∞} g(∞, x∞ )e R+ x s ˘ X + g(t, y) q (dy|x, ρt )e− (0,t] qx (ρs )ds dt R+ X g(t, y) q (dy|x, ρt )e− (0,t] qx (ρs )ds dt +I{c˘ < ∞} (0,c] ˘ X − (0,c]˘ qx (ρs )ds ˘ +e g(c, ˘ y)Q(dy|x, b)
=
(0,c]∩R ˘ +
X
g(t, y) q (dy|x, ρt )e−
X
+I{c˘ = ∞}g(∞, x∞ )e +I{c˘ < ∞}e−
(0,c] ˘
−
qx (ρs )ds
R+
(0,t]
qx (ρs )ds
dt
qx (ρs )ds
˘ g(c, ˘ y)Q(dy|x, b)
(7.1)
X
for each state (θ, x) ∈ [0, ∞) × X; and g(t, y) p(dt ˘ × dy|(∞, x∞ ), a) ˘ := g(∞, x∞ ). ˘ X
˘ ×A ˘ to B(X), ˘ as The object p˘ defined above is indeed a stochastic kernel from X a consequence of Lemma B.1.4. Similarly, the cost functions {l˘j } Jj=0 defined below ˘ ×A ˘ × X: ˘ are measurable on X ˘ (t, y)) := I{(θ, x) ∈ [0, ∞) × X} l˘j ((θ, x), a, ˘ × c Gj (x, ρs )ds + I{t = c˘ < ∞}c Ij (x, b) (0,t]∩R+ G I ˘ c j (x, ρs )ds + I{t = c˘ < ∞}c j (x, b) , = I{x ∈ X}
(7.2)
(0,t]∩R+
˘ ×A ˘ × X. ˘ Here and below the for each j = 0, 1, . . . , J and ((θ, x), a, ˘ (t, y)) ∈ X ˘ ˘ generic notation a˘ = (c, ˘ b, ρ) ∈ A of an action in this DTMDP model is in use, and ˘ The interpretation is that the pair (c, ˘ is (θ, x) is the generic notation of x˘ ∈ X. ˘ b)
7.1 The Total Cost Model and Reduction
407
the planned time until the next impulse and the next planned impulse, and ρ is (the rule of) the relaxed control to be used during the next sojourn time. Without loss of generality, the initial state is (0, x0 ), with some x0 ∈ X. Below, ˘ (t, y)) l˘±j ((θ, x), a, G ± I ± ˘ := I{x ∈ X} (c j ) (x, ρs )ds + I{t = c˘ < ∞}(c j ) (x, b) (0,t]∩R+
∞ for 0 ≤ j ≤ J . Let { X˘ n }∞ n=0 = {(n , X n )}n=0 be the controlled process in this ∞ DTMDP model, and {(C˘ n , B˘ n )}n=1 be the coordinate process corresponding to {(c˘n , b˘n )}∞ ˘ n }∞ n=1 in {a n=1 . Next, we define the concerned class of strategies in the gradual-impulsive control model.
Definition 7.1.1 Consider a sequence of stochastic kernels σ = {σn }∞ n=1 , where for I G each n ≥ 1, σn is a stochastic kernel on B([0, ∞] × A × R(A )) given h˘ n := (x˘0 , (c˘1 , b˘1 ), x˘1 , (c˘2 , b˘2 ), . . . , x˘n ). According to Proposition B.1.33, for each n ≥ 0, (0) (1) ˘ h˘ n )σn+1 ˘ ˘ h˘ n ) = σn+1 (d c˘ × d b˘ × dρ|h˘ n ) = σn+1 (d c˘ × d b| (dρ|h˘ n , c, ˘ b), σn+1 (d a| (0) (1) where σn+1 and σn+1 are some corresponding stochastic kernels and σ = {σn }∞ n=1 can be written as σ = {σn(0) , σn(1) }∞ n=1 . If for each n ≥ 0 there is a measurable mapping ˘ → R(AG ) such that ˘ b) F˘n+1 : (h˘ n , c, (1) ˘ = δ F˘ (h˘ ,c,˘ b) (dρ|h˘ n , c, ˘ b) σn+1 ˘ (dρ), n+1 n (0) ˘ ∞ then we call the sequence σ = {σn }∞ n=1 , which is also identified with σ = {σn , Fn }n=1 , a strategy in the gradual-impulsive control model. The collection of all strategies in the gradual-impulsive CTMDP model is denoted by G I .
The selection of G I as the notation for the class of strategies in the gradual-impulsive control model is consistent with the notations in Sect. C.1 for DTMDPs. Under a strategy σ ∈ G I , having in hand the history h˘ n , the decision-maker selects (c˘n+1 , b˘n+1 ) (possibly randomly), and after that, chooses ρn+1 = F˘n+1 (h˘ n , c˘n+1 , b˘n+1 ). If x˘n = (∞, x∞ ) the actions play no role. Of special interest are the following subclasses of strategies in the gradualimpulsive control model. Definition 7.1.2 GI (a) A strategy σ = {σn(0) , F˘n }∞ is called μ-Markov in the gradual-impulsive n=1 ∈ control model if it is in the following form: for each n ≥ 0,
408
7 Gradual-Impulsive Control Models (0) (0) ˘ h˘ n ) = σn+1 ˘ n) σn+1 ({∞} × d b| ({∞} × d b|x ˘ n ), = μn+1 (xn )ϕ¯ n+1 (d b|x
⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬
(0) ˘ n ) = (1 − μn+1 (xn ))ϕ¯ n+1 (d b|x ˘ n ), ⎪ ({0} × d b|x σn+1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎭ ˘ ˘ ˘ Fn+1 (h n )t (da) ≡ Fn+1 (xn )(da),
(7.3)
where μn+1 and F˘n+1 are [0, 1]- and R(AG )-valued measurable mappings on X, and ϕn+1 is a stochastic kernel. The first and last equalities indicate that the dependence on h˘ n is only through xn , and the right-hand side of the last equality does not depend on t, which has thus been omitted from the subscript. (Note that (0) ({0, ∞} × A I |xn ) ≡ 1.) σn+1 GI (b) A strategy σ = {σn(0) , F˘n }∞ is called μ-stationary in the gradualn=1 ∈ impulsive control model if it is in the following form: for each n ≥ 0, ⎫ (0) ˘ h˘ n )=σ (0) ({∞} × d b|x ˘ n ) = μ(xn )ϕ(d ˘ n ), ⎪ ({∞} × d b| ¯ b|x σn+1 ⎪ ⎪ ⎪ ⎬ (0) ˘ n )=(1 − μ(xn ))ϕ(d ˘ n ), ¯ b|x σn+1 ({0} × d b|x ⎪ ⎪ ⎪ ⎪ ⎭ ˘ ˘ ˘ Fn+1 (h n )t (da)≡ F(xn )(da),
(7.4)
where xn is the second component of x˘n = (θn , xn ), μ and F˘ are [0, 1]- and R(AG )-valued measurable mappings on X, and ϕ is a stochastic kernel. In the first expression in (7.4), one can replace ϕ with an arbitrary measurable stochastic kernel: the corresponding impulsive actions are never applied. (c) A μ-stationary strategy σ in the gradual-impulsive control model in the form of (7.4) is called μ-deterministic stationary if μ is {0, 1}-valued, and there exist measurable mappings ϕ(0) and ϕ(1) from X to A I and AG , respectively, such that ˘ F(x ˘ n )(da) = δϕ(1) (xn ) (da). ˘ n ) = δϕ(0) (xn ) (d b), ϕ(d ¯ b|x A μ-deterministic stationary strategy in the gradual-impulsive control model is specified by the aforementioned triplet {μ, ϕ(0) , ϕ(1) }. In fact, since μ is {0, 1}valued, one can merge it together with ϕ(0) , and the resulting mapping together with ϕ(1) also specifies this μ-deterministic stationary strategy. However, we prefer to keep the specification with the triplet as a special case of a μ-stationary strategy in (b). One can also interpret the meaning of universally measurable strategies in the gradual-impulsive control model as in Remark C.1.2. We could also consider the gradual-impulsive control problem over the class of Markov standard ξ-strategies, defined below. Let
7.1 The Total Cost Model and Reduction
AG =
409
{ρ = {ρt }t∈(0,∞) ∈ R(AG ) : ρt (da) ≡ δa˘ (da)}.
a∈A ˘ G
We may identify it with AG . Definition 7.1.3 Employing the notations in Definition 7.1.1, a Markov standard ξ-strategy in the gradual-impulsive control model is a sequence σ = {σn(0) , σn(1) }∞ n=1 , where (1) (1) ˘ = σn+1 (dρ|h˘ n , c, ˘ b) (dρ|xn ) σn+1
is concentrated on AG and (0) (0) ˘ h˘ n ) = σn+1 ˘ n) (d c˘ × d b| (d c˘ × d b|x σn+1
depends only on xn ∈ X and n ≥ 0. Since AG is identified with AG , we may identify (1) (dρ|xn ) with a stochastic kernel ϕ˜ n+1 (da|xn ) on Ag given xn ∈ X and use the σn+1 notation σ = {σn(0) , ϕ˜ n }∞ n=1 . A Markov standard ξ-strategy for the gradual-impulsive control model is called a μ-stationary standard ξ-strategy if σn(0) is in the form ˜ of (7.4), and ϕ˜ n+1 (da|xn ) = ϕ(da|x n ) for all n ≥ 0. (Note that a Markov standard ξ-strategy belongs to the class G I only if all the kernels ϕ˜ n are degenerate, i.e., concentrated at singletons.) In Definition 7.1.1, on each step n ≥ 0, the component ρn+1 , playing the role of the relaxed control πn+1 , is fixed. In Definition 7.1.3, the component ρn+1 is the Dirac measure concentrated at a˘ ∈ AG , but the value of a˘ is random, defined by the stochastic kernel ϕ˜ n+1 . One can say that Definition 7.1.3 describes a randomized control. ˘ and a strategy σ ∈ G I , let P˘ σx be Problem Statement Given x˘0 = (0, x0 ) ∈ X 0 the strategic measure in the DTMDP, see Sect. C.1, and E˘ σx0 be the corresponding expectation. Here the sample space is ˘ := {(x˘0 , (c˘1 , b˘1 ), x˘1 , (c˘2 , b˘2 ), x˘2 , . . .)}; the components ρ of a˘ are omitted because ρn = F˘n (h˘ n−1 , c˘n , b˘n ) is uniquely defined under the strategy σ ∈ G I . Then the concerned gradual-impulsive control problem with constraints reads
∞ σ ˘ ˘ ˘ ˘ ˘ Minimize Ex l0 ( X n , An+1 , X n+1 ) =: W˘ 0 (σ, x0 ) 0
n=0
over σ ∈ G I such that W˘ j (σ, x0 ) := E˘ σx0
∞ n=0
l˘j ( X˘ n , A˘ n+1 , X˘ n+1 ) ≤ d j , 1 ≤ j ≤ J,
(7.5)
410
7 Gradual-Impulsive Control Models
where {d j } Jj=1 ⊂ R J is a fixed vector of constants, x0 is a fixed element of X, and
E˘ σx0
∞
∞ (+) ˘l j ( X˘ n , A˘ n+1 , X˘ n+1 ) := E˘ σx ˘l j ( X˘ n , A˘ n+1 , X˘ n+1 ) 0
n=0
−E˘ σx0
n=0
∞
˘ l˘(−) j (Xn,
A˘ n+1 , X˘ n+1 )
n=0
with ∞ − ∞ := ∞ being adopted here. For an arbitrarily fixed strategy σ = GI {σn(0) , σn(1) }∞ as in Definition 7.1.1, A˘ n = (c˘n , b˘n , F˘n ( H˘ n−1 , C˘ n , B˘ n )). The n=1 ∈ performance functionals W˘ j (σ, x0 ), j = 1, . . . , J , are well defined also for Markov standard ξ-strategies σ. We say that a strategy σ replicates a strategy σ (in terms of performance measures) if W˘ j (σ, x0 ) = W˘ j (σ , x0 ) for all j = 0, 1, . . . , J. Under certain conditions, the optimal strategy in problem (7.5) exists, can be chosen μ-stationary and is replicated by a μ-stationary standard ξ-strategy: see Theorem 7.1.3. We shall obtain the optimality results for this problem in Sect. 7.1.3, by reducing the constrained gradual-impulsive control problem to a standard constrained CTMDP problem with gradual control only as introduced in Chap. 1. To distinguish from the gradual-impulsive control model, we denote the CTMDP model with gradual control only by MG O := {X, A, q G O , {c j } Jj=0 }, where the state and action spaces X and A are Borel spaces, q G O is the transition rate from X × A to B(X), and {c j } Jj=0 is the collection of measurable functions on X × A, representing the cost rates, J ≥ 0 is a fixed integer. The superscript “G O” abbreviates “gradual only”, as the model only allows gradual controls.
7.1.3 Reduction to CTMDP Model with Gradual Control Condition 7.1.1 Q({x}|x, a) = 0 for each (x, a) ∈ X × A I . This condition is not restrictive because one can always extend the state space X to X × {0, 1}, say, where the second component does not affect any primitives, but switches from 0 to 1 and back from 1 to 0 at every transition moment associated with the impulsive control. Given the system primitives of the gradual-impulsive control model described in Sect. 7.1.1, we define the following CTMDP model MG O = {X, A, q G O , {c j } Jj=0 } with gradual control only, where
7.1 The Total Cost Model and Reduction
411
A := A I ∪ AG ; q G O (dy|x, a) := q(dy|x, a), ∀ (x, a) ∈ X × AG ; q G O (dy|x, a) := Q(dy|x, a), qxG O (a) := 1, ∀ (x, a) ∈ X × A I ; c j (x, a) := c Gj (x, a), ∀ (x, a) ∈ X × AG ; c j (x, a) := c Ij (x, a), ∀ (x, a) ∈ X × A I . Condition 7.1.1 guarantees that q G O defined in the above is indeed a transition rate in the sense of Sect. 1.1.1 in Chap. 1. Throughout this section, by MG O we always mean the model just described. We underline that the action space A is in general a ˘ in the DTMDP described in Sect. 7.1.2. much simpler object than the space A Theorem 7.1.1 Suppose Condition 7.1.1 is satisfied, and, for each x ∈ X, there is some ε(x) > 0 such that qx (a) ≥ ε(x) > 0 for all a ∈ AG . Let a π-strategy GO be arbitrarily fixed. Then there are a μS = {πn }∞ n=1 in the CTMDP model M (0) ˘ ∞ GI Markov strategy σ = {σn , Fn }n=1 ∈ and a Markov standard ξ-strategy σ in the gradual-impulsive CTMDP model such that W˘ j (σ, x0 ) = W˘ j (σ , x0 ) = W j0 (S, x0 ) for each j = 0, 1, . . . , J . If the π-strategy S is stationary, then the both strategies σ and σ can be chosen μ-stationary. In the proof of Theorem 7.1.1, we will make use of the next statement, which was established in Theorem 1.3.2 and Corollary 1.3.1, where M is replaced now by p M . We formulate it here for ease of reference. Proposition 7.1.1 Assume for each x ∈ X that there is some ε(x) > 0 satisfying qxG O (a) ≥ ε(x) > 0 for each a ∈ A. Then for each π-strategy S = {πn }∞ n=1 in the CTMDP model MG O with gradual control only, there is a Markov standard ξstrategy p M = {A, { pnM }∞ n=1 }, see Table 6.1, such that p M ,0
m xS,0 (d x × da) = m x0 ,n+1 (d x × da), 0 ,n+1
M
A
qxG O (a)m xS,0 (d x × da) = Pxp0 (X n ∈ d x) 0 ,n+1
(7.6)
for each n ≥ 0. (Recall Definition 6.2.1.) Moreover, for each n ≥ 0, one can take M as the stochastic kernel satisfying pn+1 M
M (da|x)Pxp0 (X n ∈ d x) = qxG O (a)m xS,0 (d x × da) pn+1 0 ,n+1 p M ,0
= qxG O (a)m x0 ,n+1 (d x × da).
(7.7)
Finally, if the π-strategy S = π s is stationary, see Definition 1.1.3, then one can take
412
7 Gradual-Impulsive Control Models M pn+1 (da|x) ≡ p s (da|x) =
qxG O (a)π s (da|x) GO s A q x (a)π (da|x)
on B(A) for each x ∈ X, i.e., one can take the Markov standard ξ-strategy p M = s {A, { pnM }∞ n=1 } as a stationary standard ξ-strategy p , see Table 6.1 for a recapitulation of the various classes of strategies in CTMDP models with gradual control only and their notations. GO Proof Let S = {πn }∞ with gradn=1 be a fixed π-strategy in the CTMDP model M ual control only. GI (i) We will show that for some μ-Markov strategy σ = {σn(0) , F˘n }∞ for the n=1 ∈ gradual-impulsive CTMDP model,
E˘ σx0 l˘j ( X˘ n , A˘ n+1 , X˘ n+1 ) S c j (X n , a)πn+1 (da|Hn , t)dt = Ex0 (0,n+1 ]∩R+
(7.8)
A
for each n ≥ 0 and j = 0, 1, . . . , J. Without loss of generality we can assume c j is [0, ∞]-valued, for otherwise, one would apply the reasoning below to c j + and c j − , separately. After that, equalities (7.8) imply that W˘ j (σ, x0 ) = W j0 (S, x0 ) by (7.5), (1.31) and (1.33). Consider the Markov standard ξ-strategy p M in Proposition 7.1.1. Then ExS0 = X×A
c j (X n , a)πn+1 (da|Hn , t)dt p M ,0 S,0 c j (x, a)m x0 ,n+1 (d x × da) = c j (x, a)m x0 ,n+1 (d x × da) (0,n+1 ]∩R+
A
X×A
c j (x, a) G O p M ,0 = q (a)m x0 ,n+1 (d x × da) G O (a) x q X×A x c j (x, a) M M p (da|x)Pxp0 (X n ∈ d x) = G O (a) n+1 q X×A x c j (x, a) M M pn+1 (da|x)Pxp0 (X n ∈ d x) = G q G O (a) X×A x c j (x, a) M M pn+1 (da|x)Pxp0 (X n ∈ d x), + GO X×A I q x (a)
(7.9)
where the second equality is by (7.6) and the fourth equality is by (7.7). Let us define for each n ≥ 0 a stochastic kernel ϕ˜ n+1 on B(AG ) given x ∈ X by ϕ˜ n+1 (da|x) :=
M pn+1 (da ∩ AG |x) M pn+1 (AG |x)
(7.10)
7.1 The Total Cost Model and Reduction
413
M M for each x ∈ X where pn+1 (AG |x) > 0; for all x ∈ X where pn+1 (AG |x) = 0, we G put ϕ˜ n+1 (da|x) as a fixed probability measure on A . Similarly, we define for each n ≥ 0 a stochastic kernel ϕ¯ n+1 on B(A I ) given x ∈ X by
ϕ¯ n+1 (da|x) :=
M (da ∩ A I |x) pn+1 M pn+1 (A I |x)
,
(7.11)
M M (A I |x) > 0; for all x ∈ X where pn+1 (A I |x) = 0, we let for each x ∈ X where pn+1 ϕ¯ n+1 (da|x) be a fixed probability measure on A I . Now we continue from (7.9):
ExS0
(0,n+1 ]∩R+ c Gj (x, a)
=
qx (a)
G
A
X
+ X
A
c j (X n , a)πn+1 (da|Hn , t)dt A M
M ϕ˜ n+1 (da|x) pn+1 (AG |x)Pxp0 (X n ∈ d x) M
I
M c Ij (x, a)ϕ¯ n+1 (da|x) pn+1 (A I |x)Pxp0 (X n ∈ d x).
(7.12)
Let us further define for each n ≥ 0 a stochastic kernel F˘n+1 (x)(da) on B(AG ) given x ∈ X by F˘n+1 (x)(da) :=
1 ϕ˜ (da|x) qx (a) n+1 , 1 ˜ n+1 (da|x) AG qx (a) ϕ
∀ x ∈ X.
(7.13)
Then
G ˘ c Gj (x, a) F˘n+1 (x)(da) G c j (x, a) Fn+1 (x)(da) = A GO ˘ ˘ AG q x (a) Fn+1 (x)(da) AG q x (a) Fn+1 (x)(da) G c j (x, a) ϕ˜ n+1 (da|x), = AG q x (a) AG
and so from (7.12) ExS0
(0,n+1 ]∩R+
c j (X n , a)πn+1 (da|Hn , t)dt A
G ˘ M AG c j (x, a) Fn+1 (x)(da) M pn+1 (AG |x)Pxp0 (X n ∈ d x) = G O ˘ X AG q x (a) Fn+1 (x)(da) M M + c Ij (x, a)ϕ¯ n+1 (da|x) pn+1 (A I |x)Pxp0 (X n ∈ d x). X
AI
(7.14)
414
7 Gradual-Impulsive Control Models
GI Now consider the strategy σ = {σn(0) , F˘n }∞ in the gradual-impulsive conn=1 ∈ trol model defined for each n ≥ 0 by
F˘n+1 (xn )t (da) ≡ F˘n+1 (xn )(da) introduced above and (0) M ˘ n ) = pn+1 ˘ n ), ({∞} × d b|x (AG |xn )ϕ¯ n+1 (d b|x σn+1 (0) M G ˘ n ) = (1 − pn+1 (A |xn ))ϕ¯ n+1 (d b|x ˘ n ). σn+1 ({0} × d b|x
(7.15)
(Note that for brevity we did not indicate explicitly the immaterial arguments.) In (0) ({∞} ∪ {0} × A I |xn ) = 1; the strategy σ has the form of (7.3), where particular, σn+1 M μn+1 (xn ) = pn+1 (AG |xn ), i.e., σ is a μ-Markov strategy. Note that on {n < ∞}, E˘ σx0 l˘j ( X˘ n , A˘ n+1 , X˘ n+1 )| H˘ n = E˘ σx0 I{n+1 < C˘ n+1 } × c Gj (X n , a) F˘n+1 (X n )(da)dt| H˘ n (0,n+1 ]∩R+ AG +E˘ σx0 I{n+1 = C˘ n+1 }c Ij (X n , B˘ n+1 )| H˘ n G ˘ G c j (X n , a) Fn+1 (X n )(da) M G = pn+1 (A |X n ) A GO (a) F˘n+1 (X n )(da) G qX n A M (A I |X n ) c Ij (X n , a)ϕ¯ n+1 (da|X n ) + pn+1 AI = E˘ σx0 l˘j ( X˘ n , A˘ n+1 , X˘ n+1 )|X n ,
(7.16)
where the second equality is by (7.15). Recall that E˘ σx0 n+1 | H˘ n , C˘ n+1 = ∞ =
1 GO ˘ AG q X n (a) Fn+1 (X n )(da)
.
Comparing (7.16) (in particular, the last but one line) with (7.14), we see that, for pM (7.8) and thus for the statement of this theorem, it remains to show that Px0 (X n ∈ d x) = P˘ σx0 (X n ∈ d x) as follows. This relation automatically holds when n = 0, with both sides of the equality being δx0 (d x). Assume for induction that it also holds for the case of n. Then P˘ σx0 (X n+1 ∈ d x) = E˘ σx0 P˘ σx0 (X n+1 ∈ d x| H˘ n , C˘ n+1 , B˘ n+1 )(I{C˘ n+1 = ∞} + I{C˘ n+1 = 0})
7.1 The Total Cost Model and Reduction
= E˘ σx0
∞
AG
0
415
q (d x|X n , a) F˘n+1 (X n )(da) ˘
M ×e− AG q X n (a) Fn+1 (X n )(da)t dt pn+1 (AG |X n ) M I + Q(d x|X n , a)ϕ¯ n+1 (da|X n ) pn+1 (A |X n ) AI
=
E˘ σx0 +
q (d x|X n , a) F˘n+1 (X n )(da) M pn+1 (AG |X n ) ˘ G q X n (a) Fn+1 (X n )(da) A M I Q(d x|X n , a)ϕ¯ n+1 (da|X n ) pn+1 (A |X n ) AG
I
A
q G O (d x|X n , a) M (AG |X n ) ϕ˜ n+1 (da|X n ) pn+1 GO G q (a) A Xn GO q (d x|X n , a) M I + ϕ¯ n+1 (da|X n ) pn+1 (A |X n ) q XGnO (a) AI
GO q (d x|X , a) n σ M = E˘ x0 pn+1 (da|X n ) q XGnO (a) A
q G O (d x|X n , a) M M pM = Ex0 pn+1 (da|X n ) = Pxp0 (X n+1 ∈ d x), GO q X n (a) A = E˘ σx0
(7.17)
where the second equality is by (7.15), the fourth equality is by (7.13), the third to last equality is by the definitions of ϕ˜ n+1 , ϕ¯ n+1 , the last but one equality is by the inductive supposition, and the last equality holds because p M is a Markov standard ξ-strategy. Therefore, PxS0 (X n ∈ d x) = P˘ σx0 (X n ∈ d x) for all n ≥ 0, as required. The first assertion of this statement is thus proved. (ii) Now consider the Markov standard ξ-strategy σ := {σn(0) , ϕ˜ n }∞ n=1 . The reasoning is similar to part (i) with the following modification of the key expressions (7.16) and (7.17): E˘ σx0 l˘j ( X˘ n , A˘ n+1 , X˘ n+1 )| H˘ n = E˘ σx0 I{n+1 < C˘ n+1 } c Gj (X n , a)dt ϕ˜ n+1 (da|X n )| H˘ n × AG (0,n+1 ]∩R+ +E˘ σx0 I{n+1 = C˘ n+1 }c Ij (X n , B˘ n+1 )| H˘ n c Gj (X n , a) M = pn+1 (AG |X n ) ϕ˜ n+1 (da|X n ) GO AG q X n (a) M I + pn+1 (A |X n ) c Ij (X n , a)ϕ¯ n+1 (da|X n ) AI
416
7 Gradual-Impulsive Control Models
= E˘ σx0 l˘j ( X˘ n , A˘ n+1 , X˘ n+1 )|X n and P˘ σx0 (X n+1 ∈ d x) = E˘ σx0 P˘ σx0 (X n+1 ∈ d x| H˘ n , C˘ n+1 , B˘ n+1 ) ×(I{C˘ n+1 = ∞} + I{C˘ n+1 = 0}) q (d x|X n , a) σ M ˘ ϕ˜ n+1 (da|X n ) pn+1 = Ex0 (AG |X n ) G q X n (a) A M I + Q(d x|X n , a)ϕ¯ n+1 (da|X n ) pn+1 (A |X n ) AI q (d x|X n , a) M pn+1 (da|X n ) = E˘ σx0 G q X n (a) A M I + Q(d x|X n , a)ϕ¯ n+1 (da|X n ) pn+1 (A |X n ) AI
q G O (d x|X n , a) M σ ˘ = Ex0 pn+1 (da|X n ) q XGnO (a) AG q G O (d x|X n , a) M + pn+1 (da|X n ) q XGnO (a) AI
GO q (d x|X , a) n M = E˘ σx0 (da|X n ) pn+1 q XGnO (a) A
q G O (d x|X n , a) M M pM = Ex0 pn+1 (da|X n ) = Pxp0 (X n+1 ∈ d x) GO q X n (a) A (cf. (7.12)). If the π-strategy S is stationary, then the presented expressions for σ and σ define the μ-stationary strategy and the μ-stationary standard ξ-strategy, respectively. The proof is thus complete. Remark 7.1.2 If the state space is X = X ∪ {} with ∈ / X being a costless cemetery (isolated from X), then Theorem 7.1.1 remains true, and so does Remark 7.1.3 below. The following observation, which follows from the proof of Theorem 7.1.1, is used in the next section. Corollary 7.1.1 Consider the setup in Theorem 7.1.1. Under the conditions therein, if the given π-strategy S = ϕs is a deterministic stationary strategy in the CTMDP model MG O with gradual control only, then one may take the replicating strategy in the gradual-impulsive control model as a μ-deterministic stationary strategy specified by {μ, ϕ(0) , ϕ(1) } in the sense of Definition 7.1.2(c), where
7.1 The Total Cost Model and Reduction
417
μ(x) = δϕs (x) (AG ), ϕ(0) (x) = ϕs (x)I{ϕs (x) ∈ A I } + b∗ I{ϕs (x) ∈ / A I }, ϕ(1) (x) = ϕs (x)I{ϕs (x) ∈ AG } + a ∗ I{ϕs (x) ∈ / AG } with a ∗ ∈ AG and b∗ ∈ A I being arbitrarily fixed. The opposite direction of Theorem 7.1.1 also holds. Theorem 7.1.2 Suppose Condition 7.1.1 is satisfied. For each strategy σ ∈ G I in the gradual-impulsive CTMDP model, there is some π-strategy S = {π n }∞ n=1 in the CTMDP model MG O with gradual controls only such that W˘ j (σ, x0 ) = W j0 (S, x0 ) for each j = 0, 1, . . . , J. GI Proof Let a strategy σ = {σn(0) , F˘n }∞ be fixed. For the statement of this n=1 ∈ GO theorem, we will construct a π-strategy S = {π n }∞ n=1 in the CTMDP model M with gradual controls only such that (7.8) holds for all n ≥ 0. As in the proof of Theorem 7.1.1, we may assume that c j is [0, ∞]-valued. Moreover, we will further assume in this proof that c j is bounded. This is without loss of generality, because one can deduce the general case by applying the claimed relation to min{c j , N } and pass to the limit as N → ∞ with the help of the Monotone Convergence Theorem. First, let us define a generalized π-ξ strategy S = {, p0 , {( pn , πn )}∞ n=1 } in the CTMDP model MG O with gradual control only as follows:
:= [0, ∞] × A I and, for n ≥ 0, ξ0 = (c˘0 , b˘0 ), ξ1 = (c˘1 , b˘1 ), . . . , ξn = (c˘n , b˘n ), (7.18) pn+1 (dc × db|ξ0 , x0 , ξ1 , θ1 , x1 , ξ2 , θ2 , x2 , . . . , ξn , θn , xn ) (0) := σn+1 (dc × db|(θ˜0 , x0 ), (c˘1 , b˘1 ), (θ˜1 , x1 ), (c˘2 , b˘2 ), . . . , (θ˜n , xn )), where the notation x˘i = (θ˜i , xi ) is in use, and for c˘i and θi , we define θ˜i := θi I{c˘i ≥ θi } + c˘i I{c˘i < θi }, ∀ i ≥ 1; θ˜0 := 0. These expressions transform the sojourn times in the CTMDP model MG O to the sojourn times in the gradual-impulsive CTMDP model. If θi > c˘i then the jump in A I . The whole the MG O occurs after the planned impulse, under the action from ˜ interval (c˘i , θi+1 ] corresponds to the instant impulse at the moment i−1 k=0 θk + c˘i in the gradual-impulsive CTMDP model. Note that there is actually no dependence on ξ0 in the definition of pn+1 (dc × db|ξ0 , x0 , ξ1 , θ1 , x1 , ξ2 , θ2 , . . . , ξn , θn , xn ). For n ≥ 0, ξ0 = (c˘0 , b˘0 ), ξ1 = (c˘1 , b˘1 ), . . . , ξn+1 = (c˘n+1 , b˘n+1 ),
418
7 Gradual-Impulsive Control Models
πn+1 (da|ξ0 , x0 , ξ1 , θ1 , x1 , ξ2 , θ2 , x2 , . . . , xn , ξn+1 , t) (7.19) ⎧ ⎪ ⎨ F˘n+1 (x˘0 , (c˘1 , b˘1 ), (θ˜1 , x1 ), (c˘2 , b˘2 ), . . . , (θ˜n , xn ), c˘n+1 , b˘n+1 )t (da) := if t ≤ c˘n+1 ; ⎪ ⎩ δb˘n+1 (da) if t > c˘n+1 . Since there is no dependence on ξ0 , p0 can be chosen arbitrarily, and we will omit the argument ξ0 in pn+1 and in πn+1 . Let us justify the following equality: ˘Eσx l˘j ( X˘ n , A˘ n+1 , X˘ n+1 ) = ExS I{X n ∈ X} 0 0 c j (X n , a)πn+1 (da|Hn , n+1 , t)dt × (0,n+1 ]∩R+
for each n ≥ 0 and j = 0, 1, . . . , J. It holds for each n ≥ 0 that ExS0 I{X n ∈ X}
(0,n+1 ]∩R+
c j (X n , a) A
× πn+1 (da|X 0 , (C˘ 1 , B˘ 1 ), 1 , X 1 , . . . , X n , (C˘ n+1 , B˘ n+1 ), t)dt S = Ex0 ExS0 I{X n ∈ X} + + , c j (X n , a)πn+1 (da|Hn , t)dt|Hn × (0,n+1 ]∩R+
(7.20)
A
A
where Hn+ := (X 0 , (C˘ 1 , B˘ 1 ) = 1 , 1 , X 1 , . . . , (C˘ n , B˘ n ) = n , n , X n , (C˘ n+1 , B˘ n+1 ) = n+1 ). For future reference, let us also introduce for Hn+ the following: ˜ 1 , X 1 ), . . . , H˜ n+ := (X 0 , (C˘ 1 , B˘ 1 ) = 1 , ( ˜ n , X n ), C˘ n+1 , B˘ n+1 ), (C˘ n , B˘ n ) = n , ( ˜ 1 , X 1 ), . . . , ( ˜ n , X n ), C˘ n+1 , B˘ n+1 ) in and write 0, H˜ n+ for ((0, X 0 ), (C˘ 1 , B˘ 1 ), ( + ˘ ˜ expressions like Fn+1 (0, Hn ). On {X n ∈ X}, the conditional expectation in the previous equality can be written as
7.1 The Total Cost Model and Reduction
419
ExS0 c j (X n , a)πn+1 (da|Hn+ , t)dt|Hn+ (0,n+1 ]∩R+ A GO + = c j (X n , a)πn+1 (da|Hn+ , t)e− (0,t] A q X n (a)πn+1 (da|Hn ,s)ds dt R+
=
A
c Gj (X n , a) F˘n+1 (0, G ˘ (0,Cn+1 ]∩R+ A − (0,t] AG q X n (a) F˘n+1 (0, H˜ n+ )s (da)ds
×e
+I{C˘ n+1 < ∞} ×e =
H˜ n+ )t (da)
dt
c Ij (X n ,
˘
Bn+1 ) (C˘ n+1 ,∞) + −(t−C˘ n+1 ) − (0,C˘ n+1 ] AG q X n (a) F˘n+1 (0, H˜ n )s (da)ds e
c Gj (X n , a) F˘n+1 (0,
(0,C˘ n+1 ]∩R+ AG − (0,t] AG q X n (a) F˘n+1 (0, H˜ n+ )s
×e
+I{C˘ n+1
C˘ n+1 , C˘ n+1 < ∞, X n+1 ∈ d x, +ExS0 I{ ˘ H˜ n+ C˘ n+2 ∈ d c, ˘ B˘ n+2 ∈ d b}| = ExS0 I{n+1 ∈ dt, n+1 ≤ C˘ n+1 , X n+1 ∈ d x, ˘ H˜ n+ C˘ n+2 ∈ d c, ˘ B˘ n+2 ∈ d b}| +δC˘ n+1 (dt)ExS0 I{n+1 > C˘ n+1 , C˘ n+1 < ∞, X n+1 ∈ d x, ˘ H˜ n+ C˘ n+2 ∈ d c, ˘ B˘ n+2 ∈ d b}| (0) ˘ H˜ n+ , t, x)I{t ∈ [0, C˘ n+1 ]} = σn+2 (d c˘ × d b|0, q (d x|X n , F˘n+1 (0, H˜ n+ )t )
˘
˜+
×e− (0,t] q X n ( Fn+1 (0, Hn )s )ds dt (0) ˘ H˜ n+ , t, x)δC˘ (dt)I{C˘ n+1 < ∞}Q(d x|X n , B˘ n+1 ) +σn+2 (d c˘ × d b|0, n+1 ×e
−
(0,C˘ n+1 ]
q X n ( F˘n+1 (0, H˜ n+ )s )ds
,
where the third equality is by (7.18) and (7.19). Now it follows that ˘ ˜ n+1 ∈ dt, X n+1 ∈ d x, C˘ n+2 ∈ d c, ˘ B˘ n+2 ∈ d b) PxS0 ( H˜ n+ ∈ dh, (0) ˘ H˜ n+ , t, x) I{t ∈ [0, C˘ n+1 ]} = ExS0 I{ H˜ n+ ∈ dh}σn+2 (d c˘ × d b|0,
˘ ˜+ × q (d x|X n , F˘n+1 (0, H˜ n+ )t )e− (0,t] q X n ( Fn+1 (0, Hn )s )ds dt +I{C˘ n+1 < ∞}δC˘ n+1 (dt)Q(d x|X n , B˘ n+1 ) − q ( F˘ (0, H˜ + ) )ds ×e (0,C˘ n+1 ] X n n+1 n s (0) ˘ H˘ n+ , t, x) I{t ∈ [0, C˘ n+1 ]} (d c˘ × d b|0, = E˘ σx0 I{ H˘ n+ ∈ dh}σn+2
˘
˘+
× q (d x|X n , F˘n+1 (0, H˘ n+ )t )e− (0,t] q X n ( Fn+1 (0, Hn )s )ds dt +I{C˘ n+1 < ∞}δC˘ n+1 (dt)Q(d x|X n , B˘ n+1 ) − q ( F˘ (0, H˘ + ) )ds ×e (0,C˘ n+1 ] X n n+1 n s ˘ = P˘ σx0 ( H˘ n+ ∈ dh, n+1 ∈ dt, X n+1 ∈ d x, C˘ n+2 ∈ d c, ˘ B˘ n+2 ∈ d b), where the second equality is by the inductive supposition, and the last equality is by (7.1). It follows that (7.20) holds for each n ≥ 0. To complete the proof of this statement, it remains to take the Markov π-strategy GO with gradual control only such that S¯ = {π n }∞ n=1 in the CTMDP model M m xS,0 (d x × da) = m xS,0 (d x × da) 0 ,n+1 0 ,n+1
7.1 The Total Cost Model and Reduction
423
for each n ≥ 0. The existence of such a π-strategy S¯ = {π n }∞ n=1 is guaranteed by Theorem 6.2.1. Recall that c j (X n , a)πn+1 (da|Hn , t)dt ExS0 (0,n+1 ]∩R+ A c j (x, a)m xS,0 (d x × da) = c j (x, a)m xS,0 (d x × da) = 0 ,n+1 0 ,n+1 X×A X×A c j (X n , a)π n+1 (da|Hn , t)dt . = ExS0 (0,n+1 ]∩R+
A
The proof is complete.
Remark 7.1.3 Under the conditions imposed therein, Theorems 7.1.1 and 7.1.2 reduce the gradual-impulsive control problem (7.5) to a CTMDP problem with gradual control only for the model MG O : Minimize W00 (S, x0 ) over S ∈ Sπ
(7.24)
subject to W j0 (S, x0 ) ≤ d j , j = 1, 2, . . . , J, cf. problem (1.16). If S is an optimal solution to problem (7.24), then there exist a Markov π-strategy S¯ and a Markov standard ξ-strategy p M also solving this problem according to Theorem 6.2.1 and Proposition 7.1.1. Now the μ-Markov strategy σ = {σn(0) , F˘n }∞ n=1 in the gradual-impulsive control model solves problem (7.5), and is replicated by the Markov standard ξ-strategy σ = {σn(0) , ϕ˜ n }∞ n=1 , where the stochastic kernels σn(0) are defined in (7.15), the stochastic kernels ϕ˜ n and ϕ¯ n are given by (7.10) and (7.11), and the stochastic kernels F˘n are given by (7.13). This gives rise to a method of studying the gradual-impulsive control problem (7.5), which we demonstrate in the proof of Theorem 7.1.3 below, where it is also pointed out how to produce an optimal strategy for the gradual-impulsive control problem (7.5) from an optimal π-strategy for the above CTMDP problem with gradual control only. Another straightforward consequence of Theorems 7.1.1 and 7.1.2 is the following one concerning the sufficient class of strategies for solving the gradual-impulsive control problem (7.5). Its proof is obvious and thus omitted. Corollary 7.1.2 Suppose Condition 7.1.1 is satisfied, and for each x ∈ X, there is some ε(x) > 0 such that qx (a) ≥ ε(x) > 0 for all a ∈ AG . Then for each given GI strategy σ˜ ∈ G I , there exists a μ-Markov strategy σ = {σn(0) , F˘n }∞ n=1 ∈ , i.e., in the form of (7.3), in the gradual-impulsive control problem (7.5) such that ˜ x0 ) = W˘ j (σ, x0 ) for each j = 0, 1, . . . , J. In particular, if there is an optimal W˘ j (σ, strategy for the gradual-impulsive control problem (7.5), then there exists an optimal μ-Markov one.
424
7 Gradual-Impulsive Control Models
The sufficiency result obtained in the above corollary does not require any compactness-continuity conditions. It will be further strengthened below if we impose such conditions. Condition 7.1.2 (a) The spaces AG and A I are compact. (b) The functions {c Gj } Jj=0 and {c Ij } Jj=0 are [0, ∞]-valued and lower semicontinuous on X × AG and X × A I , respectively. G (c) For the functions (x, a) ∈ X × A → each bounded continuous function fI on X, q (dy|x, a) and (x, b) ∈ X × A → X f (y)Q(dy|x, b) are continuous. X f (y) The next statement is the main solvability result concerning the gradual-impulsive control problem (7.5), obtained by the application of the proposed method (see Remark 7.1.3) for studying problem (7.5). Theorem 7.1.3 Suppose Conditions 7.1.1 and 7.1.2 are satisfied, for each x ∈ X there is some ε(x) > 0 such that qx (a) ≥ ε(x) > 0 for all a ∈ AG and there exists a feasible strategy σ˜ ∈ G I with a finite value, i.e., it satisfies the constraints in the ˜ x0 ) < ∞. Then there exists an gradual-impulsive control problem (7.5) and W˘ 0 (σ, GI in the form of (7.4), which is optimal μ-stationary strategy σ = {σn(0) , F˘n }∞ n=1 ∈ also replicated by a μ-stationary standard ξ-strategy σ . Proof As mentioned in Remark 7.1.3, Theorems 7.1.1 and 7.1.2 imply that an optimal strategy in the gradual-impulsive control problem (7.5) can be produced from an optimal π-strategy for the standard CTMDP problem (7.24). By Theorem 4.2.2, under the imposed conditions, the CTMDP problem (7.24) with gradual control only has an optimal stationary π-strategy S = π s . Now the required statement follows from Theorem 7.1.1 and Remark 7.1.3. One can introduce the notion of a realizable strategy in the gradual-impulsive control problems, similarly to the material in Sect. 6.6. In this connection, ξ-strategies are realizable. Finally, it would be useful to consider a mixture over a collection of strategies in the gradual-impulsive control model. Recall that the strategies in the gradualimpulsive control model are defined as strategies in a DTMDP, and in that context, the mixture of strategies is well understood: it means the convex combination (in the strong sense as in Corollary C.1.1) of the strategic measures induced by the given collection of strategies, see Proposition C.1.1 and Corollary C.1.1. For this reason, here we are confined to the following informal description of a special case: under a mixture over two strategies σ 1 and σ 2 ∈ G I with the weight p ∈ [0, 1] on σ 1 , the decision-maker chooses σ 1 with probability p and σ 2 with probability 1 − p upfront at the initial time, and follows the chosen strategy to control the process afterwards. The rigorous notion of a mixture can be introduced similarly to Definition 6.4.1: add an extra component β ∈ [0, 1] to h˘ n , fix a probability distribution ν on [0, 1] and (0) and F˘n+1 , n ≥ 0, depend on assume that σn+1
7.1 The Total Cost Model and Reduction
425
h˘ n := (β, x˘0 , (c˘1 , b˘1 ), x˘1 , (c˘2 , b˘2 ), . . . , x˘n ). According to Remark 7.1.3, under the conditions of Theorem 7.1.3, the gradualimpulsive control problem (7.5) can be reduced to a gradual control problem (7.24) for the CTMDP model MG O , in the sense that each strategy in the gradual-impulsive control model is replicated by a corresponding one in the CTMDP model MG O with gradual control only, and vice versa. Since the gradual control problem (7.24) for the CTMDP model MG O can be reduced to a DTMDP problem with total cost criteria as in e.g., the third subsubsection of Sect. 4.2.4, one may refer to Corollary C.1.1 for the fact that mixtures do not improve the performance in the gradual control problem (7.24) for the CTMDP model MG O . See also Chap. 6: mixtures do not extend the space D S of all sequences of detailed occupation measures according to Lemma 6.4.1, and D S = D ReM by Theorem 6.2.1. The reason for considering a mixture of strategies is twofold: firstly, it is realizable; and secondly, constrained CTMDP problems can often be solved by a finite mixture of strategies in a simple form, as illustrated by the example solved in Sect. 7.2, see Corollary 7.2.1 therein. The connected notions of the TOM-mixture and the SMmixture appeared in Definitions 3.2.3 and 5.2.2; they were also useful for solving constrained problems, as shown in Theorems 3.2.7 and 5.2.4.
7.2 Example: An Epidemic with Carriers 7.2.1 Problem Statement At time t ≥ 0, let X 1 (t) be the number of susceptibles, and X 2 (t) be the number of carriers in the population. In the absence of impulses, the population dynamics evolve as follows. The process {X 2 (t)}t≥0 evolves as a continuous-time Markov chain (more specifically, a birth-and-death process) in the state space {0, 1, . . . } with the transition rate given by ⎧ ⎨ ρb x2 , if y2 = x2 + 1, q (2) (y2 |x2 ) = ρd x2 , if y2 = x2 − 1, x2 > 0 ⎩ 0 otherwise,
(7.25)
and = (ρb + ρd )x2 , ∀ x2 ∈ {0, 1, . . . }, qx(2) 2 where ρb , ρd > 0 are two fixed positive constants, representing the birth and death rates of each individual of the carriers. If the number of carriers is x2 ≥ 0, then the process {X 1 (t)}t≥0 evolves as an independent continuous-time Markov chain (more exactly, a pure death process) in the state space {0, 1, . . . } with the transition rate
426
7 Gradual-Impulsive Control Models
given by (1)
q (y1 |x1 ) =
x1 x2 , if y1 = x1 − 1, x1 > 0 0, otherwise,
(7.26)
and = x1 x2 , ∀ x1 ∈ {0, 1, . . . }. qx(1) 1 Note that the (natural) jump from x1 to x1 − 1, provided x1 > 0, corresponds to one = kx1 x2 , then the coefficient k > 0 can be susceptible becoming infected. If qx(1) 1 made equal to 1 by the change of time: t → τ := kt, and all the other transition rates should be divided by k. Now suppose at any moment in time, the decision-maker can choose to immunize one or several susceptibles. Here an impulse represents immunizing one susceptible, leading to an instantaneous decrease in the number of susceptibles by one. It is possible for multiple susceptibles to be immunized at the same time. This corresponds to applying multiple impulses in a sequence at the given time moment. The decisionmaker would like to design an immunisation policy in such a way that • the total expected number of diseased susceptibles is minimal, whereas • the total expected number of immunisations is not greater than a predetermined level. Here the time horizon is the duration until the first moment when either susceptibles or carriers go extinct in the population. There is no gradual control in this problem. Nevertheless, we will formulate and treat it as a particular gradual-impulsive control problem as described in Sect. 7.1.2, which also illustrates the results obtained in Sect. 7.1.3. To this end, let us firstly fix the system primitives of the concerned gradual-impulsive control problem. A natural choice of the state space could be {0, 1, . . . } × {0, 1, . . . }, where an element (x1 , x2 ) stands for the number of susceptibles and the number of carriers, respectively. Then the cemeteries are the collection {(0, x2 ) : x2 = 0, 1, 2, . . . } ∪ {(x1 , 0) : x1 = 0, 1, . . . }, which can be merged into a single absorbing and costless point . However, here we would like to replicate with two points say {1 , 2 }, which form a deterministic loop from one to the other. The reason for this is to guarantee that the resulting model satisfies all the requirements in Corollary 7.1.2. Consequently, we will take the state space X := X ∪ {1 , 2 } with X = {1, 2, . . . } × {1, 2, . . . },
(7.27)
7.2 Example: An Epidemic with Carriers
427
despite, with the replicator {1 , 2 } being introduced, there being no absorbing cemetery in the model, at least from the formal point of view. (By the way, this also demonstrates that the requirements in Corollary 7.1.2 do not exclude one from considering models with absorbing cemeteries.) Let A I = {1}, AG = {a∞ }, both being singletons, because an impulse means immunization of one susceptible, whereas there is no gradual control. The transition rate is given for (x1 , x2 ) ∈ X by q ((y1 , y2 )|(x1 , x2 ), a∞ ) = I{y1 = x1 } q (2) (y2 |x2 ) + I{y2 = x2 } q (1) (y1 |x1 ), ∀ (y1 , y2 ) ∈ X , q (1 |(x1 , x2 ), a∞ ) = I{x1 = 1}x2 + I{x2 = 1}ρd , q (2 |(x1 , x2 ), a∞ ) = 0, q(x1 ,x2 ) (a∞ ) = (ρd + ρb )x2 + x1 x2 = (ρd + ρb + x1 )x2 ; and at 1 or 2 by q(2 |1 , a∞ ) = q(1 |2 , a∞ ) = q1 (a∞ ) = q2 (a∞ ) = 1. The post-impulse kernel is given by Q((x1 − 1, x2 )|(x1 , x2 ), 1) = 1, ∀ (x1 , x2 ) ∈ X ; Q(1 |2 , 1) = Q(2 |1 , 1) = 1. As mentioned earlier, the conditions in Corollary 7.1.2 are all satisfied by this CTMDP model, and consequently, Remark 7.1.3 is applicable to the gradualimpulsive control problem (7.5) for this model. Finally, note that a lump sum cost of one is incurred whenever there is a natural jump from (x1 , x2 ) to (x1 − 1, x2 ) provided x1 > 0 as this corresponds to the occurrence of an infection. Since the jump intensity for that transition is x1 x2 , this lump sum cost can be expressed as the gradual cost rate x1 x2 . The verification of this fact, which is based on the comparison of two conditional expectations (with one involving the lump sum and the other one involving the gradual cost rate), computed similarly to (7.23), is straightforward and omitted. Similar calculations are provided in Sect. 1.1.5. That is why we use the gradual cost rates
428
7 Gradual-Impulsive Control Models
c0G ((x1 , x2 ), a∞ ) = x1 x2 , ∀ (x1 , x2 ) ∈ X ; c0G (1 , a∞ ) = c0G (2 , a∞ ) = 0, c1G ((x1 , x2 ), a∞ ) = 0, ∀ (x1 , x2 ) ∈ X ; c1G (1 , a∞ ) = c1G (2 , a∞ ) = 0, and the impulse cost functions c0I ((x1 , x2 ), 1) = 0, ∀ (x1 , x2 ) ∈ X ; c0I (1 , 1) = c0I (2 , 1) = 0; c1I ((x1 , x2 ), 1) = 1, ∀ (x1 , x2 ) ∈ X ; c1I (1 , 1) = c0I (2 , 1) = 0. We shall solve the gradual-impulsive control problem (7.5) for this CTMDP model with J = 1, some given initial state, say, (i 0 , j0 ) ∈ X , and some constraint constant d1 . Clearly, if d1 ≥ i 0 , then it is optimal to immediately apply i 0 impulses in turn at the initial time moment. If d1 = 0, then the only feasible strategy is to never apply impulses. If d1 < 0, then there are no feasible strategies. Thus, the nontrivial case is when 0 < d1 < i 0 ,
(7.28)
which will be assumed for the rest of this example.
7.2.2 General Plan Before jumping directly to the details, let us outline the general plan. Firstly, we apply Remark 7.1.3 to reduce the gradual-impulsive control problem (7.5) to the CTMDP problem (7.24) with gradual control only for the model MG O = {X, A, q G O , {c j }1j=0 }: namely, X is the same as in (7.27), A = {1, a∞ }; q
(dy|x, a∞ ) = q(dy|x, a∞ ); q G O (dy|x, 1) = Q(dy|x, 1); qxG O (1) = 1, ∀ x ∈ X;
GO
and c j (x, a∞ ) = c Gj (x, a∞ ), c j (x, 1) = c Ij (x, 1), ∀ x ∈ X. After this reduction, of course, we now may regard {1 , 2 } as a costless and absorbing cemetery , i.e., q (a) = 0 for each a ∈ A. This is solely to put ourselves in the framework of the third subsubsection of Sect. 4.2.4. Applying the results therein, we further reduce this CTMDP problem with gradual control only to the
7.2 Example: An Epidemic with Carriers
429
total undiscounted problem (C.5) for the following DTMDP model: its state space is X ∪ {}, its action space is A = {1, a∞ }, its transition probability is given for each (x1 , x2 ) ∈ X by p((y1 , y2 )|(x1 , x2 ), a∞ ) = =
q G O ((y1 , y2 )|(x1 , x2 )) GO q(x (a∞ ) 1 ,x 2 )
I{y1 = x1 } {I{y2 = x2 + 1}ρb + I{y2 = x2 − 1}ρd } ρb + ρd + x1 x1 +I{y2 = x2 }I{y1 = x1 − 1} , ∀ (y1 , y2 ) ∈ X ; ρb + ρd + x1 p(|(x1 , x2 ), a∞ ) =
I{x1 = 1} + I{x2 = 1}ρd ; ρb + ρd + x1
and p((y1 , y2 )|(x1 , x2 ), 1) = I{y1 = x1 − 1}I{y2 = x2 }, ∀ (y1 , y2 ) ∈ X ; p(|(x1 , x2 ), 1) = I{x1 = 1}. The point is a costless cemetery in the DTMDP model, so that, in particular, p(|, 1) = p(|, a∞ ) = 1. The cost functions are given for each (x1 , x2 ) ∈ X by l0 ((x1 , x2 ), a∞ ) =
c0 ((x1 , x2 ), a∞ ) x1 , l0 ((x1 , x2 ), 1) = 0; = GO ρ + ρ q(x (a ) b d + x1 ∞ 1 ,x 2 )
l1 ((x1 , x2 ), a∞ ) = 0, l1 ((x1 , x2 ), 1) = 1, and l0 (, a∞ ) = l0 (, 1) = l1 (, a∞ ) = l1 (, 1) = 0. A useful observation about the above DTMDP model is the following one, whose proof we decide to include for completeness, even though it involves only straightforward calculations. Lemma 7.2.1 Let some ξ ∈ (1 − α=
1 , 1) ρb +ρd +1
be fixed, and put
1 ∈ (1, ∞). 1 − (1 − ξ)(ρb + ρd + 1)
Then the above DTMDP model {X ∪ {}, {a∞ , 1}, p, {l j }1j=0 } is contracting on X in the sense of Definition C.2.4, where
430
7 Gradual-Impulsive Control Models
ζ((x1 , x2 )) = αx1 , ∀ (x1 , x2 ) ∈ X ; ζ() = 1. Proof Let (x1 , x2 ) ∈ X be arbitrarily fixed. Then (y1 ,y2 )∈X
ζ((y1 , y2 )) p((y1 , y2 )|(x1 , x2 ), 1) = αx1 −1 ,
and αx1 −1 ≤ ξαx1 ⇔ αξ ≥ 1, which holds because ξ ≥ 1 − (1 − ξ)(ρb + ρd + 1) = 1 − (ρb + ρd + 1) + ξ(ρb + ρd + 1) ⇔ ρb + ρd ≥ ξ(ρb + ρd ) ⇔ ξ ≤ 1, and the last inequality holds. Similarly,
ζ((y1 , y2 )) p((y1 , y2 )|(x1 , x2 ), a∞ )
(y1 ,y2 )∈X x1
α x1 (ρb + ρd ) + αx1 −1 ρb + ρd + x1 ρb + ρd + x1 ≤ ξαx1 ∀ (x1 , x2 ) ∈ X
=
⇐⇒ αx1 (ρb + ρd ) + αx1 −1 x1 ≤ ξαx1 (ρb + ρd + x1 ) ∀ (x1 , x2 ) ∈ X ⇐⇒ ξα(ρb + ρd + 1) − α(ρb + ρd ) ≥ x1 − ξα(x1 − 1) ∀ (x1 , x2 ) ∈ X ⇐⇒ 1 ≥ x1 − ξα(x1 − 1) ∀ (x1 , x2 ) ∈ X ⇔ 1 ≤ ξα, where the last but one inequality is according to the definition of ξ, and the last inequality has been verified earlier in this proof. The proof will be complete after we trivially note that
ζ((y1 , y2 )) p((y1 , y2 )|, 1)
(y1 ,y2 )∈X
=
ζ((y1 , y2 )) p((y1 , y2 )|, a∞ ) = 0 ≤ ζ(),
(y1 ,y2 )∈X
as required.
What remains to be done now is to solve problem (C.5) for the DTMDP model {X ∪ {}, {a∞ , 1}, p}, namely,
Minimize
subject to
Eσ(i0 , j0 ) Eσ(i0 , j0 )
∞
n=0
∞
l0 (X n , An+1 )
over σ ∈
(7.29)
l1 (X n , An+1 ) ≤ d1 .
n=0
In view of Lemma 7.2.1 and Remark C.2.2, we see that this DTMDP model is absorbing, and we may solve the constrained DTMDP problem (C.5) via the convex
7.2 Example: An Epidemic with Carriers
431
optimization problem (C.18). We may apply Theorem C.2.16 to the latter problem, as the conditions therein are satisfied by the underlying absorbing DTMDP model. Note that under the assumption of (7.28), the Slater condition for problem (C.18) is satisfied, e.g., by the occupation measure of such a strategy in the DTMDP that corresponds to not applying any impulse in the original gradual-impulse control problem. In greater detail, according to Theorem C.2.16, we shall solve firstly, for each fixed g ∈ [0, ∞), the following problem for the above DTMDP model:
Minimize
Eσ(i0 , j0 )
∞
(l0 (X n , An+1 ) + gl1 (X n , An+1 )) − gd1
n=0
over σ ∈ ,
(7.30)
which is equivalent to
Minimize
Eσ(i0 , j0 )
∞
(l0 (X n , An+1 ) + gl1 (X n , An+1 ))
over σ ∈ (7.31)
n=0
as far as optimal control strategies are concerned. The optimal strategy depends on g ∈ [0, ∞). If we find a pair of (σ ∗ , g ∗ ) such that • σ ∗ solves the above problem for g = g ∗ , with g ∗ being a solution to the following problem Maximize over g ∈ [0, ∞):
∞ inf Eσ(i0 , j0 ) (l0 (X n , An+1 ) + gl1 (X n , An+1 )) − gd1 σ∈
(7.32)
n=0
• and ∗ Eσ(i0 , j0 )
∞
l1 (X n , An+1 ) = d1 ,
(7.33)
n=0
then σ ∗ is a desired optimal strategy for the constrained DTMDP problem (C.5).
7.2.3 Optimal Solution to the Associated DTMDP Problem Let us firstly solve the DTMDP problem (7.31) for a fixed g ∈ [0, ∞). We thus will not indicate the dependence of g in this subsection. We do this by investigating its Bellman equation: for each (x1 , x2 ) ∈ X ,
432
7 Gradual-Impulsive Control Models
v((x1 , x2 )) x1 + ρd v((x1 , x2 − 1)) + ρb v((x1 , x2 + 1)) + x1 v((x1 − 1, x2 )) , = min ρd + ρb + x1 (7.34) g + v((x1 − 1, x2 )) , with the boundary condition v() = 0. The first (second) term in (7.34) corresponds to a = a∞ (a = 1). Here and below in this subsection, we accept for brevity the notations v((0, x2 )) ≡ v((x1 , 0)) ≡ v(). Proposition C.2.11 asserts that there exists a unique solution to the Bellman equation (with the boundary condition) out of the class of ζ-bounded functions on X ∪ {}, where the function ζ comes from Lemma 7.2.1. The cost functions l0 and l1 are obviously ζ-bounded. The same proposition asserts that any selector of actions attaining the minimum in the Bellman equation (7.34) will be a uniformly optimal strategy for the DTMDP problem (7.31). Some intuitive reasoning comes before it gets verified below. In view of (7.25) and (7.26), one can assume all susceptibles react to the carriers independently: if there are x2 > 0 carriers, then an individual susceptible dies (or, say, gets infected) after an exponentially distributed time with rate x2 . It is natural to expect that if one can obtain the optimal strategy for an individual susceptible, then this strategy should be used for each of the susceptibles. Thus, one expects that the Bellman function satisfies v((x1 , x2 )) = x1 v((1, x2 )), ∀ (x1 , x2 ) ∈ X . Therefore, let us investigate the Bellman equation (7.34) when x1 = 1, namely, with the boundary condition v() = 0 in mind, 1 + ρd v((1, x2 − 1)) + ρb v((1, x2 + 1)) ,g , v((1, x2 )) = min ρd + ρb + 1 ∀ x2 = 1, 2, . . . .
(7.35)
(We will not repeat anymore that in the above Bellman equation, the first term inside the parentheses corresponds to a = a∞ , and the second corresponds to a = 1.) Starting with the boundary condition v((1, 0)) = v() = 0, the solution v((1, x2 )) to the Bellman equation (7.35) coincides with a solution to the difference equation v((1, x2 )) = i.e.,
1 + ρd v((1, x2 − 1)) + ρb v((1, x2 + 1)) , ρd + ρb + 1
7.2 Example: An Epidemic with Carriers
433
(ρd + ρb + 1)v((1, x2 )) = 1 + ρd v((1, x2 − 1)) + ρb v((1, x2 + 1))
(7.36)
for x2 ≥ 0 until the smallest integer x2 = c such that v((1, c + 1)) = g. A particular solution to the difference Eq. (7.36) is given by v p ((x1 , x2 )) ≡ 1, and the characteristic equation to the difference Eq. (7.36) is ρb z 2 − z(ρb + ρd + 1) + ρd = 0.
(7.37)
Lemma 7.2.2 Equation (7.37) has two real solutions z 1 , z 2 such that 0 < z 2 < 1 < z 1 < ∞. The proof follows from the observation that the left-hand side of (7.37) equals ρd > 0 when z = 0 and −1 when z = 1. The general solution to the difference Eq. (7.36) is in the form w(x2 ) = Az 1x2 + Bz 2x2 + 1, x2 = 0, 1, . . . , with some constants A and B. After substituting it into the boundary condition w(0) = 0, we have w(x2 ) = A(z 1x2 − z 2x2 ) − z 2x2 + 1, x2 = 0, 1, . . . .
(7.38)
We are interested in the first nonnegative integer x2 , if exists at all, such that w(x2 + 1) = g, i.e., A(z 1x2 +1 − z 2x2 +1 ) − z 2x2 +1 + 1 = g. Trivial rearrangement of the above equality yields A=
g − 1 + z 2x2 +1 z 1x2 +1 − z 2x2 +1
.
We take A as small as possible. Thus, let us put A∗ :=
inf
x2 =0,1,...
g − 1 + z 2x2 +1 z 1x2 +1 − z 2x2 +1
.
(7.39)
We expect the (minimal) solution to the Bellman equation (7.35) to satisfy v((1, x2 )) = A∗ (z 1x2 − z 2x2 ) − z 2x2 + 1, 0 ≤ x2 < c∗ + 1;
(7.40)
and the minimum in the Bellman equation (7.35) to be attained at a∞ for 0 ≤ x2 < c∗ + 1, where c∗ is the first nonnegative integer such that
434
7 Gradual-Impulsive Control Models ∗
∗
A =
g − 1 + z 2c +1 ∗
∗
z 1c +1 − z 2c +1
.
(7.41)
We put c∗ = ∞ if such an integer does not exist. The finiteness of c∗ as well as the value of A∗ depend on the value of the constant g. In this connection, we present in the next lemma some relevant properties of the function y
y ∈ (0, ∞) → f (y) =
g − 1 + z2 y y . z1 − z2
(7.42)
Lemma 7.2.3 The following assertions hold for the function f defined in (7.42). (a) If g = 0, the function f is strictly increasing in y ∈ (0, ∞). (b) If g ∈ (0, 1), the function f strictly decreases at the beginning and then strictly increases, and has a unique minimizer over (0, ∞). Moreover, the unique minimizer strictly increases to ∞ as 0 < g ↑ 1. (c) If g ∈ [1, ∞), the function f strictly decreases. (d) In any case lim y→∞ f (y) = 0. The proofs of this and some other technical lemmas are presented in Appendix A.7. Next, we formulate a relevant consequence of Lemma 7.2.3, the definition of c∗ and (7.41). Lemma 7.2.4 (a) If g ∈ [0, 1), then c∗ is finite and equals the first nonnegative integer such that ∗
g − 1 + z 2c +1 ∗
∗
z 1c +1 − z 2c +1
∗
≤
g − 1 + z 2c +2 ∗
∗
z 1c +2 − z 2c +2
.
In particular, if g = 0, then c∗ = 0. (b) If g ∈ [1, ∞), then c∗ = ∞. (c) In any case, the value of A∗ is given by (7.41). Now we are in position to verify that the above intuitive arguments indeed lead to the optimal solution to the DTMDP problem (7.31). Theorem 7.2.1 The following assertions hold regarding the solution to the DTMDP problem (7.31). (a) Suppose g ∈ [0, 1), and c∗ and A∗ are as in Lemma 7.2.4. Then the Bellman function, i.e., the value function of the DTMDP problem (7.31), is given by v((x1 , x2 )) =
x1 1 − z 2x2 + A∗ (z 1x2 − z 2x2 ) , if 0 ≤ x2 ≤ c∗ + 1, if x2 ≥ c∗ + 1, x1 g,
(7.43)
where one can readily verify that the two expressions on the right-hand side indeed coincide when x2 = c∗ + 1. (Recall that we put v() = v((0, x2 )) =
7.2 Example: An Epidemic with Carriers
435
v((x1 , 0)) ≡ 0 for each x1 , x2 ∈ {0, 1, . . . }.) A uniformly optimal deterministic stationary strategy ϕ∗ is given by: ϕ∗ ((x1 , x2 )) = I{x2 > c∗ } + a∞ · I{x2 ≤ c∗ }. If
∗
g + z 2c +1 − 1 ∗
∗
z 1c +1 − z 2c +1
∗
=
g + z 2c +2 − 1 ∗
∗
z 1c +2 − z 2c +2
,
(7.44)
then ϕ∗ ((x1 , c∗ + 1)) can be chosen arbitrarily. (b) If g ∈ [1, ∞), then the Bellman function is given for each x1 , x2 ∈ {0, 1, . . . } by v((x1 , x2 )) = x1 (1 − z 2x2 ) and a uniformly optimal deterministic stationary strategy is given by ϕ∗ ((x1 , x2 )) ≡ a∞ . Here z 1 , z 2 come from Lemma 7.2.2. Proof (a) According to Lemma 7.2.1 and the ζ-boundedness of the cost function of the DTMDP problem (7.31) with ζ coming from Lemma 7.2.1, we may apply Proposition C.2.11. Therefore, this proof is done by verifying that the function defined by (7.43) satisfies the Bellman equation (7.34), and the stated deterministic stationary strategy ϕ∗ provides the minimizers in the Bellman equation. It proceeds in two steps. Step 1. Consider the case of x1 = 1. Suppose 1 ≤ x2 ≤ c∗ . Then v((1, x2 − 1)), v((1, x2 )), v((1, x2 + 1)) are all defined by the expression on the top of (7.43) with x1 = 1, that is, by (7.40). Since the latter is obtained from the difference Eq. (7.36), we see that v((1, x2 )) =
1 + ρd v((1, x2 − 1)) + ρb v((1, x2 + 1)) . ρd + ρb + 1
To show that v((1, x2 )) satisfies (7.35), it remains to show that v((1, x2 )) < g, i.e., 1 − z 2x2 + A∗ (z 1x2 − z 2x2 ) < g. Indeed, this inequality is equivalent to A∗
c∗ + 1 is satisfied, and ϕ∗ ((1, x2 )) = 1 is the unique minimizer in the right-hand side of (7.35). This completes the first step. Step 2. It was already justified in the first step that (7.35) is satisfied by the function v((1, x2 )) given by (7.43) at (1, x2 ). An application of Lemma B.1.8 with d +ρb ) yields that v((1, x2 )) satisfies the following equation: β(x2 ) ≡ (xx11 −1)(ρ (ρd +ρb +1)
1 + ρd v((1, x2 − 1)) + ρb v((1, x2 + 1)) ; ρd + ρb + 1 (x1 − 1)(ρd + ρb ) (v((1, x2 )) − g) g+ x1 (ρd + ρb + 1)
v((1, x2 )) = min
for all x1 = 1, 2, . . . . After multiplying by x1 (ρd + ρb + 1) and adding x1 (x1 − 1)v((1, x2 )) on both sides of the above equation, we obtain x1 (ρd + ρb + x1 )v((1, x2 )) = min{x1 + ρd x1 v((1, x2 − 1)) + ρb x1 v((1, x2 + 1)) +x1 (x1 − 1)v((1, x2 )); (ρd + ρb + x1 )g + (ρd + ρb + x1 )(x1 − 1)v((1, x2 ))},
7.2 Example: An Epidemic with Carriers
437
and thus x1 v((1, x2 )) x1 + ρd x1 v((1, x2 − 1)) + ρb x1 v((1, x2 + 1)) = min ρd + ρb + x1 x1 (x1 − 1)v((1, x2 )) + , g + (x1 − 1)v((1, x2 )) . ρd + ρb + x1 We see that the function v((x1 , x2 )) = x1 v(1, x2 ) as given by (7.43) satisfies the Bellman equation (7.34). Since it is clearly ζ-bounded with ζ coming from Lemma 7.2.1, this function is the Bellman function, according to Proposition C.2.11. Moreover, ϕ∗ ((x1 , x2 )) ≡ ϕ∗ ((1, x2 )) provides the minimum in the Bellman equation, according to Lemma B.1.8. Consequently, it is uniformly optimal by Proposition C.2.11 again. The case (b) can be proved in the same manner as for part (a). One can say that v and ϕ∗ are given by the same expressions as in part (a) under c∗ = ∞ and A∗ = 0.
7.2.4 The Optimal Solution to the Original Gradual-Impulsive Control Problem Let us now carry out the plan outlined at the end of Sect. 7.2.2, and solve the constrained DTMDP problem (7.29), from which we deduce an optimal strategy for the original gradual-impulsive control problem. Theorem 7.2.1 provides a solution to the DTMDP problem (7.31), whose optimal value is v((i 0 , j0 )) with v being defined in Theorem 7.2.1, where an optimal deterministic stationary strategy ϕ∗ was also provided. Of course, both the optimal values and optimal strategies presented in Theorem 7.2.1 depend on the value of g, which was fixed therein. In the present subsection, we shall indicate the dependence on g, and thus will write v((i 0 , j0 ), g) and ϕ∗g for the optimal value v((i 0 , j0 )) and strategy ϕ∗ in the DTMDP problem (7.31). Furthermore, we will write c∗ (g) for the constant c∗ coming from Lemma 7.2.4. The optimal value of the DTMDP problem (7.30) is clearly v((i 0 , j0 ), g) − gd1 , whereas ϕ∗g remains optimal for this problem. Let us firstly solve problem (7.32), namely, Maximize over g ∈ [0, ∞): v((i 0 , j0 ), g) − gd1 . Recall that we always assume 0 < d1 < i 0 , and the two constants 0 < z 2 < 1 < z 1 < ∞ come from Lemma 7.2.2.
438
7 Gradual-Impulsive Control Models
Lemma 7.2.5 Suppose inequalities (7.28) hold. Let c∗ be the minimal nonnegative integer such that j j z 10 − z 20 d1 ≤ , (7.47) c∗ +2 c∗ +2 i0 z1 − z2 which clearly exists. Then there is a constant g ∗ ∈ (0, 1) satisfying the equation ∗
g ∗ + z 2c +1 − 1 ∗
∗
z 1c +1 − z 2c +1
∗
=
g ∗ + z 2c +2 − 1 ∗
∗
z 1c +2 − z 2c +2
,
(7.48)
which solves problem (7.32), and verifies c∗ = c∗ (g ∗ ). The proof is presented in Appendix A.7. Theorem 7.2.2 Suppose inequalities (7.28) hold. Consider g ∗ ∈ (0, 1), c∗ (g ∗ ) = c∗ from Lemma 7.2.5, and the deterministic stationary strategies ϕ∗ and ψ ∗ given for each (x1 , x2 ) ∈ X by ϕ∗ ((x1 , x2 )) = I{x2 > c∗ } + a∞ I{x2 ≤ c∗ }, ψ ∗ ((x1 , x2 )) = I{x2 > c∗ + 1} + a∞ I{x2 ≤ c∗ + 1}, whereas ϕ∗ (), ψ ∗ () can be an arbitrarily fixed a ∈ {1, a∞ }. Then the mixture of ϕ∗ and ψ ∗ with the weight p ∗ ∈ [0, 1] on ϕ∗ is optimal for the constrained DTMDP problem (7.29), where p ∗ ∈ [0, 1] comes from the following relation: p∗ i0
j
j
z 10 − z 20
c∗ +1
z1
c∗ +1
− z2
+ (1 − p ∗ )i 0
j
j
z 10 − z 20
c∗ +2
z1
∗
− z 2c +2
= d1 .
Proof Having compared (7.44) and (7.48), according to the last assertion in Theorem 7.2.1(a), both ϕ∗ and ψ ∗ are optimal for the DTMDP problem (7.31) with g = g ∗ . Let us define ∗ ϕ∗ ψ∗ E(x1 ,x2 ) l1 (X n , An+1 ) =: v ϕ ((x1 , x2 )), E(x1 ,x2 ) l1 (X n , An+1 ) ∗
=: v ψ ((x1 , x2 )) ∗
∗
∗
for each (x1 , x2 ) ∈ X , and adopt v ϕ () = v ϕ ((0, x2 )) = v ϕ ((x1 , 0)) = 0 and ∗ ∗ ∗ ∗ v ψ () = v ψ ((0, x2 )) = v ψ ((x1 , 0)) = 0. Then the function v ϕ is the unique ζbounded solution W to
7.2 Example: An Epidemic with Carriers
W ((x1 , x2 )) =
439
ρd W ((x1 , x2 − 1)) + ρb W ((x1 , x2 + 1)) ρd + ρb + x1 1 −1,x 2 )) + x1 Wρd((x , ∀ x1 ≥ 1, x2 = 1, 2, . . . , c∗ ; +ρb +x1
⎫ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎬
⎪ W ((x1 , x2 )) = 1 + W ((x1 − 1, x2 )), ∀ x1 ≥ 1, x2 ≥ c∗ + 1; ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎭ W ((0, x2 )) = W ((x1 , 0)) = 0, ∀ x1 , x2 ∈ {0, 1, . . . },
(7.49)
where ζ is as in Lemma 7.2.1. It follows that W ((x1 , x2 )) = x1 for x2 ≥ c∗ + 1. It is easy to solve the above equation when x1 = 1. The solution is given by W ((1, x2 )) =
z 1x2 − z 2x2
∗
∗
z 1c +1 − z 2c +1
, ∀ 0 ≤ x2 ≤ c∗ + 1,
where (and below in this proof) z 1 , z 2 come from Lemma 7.2.2. It is straightforward to verify that W ((x1 , x2 )) := x1 W (1, x2 ), x1 , x2 ∈ {0, 1, . . . } satisfies (7.49). j
z 0 −z
j0
1 2 Note that j0 ≤ c∗ + 1, for otherwise, we would have had z c∗ +2 ≥ 1 > di01 , ∗ −z 2c +2 1 where the last inequality is by the assumption of (7.28), which is a desired contradiction against Lemma 7.2.5, see (7.47) therein. Hence,
j
j
z 10 − z 20
∗
v ϕ ((i 0 , j0 )) = i 0
∗
∗
z 1c +1 − z 2c +1
> d1
where the last strict inequality again follows from Lemma 7.2.5. ∗ Similarly, the function v ψ is the unique ζ-bounded solution W to W ((x1 , x2 )) =
ρd W ((x1 , x2 − 1)) + ρb W ((x1 , x2 + 1)) ρd + ρb + x1 x1 W ((x1 − 1, x2 )) + , ρd + ρb + x1 ∀ x1 = 1, 2, . . . , ∀ x2 = 1, 2, . . . , c∗ + 1;
W ((x1 , x2 )) = 1 + W ((x1 − 1, x2 )), ∀ x1 = 1, 2, . . . , ∀ x2 = c∗ + 2, c∗ + 3, . . . ; W ((0, x2 )) = W ((x1 , 0)) = 0, x1 , x2 ∈ {0, 1, . . . }, ∗
and one can show, as for v ϕ in the above, that j
j
z 10 − z 20
∗
v ψ ((i 0 , j0 )) = i 0
c∗ +2
z1
∗
− z 2c +2
≤ d1 ,
440
7 Gradual-Impulsive Control Models
where the last inequality follows from Lemma 7.2.5. Now the mixture of ϕ∗ and ψ ∗ with the weight p ∗ ∈ [0, 1] on ϕ∗ satisfying p∗ i0
j
j
z 10 − z 20
∗
∗
z 1c +1 − z 2c +1
+ (1 − p ∗ )i 0
j
j
z 10 − z 20
∗
∗
z 1c +2 − z 2c +2
= d1
exists, and is the desired one. According to Proposition C.1.1, there is a strategy, say, σ ∗ replicating this mixture, which is optimal for the DTMDP problem (7.31) with g = g ∗ , and satisfies ∗ ∗ ∗ Eσ(i0 , j0 ) l1 (X n , An+1 ) = p ∗ v ϕ ((i 0 , j0 )) + (1 − p ∗ )v ψ ((i 0 , j0 )) = d1 . i.e., the relation (7.33). Thus, (σ ∗ , g ∗ ) satisfy the requirements presented at the end of Sect. 7.2.2. From the discussions therein, σ ∗ is an optimal strategy for the constrained DTMDP problem (7.29). Finally, as an immediate consequence of Theorem 7.2.2, we obtain a mixture of two deterministic stationary strategies, which is optimal for the corresponding CTMDP problem with the gradual control only. Hence, by Corollary 7.1.1, the obtained mixture is optimal for the original gradual-impulsive control problem. In other words, the following corollary is valid. Corollary 7.2.1 Suppose inequalities (7.28) hold. Using the same notations as in Theorem 7.2.2, consider two μ-deterministic stationary strategies in the gradualimpulsive control model, specified by {μϕ∗ , ϕ∗ (0) , ϕ∗ (1) } and {μψ∗ , ϕ∗ (0) , ϕ∗ (1) }, see Definition 7.1.2(c), where, for each (x1 , x2 ) ∈ X , μϕ∗ ((x1 , x2 )) = I{x2 ≤ c∗ }, μψ∗ ((x1 , x2 )) = I{x2 ≤ c∗ + 1}, ϕ∗ (0) ((x1 , x2 )) = 1, ϕ∗ (1) ((x1 , x2 )) = a∞ . Then the mixture of {μϕ∗ , ϕ∗ (0) , ϕ∗ (1) } and {μψ∗ , ϕ∗ (0) , ϕ∗ (1) } with the weight p ∗ coming from Theorem 7.2.2 is optimal for the gradual-impulsive control problem (7.5) in the example described in Sect. 7.2.1. In words, the above corollary asserts the following. Before implementing the control of the epidemic, one has to “flip a coin” and choose with probability p ∗ (or with probability (1 − p ∗ )) the following strategy: immunise all the existing susceptibles at once as soon as the number of carriers, X 2 (t), is strictly bigger than c∗ (or strictly bigger than c∗ + 1, correspondingly).
7.3 The Discounted Cost Model
441
7.3 The Discounted Cost Model 7.3.1 α-Discounted Gradual-Impulsive Control Problems Here we formulate the α-discounted gradual-impulsive control problem, where α ∈ (0, ∞) is the discount factor, as a total undiscounted DTMDP problem with the modified past-dependent one-step cost functions. After that, we will show that the model is equivalent to the undiscounted modified model with killing, using the same ideas as in Sect. 1.3.5. ˘ and action space A ˘ in this DTMDP are the same as in Sect. 7.1.2, The state space X ˘ as well as the sample space and the transition probability p˘ (see (7.1)). We consider only the strategies introduced in Definition 7.1.1. However, for all j = 0, 1, . . . , J , the cost functions are defined by ˘ (t, y)) = I{x ∈ X} l˘αj ((θ, x), a,
e−αs c Gj (x, ρs )ds ˘ , +e−αc˘ I{t = c˘ < ∞}c Ij (c, b) (0,t]∩R+
(7.50)
and the performance functionals are defined by W˘ jα (σ, x0 )
:=
E˘ σx0
∞
e−αTn l˘αj ( X˘ n ,
A˘ n+1 , X˘ n+1 ) ,
(7.51)
n=0
n i . As usual, the case of (c G )+ and (c I )+ and the case of where Tn := i=0 G − I − (c ) and (c ) are considered separately. Note that the one-step cost function e−αTn l˘αj ( X˘ n , A˘ n+1 , X˘ n+1 ) depends on the past states X˘ 1 = (1 , X 1 ), . . . , X˘ n−1 = (n−1 , X n−1 ). Similarly to Sect. 1.3.5, we introduce the following “hat” model with killing. The state space X is extended to X ∪ {} with ∈ / X, and the transition probability is modified as follows. For each bounded measurable function g on ˆ := {(∞, x∞ )} ∪ ([0, ∞) × X) ∪ ([0, ∞) × ) X ˘ ˘ ρ) ∈ A, and action a˘ = (c, ˘ b, g(t, y) p(dt ˆ × dy|(θ, x), a) ˘ (7.52) ˆ X g(t, y) q (dy|x, ρt ) + αg(t, ) e− (0,t] qx (ρs )ds−αt dt := (0,c]∩R ˘ X + ˘ ˘ y)Q(dy|x, b) +I{c˘ < ∞}e− (0,c]˘ qx (ρs )ds−αc˘ g(c, X
442
7 Gradual-Impulsive Control Models
for each state (θ, x) ∈ [0, ∞) × X; and g(t, y) p(dt ˆ × dy|(θ, ), a) ˘ = g(t, y) p(dt ˆ × dy|(∞, x∞ ), a) ˘ := g(∞, x∞ ). Xˆ
Xˆ
The cost functions in the “hat” model with killing are given by the same expression as (7.2), namely, ˘ (t, y)) lˆj ((θ, x), a, G I ˘ := I{x ∈ X} c j (x, ρs )ds + I{t = c˘ < ∞}c j (x, b) , 0 ≤ j ≤ J, (0,t]∩R+
ˆ ×A ˘ × X. ˆ The expressions for the performance funcfor each ((θ, x), a, ˘ (t, y)) ∈ X tionals are standard:
∞ σ lˆj ( X˘ n , A˘ n+1 , X˘ n+1 ) , j = 0, 1, . . . , J. Wˆ j (σ, x0 ) := Eˆ x 0
n=0
Both in the original model and in the “hat” model with killing, the actions in the states (∞, x∞ ) and (θ, ) do not affect the performance functionals. Note also that p([0, ˆ ∞) × X|(θ, ), a) ˘ = 0. Therefore, one can ignore the histories h˘ n with / [0, ∞) × X in both models when considering the strategies. Now the sequences x˘n ∈ of stochastic kernels {σn }∞ n=1 with n ˘ n ≥ 1, σn : h˘ n ∈ ([0, ∞) × X) × ([0, ∞] × A I ) × ([0, ∞) × X) → P(A), as in Definition 7.1.1 define the strategies under consideration in both models. For / [0, ∞) × X, the kernels σn in both models can be defined arbitrarily in h˘ n with x˘n ∈ a measurable manner. Theorem 7.3.1 For each fixed control strategy σ ∈ G I , W˘ jα (σ, x0 ) = Wˆ j (σ, x0 ) ∀ j = 0, 1, . . . , J. Proof For brevity, we omit the index j and consider only [0, ∞]-valued functions c G and c I : otherwise, one would apply the reasoning separately to the case of (c G )+ and (c I )+ and to the case of (c G )− and (c I )− . Moreover, we assume that the cost functions c G and c I are bounded, for, otherwise, one can apply the reasoning to c G ∧ N and c I ∧ N and pass to the limit as N → ∞ using the Monotone Convergence Theorem. Let us prove that, for an arbitrarily fixed n = 0, 1, . . ., the regular conditional expectations satisfy the equality
7.3 The Discounted Cost Model
443
E˘ σx0 e−αTn l˘α ( X˘ n , A˘ n+1 , X˘ n+1 )| H˘ k = h˘ k = (x˘0 , (c˘1 , b˘1 ), . . . , x˘k ) = I{xk ∈ X}e−αtk f k (h˘ k ) ˆ X˘ n , A˘ n+1 , X˘ n+1 )| H˘ k = h˘ k = (x˘0 , (c˘1 , b˘1 ), . . . , x˘k ) = e−αtk Eˆ σx0 l( (7.53) for all k = 0, 1, . . . , n and some measurable functions f k of h˘ k . / X, we have zeros in the above expressions because inevitably X n ∈ / X. If xk ∈ / X. Therefore, the function f k can be defined arbitrarily for h˘ k with xk ∈ Let k = n. Then, for h˘ n such that xn ∈ X, we have E˘ σx0 e−αTn l˘α ( X˘ n , A˘ n+1 , X˘ n+1 )| H˘ n = h˘ n −αtn =e g˘ xn ,a˘ (t) p(dt ˘ × dy|(θn , xn ), a)σ ˘ n+1 (d a| ˘ h˘ n ), ˘ A
˘ X
˘ ρ), and where a˘ := (c, ˘ b, ˘ (t, y)) g˘ xn ,a˘ (t) := l˘α ((θn , xn ), a, ˘ e−αs c G (xn , ρs )ds + e−αc˘ I{t = c˘ < ∞}c I (xn , b). = (0,t]∩R+
According to the definition (7.1) of the transition probability p, ˘ we obtain ˘
X =
g˘ xn ,a˘ (t) p(dt ˘ × dy|(θn , xn ), a) ˘
(0,c]∩R ˘ +
g˘ xn ,a˘ (t)qxn (ρt )e−
+I{c˘ = ∞}g˘ xn ,a˘ (∞)e− +I{c˘ < ∞}e− =
(0,c]∩R ˘ +
(0,t]∩R+
(0,c] ˘
(0,t]
(0,∞)
qxn (ρs )ds
qxn (ρs )ds
dt
qxn (ρs )ds
g˘ xn ,a˘ (c) ˘
e−αs c G (xn , ρs )ds qxn (ρt )e−
(0,t]
qxn (ρs )ds
dt
+I{c˘ = ∞} × e−αs c G (xn , ρs )ds e− (0,∞) qxn (ρs )ds (0,∞) − (0,c]˘ qxn (ρs )ds ˘ +I{c˘ < ∞}e e−αs c G (xn , ρs )ds + e−αc˘ c I (xn , b) (0,c] ˘ − (0,c]∩R qxn (ρs )ds ˘ + e−αs c G (xn , ρs )ds = −e (0,c]∩R ˘ +
444
7 Gradual-Impulsive Control Models
e−αt c G (xn , ρt )e− (0,t] qxn (ρs )ds dt (0,c]∩R ˘ + − (0,c]∩R q (ρ )ds ˘ + xn s e−αs c G (xn , ρs )ds +e
+
(0,c]∩R ˘ + − (0,c]˘ qxn (ρs )ds−αc˘
˘ +I{c˘ < ∞}c (xn , b)e = c G (xn , ρt )e− (0,t] qxn (ρs )ds−αt dt I
(0,c]∩R ˘ +
˘ − +I{c˘ < ∞}c I (xn , b)e
(0,c] ˘
qxn (ρs )ds−αc˘
,
where the third equality is by integrating by parts. Furthermore, again for h˘ n such that xn ∈ X, ˆ X˘ n , A˘ n+1 , X˘ n+1 )| H˘ n = h˘ n Eˆ σx0 l( = gˆ xn ,a˘ (t) p(dt ˆ × dy|(θn , xn ), a)σ ˘ n+1 (d a| ˘ h˘ n ), ˘ A
˘ X
˘ ρ), and where a˘ := (c, ˘ b, ˆ n , xn ), a, ˘ (t, y)) gˆ x ,a˘ (t) := l((θ n ˘ c G (xn , ρs )ds + I{t = c˘ < ∞}c I (xn , b). = (0,t]∩R+
According to the definition (7.52) of the transition probability p, ˆ we obtain gˆ xn ,a˘ (t) p(dt ˆ × dy|(θn , xn ), a) ˘ c G (xn , ρs )ds qxn (ρt ) + α = ˘ X
(0,c]∩R ˘ (0,t] + − (0,t] qxn (ρs )ds−αt
×e
+I{c˘ < ∞}e− −
= −e + +e
(0,c] ˘
(0,t]
dt
qxn (ρs )ds−αc˘
(0,c] ˘
(0,c]∩R ˘ +
qxn (ρs )ds−αc˘
c G (xn , ρs )ds
˘ c G (xn , ρs )ds + c I (xn , b)
c G (xn , ρs )ds (0,c]∩R ˘ + − (0,t] qxn (ρs )ds−αt
c G (xn , ρt )e (0,c]∩R ˘ + − (0,c]∩R q (ρ )ds−αc˘ ˘ + xn s
c G (xn , ρs )ds (0,c]∩R ˘ + − (0,c]˘ qxn (ρs )ds−αc˘
˘ +I{c˘ < ∞}c I (xn , b)e = c G (xn , ρt )e− (0,t] qxn (ρs )ds−αt dt (0,c]∩R ˘ +
dt
7.3 The Discounted Cost Model
445
˘ − +I{c˘ < ∞}c I (xn , b)e
(0,c] ˘
qxn (ρs )ds−αc˘
,
where the second equality is by integrating by parts. Thus, g˘ xn ,a˘ (t) p(dt ˘ × dy|(θn , xn ), a) ˘ = gˆ xn ,a˘ (t) p(dt ˆ × dy|(θn , xn ), a), ˘ ˘ X
˘ X
and equality (7.53) is proved for k = n: for all h˘ n with xn ∈ X, f n (h˘ n ) =
c G (xn , ρt )e−
qxn (ρs )ds−αt
dt − (0,c]˘ qxn (ρs )ds−αc˘ I ˘ σn+1 (d a| + I{c˘ < ∞}c (xn , b)e ˘ h˘ n ). ˘ A
(0,c]∩R ˘ +
(0,t]
Suppose equality (7.53) holds for some 1 ≤ k ≤ n and consider the case of k − 1. For h˘ k−1 such that xk−1 ∈ X, we have E˘ σx0 e−αTn l˘α ( X˘ n , A˘ n+1 , x˘n+1 )| H˘ k−1 = h˘ k−1 = E˘ σx0 E˘ σx0 e−αTn l˘α ( X˘ n , A˘ n+1 , x˘n+1 )| H˘ k | H˘ k−1 = h˘ k−1 = E˘ σx0 I{X k ∈ X}e−αTk f k ( H˘ k )| H˘ k−1 = h˘ k−1 = e−αtk−1 E˘ σx0 I{X k ∈ X}e−αk f k ( H˘ k )| H˘ k−1 = h˘ k−1 e−αt f k (h˘ k−1 , (c˘k , b˘k ), (t, y)) q (dy|xk−1 , ρt ) = e−αtk−1 ˘ A
(0,c]∩R ˘ +
X
×e− (0,t] qxk−1 (ρs )ds dt + I{c˘ < ∞}e− (0,c]˘ qxk−1 (ρs )ds ˘ × e−αc˘ f k (h˘ k−1 , (c˘k , b˘k ), (c, ˘ y))Q(dy|xk−1 , b) X
×σk (d c˘ × d b˘ × dρ|h˘ k−1 ). The last equality is by (7.1). Furthermore, again for h˘ k−1 such that xk−1 ∈ X, ˆ X˘ n , A˘ n+1 , x˘n+1 )| H˘ k−1 = h˘ k−1 Eˆ σx0 l( = Eˆ σx0 I{X k ∈ X} f k ( H˘ k )| H˘ k−1 = h˘ k−1 f k (h˘ k−1 , (c˘k , b˘k ), (t, y)) q (dy|xk−1 , ρt ) = ˘ A
×e−
(0,c]∩R ˘ +
(0,t]
X
qxk−1 (ρs )ds−αt
dt + I{c˘ < ∞}e−
(0,c] ˘
qxk−1 (ρs )ds−αc˘
446
7 Gradual-Impulsive Control Models
˘ σk (d c˘ × d b˘ × dρ|h˘ k−1 ) f k (h˘ k−1 , (c˘k , b˘k ), (c, ˘ y))Q(dy|xk−1 , b)
× X
=: f k−1 (h˘ k−1 ). The last but one equality is by (7.52). Equality (7.53) is now proved for k − 1. By induction, equality (7.53) is valid for all k = 0, 1, . . . , n. When k = 0, we obtain the desired equality ˆ X˘ n , A˘ n+1 , x˘n+1 ) , E˘ σx0 e−αTn l˘α ( X˘ n , A˘ n+1 , x˘n+1 ) = Eˆ σx0 l( and the statement of the theorem follows.
7.3.2 Reduction to DTMDP with Total Cost According to Theorem 7.3.1, solving the constrained α-discounted gradual-impulsive control problem Minimize W˘ 0α (σ, x0 ) over σ ∈ G I such that W˘ jα (σ, x0 ) ≤ d j , j = 1, . . . , J,
(7.54)
described in Sect. 7.3.1, is equivalent to solving problem (7.5) for the “hat” model with killing, which, in its turn, under Condition 7.1.1, is equivalent to solving the standard CTMDP problem with gradual control only: see Remarks 7.1.2 and 7.1.3, and observe that, in the “hat” model, qx (a) ≥ α > 0 for all x ∈ X, a ∈ AG . The latter CTMDP model MG O with gradual control only is defined by the following primitives: X := X ∪ {};
A := A I ∪ AG ; ⎧ ⎨ q( ∩ X|x, a) + αI{ ∈ } ∀ (x, a) ∈ X × AG ; ∀ (x, a) ∈ X × A I ; q G O (|x, a) := Q( ∩ X|x, a) ⎩ 0, if x = , ∀ ⎧ G ⎨ c j (x, a) c j (x, a) := c Ij (x, a) ⎩ 0,
∈ B(X ); for all (x, a) ∈ X × AG ; for all (x, a) ∈ X × A I ; 0 ≤ j ≤ J. if x = ;
Now Condition 4.2.4 is satisfied: is the costless cemetery, and inf qx (a) ≥ min{α, 1} > 0
a∈A
7.3 The Discounted Cost Model
447
for all x ∈ X = X. According to the third subsubsection of Sect. 4.2.4, the constrained problem under investigation is equivalent to the DTMDP problem (C.5) looking like Minimize W0DT (σ, x0 ) over σ ∈ such that W jDT (σ, x0 ) ≤ d j , j = 1, . . . , J,
(7.55)
where • X and A = A I ∪ AG are the state and action spaces; • GO q (|x,a) , if x = ; qxG O (a) ∀ ∈ B(X ) p(|x, a) := δ (), if x = , •
is the transition probability; l j (x, a) :=
c j (x,a) , qxG O (a)
0,
if x = ; if x = ,
j = 0, 1, . . . , J
are the cost functions, and the performance functionals equal
W jDT (σ, x0 ) = Eσx0
∞
l j (X n , An+1 ) , j = 0, 1, . . . , J.
n=0
For the obtained constrained DTMDP problem (7.55), all the results from Appendix C are valid. Note that this DTMDP is not induced by a standard αdiscounted CTMDP (with gradual control only) in the sense of Chap. 4 because for x ∈ X, p(|x, a) =
α , qx (a)+α
0,
if a ∈ AG ; if a ∈ A I .
When studying problem (7.55), it is sufficient to restrict to Markov strategies according to Proposition C.1.4. Suppose σ M is an optimal Markov strategy in problem (7.55). Then the Markov standard ξ-strategy M = p M := σ M is optimal in the constrained CTMDP model MG O with gradual controls only. According to Theorem 7.1.1 and Remark 7.1.3, the μ-Markov strategy σ = {σn(0) , F˘n }∞ n=1 defined by (7.10), (7.11), (7.13) and (7.15) solves the constrained problem Minimize Wˆ 0 (σ, x0 ) over σ ∈ G I such that Wˆ j (σ, x0 ) ≤ d j , j = 1, . . . , J
448
7 Gradual-Impulsive Control Models
in the “hat” model with killing and is replicated by the Markov standard ξ-strategy σ = {σn(0) , ϕ˜ n }∞ n=1 . Finally, the obtained strategies are optimal in the original constrained problem (7.54) by Theorem 7.3.1. If the solution to problem (7.55) is obtained in the form of a mixture of deterministic stationary strategies ϕi∗ , then the solution to the original problem (7.54) is provided by the corresponding mixture of μ-deterministic stationary strategies {μi , ϕi(0) , ϕi(1) }. Here pnMi (da|x) ≡ δϕi∗ (x) (da). Hence, for the measurable mappings ϕi(0) (x) ϕi(1) (x)
:= :=
ϕi∗ (x), if ϕ∗ (x) ∈ A I ; ϕi (x), otherwise, ϕi∗ (x), if ϕ∗ (x) ∈ AG ; ϕi (x), otherwise,
where ϕi : X → A are arbitrary measurable mappings such that ϕi (x)
∈
A I , if ϕi∗ (x) ∈ / AI ; ∗ G / AG , A , if ϕi (x) ∈
we have ϕ¯ i (da|x) ≡ δϕ(0) (x) (da); i
μ (x) = i
Mi pn+1 (AG |x)
≡ I{ϕi∗ (x) ∈ AG } ∈ {0, 1}
and F˘ i (x)(da) ≡ δϕ(1) (x) (da). i
7.3.3 The Dynamic Programming Approach Below we assume that Condition 7.1.1 is satisfied, which is not restrictive, as was explained immediately below it. In accordance with Sect. 7.3.2, the unconstrained version of problem (7.54)
Minimize W˘ 0α (σ, x) = E˘ σx
∞
l˘0α ( X˘ n , A˘ n+1 , X˘ n+1 )
over σ ∈ G I (7.56)
n=0
is equivalent to problem (C.4), which reads
Minimize
W0DT (σ, x)
=
Eσx
∞
l0 (X n , An+1 )
over σ ∈
(7.57)
n=0
for the DTMDP described in Sect. 7.3.2. Below, we omit the index zero for brevity. The common value (Bellman) function is
7.3 The Discounted Cost Model
449
W ∗ (x) := inf W˘ α (σ, x) = inf W DT (σ, x), σ∈
σ∈ G I
x ∈ X.
The initial value x is not fixed, and we are looking for the uniformly optimal strategy as in the following definition. Definition 7.3.1 A strategy σ ∗ ∈ G I is called uniformly optimal for the α-discounted gradual-impulsive control problem (7.56) if W˘ α (σ ∗ , x) = W ∗ (x) ∀ x ∈ X. Having in hand a uniformly optimal control strategy for the associated DTMDP, one can construct a uniformly optimal strategy for the α-discounted gradualimpulsive control problem (7.56) as described at the end of Sect. 7.3.2. All these ideas will be illustrated by the example in Sect. 7.3.4. According to Propositions C.2.1 and C.2.2, if the associated DTMDP is summable (see Definition C.2.1), then the value function W ∗ is lower semianalytic and satisfies the optimality (Bellman) equation (C.7), which has the form
c G (x, a) W (y) q (dy|x, a) + ; (7.58) W (x) = min inf qx (a) + α a∈AG q x (a) + α X inf c I (x, b) + W (y)Q(dy|x, b) , x ∈ X;
b∈A I
X
W () =0. Lemma 7.3.1 A function W : X → R satisfies Eq. (7.58) if and only if it satisfies the equation G 0 = min inf c (x, a) + W (y)q(dy|x, a) − αW (x) ; (7.59) a∈AG X W (y)Q(dy|x, b) − W (x) , x ∈ X; inf c I (x, b) +
b∈A I
X
W () = 0. Proof For x ∈ X, Eq. (7.58) can be rewritten in the following way:
c G (x, a) W (y) q (dy|x, a) qx (a) + α + − W (x) ; qx (a) + α qx (a) + α a∈AG q x (a) + α X inf c I (x, b) + W (y)Q(dy|x, b) − W (x) ,
0 = min
inf
b∈A I
which means that • either
X
450
7 Gradual-Impulsive Control Models
1 c G (x, a) + W (y) q (dy|x, a) a∈AG q x (a) + α X −qx (a)W (x) − αW (x)} ⇐⇒ W (y) q (dy|x, a) 0 = inf c G (x, a) +
0 = inf
a∈AG
X
−qx (a)W (x) − αW (x)} and
0 ≤ inf
b∈A I
c (x, b) +
W (y)Q(dy|x, b) − W (x)
I
X
• or 1 c G (x, a) + W (y) q (dy|x, a) a∈AG q x (a) + α X −qx (a)W (x) − αW (x)} ⇐⇒ G W (y) q (dy|x, a) − qx (a)W (x) − αW (x) 0 ≤ inf c (x, a) +
0 ≤ inf
a∈AG
and 0 = inf
b∈A I
X
c I (x, b) + W (y)Q(dy|x, b) − W (x) . X
The latter formulae are equivalent to the first equation in (7.59). The proof is complete. One can recognise in (7.59) the expressions from the usual optimality equations for the CTMDP problem with gradual control only (cf. (3.4)) and for the DTMDP coming to the stage at the impulse moment (cf. (C.7)). Note also that, if sup(x,a)∈X×AG |c G (x, a)| < ∞ and c I ≥ 0, then supx∈X |W ∗ (x)| ≤ α1 sup(x,a)∈X×AG |c G (x, a)| < ∞. This inequality holds for W˘ (σ, x) for all strategies without impulses, and impulses cannot reduce the performance below − α1 sup(x,a)∈X×AG |c G (x, a)| because c I ≥ 0. As usual, we call a gradual-impulsive control model positive (negative) if the cost functions c G and c I , and hence l, are [0, ∞]-valued ([−∞, 0]-valued correspondingly). Theorem 7.3.2 (a) In a positive model (or negative model, respectively), W ∗ is the minimal nonnegative (or maximal nonpositive, respectively) lower semianalytic solution to the optimality Eq. (7.58). (b) Suppose Condition 7.1.2 is satisfied and the model is positive. Then W ∗ is lower semicontinuous and there exists a μ-deterministic stationary uniformly optimal strategy for problem (7.56). If ϕ(0) and ϕ(1) are measurable mappings from X to A I and to AG , respectively, satisfying
7.3 The Discounted Cost Model
451
W ∗ (y) q (dy|x, ϕ(1) (x)) c G (x, ϕ(1) (x)) + ; W (x) = min qx (ϕ(1) (x)) + α qx (ϕ(1) (x)) + α X c I (x, ϕ(0) (x)) + W ∗ (y)Q(dy|x, ϕ(0) (x)) , (7.60) ∗
X
then {μ, ϕ(0) , ϕ(1) } is a μ-deterministic stationary uniformly optimal strategy for problem (7.56). Here μ(x) :=
0, if W ∗ (x) = c I (x, ϕ(0) (x)) + 1 otherwise.
X
W ∗ (y)Q(dy|x, ϕ(0) (x));
Proof (a) This assertion follows from Propositions C.2.4(b) and C.2.5(a). (b) The semicontinuity of the function W ∗ follows from part (b) of Proposition C.2.8. Moreover, according to Proposition C.2.8(c), in the associated DTMDP, there exists a deterministic stationary uniformly optimal strategy. As explained at the end of Sect. 7.3.2, any deterministic stationary uniformly optimal strategy in the associated DTMDP gives rise to a μ-deterministic stationary uniformly optimal strategy in the original gradual-impulsive control problem. Consequently, for the optimality of the strategy {μ, ϕ(0) , ϕ(1) } described in the statement of this theorem, it is sufficient to note that the deterministic stationary strategy ϕ∗ (x) :=
ϕ(0) (x), if μ(x) = 0; ϕ(1) (x), if μ(x) = 1
is uniformly optimal in the associated DTMDP by Proposition C.2.4(c).
Under the conditions of Theorem 7.3.2(b), either the optimal impulse ϕ(0) (x) must be applied immediately after a natural jump to the state x ∈ X (and perhaps followed by an instantaneous sequence of impulses), or one has to apply the optimal gradual control ϕ(1) (x) until the next natural jump. If the Bellman function W ∗ is finite-valued, the equation in part (b) of Theorem 7.3.2 can be rewritten as G (1) W (y)q(dy|x, ϕ(1) (x)) − αW (x); 0 = min c (x, ϕ (x)) + X I (0) (0) W (y)Q(dy|x, ϕ (x)) − W (x) : c (x, ϕ (x)) + X
see Lemma 7.3.1 and its proof. Theorem 7.3.3 Consider the negative model and let σ = {μ, ϕ(0) , ϕ(1) } be a given μ-deterministic stationary strategy in the gradual-impulsive control model with the performance W˘ α (σ, x). Then σ is uniformly optimal for problem (7.56) if
452
7 Gradual-Impulsive Control Models
˘α G c (x, a) (σ, y) q (dy|x, a) W + W˘ (σ, x) = min inf ; qx (a) + α qx (a) + α a∈AG X I α ˘ W (σ, y)Q(dy|x, a) inf c (x, a) + α
a∈A I
X
for all x ∈ X. Proof According to Theorem 7.3.1, W˘ α (σ, x) = Wˆ (σ, x) for all x ∈ X. Consider W 0 (ϕ∗ , x), the performance functional in the CTMDP model MG O with gradual control only, corresponding to the “hat” model with killing. Here the deterministic stationary strategy ϕ∗ is given by
∗
ϕ (x) :=
ϕ(0) (x), if μ(x) = 0; ϕ(1) (x), if μ(x) = 1.
According to the proof of Theorem 7.1.1 (or to the proof of Theorem 7.1.2), W 0 (ϕ∗ , x) = Wˆ (σ, x) and, finally, W 0 (ϕ∗ , x) = W DT (ϕ∗ , x) for all x ∈ X according to the third subsubsection of Sect. 4.2.4. Therefore, the function W DT (ϕ∗ , ·), supplemented by the obvious condition W DT (ϕ∗ , ) = 0, satisfies the Bellman equation (7.58) and hence is optimal in the associated DTMDP problem (7.57) by Proposition C.2.5(d). Thus, the μ-deterministic stationary strategy σ = {μ, ϕ(0) , ϕ(1) } is uniformly optimal for problem (7.56): see the reasoning at the end of Sect. 7.3.2. Remark 7.3.1 If the performance functional W˘ α is finite-valued, the equation in Theorem 7.3.3 can be rewritten as G α α ˘ ˘ W (σ, y)q(dy|x, a) − αW (σ, x) ; 0 = min inf c (x, a) + a∈AG X I α α ˘ ˘ W (σ, y)Q(dy|x, a) − W (σ, x) , x ∈ X inf c (x, a) + a∈A I
X
according to Lemma 7.3.1.
7.3.4 Example: The Inventory Model 7.3.4.1
Case of Zero Set-Up Cost
Example 7.3.1 Consider a store that sells a certain commodity. At any moment the supply is represented by a nonnegative real number x. The customers arrive according to the Poisson process with rate λ > 0 and buy a random amount of commodity. To be more precise, any customer, independently of the other ones, plans to buy Z ∈ (0, ∞) units having the common cumulative distribution function
7.3 The Discounted Cost Model
453
(CDF) F. Thus, the demand process is a compound Poisson process. We have ignored the customers who intend to buy 0 units. If F(0) > 0, one has simply to adjust the values of λ and F: λ → λ(1 − F(0)) and F(z) → Assume that
F(z) − F(0) . 1 − F(0)
(0,∞)
z d F(z) < ∞.
There is no backlogging, so that customers cannot buy more than the current supply. Using the notations from Sect. 1.2.4, for x > 0, Q(z|x) =
F(z), if z < x; 1, if z = x.
Assume the storage space is limited, and the maximum inventory level is x ∈ R+ . At any moment, if the current inventory level is x < x, the manager can order b ∈ (0, x − x] units, and the replenishment is instantaneous. There is no setup cost for orders. The holding cost rate of one unit equals ch ≥ 0, and selling one unit results in the profit r ≥ 0. The goal is to minimize the expected total discounted cost under a fixed discount factor α > 0. In this model, impulses mean orders of the commodity. The primitives of the gradual-impulsive control model described in Sect. 7.1.1 are as follows. • The state space is X = [0, x], representing all possible inventory levels. • The gradual action space AG = {a g } is a singleton because the demand is not under control. • The impulsive action space is A I = (0, x]. If the current state is x ≤ x, then any order of size b > x − x is not feasible. We will avoid introducing the admissible action spaces by suitably defining the consequence of applying an impulse below. • The running cost rate is given by c G (x, a g ) = ch x − r λ
(0,x]
z d F(z) + x(1 − F(x)) .
(7.61)
Here, we replaced the lump sum cost at the jump moment (arrival of a customer) by the corresponding cost rate, in accordance with Sect. 1.1.5. • The transition rate is given by
454
7 Gradual-Impulsive Control Models
[0,x]
f (y)q(dy|x, a g )
=λ
(0,x]
f (x − z) d F(z) + f (0)(1 − F(x)) − f (x)
for each bounded measurable function f on [0, x]. ¯ Remember, F(0) = 0. • The consequence of applying an impulse b ∈ A I at the state x < x¯ is given by l(x, b) = (x + b) ∧ x, i.e. Q(dy|x, b) = δl(x,b) (dy). • The cost function is c I (x, b) ≡ 0
∀ x ∈ [0, x), ¯ b ∈ AI .
• If x = x, ¯ then impulses are not allowed. One can assign a big enough positive cost ¯ b) ≡ M. After that, since impulses will be never optimal at the state x, ¯ the c I (x, stochastic kernel Q(·|x, ¯ b) can be fixed arbitrarily. Below, in order to deal with ¯ the negative model, we follow another approach: if an impulse from A I = (0, x] is applied in the state x, ¯ the system fails and goes to a costless artificial cemetery, ¯ b) is still say, (the same state as in the “hat” model). The cost function c I (x, identical zero. Under mild conditions, such actions are not optimal and thus will not be in use. To satisfy Condition 7.1.1, one can split into two points with the deterministic loop from one to another, as was done in Sect. 7.2.1. We are in the framework of the gradual-impulsive control problem, and the target is to obtain an optimal control strategy to the unconstrained DTMDP problem (7.57). Remark 7.3.2 Below, we are not restricted to c G (·, a g ) in the form of (7.61). Rather, we assume that c G (·, a g ) is an arbitrary function satisfying the next condition. Condition 7.3.1 The running cost rate c G (·, a g ) is measurable, nonpositive, attains ¯ and, if S ∗ < x, ¯ is nondecreasing on [S ∗ , x]. ¯ its global minimum at S ∗ ∈ [0, x] Lemma 7.3.2 The function (7.61) is convex. It satisfies Condition 7.3.1 if it attains its global minimum on [0, x] ¯ at S ∗ and c G (x, ¯ a g ) ≤ 0. The proof of this lemma is presented in Appendix A.7. For example, Condition 7.3.1 is satisfied for function (7.61) if the CDF F has continuous density p(z), ch < r λ, and c G (x, ¯ a g ) ≤ 0. Indeed, in this case the function G ¯ c is differentiable, hence attains its global minimum at S ∗ on the compact [0, x]; dc G (x, a g ) = ch − r λ dx
∞
p(z) dz, x
7.3 The Discounted Cost Model
455
and c G (·, a g ) initially decreases from zero. Moreover, d 2 c G (x, a g ) = r λ p(x) ≥ 0; dx2 hence c G (·, a g ) is convex and ¯ a g )} ≤ 0. ∀ x ∈ [0, x] ¯ c G (x, a g ) ≤ max{c G (0, a g ), c G (x, ¯ then g(·, a g ) is nondecreasing on [S ∗ , x] ¯ by Lemma B.2.4. Finally, if S ∗ < x, If Condition 7.3.1 is satisfied, one can expect that the following simple control strategy ϕ∗ is optimal in problem (7.57): ϕ∗ (x) =
S ∗ − x, if x < S ∗ ; if x ≥ S ∗ . ag ,
(7.62)
Indeed, if x < S ∗ , one can always keep the inventory at the best possible level S ∗ leading to the minimal running cost rate. If x ≥ S ∗ , the additional orders only increase the running cost. In Theorem 7.3.4 below we rigorously prove the optimality of the strategy ϕ∗ assuming that the value of S ∗ is fixed. ¯ c G (x1 , a g ) > c G (x2 , a g ), then it can happen that If for some S ∗ < x1 < x2 ≤ x, the action a g is not optimal in the state x1 , and it makes sense to apply the impulse ϕ∗ (x1 ) = x2 − x1 : see Example 7.3.2. Theorem 7.3.4 Under Condition 7.3.1, the deterministic stationary control strategy (7.62) is uniformly optimal in problem (7.57). Proof The DTMDP under study has the following primitives (see Sect. 7.3.2): X := [0, x] ¯ ∪ {};
A := (0, x] ¯ ∪ {a g };
p(|x, a) ⎧ λ ⎪ ⎪ I{x − z ∈ }d F(z) + I{0 ∈ }(1 − F(x)) ⎪ ⎪ λ + α (0,x] ⎪ ⎪ ⎪ α ⎪ ⎪ I{ ∈ }, if x ∈ (0, x], ¯ a = ag ; + ⎨ λ+α := ⎪ ⎪ ⎪ ⎪ I{ ∈ }, if x = 0, a = a g , or if x = ; ⎪ ⎪ ⎪ ⎪ I{(x + a) ∧ x¯ ∈ } if x ∈ [0, x), ¯ a ∈ (0, x]; ¯ ⎪ ⎩ Q(|x, a) = I{ ∈ }, if x = x, ¯ a ∈ (0, x], ¯ and
456
7 Gradual-Impulsive Control Models
l(x, a) :=
⎧ G c (x, a g ) ⎪ ⎪ , if x ∈ (0, x], ¯ a = ag ; ⎪ ⎪ ⎪ ⎨ λ+α c G (0, a g ) ⎪ ⎪ ⎪ , if x = 0, a = a g ; ⎪ ⎪ α ⎩ 0, if x = or a ∈ (0, x]. ¯
According to Corollary C.2.1, W DT (ϕ∗ , x) = limn→∞ W n (x), where the functions W n on X ∪ {} are defined below. Put W n () ≡ 0 for all n = 0, 1, . . .. ¯ Suppose S ∗ > 0. Then, for x ∈ [0, x], W 0 (x) := 0; ⎧ n ∗ W (S ), if x < S ∗ ; ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ c G (x, a g ) λ n+1 W (x) := + W n (x − z) d F(z) ⎪ λ+α λ + α (0,x] ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ + W n (0)(1 − F(x)) , if x ≥ S ∗ . By the induction argument, one can easily show that, for each n = 0, 1, 2, . . ., the G ∗ g function W n is measurable and 0 ≥ W n (x) ≥ c (Sα ,a ) for all x ∈ [0, x]. ¯ Hence 0 ≥ W DT (ϕ∗ , x) ≥
c G (S ∗ , a g ) . α
Moreover, the function W DT (ϕ∗ , ·) is measurable as the limit of the decreasing sequence of measurable functions W n . For x < S ∗ , W DT (ϕ∗ , x) = W DT (ϕ∗ , S ∗ ) =
1 G ∗ g c (S , a ) + λW DT (ϕ∗ , S ∗ ) . λ+α
The second equality is by the Dominated Convergence Theorem, which will be sometimes used without reference. Therefore, W DT (ϕ∗ , x) = and W DT (ϕ∗ , S ∗ ) =
c G (S ∗ ,a g ) α
as well.
c G (S ∗ , a g ) , α
7.3 The Discounted Cost Model
457
Suppose S ∗ = 0. Then ⎧ G c (x, a g ) λ ⎪ ⎪ ⎪ + W n (x − z) d F(z) ⎪ ⎪ λ + α λ + α ⎪ (0,x] ⎪ ⎪ ⎨ n if x > 0; n+1 W (x) := +W (0)(1 − F(x)) , ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ c G (S ∗ , a g ) ⎪ ⎩ , if x = 0. α Similarly to the above reasoning, the function W DT (ϕ∗ , ·) is measurable and 0 ≥ W DT (ϕ∗ , x) ≥
c G (S ∗ , a g ) ; α
W DT (ϕ∗ , S ∗ ) = W DT (ϕ∗ , 0) =
c G (S ∗ , a g ) . α
By the Dominated Convergence Theorem, the equality c G (x, a g ) λ + λ+α λ+α
W DT (ϕ∗ , x − z) d F(z) + W DT (ϕ∗ , 0)(1 − F(x)) = W DT (ϕ∗ , x) (0,x]
¯ in all cases (including S ∗ = 0; recall that F(0) = 0). holds for all x ∈ [S ∗ , x] We have established the following properties of the function W DT (ϕ∗ , ·). (i) W DT (ϕ∗ , ·) is measurable. (ii) For all x ∈ [0, S ∗ ], W DT (ϕ∗ , x) =
c G (S ∗ , a g ) = min W DT (ϕ∗ , y). y∈[0,x] ¯ α
(iii) c G (x, a g ) λ + λ+α λ+α
(0,x]
W DT (ϕ∗ , x − z) d F(z)
+ W DT (ϕ∗ , 0)(1 − F(x)) λ c G (x, a g ) + W DT (ϕ∗ , x − z) d F(z) = λ+α λ + α (0,x−S ∗ ] c G (S ∗ , a g ) ∗ + (1 − F(x − S )) α = W DT (ϕ∗ , x)
458
7 Gradual-Impulsive Control Models
for all x ∈ [S ∗ , x]. ¯ Next, we will show that the function W DT (ϕ∗ , ·) is nondecreasing on the interval ¯ (if S ∗ < x). ¯ [S , x] ¯ If S ∗ = 0, then W˜ n := Consider the following sequence of functions W˜ n on [0, x]. n W , and, otherwise, ⎧ ⎨ c G (S ∗ , a g ) if x < S ∗ ; W˜ 0 (x) := α ⎩ 0, if x ≥ S ∗ ; ⎧ G ∗ g c (S , a ) ⎪ ⎪ ⎪ , if x < S ∗ ; ⎪ ⎪ α ⎪ ⎨ c G (x, a g ) λ + W˜ n (x − z) d F(z) W˜ n+1 (x) := λ + α λ + α (0,x] ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ if x ≥ S ∗ . ⎩ +W˜ n (0)(1 − F(x)) , ∗
Like previously, by the induction argument, one can easily show that, for each n = G ∗ g 0, 1, 2, . . ., the function W˜ n is measurable and W˜ n (x) ≥ c (Sα ,a ) for all x ∈ [0, x]. ¯ n ˜ Moreover, the sequence W monotonically decreases. Therefore, c G (S ∗ , a g ) , ∃ W˜ ∞ (x) = lim W˜ n (x) ≥ n→∞ α and the function W˜ ∞ is measurable. Our target is to prove that W DT (ϕ∗ , x) = W˜ ∞ (x),
∀ x ∈ [0, x]. ¯
(7.63)
This equality is obviously valid if S ∗ = 0. Consider the case S ∗ > 0. If x < S ∗ , then equality (7.63) is obvious by Property (ii). Let x ≥ S ∗ . Then, by the Dominated Convergence Theorem (also recall that G ∗ g ∞ W˜ (u) = c (Sα ,a ) for u < S ∗ ), W˜ ∞ (x) λ c G (x, a g ) ∞ ∞ ˜ ˜ + W (x − z) d F(z) + W (0)(1 − F(x)) = λ+α λ + α (0,x] λ c G (x, a g ) + W˜ ∞ (x − z) d F(z) = λ+α λ + α (0,x−S ∗ ] c G (S ∗ , a g ) c G (S ∗ , a g ) + (F(x) − F(x − S ∗ )) + (1 − F(x)) α α
7.3 The Discounted Cost Model
λ c G (x, a g ) + = λ+α λ+α
459
(0,x−S ∗ ]
W˜ ∞ (x − z) d F(z)
c G (S ∗ , a g ) ∗ (1 − F(x − S )) , + α
and the function W DT (ϕ∗ , ·) satisfies the same equation for x ≥ S ∗ by Property (iii). The operator B ◦ u defined by B ◦ u(x) :=
λ c G (S ∗ , a g ) c G (x, a g ) + (1 − F(x − S ∗ )) λ+α λ+α α λ u(x − z) d F(z) + λ + α (0,x−S ∗ ]
is contracting in the space of bounded measurable functions on [S ∗ , x] ¯ with the |u(x)|: uniform norm u := supx∈[S ∗ ,x] ¯ λ B ◦ u 1 − B ◦ u 2 ≤ |u 1 (x − z) − u 2 (x − z)|d F(z) λ + α (0,x−S ∗ ] λ u 1 − u 2 . ≤ λ+α Therefore, the bounded measurable functions W DT (ϕ∗ , ·) (see Property (i)) and ¯ and equality (7.63) is proved. W˜ ∞ (·) coincide on [S ∗ , x], Using equality (7.63), let us prove that the function W DT (ϕ∗ , ·) is nondecreasing ¯ if S ∗ < x. ¯ To do this, we will prove by induction that for all n = 0, 1, 2, . . ., on [S ∗ , x], ¯ Clearly, the function W˜ 0 ≡ 0 on [S ∗ , x] ¯ the function W˜ n is nondecreasing on [S ∗ , x]. n ˜ ¯ for some n = 0, 1, . . . exhibits this property. Suppose W is nondecreasing on [S ∗ , x] ¯ we have and consider the case of n + 1. For S ∗ ≤ x < x + y ≤ x, G c (x + y, a g ) c G (x, a g ) − W˜ n+1 (x + y) − W˜ n+1 (x) = λ+α λ+α λ + W˜ n (x + y − z)d F(z) − W˜ n (x − z)d F(z) λ + α (0,x+y] (0,x] λ ˜n + W (0)(F(x) − F(x + y)). λ+α The first term here is nonnegative by Condition 7.3.1. Note that, in any case G ∗ g (S > 0 and S ∗ = 0), W˜ n (0) = c (Sα ,a ) . λ The second and the third terms, up to the multiplier λ+α , equal ∗
460
7 Gradual-Impulsive Control Models
(0,x+y−S ∗ ]
W˜ n (x + y − z)d F(z) −
+ +
(x+y−S ∗ ,x+y] G ∗ g
(0,x−S ∗ ]
W˜ n (x + y − z)d F(z) −
W˜ n (x − z)d F(z)
(x−S ∗ ,x]
W˜ n (x − z)d F(z)
c (S , a ) (F(x) − F(x + y)) α = (W˜ n (x + y − z) − W˜ n (x − z))d F(z) (0,x−S ∗ ] + W˜ n (x + y − z)d F(z) (x−S ∗ ,x+y−S ∗ ] G ∗ g
c (S , a ) [(F(x + y) − F(x + y − S ∗ )) α −(F(x) − F(x − S ∗ )) + F(x) − F(x + y)] c G (S ∗ , a g ) [F(x + y − S ∗ ) ≥ α −F(x − S ∗ ) + F(x − S ∗ ) − F(x + y − S ∗ )] = 0. +
∗
g
Here, the first equality holds because, for u < S ∗ , we have W˜ n (u) ≡ c (Sα ,a ) , and the inequality holds by the inductive supposition and the established inequality W˜ n (u) ≥ c G (S ∗ ,a g ) for all u ∈ [0, x]. ¯ α n+1 ˜ The inequality W (x + y) − W˜ n+1 (x) ≥ 0 is proved. Thus, the function DT W (ϕ∗ , ·) = limn→∞ W˜ n (·) exhibits the following property. G
(iv) The function W DT (ϕ∗ , ·) is nondecreasing on [S ∗ , x] ¯ (if S ∗ < x), ¯ and hence nondecreasing on the whole interval [0, x] ¯ by Property (ii). If c G (S ∗ , a g ) = 0, then c G (x, a g ) ≡ 0 and thus l(x, a) ≡ 0, and all strategies are equally optimal because, for each strategy σ ∈ , W DT (σ, x) ≡ 0. Below, we assume that c G (S ∗ , a g ) < 0. ¯ < 0. If W DT (ϕ∗ , x) ¯ = 0, then Let us show by contradiction that W DT (ϕ∗ , x) c G (S ∗ ,a g ) ∗ DT ∗ ∗ S < x¯ because W (ϕ , S ) = < 0. We define α y := inf{x ∈ [0, x] ¯ : W DT (ϕ∗ , x) = 0}. ¯ so that Recall that the nonnegative function W DT (ϕ∗ , ·) is nondecreasing on [S ∗ , x], G ∗ g W DT (ϕ∗ , u) = 0 for all u > y. Clearly, y ≥ S ∗ because W DT (ϕ∗ , x) ≡ c (Sα ,a ) < 0 for all x ∈ [0, S ∗ ]. Suppose y < x. ¯ We fix an arbitrary point u ∈ (y, x] ¯ satisfying the condition F(u − y) < 21 . Such a point exists because lim z→0+ F(z) = 0. Now, on the one hand, W DT (ϕ∗ , u) = 0. On the other hand, since u > y ≥ S ∗ ,
7.3 The Discounted Cost Model
461
c G (u, a g ) W DT (ϕ∗ , u) = λ+α λ + W DT (ϕ∗ , u − z) d F(z) + W DT (ϕ∗ , 0)(1 − F(u)) λ + α (0,u] λ W DT (ϕ∗ , u − z) d F(z) + W DT (ϕ∗ , 0) d F(z) . ≤ λ + α (u−y,u] (u,∞) The equality holds by Property (iii). Recall that W DT (ϕ∗ , ·) ≤ 0. Having in mind G ∗ g that W DT (ϕ∗ , x) < 0 for all x < y, W DT (ϕ∗ , 0) = c (Sα ,a ) < 0, and 1 − F(u − y) > 21 , we conclude that W DT (ϕ∗ , u) < 0. The obtained contradiction proves that ¯ < 0. W DT (ϕ∗ , x) If y = x, ¯ then a similar reasoning is valid for u = y. The following property is also proved. (v) W DT (ϕ∗ , x) ¯ < 0. Let us show that the function W DT (ϕ∗ , ·) satisfies the Bellman equation (7.58), which we rewrite in the following equivalent form. For each x ∈ (0, x], ¯ ⎫ λ c G (x, a g ) ⎪ ⎪ ⎪ + either W (x) = ⎪ λ+ α λ+α ⎪ ⎪ ⎪ ⎪ ⎪ × W (x − z) d F(z) + W (0)(1 − F(x)) ⎪ ⎪ ⎪ ⎪ (0,x] ⎪ ⎪ ⎪ and W ((x + b) ∧ x) ¯ ≥ W (x) for all b ∈ (0, x] ¯ ⎪ ⎪ ⎪ ⎪ (W () = 0 ≥ W (x) ¯ in case x = x), ¯ ⎪ ⎪ ⎪ ⎬ (7.64) ¯ or W (x) = inf {W ((x + b) ∧ x)} ⎪ ⎪ b∈(0,x] ¯ ⎪ ⎪ ⎪ (W (x) ¯ = W () = 0 in case x = x) ¯ ⎪ ⎪ ⎪ G g ⎪ c (x, a ) λ ⎪ ⎪ and + ⎪ ⎪ λ+ α λ+α ⎪ ⎪ ⎪ ⎪ ⎪ × W (x − z) d F(z) + W (0)(1 − F(x)) ⎪ ⎪ ⎪ ⎪ (0,x] ⎪ ⎭ ≥ W (x). For x = 0, c G (0, a g ) α and W (b) ≥ W (0) for all b ∈ (0, x], ¯ or W (0) = inf {W (b)}
either W (0) =
b∈(0,x] ¯
and
c G (0, a g ) ≥ W (0). α
462
7 Gradual-Impulsive Control Models
When x > 0, the function W DT (ϕ∗ , ·) satisfies these relations because of the following. If x < S ∗ , then, according to Property (ii), c G (S ∗ , a g ) = W DT (ϕ∗ , x + (S ∗ − x)) α = min W DT (ϕ∗ , u) = inf W DT (ϕ∗ , (x + b) ∧ x). ¯ W DT (ϕ∗ , x) =
u∈[0,x] ¯
b∈(0,x] ¯
Moreover, again by Property (ii), c G (x, a g ) λ+α λ + W DT (ϕ∗ , x − z) d F(z) + W DT (ϕ∗ , 0)(1 − F(x)) λ + α (0,x] ≥
c G (S ∗ , a g ) = W DT (ϕ∗ , x). α
If x ≥ S ∗ , then, according to Property (iii), c G (x, a g ) λ+α λ + W DT (ϕ∗ , x − z) d F(z) + W DT (ϕ∗ , 0)(1 − F(x)) λ + α (0,x] = W DT (ϕ∗ , x). Moreover, if x < x, ¯ then ¯ ≥ W DT (ϕ∗ , x) W DT (ϕ∗ , ((x + b) ∧ x) for all b ∈ (0, x] ¯ by Property (iv). If x = x, ¯ according to Property (v), ¯ < 0 = W DT (ϕ∗ , ). W DT (ϕ∗ , x) (By the way, it follows from the above inequality that in the nontrivial case of c G (S ∗ , a g ) < 0, the Property (v) is valid and the non-admissible impulses from A I = (0, x] ¯ are never optimal in the state x.) ¯ When x = 0, according to Property (ii), W DT (ϕ∗ , 0) =
c G (S ∗ , a g ) ≤ W DT (ϕ∗ , b) ∀ b ∈ (0, x]. ¯ α
Since the function W DT (ϕ∗ , ·) satisfies the Bellman equation (7.58), the strategy ϕ is uniformly optimal in problem (7.57) by Proposition C.2.5(d). ∗
7.3 The Discounted Cost Model
463
Suppose Condition 7.3.1 is satisfied. According to Sect. 7.3.3, the deterministic stationary strategy ϕ∗ gives rise to the following μ-deterministic stationary uniformly optimal strategy σ = {μ, ϕ(0) , ϕ(1) } in the considered example (see also the end of Sect. 7.3.2): μ(x) = I{x ≥ S ∗ }; ∗ S − x, if x < S ∗ ; ϕ(0) (x) = ϕ (x) otherwise; g a , if x ≥ S ∗ ; ϕ(1) (x) = ϕ (x) otherwise, where ϕ : X → A is an arbitrary measurable mapping such that ϕ (x) ∈
A I , if x ≥ S ∗ ; AG , if x < S ∗ .
In words, if the supply x < S ∗ , then the impulse of the size S ∗ − x should be applied immediately (with probability 1 − μ(x) = 1), to fill the supply up to S ∗ . The artificial gradual action ϕ (x) is applied with probability μ(x) = 0. If x ≥ S ∗ , then wait until the demand reduces the inventory level to below S ∗ , that is, apply the gradual action a g with probability μ(x) = 1. The artificial impulse ϕ (x) is applied with probability 1 − μ(x) = 0. The next example shows that, if the running cost rate c G (·, a g ) can decrease on ∗ ¯ then it can happen that the prescribed basestock control strategy ϕ∗ is not [S , x], uniformly optimal. Recall that we are not restricted to c G (·, a g ) in the form of (7.61), see Remark 7.3.2. Example 7.3.2 Consider the following running cost rate: ⎧ ⎪ ⎪ 0, ⎨ −3, G g c (x, a ) = −1, ⎪ ⎪ ⎩ −2,
if x if x if x if x
∈ [0, x0 ); ∈ [x0 , S ∗ ]; ∈ (S ∗ , x1 ); ¯ ∈ [x1 , x],
¯ Assume that F(x) = I{x ≥ x}, ¯ meaning that where 0 < x0 < S ∗ < x1 < x2 < x. each customer buys all the available commodity. Calculations presented in the proof of Theorem 7.3.4 lead to the expressions −3 ∀ x ∈ [0, S ∗ ]; α λ −1 + W DT (ϕ∗ , 0) = W DT (ϕ∗ , x) = λ+α λ+α −2 λ W DT (ϕ∗ , x) = + W DT (ϕ∗ , 0) = λ+α λ+α W DT (ϕ∗ , x) =
−α − 3λ ∀ x ∈ (S ∗ , x1 ); α(λ + α) −2α − 3λ ¯ ∀ x ∈ [x1 , x]. α(λ + α)
464
7 Gradual-Impulsive Control Models
This function W DT (ϕ∗ , ·) does not satisfy the Bellman equation (7.58) at x ∈ (S ∗ , x1 ) because ¯ = W DT (ϕ∗ , x + (x1 − x)) = inf {W DT (ϕ∗ , (x + b) ∧ x)}
b∈(0,x] ¯
0. When ordering b ∈ A I = (0, x] x ∈ X, in fact, the effective order equals b ∧ (x¯ − x), and the price for it equals c O · [b ∧ (x¯ − x)], where c O ≥ 0 is a constant. Finally, the constant r ≥ 0 equals the reward (not profit) for selling one unit of the commodity. The primitives of the gradual-impulsive control model remain the same apart from c I (x, b) = K + c O · [b ∧ (x¯ − x)] ∀ x ∈ X, a ∈ A I , p(|x, ¯ a) = I{x¯ ∈ } ∀ a ∈ A I . Formally speaking, we allow impulses in the state x, ¯ but they will never be optimal because in the associated DTMDP they result in the positive cost K , without changing the state. Like in the previous subsubsection, below we assume that c G (·, a g ) is an arbitrary enough function, for instance, given by Eq. (7.61). We will investigate problem (7.57) for the DTMDP model described at the beginning of the proof of Theorem 7.3.4, but under different nonnegative cost functions. Condition 7.3.2 The function c G (·, a g ) is bounded, nonnegative and lower semicontinuous.
The first version of the DTMDP under study, denoted by M+ , has the following cost function:
7.3 The Discounted Cost Model
l(x, a) =
465
⎧ G g ⎪ ⎪ c (x, a ) , ⎪ ⎪ ⎨ λ+α
if x ∈ X, a = a g ;
⎪ ⎪ K + c O · [b ∧ (x¯ − x)], if x ∈ X, a ∈ A I ; ⎪ ⎪ ⎩ 0, if x = .
The corresponding Bellman function is denoted by W DT ∗ . Lemma 7.3.3 Suppose Condition 7.3.2 is satisfied. Then the following assertions hold true for the model M+ . (a) There exists a deterministic stationary control strategy ϕ, uniformly optimal in problem (7.57). Moreover, the Bellman function W DT ∗ is bounded and lower semicontinuous. (b) If, for a uniformly optimal deterministic stationary strategy ϕ in M+ , we have ϕ(0) ∈ A I , then, in the DTMDP model M+ with the corrected cost function: l(0, a g ) =
c G (0, a g ) , α
the Bellman function coincides with W DT ∗ and the strategy ϕ is also uniformly optimal. The proofs of this and the other lemmas are presented in Appendix A.7. Let us introduce the following function: g
g
z d F(z) h(x, a ) := c (x, a ) + c x(λ + α) − c λx F(x) + λc (0,x] z d F(z) + c O λx[1 − F(x)] ≥ 0. = c G (x, a g ) + c O xα + λc O G
O
O
O
(0,x]
We consider the version of the DTMDP model Theorem 7.3.4 with the cost function ⎧ h(x, a g ) ⎪ , if x ⎪ ⎪ ⎨ λ+α l (x, a) = ⎪ ⎪ if x ⎪ ⎩ K, 0, if x
as at the beginning of the proof of
∈ X, a = a g ; ∈ X, a ∈ A I ; = .
Denote this model by Mh . The corresponding optimality equation is as follows:
466
7 Gradual-Impulsive Control Models
λ h(x, a g ) + W (x) = min λ+α λ+α × W (x − z)d F(z) + W (0)(1 − F(x)) ; (0,x] ¯ =: T1 ◦ W (x). inf {K + W ((x + b) ∧ x)}
(7.65)
b∈(0,x] ¯
The Bellman function in the model Mh is denoted by W DT h . It coincides with the minimal nonnegative lower semianalytic solution to Eq. (7.65) by Proposition C.2.4(b). Lemma 7.3.4 Suppose Condition 7.3.2 is satisfied and the CDF F is continuous. Then the following assertions hold true.
(a) The Bellman function W DT ∗ in the model M+ and the Bellman function W DT h in the model Mh satisfy the equality W DT ∗ (x) + c O · x = W DT h (x), ∀ x ∈ X. (b) A deterministic stationary strategy ϕ is uniformly optimal (in problem (7.57)) for the model M+ if and only if it is uniformly optimal for the model Mh . Corollary 7.3.1 If all the conditions of Lemma 7.3.4 are satisfied, then the Bellman function W DT h in the model Mh is lower semicontinuous and bounded. Proof The statement follows directly from Lemmas 7.3.3(a) and 7.3.4(a). A direct proof appears at the beginning of the proof of Lemma 7.3.4. Let us introduce the following operator in the space of measurable bounded functions on X with the uniform norm V := supx∈X |V (x)|:
¯ , T2 ◦ V (x) := min G ◦ V (x); K + inf G ◦ V ((x + b) ∧ x) b∈(0,x] ¯
where λ h(x, a g ) + G ◦ V (x) := λ+α λ+α
(0,x]
V (x − z) d F(z) + V (0)(1 − F(x)) .
Lemma 7.3.5 The operator T2 is a contraction. Below, V is the unique bounded measurable solution to the equation V = T2 ◦ V . Lemma 7.3.6 Suppose Condition 7.3.2 is satisfied and the CDF F is continuous. Then the following assertions hold true.
7.3 The Discounted Cost Model
467
(a) On the space X, the Bellman function W DT h coincides with V . (b) A measurable mapping ϕ : X → A defines a uniformly optimal deterministic stationary control strategy in the model Mh if and only if it satisfies the following requirements for each x ∈ X: if ϕ(x) = a g , then V (x) = G ◦ V (x); ¯ if ϕ(x) ∈ (0, x] ¯ = A I , then V (x) = K + G ◦ V ((x + ϕ(x)) ∧ x).
(7.66)
Condition 7.3.3 (a) The function h(·, a g ) is nonnegative, continuous and hence bounded on X = [0, x]. ¯ (b) The global minimum on [0, x] ¯ of the function h(·, a g ) is attained at some point g ¯ AddiS > 0; h(·, a ) is nonincreasing on [0, S], and nondecreasing on [S, x]. tionally, h(x, a g ) > h(S, a g ) for all 0 ≤ x < S. (c) s := max{x < S : h(x, a g ) = h(S, a g ) + (λ + α)K } > 0; ⎧ ⎨ min{x > S : h(x, a g ) = h(S, a g ) + λK }, + if h(x, ¯ a g ) > h(S, a g ) + λK ; S := ⎩ x, ¯ otherwise. (d) F(z) = 0 for all z < S + − s. Requirements (a)–(c) of the function h(·, a g ) are not very restrictive and are fulfilled for a big class of convex functions. Condition (d) means that the demand size is big enough. The case S = x¯ is not excluded: the proof of the next lemma is only simplified. Lemma 7.3.7 Suppose Condition 7.3.3 is satisfied. Then the mapping ϕ from X to A defined as S − x, if x < s; ϕ(x) := if x ≥ s ag , satisfies requirements (7.66). According to Sect. 7.3.3, the deterministic stationary strategy ϕ in Lemma 7.3.7 gives rise to the following μ-deterministic stationary strategy σ ∗ = {μ, ϕ(0) , ϕ(1) } (see also the end of Sect. 7.3.2): μ(x) = I{x ≥ s}; S − x, if x < s; ϕ(0) (x) = ϕ (x) otherwise; g a , if x ≥ s; ϕ(1) (x) = ϕ (x) otherwise,
468
7 Gradual-Impulsive Control Models
where ϕ : X → A is an arbitrary measurable mapping such that
ϕ (x) ∈
A I , if x ≥ s; AG , if x < s.
In words, if the inventory level x < s, then the impulse of the size S − x should be applied immediately (with probability 1 − μ(x) = 1), to fill the inventory level up to S. The artificial gradual action ϕ (x) is applied with probability μ(x) = 0. If x ≥ s, then wait until the demand reduces the inventory level to below s, that is, apply the gradual action a g with probability μ(x) = 1. The artificial impulse ϕ (x) is applied with probability 1 − μ(x) = 0. We call σ ∗ an “(s, S)-strategy”. Condition 7.3.4 ¯ (a) The function c G (·, a g ) is continuous (and therefore, bounded) on X = [0, x]. (b) The function h and the CDF F satisfy Condition 7.3.3. (c) The CDF F is continuous. Theorem 7.3.5 If Condition 7.3.4 is satisfied, then the (s, S)-strategy σ ∗ described below Lemma 7.3.7 is uniformly optimal in the inventory model described in Example 7.3.3. Proof If the function c G (·, a g ) is nonnegative, we consequently apply Lemmas 7.3.7, 7.3.6, 7.3.4 and 7.3.3. Since ϕ(0) ∈ A I , the strategy ϕ is uniformly optimal in the DTMDP model M+ , which is exactly the DTMDP model to be investigated, with the proper cost rate c G . According to Sect. 7.3.3, the (s, S)-strategy σ ∗ is uniformly optimal in the inventory model. Suppose the function c G (·, a g ) can take negative values. Then we enlarge it to ¯ c G+ (x, a g ) := c G (x, a g ) + c¯ ≥ 0 ∀ x ∈ [0, x] by adding a big enough constant c¯ > 0. Now σ ∗ is the uniformly optimal strategy in the gradual-impulsive model under study, but with the cost rate c G+ . Let us show that it is also uniformly optimal in the model with the cost rate c G . If the cost rate is c G+ , the performance functional (7.51) is denoted as W˘ α+ (σ, x0 ). For each strategy σ ∈ G I , W˘ α (σ, x0 ) ≤ W˘ α+ (σ, x0 ), and W˘ α+ (σ ∗ , x0 ) < ∞ by Lemma 7.3.3 and Sect. 7.3.3. Therefore, below, when looking for the uniformly optimal strategy in the gradual-impulsive control problem, we consider only the strategies σ ∈ G I with the finite positive part of the performance functional W˘ α (σ, x0 ) (for all x0 ∈ X) coming from the cost rate c G ∨ 0. For each such strategy σ (and certainly for σ ∗ , too), for all x0 ∈ X, according to (7.50) and (7.51), we have W˘ α+ (σ, x0 )
∞ ! " c ¯ σ −αT α −α n+1 1−e e n l˘ ( X˘ n , A˘ n+1 , X˘ n+1 ) + = E˘ x0 α n=0 # $ c¯ lim 1 − E˘ σx0 e−αTn . = W˘ α (σ, x0 ) + α n→∞
(7.67)
7.3 The Discounted Cost Model
469
We conclude that c¯ c¯ W˘ α (σ, x0 ) ≥ W˘ α+ (σ, x0 ) − ≥ W˘ α+ (σ ∗ , x0 ) − . α α
(7.68)
The second inequality holds because σ ∗ is the uniformly optimal strategy in the model with the cost rate c G+ . For the strategy σ ∗ , the gradual action a g is applied infinitely many times because it necessarily follows each impulsive action. To be more precise, for every fixed n, on the (discrete) time horizon i = 1, 2, . . . , n, the gradual action a g is applied K (n) :=
n
I{μ(X i ) = 1}
i=1
% & times, and the random variable K (n) is greater than or equal to n2 , the integer part of n2 . The sojourn time i , following every gradual action a g (applied if and only if μ(X i ) = 1), is exponentially distributed with the parameter λ, and all these sojourn times are independent. Therefore, Tn =
n
i I{μ(X i ) = 1} =
i=1
K (n)
i ,
i=1
where i are independent and have the same exponential distribution with the parameter λ. The first equality holds because, when an impulse is applied, we have i I{μ(X i ) = 0} = 0. Finally, Tn is the Erlang random variable with the parameters K (n) and λ. Thus,
∗ ∗ ∗ E˘ σx0 e−αTn = E˘ σx0 E˘ σx0 e−αTn |K (n) = E˘ σx0 ∗
≤
λ λ+α
K (n)
n2 λ → 0 as n → ∞. λ+α
Therefore, for all x0 ∈ X, c¯ W˘ α (σ ∗ , x0 ) = W˘ α+ (σ ∗ , x0 ) − α in accordance with (7.67). Using (7.68), we conclude that c¯ W˘ α (σ, x0 ) ≥ W˘ α+ (σ ∗ , x0 ) − = W˘ α (σ ∗ , x0 ) α for all x0 ∈ X and for all strategies σ ∈ G I with the finite positive part of the performance functional W˘ α (σ, x0 ). In other words, the strategy σ ∗ is uniformly
470
7 Gradual-Impulsive Control Models
optimal in the original gradual-impulsive control problem, i.e., in the inventory model under study. As was mentioned, Condition 7.3.3 is satisfied often enough. Suppose the running cost is given by (7.61) and the CDF F has continuous density p(z). Then dh(x, a g ) = ch − r λ dx and
∞
p(z) dz + c (λ + α) − λc O
x
x
O
p(z) dz 0
d 2 h(x, a g ) = r λ p(x) − c O λ p(x). dx2
Therefore, if r > cO +
ch + c O α , λ
i.e., the reward r is big enough, then the function h(·, a g ) is strictly convex and decreasing at x = 0, hence, satisfies Condition 7.3.3(a), (b). The requirement s > 0 in Condition 7.3.3(c) is fulfilled if the set-up cost K is small enough, and Condition 7.3.3(d) is satisfied if the demand is big. In such a situation, Condition 7.3.4 is satisfied.
7.4 Bibliographical Remarks Purely impulsive control problems of piecewise deterministic processes (with deterministic jumps upon hitting boundaries of the state space) and general Markov processes (usually assuming the Feller property) were studied in e.g., [44, 50, 94, 204], respectively. Often the investigations of associated optimal stopping problems are useful for those of impulse control problems. A book treatment of optimal stopping problems is [226]. Sometimes, constraints are imposed on when impulses can be applied, see e.g., [162, 163]. The gradual-impulsive control problem for diffusion processes was perhaps firstly considered in [16]. It was introduced to CTMDPs with drift but without deterministic jumps in [128, 236], where the attention was restricted to strategies that do not apply multiple impulses simultaneously in a sequence, under which, to each time moment, there corresponds a unique value of the system state. In [128, 236], the space of the trajectories of controlled processes was described and investigated, and the authors studied the relations between the original problem and a sequence of problems obtained from time-discretization, see also [192]. The gradual-impulsive optimal control problem for piecewise deterministic processes was investigated in [48, 54]. For the concerned class of strategies therein, the authors of [54] exploited the possibility of reducing the gradual-impulsive control problem to a gradual-and-
7.4 Bibliographical Remarks
471
boundary control problem of another piecewise deterministic process with a fairly complicated state space. The reduced problem and its analogues were considered in [47, 53, 239]. It was shown in [191] that the gradual-impulsive control problem of a CTMDP may be reduced to a gradual-control problem of an equally simple CTMDP. If multiple impulses were applied at a single moment in time, there would be multiple values of system states corresponding to a single time, which is less convenient to deal with rigorously, and consequently, such a possibility is often excluded from consideration. A rigorous treatment of strategies that may apply multiple impulses (in a sequence) at a single time was given in [257], where the author extended the time index to count the number of impulses that have been applied so far at a single time. Alternatively, one may keep the time index unchanged but enlarge the state space to the space of trajectories of some DTMDP. This was done in [64, 65] for the gradual-impulsive control problem of CTMDPs with discounted cost criteria, where the dynamic programming method and convex analytic approach were developed under some extra conditions on the growth of the transition and cost rates. The extension of [64] to piecewise deterministic processes was carried out in [62]. Section 7.1. The materials come from [191]. The gradual-impulsive control problem of CTMDPs presented here and the one in Sect. 7.3 are similar to those in [256, 258]. Under some extra conditions, sufficiency of the strategies in the form of (7.3) was established for discounted gradual-impulsive control problems in [64, 65]: see Theorems 4.1(b) and 5.6 correspondingly. The methods of investigations in [64, 65] are not based on reduction to CTMDPs with gradual control only, and are thus different from the method presented in this chapter. Section 7.2. The materials come from [181]. Section 7.3. Most of the optimality results presented in Sect. 7.3.3 could be deduced following the reasoning of [254, 256, 258]. The lower semicontinuity or Borel-measurability of the value function of the gradual-impulsive control problem could be obtained directly if one could verify that the DTMDP problem used to describe the original problem satisfies suitable continuity-compactness conditions. This was done in [49, 93] for problems with gradual control only, for which, the action space of the underlying DTMDP is the space of relaxed controls endowed with the Young topology. In this connection, we mention that when α = 0, it can happen that the transition probability of the underlying DTMDP fails to be continuous in ρ ∈ R(A) when A is compact and q (dy|x, a) maps bounded continuous functions on X to continuous functions on X × A. In this connection, we mention that Lemma 5.7 of [263] contains inaccurate assertions, although all the optimality results in [263] remain correct since they can be established without referring to that lemma. The examples considered in Sect. 7.3.4 are similar to those in [194, 242], where there is no upper bound on the inventory level. In [242] the model is in discrete-time, whereas in [194] the model is investigated using the theory of impulsive control of piecewise deterministic processes, and the optimality of (s, S)-strategies was proved in the presence of the set-up cost. The basestock strategy, which is optimal in Example 7.3.1 can be viewed as a special (s, S)-strategy. There are several possible extensions of an (s, S)-strategy. For example, one may consider that the demand arrival process
472
7 Gradual-Impulsive Control Models
is modulated by another continuous-time Markov chain, representing the dynamics of the environment. In this case, the natural extension of an (s, S)-strategy is to let s and S both depend on the environment state. The optimality of such an (s, S)-strategy was considered in [228]. See [267] for another extension. Overall, the literature of inventory control is too vast to be listed here. The interested reader will benefit from consulting the literature review and the reference list of [22, 194], as well as the recent survey [78] in close relation to DTMDPs.
Appendix A
Miscellaneous Results
In this appendix, the main purpose is to provide complements and further clarifications of several points sketched in the main text, which, in our opinion, are of a smaller order of importance compared to the other statements.
A.1 Compensators and Predictable Processes In this subsection, we present the known results about the predictable projections of random measures. More details can be found in, e.g., [30, Chap. VIII, T7], [139], [149, Lemma 4], [150, Chap. 4], and [158, Chap. 3]. Definition A.1.1 (Predictable random measure) Let (, F , {Ft }t∈R0+ , P) be a filtered probability space, Pr be the associated predictable σ-algebra on × R0+ , and X be a Borel space. A random measure ν(ω; dt × d x) on R0+ × X is called predictable if the process [0,t]×X
Y (ω, s, x)ν(ω; ds × d x)
is predictable (i.e., Pr -measurable) for each nonnegative measurable (with respect to Pr × B(X)) function Y . Definition A.1.2 (Compensator) In the framework of Definition A.1.1, if μ(ω; dt × d x) is a random measure on R0+ × X, then a predictable random measure ν(ω; dt × d x) is called the dual predictable projection (compensator) of μ with respect to P if © Springer Nature Switzerland AG 2020 A. Piunovskiy and Y. Zhang, Continuous-Time Markov Decision Processes, Probability Theory and Stochastic Modelling 97, https://doi.org/10.1007/978-3-030-54987-9
473
474
Appendix A: Miscellaneous Results
R0+ ×X
R0+ ×X
=
Y (ω, s, x)μ(ω; ds × d x) P(dω) Y (ω, s, x)ν(ω; ds × d x) P(dω)
for each nonnegative Pr × B(X)-measurable function Y . Lemma A.1.1 The random measure ν given by (1.14) is the dual predictable projection (compensator), with respect to PγS , of the random measure μ given by (1.4). In the case of a π-strategy S, it has the form ν(ω; R × X ) =
R
I{u < T∞ }(da|ω, u)q( ˜ X |X (u), a)du, A
∀ R ∈ B(R+ ), X ∈ B(X). (The P(A)-valued predictable process was introduced in (1.8).) Proof For any n ∈ N0 on the set {n+1 ≥ t}, we have, for an arbitrarily fixed set ∈ B(X), ν(ω; (0, Tn + t] × ) =
n m=1 (0,m ]
+
(0,t]
G m (dθ × |Hm−1 ) G m ([θ, ∞] × X∞ |Hm−1 )
G n+1 (dθ × |Hn ) . G n+1 ([θ, ∞] × X∞ |Hn )
The last term is FTn -measurable; thus the process ν(ω; (0, Tn + t] × ) and the random measure ν on R+ × X are predictable. Now fix an arbitrary nonnegative predictable process Yˆ (ω, t) satisfying equations Yˆ (ω, t) = Z n (ω, t − Tn ) on the sets (Tn , Tn+1 ], n ∈ N0 , where Z n (ω, θ) is FTn × B(R+ )-measurable. Let Y (ω, t, x) = Yˆ (ω, t)I{x ∈ } for an arbitrarily fixed ∈ B(X) and prove that EγS
(0,∞)×X
Y (t, x)μ(dt × d x) = EγS
(0,∞)×X
Y (t, x)ν(dt × d x) .
(A.1)
As a result, formula (A.1) will be valid for an arbitrary nonnegative Pr × B(X)measurable function Y , and hence ν is the dual predictable projection of μ by definition. Recall that Pr denotes the predictable σ-algebra.
Appendix A: Miscellaneous Results
475
First of all, for any n ∈ N0 , on the set {Tn < ∞}, G n+1 (dθ × |Hn ) FT Z n (θ) G n+1 ([θ, ∞] × X∞ |Hn ) n (0, ] n+1 G n+1 (dθ × |Hn ) S FT I{n+1 ≥ θ}Z n (θ) = Eγ G n+1 ([θ, ∞] × X∞ |Hn ) n (0,∞) = Z n (θ)G n+1 (dθ × |Hn ).
EγS
(A.2)
(0,∞)
Now EγS
(0,∞)×X
=
EγS
Y (t, x)ν(dt × d x)
(0,∞) n≥0
Z n (t − Tn )
G n+1 ((dt − Tn ) × |Hn ) ×I{Tn < t ≤ Tn+1 } G n+1 ([t − Tn , ∞] × X∞ |Hn ) G n+1 (dθ × |Hn ) S = Eγ I{Tn < ∞} Z n (θ) G n+1 ([θ, ∞] × X∞ |Hn ) (0,n+1 ] n≥0 EγS I{Tn < ∞} Z n (θ)G n+1 (dθ × |Hn ) , = (0,∞)
n≥0
where the last equality comes after taking the conditional expectation given FTn and using the formula (A.2). On the other hand, S Y (t, x)μ(dt × d x) Eγ (0,∞)×X S Yˆ (Tn )I{Tn < ∞, X n ∈ } = Eγ =
n≥1
EγS I{Tn < ∞}Z n (n+1 )I{n+1 < ∞, X n+1 ∈ }
n≥0
=
n≥0
EγS
I{Tn < ∞}
(0,∞)
Z n (θ)G n+1 (dθ × |Hn ) ,
where the last equality comes after taking the conditional expectation given FTn . Formula (A.1) is proved, and the proof of the first statement is complete. The last statement follows from the expressions (1.13).
476
Appendix A: Miscellaneous Results
Lemma A.1.2 For the random measure μ˜ defined by μ(ω; ˜ R × X ) :=
I{Tn (ω) < ∞}δ(Tn (ω),X n−1 (ω)) (R × X ),
n≥1
∀ R ∈ B(R+ ), X ∈ B(X), the dual predictable projection (compensator), with respect to PγS , is given by ν(ω; ˜ R × X ) :=
δ X (u−) ( X )ν(du × X), ∀ R ∈ B(R+ ), X ∈ B(X),
R
where ν is the dual predictable projection, with respect to PγS , of the random measure μ given by (1.4). If S is a π-strategy, then ν(ω; ˜ R × X ) =
I{u < T∞ }(da|ω, u)q X (u−) (a)δ X (u−) ( X )du,
R
A
or, equivalently, ν(ω; ˜ R × X ) =
R
I{u < T∞ }(da|ω, u)q X (u) (a)δ X (u) ( X )du, A
where was defined in (1.8). Proof The measure ν˜ is predictable because, for any fixed X ∈ B(X), the process δ X (u−) ( X ) is predictable. Now it is sufficient to prove the analogue of equality (A.1) for the same Pr × B(X)-measurable function Y (ω, t, x) = Yˆ (ω, t)I{x ∈ } as in the proof of Lemma A.1.1, where ∈ B(X) is arbitrarily fixed and Yˆ (ω, t) = Z n (ω, t − Tn ) on the sets (Tn , Tn+1 ], n ∈ N0 , Z n (ω, θ) being FTn × B(R+ )-measurable. Consider also the function Y (ω, t) = Yˆ (ω, t)I{X (t−) ∈ }. It is predictable as the product of two predictable functions. Therefore, EγS
(0,∞)
(0,∞)
(0,∞)
= EγS =
EγS
= EγS
Y (t)μ(dt × X) = EγS
(0,∞)
Y (t)ν(dt × X)
Yˆ (t)I{X (t−) ∈ }ν(dt × X)
Yˆ (t)I{x ∈ }δ X (t−) (d x)ν(dt × X) X Y (t, x)ν(dt ˜ × d x) .
(0,∞)×X
On the other hand, the expressions
(A.3)
Appendix A: Miscellaneous Results
(0,∞)
Y (t)μ(dt × X) =
477
I{Tn < ∞}Yˆ (Tn )I{X (Tn −) ∈ }
n≥1
and (0,∞)×X
Y (t, x)μ(dt ˜ × d x) =
I{Tn < ∞}Yˆ (Tn )I{X (Tn−1 ) ∈ }
n≥1
coincide. After taking the expectations and using (A.3), we obtain EγS
(0,∞)×X
Y (t, x)μ(dt ˜ × d x) =
EγS
(0,∞)×X
Y (t, x)ν(dt ˜ × d x)
(A.4)
as required. The formulae for the case of a π-strategy follow from Lemma A.1.1. In the latter case, for a fixed X ∈ B(X), the marginals ν(dt × X ) and ν(dt ˜ × X ) on R+ are absolutely continuous with respect to the Lebesgue measure, so that one can modify the function I{u < T∞ }(da|ω, u)q X (u) (a)δ X (u) ( X ) A
on the set {0 < u < T∞ : X (u−) = X (u)} = {T1 , T2 , . . .}, which is null with respect to the Lebesgue measure.
Proposition A.1.1 If a filtration and the associated predictable σ-algebra are given by expressions (6.2) and (6.4), then a random process A(·) with values in a Borel space A is predictable if and only if A(ω, 0) is F0 -measurable and A(·) has the form A(ω, t) =
I{Tn−1 < t ≤ Tn }An (Hn−1 , t − Tn ), t > 0,
(A.5)
n≥1
where, for all n = 1, 2, . . ., An (h n−1 , s) is a (nonrandom) measurable mapping An : Hn−1 × R+ → A. Proof For the case A = R, this statement was proved in [139]: see Lemma 3.3. An arbitrary Borel space A is isomorphic to [0, 1] or a countable or finite subset of [0, 1] with an isomorphism ψ : A → [0, 1] by Proposition B.1.1. Now A(ω, 0) is F0 -measurable and the process A(·) has the form (A.5) if and only if ψ(A(ω, 0)) is F0 -measurable and the process ψ(A(·)) has the form
478
Appendix A: Miscellaneous Results
ψ(A(ω, t)) =
I{Tn−1 < t ≤ Tn }ψ(An (Hn−1 , t − Tn )), t > 0,
n≥1
i.e., if and only if the process ψ(A(·)) is predictable. Finally, the process ψ(A(·)) is predictable if and only if the process A(·) is predictable.
A.2 On Non-realizability of Relaxed Strategies Suppose the random process A(·) in continuous time takes independent values ±1 with equal probabilities at each time moment. The Kolmogorov Consistency The˜ and a stochas˜ F˜ , P) orem implies that there is a complete probability space (, tic process A(·) measurable in ω˜ for any t ∈ (0, ∞) such that, for any t ∈ R+ , ˜ ˜ P(A(t) = −1) = P(A(t) = +1) = 21 and the variables A(t) and A(s) are independent for s = t. However, the process A(·) is not measurable jointly in (t, ω), ˜ as was demonstrated in [147, Example 1.2.5]. Below we reproduce the very brief proof. Suppose the process A(·) is measurable in (t, ω). ˜ Then, for any interval I ⊆ R+ , E˜
2
A(t)dt I
= E˜
A(t)A(s)ds dt = E˜ [A(t)A(s)] ds dt = 0
I
I
I
I
˜ because E[A(t)A(s)] = 0 for t = s. Therefore, for each interval I , P˜
A(t)dt = 0 = 1,
I
and we need to show that P˜ ∀ interval I, I A(t)dt = 0 = 1. Indeed, P˜ for any rational r1 ≤ r2 ,
[r1 ,r2 ]
A(t)dt = 0 = 1,
˜ and as a result, P(∀ interval I ⊆ R+ , I A(t)dt = 0) = 1, because the function [a,b] A(t)dt is continuous with respect to a and b. Hence, ˜ P(A(t) = 0 for almost all t > 0 with respect to the Lebesgue measure) = 1, (A.6)
˜ 2 (t)] ≡ yielding E˜ I A2 (t)dt = 0 for any interval I ⊆ R+ , which contradicts E[A 1. For more detailed explanations, see the proof of Theorem 1.1.2.
Appendix A: Miscellaneous Results
479
A.3 Comments on the Proof of Theorem 1.1.3 Part (i). For the introduced measure P Z , the probability measure PZ (d x) = (A1 )δλ1 T (d x) + (A2 )δλ2 T (d x) on [λ1 T, λ2 T ] is a proper balayage, different from P Z , see [165, Chap. IX, Sect. 2]. Therefore, for the strictly convex function e−z , the strict inequality ˆ ω ))ds E e− (0,T ] q(A(s, EZ [e−Z ] = (A1 )e−λ1 T + (A2 )e−λ2 T > E Z [e−Z ] = is obvious. Part (ii). We will show here that the set { pωA (da) :
} A is a jointly measurable A-valued process on R+ ×
. coincides with the set of all measurable stochastic kernels on B(A) given ω∈ Let us suppose that the Borel space A is uncountable. The case of a discrete space A is much easier and can be studied in a similar way. Hence, there is an isomorphism ψ from A to [0, 1], see Proposition B.1.1. That is, ψ is a 1-1 mapping of (A, B(A)) on ([0, 1], B([0, 1])), which is measurable in both directions. We will show that, for , the process A ω∈ a given measurable stochastic kernel pω (da) on B(A) given defined by A(s, ω ) = ψ −1 (inf{x ∈ R : pω (ψ −1 ((−∞, x])) ≥ 1 − e−s })
(A.7)
is the desired one: pωA (da) = pω (da). For a fixed ω , consider the image pˆω (da) of the measure pω (da) with respect to the mapping ψ: pˆω () = pω (ψ −1 ()), or equivalently pω ( A ) = pˆω (ψ( A )), for each ∈ B([0, 1]) and A ∈ B(A). Let Fω (x) = pˆω ((−∞, x]) be the cumulative distribution function associated with the probability measure pˆω (d x) on [0, 1] and introduce Fω−1 (u) = inf{x ∈ R : Fω (x) ≥ u}, u ∈ (0, 1). Note that the image of the Lebesgue measure on B((0, 1)) with respect to the mapping Fω−1 : (0, 1) → [0, 1] coincides with the measure pˆω (d x), see Proposition B.1.21. In turn, the image of the measure λ, introduced in the proof of Theorem 1.1.3(ii), with respect to the mapping t → 1 − e−t from R+ to (0, 1) is just the Lebesgue measure on B((0, 1)). Therefore, if B(s, ω ) = Fω−1 (1 − e−s ),
480
Appendix A: Miscellaneous Results
then the image of the measure λ with respect to the mapping s → B(s, ω ) coincides ω ) = ψ −1 (B(s, ω )). with pˆω (d x). Finally, let A(s, ω ) satisfies the following The image pωA of λ with respect to the mapping s → A(s, equalities, for an arbitrary set A ∈ B(A): ω )) = λ(B −1 (ψ( A ), ω ) = pˆω (ψ( A ), ω ) = pω ( A ). pωA ( A ) = λ(A−1 ( A , (See (1.27).) Therefore, the constructed function A(s, ω ) is the desired one, and it remains to prove that the process A is measurable. , For each ∈ B([0, 1]), the function pˆω () = pω (ψ −1 ()) is measurable on so that, for each fixed x ∈ [0, 1], the function Fω (x) is measurable and hence the equipped with the set {(u, ω ) : u ≤ Fω (x)} is measurable in the space [0, 1] × Note that product σ-algebra B([0, 1]) × F. ω ) : Fω−1 (u) ≤ x} {(u, ω ) : u ≤ Fω (x)} = {(u, because the function Fω (·) is increasing. Therefore, the real-valued functions Fω−1 (u) ω ) and (s, ω ) respecand B(s, ω ) = Fω−1 (1 − e−s ) are measurable with respect to (u, ω )) is also measurable, as ψ −1 : [0, 1] → tively, and the mapping A(s, ω ) = ψ −1 (B(s, A is measurable. Obviously, the constructed process A has the form (A.7). We have proved that, when considering all different measurable functions A(s, ω ), we can obtain all possible stochastic kernels pωA (da) as the images of the measure λ(ds) with respect to the mappings s → A(s, ω ). Any such kernel defines the prob × B(A). Let us show that, if the strategy S ability measure P(d ω ) × pωA (da) on F is realizable, then, for the process A, such a measure P(d ω ) × pωA (da) is extreme in the space D = { P(d ω ) × pω (da) : pω (da) are all possible measurable stochastic }. kernels on B(A) given ω∈ and A ∈ B(A) If for all sets ∈F
P(d ω ) pωA ( A )
=α
(A.8)
P(d ω ) pω1 ( A ) + (1 − α)
P(d ω ) pω2 ( A ) , α ∈ (0, 1),
and ˆ A ∈ B(A) such that and there are sets ˆ ∈F ˆ
then
P(d ω ) pω1 (ˆ A ) =
ˆ
P(d ω ) pω2 (ˆ A ),
A ˆ P({ ω ∈ ˆ : p ω ( A ) ∈ (0, 1)}) > 0
(A.9)
Appendix A: Miscellaneous Results
481
ˆ 0 and ˆ 1 where pωA (ˆ A ) = 0 because otherwise ˆ can be split into two subsets 0 1 ˆ ˆ and , paired with and 1 correspondingly. For each of ˆ A , equality (A.8) can P(d ω ) pωA (ˆ A ), leading to the hold only if both square brackets coincide with ˆ equality P(d ω ) pω1 (ˆ A ) = P(d ω ) pω2 (ˆ A ). ˆ
ˆ
ω )) > 0 and Leb((A \ ˆ A )− ( ω )) > 0, But pωA (ˆ A ) ∈ (0, 1) means that Leb(ˆ − A ( where we use the notation introduced in (1.27). According to (1.26), if the strategy S is realizable, then, for each A ∈ B(A), P( pωA ( A ) ∈ (0, 1)) = 0, which contradicts (A.9). Therefore, for the process A, P(d ω ) pωA (da) must be extreme in the space D. →A According to Theorem 10 of [179], there is a measurable mapping ϕ : such that A(s, ω ) = ϕ( ω ) for almost all s ∈ R+ , P-a.s. See also Proposition B.1.36.
A.4 A Condition for Non-explosiveness The following condition is, to some extent, an extension of Condition 2.2.3 to the nonhomogeneous case, and was obtained independently in [41, 265]. Below, q is a Q-function: see Definition 1.1.4. Condition A.4.1 (a) X = Z, and the function t ∈ [0, ∞) → q( j|i, t) is continuous for each i, j ∈ X = Z. (b) There exists a R0+ -valued function w on R0+ × Z such that the following assertions hold. (i) v ∈ R0+ → w(v, i) is continuously differentiable for each i ∈ Z. (ii) For each T ≥ 0 and i ∈ Z, sup
q( ˜ j|i, v)|w(v, j)| < ∞.
0≤v≤T j∈Z
(iii) For each T ≥ 0, as n ↑ ∞, inf
v∈[0,T ], |i|>n
w(v, i) ↑ ∞.
482
Appendix A: Miscellaneous Results
(iv) For each T ≥ 0, there is some constant αT ∈ R+ such that ∂w(v, i) + q( j|i, v)w(v, j) ≤ αT w(v, i), ∂v j∈Z
∀ v ∈ [0, T ], i ∈ Z.
(A.10)
By Theorem 1 of [41], Condition A.4.1 is sufficient for the non-explosiveness of the nonhomogeneous Markov pure jump process X (·) with the transition function pq . We explain that Condition A.4.1 implies Conditions 2.2.1 and 2.2.2 as follows. Suppose that Condition A.4.1 is satisfied. Since the concerned process is in a denumerable state space, and the Q-function is continuous, Theorem 2.2.2 applies. In particular, assumption (b) in Theorem 2.2.2 holds because of Definition 1.1.4, and that X is denumerable with the discrete topology, so that each compact subset of X is finite. Now the validity of Condition 2.2.1 follows from the sufficiency of Condition A.4.1 for non-explosiveness. To see that Condition 2.2.2 is satisfied, merely notice that for each T ≥ 0, w(t + v, j)e−αT t− (0,t] qi (s+v)ds q( ˜ j|i, t + v)dt =
(0,T −v] j∈Z
(0,T −v] j∈Z
+ ≤
w(t + v, j)e−αT t−
(0,t]
(0,T −v]
w(t + v, i)e−αT t−
e−αT t−
(0,t]
qi (s+v)ds
qi (s+v)ds
q( j|i, t + v)dt
qi (t + v)dt
qi (s+v)ds
(αT + qi (t + v))w(t + v, i)dt −αT t− (0,t] qi (s+v)ds ∂w(v, i) − e dt ∂v t+v (0,T −v] (0,t]
(0,T −v]
= w(v, i) − e−αT (T −v)− [0,T −v) qi (s+v)ds w(T, i) ≤ w(v, i), ∀ v ∈ [0, T ], i ∈ Z, where the first inequality is by (A.10), and the last equality is by integration by parts.
A.5 Markov Pure Jump Processes Until the end of this subsection, we consider the homogeneous Markov pure jump process X (·) with the transition function pq (x, t, dy) := pq (s, x, t + s, dy) as in Sect. 2.1.2, and adopt the notations defined therein.
Appendix A: Miscellaneous Results
483
Proposition A.5.1 Consider the homogeneous Markov pure jump process X (·). Then lim t↓0
pq (x, t, ) = q(|x) t
for each ∈ B(X) and x ∈ / .
Proof See Theorem 1.13 of [39].
Proposition A.5.2 Consider the non-explosive homogeneous Markov pure jump process X (·). Fix the initial state x ∈ X. Then X (·) has the strong Markov property, i.e., for each {Ft }-stopping time τ , and bounded measurable function f on X, f (y) pq (X τ , t, dy), ∀ t ≥ 0 Ex [ f (X (t + τ ))|Fτ ] = X
on {τ < ∞}.
Proof See p.197 of [96].
A σ-finite measure μ on (X, B(X)) is said to be an invariant measure of the process X (·) if μ() = X μ(d x) pq (x, t, ) for each ∈ B(X) and t ≥ 0. An invariant measure is called an invariant distribution if it is a probability measure. Proposition A.5.3 Consider the non-explosive homogeneous Markov pure jump process X (·). Then a σ-finite measure μ on (X, B(X)) is an invariant measure if and only if either of the following equivalent conditions are satisfied. (a) For each ∈ B(X) satisfying μ() < ∞ and supx∈ q (X|x) < ∞, it holds that q(|x)μ(d x) = 0. X
(b) For each nonnegative real-valued measurable function f on X,
f (x) q (X|x)μ(d x) = X
Proof See Theorem 4.17 of [39].
f (y) q (dy|x)μ(d x). X
X
Proposition A.5.4 Consider the non-explosive homogeneous Markov pure jump process X (·). Then the following two conditions are equivalent. (a) For each bounded measurable function f on X, there exists a constant α( f ) ∈ R such that lim T →∞ T1 Ex [ (0,T ] f (X (s))ds] = α( f ) for each x ∈ X.
484
Appendix A: Miscellaneous Results
(b) There exists a nontrivial σ-finite measure, say ν, on X such that for each ∈ B(X) with ν() > 0, Px (τ < ∞) = 1 for each x ∈ X. Here τ = inf{t ≥ 0 : X (t) ∈ } for each ∈ B(X). Under either of these two conditions, there exists a unique invariant probability measure, say μ, for X (·), which is given by μ() = lim
T →∞
1 Ex T
(0,T ]
I{X (s) ∈ }ds , ∈ B(X).
Proof See Theorem 2.4 of [166] and Theorem 1 of [97].
A.6 On the Uniformly Exponential Ergodicity of CTMDPs in Denumerable State Space Proposition A.6.1 Consider a CTMDP model with denumerable state space X, endowed with the discrete topology. Suppose the following conditions are satisfied: • There exist a [1, ∞)-valued strictly unbounded function (or say moment) w on X and constants ρ < 0 and b > 0 such that
q({y}|x, a)w (y) ≤ ρ w (x) + b , x ∈ X, a ∈ A.
j∈S
• For each x, y ∈ X, the function a ∈ A → q({y}|x, a) is continuous. • There exist a [1, ∞)-valued function w on X and constants ρ ∈ R and b ≥ 0 such that q({y}|x, a)w(y) ≤ ρw(x) + b, x ∈ X, a ∈ A, j∈S
and ww is a strictly unbounded function on X. • A is compact. • Under each stationary π-strategy ϕ, there exists one closed communicating class. Then under each stationary π-strategy ϕ, there is a unique invariant distribution μϕ such that y∈X μϕ ({y})w (y) < ∞. Moreover, there exist some constants δ > 0 and R > 0 such that for each x ∈ X, t ≥ 0 and stationary π-strategy ϕ, y∈X
w (y)| pq ϕ (x, t, {y}) − μϕ ({y})| ≤ Re−δt w (x),
Appendix A: Miscellaneous Results
485
where q ϕ ({y}|x) := A q({y}|x, a)ϕ(da|x) for each x, y ∈ X. In other words, the CTMDP model is w -exponentially ergodic uniformly with respect to the class of stationary π-strategies. Proof See the proof of Theorem 3.13 of [198], based on [51], see also [229].
A.7 Proofs of Technical Lemmas Proof of Lemma 5.3.1 From the presented expression for rk ∗ one can deduce that rk ∗ = R˜ mˆ μmˆ . Indeed, k ∗
k ∗
λ y R y + λmˆ R˜ mˆ ∗ λ λ 1 + y=1 μyy 1 + ky=1 μyy + μλmmˆˆ ⎛ ∗ ⎞ ⎛ ⎞ k k∗ λy ⎠ λmˆ ⇐⇒ ⎝ λy Ry ⎠ = λmˆ R˜ mˆ ⎝1 + μ μ m ˆ y=1 y=1 y k ∗ y=1 λ y R y ⇐⇒ rk ∗ = ∗ λ = R˜ mˆ μmˆ . 1 + ky=1 μyy r
k∗
=
y=1
λy Ry
k ∗
=
y=1
According to Remark 5.3.1(a), Rk ∗ +1 μk ∗ +1 ≤ rk ∗ = R˜ mˆ μmˆ ≤ Rk ∗ μk ∗ . Note that Rmˆ μmˆ < Rk ∗ μk ∗ because mˆ > k ∗ . (Remember the convention about ordering the products R y μ y .) If R˜ mˆ ≤ Rmˆ then, according to the above inequalities, Rk ∗ +1 μk ∗ +1 ≤ Rmˆ μmˆ < Rk ∗ μk ∗ . Thus mˆ = k ∗ + 1 and hence k ∗∗ = k ∗ . Finally, k ∗
λ y R y + λmˆ Rmˆ = rk ∗ +1 ≥ rk ∗ , k ∗ λ y 1 + y=1 μ y + μλmmˆˆ y=1
which is impossible due to (5.49). Therefore, R˜ mˆ > Rmˆ .
Proof of Lemma 5.3.2 Firstly, note that the required number kˆ exists because d < λmˆ . λmˆ +μmˆ Suppose mˆ = kˆ + 1. The following inequalities are obvious:
486
Appendix A: Miscellaneous Results
kˆ
˜
y=1,y=mˆ λ y R y + λmˆ Rmˆ ˆ λ 1 + ky=1,y=mˆ μyy + μλmmˆˆ
k+1 ˆ ≷
˜
y=1,y=mˆ λ y R y + λmˆ Rmˆ ˆ λy λmˆ 1 + k+1 y=1,y=mˆ μ y + μmˆ
⎡ ⎤ kˆ λk+1 ˆ ⎣ ⇐⇒ λ y R y + λmˆ R˜ mˆ ⎦ μk+1 ˆ y=1,y=mˆ ⎡ ⎤ kˆ λ λ y m ˆ ⎣1 + ⎦ ≷ λk+1 Rk+1 + ˆ ˆ μy μmˆ y=1,y=mˆ ⎡ ˆk ⎣1 + ⇐⇒ λ y R y + λmˆ R˜ mˆ ≷ Rk+1 ˆ μk+1 ˆ y=1,y=mˆ
(A.11) kˆ y=1,y=mˆ
⎤ λy λmˆ ⎦ + . μy μmˆ
The value of R˜ mˆ comes from the equality in (A.11). Let us show that R˜ mˆ > Rmˆ as follows. If we substitute Rmˆ for R˜ mˆ in (A.11), then we obtain the following difference between the left and right sides: =
⎡
kˆ
⎣1 + λ y R y + λmˆ Rmˆ − Rk+1 ˆ μk+1 ˆ
y=1,y=mˆ
kˆ y=1,y=mˆ
⎤ λy λmˆ ⎦ . + μy μmˆ
ˆ by (5.47), If mˆ ≤ k, ⎡ = ⎣1 +
kˆ λy y=1
because
μy
⎤
⎦ rˆ − Rˆ μˆ k k+1 k+1 < 0
kˆ < k ∗ =⇒ rk+1 > rkˆ ⇐⇒ Rk+1 > rkˆ ˆ ˆ μk+1 ˆ
by Theorem 5.3.1(a, d). If mˆ > kˆ + 1, ⎡ ⎤ kˆ
λmˆ
λ y ⎦ rˆ − Rˆ μˆ = ⎣1 + Rmˆ μmˆ − Rk+1 kˆ + 1. Therefore, in either case, R˜ mˆ > Rmˆ . R ˆ μk+1 ˆ If we substitute k+1 for R˜ mˆ in (A.11), then we obtain the following difference μmˆ between the left and right sides:
Appendix A: Miscellaneous Results
=
kˆ y=1,y=mˆ
487
⎡ Rk+1 ˆ μk+1 ˆ ⎣1 + λ y R y + λmˆ − Rk+1 ˆ μk+1 ˆ μmˆ
kˆ y=1,y=mˆ
⎤ λy λmˆ ⎦ . + μy μmˆ
ˆ If mˆ ≤ k, ⎡ ⎤ kˆ λ λmˆ y⎦ ⎣1 + = λ y R y − λmˆ Rmˆ + Rk+1 − Rk+1 ˆ μk+1 ˆ ˆ μk+1 ˆ μ μ m ˆ y=1 y=1 y ⎡ ⎤ kˆ
λmˆ
λy ⎦ ⎣ = 1+ + Rk+1 rkˆ − Rk+1 − Rmˆ μmˆ < 0, ˆ μk+1 ˆ ˆ μk+1 ˆ μ μmˆ y=1 y kˆ
since mˆ < kˆ + 1 and, like previously, Rk+1 > rkˆ because kˆ < k ∗ . ˆ μk+1 ˆ ˆ If mˆ > k + 1, ⎡ ⎤ kˆ kˆ λ λmˆ λ y m ˆ ⎣1 + ⎦ = λy Ry + Rk+1 − Rk+1 + ˆ μk+1 ˆ ˆ μk+1 ˆ μ μ μ y m ˆ m ˆ y=1 y=1 ⎡ ⎤ ˆ k λy
⎦ rˆ − Rˆ μˆ = ⎣1 + k k+1 k+1 < 0, μ y=1 y since, like previously, Rk+1 > rkˆ because kˆ < k ∗ . ˆ μk+1 ˆ Therefore, in either case, Rk+1 < R˜ mˆ μmˆ . ˆ μk+1 ˆ ∗ Suppose now that mˆ = kˆ + 1 < k . Similarly to the above calculations, kˆ
λ y R y + λmˆ R˜ mˆ ≷ ˆ λy 1 + k+1 y=1 μ y
y=1
⇐⇒
kˆ
k+2 ˆ
y=1,y=mˆ
1+ ⎡
λ y R y + λmˆ R˜ mˆ k+2 ˆ λy
⎣1 + λ y R y + λmˆ R˜ mˆ ≷ Rk+2 ˆ μk+2 ˆ
y=1
y=1 μ y ˆ k+1 λy y=1
μy
⎤ ⎦
(A.12)
and the value of R˜ mˆ comes from the equality in (A.12). If we substitute Rmˆ for R˜ mˆ in (A.12), then we obtain the following difference between the left and right sides: =
ˆ k+1 y=1
⎡ ⎣1 + λ y R y − Rk+2 ˆ μk+2 ˆ
ˆ k+1 λy y=1
μy
⎤ ⎦
488
Appendix A: Miscellaneous Results
⎡ = ⎣1 +
ˆ k+1 λy y=1
because
μy
⎤
⎦ rˆ − Rˆ μˆ k+1 k+2 k+2 < 0
kˆ + 1 < k ∗ =⇒ rk+2 > rk+1 ⇐⇒ Rk+2 > rk+1 ˆ ˆ ˆ μk+2 ˆ ˆ
by Theorem 5.3.1(a, d). Therefore R˜ mˆ > Rmˆ .
Proof of Lemma 7.2.3 In all the cases, f (y) =
y
y
y
y
y
y
z 2 (ln z 2 )(z 1 − z 2 ) − (g + z 2 − 1)(z 1 ln z 1 − z 2 ln z 2 ) y y (z 1 − z 2 )2 y
=:
z1 f 1 (y) y y (z 1 − z 2 )2
with y y z2 z2 y − (g + z 2 − 1) ln z 1 − f 1 (y) = 1− ln z 2 z1 z1 y y z2 z2 y y = z 2 (ln z 2 ) − z 2 (ln z 2 ) − (g − 1) ln z 1 − (ln z 2 ) z1 z1 y z2 y y (ln z 2 ) −z 2 (ln z 1 ) + z 2 z1 y z2 z2 y = z 2 ln − (g − 1) ln z 1 − ln z 2 , z1 z1 y z 2 (ln z 2 )
and y z2 z2 z2 + (g − 1)(ln z 2 ) ln z1 z1 z1 z 1 2 y = z 2 ln (ln z 2 ) 1 + (g − 1) y . z1 z1
f 1 (y) = z 2 (ln z 2 ) ln y
y In case of (a), lim y→0 f 1 (y) = 0 and f 1 (y) = z 2 ln zz21 (ln z 2 ) 1 − z1y > 0 for 1 each y ∈ (0, 1), and thus f (y) > 0 for each y ∈ (0, ∞), as desired. In case of (b), lim y→0 f 1 (y) = ln zz21 − (g − 1)(ln z 1 − ln z 2 ) = −g ln zz21 < 0, lim y→∞ f 1 (y) = (1 − g) ln z 1 > 0, and f 1 (y) > 0. Thus, f decreases initially and then increases, and has a unique minimum over (0, ∞). For the second assertion, note that the unique minimizer is given by the unique solution to the equation f 1 (y) = 0, i.e.,
Appendix A: Miscellaneous Results y
z 2 ln
489
y z2 z2 = (g − 1) ln z 1 − ln z 2 . z1 z1
As 0 < g ↑ 1, the right-hand side of the above equality goes to 0, and so must the lefthand side, and thus y → ∞. Moreover, the derivative of the above implicit function of y in terms of g can be computed as !y − ln z 1 + zz21 (ln z 2 ) dy =− !y y dg z 2 (ln z 2 ) ln zz21 + (g − 1) zz21 (ln z 2 ) ln !y ln z 1 − zz21 (ln z 2 ) ! > 0. = y z 2 (ln z 2 ) ln zz21 1 + (g − 1) z1y
z2 z1
1
In case of (c), lim y→0 f 1 (y) = −g ln zz21 < 0, lim y→∞ f 1 (y) = (1 − g) ln z 1 ≤ 0, and f 1 (y) > 0, meaning that the function f decreases over (0, ∞). The last equality is trivial. Proof of Lemma 7.2.5 For each g ∈ [0, 1), let us recall from Lemma 7.2.4 that the nonnegative integer c∗ (g) is specified by the following relation: c∗ (g) = c if and only if g − 1 + z 22 g − 1 + z 2c g − 1 + z 2c+1 g − 1 + z2 > > · · · > > z1 − z2 z 1c − z 2c z 12 − z 22 z 1c+1 − z 2c+1 ≤
g − 1 + z 2c+2 z 1c+2 − z 2c+2
< ....
(A.13)
(If c = 0, the above chain of inequalities starts with mind, we do not treat the case of c = 0 separately.) For each ε > 0, since
g−1+z 2 z 1 −z 2
≤
g−1+z 22 . z 12 −z 22
With this in
ε ε ε ε ε > 2 > ··· > c > c+2 , c > c+1 2 c+1 z1 − z2 z1 − z2 z1 − z2 z1 − z2 z 1 − z 2c+2 all the strict inequalities “>” in (A.13) are preserved with g being replaced by g + ε, i.e., c∗ (g)
g + ε − 1 + z 22 g + ε − 1 + z2 g + ε − 1 + z2 > > ··· > c∗ (g) c∗ (g) z1 − z2 z 12 − z 22 z1 − z2 c∗ (g)+1
>
g + ε − 1 + z2 c∗ (g)+1
z1
c∗ (g)+1
− z2
,
490
Appendix A: Miscellaneous Results
having written c = c∗ (g), whereas “≤” can be violated if ε > 0 is large enough. This, together with Lemma 7.2.3, implies that c∗ (g) is an increasing step function in g ∈ [0, 1) with unit step size, and limg→1 c∗ (g) = ∞. For the last relation, recall also the original definition of c∗ (g) in terms of (7.41) and (7.39). With the expression of v((i 0 , j0 ), g) given in Theorem 7.2.1, we see from the above observation that v((i 0 , j0 ), g) − gd1 is piecewise linear in g ∈ [0, ∞). Let us investigate the behavior of this function in greater detail. • For g ∈ [0, 1) such that c∗ (g) + 1 < j0 , v((i 0 , j0 ), g) − gd1 = (i 0 − d1 )g and i 0 − d1 > 0 in view of the assumption of (7.28). As g ↑ 1, c∗ (g) ↑ ∞ as explained earlier. • For g ∈ [0, 1) such that c∗ (g) + 1 ≥ j0 , the coefficient in front of g in v((i 0 , j0 ), g) − gd1 is
j j i 0 (z 10 −z 20 ) c∗ (g)+1 c∗ (g)+1 z1 −z 2
− d1 , which is decreasing in g ∈ [0, 1). j
• For g ∈ [1, ∞), v((i 0 , j0 ), g) − gd1 = i 0 (1 − z 20 ) − gd1 , which is linear in g ∈ [1, ∞) with the slope −d1 < 0. To sum up, we see that v((i 0 , j0 ), g) − gd1 is strictly increasing at the rate i 0 − d1 > 0 initially (for g ∈ [0, 1) small enough); then, for g ∈ [0, 1) such that c∗ (g) + 1 ≥ j0 , changes piecewise linearly over disjoint intervals with the rate in each interval
j j i 0 (z 10 −z 20 ) c∗ (g)+1 c∗ (g)+1 z1 −z 2
− d1 , which decreases over subsequent intervals; and eventu-
ally, for g ≥ 1, decreases with rate −d1 . Therefore, the maximum of the function v((i 0 , j0 ), g) − gd1 over g ∈ [0, ∞) is achieved in (0, 1]. In fact, it is sufficient to take the maximum of v((i 0 , j0 ), g) − gd1 over g ∈ (0, 1), since c∗ (g) → ∞ as g → 1, and hence, for all g smaller than but close enough to 1,
j j i 0 (z 10 −z 20 ) c∗ (g)+1 c∗ (g)+1 −z 2
z1
− d1 < 0.
Consequently, for a maximizer of v((i 0 , j0 ), g) − gd1 over g ∈ [0, ∞), one can take g ∗ ∈ (0, 1) as the largest value of g ∈ (0, 1) such that j
j
i 0 (z 10 − z 20 )
c∗ (g)+1
z1
c∗ (g)+1
− z2
− d1 > 0.
(A.14)
Since c∗ (g) is an increasing step function of g ∈ (0, 1) with unit step size, blowing up to ∞ as g → 1, we see that c∗ (g ∗ ) is the minimal nonnegative integer c∗ satisfying the inequality j
j
z 10 − z 20
c∗ +2
z1
−
∗ z 2c +2
≤
d1 , i0
which coincides with (7.47). Consider such an integer c∗ . For g ∈ (0, 1) to be such that c∗ (g) = c∗ , as recalled in the beginning of this proof, it is necessary and sufficient that
Appendix A: Miscellaneous Results
491 ∗
∗
g − 1 + z2 g − 1 + z 22 g − 1 + z 2c g − 1 + z 2c +1 > > · · · > > ∗ ∗ ∗ ∗ z1 − z2 z 1c − z 2c z 12 − z 22 z 1c +1 − z 2c +1 ∗
≤
g − 1 + z 2c +2 ∗
∗
z 1c +2 − z 2c +2
< ....
Therefore, the largest value of g ∈ (0, 1) such that c∗ (g) = c∗ comes from the relation ∗
g − 1 + z 2c +1 ∗
∗
z 1c +1 − z 2c +1
∗
=
g − 1 + z 2c +2 ∗
∗
z 1c +2 − z 2c +2
.
(Again, one may benefit from a reminder of the reasoning in the beginning of this proof.) This value of g is automatically the largest one satisfying inequality (A.14). That is, g ∗ comes from (7.48). The proof is now complete. Proof of Lemma 7.3.2 Let us prove that the function c G (·, a g ) is convex on [0, x]. ¯ Equality (7.61) can be rewritten as follows: g
c (x, a ) = hx + r λ G
u(x, z) d F(z), (0,∞)
where u(x, z) := −zI{z ≤ x} − xI{z > x}. For each fixed z ∈ (0, ∞), the piecewise linear function u(·, z) satisfies the inequality (1 − μ)u(x, z) + μu(y, z) − u((1 − μ)x + μy, z) ≥ 0 for all μ ∈ [0, 1] and x, y ∈ [0, x]. ¯ After we integrate this inequality, we obtain that u(x, z) d F(z) + μ u(y, z) d F(z) (1 − μ) (0,∞) (0,∞) − u((1 − μ)x + μy, z) d F(z) ≥ 0 (0,∞)
for all μ ∈ [0, 1] and x, y ∈ [0, x]. ¯ All the integrals here are finite. As a result, (1 − μ)c G (x, a g ) + μc G (y, a g ) − c G ((1 − μ)x + μy, a g ) ≥ 0 ∀ μ ∈ [0, 1], x, y ∈ [0, x]. ¯ Therefore, the finite-valued function c G (·, a g ) is convex by Proposition B.2.2. Clearly, c G (0, a g ) = 0; hence ¯ c G (x, a g ) ≤ 0 ∀ x ∈ [0, x].
492
Appendix A: Miscellaneous Results
If S ∗ < x, ¯ then c G (·, a g ) is nondecreasing on [S ∗ , x] ¯ by Lemma B.2.4. The proof is now complete.
Proof of Lemma 7.3.3 (a) After we extend the space A I to A I := [0, x], ¯ we obtain the DTMDP M satisfying Condition C.2.1 and hence all the requirements of Proposition C.2.8. Fix the corresponding uniformly optimal strategy ϕ as in Proposition C.2.8(c). The value ϕ() at the cemetery can be arbitrary. According to Proposition C.2.4, for all x ∈ X, G λ c (x, a g ) + W DT ∗ (x) = min λ+α λ+α × W DT ∗ (x − z)d F(z) + W DT ∗ (0)(1 − F(x)) , (0,x] inf K + c O [b ∧ (x¯ − x)] + W DT ∗ ((x + b) ∧ x) ¯
(A.15)
b∈A I ⎧ G
c (x, a g ) λ ⎪ ⎪ + ⎪ ⎪ ⎪ λ + α λ + α ⎪ ⎪ ⎪ DT ∗ (x − z)d F(z) + W DT ∗ (0)(1 − F(x)) , ⎪ × W ⎨ (0,x] = if ϕ(x) = a g ; ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ K + c O [b ∧ (x¯ − x)] + W DT ∗ ((x + b) ∧ x), ¯ ⎪ ⎪ ⎩ if ϕ(x) = b ∈ A I , where W DT ∗ is the Bellman function in the model M. Note also that the function W DT ∗ is the minimal nonnegative lower semianalytic solution to the optimality equation (A.15) and in fact W DT ∗ is lower semicontinuous and bounded, since the function c G (·, a g ) is bounded. Obviously, for all x ∈ X, ϕ(x) = 0 ∈ A I because K > 0. Therefore, the above equations also hold if we replace A I with A I , that is, the function W DT ∗ satisfies the optimality equation for the model M+ . Since A I ⊂ A I , the Bellman function W DT ∗ in the model M+ cannot be smaller than W DT ∗ . Therefore, the minimal lower semianalytic solution to the optimality equation in the model M+ (i.e., the Bellman function W DT ∗ ) coincides with W DT ∗ and, as explained above, is bounded and lower semicon tinuous. The strategy ϕ is uniformly optimal in the DTMDP M+ by Proposition C.2.4(c). G g G ) (0,a g ) ≥ c λ+α . If the cost function is increasing, the (b) Note that l(0, a g ) = c (0,a α Bellman function cannot decrease. Applying the same reasoning as above, we conclude that the function W DT ∗ satisfies the optimality equation in the model M+ . Proposition C.2.4(c) implies that, in this model, W DT ∗ is the Bellman function and the mapping ϕ identifies a uniformly optimal strategy. The proof is now complete.
Appendix A: Miscellaneous Results
493
Proof of Lemma 7.3.4 Firstly note that Condition 7.3.2 is satisfied for the function h. Therefore, the Bellman function W DT h is bounded, nonnegative and lower semicontinuous by Lemma 7.3.3. Secondly, let us show that W DT h (x) ≥ c O · x ∀ x ∈ X. Clearly, since c G (·, a g ) ≥ 0, the Bellman function W DT h is greater than or equal to the Bellman function in the model Mh with the cost function l being replaced by
l (x, a) =
⎧ g(x) ⎪ , if x ∈ X, a = a g ; ⎪ ⎪ ⎨λ+α ⎪ ⎪ ⎪ ⎩ K, 0,
if x ∈ X, a ∈ A I ; if x = ,
where g(x) := c x(λ + α) − λc x F(x) + λc z d F(z) (0,x] z d F(z) + c O λx[1 − F(x)] ≥ 0. = c O xα + λc O O
O
O
(0,x]
The latter Bellman function is nonnegative, lower semicontinuous by Lemma 7.3.3, and coincides with the minimal nonnegative lower semicontinuous solution W˜ to the optimality equation W˜ (x) λ g(x) + W˜ (x − z)d F(z) + W˜ (0)(1 − F(x)) ; = min λ + α λ + α (0,x] ¯ , (A.16) inf K + W˜ ((x + b) ∧ x) b∈(0,x] ¯
and we will show that W˜ (x) = c O · x. Note that z d F(z) + x[1 − F(x)] = (0,x]
(0,∞)
u x (z) d F(z),
where u x (z) := z I{z ≤ x} + x I{z > x} is a function, nondecreasing in x. Therefore, the function g is increasing.
494
Appendix A: Miscellaneous Results
The Bellman function W˜ can be built by the usual successive approximations starting from W˜ 0 (x) ≡ 0. Clearly, W˜ n (0) = 0 for all n = 0, 1, . . .. If W˜ n (·) is nondecreasing, then, for all 0 ≤ x < y ≤ x, ¯ W˜ n (y − z) d F(z) − g(x) − λ W˜ n (x − z) d F(z) g(y) + λ (0,y] (0,x] ! ˜ ˜ ˜ Wn (y − z) − Wn (x − z) d F(z) + >λ Wn (y − z) d F(z) (0,x]
(x,y]
≥0 by the inductive supposition and because W˜ n (·) ≥ 0. We conclude that the minimal nonnegative lower semicontinuous solution to Eq. (A.16) is nondecreasing, meaning ˜ ¯ is excluded for all x ∈ X, and, that the case W˜ (x) = K + inf b∈(0,x] ¯ W ((x + b) ∧ x) in fact, for the minimal nonnegative solution W˜ , now one can replace the optimality equation (A.16) by λ g(x) + W˜ (x) = λ+α λ+α
(0,x]
W˜ (x − z)d F(z).
The operator on the right-hand side is a contraction in the space of measurable bounded functions on X with the uniform norm. (The proof is similar to the proof of Lemma 7.3.5.) Direct substitution shows that the unique bounded measurable solution is given by the formula W˜ (x) = c O · x. (a) Let us check that a nonnegative lower semicontinuous function V on X satisfies the optimality equation in the model M+ , i.e., V (x) G λ c (x, a g ) + V (x − z)d F(z) + V (0)(1 − F(x)) ; = min λ+α λ + α (0,x] & ' O inf K + c [b ∧ (x¯ − x)] + V ((x + b) ∧ x) ¯ (A.17) b∈(0,x] ¯
if and only if the nonnegative lower semicontinuous function W (x) := V (x) + c O · x satisfies the optimality equation (7.65). To do that, suppose that such a function V satisfies Eq. (A.17) and substitute W (x) := V (x) + c O · x in the right-hand side of Eq. (7.65): T1 ◦ W (x) G λ c (x, a g ) + V (x − z)d F(z) + V (0)(1 − F(x)) = min λ+α λ + α (0,x] O λc O λc x F(x) O + z d F(z) +c x − λ+α λ + α (0,x]
Appendix A: Miscellaneous Results
495
λc O + (x − z) d F(z); λ + α (0,x] & ' O inf K + c [(x + b) ∧ x] ¯ + V ((x + b) ∧ x) ¯ b∈(0,x] ¯ G c (x, a g ) λ V (x − z)d F(z) + V (0)(1 − F(x)) = min + λ+α λ + α (0,x] +c O · x; & ' O O ¯ inf K + c · x + c · [b ∧ (x¯ − x)] + V ((x + b) ∧ x) b∈(0,x] ¯
= V (x) + c O · x = W (x), where the first equality holds because
−x F(x) +
(0,x]
z d F(z) +
(0,x]
(x − z) d F(z) = 0
and [(x + b) ∧ x] ¯ = x + [b ∧ (x¯ − x)]. If a nonnegative lower semicontinuous function W satisfies Eq. (7.65), then, for all x ∈ X, W (x) ≥ W DT h (x) ≥ c O · x, and similar calculations show that the nonnegative lower semicontinuous function V (x) := W (x) − c O · x satisfies Eq. (A.17). Therefore, the minimal nonnegative lower semicontinuous functions V and W solving the corresponding Eqs. (A.17) and (7.65) are connected as W (x) = V (x) + c O · x, and the statement (a) follows. (b) This assertion follows from Proposition C.2.4(c) and calculations similar to those presented above. The proof is now complete.
Proof of Lemma 7.3.5 For two arbitrary bounded measurable functions V1 and V2 , we have λ G ◦ V1 (x) = G ◦ V2 (x) + λ+α × (V1 − V2 )(x − z) d F(z) + (V1 − V2 )(0)(1 − F(x)) (0,x]
≤ G ◦ V2 (x) +
λ V1 − V2 . λ+α
Similarly, for each b ∈ (0, x], ¯
(A.18)
496
Appendix A: Miscellaneous Results
K + G ◦ V1 ((x + b) ∧ x) ¯ ≤ K + G ◦ V2 ((x + b) ∧ x) ¯ +
λ V1 − V2 , λ+α
so that, under an arbitrarily fixed b˜ ∈ (0, x], ¯ ˜ ∧ x) K + inf G ◦ V1 ((x + b) ∧ x) ¯ ≤ K + G ◦ V2 ((x + b) ¯ + b∈(0,x] ¯
λ V1 − V2 λ+α
leading to the inequality K + inf
b∈(0,x] ¯
G ◦ V1 ((x + b) ∧ x) ¯ ≤ K + inf
b∈(0,x] ¯
G ◦ V2 ((x + b) ∧ x) ¯ +
λ V1 − V2 . λ+α
(A.19) From (A.18) and (A.19) we deduce that T2 ◦ V1 (x) ≤ T2 ◦ V2 (x) +
λ V1 − V2 λ+α
and, by the symmetry, T2 ◦ V2 (x) ≤ T2 ◦ V1 (x) +
λ V1 − V2 . λ+α
The desired inequality ∀ x ∈ X |T2 ◦ V1 (x) − T2 ◦ V2 (x)| ≤ =⇒
T2 ◦ V1 − T2 ◦ V2 ≤
λ V1 − V2 λ+α
λ V1 − V2 λ+α
is now obvious. The proof is complete.
Proof of Lemma 7.3.6 (a) It is sufficient to show that, in the space X, W DT h = T2 ◦ W DT h . (Note that the function W DT h is lower semicontinuous and bounded by Corollary 7.3.1.) Obviously, for all x ∈ X, ¯ W DT h (x) ≤ G ◦ W DT h (x) and W DT h (x) ≤ K + inf W DT h ((x + b) ∧ x). b∈(0,x] ¯
(A.20) Consider a deterministic stationary strategy ϕ, uniformly optimal in the model Mh , which exists by Lemmas 7.3.3(a) and 7.3.4(b). According to Proposition C.2.4(c), the mapping ϕ provides the infimum in the equation W DT h = T1 ◦ W DT h ,
(A.21)
Appendix A: Miscellaneous Results
497
so that, if ϕ(x) ∈ A I , then ¯ = inf W DT h ((x + b) ∧ x). W DT h ((x + ϕ(x)) ∧ x) b∈(0,x] ¯
(A.22)
Let us show that, if ϕ(x) ∈ A I , ¯ = G ◦ W DT h ((x + ϕ(x)) ∧ x). ¯ W DT h ((x + ϕ(x)) ∧ x)
(A.23)
If this equality does not hold for some x ∈ X, then, for y := (x + ϕ(x)) ∧ x, ¯ we necessarily have ¯ W DT h (y) = K + W DT h ((y + ϕ(y)) ∧ x). ¯ =K+ Hence, y = x¯ because, otherwise, we would have had W DT h (x) ¯ which is impossible. Therefore, y = x + ϕ(x) < x¯ and W DT h (x), ¯ < K + W DT h ((y + ϕ(y)) ∧ x) ¯ = W DT h (y). W DT h ((x + ϕ(x) + ϕ(y)) ∧ x) The obtained inequality contradicts (A.22): there exists b := (ϕ(x) + ϕ(y)) ∧ x¯ such that W DT h (x + ϕ(x)) ∧ x) ¯ = W DT h (x + ϕ(x)) = W DT h (y) > W DT h ((x + b) ∧ x). ¯
Equality (A.23) is proved. Also, if ϕ(x) ∈ A I , then, by (A.23), ¯ = K + G ◦ W DT h ((x + ϕ(x)) ∧ x), ¯ W DT h (x) = K + W DT h ((x + ϕ(x)) ∧ x) (A.24) and we intend to show that in this case W DT h (x) = K + inf
b∈(0,x] ¯
&
' G ◦ W DT h ((x + b) ∧ x) ¯ .
(A.25)
Indeed, if there is a b ∈ (0, x] ¯ such that ¯ < G ◦ W DT h ((x + ϕ(x)) ∧ x), ¯ G ◦ W DT h ((x + b) ∧ x) then ¯ ≤ K + G ◦ W DT h ((x + b) ∧ x) ¯ W DT h (x) ≤ K + W DT h ((x + b) ∧ x) DT h < K +G◦W ((x + ϕ(x)) ∧ x) ¯ = W DT h (x). Here, the first two inequalities are by (A.20), and the last equality is by (A.24). The obtained contradiction proves (A.25).
498
Appendix A: Miscellaneous Results
Suppose W DT h (x) < G ◦ W DT h (x). Then ϕ(x) ∈ A I because the mapping ϕ provides the infimum in the Eq. (A.21), and hence equality (A.25) holds. Suppose W DT h (x) = G ◦ W DT h (x). Then W DT h (x) ≤ K + inf W DT h ((x + b) ∧ x) ¯ ≤ K + inf G ◦ W DT h ((x + b) ∧ x). ¯ b∈(0,x] ¯
b∈(0,x] ¯
Both inequalities follow from (A.20). Therefore, in any case W
DT h
(x) = min G ◦ W
DT h
(x); K + inf G ◦ W b∈(0,x] ¯
DT h
((x + b)x) ¯
= T2 ◦ W DT h (x) for all x ∈ X. (b) (i) Suppose a measurable mapping ϕ : X → A defines a uniformly optimal strategy in the model Mh . Then, according to Proposition C.2.4(c), it provides the infimum in the optimality equation W DT h = T1 ◦ W DT h . Therefore, for each x ∈ X, • if ϕ(x) = a g , then W DT h (x) = G ◦ W DT h (x); ¯ • if ϕ(x) ∈ A I , then W DT h (x) = K + G ◦ W DT h ((x + ϕ(x)) ∧ x) by (A.24). The requirements, formulated in the lemma, are fulfilled according to part (a). (ii) Suppose now that a measurable mapping ϕ : X → A satisfies the requirements formulated in the lemma. If ϕ(x) ∈ A I , we have ¯ V (x) = K + G ◦ V ((x + ϕ(x)) ∧ x) ¯ = K + inf G ◦ V ((x + b) ∧ x). b∈(0,x] ¯
(A.26) G ◦ V ((x + b) ∧ The second equality holds because V (x) ≤ K + inf b∈(0,x] ¯ x). ¯ According to part (a), since the function V = W DT h also satisfies the equation V = T1 ◦ V , G ◦ V ((x + ϕ(x)) ∧ x) ¯ ≥ V ((x + ϕ(x)) ∧ x). ¯ Actually, we have the equality in this expression because, otherwise, according to (A.26), ¯ V ((x + ϕ(x)) ∧ x) ¯ < inf G ◦ V ((x + b) ∧ x) b∈(0,x] ¯
¯ < inf G ◦ V ((x + b) ∧ x) ¯ =⇒ inf V ((x + b) ∧ x) b∈(0,x] ¯
and, since V = T1 ◦ V ,
b∈(0,x] ¯
Appendix A: Miscellaneous Results
499
V (x) ≤ K + inf V ((x + b) ∧ x) ¯ < K + inf G ◦ V ((x + b) ∧ x), ¯ b∈(0,x] ¯
b∈(0,x] ¯
which contradicts (A.26). Therefore, G ◦ V ((x + ϕ(x)) ∧ x) ¯ = V ((x + ϕ(x)) ∧ x) ¯ and, according to (A.26), V (x) = K + V ((x + ϕ(x)) ∧ x), ¯ i.e., the mapping ϕ provides the infimum in the optimality equation (7.65) W DT h (x) = V (x) = T1 ◦ V (x) = T1 ◦ W DT h (x). If ϕ(x) = a g , V (x) = G ◦ V (x), i.e., again the mapping ϕ provides the infimum in the optimality equation W DT h (x) = T1 ◦ W DT h (x). Therefore, the mapping ϕ defines a uniformly optimal strategy in the model Mh by Proposition C.2.4(c). The proof is complete. Proof of Lemma 7.3.7 Consider the following function V1 on [0, S):
V1 (x) :=
⎧ g g ⎪ ˆ := K (λ + α) + h(S, a ) = h(s, a ) , if x < s; ⎪ V ⎪ ⎨ α α ⎪ ⎪ h(x, a g ) λ ˆ ⎪ ⎩ + V, λ+α λ+α
if s ≤ x < S.
This function is continuous according to the definitions of Vˆ and of s given in Condition 7.3.3(c). It is nonincreasing due to Condition 7.3.3(b) and, for all x ∈ [0, S), V1 (x) >
λ ˆ λ ˆ h(S, a g ) α Vˆ − K (λ + α) + + V = V = Vˆ − K . λ+α λ+α λ+α λ+α
The first equality is by the definition of Vˆ . Therefore, Vˆ − K < V1 (x) ≤ Vˆ ∀ x ∈ [0, S).
(A.27)
Note also that G ◦ V1 (x) λ h(x, a g ) + V1 (x − z) d F(z) + V1 (0)(1 − F(x)) = λ+α λ + α [S + −s,x] λ ˆ h(x, a g ) + = V ∀ x ∈ [0, S) (A.28) λ+α λ+α
500
Appendix A: Miscellaneous Results
because V1 (y) ≡ Vˆ for all y ∈ [0, (S − S + + s) ∨ 0] ⊆ [0, (S − S + s)] = [0, s]; see also Condition 7.3.3(d). ¯ equal to the unique bounded measurable We introduce the function V2 on [S, x], solution to the equation λ h(x, a g ) + V2 (x) = V2 (x − z) d F(z) λ+α λ + α (0,x−S] + V1 (x − z) d F(z) + V1 (0)(1 − F(x)) . (x−S,x]
The operator in the right-hand side is a contraction with respect to V2 in the space of bounded measurable functions on [S, x] ¯ with the uniform norm: the proof is similar to the proof of Lemma 7.3.5. If x = S, then, according to Condition 7.3.3(d), V2 (S) λ h(S, a g ) + V1 (S − z) d F(z) + V1 (0)(1 − F(S)) = λ+α λ + α [S + −s,S] h(S, a g ) λ ˆ (A.29) = + V = lim V1 (y). y→S− λ+α λ+α The second equality holds because, for y ≤ S − S + + s < s, V1 (y) = Vˆ . The last equality holds because the function h(·, a g ) is continuous. Note that V2 = V2 + V2 , where V2 is the unique bounded measurable solution to the equation V2 (x) =
λ h(x, a g ) + λ+α λ+α
(0,x−S]
V2 (x − z) d F(z)
and V2 is the unique bounded measurable solution to the equation V2 (x) = +
λ λ+α
(x−S,x]
(0,x−S]
V2 (x − z) d F(z)
V1 (x − z) d F(z) + V1 (0)(1 − F(x)) .
In both equations, the right-hand side operators are contractions in the space of bounded measurable functions on [S, x] ¯ with the uniform norm. (The proof is similar to the proof of Lemma 7.3.5.) Let us show that the function V2 is nondecreasing. It can be built using the successive approximations:
Appendix A: Miscellaneous Results
501
V2,0 (x) ≡ 0; λ h(x, a g ) + V (x − z) d F(z), n = 0, 1, . . . . V2,n+1 (x) = λ+α λ + α (0,x−S] 2,n Clearly, V2,n ↑ V2 because h(·, a g ) ≥ 0. The function V2,0 is obviously nonde creasing. Suppose the function V2,n is nondecreasing for some n ≥ 0. Then, for all S ≤ x < x + y ≤ x, ¯ since the function h(·, a g ) is nondecreasing by Condition 7.3.3(b), (x + y) − V2,n+1 (x) V2,n+1
λ V (x + y − z) − V2,n (x − z) d F(z) ≥ λ + α (0,x−S] 2,n + V2,n (x + y − z) d F(z) ≥ 0 (x−S,x+y−S]
by the inductive supposition and because V2,n ≥ 0. We conclude that the function V2 is nondecreasing. Let us show that
λ λ ˆ (Vˆ − K ) < V2 (x) ≤ V ∀ x ∈ [S, x]. ¯ λ+α λ+α
(A.30)
¯ (and for x, ¯ too, if S + (S + − These inequalities hold for x ∈ [S, (S + (S + − s)) ∧ x) s) > x) ¯ because, for such values of x, V2 (x)(x − z) d F(z) = 0 (0,x−S]
by Condition 7.3.3(d), and V2 (x) =
λ λ+α
(x−S,x]
V1 (x − z) d F(z) + V1 (0)(1 − F(x))
leading to inequalities (A.30) according to (A.27). Suppose inequalities (A.30) hold for all x ∈ [S, (S + n(S + − s))) for some ¯ Then, for x ∈ [S + n(S + − s), (S + (n + n ≥ 1, assuming that S + n(S + − s) ≤ x. + ¯ (and for x, ¯ too, if S + (n + 1)(S + − s) > x) ¯ we have 1)(S − s)) ∧ x), V2 (x) +
λ = λ+α
(x−S,x]
[S + −s,x−S]
V2 (x − z) d F(z)
V1 (x − z) d F(z) + V1 (0)(1 − F(x)) .
502
Appendix A: Miscellaneous Results
λ The functions V2 and V1 in these integrals take values from the interval λ+α (Vˆ − K ), λ ˆ V and from (Vˆ − K , Vˆ ] correspondingly by the inductive supposition and by λ+α
(A.27). Therefore, inequalities (A.30) are valid for x ∈ [S + n(S + − s), (S + (n + ¯ and for x, ¯ too, if S + (n + 1)(S + − s) > x. ¯ Since S + − s > 0, for 1)(S + − s)) ∧ x) + ¯ and at this step inequalities (A.30) hold some n ≥ 1, we have S + n(S − s) > x, ¯ = [S, x), ¯ as well as for x. ¯ for all s ∈ [S, (S + n(S + − s)) ∧ x) We have proved that the function V2 = V2 + V2 is the sum of an increasing function and a function with bounded fluctuations. The important consequence from here is that λ K ∀ S ≤ x < x + b ≤ x, ¯ λ+α
(A.31)
V2 (x) < K + inf V2 ((x + b) ∧ x) ¯ ∀ x ∈ [S, x]. ¯
(A.32)
V2 (x + b) − V2 (x) ≥ − and
b∈(0,x] ¯
Let us show that V2 (x) ≥ V2 (S) =
λ ˆ h(S, a g ) + ¯ V = lim V1 (y) ∀ x ∈ [S, x]. y→S− λ+α λ+α
(A.33)
According to Condition 7.3.3(d), for all x ∈ [S, S + ], h(x, a g ) λ V1 (x − z) d F(z) + V1 (0)(1 − F(x)) + λ+α λ + α (x−S,x] g λ h(x, a ) + = V1 (x − z) d F(z) + V1 (0)(1 − F(x)) λ+α λ + α [S + −s,x] g λ ˆ h(x, a ) + = V. (A.34) λ+α λ+α
V2 (x) =
The third equality holds because, for all x − z ≤ S + − (S + − s) = s, V1 (x − z) = Vˆ . Now the inequality in (A.33) follows from Condition 7.3.3(b). See also (A.29). If x > S + , V2 (x) = V2 (S + ) + (V2 (x) − V2 (S + )) ≥ V2 (S + ) −
λ K λ+α
λ ˆ h(S + , a g ) λ + K V− λ+α λ+α λ+α λ ˆ h(S, a g ) + λK λ + K = V2 (S). = V− λ+α λ+α λ+α
=
Appendix A: Miscellaneous Results
503
Here, the inequality is by (A.31), the second equality is by (A.34) at x = S + , and the third equality is according to Condition 7.3.3(c): here h(x, ¯ a g ) > h(S, a g ) + λK , + ¯ The last equality is by (A.34) at x = S. since S < x ≤ x. Let us prove that the function V˜ on [0, x], ¯ defined by V˜ (x) := I{x < S}V1 (x) + I{x ≥ S}V2 (x), satisfies the equation V˜ = T2 ◦ V˜ and, therefore, coincides with V . Simultaneously, we will prove that the mapping ϕ satisfies requirements (7.66). Obviously, for all x ∈ [S, x], ¯ in accordance with the definition of the function V2 , V˜ (x) = V2 (x) = G ◦ V˜ (x).
(A.35)
(i) Consider the case S ≤ x ≤ x. ¯ According to (A.32) and (A.35), ¯ V˜ (x) = G ◦ V˜ (x) < K + inf V˜ ((x + b) ∧ x). b∈(0,x] ¯
Therefore, V˜ (x) = T2 ◦ V˜ (x), and the requirement (7.66), the case of ϕ(x) = a g , is fulfilled for the function V˜ . (ii) Consider the case 0 ≤ x < s. According to (A.28), λ ˆ h(x, a g ) + G ◦ V˜ (x) = G ◦ V1 (x) = V; λ+α λ+α hence λ ˆ λ ˆ h(S, a g ) h(s, a g ) + +K+ V = V G ◦ V˜ (x) ≥ λ+α λ+α λ+α λ+α = K + V2 (S) = K + V˜ (S) = K + G ◦ V˜ (S). The inequality is by Condition 7.3.3(b), the first equality is by the definition of s (see Condition 7.3.3(c)), the second equality is by (A.29), and the last equality is by (A.35). Note also that, for y ∈ (x, S), we have G ◦ V˜ (y) = G ◦ V1 (y) = = V2 (S) = G ◦ V˜ (S).
h(y, a g ) λ ˆ h(S, a g ) λ ˆ + V > + V λ+α λ+α λ+α λ+α
Here, we used formula (A.28), Condition 7.3.3(b) and equalities (A.29), (A.35). Additionally, for y ∈ [S, x], ¯ we have, according to (A.35) and (A.33):
504
Appendix A: Miscellaneous Results
G ◦ V˜ (y) = V2 (y) ≥ V2 (S) = G ◦ V˜ (S). Therefore, for the values x < s under consideration (and also for all x ∈ [0, S)), we have: inf G ◦ V˜ ((x + b) ∧ x) ¯ = G ◦ V˜ (S) = G ◦ V˜ (x + ϕ(x))
b∈(0,x] ¯
(A.36)
and h(s, a g ) λ ˆ h(s, a g ) = + V = K + G ◦ V˜ (S) V˜ (x) = α λ+α λ+α ¯ = K + G ◦ V˜ (x + ϕ(x)) = K + inf G ◦ V˜ ((x + b) ∧ x). b∈(0,x] ¯
Here, the second equality can be checked straightforwardly for Vˆ := and the third equality is by (A.36). Furthermore,
h(s,a g ) , α
¯ G ◦ V˜ (x) ≥ K + G ◦ V˜ (S) = K + inf G ◦ V˜ ((x + b) ∧ x). b∈(0,x] ¯
Here, the inequality is by (A.36). The equality V˜ (x) = T2 ◦ V˜ (x) is proved, and V˜ (x) = K + G ◦ V˜ ((x + ϕ(x)) ∧ x). ¯ Recall that here ϕ(x) = S − x ∈ A I , so that requirement (7.66) is fulfilled for the function V˜ . (iii) Consider the case s ≤ x < S. According to (A.28), λ ˆ h(x, a g ) + G ◦ V˜ (x) = G ◦ V1 (x) = V λ+α λ+α g λ ˆ h(s, a ) + V = K + G ◦ V˜ (S) ≤ λ+α λ+α ¯ = K + inf G ◦ V˜ ((x + b) ∧ x), b∈(0,x] ¯
where the inequality is by Condition 7.3.3(b), the last but one equality is by (A.36), and the last equality is by (A.36). Therefore, again using (A.28) for x ∈ [s, S), we obtain V˜ (x) = V1 (x) = G ◦ V1 (x) = G ◦ V˜ (x) = T2 ◦ V˜ (x), and the requirement (7.66), the case of ϕ(x) = a g , is fulfilled for the function V˜ . We have proved that V˜ = T2 ◦ V˜ , hence V = V˜ , and the requirements (7.66) are fulfilled for the function V and for the mapping ϕ. The proof is complete.
Appendix B
Relevant Definitions and Facts
To be self-contained, in this appendix we collect the relevant definitions and results from real analysis, measure theory, and applied probability, most of which are presented without proofs.
B.1 Real Analysis B.1.1 Borel Spaces and Semianalytic Functions A measurable space X is isomorphic to another measurable space Y if there is a measurable bijection mapping X onto Y , whose inverse is also measurable. As a standard practice adopted here and below, we do not indicate explicitly the σ-algebras on X and Y , when there is no danger of confusion. Definition B.1.1 (Borel space) A Borel space is a measurable space that is isomorphic to a Borel subset of a Polish space. A topological Borel space is a topological space that is homeomorphic to a Borel subset of a Polish space, endowed with the relative topology. For a Borel space (X, F), we call subsets in F Borel subsets of X , and often write F as B(X ), because there is always a metric on X , whose Borel σ-algebra on X coincides with the given F. For a Borel space (X, F), P(X ) denotes the space of probability measures on it. For each ( p ∈ P(X ), let F( p) be the completion of F with respect to p. The σ-algebra p∈P(X ) F( p) is called the universal σ -algebra on X . A mapping f from X to another Borel space (Y, G) is called universally measurable ( if it is measurable with respect to the universal σ-algebra on X , i.e, f −1 (G) ⊆ p∈P(X ) F( p).
© Springer Nature Switzerland AG 2020 A. Piunovskiy and Y. Zhang, Continuous-Time Markov Decision Processes, Probability Theory and Stochastic Modelling 97, https://doi.org/10.1007/978-3-030-54987-9
505
506
Appendix B: Relevant Definitions and Facts
Proposition B.1.1 Every Borel space is isomorphic to the interval [0, 1] endowed with the Borel σ-algebra or to its finite or countable subset endowed with the σalgebra of all its subsets. Proof See Proposition 7.16 and Corollary 7.16.1 of [21] or Appendix 1 of [69]. Proposition B.1.2 If X 1 , X 2 , . . . are disjoint Borel spaces with σ-algebras F1 , F2 , . . ., then their union X = X 1 ∪ X 2 ∪ . . ., equipped with the minimal σalgebra F containing F1 , F2 , . . ., is a Borel space. Proof Every space X n , n = 1, 2, . . ., is isomorphic to the interval [2n, 2n + 1] or to its finite or countable subset by Proposition B.1.1. Hence (X, F) is isomorphic to the Borel subset of the Polish space R. See also Proposition 3.2.2 of [17]. Definition B.1.2 (Analytic set) Let X be a nonempty Borel space. A subset of X is called analytic if it is the image of a Borel subset of Y under a Borel measurable mapping from Y into X , where Y is some uncountable Borel space. There are several equivalent definitions of an analytic subset, see e.g., Proposition 7.41 of [21]. Each Borel subset of a Borel space is analytic. Definition B.1.3 (Lower semianalytic function) Let X be a Borel space and D be its analytic subset. A function f : D → [−∞, +∞] is called lower semianalytic if the set {x ∈ D : f (x) < c} is analytic for every c ∈ R. Properties of analytic sets and semianalytic functions are presented in [21, Sects. 7.6, 7.7]. Proposition B.1.3 (Novikov Separation Theorem) Let X( be a Borel space, and ∞ {Cn }∞ n=1 be a sequence of analytic subsets of X such that( n=1 C n = ∅. Then there ∞ ∞ exists a sequence {Bn }n=1 of Borel subsets of X such that n=1 Bn = ∅, and for each n = 1, 2, . . . , Cn ⊆ Bn . Proof See Theorem 4.6.1 of [232] or 28.B of [148].
Proposition B.1.4 Let X and Y be Borel spaces, D be an analytic subset of X × Y with its projection on X denoted by h(D), and f : D → [−∞, ∞] be a lower semianalytic function. Then the function f ∗ : h(D) → inf y∈Dx f (x, y) is a lower semianalytic function. Here Dx is the section of D at x ∈ X. Proof See Proposition 7.47 of [21].
Proposition B.1.5 Let X be a Borel space, D be an analytic subset of X , and f n : D → [−∞, ∞] be a lower semianalytic function for each n = 1, 2, . . . . If { f n }∞ n=1 converges pointwise to a function f , then the function f is a lower semianalytic function. Proof See Lemma 7.30 of [21].
Appendix B: Relevant Definitions and Facts
507
B.1.2 Semicontinuous Functions Definition B.1.4 (Upper and lower semicontinuous functions) Let X be a topological space, and c be a [−∞, ∞]-valued function on X . Then c is called upper semicontinuous on X if {x ∈ X : c(x) ≥ } is closed in X for each constant ∈ R. If −c is upper semicontinuous on X , then c is called lower semicontinuous (on X ). Lemma B.1.1 If {cα }α∈A is a family of [−∞, ∞]-valued lower semicontinuous functions on a topological space X , where A is an arbitrary set, then the [−∞, ∞]valued function c := supα∈A cα is also lower semicontinuous. Proof For each ∈ R, the set {x ∈ X : c(x) ≤ } =
)
{x ∈ X : cα (x) ≤ }
α∈A
is obviously closed. In particular, we formulate the next remark.
Remark B.1.1 Suppose X is a metric space, and c is a [−∞, ∞]-valued lower semicontinuous function on X . Then c+ is lower semicontinuous, and c− is upper semicontinuous. The following statement provides an equivalent definition of a lower or upper semicontinuous function. Proposition B.1.6 Let X be a metric space, and c be a [−∞, ∞]-valued function on X . Then the following assertions hold. (a) The function c is lower semicontinuous on X if and only if for each x ∈ X, lim c(xn ) ≥ c(x)
n→∞
for each convergent sequence {xn }∞ n=1 in X to x as n → ∞. (b) The function c is upper semicontinuous on X if and only if for each x ∈ X, lim c(xn ) ≤ c(x)
n→∞
for each convergent sequence {xn }∞ n=1 in X to x as n → ∞. Proof See Lemma 7.13 of [21].
Lemma B.1.2 Suppose Y is a compact topological Borel space, and X is a Borel space. Let a [−∞, ∞]-valued lower semicontinuous function g on X × Y be fixed. Then the following assertions hold.
508
Appendix B: Relevant Definitions and Facts
(a) For each ∈ R, it holds that the set {x ∈ X : ∀ a ∈ Y, g(x, a) > } is open in X. (b) The following equality holds: {x ∈ X : ∀ a ∈ Y, g(x, a) > 0} ∞ * 1 x ∈ X : ∀ a ∈ Y, g(x, a) > . = l l=1
Proof See Lemmas 3.1 and 3.2 in [61].
Proposition B.1.7 Suppose X and Y are topological Borel spaces, with Y being compact. Let c be a [−∞, ∞]-valued measurable function on X × Y . Then the following assertions hold. (a) The function y ∈ Y → c(x, y) is bounded from below and lower semicontinuous on Y for each x ∈ X if and only if there exists a sequence of bounded measurable functions {cn }∞ n=1 on X × Y such that cn ↑ c (pointwise) as n ↑ ∞, and, for each n, the function y ∈ Y → cn (x, y) is bounded continuous on Y for each x ∈ X. (b) The function y ∈ Y → c(x, y) is bounded from above and upper semicontinuous on Y for each x ∈ X if and only if there exists a sequence of bounded measurable functions {cn }∞ n=1 on X × Y such that cn ↓ c (pointwise) as n ↑ ∞, and, for each n, the function y ∈ Y → cn (x, y) is bounded continuous on Y for each x ∈ X. Proof The proof comes from the reasoning of Lemma 7.14 of [21]. The details are as follows. We only prove part (a). Consider the “only if” part. Let us introduce some notations used in this proof first. Let ρY denote the fixed compatible metric on Y . Let c(x, y) be with a lower bound b(x) ∈ (−∞, ∞) for each x ∈ X. Let us define ∞ = x ∈ X : inf c(x, y) = ∞ . y∈Y
Therefore, for each x ∈ X \ ∞ , there is some yx ∈ Y such that c(x, yx ) < ∞. For each n = 1, 2, . . . , consider the function gn on X × Y defined by gn (x, y) =
inf z∈Y {c(x, z) + nρY (y, z)} , ∀ x ∈ X \ ∞ ; n, ∀ x ∈ ∞ .
Then for each n = 1, 2, . . . , b(x) ≤ gn (x, y) ≤ c(x, y), ∀ x ∈ X, y ∈ Y ; b(x) ≤ gn (x, y) ≤ c(x, yx ) + nρY (y, yx ) < ∞, ∀ x ∈ X \ ∞ , y ∈ Y. It is now clear that, for each x ∈ X, the sequence of (−∞, ∞]-valued functions {gn (x, ·)}∞ n=1 is monotone nondecreasing, bounded from below (in y ∈ Y ) by b(x), and
Appendix B: Relevant Definitions and Facts
509
lim gn (x, y) ≤ c(x, y), ∀ x ∈ X.
n→∞
(B.1)
Step 1: We verify that, for each x ∈ X, the function y ∈ Y → gn (x, y) is continuous for each n = 1, 2, . . . . This is clear for each x ∈ ∞ . So let us consider x ∈ X \ ∞ . Then c(x, z) + nρY (y, z) ≤ c(x, z) + nρY (a, z) + nρY (y, a), c(x, z) + nρY (a, z) ≤ c(x, z) + nρY (y, z) + nρY (a, y) for each y, z, a ∈ Y. After taking the infimum appropriately, the above two equalities lead to |gn (x, a) − gn (x, y)| ≤ nρY (a, y), ∀ a, y ∈ Y, meaning that the function y ∈ Y → gn (x, y) is (uniformly) continuous. Step 2: We verify that lim gn (x, y) = c(x, y), ∀ x ∈ X, y ∈ Y.
n→∞
(B.2)
Consider only x ∈ X \ ∞ , as the other case is trivial. Let > 0 be fixed. Using the definition of infimum in the definition of gn (x, y), we see that for each n = 1, 2, . . . , and y ∈ Y, there is some yn (x, y) ∈ Y satisfying c(x, yn (x, y)) + nρY (y, yn (x, y)) ≤ gn (x, y) + .
(B.3)
If limn→∞ gn (x, y) = ∞, then (B.2) follows from this and (B.1). Therefore, suppose limn→∞ gn (x, y) < ∞. Then (B.1) and (B.3) imply that lim ρY (y, yn (x, y)) = 0.
n→∞
Now take the lower limit as n → ∞ on both sides of (B.3). After dropping off the nonnegative term on the left, we see that lim c(x, yn (x, y)) ≤ lim gn (x, y) + , n→∞
n→∞
which, together with Proposition B.1.6, leads to c(x, y) ≤ lim gn (x, y) + , n→∞
because limn→∞ yn (x, y) = y. Since > 0 was arbitrarily fixed, this and (B.1) imply (B.2). Step 3: We verify that, for each n = 1, 2, . . . , gn is measurable on X × Y. To this end, let us firstly fix y ∈ Y and note that the function x ∈ X → gn (x, y)
510
Appendix B: Relevant Definitions and Facts
is measurable. Indeed, this follows from Proposition B.1.39. Secondly, it has been shown in Step 1 that for each x ∈ X, the function y → gn (x, y) is continuous on Y. Therefore, for each n = 1, 2, . . . , gn is a Carathéodory function, and thus measurable on X × Y according to Proposition B.1.38. Step 4. Let us define a function cn on X × Y by cn (x, y) = min{gn (x, y), n}, ∀ x ∈ X, y ∈ Y, for each n = 1, 2, . . . . From the statements proved in the previous steps, {cn }∞ n=1 is the desired sequence of functions. The “only if” part is proved. The “if” part follows immediately from Definition B.1.4 and Lemma B.1.1. Definition B.1.5 (Lower semicontinuous regulation) Let X be a metric space with the metric ρ. For each [−∞, ∞]-valued function f on X , its lower semicontinuous regulation is defined by f E (x) := sup inf
>0 ρ(x,y)≤
f (y), ∀ x ∈ X.
Clearly, f E (x) ≤ f (x) for each x ∈ X. Proposition B.1.8 Let X be a metric space with the metric ρ. For each R-valued function f on X , f E is lower semicontinuous. Furthermore, if g is another lower semicontinuous function on X such that g(x) ≤ f (x) for each x ∈ X , then g(x) ≤ f E (x) for each x ∈ X . If h ≤ f , then h E ≤ f E pointwise, too. Proof See Lemma 5.13.4 and Remark 5.13.3 of [18].
Proposition B.1.9 (a) Suppose that the R-valued functions f 1 , f 2 , . . . are defined and continuous on a metric space X , and that f n (x) → f (x) uniformly on X as n → ∞. Then f is continuous on X . (b) (Weierstrass M-test.) Let the R-valued functions u 1 , u 2 , . . . be defined on a Mn for all x ∈ X , for (measurable) set X ⊆ R, and suppose that |u n (x)| ≤ ∞ n = 1, 2, . . .. Suppose further that the numerical series n=1 Mn is convergent. ∞ Then the series n=1 u n (x) converges uniformly on X . Proof See Sect. 1.8 of [68] and Theorem 2.65 of [3].
B.1.3 Spaces of Functions Proposition B.1.10 Let (, F, μ) be a measure space, 1 ≤ p < ∞. Then the space L p (, F, μ) of pth integrable functions (with two functions differing on a μ-null set
Appendix B: Relevant Definitions and Facts
511
not distinguished), endowed with the L p -norm, is complete. If μ is σ-finite, and F is countably generated, then L p (, F, μ) is separable.
Proof See Theorem 19.2 of [24].
Definition B.1.6 (Equicontinuity) A family H of R-valued functions on a metric space X is called equicontinuous at a point x ∈ X if for each > 0, there exists an open neighborhood G of x ∈ X such that | f (x) − f (y)| < , ∀ y ∈ G, f ∈ H. The family H is called equicontinuous if it is equicontinuous at every x ∈ X. The next statement is a version of the Arzela–Ascoli Theorem. Proposition B.1.11 (Arzela–Ascoli Theorem) If a family H of R-valued functions on a separable metric space X is equicontinuous, and a sequence { f n } ⊆ H is such that for each x ∈ X, { f n (x) : n ≥ 0} is relatively compact in R, then there exists a subsequence { f n k } ⊆ { f n } and an R-valued continuous function f on X such that { f n k } converges to f uniformly on compact subsets of X.
Proof See Theorem 40 in p. 169 of [212].
Until the end of this subsection, let Y and X be nonempty Borel spaces. Let FY be the class of Borel measurable mappings from R+ to Y . Two elements of FY are not distinguished if they coincide almost everywhere. Proposition B.1.12 Equip FY with the σ-algebra FY as the minimal one with respect to which the mapping f ∈ FY →
R+
e−t g(t, f (t))dt
is measurable for each bounded measurable function g on R+ × Y . Then FY is a Borel space. Proof This proof comes from [254]. Throughout this proof, there is no loss of generality to consider Y as a nonempty closed subset of [0, 1] with its Borel σ-algebra B(Y ), because each Borel space is isomorphic to such a subset. Then we consider the metric ρ on FY defined by ρ( f 1 , f 2 ) :=
R+
−t
1/2
e ( f 1 (t) − f 2 (t)) dt 2
(B.4)
for each f 1 , f 2 ∈ FY . By Proposition B.1.10, one can see that this metric space FY is a Polish space, whose Borel σ-algebra is denoted by G, and is thus generated by the collection of open balls in FY .
512
Appendix B: Relevant Definitions and Facts
For each > 0, the -open ball centered at some f ∈ FY in FY with respect to the metric defined in the above is clearly an element of FY . It follows that G ⊆ FY . For the opposite direction, let H be the vector space of bounded measurable functions g on R+ × Y such that f ∈ FY →
R+
e−t g(t, f (t))dt
is measurable with respect to G. Then the multiplicative system K, whose members are in the form e−mt y n , t ∈ R+ , y ∈ Y , m, n = 0, 1, 2, . . . , is contained in H, which itself contains constant functions, and is closed under bounded pointwise passage to the limit. Now by the Monotone Class Theorem, cf. Proposition B.1.43, we see that H contains all the bounded functions measurable with respect to the σ-algebra generated by the multiplicative system K, which is B(R+ × Y ). Since FY is the minimal one with respect to which the functions f ∈ FY →
R+
e−t g(t, f (t))dt
are measurable for all bounded measurable functions g on R+ × Y , we see that FY ⊆ G. Hence, FY = G. The statement follows. Lemma B.1.3 Suppose X and Y are two nonempty Borel spaces. Let FY be endowed with the σ-algebra defined in Proposition B.1.12. (a) Let (t, x) ∈ R+ × X → f (t, x) ∈ Y be measurable. Then x ∈ X → f x ∈ FY with f x (t) := f (t, x) is measurable. (b) Let x ∈ X → f x ∈ FY be measurable. Then there exists some measurable mapping f from R+ × X to Y such that for each x ∈ X , f x (t) = f (t, x) almost everywhere with respect to t ∈ R+ . Proof This proof comes from [254]. The case of Y being a singleton is trivial. Below we assume that Y is not a singleton. As in the proof of Proposition B.1.12, we assume without loss of generality that Y is a closed subset of [0, 1], and consider the metric on FY defined by (B.4). Then the Borel σ-algebra on FY is generated by the collection of open balls in the metric space FY . The preimage with respect to x ∈ X → f x ∈ FY of the -open ball centered at some g ∈ FY for some > 0 is
x∈X:
R+
−t
e ( f x (t) − g(t)) dt =
−t
2
R+
e ( f (x, t) − g(t)) dt < 2
2
,
which is a measurable subset of X by Proposition B.1.34. Part (a) of the statement follows from this.
Appendix B: Relevant Definitions and Facts
513
Now we prove part (b) as follows. Since the metric space FY is separable by ∞ Proposition B.1.12, see its proof, for each m ∈ N, there exists +∞ a sequence {Fm,n }n=1 of disjoint nonempty measurable subsets of FY such that n=1 Fm,n = FY and sup ρ(g, h) ≤
g,h∈Fm,n
1 , ∀ n = 1, 2, . . . . m2
For each m, n ∈ N, let f m,n be a fixed element of Fm,n . For each x ∈ X , t ∈ R+ , define n(m, x) by n(m, x) = k if and only if f x ∈ Fm,k . Remember the sequence {Fm,n }∞ n=1 is disjoint and exhaustive. Thus, x ∈ X → n(m, x) is measurable for each m ∈ N. Now, for each m ∈ N and x ∈ X , we define f x(m) ∈ FY by f x(m) (t) := f m,n(m,x) (t), ∀ t ∈ R+ . It follows that for each measurable subset of the Borel space FY , ' (x, t) ∈ X × R+ : f x(m) (t) ∈ ∞ * & ' (x, t) ∈ X × R+ : f m,k (t) ∈ , n(m, x) = k = &
k=1
is a measurable subset of X × R+ . Thus, (x, t) ∈ X × R+ → f x(m) (t) is measurable. Define the function f on X × R+ by f (t, x) := lim f x(m) (t), ∀ x ∈ X, t ∈ R+ . m→∞
It follows that f is measurable on X × R+ . Let x ∈ X be fixed. Note that, for each m ∈ N, ρ( f x , f x(m) ) ≤
1 . m2
Let λ be the exponential distribution with the unit mean on R+ . Then λ({t ∈ R+ : | f x (t) − f x(m) (t)| ≥ ∀ m ∈ N,
1 1 }) ≤ m 2 ρ2 ( f x , f x(m) ) ≤ 2 , m m (B.5)
by the Chebyshev inequality. It follows that the measurable functions f x(m) converge to f x for almost all t ∈ R+ with respect to λ. Indeed,
514
Appendix B: Relevant Definitions and Facts
*) *
{t ∈ R+ : | f x (t) − f x(n) (t)| ≥ r }
r ∈Q m∈N n≥m
⊆
) *
{t ∈ R+ : | f x (t) − f x(n) (t)| ≥
m∈N n≥m
1 }, n
where the set on the right-hand side is null with respect to λ by (B.5) and the Borel– Cantelli Lemma, see Proposition B.1.44(a). Hence, the measurable functions f x(m) converge to f x for almost all t ∈ R+ . Therefore, for each x ∈ X , the equality f x (t) = f (t, x) holds for almost all t ∈ R+ . Recall the definition of the measurable mapping f on X × R+ . The statement thus follows. Recall that, for a topological Borel space Y, P(Y ) is the space of probability measures on (Y, B(Y )), and is a (topological) Borel space when it is endowed with the standard weak topology. Unless stated otherwise, we endow FP(Y ) with the σalgebra FP(Y ) defined in Proposition B.1.12. Remark B.1.2 When Y is interpreted as an action space, we also use R(Y ) for FP(Y ) to signify its meaning as the space of relaxed controls. This applies to the forthcoming sections without further reference. The next lemma and its proof are taken from [46]. ¯ 0+ -valued meaLemma B.1.4 Let Y be a topological Borel space. For each R function h on X × R+ × Y × FP(Y ) , the function (x, ρ) ∈ X × FP(Y ) → surable ∞ h(x, s, a, ρ)ρs (da)ds is measurable. 0 Y Proof Let K be the class of bounded measurable functions g on X × R+ × Y × FP(Y ) such that (x, ρ) ∈ X × FP(Y ) → R+ e−s Y g(x, s, y, ρ)ρs (dy)ds is measurable with respect to B(X ) × FY . The set K is a vector space, contains constants, and is closed under bounded passage to a limit. Consider the multiplicative system K0 whose elements are in the form c(s, y)b(x, ρ) with c and b being bounded measurable on R+ × Y and on X × FP(Y ) , respectively. Now consider an arbitrary element c(s, y)b(x, ρ) of K0 . It follows that (x, ρ) ∈ X × FP(Y ) → e−s c(s, y)b(x, ρ)ρs (dy)ds R+ Y e−s c(s, y)ρs (dy)ds = b(x, ρ) R+
A
is measurable by the definition of FP(Y ) . Thus, K0 ⊆ K. By the Monotone Class Theorem, see Proposition B.1.43, the class of bounded functions measurable with respect to B(X × R+ × Y ) × FY = σ(K0 ) is a subset of K. That is, R+ e−s Y g(x, s, y, ρ)
Appendix B: Relevant Definitions and Facts
515
ρs (dy)ds is measurable with respect to B(X ) × FY for all bounded measurable functions g on X × R+ × Y × FP(Y ) . ¯ 0+ -valued measurable function h on X × R+ × Y × FP(Y ) be Now let some R arbitrarily fixed. By the Monotone Convergence Theorem, see Proposition B.1.43,
R+
h(x, s, y, ρ)ρs (dy)ds = lim
k→∞ R +
Y
e−s
{(es h(x, s, y, ρ)) ∧ k}ρs (dy)ds Y
so that (x, ρ) ∈ X × FP(Y ) →
R+
h(x, s, y, ρ)ρs (dy)ds Y
is measurable with respect to B(X ) × FY . The statement is thus proved.
B.1.4 Passage to the Limit Proposition B.1.13 If {cn }∞ n=1 is a monotone nonincreasing sequence of [−∞, ∞]valued functions on some space X , then inf x∈X limn→∞ cn (x) = limn→∞ inf x∈X cn (x).
Proof See Lemma 3.4 of [125].
Lemma B.1.5 Suppose X is an arbitrary set and Y is a compact topological Borel space. Let {gn }∞ n=1 be a sequence of [−∞, ∞]-valued functions on X × Y such that, for each x ∈ X , the sequence {gn (x, ·)}∞ n=1 is monotone nondecreasing and, for each n = 1, 2, . . . , the function gn (x, ·) is lower semicontinuous on Y . Then lim inf gn (x, y) = inf lim gn (x, y), ∀ x ∈ X.
n→∞ y∈Y
y∈Y n→∞
(B.6)
Proof The reasoning of the presented proof and the more general version of the statement of this lemma come from Proposition 10.1 of [214]. Let some x ∈ X be arbitrarily fixed. Observe that lim inf gn (x, y) ≤ inf lim gn (x, y).
n→∞ y∈Y
y∈Y n→∞
We argue for the opposite direction of the above inequality as follows. For each n ∈ {1, 2, . . . }, let yn ∈ Y be such that gn (x, yn ) = inf gn (x, y). y∈Y
Since Y is compact, the sequence {yn }∞ ˆ ∈ Y. Let {yn k }∞ n=1 has some limit point y k=1 ⊆ ∞ {yn }n=1 be the convergent subsequence such that limn k →∞ yn k = yˆ . Now let m ∈
516
Appendix B: Relevant Definitions and Facts
{1, 2, . . . } be arbitrarily fixed. Then for all n k ≥ m, gn k (x, yn k ) ≥ gm (x, yn k ), ∀ x ∈ X. It follows that lim inf gn (x, y) = lim gn (x, yn ) = lim gn k (x, yn k )
n→∞ y∈Y
n k →∞
n→∞
≥ lim gm (x, yn k ) ≥ gm (x, yˆ ), n k →∞
where the last inequality is by the lower semicontinuity condition. Thus, by passing to the limit as m → ∞ first and then taking the infimum over y ∈ Y, we see that lim inf gn (x, y) ≥ inf lim gn (x, y).
n→∞ y∈Y
y∈Y n→∞
The proof is complete.
Corollary B.1.1 Suppose that Y is a compact topological Borel space, and {gn }∞ n=1 is a monotone nonincreasing sequence of R-valued functions on Y such that, for each n = 1, 2, . . . , gn is an upper semicontinuous function on Y , and with the pointwise limit g being a continuous real-valued function on Y . Then gn converges to g uniformly on Y as n → ∞, i.e., lim sup |gn (y) − g(y)| = 0.
n→∞ y∈Y
Proof Since {gn }∞ n=1 is a monotone nonincreasing sequence of upper semicontinuous functions, {g − gn }∞ n=1 is a monotone nondecreasing sequence of lower semicontinuous functions on Y . We view g − gn as a function on {1} × Y for each n. Then Lemma B.1.5 applies with X = {1}. So lim inf (g(y) − gn (y)) = inf lim (g(y) − gn (y)) = 0,
n→∞ y∈Y
y∈Y n→∞
i.e., − lim sup(gn (y) − g(y)) = − sup lim (gn (y) − g(y)) = 0, n→∞ y∈Y
y∈Y n→∞
and thus lim sup |gn (y) − g(y)| = sup lim |gn (y) − g(y)| = 0
n→∞ y∈Y
y∈Y n→∞
because gn (y) ≥ g(y) for each y ∈ Y . The proof is complete.
Appendix B: Relevant Definitions and Facts
517
Proposition B.1.14 Let μn and μ be [0, ∞]-valued measures on the measurable space (X, F), and f n (resp., f ) be R-valued μn -integrable (resp., measurable) functions thereon, and gn be nonnegative measurable functions on X . Suppose limn→∞ μn () ≥ μ() for each ∈ F, and f n (x) → f (x) for each x ∈ X. If | f n (x)| ≤ gn (x) for each x ∈ X and n ≥ 1, and
gn (y)μn (dy) ≤
lim
n→∞
X
lim gn (y)μ(dy) < ∞,
X n→∞
then f is μ-integrable, and
f n (y)μn (dy) =
lim
n→∞
X
f (y)μ(dy) ∈ (−∞, ∞). X
Proof See Theorem 2.4 of [223].
Proposition B.1.15 Let μn and μ be [0, ∞]-valued measures on the measurable space (X, F). Then the following statements are equivalent. (a) X limn→∞ f n (x)μ(d x) ≤ limn→∞ X f n (x)μn (d x) for all [0, ∞)-valued measurable functions f n on X . (b) X f (x)μ(d x) ≤ limn→∞ X f (x)μn (d x) for each [0, ∞)-valued measurable function f on X . (c) μ() ≤ limn→∞ μn () for each ∈ F.
Proof See Lemma 2.2 of [223].
Proposition B.1.16 Let f be a [0, ∞)-valued measurable function on the measurable space (X, F), and {μn }∞ n=1 be a sequence of measures on (X, F) such that is monotone nondecreasing for each ∈ F. Then there is a measure {μn ()}∞ n=1 () ↑ μ() for each ∈ F and lim f (x)μ μ on (X, F) such that μ n n→ n (d x) = X f (x)μ(d x). X
Proof See Theorem 2.1 of [122].
Proposition B.1.17 Let μn be [0, ∞]-valued measures on the measurable space (X, F), and suppose {μn ()}∞ n=1 is an increasing sequence for each ∈ F. Then μ() := limn→∞ μn () defines a measure on (X, F) and, for every increasing sequence of nonnegative measurable functions fl : X → [0, ∞],
fl (x)μn (d x) = lim
lim lim
n→∞ l→∞
n→∞
f (x)μn (d x)
= lim lim fl (x)μn (d x) = lim fl (x)μ(d x) l→∞ n→∞ X l→∞ X = lim f n (x)μn (d x) = f (x)μ(d x), n→∞
X
X
where f := liml→∞ fl pointwise.
X
X
518
Appendix B: Relevant Definitions and Facts
Proof This is a consequence of Propositions B.1.13 and B.1.16.
∞ Proposition B.1.18 Suppose {κn }n=1 and κ are nonnegative measurable functions on (R+ , B(R+ )) such that R+ κ(θ)dθ < ∞ and, for each α ≥ 0,
lim
n→∞ R +
e
−αθ
κn (θ)dθ =
e−αθ κ(θ)dθ. R+
Then, for each bounded continuous function g on R+ ,
lim
n→∞ R +
g(θ)κn (θ)dθ =
g(θ)κ(θ)dθ. R+
Proof Note that (at α = 0) limn→∞ R+ κn (θ)dθ = R+ κ(θ)dθ, and without loss of generality we can assume that all the integrals R+ κn (θ)dθ are finite. If R+ κ(θ)dθ = 0 then, for every bounded nonnegative continuous function g on R+
lim
n→∞ R +
= sup g(θ) θ∈R+
g(θ)κn (θ)dθ ≤ sup g(θ) lim R+
n→∞ R +
θ∈R+
κn (θ)dθ
κ(θ)dθ = 0
and the statement of the proposition follows. Assume that R+ κ(θ)dθ > 0. Without loss of generality, we assume also that κn (θ) R+ κn (θ)dθ > 0 for all n = 1, 2, . . .. For the probability densities R κn (θ)dθ (n = + −αθ κn (θ)dθ we know that the Laplace transforms e 1, 2, . . .) and κ(θ) R+ κ(θ)dθ R+ R+ κn (θ)dθ κ(θ)dθ for all α ≥ 0. Therefore, for every bounded conconverge to e−αθ R+ R+ κ(θ)dθ tinuous function g on R+ lim
n→∞ R +
κn (θ)dθ = R+ κn (θ)dθ
g(θ)
R+
κ(θ)dθ R+ κ(θ)dθ
g(θ)
(see [91, XIII, Sect. 1, Theorem 2] and [91, VIII, Sect. 1, Theorem 1]), and the statement to be proved follows. Proposition B.1.19 Suppose K ⊆ L 1 (X, F, μ) is a bounded (in norm) collection of functions f , such that the integrals E f (x)μ(d x) are countably additive uniformly ∞for each E ∈ F and its measurable partition +with respect to f ∈ K , i.e., E= ∞ n=1 E n , the convergence n=1 E n f (x)μ(d x) = E f (x)μ(d x) takes place uniformly with respect to f ∈ K . Then every sequence { f n }∞ n=1 ⊆ K contains a subse∞ quence { f ni }i=1 such that, for some f¯ ∈ L 1 (X, F, μ), for each bounded measurable
Appendix B: Relevant Definitions and Facts
519
function g on X ,
g(x) f¯(x)μ(d x)
g(x) f ni (s)μ(d x) −→ X
X
as i → ∞. Here, the measure μ may be [0, ∞]-valued.
Proof This is a simplified statement of Theorem IV.8.9 in [67, p. 292].
Corollary B.1.2 Suppose {κn }∞ n=1 and κ are nonnegative measurable functions on (R+ , B(R+ )) such that (i) for some constant D < ∞, κn (θ), κ(θ) ≤ D for all θ ∈ R+ ; (ii) R+ κ(θ)dθ < ∞; (iii) for each bounded continuous function g on R+
lim
n→∞ R +
g(θ)κn (θ)dθ =
g(θ)κ(θ)dθ.
(B.7)
R+
Then equality (B.7) holds also for every bounded measurable function g on (R+ , B(R+ )). Proof According to (B.7) under g(θ) ≡ 1, we see that limn→∞ R+ κn (θ)dθ < ∞, so that, without loss of generality, we can assume that all the integrals R+ κn (θ)dθ, constant. R+ κ(θ)dθ are bounded by a common Let us show that the integrals E κn (θ)dθ are countably additive uniformly with respect to n. + Let us fix an arbitrary δ > 0, and consider E = ∞ j=1 E j , where, for each j ≥ 1, E j is Lebesgue measurable. + If E ⊆ (a, b) for b < ∞, then one can find M < ∞ such that Leb(E \ M j=1 E j ) < δ and hence, for all n = 1, 2, . . . D
κn (θ)dθ − E
+M j=1
κn (θ)dθ ≤ Ej
E\
+M j=1
sup κn (θ) dθ ≤ δ.
E j θ∈R+
If the set E is unbounded, we fix t > 0 such that, for the continuous function ⎧ if θ ≥ 2t; ⎨ 1, g(θ) = [θ − t] 1t , if t ≤ θ ≤ 2t; ⎩ 0, if θ ≤ t, the inequality
R+
g(θ)κ(θ)dθ =
holds, and fix N such that
(t,∞)
g(θ)κ(θ)dθ ≤
δ 4
520
Appendix B: Relevant Definitions and Facts
g(θ)κn (θ)dθ ≤
R+
g(θ)κ(θ)dθ +
R+
δ δ ≤ 4 2
for all n ≥ N . After that (if needed) we increase t to make sure that the inequality R+
g(θ)κn (θ)dθ ≤
δ 2
holds also for n = 1, 2, . . . , N . Now for all n = 1, 2, . . . δ κn (θ)dθ ≤ . 2 (t,∞) For E˜ := E ∩ (0, t], we, like previously, fix M < ∞ such that, for the set E˜ := ! +M +M δ ˜ ˜ j=1 (E j ∩ (0, t]) = j=1 E j ∩ (0, t], the inequality Leb( E \ E ) < 2D holds, and hence, for all n = 1, 2, . . ., δ κn (θ)dθ ≤ + κn (θ)dθ. 2 E˜ E˜ Now
δ κn (θ)dθ ≤ + 2 E
E˜
κ(θ)dθ ≤ δ +
E˜
κn (θ)dθ ≤ δ +
+M j=1
κn (θ)dθ) Ej
for all n = 1, 2, . . .. According to Proposition B.1.19 under L 1 (R+ , B(R+ ), Leb), for some subse∞ of {κn }∞ ¯ ∈ L 1 (R+ , B(R+ ), Leb) such that, for all quence {κni }i=1 n=1 , there is a κ bounded measurable functions g on R+ ,
lim
i→∞ R +
g(θ)κni (θ)dθ =
R+
g(θ)κ(θ)dθ. ¯
¯ = R+ κ(θ)dθ because of (B.7). Note that R+ κ(θ)dθ If R+ κ(θ)dθ = 0 then, for every bounded nonnegative measurable function g on (R+ , B(R+ )) lim
n→∞ R +
= sup g(θ) θ∈R+
g(θ)κn (θ)dθ ≤ sup g(θ) lim θ∈R+
R+
κ(θ)dθ = 0
and the statement of the corollary follows.
n→∞ R +
κn (θ)dθ
Appendix B: Relevant Definitions and Facts
521
Assume that R+ κ(θ)dθ > 0. Without loss of generality, we assume also that R+ κn (θ)dθ > 0 for all n = 1, 2, . . .. Formula (B.7) means that the probability mea κ(θ)dθ E R+ κ(θ)dθ
on B(R+ ) is the weak limit of the probability measures E → as n → ∞. This weak limit is unique. Thus the measure E → E κ(θ)dθ ¯ ¯ = κ(θ) for almost on B(R+ ) coincides with the measure E → E κ(θ)dθ and κ(θ) all θ ∈ R+ . If the statement of this corollary is false, then there is a bounded measurable ∞ , for function gˆ on R+ such that, for some ε > 0, there is a subsequence {κˆ ni }i=1 which > ε, i = 1, 2, . . . . (B.8) g(θ)κ ˆ g(θ)κ(θ)dθ ˆ n i (θ)dθ − sure E → κ (θ)dθ E n R+ κn (θ)dθ
R+
R+
∞ But this subsequence {κˆ ni }i=1 satisfies all the conditions of the current corollary and hence, for some of its subsequences {κˆ ni j }∞ j=1 we must have
lim
j→∞ R +
g(θ) ˆ κˆ ni j (θ)dθ =
R+
¯ˆ g(θ) ˆ κ(θ)dθ =
R+
g(θ)κ(θ)dθ ˆ
for some κ¯ˆ ∈ L 1 (R+ , B(R+ ), Leb) such that κ¯ˆ = κ almost everywhere, which contradicts (B.8).
B.1.5 The Tauberian Theorem Proposition B.1.20 (Tauberian Theorem) Suppose c is a [0, ∞]-valued measurable function on (0, ∞). Then the following assertions hold. (a) The following Tauberian relation holds: 1 c(s)ds ≤ lim α e−αs c(s)ds t→∞ t (0,t] α↓0 (0,∞) 1 e−αs c(s)ds ≤ lim c(s)ds. ≤ lim α t→∞ t (0,t] α↓0 (0,∞) lim
(B.9)
(b) The following statements are equivalent: 1. All the terms in (B.9) are equal and finite. 2. limt→∞ 1t (0,t] c(s)ds exists and is finite. 3. limα↓0 α (0,∞) e−αs c(s)ds exists and is finite. (This is the continuous version of the relation presented in Appendix A of [221].)
522
Appendix B: Relevant Definitions and Facts
Proof (a) This part was actually proved in [130], see Lemma 4.5 therein, but we present a more detailed proof here. Let c(s) ≥ 0 be almost everywhere finite, for otherwise the statement holds automatically. The proof of the last inequality is given in the proof of Proposition 4.6 of [46]. The second inequality holds automatically. For the first inequality, we argue similarly as follows. If there exists some α > 0 such that (0,∞) e−αs c(s)ds = ∞, then the inequality holds automatically. So assume that, for all α > 0, (0,∞) e−αs c(s)ds < ∞. Therefore, by partial integration (p. 336 of [30]), for each K > 0 α
e
−αs
(0,K ]
= αe−αK 2 ≥α
c(s)ds = α
e
−αs
d
(0,K ]
c(t)dt + α2 e−αs (0,K ] −αs e c(t)dtds.
c(t)dt (0,s]
c(t)dtds
(0,K ]
(0,K ]
(0,s]
(0,s]
So α
e
−αs
(0,∞)
(0,s]
≥ inf
s>K
= inf
c(s)ds ≥ α
c(t)dt
α
s
e
c(t)dtds (0,s]
e−αs sds
2 (K ,∞)
s
−αs
(0,∞)
1 − α2
(0,s] c(t)dt
s>K
2
(0,K ]
e−αs sds .
Passing to the lower limit as α ↓ 0 first and then as K → ∞ gives the first inequality. (b) We only need show that statement 3 implies statement 2. Let lim α α↓0
(0,∞)
e−αs c(s)ds = L .
We first show that lim α α↓0
e
−αt
(0,∞)
c(t) f (e
−αt
)dt = L
(0,1]
f (z)dz
for all continuous functions f on [0, 1] as follows. Consider f (z) = z k for some k = 0, 1, . . . . Then
(B.10)
Appendix B: Relevant Definitions and Facts
523
e−αt c(t) f (e−αt )dt = α e−αt (k+1) c(t)dt (0,∞) (0,∞) α = e−α(k+1)t c(t)dt, α(k + 1) α(k + 1) (0,∞) α
so that lim α α↓0
(0,∞)
e−αt c(t) f (e−αt )dt =
1 L=L k+1
f (z)dz.
(0,1]
Thus, (B.10) holds for all polynomials. Now let s be a continuous real-valued function on [0, 1]. Then by the Stone–Weierstrass Theorem, for each > 0, there exists a polynomial f on [0, 1] such that | f (z) − s(z)| ≤ for all z ∈ [0, 1]. Thus f (e−αt )αe−αt dt − ≤ s(e−αt )αe−αt dt (0,∞) (0,∞) f (e−αt )αe−αt dt + . ≤ (0,∞)
Also α (0,∞)
e−αt c(t)s(e−αt )dt
≤α
(0,∞)
e−αt c(t) f (e−αt dt) +
(0,∞)
e−αt c(t)dt .
Passing to the upper limit as α ↓ 0, we see lim α α↓0
e
−αt
c(t)s(e
−αt
(0,∞)
)dt ≤ L
(0,1]
f (z)dz + L ≤ L
(0,1]
s(z)dz + 2L.
Similarly, one can see lim α α↓0
(0,∞)
e−αt c(t)s(e−αt )dt ≥ L
(0,1]
s(z)dz + 2L.
Hence lim α α↓0
e (0,∞)
−αt
c(t)s(e
−αt
)dt = L
s(z)dz, (0,1]
i.e., (B.10) holds for any continuous function f on [0, 1]. Consider the function r (z) = 1z I{z > e−1 } defined on [0, 1]. By Lemma A.4.1 of [221], for each > 0, there exist two continuous functions f 1 , f 2 on [0, 1] such that
524
Appendix B: Relevant Definitions and Facts
f1 ≤ r ≤ f2 and for each > 0,
1−≤
(0,1]
f 1 (z)dz ≤
(0,1]
f 2 (z)dz ≤ 1 + .
Based on this fact, one can show that as 0 < α → 0 −αt −αt α e c(t)r (e )dt → L r (z)dz = L . (0,∞)
(0,1]
On the other hand, α e−αt c(t)r (e−αt )dt = α (0,∞)
if α =
1 . T
(0, α1 ]
c(t)dt =
1 T
c(t)dt (0,T ]
We conclude that statement 2 holds.
B.1.6 Measures and Spaces of Measures Proposition B.1.21 (Inverse transformation method) Let F be the cumulative distribution function (CDF) of a random variable with values in [0, 1], which is thus right-continuous. Introduce the generalized inverse F −1 (y) := min{x : F(x) ≥ y}, y ∈ [0, 1]. Then the random variable X := F −1 (U ) has the CDF F, where U has the standard uniform distribution over the interval [0, 1]. Proof In elementary textbooks, this statement is proved for continuous invertible functions F: see, e.g., Proposition 11.1 of [211]. For the general case (see Theorem 2.1 of [58]), it is sufficient to notice that y < F(x) =⇒ F −1 (y) ≤ x =⇒ y ≤ F(x) and remember that, for all x ∈ [0, 1], P(U < F(x)) = P(U ≤ F(x)) = F(x). Thus, P(X ≤ x) = F(x) ∀ x ∈ [0, 1].
Appendix B: Relevant Definitions and Facts
525
Note that in Proposition B.1.21, F −1 : [0, 1] → [0, 1] is a mapping, with respect to which the image of the Lebesgue measure is the probability distribution with CDF coincident with F. Proposition B.1.22 Let X be a nonempty topological Borel space, and μ be a finite signed measure on (X, B(X )). Then μ is a finite measure on (X, B(X )) if and only if f (x)μ(d x) ≥ 0 X
for each [0, 1]-valued continuous function on X .
Proof See Theorem I.5.5 of [246].
Proposition B.1.23 Let X be a separable metric space, and μ be a finite signed measure on B(X ). Then μ = 0 if and only if X g(y)μ(dy) = 0 for each bounded uniformly continuous function u on X .
Proof See Lemma 2.3 of [237].
Proposition B.1.24 Let μ be a σ-finite signed measure on ([0, ∞), B([0, ∞)), i.e., the total variation |μ| is σ-finite on ([0, ∞), B([0, ∞)). If [0,∞)
e−αt μ(dt) = 0, ∀ α > 0,
then μ(dt) = 0. If, furthermore, μ(dt) = ϕ(t)dt, where ϕ is either a right-continuous R-valued function on [0, ∞) or a left-continuous R-valued function on [0, ∞) with ϕ(0) = 0, then ϕ(t) = 0 for each t ≥ 0.
Proof See Theorem 1.38 of [39].
Lemma B.1.6 Let X be a nonempty topological Borel space equipped with the compatible metric ρ, with the countable everywhere dense subset X d = {e1 , e2 , . . .} and with the Borel σ-algebra B(X ). Suppose P is a probability measure on (X, B(X )). Then P is a Dirac measure if and only if one of the following equivalent statements holds true. (a) There is some a ∈ X such that for each ∈ B(X ), P() =
1, if a ∈ ; 0 otherwise.
(b) For any ∈ B(X ), P() ∈ {0, 1}. (c) For any em ∈ X d , for any k ∈ N, P(O(em , k1 )) ∈ {0, 1}, where 1 1 O(em , ) = {e ∈ X : ρ(e, em ) < } k k is an open ball.
526
Appendix B: Relevant Definitions and Facts
Proof Since (a) is just the definition of a Dirac measure and the implications (a)⇒(b)⇒(c) are obvious, it remains to show that (c)⇒(a). Among the denumerable number of balls O(em , 1), take an arbitrary one with P(O(em , 1)) = 1 and denote it by 1 . Having i in hand with P(i ) = 1, take an 1 )) = 1, denote it by Oˆ i+1 and put i+1 = i ∩ Oˆ i+1 . arbitrary ball with P(O(em , i+1 ˆ Then P( (∞i+1 ) = 1 because otherwise P( Oi+1 \ i ) > 0 and P(i ) < 1. The set ∞ = i=1 i is non-empty because P(∞ ) = 1; it cannot contain two different points because otherwise, for x1 , x2 ∈ ∞ , ρ(x1 , x2 ) = ε > 0, which contradicts x1 , x2 ∈ i with i > 1ε . (Remember, i ⊂ Oˆ i , where Oˆ i is a ball with radius 1i .) Therefore, ∞ = {a} is a singleton and statement (a) holds. An extension of Lemma B.1.6 is given in Proposition B.1.36 below. Proposition B.1.25 Let X be a separable metrizable space, and τ be a subbase of its topology. Then its Borel σ-algebra B(X ) coincides with σ(τ ), the σ-algebra generated by τ . Proof This follows from Proposition 7.1 of [21]. See also p. 117 of [21].
Proposition B.1.26 Let X be a metrizable space. Then B(X ) coincides with the σ-algebra generated by the collection of all bounded continuous functions on X .
Proof See Proposition 7.10 of [21].
Definition B.1.7 (topology on P(X ) generated by C(X )) Let X be a metrizable topological space and f be an R-valued bounded continuous function on X . For a fixed measure p ∈ P(X ) with P(X ) being the space of probability measures on (X, B(X )), V ( p, f ) := q ∈ P(X ) : f (x)q(d x) − f (x) p(d x) < X
X
is a neighborhood of p. The weak topology on P(X ) is the weakest topology, which contains the neighborhoods V ( p, f ) for all > 0, p ∈ P(X ), f ∈ C(X ), where C(X ) is the space of bounded continuous functions on X . Proposition B.1.27 Let X be a metrizable space and let P(X ) be equipped with the weak topology. Then the following assertions hold. (a) A sequence { pn }∞ n=1 in P(X ) converges to p if and only if
f (x) pn (d x) =
lim
n→∞
X
f (x) p(d x) for each f ∈ C(X ). X
(b) If X is a topological Borel space, then P(X ) is a topological Borel space. Proof See Proposition 7.21 and Corollary 7.25.1 of [21].
Appendix B: Relevant Definitions and Facts
527
Proposition B.1.28 Let X be a nonempty set. Consider two topologies τ1 and τ2 on X . Then τ1 ⊆ τ2 if and only if every convergent net in X with respect to τ2 also converges with respect to τ1 .
Proof See p. 127 of [95].
Definition B.1.8 (w-norm) Let X be a Borel space and w be a measurable [1, ∞)valued function on X . (a) A finite signed measure M on (X, B(X )) is said to have a finite w -norm, if X w(x)|M|(d x) < ∞. Here |M| stands for the total variation of M. Denote by MwF (X ) the linear space of all finite signed measures on (X, B(X )) with a finite w-norm. (b) If X is a topological Borel space, we introduce the space Cw (X ) := C(X ) ∩ Bw (X ), where Bw (X ) is the space of measurable functions f on X satisfying f (x)| < ∞. supx∈X |w(x) Let X be a topological Borel space. The w -weak topology on MwF (X ), denoted by τ (MwF (X )), is the weakest topology in which the mapping M → X u(x)M(d x) from MwF (X ) to R is continuous for each u ∈ Cw (X ). Convergence in this topology w is denoted by Mζ → M, where {Mζ } is a net in MwF (X ) converging to M. Let Pw (X ) := MwF (X ) ∩ P(X ), and τ (Pw (X )) be the relative topology, induced by τ (MwF (X )). It coincides with the w-weak topology on Pw (X ) generated by Cw (X ), see [3, Lemma 2.53]. There is a one-to-one correspondence between Pw (X ) and P(X ) given by mappings P˜ = ˜ ˜ Q w (P) and P = Q −1 w ( P) for P ∈ Pw (X ) and P ∈ P(X ): w(x)P(d x) ˜ and P() := P() := w(x)P(d x) X
˜ x) P(d w(x) ˜ x) P(d X w(x)
∀ ∈ B(X ).
The equality Q −1 w (Q w (P)) = P is obvious. Remark B.1.3 If X is a topological Borel space, then on the space P(X ) we fix the standard weak topology τweak generated by C(X ), and the corresponding Borel σ-algebra generated by τweak , unless stated otherwise. Proposition B.1.29 The set of all Dirac measures on a topological Borel space X is closed in P(X ). Proof See Chap. II, Lemma 6.2 of [176].
Lemma B.1.7 Suppose X is a topological Borel space and w is a continuous function on X . Let P ∈ Pw (X ), {Pζ } ⊆ Pw (X ), where {Pζ } is a net, and let P˜ = Q w (P) ∈ P(X ), P˜ζ = Q w (Pζ ) ⊂ P(X ). With some abuse of notation, the latter is the image w ˜ of the net Pζ . Then Pζ → P if and only if P˜ζ → P.
528
Appendix B: Relevant Definitions and Facts w
Proof (a) If Pζ → P, then, for every function g ∈ C(X ),
g(x) P˜ζ (d x) =
X
g(x) X
g(x)w(x)P(d x) = → X X w(x)P(d x)
w(x)Pζ (d x) X w(x)Pζ (d x)
˜ x) g(x) P(d X
˜ because both gw and w are functions in Cw (X ). Hence P˜ζ → P. ˜ ˜ (b) If Pζ → P, then, for every function g ∈ Cw (X ),
g(x)Pζ (d x) = X
because
g , w
P˜ζ (d x) w(x) g(x) ˜ Pζ (d x) X X w(x)
w ∈ C(X ). In other words,
X
→
X
g(x) w(x)
!
˜ x) P(d
˜ x) P(d X w(x)
g(x)Pζ (d x) →
X
g(x)P(d x).
Corollary B.1.3 (Pw (X ), τ (Pw (X ))) is a topological Borel space. Proof According to Lemma B.1.7, Q w is a homeomorphism between the topological spaces (Pw (X ), τ (Pw (X ))) and (P(X ), τweak ), and the latter space is topological Borel by Proposition B.1.27. Definition B.1.9 (Moment) A [0, ∞)-valued measurable function g on a topological Borel space X is called strictly unbounded (or moment, or norm-like) if there exists an increasing sequence of compact sets X n ↑ X such that limn→∞ inf x∈X \X n g(x) = +∞. If X is compact then any function on X is strictly unbounded. Definition B.1.10 (Tight and relatively compact family) Consider a family P ⊆ P(X ) of probability measures on a metric space X endowed with its Borel σ-algebra. The family P is said to be (a) tight if for every ε > 0 there exists a compact subset K ⊂ X such that for each μ ∈ P, μ(K ) > 1 − ε; (b) relatively compact (in (P(X ), τweak )) if every sequence in P contains a weakly convergent subsequence in P(X ). Proposition B.1.30 (Prohorov Theorem) Let P ⊆ P(X ) be a family of probability measures on a metric space X . (a) If there exists a strictly unbounded function g on X such that g(x)μ(d x) < ∞,
sup
μ∈P
X
then P is tight. (b) If P is tight, then it is relatively compact (in (P(X ), τweak )).
Appendix B: Relevant Definitions and Facts
529
(c) Suppose X is separable and complete. If P is relatively compact (in (P(X ), τweak )), then it is tight. Proof (a) For this item, one can refer to [12, Sect. 2]. But the proof is short, and we present it below. If P is not tight, then ∃ ε > 0 : for every compact K ⊆ X, ∃ μ ∈ P : μ(X \ K ) ≥ ε. For an arbitrary M ∈ R+ , take X n such that inf x ∈X / n g(x) > M and consider the corresponding measure μˆ ∈ P satisfying μ(X ˆ \ X n ) ≥ ε. Then
g(x)μ(d ˆ x) ≥ X
X \X n
g(x)μ(d ˆ x) ≥ Mε,
so that supμ∈P X g(x)μ(d x) cannot be finite, as M was arbitrary. (b) and (c): see [23, pp. 37–40] or [176, Chap. II, Theorem 6.7]. This is the so-called Prohorov’s Theorem.
B.1.7 Stochastic Kernels Definition B.1.11 (Stochastic and finite kernels) Consider measurable spaces (X, F X ) and (Y, FY ). The function ϕ(dy|x) defined on FY × X is called a stochastic (finite) kernel on FY given x ∈ X if the following two conditions are satisfied. (a) For each x ∈ X, ϕ(·|x) is a probability (finite) measure on (Y, FY ). (b) For each Y ∈ FY , the function x ∈ X → ϕ(Y |x) is measurable on (X, F X ). Proposition B.1.31 Let X and Y be nonempty topological Borel spaces, endowed with their corresponding Borel σ-algebras. Then ϕ(dy|x) is a stochastic kernel on B(Y ) given x ∈ X if and only if the mapping x ∈ X → ϕ(·|x) ∈ P(Y ) is Borelmeasurable. Proof See Proposition 7.26 of [21].
Definition B.1.12 (Continuous stochastic kernel) Suppose that X and Y are topological Borel spaces. A stochastic kernel ϕ(dy|x) on B(Y ) given x ∈ X is called continuous if for each f ∈ C(Y ), the function Y f (y)ϕ(dy|x) is continuous on X . Proposition B.1.32 Let X and Y be topological Borel spaces, and p(dz|x, y) be a continuous stochastic kernel from X × Y to B(X ). If f is a lower semicontinuous (−∞, ∞]-valued function on X that is bounded from below, then the func tion (x, y) ∈ X × Y → X f (z) p(dz|x, y) is lower semicontinuous and bounded from below. If f is an upper semicontinuous and bounded from above [−∞, ∞)valued function on X , then the function (x, y) ∈ X × Y → X f (z) p(dz|x, y) is upper semicontinuous and bounded from above.
530
Appendix B: Relevant Definitions and Facts
Proof See Proposition 7.31 of [21].
Proposition B.1.33 Let X , Y and Z be (nonempty) Borel spaces. Let q(dy × dz|x) be a stochastic kernel on B(Y × Z ) given x ∈ X . Then there are stochastic kernels ϕ(dy|x) on B(Y ) given x ∈ X and r (dz|x, y) on B(Z ) given (x, y) ∈ X × Y such that q(dy × dz|x) = ϕ(dy|x)r (dz|x, y). For every fixed x ∈ X , clearly, ϕ(dy|x) = q(dy × Z |x), and the stochastic kernel r (dz|x, y) is unique in the sense that, if two stochastic kernels r1 and r2 on B(Z ) given y ∈ Y satisfy equality q(dy × dz|x) = ϕ(dy|x)r1,2 (dz|x, y), then there is a set B ∈ B(Y ), perhaps dependent on x, with q(B × Z |x) = ϕ(B|x) = 0 such that r1 (A|x, y) = r2 (A|x, y) for all A ∈ B(Z ) and y ∈ Y \ B. Proof See e.g., Theorem F on p. 88 of [69] or Corollary 7.27.1 of [21]. The uniqueness follows from Lemma 10.4.3 of [27, Vol. II.]. Proposition B.1.33 is also valid if q is a finite (not necessarily stochastic) kernel: 0 < q(Y × Z |x) =: f (x) < ∞. It is sufficient to pass to the stochastic kerand introduce the stochastic kernels ϕ(dy|x) ˜ and r˜ (dz|x, y) as nel q(·|x) ˜ := q(·|x) f (x) in Proposition B.1.33. Then q(dy × dz|x) = ϕ(dy|x)r (dz|x, y), where ϕ(dy|x) = f (x)ϕ(dy|x) ˜ = q(dy × Z |x) is a finite kernel and r = r˜ is a stochastic kernel. Proposition B.1.34 Let X and Y be (nonempty) Borel spaces and ϕ(dy|x) be a stochastic kernel on B(Y ) given x ∈ X . If f is a Borel-measurable (resp., lower semianalytic) function on X × Y then the function on X
λ(x) =
f + (x, y)ϕ(dy|x) −
f (x, y)ϕ(dy|x) := Y
Y
f − (x, y)ϕ(dy|x) Y
is measurable (resp., lower semianalytic). Here and below the convention of ∞ − ∞ := ∞ is used concerning the definition of the above integral. Proof See Propositions 7.29 and 7.48 of [21].
It follows from Proposition B.1.34 that, if X , Y and Z are (nonempty) Borel spaces and ϕ(dy|x), q(dz|x, y) are stochastic kernels, then, for an arbitrarily fixed Y ∈ B(Y ), Z ∈ B(Z ), the function on X λ(Y × Z |x) = I{y ∈ Y }q( Z |x, y)ϕ(dy|x) Y
is measurable and hence defines a stochastic kernel on B(Y × Z ) given x ∈ X . Indeed, the extension of λ(Y × Z |x) to λ(|x) with ∈ B(Y × Z ) is measurable in x ∈ X by the Monotone Class Theorem, see Proposition B.1.42. The next statement is a version of Proposition B.1.34.
Appendix B: Relevant Definitions and Facts
531
Proposition B.1.35 Let (X, F X , μ) and (Y, FY , ν) be two measure spaces and f : X × Y → R be an F X × FY -measurable function such that, for μ-almost all x ∈ X , the function y ∈ Y → f (x, y) is integrable with respect to ν. Then the function ψ: X→
f (x, y)ν(dy) Y
is F X -measurable.
Proof See Corollary 3.4.6 of [27, Vol. I.].
Proposition B.1.36 Let (X, F X , μ) be a probability space, and Y be a Borel space with the underlying σ-algebra B(Y ). Suppose P is a probability measure on the product measurable space (X × Y, F X × B(Y )) such that P(d x × dy) = ϕ(dy|x)μ(d x) for some stochastic kernel ϕ(dy|x) on B(Y ) given x ∈ X. If for each ∈ B(Y ), it holds that μ({x ∈ X : ϕ(|x) ∈ {0, 1}}) = 1, then there exists a measurable mapping f from (X, F X ) to (Y, B(Y )) such that P(d x × dy) = δ f (x) (dy)μ(d x).
Proof See the proof of Theorem 10 of [179].
Stochastic kernels and some initial distribution specify a probability measure on a finite or countably infinite product of measurable spaces in the sense of the next statement. ∞ Proposition B.1.37 (Ionescu-Tulcea Theorem) Let {(X i , Fi )}i=1 be a sequence of measurable spaces, Yn := X 1 × X 2 × . . . × X n and Y := X 1 × X 2 × . . .. Let p be a given probability measure on (X 1 , F1 ), and, for n = 1, 2, . . ., let qn (d xn+1 |yn ) be . Then for each,n = 2, 3, . . ., there exists a stochastic kernel on X n+1 given yn ∈ Yn, n n Fi ) with i=1 Fi being the product a unique probability measure μn on (Yn , i=1 n such that, for all 1 ∈ F1 , . . . , n ∈ Fn , σ-algebra of {Fi }i=1
μn (1 × . . . × n ) =
...
1 2
qn−1 (n |x1 , . . . , xn−1 )
n−1
qn−2 (d xn−1 |x1 , . . . , xn−2 ) . . . q1 (d x2 |x1 ) p(d x1 ). If f : Yn → [0, ∞] is a measurable function, then
532
Appendix B: Relevant Definitions and Facts
f (yn ) dμn (yn ) = Yn
... X1
X2
f (x1 , x2 , . . . , xn ) Xn
qn−1 (d xn |x1 , . . . , xn−1 ) . . . q1 (d x2 |x1 ) p(d x1 ). ,∞ Fi ) with Furthermore, there exists a unique probability measure μ on (Y, i=1 ,∞ ∞ F being the product σ-algebra of {F } such that, for each n, the marginal i i=1 i=1 i of μ on Yn is μn . Proof See Theorem 2.7.2 on p. 114 of [9] and Proposition 7.28 of [21].
B.1.8 Carathéodory Functions and Measurable Selection Definition B.1.13 (Carathéodory function) If X , Y and Z are metrizable spaces, all endowed with their Borel σ-algebras, and X is separable, then a function f from Z × X to Y is called a Carathéodory function if the following two conditions are satisfied. (a) For each x ∈ X, function z ∈ Z → f (z, x) is measurable. (b) For each z ∈ Z , function x ∈ X → f (z, x) is continuous. Proposition B.1.38 Let X be a separable metrizable space, and Y and Z be metrizable spaces, all endowed with their Borel σ-algebras. Then it holds that every Carathéodory function from Z × X to Y is (jointly) measurable on Z × X .
Proof See Lemma 4.51 of [3]. The next proposition presents a measurable selection theorem.
Proposition B.1.39 Suppose X and Y are topological Borel spaces, A(x) ⊆ Y is compact for each x ∈ X, and K = {(x, y) ∈ X × Y : y ∈ A(x)} ⊆ X × Y is a measurable subset, and f is a [−∞, ∞]-valued measurable function on K such that the function y → f (x, y) is lower semicontinuous in y ∈ A(x) for each x ∈ X. Then the function x → inf y∈A(x) f (x, y) is measurable on X , and there exists a measurable mapping ϕ from X to Y such that f (x, ϕ(x)) = inf f (x, y), ∀ x ∈ X. y∈A(x)
Moreover, for each x ∈ X, the set of minimizers {y ∈ A(x) : f (x, y) = inf y∈A(x) f (x, y)} is compact in Y . Proof See Theorem 2 of [124]; see also Corollary 1 and Remark 1 of [32]. The last assertion follows from the definition of lower semicontinuous functions; see p. 146 of [21].
Appendix B: Relevant Definitions and Facts
533
Definition B.1.14 (Inf-compact function) A [−∞, ∞]-valued function f defined on a topological space X is called inf-compact if, for each ∈ (−∞, ∞), {x ∈ X : f (x) ≤ } is a compact subset of X . Proposition B.1.40 (Berge’s Theorem) Let X and Y be topological Borel spaces and f be a [−∞, ∞]-valued lower semicontinuous function on X × Y . Suppose that Y is compact (respectively, f is inf-compact). Then the following assertions hold. (a) The function x ∈ X → inf y∈Y f (x, y) is lower semicontinuous (respectively, inf-compact). (b) There exists a measurable mapping ϕ from X to Y such that f (x, ϕ(x)) = inf f (x, y), ∀ x ∈ X. y∈Y
(c) The set {(x, y) ∈ X × Y : f (x, y) = inf y∈Y f (x, y)} is a measurable subset in B(X × Y ). (d) If the function f is continuous, then the function x ∈ X → inf y∈Y f (x, y) is continuous. (e) If f is inf-compact, for each x ∈ X , if inf y∈Y f (x, y) < ∞, then the set {y ∈ Y : f (x, y) = inf y∈Y f (x, y)} is nonempty compact. Proof See Propositions 7.32 and 7.33 of [21], and Theorems 3.1, 3.2, 4.1 and Corollary 3.1 of [79].
B.1.9 Conditional Expectations and Regular Conditional Measures Definition B.1.15 (Conditional expectation) Let (, F, P) be a probability space, Z be a (P-)integrable (or [0, ∞]-valued, respectively) random variable on it, and G ⊆ F be a sub-σ-algebra. A random variable Y , denoted by E(Z |G), is called the conditional expectation of Z with respect to G if Y is G-measurable, integrable (or [0, ∞]-valued, respectively), and
Y dP = G
Z d P for all G ∈ G. G
If G = σ(X ) is the σ-algebra generated by a measurable mapping X from (, F) to another measurable space (X, H), then we use notation E(Z |X ). For the existence of conditional expectations, see [234, p. 234] or Chap. 2 of [225]. Proposition B.1.41 (Existence of regular conditional measures) Let (, F ) be a Borel space and P be a probability measure on F . Let X : → X be a measurable mapping to another Borel space (X, H). Then there exists a stochastic kernel ϕ(dy|x) on F given x ∈ X such that
534
Appendix B: Relevant Definitions and Facts
ϕ(X −1 (x)|x) = 1 for P ◦ X −1 -almost all x ∈ X, and, for all ∈ F and E ∈ H, P( ∩ X −1 (E)) =
ϕ(|x)P ◦ X −1 (d x). E
Here P ◦ X −1 ( X ) := P(X −1 ( X )), X ∈ H is the image of the probability P with respect to the mapping X . This stochastic kernel ϕ is called the regular conditional measure with respect to X.
Proof See Example 10.4.11 of [27, V. II].
If Z (ω) = I (ω) is the indicator function for some ∈ F , then, for the sub-σalgebra σ(X ) := X −1 (H), E(Z |σ(X )) =
Z (ω)ϕ(dω|X (ω))
(B.11)
because the random variable on the right-hand side is σ(X )-measurable, and, for every G = X −1 (E) ∈ σ(X ) with some E ∈ H, we have
Z (ω)ϕ(dω|X (ω)) d P(ω) = G = P( ∩ G) = Z (ω)d P(ω).
ϕ(|x)P ◦ X −1 (d x) E
G
Since every positive bounded measurable function on can be approximated from below by simple functions (taking a finite number of values), we conclude, after trivial additional arguments, that equality (B.11) holds for any integrable or [0, ∞]-valued random variable Z . Expression Z (ω)ϕ(dω|x) is called the regular conditional expectation of Z with respect to (or given) X , sometimes denoted as E(Z |x) or E[Z |X = x], so that, after we substitute X (ω) here, we obtain E(Z |X (ω)) = E(Z |X )(ω) = E(Z |σ(X ))(ω): all the notations are consistent.
B.1.10 Other Useful Facts Proposition B.1.42 (Monotone Class Theorem) Let A be an algebra of sets. Then the σ-algebra generated by A coincides with the smallest monotone class containing A.
Appendix B: Relevant Definitions and Facts
535
Proof See Theorem 1.9.3 of [27, Vol. I.]. Proposition B.1.43 (Functional Monotone Class Theorem) Let H be a vector space of bounded real-valued functions on a set X . Suppose that H contains constant functions, and is closed under bounded pointwise passage to the limit, i.e., if a uniformly bounded sequence { f n }∞ n=1 of nonnegative functions in H satisfies f n ↑ f pointwise, then it holds that f ∈ H. If K ⊆ H is closed under multiplication, then H contains every bounded function on X , which is measurable with respect to the σ-algebra generated by K. Proof See p. 24 of [134], or Theorem 3.2 in Chap. 2 of [207] and the paragraph below it, or Theorem 20 of [165, Chap. I]. Proposition B.1.44 (Borel–Cantelli Lemma) Let (, F, P) be a probability space, and {An }∞ n=1 ⊆ F be a sequence of measurable subsets (events). Then the following assertions hold. ( ∞ (a) If + n=1 P(An ) < ∞, then P(limn→∞ An ) = 0, where limn→∞ An := m≥1 m≥n An . (b) If {An }∞ n=1 is a sequence of mutually independent events, and ∞
P(An ) = ∞,
n=1
then P( lim An ) = 1. n→∞
Proof See Theorems 4.3 and 4.4 of [24].
Lemma B.1.8 Suppose X is an arbitrary set and F and G are arbitrary operators transforming real-valued functions w on X to real-valued functions [F ◦ w] and [G ◦ w] on X . Then the function w satisfies the equation w(x) = min{[F ◦ w](x), [G ◦ w](x)}, x ∈ X
(B.12)
if and only if the following equation is satisfied for all real-valued functions β on X satisfying β(x) ≤ 1 for all x ∈ X : w(x) = min{[F ◦ w](x), [G ◦ w](x) + β(x)[w(x) − [G ◦ w](x)]}.
(B.13)
Moreover, if the minimum in (B.12) is provided by the first (second) term inside the parenthesis, then so is the minimum in (B.13). Proof The proof of this lemma comes from that of Lemma 6 in [181]. Clearly, if w satisfies Eq. (B.13), then, after putting β(x) ≡ 0, we obtain Eq. (B.12). Now, let w satisfy Eq. (B.12). We distinguish two cases.
536
Appendix B: Relevant Definitions and Facts
(a) Suppose x ∈ X is such that [F ◦ w](x) ≤ [G ◦ w](x). Then [F ◦ w](x) = w(x) = [G ◦ w](x) + {w(x) − [G ◦ w](x)} ≤ [G ◦ w](x) + β(x){w(x) − [G ◦ w](x)} for each real-valued value β on X satisfying β(x) ≤ 1 for all x ∈ X , since {w(x) − [G ◦ w](x)} ≤ 0. Hence, in this case, w(x) = [F ◦ w](x) satisfies Eq. (B.13). (b) Suppose x ∈ X is such that [F ◦ w](x) > [G ◦ w](x). Then w(x) = [G ◦ w](x) and [G ◦ w](x) + β(x){w(x) − [G ◦ w](x)} < [F ◦ w](x) for each β(x). Hence, in this case w(x) = [G ◦ w](x) + β(x){w(x) − [G ◦ w](x)} also satisfies Eq. (B.13). This proves the first assertion in the statement of this lemma. The last assertion follows from an inspection of the above proof. Proposition B.1.45 (Post–Widder Theorem) If the integral D(β) :=
e−βs D(s)ds (0,∞)
converges for every β > α for some α > 0, then (−1)n n→∞ n!
D(t) = lim
n !n+1 d n D n ! t dβ n t
for every point of continuity of D(t). Proof See Theorem 2.4 of [43].
B.2 Convex Analysis and Optimization B.2.1 Convex Sets and Convex Functions This material of this subsection is mainly borrowed from [3], [20, Sects. 1.3, 3.3], [89] and [206, Part IV]. Most statements and definitions are presented in the finitedimensional setup, although they admit extensions to the more general setup. Below, x, y is the standard inner product on Rn .
Appendix B: Relevant Definitions and Facts
537
Definition B.2.1 (Pareto optimal point) Let E be a fixed nonempty convex subset of Rn . A point u ∈ E is called Pareto optimal if, for each v ∈ E, the componentwise inequality v ≤ u implies that v = u. The collection of all Pareto optimal points is denoted by Par(E). Definition B.2.2 (Face and extreme point) Let E be a nonempty convex subset of a vector space. A nonempty convex subset D ⊆ E is called extreme (extremal) or a face of E if, for each point u ∈ D, the representation u = αu 1 + (1 − α)u 2 with α ∈ (0, 1), u 1 , u 2 ∈ E implies that u 1 , u 2 ∈ D. If a face D = {u} of E contains only one point, then u is called an extreme point of E. Let E be a nonempty convex subset of a vector space. It is obvious that a face of a face of E is also a face of E, and the intersection of an arbitrary collection of faces of E (provided that the intersection is nonempty) is also a face of E. The minimal face of E that contains a point u ∈ E, i.e., the intersection of all the faces of E containing u, is denoted by G(u). Definition B.2.3 (Supporting hyperplane) Let E be a fixed nonempty convex subset of Rn and x, y be the standard inner product on Rn . A supporting hyperplane to E at u ∈ E is any hyperplane of the form H = {x ∈ Rn : x, b = β}, b = 0, where x, b ≥ β for every point x ∈ E, and u, b = β. The intersection E ∩ H is known to be a face, a so-called “exposed face”. Note that not every face is exposed. Suppose E is a fixed nonempty convex subset of a vector space. Clearly, every extreme point u of a face D of E is also extreme in E. Indeed, if u = αu 1 + (1 − α)u 2 with u 1 , u 2 ∈ E, α ∈ (0, 1), then u 1 , u 2 ∈ D because D is a face; hence u 1 = u 2 = u because u is extreme in D. In particular, in the case of Rn , every extreme point of an exposed face of E ⊆ Rn is extreme in E. (See also Proposition 3.3.1(a) of [20].) Lemma B.2.1 Suppose E is a fixed nonempty convex subset of Rn , and u ∈ Par(E). Then the following assertions hold. (a) G(u) ⊆ Par(E). (b) For some 1 ≤ k ≤ n, there exist hyperplanes H i = {x ∈ Rn : x, bi = β i }, i = 1, 2, . . . , k with the following properties: ! (k i ; H (i) G(u) = E ∩ i=1 (ii) H 1 is supporting !to E at u and, for every i = 2, 3, . . . , k, H i is supporting (i−1 l to E ∩ at u; l=1 H (iii) bi ≥ 0 for i = 1, 2, . . . , k − 1 and bk > 0. Here all the inequalities are componentwise. Proof See Lemmas 3.1 and 3.2 of [89].
538
Appendix B: Relevant Definitions and Facts
Definition B.2.4 (Convex hull) The convex hull of a set B ⊆ Rn , denoted as conv B, is the intersection of all convex sets in Rn containing B. Lemma B.2.2 (Krein–Milman Theorem) If E ⊆ Rn is a nonempty convex compact set, then E coincides with the convex hull of its extreme points.
Proof See Proposition 3.3.1(c) of [20].
Proposition B.2.1 (Carathéodory Theorem) Let B ⊆ Rn be an arbitrary set. Then u ∈ conv B if and only if u can be expressed as a convex combination of n + 1 points from B (not necessarily distinct).
Proof See Theorem 17.1 of [205].
Corollary B.2.1 If E ⊆ Rn is a nonempty convex compact set, then u ∈ E if and only if u can be expressed as a convex combination of n + 1 extreme points of E. Proof The statement follows directly from Lemma B.2.2 and the aforementioned Carathéodory Theorem, see Proposition B.2.1. The notion of a face can also be introduced for convex sets in abstract locally convex topological vector spaces: see [165, Chap. XI]. Lemma B.2.3 If E is a convex compact set in a locally convex Hausdorff space, then every closed face B of E contains at least one extreme point of E. Proof See Lemma 7.65 of [3] or Theorem 11 of [165, Chap. XI]. (According to Corollary 2.71 of [160], provided that the set B is closed, Definition B.2.2 for B to be a face of a compact convex set E in a locally convex Hausdorff space is equivalent to the definition of a face presented in [165, Chap. XI].) Note that the space of finite signed measures X on a topological Borel space X , equipped with the weak topology generated by the space Y of bounded (real-valued) functions on X and the bilinear form v(x)μ(d x) v, μ := X
defined for each μ ∈ X and v ∈ Y, is locally convex and Hausdorff: see Sect. B.2.2. Definition B.2.5 (Convex function) A [−∞, ∞]-valued function f defined on ⊆ Rn is called convex if its epigraph {(x, μ) : x ∈ , μ ∈ R : μ ≥ f (x)} is a convex set in Rn+1 . It is called proper convex if it is convex, its effective domain dom f := {x ∈ : f (x) < ∞} is nonempty, and f (x) > −∞ for all x ∈ dom f .
Appendix B: Relevant Definitions and Facts
539
Proposition B.2.2 Let f be a (−∞, ∞]-valued function on a convex set ⊆ Rn . Then f is convex if and only if f ((1 − λ)x + λy) ≤ (1 − λ) f (x) + λ f (y) for all λ ∈ [0, 1] and x, y ∈ .
Proof See Theorem 4.1 of [205].
Lemma B.2.4 Let [a, b] ⊆ R be a fixed interval and f : [a, b] → R be a convex function, which attains its global minimum at S ∗ ∈ [a, b). Then f is nondecreasing on [S ∗ , b]. Proof Clearly, for all S ∗ ≤ x1 < x2 ≤ b, x1 = (1 − λ)S ∗ + λx2 , where λ := Thus, by Proposition B.2.2,
x1 −S ∗ . x2 −S ∗
f (x1 ) ≤ (1 − λ) f (S ∗ ) + λ f (x2 ) ≤ f (x2 ),
as desired.
Proposition B.2.3 For a lower semicontinuous (−∞, ∞]-valued proper convex function f on R, the following assertions hold. (a) For every x ∈ dom f and every y ∈ R, f (y) = lim f ((1 − λ)x + λy). λ↑1
(b) On the effective domain dom f , there exist right and left derivative functions f + and f − . (Only f − at the point x¯ := max{dom f } if x¯ ∈ dom f , and only f + at the point x := min{dom f } if x ∈ dom f .) (c) Extend the right and left derivative functions f + and f − beyond the interval dom f by setting them both equal to +∞ for the points lying to the right of dom f , and both equal to −∞ for the points lying to the left. Then f + and f − are monotone nondecreasing functions on R, and take values in R on the interior of dom f . Proof See Corollary 7.5.1 and Theorems 23.1 and 24.1 of [205].
Proposition B.2.4 Let f be an R-valued convex function on a nonempty open interval I ⊆ R, i.e., the extension of f with f (x) = ∞ for x ∈ R \ I is convex on R. Then f is continuous, and f (y) − f (x) =
(x,y]
f + (t)dt =
(x,y]
f − (t)dt
for each x, y ∈ I ⊆ R. Proof See Theorem 10.1 and Corollary 24.2.1 of [205].
540
Appendix B: Relevant Definitions and Facts
B.2.2 Linear Programs This material is borrowed from [120, Sect. 6.2]; more details can be found in [6, Chap. 3] and [3]. Let X and Y be two arbitrary real linear spaces and let X, Y be a bilinear form on X × Y, that is, a real function, linear in each component for each arbitrarily fixed other component. We assume that for all X = 0 in X there is a Y ∈ Y such that X, Y = 0, and for all Y = 0 in Y there is an X ∈ X such that X, Y = 0. We call (X , Y) a “dual pair”. We equip X with the weak topology τ (X , Y), the coarsest (i.e., weakest) topology such that the mapping X ∈ X → X, Y ∈ R is continuous for each Y ∈ Y. The weak topology τ (Y, X ) in Y is defined similarly. It is known that (X , τ (X , Y)) is a locally convex Hausdorff topological vector space (see [3, p. 211]). Definition B.2.6 (Positive cone and its dual cone) A positive cone in X is a subset Co ⊆ X such that for each X 1 , X 2 ∈ Co, λ ∈ R0+ and β ∈ [0, 1], it holds that β X 1 + (1 − β)X 2 ∈ Co and λX 1 ∈ Co. The dual cone to Co is defined as Co∗ := {Y ∈ Y : X, Y ≥ 0, ∀ X ∈ Co}. Clearly, Co∗ is a positive cone in Y. Suppose (X , Y) and (Z, V) are two dual pairs and all the spaces are equipped with the corresponding weak topologies. Let U be a linear continuous mapping from X to Z. Let us introduce the generic notation: U ◦ X = Z ∈ Z for each X ∈ X . The adjoint mapping U ∗ : V → Y of U is defined by the following equality U ◦ X, V = X, U ∗ ◦ V , ∀ X ∈ X , V ∈ V. One can easily see that U ∗ ◦ V is uniquely defined for each V ∈ V. It is known that Y is in a one-to-one correspondence with continuous linear functionals on X , see [3, Sect. 5.14] or [120, Sect. 6.2]. Therefore, the weak topology τ (X , Y) is consistent (or compatible) with the dual pair (X , Y), by definition. We are now ready to formulate the pair of primal and dual linear programs and present some of their properties. Let (X , Y) and (Z, V), Co, Co∗ , U , and U ∗ be as introduced above, and let B ∈ Z, C ∈ Y be fixed. By the Primal Linear Program we mean: Minimize over X ∈ Co : X, C subject to U ◦ X = B.
(B.14)
By the Dual Linear Program we mean: Maximize over V ∈ V : B, V subject to C − U ∗ ◦ V ∈ Co∗ .
(B.15)
The notions of feasible and optimal points (solutions) for the primal and dual programs are conventional. A linear program is called consistent if there is at least one
Appendix B: Relevant Definitions and Facts
541
feasible point. The values of the programs (B.14) and (B.15) are denoted by inf(P) and sup(D), correspondingly. Proposition B.2.5 (a) If (B.14) and (B.15) are both consistent, then −∞ < sup(D) ≤ inf(P) < ∞. (b) If X ∈ X is feasible for (B.14), V ∈ V is feasible for (B.15), and X, C − U ∗ ◦ V = 0, then X is optimal for (B.14) and V is optimal for (B.15). (c) Suppose sup(D) = inf(P), X ∗ ∈ X is optimal for (B.14) and V ∗ ∈ V is optimal for (B.15). Then X ∗ , C − U ∗ ◦ V ∗ = 0. Proof This theorem was formulated in [120, Theorem 6.2.4]. The elementary proof is included as follows. If X ∈ X is feasible for (B.14), V ∈ V is feasible for (B.15), then, according to the definitions of Co∗ and U ∗ , 0 ≤ X, C − X, U ∗ ◦ V = X, C − U ◦ X, V = X, C − B, V . Items (a) and (b) follow directly from this observation. For (c), note that X ∗ , C = B, V ∗ .
B.2.3 Convex Programs This material is based on [159, 206]. Suppose D is a nonempty convex set in a linear space X and f 0 , f 1 , . . . , f J are R-valued convex functions on D; J ≥ 0. By the Primal Convex Program we mean: Minimize over X ∈ D : f 0 (X ) subject to f j (X ) ≤ 0, j = 1, 2, . . . , J. (B.16) Assuming that there is at least one feasible point X ∈ D satisfying inequalities f j (X ) ≤ 0 for all j = 1, 2, . . . , J , this program can be rewritten as the following problem: ¯ (B.17) Minimize over X ∈ D : sup L(X, g), g¯ ∈(R0+ ) J
where
542
Appendix B: Relevant Definitions and Facts
L(X, g) ¯ := f 0 (X ) +
J
g j f j (X )
j=1
is the Lagrangian. The coefficients g j are called Lagrange multipliers. By the Dual Convex Program we mean: ¯ Maximize over g¯ ∈ R+J : inf L(X, g). X ∈D
(B.18)
The dual functional g¯ ∈ (R0+ ) J → inf X ∈D L(X, g) ¯ is concave, see [159, Sect. 8.6, Proposition 1]. The values of the programs (B.17) and (B.18) are denoted by inf(P c ) and sup(D c ), respectively. The next statement is similar to Theorem A2.1 in [179]. Proposition B.2.6 Suppose the so-called Slater condition is satisfied: ∃X ∈ D : f j (X ) < 0, j = 1, 2, . . . , J, and inf(P c ) > −∞. Then the following assertions hold. (a) inf(P c ) = sup(D c ) < ∞, and there exists at least one g¯ ∗ ∈ (R0+ ) J solving the Dual Convex Program (B.18). (b) A point X ∗ ∈ D is an optimal solution to the Primal Convex Program (B.17) if and only if one of the following two equivalent statements holds for some g¯ ∗ ∈ (R0+ ) J solving the Dual Convex Program (B.18): (i) The pair (X ∗ , g¯ ∗ ) is a saddle point of the Lagrangian: ¯ ≤ L(X ∗ , g¯ ∗ ) ≤ L(X, g¯ ∗ ), ∀ X ∈ D, g¯ ∈ (R0+ ) J . L(X ∗ , g) (ii) The following relations hold: f j (X ∗ ) ≤ 0, j = 1, 2, . . . , J ; L(X ∗ , g¯ ∗ ) = inf L(X, g¯ ∗ ) = f 0 (X ∗ ); X ∈D
J
g ∗j f j (X ∗ ) = 0.
j=1
(The last equality is known as the Complementary Slackness Condition.) (c) Suppose a point X ∗ ∈ D is an optimal solution to the Primal Convex Program (B.17). If a point g¯ ∗ ∈ (R0+ ) J satisfies (b)(ii), then g¯ ∗ solves the Dual Convex Program (B.18), and the pair (X ∗ , g¯ ∗ ) is a saddle point of the Lagrangian. Proof (a) See [159, Sect. 8.6, Theorem 1]. (b) Suppose X ∗ ∈ D is an optimal solution to (B.17). Then all the formulae in Item (ii) follow from [159, Sect. 8.6, Theorem 1], too.
Appendix B: Relevant Definitions and Facts
543
Suppose all the formulae in Item (ii) hold true. Then, obviously, L(X ∗ , g¯ ∗ ) ≤ L(X, g¯ ∗ ) for all X ∈ D, and, for every g¯ ∈ (R0+ ) J , ¯ = f 0 (X ∗ ) + L(X ∗ , g)
J
g j f j (X ∗ ) ≤ f 0 (X ∗ ) = L(X ∗ , g¯ ∗ )
j=1
because f j (X ∗ ) ≤ 0 for all j = 1, 2, . . . , J . Therefore, the pair (X ∗ , g¯ ∗ ) is a saddle point of the Lagrangian. Finally, if (X ∗ , g¯ ∗ ) is a saddle point of the Lagrangian, then X ∗ is an optimal solution to the Primal Convex Program (B.17) by [159, Sect. 8.4, Theorem 2]. (c) If X ∗ is an optimal solution to (B.17), then from Item (a) and (b)-(ii) we have ∗
∗
∗
L(X , g¯ ) = f 0 (X ) = inf(P ) = sup(D ) = sup c
c
g¯ ∈(R0+ ) J
inf L(X, g) ¯
X ∈D
≥ inf L(X, g¯ ∗ ) = L(X ∗ , g¯ ∗ ). X ∈D
¯ i.e., solves the Thus, g¯ ∗ provides a supremum to the function inf X ∈D L(X, g), Dual Convex Program (B.18). Proposition B.2.7 Suppose X 0∗ and X 1∗ are the solutions to the Primal Convex Programs Minimize over X ∈ D : f 0 (X ) subject to f j (X ) ≤ 0, j = 1, 2, . . . , J, and Minimize over X ∈ D : f 0 (X ) subject to f j (X ) − u j ≤ 0, j = 1, 2, . . . , J, respectively, where u¯ = (u 1 , u 2 , . . . , u J ) ∈ R J . Suppose also that g¯ ∗0 and g¯ ∗1 are the solutions to the corresponding Dual Convex Programs. Then J
∗ ∗ u j g ∗1 j ≤ f 0 (X 0 ) − f 0 (X 1 ) ≤
j=1
Proof See [159, Sect. 8.5, Theorem 1].
J
u j g ∗0 j .
j=1
544
Appendix B: Relevant Definitions and Facts
B.3 Stochastic Processes and Applied Probability B.3.1 Some General Definitions and Facts We collect some well known definitions in the next definition, see e.g., [30, 70, 96, 165] for more details. Definition B.3.1 (Stopping time) Let a measurable space (, F) be fixed. Let T be the time index set. For the moment, T is either {0, 1, . . . }, which is the discrete-time case, or [0, ∞), which is the continuous-time case. An increasing family {Ft }t∈T of σ-algebras on with Ft ⊆ F for each t ∈ T is called a filtration. Let F∞ be the minimal σ-algebra containing the filtration {Ft }t∈T . A [0, ∞]-valued random variable τ is called a {Ft }-stopping time if {τ ≤ t} ∈ Ft for each t ∈ T. For a {Ft }stopping time τ , Fτ := { ∈ F∞ : ∩ {τ ≤ t} ∈ Ft , t ∈ T }. Definition B.3.2 (Measurable, adapted and progressive processes) A continuoustime stochastic process X (·) defined on a measurable space (, F) (respectively, filtered measurable space (, F, {Ft }t∈R0+ )) taking values in a measurable space X is called measurable (respectively, {Ft }-adapted) if X (t, ω) is jointly measurable in (t, ω) ∈ T × endowed with the σ-algebra B(T ) × F (respectively, for each t ≥ 0, the mapping ω ∈ → X (t, ω) is measurable on (, Ft )). The process X (·) is called progressively measurable or simply progressive if for each t ∈ R+ , the restriction of X (·) on ([0, t] × , B([0, t]) × Ft ) is measurable, i.e., the mapping (s, ω) → X (s, ω) from [0, t] × to X is B([0, t]) × Ft -measurable. The following remark concerns a convention we often use. Remark B.3.1 In the above definition, the measurable space X is called the state space of the process X (·). Its σ-algebra is usually not explicitly indicated. Whenever we talk about a process X (·) taking values in a topological space X, unless stated otherwise, we keep in mind that the σ-algebra on X is its Borel σ-algebra. A relevant example of progressive processes is in the next statement. Proposition B.3.1 Suppose (, F, {Ft }t∈R0+ ) is a filtered measurable space, and X (·) is an adapted stochastic process with right-continuous trajectories, taking values in a separable metric space X endowed with its Borel σ-algebra. Then X (·) is a progressively measurable (or simply progressive) process. Proof See the proof of Theorem T11, Chap. III of [52].
Let X be a separable metric space. It is convenient to introduce the space DX [0, ∞) of right-continuous X-valued functions X (t) of time t ∈ R0+ with leftlimits. Endowed with the Skorohod metric, the space DX [0, ∞) is referred to as the Skorohod space, which is separable (and also complete, if X is also complete).
Appendix B: Relevant Definitions and Facts
545
The precise definition of the Skorohod metric is immaterial to this book, and can be found in, e.g., Chap. 3, Sect. 5 of [70]. The Borel σ-algebra on the Skorohod space is characterized in the next statement. Proposition B.3.2 Let X be a separable metric space endowed with its Borel σalgebra. For each t ∈ R0+ , define the mapping xt : DX [0, ∞) → X by xt (X ) = X (t). Then the Borel σ-algebra on DX [0, ∞) coincides with σ(xt : t ∈ R0+ ), the σ-algebra generated by the mappings xt . Proof See the proof of Proposition 7.1, Chap. 3 of [70].
Definition B.3.3 (Martingale) An R-valued stochastic process X (·) with time index set T being either {0, 1, . . . } or [0, ∞) defined on a filtered probability space (, F, {Ft }t∈T , P) is called a martingale (with respect to the filtration {Ft }) if it is adapted to {Ft }; E[|X (t)|] < ∞ for each t ∈ T ; and E[X (t + s)|Ft ] = X (t), P-a.s., ∀ s, t ∈ T, s, t ≥ 0. The following statement is a version of the Doob Optional Sampling Theorem. Proposition B.3.3 Consider a {Ft }-martingale X (·). Suppose X (·) has rightcontinuous trajectories. Let τ1 , τ2 be two stopping times with respect to {Ft }. Then for each t ∈ T , t > 0, E[X (min{τ2 , t})|Fτ1 ] = X (min{τ1 , τ2 , t}), P-a.s. Proof See Theorem 2.13 and Remark 2.14 of [70].
B.3.2 Renewal and Regenerative Processes Let some probability space (, F, P) be fixed, and all the random elements mentioned in this subsection be defined thereon. Definition B.3.4 (Renewal process) A point process {Tn }∞ n=1 is an increasing sequence of (0, ∞]-valued random variables. A point process is called a (delayed) renewal process if ξn := Tn − Tn−1 , n = 1, 2, . . . , with T0 := 0 are all (0, ∞)valued, independent random variables, and ξ2 , ξ3 , . . . have the common distribution. Proposition B.3.4 Consider a (delayed) renewal process {Tn }∞ n=1 and its associated N (T ) E[N (T )] 1 → E[ξ1 2 ] , as T → ∞. counting process N (·). Then T → E[ξ2 ] P-a.s., and T Proof See Proposition 1.4 in p. 107 of [10].
The next result is the well-known Kolmogorov Strong Law of Large Numbers.
546
Appendix B: Relevant Definitions and Facts
Proposition B.3.5 LetX 1 , X 2 , . . . be i.i.d. random variables such that E[X 1+ ] < ∞ n Xi → E[X 1 ] P-a.s., as n → ∞. or E[X 1− ] < ∞. Then i=1 n
Proof See Sect. 22 of [24]. The following definition is similar to the one in Sect. 2.8 of [224].
Definition B.3.5 (Regenerative process) Suppose that X (·) is a measurable continuous-time stochastic process taking values in a metric space E, endowed with its Borel σ-algebra B(E). The process X (·) is called a (delayed) regenerative process if there exists an increasing sequence of (0, ∞)-valued random variables T1 , T2 , . . . such that the random vectors f (t, X (Tn−1 + t))dt , n = 1, 2, . . . ζn := ξn , [0,Tn −Tn−1 )
are all independent and ζn , n = 2, 3, . . ., all have the common distribution. (Here T0 := 0 and ξn := Tn − Tn−1 .) Lemma B.3.1 Let X (·) be a delayed regenerative process, and f be an R-valued measurable function on (E, B(E)). If E[ξ1 ] < ∞, E[ξ2 ] < ∞, and E (0,T1 ] f (X (t))dt] and E (T1 ,T2 ] f (X (t))dt are well-defined and finite, then, as T → ∞, 1 T
(0,T ]
f (X (t))dt →
E
f (X (t))dt
(T1 ,T2 ]
P-a.s.,
E[ξ2 ]
and 1 E T
(0,T ]
f (X (t))dt →
E
(T1 ,T2 ]
f (X (t))dt
E[ξ2 ]
.
Proof This statement is well known, see e.g., [10, 151, 224] for some related versions, but we present its proof here for completeness. Without loss of generality, we assume that f takes nonnegative values. Let N (·) be the counting process of the renewal process {Tn }n=1,2,... . Note that N (T ) 1 1 f (X (t))dt ≤ f (X (t))dt T j=1 (T j−1 ,T j ] T (0,T ] ≤
N (T )+1 1 f (X (t))dt, ∀ T > 0. T j=1 (T j−1 ,T j ]
(B.19)
Here, by using Propositions B.3.4 and B.3.5, we see that, as T → ∞ (and N (T ) → ∞ P-a.s.),
Appendix B: Relevant Definitions and Facts
547
N (T ) N (T ) 1 N (T ) − 1 j=2 (T j−1 ,T j ] f (X (t))dt f (X (t))dt = T j=1 (T j−1 ,T j ] T N (T ) − 1 E (T1 ,T2 ] f (X (t))dt (0,T1 ] f (X (t))dt + → P-a.s., T E[ξ2 ] and similarly N (T )+1 E 1 f (X (t))dt → T j=1 (T j−1 ,T j ]
(T1 ,T2 ]
f (X (t))dt
P-a.s.
E[ξ2 ]
Thus, 1 T
(0,T ]
f (X (t))dt →
E
(T1 ,T2 ]
f (X (t))dt
E[ξ2 ]
P-a.s.,
as required. For the last assertion, let T > 0 be fixed, and note that N (T ) + 1 is a stopping time with respect to the filtration, say {Fn }, generated by {ξn , (Tn−1 ,Tn ] f (X (t))dt}∞ n=1 . Then for each fixed integer m > 0, by applying the Doob Optional Sampling Theorem (see Proposition B.3.3) to the discrete-time {Fn }-martingale defined as
n j=1
f (X (t))dt − (n − 1)E
(T j−1 ,T j ]
(T1 ,T2 ]
f (X (t))dt ,
we see that ⎡ ⎡
min{m,N (T )+1}
E ⎣E ⎣
j=1
(T j−1 ,T j ]
−(min{m, N (T )} − 1)E f (X (t))dt , =E (0,T1 ]
so that
f (X (t))dt
(T1 ,T2 ]
f (X (t))dt F1
548
Appendix B: Relevant Definitions and Facts
⎡
⎤
min{m,N (T )+1}
E⎣
=E
j=1
(0,T1 ]
f (X (t))dt ⎦
(T j−1 ,T j ]
f (X (t))dt + E[min{m, N (T )} − 1]E
(T1 ,T2 ]
f (X (t))dt .
Passing to the limit as m → ∞, we see that ⎡ E⎣
⎤
N (T )+1 j=1
(T j−1 ,T j ]
+E[N (T ) − 1]E
f (X (t))dt ⎦ = E
(T1 ,T2 ]
f (X (t))dt .
(0,T1 ]
f (X (t))dt
Since the above equality holds for the arbitrarily fixed T > 0, by Proposition B.3.4, we see that ⎡ ⎤ N (T )+1 1 ⎣ E f (X (t))dt ⎦ T (T ,T ] j−1 j j=1 1 E[N (T ) − 1] = E f (X (t))dt + f (X (t))dt E T T (0,T1 ] (T1 ,T2 ] E (T1 ,T2 ] f (X (t))dt → E[ξ2 ] as T → ∞. The last assertion follows from this and (B.19).
Appendix C
Definitions and Facts about Discrete-Time Markov Decision Processes
C.1 Description of the DTMDP The primitives of a DTMDP are the following. (i) The state space X is a nonempty Borel space, endowed with the σ-algebra B(X). (ii) The action space A is a nonempty Borel space, endowed with the σ-algebra B(A). (iii) The transition probability p(dy|x, a) is a stochastic kernel from X × A to B(X). (iv) The initial distribution γ(d x) is a probability measure on (X, B(X)) (or more simply, we would say on X). (v) The [−∞, ∞]-valued one step cost functions l j (x, a), j = 0, 1, . . . , J . We often refer to (X, A, p, {l j } Jj=0 ) as a DTMDP model or simply a DTMDP. Roughly speaking, a DTMDP is a discrete-time stochastic process, where at each time step, the decision-maker selects some action. Depending on the current state and action, some costs are incurred and the transition probability is determined, specifying the conditional distribution of the state of the process in the next time step. What specifies how to select an action at each time step is called a strategy, defined as follows.
© Springer Nature Switzerland AG 2020 A. Piunovskiy and Y. Zhang, Continuous-Time Markov Decision Processes, Probability Theory and Stochastic Modelling 97, https://doi.org/10.1007/978-3-030-54987-9
549
550
Appendix C: Definitions and Facts about Discrete-Time Markov Decision Processes
Definition C.1.1 (Strategy in DTMDPs) (a) A strategy σ = {σn }∞ n=1 is a sequence of stochastic kernels such that for each n = 1, 2, . . . , σn (da|x0 , a1 , . . . , xn−1 ) is a stochastic kernel from (X × A)n−1 × X to B(A), where (X × A)0 × X := X. (b) A strategy σ = {σn }∞ n=1 is Markov if for each n = 1, 2, . . . , there is a stochastic kernel σnM (da|xn−1 ) from X to B(A) such that σnM (da|xn−1 ) = σn (da|x0 , a1 , . . . , xn−1 ) for each (x0 , a1 , . . . , xn−1 ) ∈ (X × A)n−1 × X. s (c) A strategy σ = {σn }∞ n=1 is called stationary if there is a stochastic kernel σ (da|x) from X to B(A) such that σ s (da|xn−1 ) = σn (da|x0 , a1 , . . . , xn−1 ) for each n = 1, 2, . . . , and (x0 , a1 , . . . , xn−1 ) ∈ (X × A)n−1 × X. (d) If σn (da|x0 , a1 , . . . , xn−1 ) is concentrated on {ϕn (x0 , a1 , . . . , xn−1 )} (respectively, {ψn (x0 , xn−1 )}, {ϕnM (xn−1 )} and {ϕ(xn−1 )}) for each n = 1, 2, . . . , where ϕn , ψn , ϕnM and ϕ are A-valued measurable mappings, then the strategy is called deterministic (respectively, deterministic semi-Markov, deterministic Markov and deterministic stationary). With a conventional abuse of notation, we often signify a deterministic stationary strategy by ϕ. Under a strategy σ, at time index n ∈ {0, 1, 2, . . .}, based on the n-history (x0 , a1 , x1 , a2 , . . . , xn ), the next action an+1 is realized according to the distribution σn (da|x0 , a1 , x1 , a2 , . . . , xn ). For the concerned DTMDP model, let be the class of all the strategies, S be the class of all the stationary strategies, D M be the class of all the deterministic Markov strategies, and DS be the class of all the deterministic stationary strategies. We now proceed with a formal description of the DTMDP under an arbitrarily fixed strategy σ = {σn }∞ n=1 . The sample space = (X × A)∞ is the collection of trajectories ω = (x0 , a1 , x1 , a2 , . . . ). We endow with the product σ-algebra, which coincides with B() = B((X × A)∞ ). When X and A are topological spaces, we endow with the product topology. For each n = 1, 2, . . . , we define X n−1 (ω) = xn−1 , An (ω) = an . ∞ Below we often omit the argument ω ∈ . The processes {X n }∞ n=0 and {An }n=1 are the controlled and controlling processes. The interpretation is consistent with the unique probability measure Pσγ (dω) on (, B()) satisfying the following conditions;
Appendix C: Definitions and Facts about Discrete-Time Markov Decision Processes
Pσγ (X 0 ∈ d x) = γ(d x);
551
(C.1)
and for each n = 1, 2, . . . , iX ∈ B(X) (i = 0, 1, . . . , n) and iA ∈ B(A) (i = 1, 2, . . . , n), X , An ∈ nA ) Pσγ (X 0 ∈ 0X , A1 ∈ 1A , . . . , X n−1 ∈ n−1 = σn (nA |x0 , a1 , . . . , xn−1 ) X 0X ×1A ×···×n−1 σ Pγ (X 0 ∈ d x0 ,
A1 ∈ da1 , . . . , X n−1 ∈ d xn−1 );
(C.2)
and Pσγ (X 0 ∈ 0X , A1 ∈ 1A , . . . , X n ∈ nX ) = p(nX |xn−1 , an ) X 0X ×1A ×···×n−1 ×nA σ Pγ (X 0 ∈ d x0 , A1 ∈
da1 , . . . , X n−1 ∈ d xn−1 , An ∈ dan ).
(C.3)
The existence and uniqueness of this measure is by Ionescu-Tulcea’s Theorem (see Proposition B.1.37). Definition C.1.2 (Strategic measure in DTMDP) The measure Pσγ is called a strategic measure (of the strategy σ ∈ ) for the DTMDP. The expectation taken with respect to Pσγ is denoted by Eσγ . If γ(dy) = δx (dy) is a Dirac measure concentrated on the point x ∈ X, then we write Pσγ and Eσγ as Pσx and Eσx . Remark C.1.1 If only actions from the set A(x) ∈ B(A) are admissible in the state x ∈ X, the construction and properties of strategic measures are mainly the same, assuming that the set K := {(x, a) : a ∈ A(x), x ∈ X} is measurable in X × A and there exists a measurable mapping ϕ : X → A with (x, ϕ(x)) ∈ K. More details can be found in [69]. Proposition C.1.1 Assume that X and A are topological Borel spaces. (a) The set of all strategic measures, for a fixed initial distribution γ, is a measurable and convex subset of P(). (b) For a fixed strategy σ, the mapping x → Pσx from X to P() is measurable. Proof See Theorem 8 of [179], Chap. 5, Sect. 5 of [69] and Lemma 3.1 of [72].
552
Appendix C: Definitions and Facts about Discrete-Time Markov Decision Processes
Proposition C.1.2 A probability measure P on (, B()) is a strategic measure if and only if, for each n = 1, 2, . . ., for each bounded measurable function f on {(x0 , a1 , x1 , . . . , an , xn )} = (X × A)n × X, f (X 0 , A1 , X 1 , . . . , An , X n )P(dω) = f (X 0 , A1 , X 1 , . . . , An , y) p(dy|X n−1 , An )P(dω).
X
Proof See Chap. 5, Sect. 5 of [69]. Corollary C.1.1
(a) Suppose P(dω|β) is a stochastic kernel on B() given β ∈ [0, 1], such that P(dω|β) is a strategic measure for ν-almost all β ∈ [0, 1], where ν is a fixed probability measure on ([0, 1], B([0, 1]). Then P(dω) :=
[0,1]
P(dω|β)ν(dβ)
is a strategic measure. (b) Assume that X and A are topological Borel spaces. Then, for each probability measure ν on the space of strategic measures, Pν(dP) is again a strategic measure. Proof The characteristic property from Proposition C.1.2 still holds for the measure P in (a) and for Pν(dP) in (b). Proposition C.1.3 There exist a Borel space Z and a probability measure ν on (Z, B(Z)) such that the following assertions hold. (a) For each strategy σ, there exists a sequence of measurable mappings rn : Z × (X × A)n−1 × X → A, n = 1, 2, . . ., such that, for the deterministic strategies ϕz := {ϕnz }∞ n=1 with ϕnz (x0 , a1 , . . . , xn−1 ) := rn (z, x0 , a1 , . . . , xn−1 ), the following equality holds: Pσγ () =
Z
Pϕγ ()ν(dz), ∀ ∈ B(). z
(b) If σ = σ M is a Markov strategy, then there exist measurable mappings rn : Z × X → A defining the Markov strategies
Appendix C: Definitions and Facts about Discrete-Time Markov Decision Processes
553
ϕz := {ϕnz }∞ n=1 with ϕnz (x) := rn (z, x), for which a similar equality to the one in part (a) holds: Pσγ () = M
Pϕγ ()ν(dz), ∀ ∈ B(). z
Z
(c) Both mappings z → Pϕγ in parts (a) and (b) are stochastic kernels on B() given z ∈ Z. z
Proof For (a) and (b), see Theorem 5.1 of [74] and Theorem 1 of [71]. z For assertion (c), note that the marginals Pnz of Pϕγ on the histories X × (A × X)n−1 z as the mappings z → Pn are stochastic kernels on B(X × (A × X)n−1 ) given z ∈ Z for all n = 1, 2, . . .: one should use induction and Proposition B.1.34. (See also the z remark after it.) Finally, z → Pϕγ is the stochastic kernel on B() given z ∈ Z by the Monotone Class Theorem, see Proposition B.1.42: the class of subsets ⊆ z for which the mapping z → Pϕγ () is measurable is a monotone class and includes H × (A × X)∞ for all H ∈ B(X × (A × X)n−1 ); thus it contains B(). Proposition C.1.3 remains valid if only actions from the set A(x) are admissible for each x ∈ X, under the standard requirements as in Remark C.1.1. According to Proposition B.1.1, one can, without loss of generality, accept that Z = [0, 1]. Moreover, ν can be taken as the uniform distribution on ([0, 1], B([0, 1])) because an arbitrary distribution on [0, 1] can be obtained as the image of the uniform distribution under a properly selected mapping, say F −1 : [0, 1] → [0, 1]. See Proposition B.1.21. Let J ∈ N be fixed, and l j (x, a) be a measurable [−∞, ∞]-valued function for each j = 0, 1, . . . , J, representing the cost function given the current state and action (x, a) ∈ X × A. Let d j be a real constant for each j = 1, 2, . . . , J , representing the constraint constant. The following optimal control problems for the DTMDP model are of particular relevance to the materials presented in this book. Namely, the unconstrained optimal control problem for the DTMDP model (X, A, p, {l j } Jj=0 ) is Minimize over all strategies σ : Eσx
∞
l0 (X n , An+1 ) =: W0DT (σ, x),
(C.4)
n=0
with x ∈ X, whereas the constrained optimal control problem for the DTMDP model (X, A, p, {l j } Jj=0 ) is
554
Appendix C: Definitions and Facts about Discrete-Time Markov Decision Processes
Minimize over all strategies σ : subject to
Eσγ Eσγ
∞
n=0 ∞
l0 (X n , An+1 )
(C.5)
l j (X n , An+1 ) ≤ d j , 1 ≤ j ≤ J.
n=0
Here the initial distribution γ is fixed. Remark C.1.2 (Universally measurable strategy) (a) One may also consider universally measurable strategies σ = {σn }∞ n=1 , where for each n ≥ 1, σn (da|x0 , a1 , . . . , xn−1 ) is a universally measurable stochastic kernel, i.e., for each ∈ B(A), σn (|x0 , a1 , . . . , xn−1 ) is universally measurable in (x0 , a1 , . . . , xn−1 ). Similarly, one can understand universally measurable deterministic Markov or semi-Markov strategies. The notation W0DT (σ, x) extends to universally measurable strategies σ, too. (b) Let some x ∈ X be fixed. By taking suitable Borel-measurable modifications, for each given universally measurable strategy σ, there is a (Borel-measurable) strategy σ such that W0DT (σ, x) = W0DT (σ , x). Definition C.1.3 (Uniformly optimal strategy) A strategy σ ∗ is called uniformly optimal for the unconstrained problem (C.4) for the DTMDP model (X, A, p, l0 ) if ∗ Eσx
∞
l0 (X n , An+1 ) ≤
Eσx
∞
n=0
l0 (X n , An+1 )
n=0
for each strategy σ and for each initial state x ∈ X. Let W0DT ∗
: x ∈X→
W0DT ∗ (x)
:= inf
σ∈
Eσx
∞
l0 (X n , An+1 )
(C.6)
n=0
denote the value (Bellman) function of problem (C.4). Here the superscript “DT” signifies “Discrete-Time”. Definition C.1.4 (Feasible strategy) A strategy σ ∗ is called feasible for the constrained problem (C.5) for the DTMDP model (X, A, p, {l j } Jj=0 ) if ∗ Eσγ
∞
l j (X n , An+1 ) ≤ d j , j = 1, 2, . . . , J.
n=0
A feasible strategy σ ∗ is called optimal for problem (C.5) if Eσγ
∗
∞ n=0
l0 (X n , An+1 ) ≤ Eσγ
∞ n=0
l0 (X n , An+1 )
Appendix C: Definitions and Facts about Discrete-Time Markov Decision Processes
555
for each feasible strategy σ. The following result is often referred to as the Derman–Strauch Lemma. Proposition C.1.4 For each strategy σ = {σn }∞ n=1 ∈ , there is a Markov strategy such that σ M = {σnM }∞ n=1 Pσγ (X n ∈ d x, An+1 ∈ da) = Pσγ (X n ∈ d x, An+1 ∈ da) M
for each n = 0, 1, 2, . . . . Here for each n = 1, 2, . . . , one can take σnM as the stochastic kernel from X to A such that Pσγ (X n−1 ∈ d x, An ∈ da) = Pσγ (X n−1 ∈ d x)σnM (da|x). Proof The statement can be easily established by induction. More details can be found in Lemma 2 of [179]. See also [56].
C.2 Selected Facts and Optimality Results C.2.1 Results for the Unconstrained Problem In this subsection, we concentrate on the unconstrained problem (C.4). For this unconstrained problem, the optimality (or say dynamic programming or Bellman) equation is an important tool, which reads u(x) = inf l0 (x, a) + u(y) p(dy|x, a) , ∀ x ∈ X, a∈A
(C.7)
X
assuming that the integral is well defined. Proposition C.2.1 The value (Bellman) function W0DT ∗ defined by (C.6) for problem (C.4) is lower semianalytic and thus universally measurable on X.
Proof See Theorem 4.2 of [74]. Definition C.2.1 (Summable DTMDP) A DTMDP is called summable if either
Eσx
∞
l0+ (X n ,
An+1 ) < ∞, ∀ x ∈ X, σ ∈ ,
n=0
or
Eσx
∞ n=0
l0− (X n , An+1 ) < ∞, ∀ x ∈ X, σ ∈ .
556
Appendix C: Definitions and Facts about Discrete-Time Markov Decision Processes
Proposition C.2.2 If the DTMDP is summable then the value (Bellman) function W0DT ∗ (C.6) satisfies Bellman equation (C.7).
Proof See Chap. 6, Sect. 2 of [69]. Proposition C.2.3 Suppose for each x ∈
X sup Eσx σ∈
∞
l0− (X n ,
An+1 ) < ∞ and a
n=0
deterministic stationary strategy ϕ is such that, for all x ∈ X, the following equalities hold true: DT ∗ W0 (x) = l0 (x, ϕ(x)) + W0DT ∗ (y) p(dy|x, ϕ(x)); X
lim Eϕx W0DT ∗ (X n ) = 0. n→∞
Then ϕ is uniformly optimal (for the unconstrained problem (C.4)).
Proof See Proposition 9.5.11 of [121] or Theorem 2.2 of [259]. Keeping in mind Proposition C.2.1, we can define the following for all x ∈ X: -
V0(0) (x) := 0,
V0(n+1) (x) := inf a∈A l0 (x, a) + X V0(n) (y) p(dy|x, a) , ∀ n ≥ 0. (C.8)
Here the functions V0(n) are well defined because they are lower semianalytic on X by Propositions B.1.4 and B.1.34. For future reference, let us introduce V0(∞) (x) := lim V0(n) (x) n→∞
(C.9)
wherever the limit on the right-hand side exists. If the limit exists for all x ∈ X, then by Proposition B.1.5, V0(∞) is lower semianalytic on X. The sequence {V (n) }∞ n=0 is often referred to as the value iteration or dynamic programming algorithm. Proposition C.2.4 Consider problem (C.4). Suppose l0 (x, a) ∈ [0, ∞] for each (x, a) ∈ X × A. Then the following assertions hold. (a) For each x ∈ X, V0(∞) (x) exists, and satisfies V0(∞) (x) ≤ W0DT ∗ (x).
(C.10)
Furthermore, V0(∞) is lower semianalytic on X, and V0(∞) = W0DT ∗ if and only if the function V0(∞) satisfies the Bellman equation (C.7).
Appendix C: Definitions and Facts about Discrete-Time Markov Decision Processes
557
(b) If V is some [0, ∞]-valued lower semianalytic function on X satisfying V (x) ≥ inf l0 (x, a) + V (y) p(dy|x, a) , ∀ x ∈ X, a∈A
(C.11)
X
then W0DT ∗ (x) ≤ V (x) for each x ∈ X. In particular, the value (Bellman) function W0DT ∗ is the minimal nonnegative lower semianalytic solution to the optimality equation (C.7). (c) A deterministic stationary strategy ϕ∗ is uniformly optimal if and only if DT ∗ = inf l0 (x, a) + W0 (y) p(dy|x, a) a∈A X ∗ DT ∗ = l0 (x, ϕ (x)) + W0 (y) p(dy|x, ϕ∗ (x)), ∀ x ∈ X. W0DT ∗ (x)
X
(d) For each ε > 0, there exists an ε-optimal universally measurable deterministic Markov strategy σ, i.e., W0DT (σ, x) ≤ W0DT ∗ (x) + ε for each x ∈ X. Proof See Propositions 9.8, 9.10, 9.12, 9.16, 9.19 and Corollary 9.4.1 of [21].
Proposition C.2.5 Consider problem (C.4), but suppose l0 (x, a) ∈ [−∞, 0] for each (x, a) ∈ X × A. Then the following assertions hold. (a) The value (Bellman) function W0DT ∗ is the maximal nonpositive lower semianalytic solution to the optimality equation (C.7). Moreover, W0DT ∗ = V0(∞) . (b) If V is some [−∞, 0]-valued lower semianalytic function on X satisfying V (x) ≤ inf l0 (x, a) + V (y) p(dy|x, a) , ∀ x ∈ X, a∈A
X
then W0DT ∗ (x) ≥ V (x) for each x ∈ X. (c) For each ε > 0, there exists an ε-optimal universally measurable deterministic semi-Markov strategy σ, i.e., for each x ∈ X, W0DT (σ, x) ≤ W0DT ∗ (x) + ε if W0DT ∗ (x) > −∞, and W0DT (σ, x) ≤ − 1ε if W0DT ∗ (x) = −∞. (d) A stationary strategy σ s is uniformly optimal if and only if the function W0DT (σ s , x)
:=
s Eσx
∞
l0 (X n , An+1 )
n=0
satisfies the optimality equality (C.7). Proof See Corollary 9.4.1 and Propositions 9.10, 9.14, 9.20, 9.13 of [21].
558
Appendix C: Definitions and Facts about Discrete-Time Markov Decision Processes
Propositions C.2.4 and C.2.5 remain valid if only actions from the set A(x) are admissible for each x ∈ X, under the standard requirements as in Remark C.1.1. Corollary C.2.1 Consider problem (C.4) and suppose that either l0 (x, a) ∈ [0, ∞] for all (x, a) ∈ X × A, or l0 (x, a) ∈ [−∞, 0] for all (x, a) ∈ X × A. Then, for each deterministic stationary strategy ϕ, W0DT (ϕ, x) coincides with W (x) := limn→∞ W n (x) for all x ∈ X, where W 0 (x) := 0;
W n+1 (x) := l0 (x, ϕ(x)) +
W n (y) p(dy|x, ϕ(x)). X
Proof According to the remark above, one can say that A(x) = {ϕ(x)}. Now, in the positive case, we have the equality W (x) = l0 (x, ϕ(x)) +
W (y) p(dy|x, ϕ(x)) X
due to the Monotone Convergence Theorem: the sequence of nonnegative functions W n is obviously increasing. Thus W0DT (ϕ, x) = W (x) by Proposition C.2.4(a). In the negative case, W0DT (ϕ, x) = W (x) by Proposition C.2.5(a). Proposition C.2.6 Suppose l0 (x, a) ∈ [0, ∞] for each (x, a) ∈ X × A. If there is a uniformly optimal strategy for problem (C.4), then there is a deterministic stationary uniformly optimal strategy.
Proof See Proposition 9.19 of [21].
Proposition C.2.7 Consider problem (C.4). Suppose l0 (x, a) ∈ [0, ∞] for each (x, a) ∈ X × A. Then W0DT ∗ (x)
= inf
σ∈ D M
Eσx
∞
l0 (X n , An+1 ) , ∀ x ∈ X.
n=0
Proof This follows from the standard dynamic programming argument, see Chap. 9 of [21]. Below in this section, X and A are topological Borel spaces. The following two conditions are called the compactness-weak continuity condition, and the compactness-strong continuity condition, respectively. Condition C.2.1 (a) The action space A is compact. (b) For each bounded continuous function f on X, X f (y) p(dy|x, a) is continuous in (x, a) ∈ X × A. (c) For each j = 0, 1, . . . , J, the function l j is lower semicontinuous on X × A.
Appendix C: Definitions and Facts about Discrete-Time Markov Decision Processes
559
Condition C.2.2 (a) The action space A is compact. (b) For each bounded measurable function f on X and x ∈ X, the function X f (y) p(dy|x, a) is continuous in a ∈ A. (c) For each j = 0, 1, . . . , J, the function l j (x, ·) is lower semicontinuous on A for each x ∈ X. If the problem is unconstrained, then part (c) of Conditions C.2.1 and C.2.2 only concerns l0 . Proposition C.2.8 Consider problem (C.4). Suppose l0 (x, a) ∈ [0, ∞] for each (x, a) ∈ X × A, and Condition C.2.1 (respectively, Condition C.2.2) is satisfied. Then the following assertions hold. (a) For each x ∈ X, V0(∞) (x) = W0DT ∗ (x). Furthermore, W0DT ∗ is lower semicontinuous (respectively, measurable) on X. (b) The value (Bellman) function W0DT ∗ is the minimal nonnegative lower semicontinuous (respectively, measurable) solution to (C.7). (c) There exists a deterministic stationary uniformly optimal strategy.
Proof See Theorems 15.2 and 16.2 of [214]. The following β -discounted DTMDP problem with β ∈ [0, 1) Minimize over all strategies σ:
Eσx
∞
β l0 (X n , An+1 ) , ∀ x ∈ X (C.12) n
n=0
can be considered as a special case of the total undiscounted DTMDP problem (C.4): at each time step, (1 − β) is the probability of the absorption at the artificial cemetery ∈ / X with no future cost. Let ∞ DT,β∗ σ n W0 (x) := inf Ex β l0 (X n , An+1 ) . σ∈
n=0
Definition C.2.2 (Uniformly optimal strategy) A strategy σ ∗ is called uniformly optimal for the β-discounted DTMDP problem (C.12) if ∗ Eσx
∞ n=0
β l0 (X n , An+1 ) ≤ n
Eσx
∞ n=0
for each strategy σ and for each initial state x ∈ X.
β l0 (X n , An+1 ) n
560
Appendix C: Definitions and Facts about Discrete-Time Markov Decision Processes
Similarly to (C.8), we may define the dynamic programming algorithm for the β-discounted problem (C.12): for all x ∈ X and n ≥ 0 -
(0),β
V0
(x) := 0,
(n+1),β V0 (x)
(n),β := inf a∈A l0 (x, a) + β X V0 (y) p(dy|x, a) ,
(∞),β
V0
(n),β
(x) := lim V0 n→∞
(x)
(C.13)
(C.14)
wherever the limit on the right-hand side exists. We now present a version of Proposition C.2.8 for the β-discounted DTMDP problem (C.12). Proposition C.2.9 Consider the β-discounted DTMDP problem (C.12) with β ∈ [0, 1). Suppose l0 is inf-compact on X × A such that l0 (x, a) ∈ [0, ∞] for each (x, a) ∈ X × A, and Condition C.2.1(b) is satisfied. Then the following assertions hold. (n),β
(a) For each x ∈ X, {V0
(∞),β
(x)}∞ n=0 increases to V0 (∞),β
V0 (n),β
DT,β∗
(x) = W0
(x), and
(x).
Furthermore, {V0 }∞ are inf-compact on X. n=0 and W0 DT,β∗ is the minimal nonnegative lower semi(b) The value (Bellman) function W0 continuous solution to the β-discounted optimality equation DT,β∗
W (x) = inf l0 (x, a) + β W (y) p(dy|x, a) , x ∈ X. a∈A
X
(c) There is a deterministic stationary uniformly optimal strategy for the β-discounted DTMDP problem (C.12). Moreover, a deterministic stationary strategy ϕ is uniformly optimal for the β-discounted DTMDP problem (C.12) if and only if DT,β∗
W0
DT,β∗
(x) = l0 (x, ϕ(x)) + β X
W0
(y) p(dy|x, ϕ(x))
for each x ∈ X. Proof This statement can be established in the same way as for Proposition C.2.8, based on Proposition B.1.40. See Theorem 2.1 of [81]. The following definition is taken from [221], see also [200]. Definition C.2.3 (Finite and unichain DTMDPs) Consider a DTMDP with a finite state space X and action space A. Such a model is called finite. It is called unichain if,
Appendix C: Definitions and Facts about Discrete-Time Markov Decision Processes
561
under each deterministic stationary strategy, the process {X n }∞ n=0 has a single positive recurrent class plus a possibly empty set of transient states. Proposition C.2.10 For a finite DTMDP model with a bounded cost function l0 , if it is unichain, then for each z ∈ X, there exists a constant L ∈ [0, ∞) such that DT,β∗
|W0
DT,β∗
(x) − W0
(z)| ≤ L , ∀ x ∈ X, β ∈ (0, 1).
Proof See Proposition 6.4.1 of [221].
Remark C.2.1 If the decision at the current state x can only be selected from A(x) ⊆ ˆ the set of all underlying admissible strategies which is nonA, then we denote by empty if the set K := {(x, a) : x ∈ X, a ∈ A(x)} is measurable and contains the graph of a measurable mapping ψ : X → A. In this DT,β∗ (y) is replaced situation, Proposition C.2.10 still holds if for each y ∈ X, W0
DT,β∗ ∞ σ n ˆ therein by W0 (y) := inf σ∈ˆ E y n=0 β l 0 (X n , An+1 ) . The statement in Proposition C.2.4 can be strengthened for the contracting DTMDP models, defined as follows. Definition C.2.4 (Contracting DTMDP) Let the Borel state space be X := X ∪ {}, where is a point that does not belong to X. The DTMDP model (X ∪ {}, A, p, {l j } Jj=0 ) is called contracting on X if there is some measurable function ζ : X ∪ {} → [1, ∞) and a constant ξ ∈ [0, 1) such that ζ(y) p(dy|x, a) ≤ ξζ(x), ∀ x ∈ X ∪ {}, a ∈ A. X
The above definition comes from Definition 7.9 of [7], which deals with models with a countable state space. Proposition C.2.11 Consider a DTMDP model (X ∪ {}, A, p, {l j } Jj=0 ), which is contracting on the countable set X. Assume that the cost function l0 is ζ-bounded, i.e., (x,a)| < ∞. Furthermore, suppose l0 (, a) = 0 for all a ∈ A, and is supx∈X ,a∈A |l0ζ(x) an absorbing state. Then W0DT ∗ is the unique solution to (C.11) out of all ζ-bounded functions on X , which actually satisfies (C.11) with strict equality. Moreover, the assertion in part (c) of Proposition C.2.4 holds, too. Proof See Theorem 9.2 of [7].
C.2.2 Results for the Constrained Problem Definition C.2.5 (Occupation measures in DTMDPs) The total occupation measure Mσγ of a strategy σ ∈ for the DTMDP is defined by
562
Appendix C: Definitions and Facts about Discrete-Time Markov Decision Processes
Mσγ ( X
× A ) :=
Eσγ
∞
I{X n−1 ∈ X , An ∈ A } =
n=1
∞
Mσγ,n ( X × A )
n=1
for each X ∈ B(X) and A ∈ B(A), where Mσγ,n ( X × A ) := Eσγ [I{X n−1 ∈ X , An ∈ A }] is the detailed occupation measure. If γ(dy) = δx (dy) is a Dirac measure, then we write Mσγ and Mσγ,n as Mσx and Mσx,n correspondingly. Clearly, for all X ∈ B(X), A ∈ B(A), Mσγ ( X × A ) = X Mσx ( X × A )γ(d x) and Mσγ,n ( X × A ) = σ X Mx,n ( X × A )γ(d x). Below, D is the space of all total occupation measures under the fixed initial distribution γ. Proposition C.2.12 The total occupation measure Mσγ of a strategy σ for the DTMDP satisfies the following relation: Mσγ ( × A) = γ() +
X×A
p(|y, a)Mσγ (dy × da), ∀ ∈ B(X). (C.15)
Proof The statement directly follows from Definition C.2.5 and the construction of the DTMDP. See Lemma 9.4.3 of [121] for greater details. The space of measures M(dy × da) satisfying Eq. (C.15) is obviously convex, but there can exist such (phantom) solutions which are not total occupation measures: see Sect. 2.2.21 of [184]. Nevertheless, the following proposition holds true. Proposition C.2.13 The space D of all total occupation measures is convex. Proof The space {Pσγ , σ ∈ } of all strategic measures under a fixed initial distribution γ is convex: see Proposition C.1.1(a). Therefore, for arbitrarily fixed Mσγ 1 , Mσγ 2 ∈ D and α ∈ (0, 1), there exists a strategic measure Pσγ 3 = αPσγ 1 + (1 − α)Pσγ 2 . Now, for all X ∈ B(X), A ∈ B(A), Eσγ 3
∞
= αEσγ 1
I{X n−1 ∈ X , An ∈ A }
n=1 ∞
I{X n−1 ∈ X , An ∈ A }
n=1
+(1 − α)Eσγ 2
∞
I{X n−1 ∈ X , An ∈ A } .
n=1
Thus, we obtain the required equality αMσγ 1 + (1 − α)Mσγ 2 = Mσγ 3 ∈ D, as desired.
Appendix C: Definitions and Facts about Discrete-Time Markov Decision Processes
563
The constrained undiscounted DTMDP problem (C.5) can be rewritten as follows: l0 (x, a)M(d x × da) over all measures M ∈ D
Minimize:
X×A
l j (x, a)M(d x × da) ≤ d j , j = 1, 2, . . . , J.
subject to
(C.16)
X×A
Having in mind Proposition C.2.12 we introduce the following primal linear program in M+ (X × A): l0 (x, a)M(d x × da) over all measures M ∈ M+ (X × A) Minimize: X×A subject to l j (x, a)M(d x × da) ≤ d j , j = 1, 2, . . . , J ; X×A M( × A) = γ() + p(|y, a)M(dy × da), ∀ ∈ B(X). X×A
(C.17) Recall that M+ (X × A) denotes the set of all the [0, ∞]-valued measures on (X × A, B(X × A)). Definition C.2.6 (Feasible measure) A measure M ∈ D (a measure M ∈ M+ (X × A)) is said to be feasible for problem (C.16) (correspondingly, for program (C.17)) if it satisfies all of the constraints. A feasible measure M∗ for the corresponding problem is called optimal if
l0 (x, a)M∗ (d x × da) ≤ X×A
l0 (x, a)M(d x × da) X×A
for each feasible measure M (in problem (C.16) or in program (C.17)). Condition C.2.3 There is a feasible measure M for program (C.17) such that l j (x, a)M(d x × da) ∈ (−∞, ∞), ∀ j = 0, 1, . . . , J. X×A
According to Proposition C.2.12, Condition C.2.3 is satisfied if there exists a feasible strategy σ ∗ in the constrained problem (C.5) with ∗ Eσγ
∞
l j (X n , An+1 ) ∈ (−∞, ∞), j = 0, 1, . . . , J.
n=0
If σ s (da|y) is a stationary strategy, then Mσγ (dy × da) = σ s (da|y)Mσγ (dy × A), s the marginal Mσγ (dy × A) is the minimal nonnegative solution to the equation s
s
564
Appendix C: Definitions and Facts about Discrete-Time Markov Decision Processes
Mσγ ( × A) = γ() + s
X
p(|y, a)σ s (da|y)Mσγ (dy × A), ∈ B(X), s
A
and can be constructed by successive approximations Mσγ,0 ( × A) = 0; s
s Mσγ,i+1 (
× A) = γ() + X
A
p(|y, a)σ s (da|y)Mσγ,i (dy × A), i ≥ 0. s
Definition C.2.7 (Absorbing DTMDP) A DTMDP with the initial distribution γ is called absorbing if the state space is X = X ∪ {}, where is the absorbing cemetery state (an isolated point with no further costs), and Mσγ (X × A) < ∞ for all σ ∈ . (Note that Mσγ (X × A) = Eσγ [T ], where T := inf{n ≥ 0 : X n = }.) Remark C.2.2 If a DTMDP model (X , A, p, {l j } Jj=0 ) is contracting on a countable set X, where is a costless cemetery, and the initial distribution is concentrated on a singleton, then it is absorbing in the sense of Definition C.2.7. See Theorem 7.5 of [7]. In an absorbing DTMDP, we always consider its occupation measure restricted to X × A. Also the definition of a strategy at the state is not important, and will not be specified. If a DTMDP with the initial distribution γ is absorbing, then there is a K < ∞ such that for all σ ∈ both Mσγ (X × A) = Eσγ [T ] < K . Indeed, if there are ∞ 1 σi strategies σ1 , σ2 , . . . such that Mσγ i (X × A) > 2i then, since Pσγˆ = i=1 P is a 2i γ strategic measure according to [69, Sect. 5, Chap. 3] (see also Corollary C.1.1), one has Mσγˆ (X × A) = ∞. Clearly, equality (C.15) holds for an absorbing model, too. Proposition C.2.14 Suppose, in an absorbing DTMDP with the initial distribus tion γ, a finite measure M on X × A satisfies equality (C.15). Then M = Mσγ , s wheres the stationary strategy σ comes from the decomposition M( X × A ) = X σ ( A |x)M(d x × A). Proof See Lemma 4.2 of [88].
The next statement is obvious. (See also Proposition C.2.13.) Corollary C.2.2 In an absorbing DTMDP with the initial distribution γ, the space D of total occupation measures is convex, generated by stationary strategies. Proposition C.2.15 In an absorbing DTMDP with the initial distribution γ, a total occupation measure M is extreme in D if and only if M = Mϕγ for some deterministic stationary strategy ϕ. Proof See Lemma 4.6 of [88].
Appendix C: Definitions and Facts about Discrete-Time Markov Decision Processes
565
Condition C.2.4 (a) The state space is X = X ∪ {}, where is an absorbing cemetery state (an isolated point with no further costs). For each M ∈ D, if M(X × A) = ∞, then either at least one constraint in (C.16) is violated, or there exists a feasible M ∈ D such that X×A l0 (x, a)M (d x × da) < X×A l0 (x, a)M(d x × da). (b) All the cost functions l j , j = 0, 1, . . . , J , are bounded. Under Condition C.2.4(a), which is certainly valid in an absorbing DTMDP, one can ignore the strategies for which Mσγ (X × A) = ∞ and replace problem (C.16) with the following: l0 (x, a)M(d x × da) over all measures M ∈ D
Minimize: X×A
with M(X × A) < ∞ subject to l j (x, a)M(d x × da) ≤ d j , j = 1, 2, . . . , J.
(C.18)
X×A ∗
If the latter problem is solved and Mσγ is the optimal point, then this strategy σ ∗ is an optimal solution to the original problem (C.5). If Condition C.2.4 is satisfied, then problem (C.18) is a special case of the Primal Convex Program (B.16): • X is the linear space of finite signed measures on (X × A, B(X × A)); • f 0 (M) =
X×A
• f j (M) =
l0 (x, a)M(d x × da); l j (x, a)M(d x × da) − d j , j = 1, 2, . . . , J .
X×A
Proposition C.2.16 Suppose Condition C.2.4 is satisfied as well as the Slater condition for problem (C.18) (that is, all the inequalities in (C.18) are strict for some M ∈ D with M(X × A) < ∞). Assume additionally that the minimal value in (C.18) is bigger than −∞. Then the following holds true: (a) There exists at least one vector of Lagrange multipliers g¯ ∗ ∈ (R0+ ) J solving the Dual Convex Program: Maximize over g¯ ∈ (R0+ ) J :
inf
M∈D: M(X×A)