Adaptive dynamic programming: single and multiple controllers 9789811317118, 9789811317125


271 106 3MB

English Pages 278 Year 2019

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface......Page 6
Contents......Page 8
About the Authors......Page 13
Symbols......Page 15
1.1.1 Continuous-Time LQR......Page 16
1.1.2 Discrete-Time LQR......Page 17
1.2 Adaptive Dynamic Programming......Page 18
1.3 Review of Matrix Algebra......Page 20
References......Page 21
2.1 Introduction......Page 22
2.3 The Data-Based Identifier......Page 24
2.4 Derivation of the Iterative ADP Algorithm with Convergence Analysis......Page 26
2.5 Neural Network Implementation of the Iterative Control Algorithm......Page 32
2.6 Simulation Study......Page 33
2.7 Conclusions......Page 35
References......Page 37
3.1 Introduction......Page 39
3.2 Problem Statement......Page 40
3.3.1 The Novel ADP Iteration Algorithm......Page 44
3.3.2 Convergence Analysis of the Improved Iteration Algorithm......Page 47
3.3.3 Neural Network Implementation of the Iteration ADP Algorithm......Page 52
3.4 Simulation Study......Page 54
3.5 Conclusion......Page 59
References......Page 62
4.1 Introduction......Page 63
4.2 Problem Formulation......Page 64
4.3 Derivation of the ADP Algorithm for Time-Delay Systems......Page 65
4.4 Neural Network Implementation for the Multi-objective Optimal Control Problem of Time-Delay Systems......Page 68
4.5 Simulation Results......Page 69
4.6 Conclusions......Page 75
References......Page 76
5.1 Introduction......Page 77
5.2 Problem Statement......Page 79
5.3 SIANN Architecture-Based Classification......Page 80
5.4 Optimal Control Based on ADP......Page 83
5.4.1 Model Neural Network......Page 84
5.4.2 Critic Network and Action Network......Page 88
5.5 Simulation Study......Page 96
References......Page 105
6.1 Introduction......Page 108
6.2 Motivations and Preliminaries......Page 109
6.3.1 Critic Network......Page 112
6.3.2 Action Network......Page 114
6.3.3 Design of the Compensation Controller......Page 115
6.3.4 Stability Analysis......Page 116
6.4 Simulation Study......Page 120
References......Page 123
7.1 Introduction......Page 125
7.2 Problem Statement......Page 126
7.3 Off-Policy Optimal Control Method......Page 127
7.3.1 Convergence Analysis of Off-Policy PI Algorithm......Page 129
7.3.2 Implementation Method of Off-Policy Iteration Algorithm......Page 131
7.4 Simulation Study......Page 134
References......Page 137
8.1 Introduction......Page 139
8.2 Problem Formulation and Preliminaries......Page 140
8.3.1 Description of Approximation-Error ADP Algorithm......Page 142
8.3.2 Convergence Analysis of The Iterative ADP Algorithm......Page 144
8.4 Simulation Study......Page 148
References......Page 156
9.1 Introduction......Page 158
9.2 Problem Statement......Page 159
9.3.1 On-Policy IRL for Nonzero Disturbance......Page 162
9.3.2 Off-Policy IRL for Nonzero Disturbance......Page 163
9.3.3 NN Approximation for Actor-Critic Structure......Page 165
9.4.1 Disturbance Compensation Off-Policy Controller Design......Page 168
9.4.2 Stability Analysis......Page 169
9.5 Simulation Study......Page 172
References......Page 174
10.1 Introduction......Page 176
10.2 Preliminaries and Assumptions......Page 177
10.3.1 Derivation of The Iterative ADP Method......Page 180
10.3.2 The Procedure of the Method......Page 185
10.3.3 The Properties of the Iterative ADP Method......Page 187
10.4 Neural Network Implementation......Page 201
10.4.1 The Model Network......Page 202
10.4.2 The Critic Network......Page 203
10.4.3 The Action Network......Page 204
10.5 Simulation Study......Page 206
References......Page 215
11.1 Introduction......Page 217
11.2 Motivations and Preliminaries......Page 218
11.3.1 Derivation of Off-Policy Algorithm......Page 223
11.3.2 Implementation Method for Off-Policy Algorithm......Page 224
11.3.3 Stability Analysis......Page 228
11.4 Simulation Study......Page 229
References......Page 234
12.1 Introduction......Page 236
12.2 Problem Statement......Page 237
12.3 Multi-player Learning PI Solution for NZS Games......Page 238
12.4.1 Derivation of Off-Policy Algorithm......Page 243
12.4.2 Implementation Method for Off-Policy Algorithm......Page 245
12.4.3 Stability Analysis......Page 247
12.5 Simulation Study......Page 251
References......Page 257
13.1 Introduction......Page 259
13.2.1 Graph Theory......Page 260
13.2.2 Synchronization and Tracking Error Dynamic Systems......Page 261
13.3 Optimal Distributed Cooperative Control for Multi-agent …......Page 263
13.3.1 Cooperative Performance Index Function......Page 264
13.3.2 Nash Equilibrium......Page 265
13.4.1 Derivation of the Heterogeneous Multi-agent Differential Graphical Games......Page 267
13.4.2 Properties of the Developed Policy Iteration Algorithm......Page 268
13.4.3 Heterogeneous Multi-agent Policy Iteration Algorithm......Page 272
13.5 Simulation Study......Page 273
References......Page 277
Index......Page 278
Recommend Papers

Adaptive dynamic programming: single and multiple controllers
 9789811317118, 9789811317125

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Studies in Systems, Decision and Control 166

Ruizhuo Song Qinglai Wei Qing Li

Adaptive Dynamic Programming: Single and Multiple Controllers

Studies in Systems, Decision and Control Volume 166

Series editor Janusz Kacprzyk, Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland e-mail: [email protected]

The series “Studies in Systems, Decision and Control” (SSDC) covers both new developments and advances, as well as the state of the art, in the various areas of broadly perceived systems, decision making and control–quickly, up to date and with a high quality. The intent is to cover the theory, applications, and perspectives on the state of the art and future developments relevant to systems, decision making, control, complex processes and related areas, as embedded in the fields of engineering, computer science, physics, economics, social and life sciences, as well as the paradigms and methodologies behind them. The series contains monographs, textbooks, lecture notes and edited volumes in systems, decision making and control spanning the areas of Cyber-Physical Systems, Autonomous Systems, Sensor Networks, Control Systems, Energy Systems, Automotive Systems, Biological Systems, Vehicular Networking and Connected Vehicles, Aerospace Systems, Automation, Manufacturing, Smart Grids, Nonlinear Systems, Power Systems, Robotics, Social Systems, Economic Systems and other. Of particular value to both the contributors and the readership are the short publication timeframe and the world-wide distribution and exposure which enable both a wide and rapid dissemination of research output.

More information about this series at http://www.springer.com/series/13304

Ruizhuo Song Qinglai Wei Qing Li •

Adaptive Dynamic Programming: Single and Multiple Controllers

123

Ruizhuo Song University of Science and Technology Beijing Beijing, China

Qinglai Wei Institute of Automation Chinese Academy of Sciences Beijing, China Qing Li University of Science and Technology Beijing Beijing, China

ISSN 2198-4182 ISSN 2198-4190 (electronic) Studies in Systems, Decision and Control ISBN 978-981-13-1711-8 ISBN 978-981-13-1712-5 (eBook) https://doi.org/10.1007/978-981-13-1712-5 Jointly published with Science Press, Beijing, China The print edition is not for sale in China Mainland. Customers from China Mainland please order the print book from: Science Press, Beijing. Library of Congress Control Number: 2018949333 © Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 This work is subject to copyright. All rights are reserved by the Publishers, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publishers, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publishers nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publishers remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore

Preface

In the past few decades, neurobiological discoveries towards understanding learning mechanisms of the human brain have provided indications of how to design more effective and responsive decision and control systems for complex interconnected human-engineered systems. Advances in computational intelligence reinforcement learning (RL) methods, particularly those with an actor-critic structure in the family of approximate/adaptive dynamic programming (ADP) techniques, have made inroads in comprehending and mimicking brain functionality at the level of the brainstem, cerebellum, basal ganglia, and cerebral cortex. The works of Doya and others have shown that the basal ganglia acts as a critic in selecting action commands sent to the muscle motor control systems. The concentration of dopamine neurotransmitters in the basal ganglia is modified based on rewards received from previous actions, so that successful actions are more likely to be repeated in the future. The cerebellum learns a dynamics model of environments for control purposes. This motivates a three-level actor-critic structure with learning networks for critic, actor, and model dynamics. Based on this actor-critic structure, the ADP algorithms which are state feedback control methods have been developed as powerful tools for solving optimal control problems online. ADP techniques are used to design new structures of adaptive feedback control systems that learn the solutions to optimal control problems by measuring data along the system trajectories. This book studies the optimal control based on ADP for two categories of dynamic feedback control systems: systems with single control input and systems with multiple control inputs. This book is organized into 13 chapters. First, Chap. 1 gives a preparation of the book. After that, the book is divided into two parts. In Part I, we develop novel ADP methods for optimal control of the systems with one control input. Chapter 2 introduces a finite-time optimal control method for a class of unknown nonlinear systems. Chapter 3 proposes finite-horizon optimal control for a class of nonaffine time-delay nonlinear systems. Chapter 4 presents multi-objective optimal control for a class of nonlinear time-delay systems. Chapter 5 develops multiple actor-critic

v

vi

Preface

structures for continuous-time optimal control using input–output data. In Chap. 6, complex-valued nonlinear system optimal control method is studied based on ADP. In Chap. 7, off-policy neuro-optimal control is developed for unknown complex-valued nonlinear systems based on policy iteration. In Chap. 8, optimal tracking control is proposed with the convergence proof. Part II concerns multi-player systems from Chaps. 9 to 13. In Chap. 9, off-policy actor-critic structure for optimal control of unknown systems with disturbances is presented. In Chap. 10, iterative ADP method for solving a class of nonlinear zero-sum differential games is introduced. In Chap. 11, neural network-based synchronous iteration learning method for multi-player zero-sum games is developed. In Chap. 12, off-policy integral reinforcement learning (IRL) method is also employed to solve nonlinear continuous-time multi-player non-zero-sum games. In Chap. 13, optimal distributed synchronization control for continuous-time heterogeneous multi-agent differential graphical games is established. Dr. Ruizhuo Song completed Chaps. 1–12 and wrote 250,000 words. Dr. Qinglai Wei completed Chap. 13 and wrote 50,000 words. The authors would like to acknowledge the help and encouragement they have received from colleagues in the University of Science and Technology Beijing, and Institute of Automation, Chinese Academy of Sciences during the course of writing this book. The authors are very grateful to the National Natural Science Foundation of China (NSFC) for providing necessary financial support to our research. The present book is the result of NSFC Grants 61374105, 61673054, 61722312, and in part by the Fundamental Research Funds for the Central Universities under Grant FRF-GF-17-B45. This book is dedicated to Tiantian, who makes every day exciting. Beijing, China October 2017

Ruizhuo Song Qinglai Wei Qing Li

Contents

1

Introduction . . . . . . . . . . . . . . . . . . . . 1.1 Optimal Control . . . . . . . . . . . . . 1.1.1 Continuous-Time LQR . . 1.1.2 Discrete-Time LQR . . . . 1.2 Adaptive Dynamic Programming . 1.3 Review of Matrix Algebra . . . . . . References . . . . . . . . . . . . . . . . . . . . . .

2

Neural-Network-Based Approach for Finite-Time Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Problem Formulation and Motivation . . . . . . . . . . . . . . 2.3 The Data-Based Identifier . . . . . . . . . . . . . . . . . . . . . . 2.4 Derivation of the Iterative ADP Algorithm with Convergence Analysis . . . . . . . . . . . . . . . . . . . . . 2.5 Neural Network Implementation of the Iterative Control Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

1 1 1 2 3 5 6

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

7 7 9 9

......

11

. . . .

. . . .

. . . .

. . . .

Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay Nonlinear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 The Iteration ADP Algorithm and Its Convergence . . . . . . . . 3.3.1 The Novel ADP Iteration Algorithm . . . . . . . . . . . . 3.3.2 Convergence Analysis of the Improved Iteration Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Neural Network Implementation of the Iteration ADP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

17 18 20 22

. . . . .

. . . . .

25 25 26 30 30

..

33

..

38

vii

viii

Contents

3.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40 48 48

Multi-objective Optimal Control for Time-Delay Systems . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Derivation of the ADP Algorithm for Time-Delay Systems . 4.4 Neural Network Implementation for the Multi-objective Optimal Control Problem of Time-Delay Systems . . . . . . . . 4.5 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

49 49 50 51

. . . .

. . . .

. . . .

54 55 61 62

5

Multiple Actor-Critic Optimal Control via ADP . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Problem Statement . . . . . . . . . . . . . . . . . . . 5.3 SIANN Architecture-Based Classification . . . 5.4 Optimal Control Based on ADP . . . . . . . . . . 5.4.1 Model Neural Network . . . . . . . . . . 5.4.2 Critic Network and Action Network 5.5 Simulation Study . . . . . . . . . . . . . . . . . . . . 5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

63 63 65 66 69 70 74 82 91 91

6

Optimal Control for a Class of Complex-Valued Nonlinear Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Motivations and Preliminaries . . . . . . . . . . . . . . . . . . . 6.3 ADP-Based Optimal Control Design . . . . . . . . . . . . . . 6.3.1 Critic Network . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Action Network . . . . . . . . . . . . . . . . . . . . . . . 6.3.3 Design of the Compensation Controller . . . . . . 6.3.4 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . 6.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

95 95 96 99 99 101 102 103 107 110 110

. . . . .

. . . . .

. . . . .

. . . . .

113 113 114 115 117

4

7

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

Off-Policy Neuro-Optimal Control for Unknown Complex-Valued Nonlinear Systems . . . . . . . . . . . . . . . . . . . . 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Off-Policy Optimal Control Method . . . . . . . . . . . . . . . . . 7.3.1 Convergence Analysis of Off-Policy PI Algorithm 7.3.2 Implementation Method of Off-Policy Iteration Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . 119

Contents

7.3.3 Implementation Process 7.4 Simulation Study . . . . . . . . . . . 7.5 Conclusion . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . 8

9

ix

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

Approximation-Error-ADP-Based Optimal Tracking Control for Chaotic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Problem Formulation and Preliminaries . . . . . . . . . . . . . . 8.3 Optimal Tracking Control Scheme Based on Approximation-Error ADP Algorithm . . . . . . . . . . . . . . . 8.3.1 Description of Approximation-Error ADP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.2 Convergence Analysis of The Iterative ADP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

122 122 125 125

. . . . 127 . . . . 127 . . . . 128 . . . . 130 . . . . 130 . . . .

Off-Policy Actor-Critic Structure for Optimal Control of Unknown Systems with Disturbances . . . . . . . . . . . . . . . . . . 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Off-Policy Actor-Critic Integral Reinforcement Learning . . . 9.3.1 On-Policy IRL for Nonzero Disturbance . . . . . . . . 9.3.2 Off-Policy IRL for Nonzero Disturbance . . . . . . . . 9.3.3 NN Approximation for Actor-Critic Structure . . . . . 9.4 Disturbance Compensation Redesign and Stability Analysis 9.4.1 Disturbance Compensation Off-Policy Controller Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4.2 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 9.5 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10 An Iterative ADP Method to Solve for a Class of Nonlinear Zero-Sum Differential Games . . . . . . . . . . . . . . . . . . . . . . . . 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 Preliminaries and Assumptions . . . . . . . . . . . . . . . . . . . 10.3 Iterative Approximate Dynamic Programming Method for ZS Differential Games . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 Derivation of The Iterative ADP Method . . . . . . 10.3.2 The Procedure of the Method . . . . . . . . . . . . . . 10.3.3 The Properties of the Iterative ADP Method . . .

. . . .

. . . .

. . . .

. . . .

132 136 144 144

. . . . . . . .

. . . . . . . .

. . . . . . . .

147 147 148 151 151 152 154 157

. . . . .

. . . . .

. . . . .

157 158 161 163 163

. . . . . 165 . . . . . 165 . . . . . 166 . . . .

. . . .

. . . .

. . . .

. . . .

169 169 174 176

x

Contents

10.4 Neural Network Implementation 10.4.1 The Model Network . . . 10.4.2 The Critic Network . . . 10.4.3 The Action Network . . . 10.5 Simulation Study . . . . . . . . . . . 10.6 Conclusions . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

190 191 192 193 195 204 204

11 Neural-Network-Based Synchronous Iteration Learning Method for Multi-player Zero-Sum Games . . . . . . . . . . . . . . . 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Motivations and Preliminaries . . . . . . . . . . . . . . . . . . . . . 11.3 Synchronous Solution of Multi-player ZS Games . . . . . . . 11.3.1 Derivation of Off-Policy Algorithm . . . . . . . . . . . 11.3.2 Implementation Method for Off-Policy Algorithm 11.3.3 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . 11.4 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

207 207 208 213 213 214 218 219 224 224

12 Off-Policy Integral Reinforcement Learning Method for Multi-player Non-zero-Sum Games . . . . . . . . . . . . . . . . . . 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Multi-player Learning PI Solution for NZS Games . . . . . . 12.4 Off-Policy Integral Reinforcement Learning Method . . . . . 12.4.1 Derivation of Off-Policy Algorithm . . . . . . . . . . . 12.4.2 Implementation Method for Off-Policy Algorithm 12.4.3 Stability Analysis . . . . . . . . . . . . . . . . . . . . . . . . 12.5 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

227 227 228 229 234 234 236 238 242 248 248

. . . .

. . . .

251 251 252 252

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

13 Optimal Distributed Synchronization Control for Heterogeneous Multi-agent Graphical Games . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2 Graphs and Synchronization of Multi-agent Systems . . . . . . . 13.2.1 Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.2.2 Synchronization and Tracking Error Dynamic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.3 Optimal Distributed Cooperative Control for Multi-agent Differential Graphical Games . . . . . . . . . . . . . . . . . . . . . . . . 13.3.1 Cooperative Performance Index Function . . . . . . . . . 13.3.2 Nash Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . .

. . 253 . . 255 . . 256 . . 257

Contents

13.4 Heterogeneous Multi-agent Differential Graphical Games by Iterative ADP Algorithm . . . . . . . . . . . . . . . . . . . . . 13.4.1 Derivation of the Heterogeneous Multi-agent Differential Graphical Games . . . . . . . . . . . . . . 13.4.2 Properties of the Developed Policy Iteration Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.4.3 Heterogeneous Multi-agent Policy Iteration Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.5 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

. . . . . 259 . . . . . 259 . . . . . 260 . . . .

. . . .

. . . .

. . . .

. . . .

264 265 269 269

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271

About the Authors

Ruizhuo Song Ph.D., Associate Professor, School of Automation and Electrical Engineering, University of Science and Technology Beijing. She received her Ph.D. in control theory and control engineering from Northeastern University, Shenyang, China, in 2012. She was a Postdoctoral Fellow at the University of Science and Technology Beijing, Beijing, China. She is currently an Associate Professor at the School of Automation and Electrical Engineering, University of Science and Technology Beijing. She was a Visiting Scholar in the Department of Electrical Engineering at the University of Texas at Arlington, Arlington, TX, USA, from 2013 to 2014. She was a Visiting Scholar in the Department of Electrical, Computer, and Biomedical Engineering at the University of Rhode Island, Kingston, RI, USA, from January 2018 to February 2018. Her current research interests include optimal control, multi-player games, neural network-based control, nonlinear control, wireless sensor networks, and adaptive dynamic programming and their industrial application. She has published over 40 journal and conference papers and co-authored two monographs. Qinglai Wei Ph.D., Professor, The State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences. He received the B.S. degree in Automation, and the Ph.D. degree in control theory and control engineering, from the Northeastern University, Shenyang, China, in 2002 and 2009, respectively. From 2009–2011, he was a postdoctoral fellow with The State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China. He is currently a professor of the institute. He has authored three books, and published over 80 international journal papers. His research interests include adaptive dynamic programming, neural network-based control, computational intelligence, optimal control, nonlinear systems and their industrial applications. Dr. Wei is an Associate Editor of IEEE Transaction on Automation Science and Engineering since 2017, IEEE Transaction on Consumer Electronics since 2017, Control Engineering (in Chinese) since 2017, IEEE Transactions on Cognitive and xiii

xiv

About the Authors

Developmental Systems since 2017, IEEE Transaction on Systems Man, and Cybernetics: Systems since 2016, Information Sciences since 2016, Neurocomputing since 2016, Optimal Control Applications and Methods since 2016, Acta Automatica Sinica since 2015, and has been holding the same position for IEEE Transactions on Neural Networks and Learning Systems during 2014– 2015. He is the Secretary of IEEE Computational Intelligence Society (CIS) Beijing Chapter since 2015. He was the Program Chair of The 14th International Symposium on Neural Networks (ISNN 2017), Program Co-Chair of The 24th International Conference on Neural Information Processing (ICONIP 2017), Registration Chair of the 12th World Congress on Intelligent Control and Automation (WCICA 2016), 2014 IEEE World Congress on Computational Intelligence (WCCI 2014), the 2013 International Conference on Brain Inspired Cognitive Systems (BICS 2013), and the Eighth International Symposium on Neural Networks (ISNN 2011). He was the Publication Chair of 5th International Conference on Information Science and Technology (ICIST 2015) and the Ninth International Symposium on Neural Networks (ISNN 2012). He was the Finance Chair of the 4th International Conference on Intelligent Control and Information Processing (ICICIP 2013) and the Publicity Chair of the 2012 International Conference on Brain Inspired Cognitive Systems (BICS 2012). He was guest editors for Neual Computing and Applications and Neurocomuting in 2013 and 2014, respectively. He was a recipient of Shuang-Chuang Talents in Jiangsu Province, China, in 2014. He was a recipient of the Outstanding Paper Awards of IEEE Transactions on Neural Network and Learning Systems in 2018, Acta Automatica Sinica in 2011, Zhang Siying Outstanding Paper Award of Chinese Control and Decision Conference (CCDC) in 2015 and Best Paper Award of IEEE 6th Data Driven Control and Learning Systems Conference (DDCLS) in 2017. He was a recipient of Young Researcher Award of Asia Pacific Neural Network Society (APNNS) in 2016. He was recipient of Young Scientist Award, Yang Jiachi Science and Technology Awards (Second Class Prize), and Natural Science Award (First Prize) in Chinese Association of Automation (CAA) in 2017. He was the PI for 13 national and local government projects. Qing Li Ph.D., received his B.E. from North China University of Science and Technology, Tangshan, China, in 1993, and the Ph.D. in control theory and its applications from University of Science and Technology Beijing, Beijing, China, in 2000. He is currently a Professor at the School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing, China. He has been a Visiting Scholar at Ryerson University, Toronto, Canada, from February 2006 to February 2007. His research interests include intelligent control and intelligent optimization.

Symbols

x u F i ℝn X J, V U J* u* N W Q, R a, b H ec Ec

State vector Control vector System function Index State space State set Performance index functions Utility function Optimal performance index function Law of optimal control Terminal time Weight matrix between the hidden layer and output layer Positive definite matrices Learning rate Hamiltonian function Estimation error Squared residual error

xv

Chapter 1

Introduction

1.1 Optimal Control Optimal control is one particular branch of modern control. It deals with the problem of finding a control law for a given system such that a certain optimality criterion is achieved. A control problem includes a cost functional that is a function of state and control variables. An optimal control is a set of differential equations describing the paths of the control variables that minimize the cost function. The optimal control can be derived using Pontryagin’s maximum principle (a necessary condition also known as Pontryagin’s minimum principle or simply Pontryagin’s Principle), or by solving the Hamilton–Jacobi–Bellman (HJB) equation (a sufficient condition). For linear systems with quadratic performance function, the HJB equation reduces to the algebraic Riccati equation (ARE) [1]. Dynamic programming is based on Bellman’s principle of optimality: an optimal (control) policy has the property that no matter what previous decisions have been, the remaining decisions must constitute an optimal policy with regard to the state resulting from those previous decisions. Dynamic programming is a very useful tool in solving optimization and optimal control problems. In the next, we will introduce optimal control problems for continuous-time and discrete-time linear systems.

1.1.1 Continuous-Time LQR A special case of the general optimal control problem given is the linear quadratic regulator (LQR) optimal control problem [2]. The LQR considers the linear time invariant dynamical system described by x˙ (t) = Ax(t) + Bu(t) © Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 R. Song et al., Adaptive Dynamic Programming: Single and Multiple Controllers, Studies in Systems, Decision and Control 166, https://doi.org/10.1007/978-981-13-1712-5_1

(1.1) 1

2

1 Introduction

with state x(t) ∈ Rn and control input u(t) ∈ Rm . To this system is associated the infinite-horizon quadratic cost function or performance index V (x(t0 ), t0 ) =

1 2





{xT (t)Qx(t) + uT (t)Ru(t)}dτ

(1.2)

t0

with√weighting matrices Q > 0 and R > 0. It is assumed that (A, B) is stabilizable, (A, Q) is detectable. The LQR optimal control problem requires finding the control policy that minimizes the cost (1.2): u∗ (t) = arg min V (t0 , x(t0 ), u(t)). u(t)

(1.3)

The solution of this optimal control problem is given by the state-feedback u(t) = −Kx(t)

(1.4)

where K = R−1 BT P, and matrix P is a positive definite solution of the algebraic Riccati equation (ARE) AT P + PA + Q − PBR−1 BT P = 0.

(1.5)

Under the stabilizability and detectability conditions, there is an unique positive semidefinite solution of the ARE that yields a stabilizing closed-loop controller given by (1.4). That is, the closed-loop system is asymptotically stable.

1.1.2 Discrete-Time LQR Consider the discrete-time LQR problem where the dynamical system described by x(k + 1) = Ax(k) + Bu(k)

(1.6)

with k the discrete time index. The associated infinite horizon performance index has deterministic stage costs and is V (k) =

∞  1  T x (i)Qx(i) + uT (i)Ru(i) . 2

(1.7)

i=k

The value function for a fixed policy depends only on the initial state x(k). A difference equation equivalent to this infinite sum is given by

1.1 Optimal Control

V (x(k)) =

3

1 T (x (k)Qx(k) + uT (k)Ru(k)) + V (x(k + 1)). 2

(1.8)

Assuming the value is quadratic in the state so that V (x(k)) =

1 T x (k)Px(k) 2

(1.9)

for some kernel matrix P yields the Bellman equation form xT (k)Px(k) = xT (k)Qx(k) + uT (k)Ru(k) + (Ax(k) + Bu(k))T P(Ax(k) + Bu(k)).

(1.10)

Assuming a constant state feedback policy u(k) = −Kx(k) for some stabilizing gain K, we write (A − BK)T P(A − BK) − P + Q + K T RK = 0.

(1.11)

This is a Lyapunov equation. In the above, the solution methods for linear system optimal control problems are given. However, it is often computationally untenable to run true dynamic programming due to the backward numerical process required for its solutions, i.e., as a result of the well-known “curse of dimensionality”.

1.2 Adaptive Dynamic Programming Reinforcement learning (RL) is a type of machine learning developed in the computational intelligence community in computer science and engineering. It has been extensively used to solve optimal control problems and it is a computational approach to learning from interactions with the surrounding environment and concerned with how an agent or actor ought to take actions so as to optimize a cost of its long-term interactions with the environment. In the context of control, the environment is the dynamic system, the agent corresponds to the controller, and actions correspond to control signals. The RL objective is to find a strategy that minimizes an expected long-term cost. One type of reinforcement learning algorithms employs the actor-critic structure shown in Fig. 1.1, which is also used in [3]. This structure produces forward-in-time algorithms that are implemented in real time wherein an actor component applies an action, or control policy, to the environment, and a critic component assesses the value of that action. The learning mechanism supported by the actor-critic structure has two steps: policy evaluation and executing by the critic. The policy evaluation step is performed by observing from the environment the results of applying current actions. Performance or value can be defined in terms of optimality objectives such as

4

1 Introduction

Fig. 1.1 Reinforcement learning with an actor/critic structure

minimum fuel, minimum energy, minimum risk, or maximum reward. Based on the assessment of the performance, one of several schemes can then be used to modify or improve the control policy in the sense that the new policy yields a value that is improved relative to the previous value. In this scheme, reinforcement learning is a means of learning optimal behaviors by observing the real-time responses from the environment to nonoptimal control policies. Werbos developed actor-critic techniques for feedback control of discrete-time dynamical systems that learn optimal policies online in real time using data measured along the system trajectories [4–7]. These methods, known as approximate dynamic programming or adaptive dynamic programming (ADP), comprise a family of the basic learning methods: heuristic dynamic programming (HDP), action-dependent HDP (ADHDP), dual heuristic dynamic programming (DHP), ADDHP, globalized DHP (GDHP), and ADGDHP. It builds critic to approximate the cost function and actor to approximate the optimal control in dynamic programming using a function approximation structure such as neural networks (NNs) [8]. Based on the Bellman equation for solving optimal decision problems in real time forward in time according to data measured along the system trajectories, ADP algorithm as an effective intelligent control method has played an important role in seeking solutions for the optimal control problem [9]. Now, ADP has two main iteration forms, namely policy iteration (PI) and value iteration (VI), respectively [10]. PI algorithms contain policy evaluation and policy improvement. An initial stabilizing control law is required, which is often difficult to obtain. Comparing to VI algorithms, in most applications, PI would require fewer iterations as a Newtons method, but every iteration is more computationally demanding. VI algorithms solve the optimal control problem without requirement of an initial stabilizing control law, which is easy to implement. For system (1.6) and the cost function (1.7), the detailed procedures of PI and VI are given as follows.

1.2

Adaptive Dynamic Programming

5

1. PI Algorithm (1) Initialize. Select any admissible (i.e. stabilizing) control policy h[0] (x(k)) (2) Policy Evaluation Step. Determine the value of the current policy using the Bellman Equation V [i+1] (x(k)) = r(x(k), h[i] (x(k))) + V [i+1] (x(k + 1)).

(1.12)

(3) Policy Improvement Step. Determine an improved policy using   h[i+1] (x(k)) = arg min r(x(k), h(x(k))) + V [i+1] (x(k + 1)) . h(·)

(1.13)

2. VI Algorithm (1) Initialize. Select any control policy h[0] (x(k)), not necessarily admissible or stabilizing. (2) Value Update Step. Update the value using V [i+1] (x(k)) = r(x(k), h[i] (x(k))) + V [i] (x(k + 1)).

(1.14)

(3) Policy Improvement Step. Determine an improved policy using   h[i+1] (x(k)) = arg min r(x(k), h(x(k))) + V [i+1] (x(k + 1)) . h(·)

(1.15)

1.3 Review of Matrix Algebra In this book, some matrix manipulations are the basic mathematical vehicle and, for those whose memory needs refreshing, we provides a short review. 1. For any n × n matrices A and B, (AB)T = BT AT . 2. For any n × n matrices A and B, if A and B are nonsingular, then (AB)−1 = −1 −1 B A . 3. The Kronecker product of two matrices A = [aij ] ∈ Cm×n and B = [bij ] ∈ Cp×q is A ⊗ B = [aij B] ∈ Cmp×nq . m×n 4. If A = [a1 , a2 , . . . , an ] ∈ ⎡ C ⎤ , where ai are the columns of A, the stacking a1 ⎢ a2 ⎥ ⎢ ⎥ operator is defined by s(A) = ⎢ . ⎥. It converts A ∈ Cm×n into a vector s(A) ∈ Cmn . ⎣ .. ⎦ an Then for matrices A, B and D we have s(ABD) = (DT ⊗ A)s(B).

(1.16)

5. If x ∈ Rn is a vector, then the square of the Euclidean norm is ||x||2 = xT x. ∂ 6. If Q is symmetric, then (xT Qx) = 2Qx. ∂x

6

1 Introduction

References 1. Zhang, H., Liu, D., Luo, Y., Wang, D.: Adaptive Dynamic Programming for Control-Algorithms and Stability. Springer, London (2013) 2. Vrabie, D., Vamvoudakis, K., Lewis, F.: Optimal Adaptive Control and Differential Games by Reinforcement Learning Principles. The Institution of Engineering and Technology, London (2013) 3. Barto, A., Sutton, R., Anderson, C.: Neuron-like adaptive elements that can solve difficult learning control problems. IEEE Trans. Syst. Man Cybern. Part B, Cybern. SMC–13(5), 834– 846 (1983) 4. Werbos, P.: A menu of designs for reinforcement learning over time. In: Miller, W.T., Sutton, R.S., Werbos, P.J. (eds.) Neural Networks for Control, pp. 67–95. MIT Press, Cambridge (1991) 5. Werbos, P.: Approximate dynamic programming for real-time control and neural modeling. In: White, D.A., Sofge, D.A. (eds.) Handbook of Intelligent Control. Van Nostrand Reinhold, New York (1992) 6. Werbos, P.: Neural networks for control and system identification. In: proceedings of IEEE Conference Decision Control, Tampa, FL, pp. 260–265 (1989) 7. Werbos, P.: Advanced forecasting methods for global crisis warning and models of intelligence. General Syst. Yearbook 22, 25–38 (1977) 8. Liu, D., Wei, Q., Yang, X., Li, H., Wang, D.: Adaptive Dynamic Programming with Applications in Optimal Control. Springer International Publishing, Berlin (2017) 9. Werbos, P.: ADP: the key direction for future research in intelligent control and understanding brain intelligence. IEEE Trans. Syst. Man Cybern. Part B: Cybern. 38(4), 898–900 (2008) 10. Lewis, F., Vrabie, D.: Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits Syst. Mag. 9(3), 32–50 (2009)

Chapter 2

Neural-Network-Based Approach for Finite-Time Optimal Control

This chapter proposes a novel finite-time optimal control method based on inputoutput data for unknown nonlinear systems using ADP algorithm. In this method, the single-hidden layer feed-forward network (SLFN) with extreme learning machine (ELM) is used to construct the data-based identifier of the unknown system dynamics. Based on the data-based identifier, the finite-time optimal control method is established by ADP algorithm. Two other SLFNs with ELM are used in ADP method to facilitate the implementation of the iterative algorithm, which aim to approximate the performance index function and the optimal control law at each iteration, respectively. A simulation example is provided to demonstrate the effectiveness of the proposed control scheme.

2.1 Introduction The linear optimal control problem with a quadratic cost function is probably the most well-known control problem [1, 2], and it can be translated into Riccati equation. While the optimal control of nonlinear systems is usually a challenging and difficult problem [3, 4]. Furthermore, comparing with the known system dynamics case, it is more intractable to solve the optimal control problem of the unknown system dynamics. Generally speaking, most actual systems are nearly far too complex to present the perfect mathematical models. Whenever no model is available to design the system controller nor is easy to produce, a standard way is resorting to databased techniques [5]: (1) on the basis of input-output data, the model of the unknown system dynamics is identified; (2) on the basis of the estimated model of the system dynamics, the controller is designed by model-based design techniques. It is well known that neural network is an effective tool to implement intelligent identification based on input-output data, due to the properties of nonlinearity, adaptivity, self-learning and fault tolerance [6–10]. In which, SLFN is one of the most © Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 R. Song et al., Adaptive Dynamic Programming: Single and Multiple Controllers, Studies in Systems, Decision and Control 166, https://doi.org/10.1007/978-981-13-1712-5_2

7

8

2 Neural-Network-Based Approach for Finite-Time Optimal Control

useful types [11]. In [12], Hornik proved that if the activation function is continuous, bounded, and non-constant, then continuous mappings can be approximated by SLFN with additive hidden nodes over compact input sets. In [13], Leshno improved the results of [12] and proved that SLFN with additive hidden nodes and with a nonpolynomial activation function can approximate any continuous target functions. In [11], it is proven in theory that SLFN with randomly generated additive and a broad type of activation functions can universally approximate any continuous target functions in any compact subset of the Euclidean space. For SLFN training, there are three main approaches: (1) gradient-descent based, for example back-propagation (BP) method [14]; (2) least square based, for example ELM method in this chapter; (3) standard optimization method based, for example support vector machine (SVM). While, the learning speed of feed-forward neural networks is in general far slower than required and it has been a major bottleneck in their applications for past decades [15]. Two key reasons are: (1) the slow gradient-based learning algorithms are extensively used to train neural networks, (2) all the parameters of the networks are tuned iteratively by using such learning algorithms. Unlike conventional neural network theories, in this chapter, the ELM method is used to train SLFN. Such SLFN can be as the universal approximator, one may simply randomly choose hidden nodes and then only need to adjust the output weights linking the hidden layer and the output layer. For given network architectures, ELM does not require human-intervened parameters, so ELM has fast convergence and can be easily used. Based on the SLFN identifier, the finite-time optimal control method is presented in this chapter. For finite-time control problems, the system must be stabilized to zero within finite time. The controller design of finite-time problems still presents a challenge to control engineers as the lack of methodology and the control step is difficult to determine. Few results relate to the finite-time optimal control based on ADP algorithm. As we know that [16] solved the finite-horizon optimal control problem for a class of discrete-time nonlinear systems using ADP algorithm. But the method in [16] adopts the BP networks to obtain the optimal control, which has slow convergence speed. In this chapter, we will design the finite-time optimal controller based on SLFN with ELM for unknown nonlinear systems. First, the identifier is established by the input-output data. It is proven that the identification error converges to zero. Upon the data-based identifier, the optimal control method is proposed. We prove that the iterative performance index function converges to optimum, and the optimal control is also obtained. Compared to other popular implementation methods such as BP, the SLFN with ELM has the fast response speed and is fully automatic. It means that except for target errors and the allowed maximum number of hidden nodes, no control parameters need to be manually tuned by users. The rest of this chapter is organized as follows. In Sect. 2.2, the problem formulation is presented. In Sect. 2.3, the identifier is developed based on the input-output data. In Sect. 2.4, the iterative ADP algorithm and the convergence proof are given. In Sect. 2.6, an example is given to demonstrate the effectiveness of the proposed control scheme. In Sect. 2.7, the conclusion is drawn.

2.2 Problem Formulation and Motivation

9

2.2 Problem Formulation and Motivation Consider the following unknown discrete-time nonlinear systems x(k + 1) = F(x(k), u(k)),

(2.1)

where the state x(k) ∈ Rn and the control u(k) ∈ Rm . F(x(k), u(k)) is unknown continuous function. Assume that the state is completely controllable and bounded on Ω, and F(0, 0) = 0. The finite-time performance index function is defined as follows: J (x(k), U (k, K )) =

K 

{x T (i)Qx(i) + u T (i)Ru(i)}

(2.2)

i=k

where Q and R are positive definite matrices, K is the finite positive integer, the control sequence U (k, K ) = (u(k), u(k + 1), . . . , u(K )) is finite-time admissible [16]. The length of U (k, K ) is defined as (K − k + 1). This chapter is desired to find the optimal control for system (2.1) based on performance index function (2.2). Since the system dynamics is completely unknown, the optimal problem cannot be solved directly. Therefore, it is desirable to propose a novel method that does not need the exact system dynamics but only the inputoutput data, which can be obtained during the operation of the system. In this chapter, we propose a data-based optimal control scheme using SLFN with ELM and ADP method for general unknown nonlinear systems. The design of proposed controller is divided into two steps: (1) The unknown nonlinear system dynamics is identified by SLFN identification scheme with convergence proof. (2) The optimal controller is designed based on the data-based identifier. In the following sections, we will discuss the establishment of the data-based identifier and the controller design in details.

2.3 The Data-Based Identifier In this section, the ELM method is introduced and the data-based identifier is established with convergence proof. The structure of SLFN is in Fig. 2.1. ¯ y¯ (i)), where x(i) ¯ ∈ Rn 1 , y¯ (i) ∈ Rm 1 , For N1 arbitrary distinct samples (x(i), i = 1, 2, . . . , N1 . The weight vectors between the input neurons and the jth hidden neuron are w j ∈ Rn 2 . The weight vectors between the output neurons and the jth hidden neuron are β¯ j ∈ Rn 3 , which will be designed by ELM method [17]. The number of hidden neurons is L. The threshold of the jth hidden neuron is ¯ is infinitely differentiable, then the b j . The hidden layer activation function g L (x)

10

2 Neural-Network-Based Approach for Finite-Time Optimal Control

fL (x )

m

1 L

L

1

w

n

1

x Fig. 2.1 The basic SLFN architecture

mathematically model of SLFN is [15] f L (x(i)) ¯ =

L 

β¯ j g L (wTj x(i) ¯ + b j ), i = 1, 2, . . . , N1 .

(2.3)

j=1

Unlike the traditional popular implementations of SLFN, in this chapter, ELM is used to adjust the output weights. In theory, Refs. [18, 19] show that the input weights and hidden neurons biases of SLFN do not need be adjusted during training and one may simply randomly assign values to them. To be convenient for explanation, let β L = [β¯1 , β¯2 , . . . , β¯L ]TL×m 1 , Y¯ = [ y¯ (1), y¯ (2), . . . , y¯ (N1 )]TN1 ×m 1 , and H = [h(x(1)), ¯ h(x(2)), ¯ . . . , h(x(N ¯ 1 ))]T ⎡ ¯ . . . G(w L , b L , x(1)) ¯ G(w1 , b1 , x(1)) ⎢ .. . . .. .. = ⎣.

¯ 1 )) . . . G(w L , b L , x(N ¯ 1 )) G(w1 , b1 , x(N

⎤ ⎥ ⎦

, N1 ×L

where G(w j , b j , x(i)) ¯ = g L (wTj x(i) ¯ + b j ). So we have Hβ L = Y¯ .

(2.4)

Based on least-square method, it can be obtained that β L = H + Y¯ , where H + = (H T H )−1 H T .

(2.5)

2.3 The Data-Based Identifier

11

For SLEN in (2.3), the output weight β L is the only value we want to obtain. In the following, a theorem is given to prove that β L exists, which means that H is invertible. Theorem 2.1 If SLFN is defined as in (2.3), let the hidden neurons number is L. ¯ and any given w j and b j , we have H in For N1 arbitrary distinct input samples x(i) (2.4) is invertible. Proof As x(i) ¯ are distinct, for any vector w j according to any continuous probability distribution, then with probability one, wTj x(1), wTj x(2), . . ., wTj x(N1 ) are ¯ + different from each other. Define the jth column of H is c( j) = [g L (wTj x(1) ¯ + b j ), . . . , g L (wTj x(N ¯ 1 ) + b j )]T , we can have c( j) does not belong b j ), g L (wTj x(2) to any subspace whose dimension is less than N1 [19]. It means that for any given w j and b j , according to any continuous probability distribution, H in (2.4) can be made full-rank, i.e., H is invertible. Therefore, the SLFN with ELM method is summarized as follows [20]: Step 1. Given a training set (x(i), ¯ y¯ (i)), i = 1, 2, . . . , N1 , hidden node output ¯ and hidden node number L. function G(w j , b j , x(i)) Step 2. Given arbitrary hidden node parameters (w j , b j ), j = 1, 2, . . . , L. Step 3. Calculate the hidden layer output matrix H . Step 4. According to (2.5) to calculate β L . Remark 2.1 ELM algorithm can work with wide type of activation functions, such as sigmoidal functions, radial basis, sine, cosine and exponential functions et.al.. The feed-forward networks with arbitrarily assigned input weights and hidden layer biases can universally approximate any continuous functions on any compact input sets [21]. Remark 2.2 It is important to point out that β L in (2.5) has the smallest norm among all the least-squares solutions of Hβ L = Y¯ . As the input weights and hidden neurons biases of SLFN are simply randomly assigned values. So training an SLFN is simply equivalent to finding a least-squares solution of the linear system Hβ L = Y¯ . Although almost all learning algorithms wish to reach the minimum training error, however, most of them cannot reach it because of local minimum or infinite training iteration is usually not allowed in applications [21]. Fortunately, the special unique solution β L in (2.5) has the smallest norm among all the least squares solutions.

2.4 Derivation of the Iterative ADP Algorithm with Convergence Analysis For the unknown nonlinear system (2.1), the data-based identifier is established. Then we can design the iterative ADP algorithm to get the solution of the finite-time optimal control problem.

12

2 Neural-Network-Based Approach for Finite-Time Optimal Control

First, the derivations of the optimal control u ∗ (k) and J ∗ (x(k)) are given in details. It is known that, for the case of finite horizon optimization, the optimal performance index function J ∗ (x(k)) satisfies [16] J ∗ (x(k)) = inf {J (x(k), U (k, K ))}, U (k)

(2.6)

where U (k, K ) stands for a finite-time control sequence. The length of the control sequence is not assigned. According to Bellman’s optimality principle, the following Hamilton–Jocabi– Bellman (HJB) equation J ∗ (x(k)) = inf {x T (k)Qx(k) + u T (k)Ru(k) + J ∗ (x(k + 1))} u(k)

(2.7)

holds. Define the law of optimal control sequence starting at k by U ∗ (k, K ) = arg inf {J (x(k), U (k, K ))}, U (k,K )

(2.8)

and the law of optimal control vector by u ∗ (k) = arg inf {x T (k)Qx(k) + u T (k)Ru(k) + J ∗ (x(k + 1))}. u(k)

(2.9)

Therefore, we can have J ∗ (x(k)) =x T (k)Qx(k) + u ∗T (k)Ru ∗ (k) + J ∗ (x(k + 1)).

(2.10)

Based on the above preparation, the finite-time ADP method for unknown system is proposed. The iterative procedure is as follows. For the iterative step i = 1, the performance index function is computed as V [1] (x(k)) = inf {x T (k)Qx(k) + u T (k)Ru(k) + V [0] (x(k + 1))} u(k)

= x T (k)Qx(k) + u [1]T (k)Ru [1] (k) + V [0] (x(k + 1))

(2.11)

where u [1] (x(k)) = arg inf {x T (k)Qx(k) + u T (k)Ru(k) + V [0] (x(k + 1))}, u(k)

(2.12)

and V [0] (x(k + 1)) has two expression forms according to two different cases. If for x(k), there exists U (k, K ) = (u(k)), s.t. F(x(k), u(k)) = 0, then V [0] (x(k + 1)) is V [0] (x(k + 1)) = J (x(k + 1), U ∗ (k + 1, k + 1)) = 0, ∀x(k + 1)

(2.13)

2.4 Derivation of the Iterative ADP Algorithm with Convergence Analysis

13

where U ∗ (k + 1, k + 1) = (0). In this situation, the restrict term F(x(k), u [1] (k)) = 0 for (2.11) is necessary. If for x(k), there exists U (k, K¯ ) = (u(k), u(k + 1), . . . , u( K¯ )), s.t. F(x(k), U (k, K¯ )) = 0, then V [0] (x(k + 1)) is V [0] (x(k + 1)) = J (x(k + 1), U ∗ (k + 1, K )),

(2.14)

where U ∗ (k + 1, K ) = (u ∗ (k + 1), u ∗ (k + 2), . . . , u ∗ (K )). For the iterative step i > 1, the performance index function is updated as V [i+1] (x(k)) = inf {x T (k)Qx(k) + u T (k)Ru(k) + V [i] (x(k + 1))} u(k)

= x T (k)Qx(k) + u [i+1]T (k)Ru [i+1] (k) + V [i] (x(k + 1)),

(2.15)

u [i+1] (x(k)) = arg inf {x T (k)Qx(k) + u T (k)Ru(k) + V [i] (x(k + 1))}.

(2.16)

where u(k)

In the above recurrent iterative procedure, the index i is the iterative step and k is the time step. The optimal control and optimal performance index function can be obtained by the iterative ADP algorithm (2.11)–(2.16). In the following part, we will present the convergence analysis of the iterative ADP algorithm (2.11)–(2.16). Theorem 2.2 For an arbitrary state vector x(k), the performance index function V [i+1] (x(k)) is obtained by the iterative ADP algorithm (2.11)–(2.16), then {V [i+1] (x(k))} is a monotonically nonincreasing sequence for i ≥ 1, i.e., V [i+1] (x(k)) ≤ V [i] (x(k)), ∀i ≥ 1. Proof The mathematical induction is used to prove the theorem. First, for i = 1, we can have V [1] (x(k)) in (2.11), V [0] (x(k + 1)) in (2.13), and the finite-time admissible control sequence U ∗ (k, k + 1) = (u [1] (k), U ∗ (k + 1, k + 1)) = (u [1] (k), 0). For i = 2, we have V [2] (x(k)) = x T (k)Qx(k) + u [2]T (k)Ru [2] (k) + V [1] (x(k + 1)).

(2.17)

From (2.11), we have V [1] (x(k + 1)) = inf {x T (k + 1)Qx(k + 1) + u T (k + 1)Ru(k + 1) u(k+1)

+ V [0] (x(k + 2))}. So (2.17) can be expressed as

(2.18)

14

2 Neural-Network-Based Approach for Finite-Time Optimal Control

V [2] (x(k)) = inf {x T (k)Qx(k) + u T (k)Ru(k) + inf {x T (k + 1)Qx(k + 1) u(k)

u(k+1)

+ u (k + 1)Ru(k + 1)} + V T

[0]

(x(k + 2))}.

(2.19)

{x T (l)Qx(l) + u T (l)Ru(l)}.

(2.20)

So (2.19) can be written as V [2] (x(k)) =

inf

k+1 

U (k,k+1)

l=k

If U (k, k + 1) in (2.20) is defined as U (k, k + 1) = (u(k), u(k + 1)) = (u [1] (k), 0), then we have k+1 

{x T (l)Qx(l) + u T (l)Ru(l)} = x T (k)Qx(k) + u [1]T (k)Ru [1] (k)

l=k

= V [1] (x(k)).

(2.21)

So according to (2.20) and (2.21), we have V [2] (x(k)) ≤ V [1] (x(k)), for i = 1. Second, we assume that for i = j − 1, the following expression V [ j] (x(k)) ≤ V [ j−1] (x(k))

(2.22)

holds. Then according to (2.15), for i = j, we have V [ j+1] (x(k)) = inf {x T (k)Qx(k) + u T (k)Ru(k) + inf {x T (k + 1)Qx(k + 1) u(k)

u(k+1)

+ u (k + 1)Ru(k + 1) + · · · T

+ inf {x T (k + j)Qx(k + j) + u T (k + j)Ru(k + j)} . . .}}. u(k+ j)

(2.23) So we can obtain V [ j+1] (x(k)) = inf

U (k)

k+ j 

{x T (l)Qx(l) + u T (l)Ru(l)}.

(2.24)

l=k

If we let U (k, k + j) = (u [ j] (k), u [ j] (k + 1), . . . , u [1] (k + j − 1), 0) in (2.24), then we can get k+ j 

{x T (l)Qx(l) + u T (l)Ru(l)}

l=k

= x T (k)Qx(k) + u [ j]T (k)Ru [ j] (k) + x T (k + 1)Qx(k + 1)

2.4 Derivation of the Iterative ADP Algorithm with Convergence Analysis

15

+ u [ j−1]T (k + 1)Ru [ j−1] (k + 1) + · · · + x T (k + j − 1)Qx(k + j − 1) + u [1]T (k + j − 1)Ru [1] (k + j − 1) + x T (k + j)Qx(k + j).

(2.25)

As mentioned in the iterative algorithm, the restrict term F(x(k), u [1] (k)) = 0, ∀x(k) for (2.11) is necessary. So we can get x(k + j) = F(x(k + j), u [1] (k + j)) = 0.

(2.26)

Thus, we have k+ j 

{x T (l)Qx(l) + u T (l)Ru(l)} = V [ j] (x(k)).

(2.27)

l=k

Therefore, we obtain V [ j+1] (x(k)) ≤ V [ j] (x(k)).

(2.28)

For the situation (2.14), it can easily be proven according to the above method. Therefore, we can conclude that V [i+1] (x(k)) ≤ V [i] (x(k)), ∀i. From Theorem 2.2, it is clear that the iterative performance index function is convergent. So we can define the limitation of the sequence {V [i+1] (x(k))} is V o (x(k)). In the next theorem, we will prove that V o (x(k)) satisfies the HJB equation. Theorem 2.3 Let V o (x(k)) = lim V [i+1] (x(k)), then V o (x(k)) satisfies i→∞

V o (x(k)) = inf {x T (k)Qx(k) + u T (k)Ru(k) + V o (x(k + 1))}. u(k)

(2.29)

Proof According to (2.15) and (2.16), for any admissible control vector η(k), we have V [i+1] (x(k)) ≤x T (k)Qx(k) + ηT (k)Rη(k) + V [i] (x(k + 1)).

(2.30)

From Theorem 2.2, we can obtain V o (x(k)) ≤ V [i+1] (x(k)).

(2.31)

V o (x(k)) ≤ x T (k)Qx(k) + ηT (k)Rη(k) + V [i] (x(k + 1)).

(2.32)

So it can be obtained that

Let i → ∞, (2.32) can be written as

16

2 Neural-Network-Based Approach for Finite-Time Optimal Control

V o (x(k)) ≤ x T (k)Qx(k) + ηT (k)Rη(k) + V o (x(k + 1)).

(2.33)

As η(k) is any admissible control, so we can obtain V o (x(k)) ≤ inf {x T (k)Qx(k) + u T (k)Ru (k) + V o (x(k + 1))}. u(k)

(2.34)

On the other side, according to the definition V o (x(k)) = lim V [i+1] (x(k)), there i→∞

exists a positive integer p and an arbitrary positive number ε, such that V [ p] (x(k)) ≥ V o (x(k)) ≥ V [ p] (x(k)) − ε.

(2.35)

From (2.15), we have V [ p] (x(k)) = x T (k)Qx(k) + u [ p]T (k)Ru [ p] (k) + V [ p−1] (x(k + 1)).

(2.36)

So according to (2.35) and (2.36), we have V o (x(k)) ≥ x T (k)Qx(k) + u [ p]T (k)Ru [ p] (k) + V [ p−1] (x(k + 1)) − ε.

(2.37)

As V [ p−1] (x(k + 1)) ≥ V o (x(k + 1)), ∀ p, then (2.37) is written as follows: V o (x(k)) ≥ x T (k)Qx(k) + u [ p]T (k)Ru [ p] (k) + V o (x(k + 1)) − ε.

(2.38)

Hence we have V o (x(k)) ≥ inf {x T (k)Qx(k) + u T (k)Ru (k) + V o (x(k + 1)) − ε}. u(k)

(2.39)

Since ε is arbitrary, we can have V o (x(k)) ≥ inf {x T (k)Qx(k) + u T (k)Ru (k) + V o (x(k + 1))}. u(k)

(2.40)

Thus from (2.34) and (2.40), we have V o (x(k)) = inf {x T (k)Qx(k) + u T (k)Ru (k) + V o (x(k + 1))}. u(k)

(2.41)

From Theorems 2.2 and 2.3, it can be concluded that V o (x(k)) is the optimal performance index function and V o (x(k)) = J ∗ (x(k)). So we can have the following corollary. Corollary 2.1 Let the iterative algorithm be expressed as (2.11)–(2.16). Then we have the iterative control u [i] (k) converge to the optimal control u ∗ (k), as i → ∞, i.e.,

2.4 Derivation of the Iterative ADP Algorithm with Convergence Analysis

17

u ∗ (k) = arg inf {x T (k)Qx(k) + u T (k)Ru(k) + J ∗ (x(k + 1))}. u(k)

(2.42)

In this section, the iterative control algorithm is proposed for data-based unknown systems with convergence analysis. In next section, the neural network implementation of the iterative control algorithm will be presented.

2.5 Neural Network Implementation of the Iterative Control Algorithm The input-output data are used to identify the unknown nonlinear system, until the identification error is in the satisfied precision range. Then the data-based identifier is used for the controller design. The diagram of the whole structure is shown in Fig. 2.2. In Fig. 2.2, the SLENs module is the identifier, the action network module is used to approximate the iterative control u [i] (k), and the critic network module is used to approximate the iterative performance index function. The SlFNs with ELM are used in the ADP algorithm, i.e. action network and critic network. The detailed implementation steps are as follows. Step 1. Train the identifier by input-output data. Step 2. Choose an error bound , and choose randomly an initial state x(0). Step 3. Calculate the initial finite-time admissible control sequence for x(0), which is U (0, K ) = (u(0), u(1), . . . , u(K )). The corresponding state sequence is (x(0), x(1), . . ., x(K + 1)), where x(K + 1) = 0.

x(k ) u (k )

x(k )

Critic Network

SLENs

u[i ] (k )

x(k 1)

Critic Network

V [ i ] ( x(k 1))

Critic Network

V [i 1] ( x(k 1))

Fig. 2.2 The basic structure of the proposed control method

x ( k )Qx ( k ) u (k ) Ru (k )

18

2 Neural-Network-Based Approach for Finite-Time Optimal Control

Step 4. For the state x(K ), run the iterative ADP algorithm (2.11)–(2.13) for i = 1, and (2.15)–(2.16) for i > 1. If |V [i+1] (x(K )) − V i (x(K ))| < , then Stop. Step 5. For the state x(k), k = K − 1, K − 2, . . . , 0, run the iterative ADP algorithm (2.11)–(2.12) and (2.14) as i = 1, (2.15)–(2.16) as i > 1. Until |V [i+1] (x(k)) − V i (x(k))| < . Step 6. Stop.

2.6 Simulation Study To evaluate the performance of our iterative ADP algorithm for the data-based identifier, an example is provided in this section. Consider the following nonlinear system [16] x(k + 1) = x(k) + sin(0.1x 2 (k) + u(k)),

(2.43)

where the state x(k) ∈ R and the control u(k) ∈ R. In this chapter, 5000 sampling points are used to train the SLFN identifier. The number of hidden neurons is 20. The weight vectors between the input neurons and the jth hidden neuron are selected in (0, 1). The threshold of the hidden neuron is selected in (0, 1). For 50 test points, we get the identification results in Fig. 2.3. The red dashed line is the test points, the blue solid line is the output of the data-based identifier, and the purple line with sign “×” is the identification error. From Fig. 2.3, we can see that the identifier reconstructs the unknown nonlinear system accurately. Based on the identification results, the optimal ADP controller is designed. The initial state for system (2.43) is x(0) = 1.5. For implementation the proposed iteration algorithm in this chapter, two neural networks which are SLFNs with ELM are used to approximate the action network and critic network, respectively. To demonstrate the effectiveness of the proposed scheme, we implement the iterative algorithm by neural networks with ELM and BP methods, respectively. The maximal iteration step is 50 for two kinds of neural networks. The convergence precision of ELM is 10−6 , and the convergence precision of BP is 10−4 . We get the simulation results in Figs. 2.4, 2.5, 2.6, 2.7, 2.8 and 2.9. Figures 2.4 and 2.5 are trajectories of the iterative performance index function obtained by ELM and BP, respectively. In Fig. 2.4, after 5 iterative steps, the iterative performance index function is convergent. While in Fig. 2.5, it costs 15 iterative steps. Figures 2.6 and 2.7 are the state trajectories obtained by ELM method and BP method, respectively. Figures 2.8 and 2.9 are the control trajectories obtained by ELM method and BP method, respectively. By ELM method, the trajectories of state and control are convergent after 4 time steps. While by BP method, it costs 15 time steps. From the figures we can see that the results

2.6 Simulation Study

19

20

test state train state accuracy

state, accuracy

15

10

5

0

-5

0

10

20

30

40

50

60

70

80

90

100

30

35

40

45

50

time steps

Fig. 2.3 The train results of the data-based identifier 9

performance index function

8.5

8

7.5

7

6.5

6

0

5

10

15

20

25

iteration steps

Fig. 2.4 The performance index function obtained by ELM method

of ELM method are faster and smoother than the results of BP method. It can be concluded that the learning speed of ELM method is faster than BP method while obtaining better generalization performance.

20

2 Neural-Network-Based Approach for Finite-Time Optimal Control 6

performance index function

5.8 5.6 5.4 5.2 5 4.8 4.6 4.4

0

10

20

30

40

50

60

70

80

90

100

16

18

20

iteration steps

Fig. 2.5 The performance index function obtained by BP method 1.6 1.4 1.2

state

1 0.8 0.6 0.4 0.2 0 -0.2

0

2

4

6

8

10

12

14

time steps

Fig. 2.6 The state obtained by ELM method

2.7 Conclusions This chapter studied the ELM method for optimal control of unknown nonlinear systems. Using the input-output data, a data-based identifier was established. The

2.7 Conclusions

21

1.6 1.4 1.2

state

1 0.8 0.6 0.4 0.2 0 -0.2

0

5

10

15

20

25

30

20

25

30

time steps

Fig. 2.7 The state obtained by BP method 0

-0.2

control

-0.4

-0.6

-0.8

-1

-1.2

-1.4

0

5

10

15

time steps

Fig. 2.8 The control obtained by ELM method

finite-time optimal control scheme was proposed based on iterative ADP algorithm. The results of theorems showed that the proposed iterative algorithm was convergent. The simulation study have demonstrated the effectiveness of the proposed control algorithm.

22

2 Neural-Network-Based Approach for Finite-Time Optimal Control 0.2 0 -0.2

control

-0.4 -0.6 -0.8 -1 -1.2 -1.4

0

5

10

15

20

25

30

time steps

Fig. 2.9 The control obtained by BP method

References 1. Duncan, T., Guo, L., Pasik-Duncan, B.: Adaptive continuous-time linear quadratic gaussian control. IEEE Trans. Autom. Control 44(9), 1653–1662 (1999) 2. Gabasov, R., Kirillova, F., Balashevich, N.: Open-loop and closed-loop optimization of linear control systems. Asian J. Control 2(3), 155–168 (2000) 3. Jin, X., Yang, G., Peng, L.: Robust adaptive tracking control of distributed delay systems with actuator and communication failures. Asian J. Control 14(5), 1282–1298 (2012) 4. Zhang, H., Song, R., Wei, Q., Zhang, T.: Optimal tracking control for a class of nonlinear discrete-time systems with time delays based on heuristic dynamic programming. IEEE Trans. Neural Netw. 22(12), 1851–1862 (2011) 5. Guardabassi, G., Savaresi, S.: Virtual reference direct design method: an off-line approach to data-based control system design. IEEE Trans. Autom. Control 45(5), 954–959 (2000) 6. Jagannathan, S.: Neural Network Control of Nonlinear Discrete-Time Systems. CRC Press, Boca Raton (2006) 7. Yu, W.: Recent Advances in Intelligent Control Systems. Springer, London (2009) 8. Fernández-Navarro, F., Hervás-Martínez, C., Gutierrez, P.: Generalised Gaussian radial basis function neural networks. Soft Comput. 17, 519–533 (2013) 9. Richert, D., Masaud, K., Macnab, C.: Discrete-time weight updates in neural-adaptive control. Soft Comput. 17, 431–444 (2013) 10. Kuntal, M., Pratihar, D., Nath, A.: Analysis and synthesis of laser forming process using neural networks and neuro-fuzzy inference system. Soft Comput. 17, 849–865 (2013) 11. Huang, G., Chen, L., Siew, C.: Universal approximation using incremental constructive feedforward networks with random hidden nodes. IEEE Trans. Neural Netw. 17(4), 879–892 (2006) 12. Hornik, K.: Approximation capabilities of multilayer feedforward networks. Neural Netw. 4, 251–257 (1991) 13. Leshno, M., Lin, V., Pinkus, A., Schocken, S.: Multilayer feedforward networks with a nonpolynomial activation function can approximate any function. Neural Netw. 6, 861–867 (1993)

References

23

14. Zhang, H., Wei, Q., Luo, Y.: A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm. IEEE Trans. Syst. Man Cybern. Part B: Cybern. 38(4), 937–942 (2008) 15. Huang, G., Siew, C.: Extreme learning machine: RBF network case. In: Proceedings of the Eighth International Conference on Control, Automation, Robotics and Vision (ICARCV 2004), Dec 6–9, Kunming, China, vol. 2, pp. 1029–1036. (2004) 16. Wang, F., Jin, N., Liu, D., Wei, Q.: Adaptive dynamic programming for finite-horizon optimal control of discrete-time nonlinear systems with -error bound. IEEE Transactions on Neural Networks 22, 24–36 (2011) 17. Zhang, R., Huang, G., Sundararajan, N., Saratchandran, P.: Multi-category classification using extreme learning machine for microarray gene expression cancer diagnosis. IEEE/ACM Trans. Comput. Biol. Bioinform. 4(3), 485–495 (2007) 18. Tamura, S., Tateishi, M.: Capabilities of a fourlayered feedforward neural network: four layers versus three. IEEE Trans. Neural Netw. 8(2), 251–255 (1997) 19. Huang, G.: Learning capability and storage capacity of two-hidden-layer feedforward networks. IEEE Trans. Neural Netw. 14(2), 274–281 (2003) 20. Huang, G., Wang, D., Lan, Y.: Extreme learning machines: a survey. Int. J. Mach. Learn. Cybern. 2(2), 107–122 (2011) 21. Huang, G., Zhu, Q., Siew, C.: Extreme learning machine: theory and applications. Neurocomputing 70, 489–501 (2006)

Chapter 3

Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay Nonlinear Systems

In this chapter, a novel ADP algorithm is developed to solve the nearly optimal finitehorizon control problem for a class of deterministic nonaffine nonlinear time-delay systems. The idea is to use ADP technique to obtain the nearly optimal control which makes the optimal performance index function close to the greatest lower bound of all performance index functions within finite time. The proposed algorithm contains two cases with respective different initial iterations. In the first case, there exists control policy which makes arbitrary state of the system reach to zero in one time step. In the second case, there exists a control sequence which makes the system reach to zero in multiple time steps. The state updating is used to determine the optimal state. Convergence analysis of the performance index function is given. Furthermore, the relationship between the iteration steps and the length of the control sequence is presented. Two neural networks are used to approximate the performance index function and compute the optimal control policy for facilitating the implementation of ADP iteration algorithm. At last, two examples are used to demonstrate the effectiveness of the proposed ADP iteration algorithm.

3.1 Introduction Time-delay phenomenons are often encountered in physical and biological systems, and requires special attention in engineering applications [1]. Transportation systems, communication systems, chemical processing systems, metallurgical processing systems and power systems are examples of time-delay systems. Delays may result in degradation in the control efficiency even instability of the control systems [2]. So there have been many works about systems with time delays in various research areas such as electrical, chemical engineering and networked control [3]. In the past few decades, the stabilization and control of time-delay systems have always been the key focus in the control field [3, 4]. Furthermore, there are many researchers studied © Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 R. Song et al., Adaptive Dynamic Programming: Single and Multiple Controllers, Studies in Systems, Decision and Control 166, https://doi.org/10.1007/978-981-13-1712-5_3

25

26

3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …

the controllability of linear time-delay systems [5, 6]. They proposed some related theorems to judge the controllability of the linear time-delay systems. In addition, the optimal control problem is often encountered in industrial production. So the investigation of the optimal control for time-delay systems is significant. In [7] D. H. Chyung has pointed out the disadvantages of discrete time-delay system written as an extended system by increasing dimension method to deal with the optimal control problem. So some direct methods for linear time-delay systems were presented in [7, 8]. While for nonlinear time-delay system, due to the complexity of systems, the optimal control problem is rarely researched. As we know that [9] solved the finite-horizon optimal control problem for a class of discrete-time nonlinear systems using ADP algorithm. But the method in [9] can not be used in nonlinear time-delay systems. As the delay states in time-delay systems are coupling with each other. The state of current time k is decides by the states before k and the control law, while the control law is not known before it is obtained. So based on the research results in [9], we proposed a new ADP algorithm to solve the nearly finite-horizon optimal control problem for discrete time-delay systems through the framework of Hamilton–Jacobi–Bellman (HJB) equation. In this chapter the optimal controller is designed based on the original time-delay systems, directly. The state updating method is proposed to determine the optimal state of the time-delay system. For finite-horizon optimal control, the system can reach to zero when the final running step N is finite. But it is impossible in practice. So the results in this chapter is in the sense of an error bound. The main contributions of this chapter can be summarized as follows. (1) The finite-horizon optimal control for deterministic discrete time-delay systems is first studied based on ADP algorithm. (2) The state updating is used to determine the optimal states of HJB equation. (3) The relationship between the iteration steps and the length of the control sequence is given. This chapter is organized as follows. In Sect. 3.2, the problem formulation is presented. In Sect. 3.3, the nearly finite-horizon optimal control scheme is developed based on iteration ADP algorithm and the convergence proof is given. In Sect. 3.1, two examples are given to demonstrate the effectiveness of the proposed control scheme. In Sect. 3.5, the conclusion is drawn.

3.2 Problem Statement Consider a class of deterministic time-delay nonaffine nonlinear systems 

x(t + 1) = F(x(t), x(t − h 1 ), x(t − h 2 ), . . . , x(t − h l ), u(t)), x(t) = χ (t), −h l ≤ t ≤ 0,

(3.1)

where x(t) ∈ Rn is state and x(t − h 1 ), x(t − h 2 ), . . . , x(t − h l ) ∈ Rn are time-delay

3.2 Problem Statement

27

states. u(t) ∈ Rm is the system input. χ (t) is the initial state, h i , i = 1, 2, . . . , l, is the time delay, set 0 < h 1 < h 2 < · · · < h l , and they are nonnegative integer numbers. F(x(t), x(t − h 1 ), x(t − h 2 ), . . . , x(t − h l ), u(t)) is known function. F(0, 0, . . . , 0) = 0. For any time step k, the performance index function for state x(k) under the control sequence U (k, N + k − 1) = (u(k), u(k + 1), . . . , u(N + k − 1)) is defined as J (x(k), U (k, N + k − 1)) =

N +k−1

{x T ( j)Qx( j) + u T ( j)Ru( j)},

(3.2)

j=k

where Q and R are positive definite constant matrixes. In this chapter, we focus on solving the nearly finite-horizon optimal control problem for system (3.1). The feedback control u(k) must not only stabilize the system within finite time step but also guarantee the performance index function (3.2) to be finite. So the control sequence U (k, N + k − 1) = (u(k), u(k + 1), . . . , u(N + k − 1)) must be admissible. Definition 3.1 N time steps control sequence: For any time step k, we define the N time steps control sequence U (k, N + k − 1) = (u(k), u(k + 1), . . . , u(N + k − 1)). The length of U (k, N + k − 1) is N . Definition 3.2 Final state: we define final state x f = x f (x(k), U (k, N + k − 1)), i.e., x f = x(N + k). Definition 3.3 Admissible control sequence: An N time steps control sequence is said to be admissible for x(k), if the final state x f (x(k), U (k, N + k − 1)) = 0 and J (x(k), U (k, N + k − 1)) is finite. Remark 3.1 Definitions 3.1 and 3.2 are used to state conveniently the admissible control sequence, i.e. Definition 3.3, which is necessary for the theorems of this chapter. Remark 3.2 It is important to point out that the length of control sequence N can not be designated in advance. It is calculated by the proposed algorithm. If we calculate that the length of optimal control sequence is L at time step k, then we consider that the optimal control sequence length at time step k is N = L. According to the theory of dynamics programming [10], the optimal performance index function is defined as J ∗ (x(k)) =

inf J (x(k), U (k, N + k − 1))   = inf x T (k)Qx(k) + u T (k)Ru(k) + J ∗ (x(k + 1)) , U (k,N +k−1) u(k)

and the optimal control policy is

(3.3) (3.4)

28

3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …

  u ∗ (k) = arg inf x T (k)Qx(k) + u T (k)Ru(k) + J ∗ (x(k + 1)) , u(k)

(3.5)

so the state under the optimal control policy is x ∗ (t + 1) = F(x ∗ (t), x ∗ (t − h 1 ), . . . , x ∗ (t − h l ), u ∗ (t)), t = 0, 1, . . . , k, . . . , (3.6) and then, the HJB equation is written as J ∗ (x ∗ (k)) = J (x ∗ (k), U ∗ (k, N + k − 1)) = x ∗T (k)Qx ∗ (k) + u ∗T (k)Ru ∗ (k) + J ∗ (x ∗ (k + 1)).

(3.7)

Remark 3.3 From Remark 3.2, we can see that the length N of the optimal control sequence is unknown finite number and can not be designated in advance. So we can say that if at time step k, the length of the optimal control sequence is N , then at time step k + 1, the length of the optimal control sequence is N − 1. Therefore, the HJB equation (3.7) is established. In the following, we will give an explanation about the validity of Eq. (3.4). First, we define U ∗ (k, N + k − 1) = (u ∗ (k), u ∗ (k + 1), . . . , u ∗ (N + k − 1)), i.e., U ∗ (k, N + k − 1) = arg

inf

U (k,N +k−1)

J (x(k), U (k, N + k − 1)).

(3.8)

Then we have J ∗ (x(k)) =

inf

U (k,N +k−1)

J (x(k), U (k, N + k − 1))

= J (x(k), U ∗ (k, N + k − 1)).

(3.9)

Then according to (3.2), we can get J ∗ (x(k)) =

N +k−1

{x T ( j)Qx( j) + u ∗T ( j)Ru ∗ ( j)}

j=k

= x (k)Qx(k) + u ∗T (k)Ru ∗ (k) + ··· T

+ x T (N + k − 1)Qx(N + k − 1) + u ∗T (N + k − 1)Ru ∗ (N + k − 1). Equation (3.10) can be written as J ∗ (x(k)) = x T (k)Qx(k) + u ∗T (k)Ru ∗ (k) + ··· + x T (N + k − 2)Qx(N + k − 2)

(3.10)

3.2 Problem Statement

29

+ u ∗T (N + k − 2)Ru ∗ (N + k − 2) +

{x T (N + k − 1)Qx(N + k − 1)

inf

u(N +k−1)

+ u T (N + k − 1)Ru(N + k − 1)}.

(3.11)

We also obtain J ∗ (x(k)) = x T (k)Qx(k) + u ∗T (k)Ru ∗ (k) + ··· +

inf

u(N +k−2)

{x T (N + k − 2)Qx(N + k − 2)

+ u T (N + k − 2)Ru(N + k − 2) +

inf

u(N +k−1)

{x T (N + k − 1)Qx(N + k − 1)

+ u T (N + k − 1)Ru(N + k − 1)}}.

(3.12)

So we have J ∗ (x(k)) = inf {x T (k)Qx(k) + u T (k)Ru(k) u(k)

+ ··· +

inf

u(N +k−2)

{x T (N + k − 2)Qx(N + k − 2)

+ u T (N + k − 2)Ru(N + k − 2) +

inf

u(N +k−1)

{x T (N + k − 1)Qx(N + k − 1)

+ u T (N + k − 1)Ru(N + k − 1)}} . . .}.

(3.13)

Thus according to (3.9), Eq. (3.10) is expressed as  J ∗ (x(k)) = inf x T (k)Qx(k) + u T (k)Ru(k) u(k)

J (x(k + 1), U (k + 1, N + k − 1))} inf  T  = inf x (k)Qx(k) + u T (k)Ru(k) + J ∗ (x(k + 1)) .

+

U (k+1,N +k−1) u(k)

(3.14)

Therefore, the Eqs. (3.3) and (3.4) are established. In the following part we will give a novel iteration ADP algorithm to get the nearly optimal solution.

30

3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …

3.3 The Iteration ADP Algorithm and Its Convergence 3.3.1 The Novel ADP Iteration Algorithm In this subsection we will give the novel iteration ADP algorithm in details. For the state x(k) of system (3.1), there exists two cases. Case 1: ∃U (k, k) which makes x(k + 1) = 0. Case 2: ∃U (k, k + m), m > 0, which makes x(k + m + 1) = 0. In the following part, we will discuss the two cases, respectively. Case 1: There exists U (k, k) = (β(k)) which makes x(k + 1) = 0 for system (3.1). We set the optimal control sequence U ∗ (k + 1, k + 1) = (0). The states of the system are driven by a given initial state χ (t), −h l ≤ t ≤ 0 and the initial control policy β(t). We set V [0] (x(k + 1)) = J (x(k + 1), U ∗ (k + 1, k + 1)) = 0, ∀x(k + 1), then for time step k, we have the following iterations u [1] (k) = arg inf {x T (k)Qx(k) + u T (k)Ru(k) + V [0] (x(k + 1))}, u(k)

s.t. F(x(k), x(k − h 1 ), x(k − h 2 ), . . . , x(k − h l ), u(k)) = 0,

(3.15)

and V [1] (x [1] (k)) = x [1]T (k)Qx [1] (k) + u [1]T (k)Ru [1] (k) + V [0] (x [0] (k + 1)), (3.16) where the states in (3.16) are obtained as x [1] (t + 1) = F(x [1] (t), x [1] (t − h 1 ), x [1] (t − h 2 ), . . . , x [1] (t − h l ), u [1] (t)), t = 0, 1, . . . , k − 1,

(3.17)

and x [0] (t + 1) = F(x [1] (t), x [1] (t − h 1 ), x [1] (t − h 2 ), . . . , x [1] (t − h l ), u [1] (t)), t = k, k + 1, . . . ,

(3.18)

For the iteration step i = 1, 2, . . . , we have the iterations as u [i+1] (k) = arg inf {x T (k)Qx(k) + u T (k)Ru(k) + V [i] (x(k + 1))} u(k)

(3.19)

and V [i+1] (x [i+1] (k)) = x [i+1]T (k)Qx [i+1] (k) + u [i+1]T (k)Ru [i+1] (k) + V [i] (x [i] (k + 1)), where V [i] (x(k + 1)) in (3.19) is obtained as

(3.20)

3.3 The Iteration ADP Algorithm and Its Convergence

31

V [i] (x(k + 1)) = arg inf {x T (k + 1)Qx(k + 1) + u T (k + 1)Ru(k + 1) u(k+1)

+V

[i−1]

(x(k + 2))},

(3.21)

and the states in (3.20) are updated as x [i+1] (t + 1) = F(x [i+1] (t), x [i+1] (t − h 1 ), x [i+1] (t − h 2 ), . . . , x [i+1] (t − h l ), u [i+1] (t)), t = 0, 1, . . . , k − 1,

(3.22)

and x [i] (t + 1) = F(x [i+1] (t), x [i+1] (t − h 1 ), x [i+1] (t − h 2 ), . . . , x [i+1] (t − h l ), u [i+1] (t)), t = k, k + 1, . . . .

(3.23)

Case 2: There exists finite-horizon admissible control sequence U (k, k + m) = (β(k), β(k + 1), . . . , β(k + m)) which makes x f (x(k), U (k, k + m)) = 0. We suppose that for x(k + 1), there exists optimal control sequence U ∗ (k + 1, k + j − 1) = (u ∗ (k + 1), u ∗ (k + 2), . . . , u ∗ (k + j − 1)). For time step k, the iteration ADP algorithm between u [1] (k) = arg inf {x T (k)Qx(k) + u T (k)Ru(k) + V [0] (x(k + 1))} u(k)

(3.24)

and V [1] (x [1] (k)) = x [1]T (k)Qx [1] (k) + u [1]T (k)Ru [1] (k) + V [0] (x [0] (k + 1)), (3.25) where ∀x(k + 1), V [0] (x(k + 1)) in (3.24) is obtained as V [0] (x(k + 1)) = J (x(k + 1), U ∗ (k + 1, k + j − 1) = J ∗ (x(k + 1)).

(3.26)

In (3.25), V [0] (x [0] (k + 1)) is obtained by the similar Eq. (3.26). The states in (3.25) are obtained as x [1] (t + 1) = F(x [1] (t), x [1] (t − h 1 ), x [1] (t − h 2 ), . . . , x [1] (t − h l ), u [1] (t)), t = 0, 1, . . . , k − 1,

(3.27)

and x [0] (t + 1) = F(x [1] (t), x [1] (t − h 1 ), x [1] (t − h 2 ), . . . , x [1] (t − h l ), u [1] (t)), t = k, k + 1, . . .

(3.28)

32

3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …

For the iteration step i = 1, 2, . . ., the iteration algorithm will be implemented as follows: u [i+1] (k) = arg inf {x T (k)Qx(k) + u T (k)Ru(k) + V [i] (x(k + 1))} u(k)

(3.29)

and V [i+1] (x [i+1] (k)) = (x [i+1] (k))T Qx [i+1] (k) + u [i+1]T (k)Ru [i+1] (k) + V [i] (x [i] (k + 1)),

(3.30)

where V [i] (x(k + 1)) in (3.29) is updated as V [i] (x(k + 1)) = inf {x T (k + 1)Qx(k + 1) + u T (k + 1)Ru(k + 1) u(k+1)

+ V [i−1] (x(k + 2))},

(3.31)

and the states in (3.30) are obtained as x [i+1] (t + 1) = F(x [i+1] (t), x [i+1] (t − h 1 ), x [i+1] (t − h 2 ), . . . , x [i+1] (t − h l ), u [i+1] (t)), t = 0, 1, . . . , k − 1,

(3.32)

and x [i] (t + 1) = F(x [i+1] (t), x [i+1] (t − h 1 ), x [i+1] (t − h 2 ), . . . , x [i+1] (t − h l ), u [i+1] (t)), t = k, k + 1, . . . .

(3.33)

This completes the iteration algorithm. From the two cases we can see that, if V [0] = 0 in (3.25), then Case 1 is a special one of Case 2. In the following, the algorithms are summarized as follows. Algorithm 1 ADP Algorithm Initialization: Compute u [1] (k) and V [1] (x [1] (k)) by (3.15) and (3.16) in Case 1, or by (3.24) and (3.25) in Case 2; Update: Update u [i+1] (k) and V [i+1] (x [i+1] (k)) by (3.19) and (3.20) in Case 1, or by (3.29) and (3.30) in Case 2.

Remark 3.4 For the state x(k) of system (3.1), which is driven by the fixed initial states χ (t), −h l ≤ t ≤ 0. If there exists a control sequence U (k, k) = (β(k)), which makes x(k + 1) = 0 hold, then we will use Case 1 of the algorithm to obtain the optimal control. Otherwise, i.e., there does not exist U (k, k), which makes x(k + 1) = 0 hold. But there is a control sequence U (k, k + m) = (β(k), β(k + 1), . . . , β(k + m)) which makes x f (x(k), U (k, k + m)) = 0, then we will adopt Case 2 of the

3.3 The Iteration ADP Algorithm and Its Convergence

33

algorithm. The detailed implementation process of the second algorithm is as follows. For system (3.1), there exists arbitrary finite-horizon admissible control sequence U (k, k + m) = (β(k), β(k + 1), . . . , β(k + m)) and the corresponding state sequence (x(k + 1), x(k + 2), . . . , x(k + m + 1)) in which x(k + m + 1) = 0. It is clearly that U (k, k + m) may not be optimal one. Which means two points: (1) The length m + 1 of control sequence U (k, k + m) may not be optimal. (2) The law of control sequence U (k, k + m) may not be optimal. So it is necessary to use the proposed algorithm to obtain the optimal one. We start to discuss the proposed algorithm from the state x(k + m) now. Obviously, the situation of x(k + m) is belongs to Case 1, so the optimal control for x(k + m) can be obtained by Case 1 of the proposed algorithm. Although the state x(k + m) can reach to zero in one step, the optimal control step number may be more than one, this property can be seen in Corollary 3.1. Next, we can obtain the optimal control for x(k + m − 1) according to Case 2 of the proposed algorithm. Continue this process, until the optimal control of state x(k) is obtained. From [9] we known that if the optimal control length of state x(k + m 1 + 1) is the same as the one of x(k + m 1 ), then we say that the two states x(k + m 1 ) and x(k + m 1 + 1) are in the same circular region. The finite-horizon optimal control for the two states are same. The detailed analysis can be seen in [9].

3.3.2 Convergence Analysis of the Improved Iteration Algorithm In the above subsection, the novel algorithm for finite-horizon time-delay nonlinear systems has been proposed in detail. In the following part, we will prove that the algorithm is convergent and the limitation of the sequence of performance index functions V [i+1] (x [i+1] (k)) satisfies the HJB equation (3.7). Theorem 3.1 For system (3.1), the states of the system are driven by a given initial state χ (t), −h l ≤ t ≤ 0, and the initial finite-horizon admissible control policy β(t). The iteration algorithm is as in (3.15)–(3.33). For time step k, ∀ x(k) and U (k, k + i), we define Λ[i+1] (x(k), U (k, k + i)) = x T (k)Qx(k) + u T (k)Ru(k) + x T (k + 1)Qx(k + 1) + u T (k + 1)Ru(k + 1) + ··· + x T (k + i)Qx(k + i) + u T (k + i)Ru(k + i) + V [0] (x(k + i + 1)),

(3.34)

where V [0] (x(k + i + 1)) as in (3.26) and V [i+1] (x(k)) is updated as (3.31). Then we have

34

3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …

V [i+1] (x(k)) =

inf

U (k,k+i)

Λ[i+1] (x(k), U (k, k + i)).

(3.35)

Proof From (3.31) we have V [i+1] (x(k)) = inf {x T (k)Qx(k) + u T (k)Ru(k) u(k)

+ inf {x T (k + 1)Qx(k + 1) + u T (k + 1)Ru(k + 1) u(k+1)

+ ··· + inf {x T (k + i)Qx(k + i) + u T (k + i)Ru(k + i)} u(k+i)

+ V [0] (x(k + i + 1))} . . .}}.

(3.36)

So we can further obtain V [i+1] (x(k)) =

inf {x T (k)Qx(k) + u T (k)Ru(k)

U (k,k+i)

+ x T (k + 1)Qx(k + 1) + u T (k + 1)Ru(k + 1) + ... + x T (k + i)Qx(k + i) + u T (k + i)Ru(k + i) + V [0] (x(k + i + 1))},

(3.37)

Thus we can have V [i+1] (x(k)) =

inf

U (k,k+i)

Λ[i+1] (x(k), U (k, k + i)).

(3.38)

Based on Theorem 3.1, we give the monotonicity theorem about the sequence of performance index functions V [i+1] (x [i+1] (k)), ∀x [i+1] (k). Theorem 3.2 For system (3.1), let the iteration algorithm be as in (3.15)–(3.33). Then we have V [i+1] (x [i] (k)) ≤ V [i] (x [i] (k)), ∀i > 0, for Case 1; V [i+1] (x [i] (k)) ≤ V [i] (x [i] (k)), ∀i ≥ 0, for Case 2. Proof We first give the proof for Case 2. Define Uˆ (k, k + i) = (u [i] (k), u [i] (k + 2), . . . , u [1] (k + i − 1), u ∗ (k + i)), then according to the definition of Λ[i+1] (x(k), Uˆ (k, k + i)) in (3.34), we have Λ[i+1] (x(k), Uˆ (k, k + i)) = x T (k)Qx(k) + u [i]T (k)Ru [i] (k) + ··· + x T (k + i − 1)Qx(k + i − 1) + u [1]T (k + i − 1)Ru [1] (k + i − 1) + x T (k + i)Qx(k + i) + u ∗T (k + i)Ru ∗ (k + i) + V [0] (x(k + i + 1)).

(3.39)

3.3 The Iteration ADP Algorithm and Its Convergence

35

From (3.26) and (3.4), we get V [0] (x(k + i)) = J ∗ (x(k + i)) = x T (k + i)Qx(k + i) + u ∗T (k + i)Ru ∗ (k + i) + J ∗ (x(k + i + 1)) = x T (k + i)Qx(k + i) + u ∗T (k + i)Ru ∗ (k + i) + V [0] (x(k + i + 1)).

(3.40)

On the other side, from (3.31), we have V [i] (x(k)) = x T (k)Qx(k) + u [i]T (k)Ru [i] (k) + ··· + x T (k + i − 1)Qx(k + i − 1) + u [1]T (k + i − 1)Ru [1] (k + i − 1) + V [0] (x(k + i)).

(3.41)

So according to (3.40), we obtain Λ[i+1] (x(k), Uˆ (k, k + i)) = V [i] (x(k)).

(3.42)

From Theorem 3.1, we can get V [i+1] (x(k)) ≤ Λ[i+1] (x(k), Uˆ (k, k + i)).

(3.43)

V [i+1] (x(k)) ≤ V [i] (x(k)),

(3.44)

V [i+1] (x [i] (k)) ≤ V [i] (x [i] (k)).

(3.45)

So we have ∀x(k),

i.e., for x [i] (k),

For Case 1, we set V [0] = 0, the proof is similar with Case 2. From Theorem 3.2, we can conclude that the performance index function { V [i+1] (x(k)) } is a monotonically nonincreasing sequence. As the performance index function is positive definite, so we can say that the performance index function is convergent. Thus we define V ∞ (x(k)) = lim V [i+1] (x(k)), u ∞ (k) = lim u [i+1] (k) and i→∞

i→∞

x ∞ (k) is the state under u ∞ (k). In the following, we give a theorem to indicate that V ∞ (x ∞ (k)) satisfies HJB equation. Theorem 3.3 For system (3.1), the iteration algorithm is as in (3.15)–(3.33). Then we have

36

3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …

V ∞ (x ∞ (k)) = x ∞T (k)Qx ∞ (k) + u ∞T (k)Ru ∞ (k) + V ∞ (x ∞ (k + 1)).

(3.46)

Proof Let  be an arbitrary positive number. Since V [i+1] (x(k)) is nonincreasing and V ∞ (x(k)) = lim V [i+1] (x(k)), there exists a positive integer p such that i→∞

V [ p] (x(k)) −  ≤ V ∞ (x(k)) ≤ V [ p] (x(k)).

(3.47)

So we have V ∞ (x(k)) ≥ inf {x T (k)Qx(k) + u T (k)Ru(k) + V [ p−1] (x(k + 1))} − . u(k)

(3.48)

According to Theorem 3.2, we have V ∞ (x(k)) ≥ inf {x T (k)Qx(k) + u T (k)Ru(k) + V ∞ (x(k + 1))} −  u(k)

(3.49)

hold. Since  is arbitrary, we have V ∞ (x(k)) ≥ inf {x T (k)Qx(k) + u T (k)Ru(k) + V ∞ (x(k + 1))}. u(k)

(3.50)

On the other side, according to Theorem 3.2, we have V ∞ (x(k)) ≤ V [i+1] (x(k)) = inf {x T (k)Qx(k) + u T (k)Ru(k) + V [i] (x(k + 1))}. u(k)

(3.51)

Let i → ∞, we have V ∞ (x(k)) ≤ inf {x T (k)Qx(k) + u T (k)Ru(k) + V ∞ (x(k + 1))}. u(k)

(3.52)

So from (3.50) and (3.52), we can get V ∞ (x(k)) = inf {x T (k)Qx(k) + u T (k)Ru(k) + V ∞ (x(k + 1))}, ∀x(k). u(k)

(3.53)

According to (3.29), we obtain u ∞ (k). From (3.32) and (3.33), we have the corresponding state x ∞ (k), thus the following expression V ∞ (x ∞ (k)) = x ∞T (k)Qx ∞ (k) + u ∞T (k)Ru ∞ (k) + V ∞ (x ∞ (k + 1))

(3.54)

holds, which completes the proof. So we can say that V ∞ (x ∞ (k)) = J ∗ (x ∗ (k)). Until now, we have proven that for ∀k, the iteration algorithm converges to the optimal performance index function when the iteration index i → ∞. For finite-horizon optimal control problem of time-

3.3 The Iteration ADP Algorithm and Its Convergence

37

delay systems, another aspect is the length N of the optimal control sequence. In this chapter, the specific value of N is not known, but we can analyse the relationship between the the iteration index i and the terminal time N . Theorem 3.4 Let the iteration algorithm be in (3.24)–(3.33). If V [0] (x(k + i + 1)) = J (x(k + i + 1), U ∗ (k + i + 1, k + i + j − 1)), ∀x(k + i + 1), then the state at time step k of system (3.1) can reach to zero in N = i + j steps for Case 2. Proof For Case 2 of the iteration algorithm, we have V [i+1] (x [i+1] (k)) = x [i+1]T (k)Qx [i+1] (k) + u [i+1]T (k)Ru [i+1] (k) + x [i]T (k + 1)Qx [i] (k + 1) + u [i]T (k + 1)Ru [i] (k + 1) + ··· + x [1]T (k + i)Qx [1] (k + i) + u [1]T (k + i)Ru [1] (k + i) + V [0] (x [0] (k + i + 1)).

(3.55)

According to [9], we can see that the optimal control sequence for x [i+1] (k) is U (k, k + i) = (u [i+1] (k), u [i] (k + 1), . . . , u [1] (k + i)). As we have V [0] (x [0] (k + i + 1)) = J (x [0] (k + i + 1), U ∗ (k + i + 1, k + i + j − 1)), so we can obtain N = i + j. ∗

For Case 1 of the proposed iteration algorithm, we have the following corollary. Corollary 3.1 Let the iteration algorithm be in (3.15)–(3.23). Then for system (3.1), the state at time step k can reach to zero in N = i + 1 steps for Case 1. Proof For Case 1, we have V [i+1] (x [i+1] (k)) = x [i+1]T (k)Qx [i+1] (k) + u [i+1]T (k)Ru [i+1] (k) + x [i]T (k + 1)Qx [i] (k + 1) + u [i]T (k + 1)Ru [i] (k + 1) + ··· + x [1]T (k + i)Qx [1] (k + i) + u [1]T (k + i)Ru [1] (k + i) + V [0] (x [0] (k + i + 1)) = J (x [i+1] (k), U (k, k + i)),

(3.56)

where U ∗ (k, k + i) = (u [i+1] (k), u [i] (k + 1), . . . , u [1] (k + i)), and each element of U ∗ (k, k + i) is obtained from (3.29). According to Case 1, x [0] (k + i + 1) = 0. So the state at time step k can reach to zero in N = i + 1 steps. We can see that for time step k the optimal controller is obtained when i → ∞, which induces the time steps N → ∞ according to Theorem 3.4 and Corollary 3.1. In this chapter, we want to get the nearly optimal performance index function within finite N time steps. The following corollary is used to prove that the existences of nearly optimal performance index function and nearly optimal control.

38

3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …

Corollary 3.2 For system (3.1), the iteration algorithm is as in (3.15)–(3.33), then ∀ε > 0, ∃I ∈ N, ∀i > I , we have   [i+1] [i+1] V (x (k)) − J ∗ (x ∗ (k)) ≤ ε.

(3.57)

Proof From Theorems 3.2 and 3.3, we can see that lim V [i] (x [i] (k)) = J ∗ (x ∗ (k)), i→∞

then from the limitation definition, the conclusion is obtained easily. So we can say that V [i] (x [i] (k)) is the nearly optimal performance index function in the sense of ε, the corresponding nearly optimal control is defined as follows: u ε (k) = arg inf {x T (k)Qx(k) + u T (k)Ru(k) + V [i] (x(k + 1))}. u(k)

(3.58)

Remark 3.5 From Theorem 3.4 and Corollary 3.1, we can see that the length of the control sequence N is dependent on the iteration step. In addition, from Corollary 3.2, we know that the iteration step is dependent on ε. So it is concluded that the length of the control sequence N is dependent on ε. From (3.57), we can see that the inequality is hard to satisfy. So in practice, we adopt the following standard to substitute (3.57):   [i+1] [i+1] V (3.59) (x (k)) − V [i] (x [i] (k)) ≤ ε.

3.3.3 Neural Network Implementation of the Iteration ADP Algorithm The nonlinear optimal control solution relies on solving the HJB equation, and the exact solution of which is generally impossible to be obtained for nonlinear timedelay system. So we employ neural networks for approximations of u [i] (k) and J [i+1] (x(k)) in this section. Assume the number of hidden layer neurons is denoted by l, the weight matrix between the input layer and hidden layer is denoted by V , the weight matrix between the hidden layer and output layer is denoted by W , then the output of three-layer neural network is represented by ˆ F(X, W, Wˆ ) = W T σ (Wˆ T X ),

(3.60)

ezi − e−zi where σ (Wˆ T X ) ∈ Rl , [σ (z)]i = z , i = 1, 2, . . . , l, are the activation funce i + e−zi tion. The gradient descent rule is adopted for the weight update rules of each neural network. Here, there are two networks, which are critic network and action network, respectively. Both neural networks are chosen as three-layer Back-propagation (BP) neural networks. The whole structure diagram is shown in Fig. 3.1.

3.3 The Iteration ADP Algorithm and Its Convergence

Plant

39

Critic Network

x(k )

Action Network Action Network x(k ),

v* ( k )

, x(k hl )

Plant x* ( k )

J * ( x* (k ))

Critic Network

Fig. 3.1 The structure diagram of the algorithm

3.3.3.1

The Critic Network

The critic network is used to approximate the performance index function V [i+1] (x(k)). The output of the critic network is denoted as Vˆ [i+1] (x(k)) = wc[i+1]T σ (vc[i+1]T x(k)).

(3.61)

The target function can be written as V [i+1] (x(k)) = x T (k)Qx(k) + uˆ [i+1]T (k)R uˆ [i+1] (k) + Vˆ [i] (x(k + 1)).

(3.62)

Then we define the error function for the critic network as follows: ec[i+1] (k) = Vˆ [i+1] (x(k)) − V [i+1] (x(k)).

(3.63)

The objective function to be minimized in the critic network is E c[i+1] (k) =

1 [i+1] (e (k))2 . 2 c

(3.64)

So the gradient-based weights update rule for the critic network is given by wc[i+2] (k) = wc[i+1] (k) + wc[i+1] (k),

(3.65)

vc[i+2] (k)

(3.66)

=

vc[i+1] (k)

+

vc[i+1] (k),

where wc[i+1] (k) = −αc vc[i+1] (k) = −αc

∂ E c[i+1] (k) ∂wc[i+1] (k) ∂ E c[i+1] (k) ∂vc[i+1] (k)

,

(3.67)

,

(3.68)

and the learning rate αc of critic network is positive number.

40

3.3.3.2

3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …

The Action Network

In the action network the states x(k), x(k − h 1 ), . . . , x(k − h l ) are used as inputs to create the optimal control, uˆ [i] (k) as the output of the network. The output can be formulated as uˆ [i] (k) = wa[i]T σ (va[i]T Y (k)),

(3.69)

where Y (k) = [x T (k), x T (k − h 1 ), . . . , x T (k − h l )]T . We define the output error of the action network as follows: ea[i] (k) = uˆ [i] (k) − u [i] (k).

(3.70)

The weights in the action network are updated to minimize the following performance error measure E a[i] (k) =

1 [i]T e (k)ea[i] (k). 2 a

(3.71)

The weights updating algorithm is similar to the one for the critic network. By the gradient descent rule, we can obtain wa[i+1] (k) = wa[i] (k) + wa[i] (k),

(3.72)

va[i+1] (k)

(3.73)

=

va[i] (k)

+

va[i] (k),

where wa[i] (k) = −αa va[i] (k) = −αa

∂ E a[i] (k) ∂wa[i] (k) ∂ E a[i] (k) ∂va[i] (k)

,

(3.74)

,

(3.75)

and the learning rate αa of action network is positive number. In the next section, we will give the simulation study to explain the proposed iteration algorithm in details.

3.4 Simulation Study Example 3.1 We take the example in [9] with modification: x(t + 1) = x(t − 2) + sin(0.1x 2 (t) + u(t)).

(3.76)

3.4 Simulation Study

41

1.15 1.14

iteration performance index

1.13 1.12 1.11 1.1 1.09 1.08 1.07 1.06 1.05

0

5

10

15

20

25

30

35

40

45

50

iteration steps

Fig. 3.2 The performance trajectory for x(3) = 0.5

We give the initial states as χ1 (−2) = χ1 (−1) = χ1 (0) = 1.5, and the initial control policy as β(t) = sin−1 (x(t + 1) − x(t − 2)) − 0.1x 2 (t). We implement the proposed algorithm at the time instant k = 3. First, according to the initial control policy β(t) = sin−1 (x(t + 1) − x(t − 2)) − 0.1x 2 (t) of system (3.76), we give fist group of state data: x(1) = 0.8, x(2) = 0.7, x(3) = 0.5, x(4) = 0. We also can get the second group of state data: x(1) = 1.4, x(2) = 1.2, x(3) = 1.1, x(4) = 0.8, x(5) = 0.7, x(6) = 0.5, x(7) = 0. Obviously, for the first sequences of states we can get the optimal controller by Case 1 of the proposed algorithm. For the second one, the optimal controller can be obtained by Case 2 of the proposed algorithm, and the optimal control sequence U o (k + 1, k + j + 1) can be obtained in the first group of state data. We select Q = R = 1. The three-layer BP neural networks are used to approach the critic network and the action network with the structure 2 − 8 − 1 and 6 − 8 − 1, respectively. The iteration times of the weights updating for two neural networks are 200. The initial weights are chosen randomly from (−0.1, 0.1), and the learning rates are αa = αc = 0.05. The performance index trajectories for the first and second state data group show in Figs. 3.2 and 3.3, respectively. According to Theorem 3.2, for the first state group, the performance index is decreasing as i > 0. For the second state group, the performance index is decreasing as i ≥ 0. The state trajectory and the control trajectory of the second state data are shown in Figs. 3.4 and 3.5. From the figures, we can see that the system is asymptotically stable. The simulation study shows the new iteration ADP algorithm is very feasible.

42

3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay … 8.6

performance index function

8.4

8.2

8

7.8

7.6

7.4

0

5

10

15

20

25

30

35

40

45

50

iteration steps

Fig. 3.3 The performance trajectory for x(3) = 1.1 1.6 1.4 1.2

state

1 0.8 0.6 0.4 0.2 0 -0.2

0

5

10

15

20

25

30

time steps

Fig. 3.4 The state trajectory using the second state data group

Example 3.2 For demonstrating the effectiveness of the proposed iteration algorithm in this chapter, we give a more substantial application. Consider the ball and beam experiment. A ball is placed on a beam as shown in Fig. 3.6.

3.4 Simulation Study

43

0.2 0 -0.2

control

-0.4 -0.6 -0.8 -1 -1.2 -1.4

0

5

10

15

20

25

30

time steps

Fig. 3.5 The control trajectory using the second state data group

Fig. 3.6 Ball and beam experiment

2d θ. The beam angle α can be expressed in terms of the servo gear angle θ as α ≈ L The equation of motion for the ball is given as 

 M + m r¨ + mg sin α − mr (α) ˙ 2 = 0, R2

(3.77)

where r is the ball position coordinate. The mass of the ball m = 0.1 kg, the radius of the ball R = 0.015 m, the radius of the lever gear d = 0.03 m, the length of the

44

3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …

beam L = 1.0 m and the ball’s moment of inertia M = 10−5 kg.m2 . Given the time step h, let r (t) = r (t h), α(t) = α(t h) and θ (t) = θ (t h), then the Eq. (3.77) is discretized as ⎧   2d ⎪ ⎪ x(t + 1) = x(t) + y(t) − A sin θ (t) + Bx(t)(θ (t) − z(t))2 , ⎪ ⎪ ⎨ L  2d (3.78) θ (t) + Bx(t)(θ (t) − z(t))2 , y(t + 1) = y(t) − A sin ⎪ ⎪ ⎪ L ⎪ ⎩ z(t + 1) = θ (t), mgh 2 R 2 4d 2 m R 2 . The state X (t) = (x(t), y(t), and B = 2 2 M + mR L (M + m R 2 ) z(t))T , in which x(t) = r (t), y(t) = r (t) − r (t − 1) and z(t) = θ (t − 1). The control input is u(t) = θ (t). For the convenience of analysis, system (3.78) is rewritten as follows: ⎧   2d ⎪ ⎪ x(t + 1) = x(t − 2) + y(t) − A sin θ (t) + Bx(t)(θ (t) − z(t))2 , ⎪ ⎪ ⎨  L  2d (3.79) θ (t) + Bx(t)(θ (t) − z(t − 2))2 , y(t + 1) = y(t) − A sin ⎪ ⎪ ⎪ L ⎪ ⎩ z(t + 1) = θ (t). where A =

In this chapter, h is selected as 0.1, the states of time-delay system (3.79) are X (1) = [1.0027, 0.0098, 1]T , X (2) = [0.0000, 0.0057, 1.0012]T , X (3) = [1.0000, 0.0016, 1.0000]T , X (4) = [1.0002, −0.0025, 0.9994]T and X (5) = [0, 0, 0]T . The initial states are χ (−2) = [0.9929, 0.0221, 1.0000]T , χ (−1) = [−0.0057, 0.0180, 1.0000]T and χ (0) = [0.9984, 0.0139, 1.0000]T . The initial control sequence is (1.0000, 1.0012, 1.0000, 0.9994, 0.0000). Obviously, the initial control sequence and states are not the optimal ones, so the proposed algorithm in this chapter is adopt to obtain the optimal solution. We select Q = R = 1. The iteration times of the weights updating for two neural networks are 200. The initial weights of critic network are chosen randomly from (−0.1, 0.1), the initial weights of action network are chosen randomly from [−2, 2], and the learning rates are αa = αc = 0.001. For the state X (4) = [1.0002, −0.0025, 0.9994]T . For the state X (1) = [1.0027, 0.0098, 1]T . Obviously, for the state X (4) we can get the optimal controller by Case 1 of the proposed algorithm. For the state X (1), the optimal controller can be obtained by Case 2 of the proposed algorithm. Then we obtain the performance index function trajectories of the two states as shown in Figs. 3.7 and 3.8, which satisfy Theorem 3.2, i.e., for the state X (4), the performance index is decreasing as i > 0, for the state X (1), the performance index is decreasing as i ≥ 0. The state trajectories and the control trajectory of state X (1) show in Figs. 3.9, 3.10, 3.11 and 3.12. From the figures, we can see that the states of the system are asymptotically stable. Based on the above analysis, we can conclude that the proposed iteration ADP algorithm is satisfactory.

3.4 Simulation Study

45

1.3

performance index function

1.2 1.1 1 0.9 0.8 0.7 0.6 0.5

0

5

10

15

20

25

30

35

40

45

50

70

80

90

100

iteration steps

Fig. 3.7 The performance trajectory for X (4) 6

performance index function

5.8 5.6 5.4 5.2 5 4.8 4.6 4.4

0

10

20

30

40

50

60

iteration steps

Fig. 3.8 The performance trajectory for X (1)

46

3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay … 1 0.9 0.8 0.7

state

0.6 0.5 0.4 0.3 0.2 0.1 0

0

5

10

15

20

25

30

35

40

45

50

40

45

50

time steps

Fig. 3.9 The state trajectory of x(t) 0.01 0.009 0.008 0.007

y

0.006 0.005 0.004 0.003 0.002 0.001 0

0

5

10

15

20

25

time steps

Fig. 3.10 The state trajectory of y(t)

30

35

3.4 Simulation Study

47

1

0.8

z

0.6

0.4

0.2

0

-0.2

0

5

10

15

20

25

30

35

40

45

50

30

35

40

45

50

time steps

Fig. 3.11 The state trajectory of z(t) 0

control

-0.05

-0.1

-0.15

0

5

10

15

20

25

time steps

Fig. 3.12 The control trajectory

48

3 Nearly Finite-Horizon Optimal Control for Nonaffine Time-Delay …

3.5 Conclusion This chapter proposed a novel ADP algorithm to deal with the nearly finite-horizon optimal control for a class of deterministic nonaffine time-delay nonlinear systems. For determining the optimal state, the state updating was contained in the novel ADP algorithm. The results of theorems showed the proposed iteration algorithm was convergent. Moreover, the relationship between the the iteration steps and time steps was given. The simulation study have demonstrated the effectiveness of the proposed control algorithm.

References 1. Niculescu, S.: Delay Effects on Stability: A Robust Control Approach. Springer, Berlin (2001) 2. Gu, K., Kharitonov, V., Chen, J.: Stability of Time-Delay Systems. Birkhäuser, Boston (2003) 3. Song, R., Zhang, H., Luo, Y., Wei, Q.: Optimal control laws for time-delay systems with saturating actuators based on heuristic dynamic programming. Neurocomputing 73(16–18), 3020–3027 (2010) 4. Huang, J., Lewis, F.: Neural-network predictive control for nonlinear dynamic systems with time-delay. IEEE Trans. Neural Netw. 14(2), 377–389 (2003) 5. Chyung, D.: On the controllability of linear systems with delay in control. IEEE Trans. Autom. Control 15(2), 255–257 (1970) 6. Phat, V.: Controllability of discrete-time systems with multiple delays on controls and states. Int. J. Control 49(5), 1645–1654 (1989) 7. Chyung, D.: Discrete optimal systems with time delay. IEEE Trans. Autom. Control 13(1), 117 (1968) 8. Chyung, D., Lee, E.: Linear optimal systems with time delays. SIAM J. Control 4(3), 548–575 (1966) 9. Wang, F., Jin, N., Liu, D., Wei, Q.: Adaptive dynamic programming for finite-horizon optimal control of discrete-time nonlinear systems with -error bound. IEEE Trans. Neural Netw. 22, 24–36 (2011) 10. Manu, M., Mohammad, J.: Time-Delay Systems Analysis, Optimization and Applications. North-Holland, New York (1987)

Chapter 4

Multi-objective Optimal Control for Time-Delay Systems

A novel multi-objective ADP method is constructed to obtain the optimal controller of a class of nonlinear time-delay systems in this chapter. Using the weighted sum technology, the original multi-objective optimal control problem is transformed to the single one. An ADP method is established for nonlinear time-delay systems to solve the optimal control problem. To demonstrate the presented iterative performance index function sequence is convergent and the closed-loop system is asymptotically stable, the convergence analysis is also given. The neural networks are used to get the approximative control policy and the approximative performance index function, respectively. Two simulation examples are presented to illustrate the performance of the presented optimal control method.

4.1 Introduction For a class of unknown discrete time nonlinear systems the multi-objective optimal control problem was discussed in [1]. In [2], for nonaffine nonlinear unknown discrete-time systems an optimal control scheme with discount factor was developed. However, as far as we know, how to obtain the multi-objective optimal control solution of nonlinear time-delay systems based on ADP algorithm is still an intractable problem. This chapter will discuss this difficult problem explicitly. First, the simple objective optimal control problem is obtained by the weighted sum technology. Then, the iterative ADP optimal control method of time-delay systems is established, and the convergence analysis is presented. The neural network implementation program is also given. At last, for illustrating the control effect of the proposed multi-objective optimal control method, two simulation examples are introduced. The rest chapter is organized as follows. In Sect. 4.2, it will present the problem formulation. In Sect. 4.3, it will develop the multi-objective optimal control scheme

© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 R. Song et al., Adaptive Dynamic Programming: Single and Multiple Controllers, Studies in Systems, Decision and Control 166, https://doi.org/10.1007/978-981-13-1712-5_4

49

50

4 Multi-objective Optimal Control for Time-Delay Systems

and give the corresponding convergence proof. In Sect. 4.4, it will present implementation process by neural networks. In Sect. 4.5, it will give examples to demonstrate the validity of the proposed control scheme. In Sect. 4.6, it will draw the conclusion.

4.2 Problem Formulation We consider the unknown nonlinear time-delay systems as follows: x(t + 1) = f (x(t), x(t − h)) + g(x(t), x(t − h))u(t),

(4.1)

where u(t) ∈ Rm and the state x(t), x(t − h) ∈ Rn . x(t) = 0, k ≤ 0. F(x(t), x(t − h), u(t)) = f (x(t), x(t − h)) + g(x(t), x(t − h))u(t) is the unknown continuous function. For the system (4.1), we will consider the following multi-objective optimal control problem inf Pi (x(t), u(t)), i = 1, 2, . . . , h, u(t)

(4.2)

where Pi (x(t), u(t)) is the performance index function, and it is defined as the following expression Pi (x(t), u(t)) =

∞  

 L i (x( j)) + u T ( j)Ri u( j) ,

(4.3)

j=k

in which the utility function Ii ( j) = L i (x( j)) + u T ( j)Ri u( j) is the positive-definite, Ri is the positive matrix. In this chapter, the weighted sum technology is used to combine the different objectives, so it is obtained that the following single objective performance index function P(x(t)) =

h 

wi Pi (x(t)),

(4.4)

i=1

where W = [w1 , w2 , . . . , wh ]T , wi ≥ 0 and

h 

wi = 1.

i=1

So we have the following expression P(x(t)) =

h  i=1 T

wi Ii (t) +

h 

wi Pi (x(t + 1))

i=1

= W I (t) + P(x(t + 1)),

(4.5)

4.2 Problem Formulation

51

where I (t) = [I1 (t), I2 (t), . . . , Ih (t)]T . Therefore, we can define the optimal performance index function as follows: P ∗ (x(t)) = inf {P(x(t), u(t))}, u(t)

(4.6)

and the optimal control policy is defined as u ∗ (t) = arg inf {P(x(t), u(t))}. u(t)

(4.7)

In fact, the first-order necessary condition of the optimal control policy u ∗ (t) should be satisfied according to optimality principle of Bellman, i.e., −1   h  h ∂ F T  ∂ Pi 1  wi Ri wi u (t) = − . 2 i=1 ∂u ∂F i=1 ∗

(4.8)

Thus we obtain the optimal performance index function as follows: P ∗ (x(t)) =

h 

wi L i (x(t)) +

i=1

h 

wi Pi∗ (x(t + 1))

i=1

⎛ ⎞T −1   h h ∂ F T  ∂ Pi ⎠ 1⎝  + wi Ri wi 4 ∂u ∂F i=1 i=1 ⎛ ⎞ −1  T  h h h   ∂ F ∂ P i ⎠. × wi Ri ⎝ wi Ri wi ∂u ∂ F i=1 i=1 i=1

(4.9)

4.3 Derivation of the ADP Algorithm for Time-Delay Systems The aim of this chapter is to obtain the solution of the optimal control problem (4.14), so we propose the following ADP algorithm. Let start from i = 0, and define the initial iterative performance index function P [0] (·) = 0. So we can calculate the initial iterative control vector u [0] (t):   u [0] (t) = arg inf W T I (x(t), u(t)) . u(t)

(4.10)

So we can get the next step performance index function as   P [1] (x(t)) = inf W T I (x(t), u(t)) , u(t)

(4.11)

52

4 Multi-objective Optimal Control for Time-Delay Systems

accordingly, we have the following state x(t + 1) = F(x(t), x(t − h), u [0] (t)).

(4.12)

Immediately following, the iterative control is updated by   u [i] (t) = arg inf W T I (x(t), u(t)) +P [i] (x(t + 1)) , u(t)

(4.13)

and the iterative performance index function is updated as   P [i+1] (x(t)) = inf W T I (x(t), u(t)) +P [i] (x(t + 1)) . u(t)

(4.14)

So we can have the updated state x(t + 1) = F(x(t), x(t − h), u [i] (t)).

(4.15)

In the next part, we will prove that the proposed algorithm is convergent, and the asymptotic stability of the closed-loop systems.     Lemma 4.1 Give a control sequence μ[i] (t) , and u [i] (t) is obtained from (4.13). The iterative performance index function P [i] is obtained from (4.14) and Υ [i] is defined as (4.16) Υ [i+1] (x(t)) = W T I (x(t), μ[i] (t)) + Υ [i] (x(t + 1)), where x(t + 1) is obtained from μ[i] (t). If the initial values of P [i] and Υ [i] are same for i = 0, then P [i+1] (x(t)) ≤ Υ [i+1] (x(t)), ∀i. Lemma 4.2 If the iterative performance index function P [i] (x(t)) is obtained from (4.14). Then P [i] (x(t)) is bounded, i.e., 0 ≤ P [i+1] (x(t)) ≤ B, ∀i, where B is positive. Theorem 4.1 If the iterative performance index function P [i] (x(t)) is obtained from (4.14). Then we have P [i+1] (x(t)) ≥ P [i] (x(t)), ∀i.   Proof Let Θ [i] (x(t)) define as follows: Θ [i] (x(t)) = W T I (x(t), u [i] (t)) + Θ i−1 (x(t + 1)),

(4.17)

with Θ [0] (.) = P [0] (.) = 0. Here, the mathematical induction is used to get the conclusion: Θ [i] (x(t)) ≤ [i+1] (x(t)). P It is quite clear that for i = 0, we have P [1] (x(t)) − Θ [0] (x(t)) = W T I (x(t), u [0] (t)) ≥ 0. So it is summed up for i = 0,

(4.18)

4.4 Neural Network Implementation …

P [1] (x(t)) ≥ Θ [0] (x(t)).

53

(4.19)

Suppose that for i − 1, P [i] (x(t)) ≥ Θ [i−1] (x(t)), ∀x(t). Then we have

and

P [i+1] (x(t)) = W T I (x(t), u [i] (t)) + P [i] (x(t + 1)),

(4.20)

Θ [i] (x(t)) = W T I (x(t), u [i] (t)) + Θ i−1 (x(t + 1)).

(4.21)

Thus we can conclude that P [i+1] (x(t)) − Θ [i] (x(t)) = P [i] (x(t + 1)) − Θ i−1 (x(t + 1)) ≥ 0. So, we have

P [i+1] (x(t)) ≥ Θ [i] (x(t)), ∀i.

(4.22)

(4.23)

Furthermore, it is known that P [i] (x(t)) ≤ Θ [i] (x(t)) from Lemma 4.1. So it can be drawn a conclusion P [i] (x(t)) ≤ Θ [i] (x(t)) ≤ P [i+1] (x(t)),

(4.24)

  which means that as i → ∞, P [i] (x(t)) is convergent. Hence, we have that as i → ∞, P [i] (x(t)) → P ∗ (x(t)), and u [i] (t) → u ∗ (t) accordingly. Theorem 4.2 If the iterative performance index function P [i] (x(t)) is obtained from (4.14). If u ∗ (t) is expressed as (4.13) and P ∗ (x(t)) is expressed as (4.14). Then we can say that u ∗ (t) stabilizes the system (4.1) asymptotically. Proof It is proven that P [i+1] (x(t)) ≥ 0. So P ∗ (x(t)) ≥ 0. From (4.5), we have   P ∗ (x(t + 1)) − P ∗ (x(t)) = − W T I (x(t), u ∗ (t)) ≤ 0,

(4.25)

So, the closed-loop system is asymptotically stabilized according to Lyapunov Theorem. Remark 4.1 In fact, the proposed method is the further study of [3]. In [3], although the multi-objective control problem is presented, the time-delay state for nonlinear system is not considered. In [1] the multi-objective optimal control problem is not discussed. In this chapter, the multi-objective optimal control problem of time-delay system is solved successfully. So this chapter is the evolvement of traditional ADP literature.

54

4 Multi-objective Optimal Control for Time-Delay Systems

4.4 Neural Network Implementation for the Multi-objective Optimal Control Problem of Time-Delay Systems There are two neural networks in the ADP algorithm, i.e., the critic network and the action network. The approximate performance index function P [i+1] (x(t)) is obtained from the critic network, which is denoted as follows: Pˆ [i+1] (x(t)) = vc[i+1]T σ (wc[i+1]T x(t)).

(4.26)

The approximation error is defined as ec[i+1] (t) = Pˆ [i+1] (x(t)) − P [i+1] (x(t)),

(4.27)

where we define P [i+1] (x(t)) as P [i+1] (x(t)) = W T I (x(t), uˆ [i] (t)) + Pˆ [i] (e[i−1] (t + 1)).

(4.28)

For obtaining the update rule, we define Yc[i+1] (t) =

1 [i+1]T e (t)ec[i+1] (t). 2 c

(4.29)

Thus the critic network weights is updated by vc[i+2] (t) = vc[i+1] (t) + vc[i+1] (t),

(4.30)

where vc[i+1] (t) = −βc

∂Yc[i+1] (t) ∂vc[i+1] (t)

.

(4.31)

And wc[i+2] (t) = wc[i+1] (t) + vc[i+1] (t),

(4.32)

where wc[i+1] (t) = −βc

∂Yc[i+1] (t) ∂wc[i+1] (t)

.

(4.33)

The regulation parameter βc > 0. The approximation iterative control is obtained by the action network. x(t − 1), x(t − 2), . . . , x(t − h) are used as inputs, the critic network output is uˆ [i] (t), which can be formulated as

4.4 Neural Network Implementation …

55

uˆ [i] (t) = va[i]T σ (wa[i]T X (t)),

(4.34)

where X (t) = [x T (t − 1), x T (t − 2), . . . , x T (t − h)]T . The approximation error of the network is defined as follows: ea[i] (t) = uˆ [i] (t) − u [i] (t).

(4.35)

For obtaining the update rule, we define the following performance error measure Ya[i] (t) =

1 [i]T e (t)ea[i] (t). 2 a

(4.36)

The critic weights is updated by va[i+1] (t) = va[i] (t) + va[i] (t),

(4.37)

where va[i] (t) = −βa

∂Ya[i] (t) ∂va[i] (t)

.

(4.38)

And wa[i+1] (t) = wa[i] (t) + wa[i] (t),

(4.39)

where wa[i] (t) = −βa

∂Ya[i] (t) ∂wa[i] (t)

.

(4.40)

The regulation parameter βa > 0.

4.5 Simulation Results Example 4.1 For illustrating the detailed implementation procedure of the presented method, we will discuss the nonlinear time-delay system as follows [4]: x(t + 1) = f (x(t), x(t − 2)) + g(x(t), x(t − 2))u(t) x(t) =x0 , −2 ≤ s ≤ 0 where

(4.41)

56

4 Multi-objective Optimal Control for Time-Delay Systems

f (x(t), x(t − 2)) =

 0.2x1 (t) exp (x2 (t))2 x2 (t − 2) , 0.3 (x2 (t))2 x1 (t − 2)

and

 x2 (t − 2) 0.2 g(x(t), x(t − 1), x(t − 2)) = . 0.1 1 The performance index functions Pi (x(t)) are similar as in [1], which are defined as follows: P1 (x(t)) = P2 (x(t)) =

∞  

 ln(x12 (t) + 1) + u 2 (t) ,

(4.42)

 ln(x22 (t) + 1) + u 2 (t) ,

(4.43)

 ln(x12 (t) + x22 (t) + 1) + u 2 (t) ,

(4.44)

k=0 ∞ 



k=0

and P3 (x(t)) =

∞   k=0

According to the request of the system property, the weight vector W is selected as W = [0.1, 0.3, 0.6]T . For implementing the established algorithm, we let the initial state of time-delay system (4.41) x0 = [0.5, −0.5]T . We also use neural networks to implement the iterative ADP algorithm. The critic network is a three-layer BP neural network, which is used to approximate the iterative performance index function. The other three-layer BP neural network is also used as the action network, which is used to approximate the iterative control. For increasing the neural network accuracy, the two networks are trained 1000 steps, and the regulation parameters are both 0.001. Then we get the iterative performance index functions as shown in Fig. 4.1, which converges to a constant. The system runs 200 steps, then the simulation results are obtained. We can see that the state trajectories are shown as in Fig. 4.2, and the control input trajectories are shown as in Fig. 4.3. It is quite clear that the time-delay system is convergent after 60 time steps. So the constructed iterative multi-objective optimal control method in this chapter has good performance. Example 4.2 This example is derived from [3, 5] with modifications ⎧ −3 ⎪ ⎪ x1 (t + 1) = x1 (t) + 10 x2 (t − 1), ⎨ x2 (t + 1) = 0.03151 sin(x1 (t − 2)) + 0.9991x2 (t) + 0.0892x3 (t), x3 (t + 1) = −0.01x2 (t) + 0.99x3 (t − 1) + u(t), ⎪ ⎪ ⎩ x(t) = x0 , −2 ≤ s ≤ 0. For system (4.45), define the performance index functions

(4.45)

4.5 Simulation Results

57

1.6

performance index function

1.4

1.2

1

0.8

0.6

0.4

0

10

20

30

40

50

iteration steps

Fig. 4.1 The iterative performance index functions 1

x1

state trajectories

0.8

x2

0.6

0.4

0.2

0

-0.2

0

50

100

time steps

Fig. 4.2 The state trajectories x1 and x2

150

200

58

4 Multi-objective Optimal Control for Time-Delay Systems 0.4

u1 u2

0.3

control

0.2

0.1

0

-0.1

-0.2

-0.3

0

50

100

150

200

time steps

Fig. 4.3 The control trajectories u 1 and u 2

P1 (x(t)) = P2 (x(t)) = P3 (x(t)) =

∞    ln(x12 (t) + x22 (t)) + u 2 (t) , k=0 ∞  k=0 ∞ 

(4.46)

  ln(x22 (t) + x32 (t)) + u 2 (t) ,

(4.47)

  ln(x12 (t) + x32 (t)) + u 2 (t) ,

(4.48)

k=0

and P4 (x(t)) =

∞    ln(x12 (t) + 1) + u 2 (t) ,

(4.49)

k=0

The weight vector is W = [0.1, 0.3, 0.1, 0.5]T . Then the performance index function P(x(t)) can be obtained based on the weighted sum method. The initial state is selected as x0 = [0.5, −0.5, 1]T . The structures, the learning rate and initial weights of critic network and action network are same as in Example 4.1. After 10 iterative steps, the performance index function trajectory is obtained as in Fig. 4.4. The state trajectories are shown in Fig. 4.5, and the control trajectory is given in Fig. 4.6.

4.5 Simulation Results

59

0.6

performance index function

0.5

0.4

0.3

0.2

0.1

0

0

1

2

3

4

5

6

7

8

9

10

iteration steps

Fig. 4.4 The iterative performance index functions 1

x1 x2

state trajectories

x3

0.5

0

-0.5

0

10

20

30

40

time steps

Fig. 4.5 The state trajectories x1 , x2 and x3

50

60

70

80

60

4 Multi-objective Optimal Control for Time-Delay Systems 0 -0.1 -0.2

control

-0.3 -0.4 -0.5 -0.6 -0.7 -0.8

0

10

20

30

40

50

60

70

80

time steps

Fig. 4.6 The control trajectories u 1

x1

0.5

x2 x3

state trajectories

0 -0.5 -1 -1.5 -2 -2.5 -3

0

500

1000

time steps

Fig. 4.7 The state trajectories x1 , x2 and x3 obtained by [1]

1500

4.5 Simulation Results

61

0.2

0.1

control

0

-0.1

-0.2

-0.3

-0.4

-0.5

0

150

300

450

600

750

900

1050

1200

1350

1500

time steps

Fig. 4.8 The control trajectories u obtained by [1]

For illustrating the good performance, the method in [1] has been used to obtain the optimal control of system (4.45). After 1500 time steps, the system states and control converge to zero, which are shown in Figs. 4.7 and 4.8. From the comparison, we can see that the convergence speed of the method presented in this chapter is faster and the performance is better than the method in [1].

4.6 Conclusions This chapter aimed nonlinear time-delay systems and solved the multi-objective optimal control problem using the presented ADP method. By the weighted sum technology, we obtained the single optimal control problem from the original multiobjective one. So we established an ADP method for the considered nonlinear timedelay systems, and the convergence analysis proved that the iterative performance index functions converge to the optimal one. The critic and action networks were used to get the iterative performance index function and the iterative control policy, respectively. In the simulation, the multiple performance index functions were given, and using the proposed method, the results are achieved, which illustrates the validity and effectiveness of the proposed multi-objective optimal control method.

62

4 Multi-objective Optimal Control for Time-Delay Systems

References 1. Wei, Q., Zhang, H., Dai, J.: Model-free multiobjective approximate dynamic programming for discrete-time nonlinear systems with general performance index functions. Neurocomputing 72(7–9), 1839–1848 (2009) 2. Wang, D., Liu, D., Wei, Q., Zhao, D., Jin, N.: Optimal control of unknown nonaffine nonlinear discrete-time systems based on adaptive dynamic programming. Automatica 48(8), 1825–1832 (2012) 3. Song, R., Xiao, W., Zhang, H.: Multi-objective optimal control for a class of unknown nonlinear systems based on finite-approximation-error ADP algorithm. Neurocomputing 119(7), 212–221 (2013) 4. Zhang, H., Song, R., wei, Q., Zhang, T.: Optimal tracking control for a class of nonlinear discretetime systems with time delays based on heuristic dynamic programming. IEEE Trans. Neural Netw. 22(12), 1851–1862 (2011) 5. Gyurkovics, É., Takács, T.: Quadratic stabilisation with H∞ -norm bound of non-linear discretetime uncertain systems with bounded control. Syst. Control Lett. 50, 277–289 (2003)

Chapter 5

Multiple Actor-Critic Optimal Control via ADP

In industrial process control, there may be multiple performance objectives, depending on salient features of the input-output data. Aiming at this situation, this chapter proposes multiple actor-critic structures to obtain the optimal control via input-output data for unknown nonlinear systems. The shunting inhibitory artificial neural network (SIANN) is used to classify the input-output data into one of several categories. Different performance measure functions may be defined for disparate categories. The ADP algorithm, which contains model module, critic network and action network, is used to establish the optimal control in each category. A recurrent neural network (RNN) model is used to reconstruct the unknown system dynamics using input-output data. Neural networks are used to approximate the critic and action networks, respectively. It is proven that the model error and the closed unknown system are uniformly ultimately bounded (UUB). Simulation results demonstrate the performance of the proposed optimal control scheme for the unknown nonlinear system.

5.1 Introduction Recent work in experimental neurocognitive psychology has revealed the existence of new relations between structures in the human brain and their functions in decision and control [1, 2]. Mechanisms in the brain involving the emotional gates in the amygdala and deliberative decisions between risky choices in the anterior cingulate cortex (ACC) indicate the presence of multiple interacting actor-critic structures. That research reveals the role of the amygdala in fast intuitive response to the environment based on stored patterns, and the role of the ACC in deliberative response when risk or mismatch with stored patterns is detected. It is shown that the amygdala uses features and cues received from interactions with the environment to classify © Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 R. Song et al., Adaptive Dynamic Programming: Single and Multiple Controllers, Studies in Systems, Decision and Control 166, https://doi.org/10.1007/978-981-13-1712-5_5

63

64

5 Multiple Actor-Critic Optimal Control via ADP

possible responses into different categories based on context and match with previously encountered situations. These categories can be viewed as representing stored behavior responses or patterns. Werbos [3, 4] discusses novel ADP structures with multiple critic levels based on functional regions in the human cerebral cortex. The interplay between using previously stored experiences to quickly decide possible actions, and real-time exploration and learning for precise control is emphasized. This agrees with the work in [1, 2] which details how stored behavior patterns can be used to enact fast decisions by drawing on previous experiences when there is match between observed attributes and stored patterns. In the event of risk, mismatch, or anxiety, higher-level control mechanisms are recruited by the ACC that involve more focus on real-time observations and exploration. There are existing approaches to adaptive control and ADP that take into account some features of these new studies. Included is the multiple-model adaptive control method of [5]. Multiple actor-critic structures were developed by [5–8]. These works used multiple modules, each of which contains a stored dynamics model of the environment and a controller. Incoming data were classified into modules based on prediction errors between observed data and the prediction of the dynamics model in each module. In industrial process control, the system dynamics are difficult and expensive to estimate and cannot be accurately obtained [9]. Therefore, it is difficult to design optimal controllers for these unknown systems. On the other hand, new imperatives in minimizing resource use and pollution, while maximizing throughput and yield, make it important to control industrial processes in an optimal fashion. Using proper control techniques, the input-output data generated in the process of system operation can be accessed and used to design optimal controllers for unknown systems [10]. Recently, many studies have been done on data-based control schemes for unknown systems [11–17]. For large-scale industrial processes, there are usually different production performance measures according to desired process properties. Therefore, more extensive architectures are needed to design optimal controllers for large-scale industrial processes, which respond quickly based on previously learned behavior responses, allow adaptive online learning to guarantee real-time performance, and accommodate different performance measures for different situations. Such industrial imperatives show the need for more comprehensive approaches to data-driven process control. Multiple performance objectives may be needed depending on different important features hidden in observed data. This requires different control categories that may not only depend on closeness of match with predictions based on stored dynamics models as used in [2, 18]. The work of Levine [19] and Werbos [20] indicates that more complex structures are responsible for learning in the human brain than the standard three-level actor-critic based on networks for actor, model, and critic. It has been shown [1, 2] that shunting inhibition is a powerful computational mechanism and plays an important role in sensory neural information processing systems [21]. Bouzerdoum [22] proposed the shunting inhibitory artificial neural network (SIANN) which can be used for highly effective classification and function approximation. In SIANN the synaptic interactions among neurons are mediated via shunting inhibition. Shunting inhibition is more powerful than multi-layer

5.1 Introduction

65

perceptions in that each shunting neuron has a higher discrimination capacity than a perceptron neuron and allows the network to construct complex decision surfaces much more readily. Therefore, SIANN is a powerful and effective classifier. In [23], efficient training algorithms for a class of shunting inhibitory convolution neural networks are presented. In this chapter we bring together the recent studies in neurocognitive psychology [19, 20] and recent mathematical machinery for implementing shunting inhibition [21, 22] to develop a new class of process controllers that have a novel multiple actor critic structure. This multiple network structure has learning on several timescales. Stored experiences are first used to train a SIANN to classify environmental cues into different behavior performance categories according to different technical requirements. This classification is general, and does not only depend on match of observe data with stored dynamics models. Then, data observed in real-time are used to learn deliberative responses for more precise online optimal control. This results in faster immediate control using stored experiences, along with real-time deliberative learning. The contributions of this chapter are as follows. (1) Based on neurocognitive psychology, a novel controller based on multiple actorcritic structures is developed for unknown systems. This controller trades off fast actions based on stored behavior patterns with real-time exploration using current input-output data. (2) A SIANN is used to classify input-output data into categories based on salient features of previously recorded data. Shunting inhibition allows for higher discriminatory capabilities and has been shown important in neural information processing. (3) In each category, an RNN is used to identify the system dynamics, novel parameter update algorithms are given, and it is proven that the parameter errors are UUB. (4) Action-critic networks are developed in each category to obtain the optimal performance measure function and the optimal controller based on current observed input-output data. It is proven that the closed-loop systems and the weight errors are UUB based on Lyapunov techniques. The rest of the chapter is organized as follows. In Sect. 5.2, the problem motivations and preliminaries are presented. In Sect. 5.3, the SIANN architecture-based category technique is developed. In Sect. 5.4, the model network, critic network and action network are introduced. In Sect. 5.5, two examples are given to demonstrate the effectiveness of the proposed optimal control scheme. In Sect. 5.6, the conclusion is drawn.

5.2 Problem Statement For most complex industrial systems, it is difficult, time consuming, and expensive to identify an accurate mathematics model. Therefore, optimal controllers are difficult to design. Motivated by this problem, a multiple actor-critic structure is proposed

66

5 Multiple Actor-Critic Optimal Control via ADP

to obtain optimal controllers based on measured input-output data. This allows the construction of adaptive optimal control systems for industrial processes that are responsive to changing plant conditions and performance requirements. This structure uses fast classification based on stored memories, such as that occurs in the amygdala and orbitofrontal cortex (OFC) in [1, 2]. It also allows for real-time exploration and learning, such as occurs in the ACC in [1, 2]. This structure conforms to the ideas of [3], which stress the importance of an integrated utilization of stored memories and real-time exploration. The overall multiple actor-critic structure is shown in Fig. 5.1. A SIANN is trained to classify previously recorded data into different memory categories [24]. In Fig. 5.1, xi stands for the measured system data, which are also the input of the SIANN. The trained SIANN is used to classify data recorded in real time into categories, and its output y = 1, 2, . . . , L, is the category label. Within each category, ADP is used to establish the optimal control. If the output of SIANN is y = i, then the ith ADP structure is activated. For each ADP, an RNN is trained to model the dynamics. Then, the critic network and the action network of that ADP are used to determine the performance measure function and the optimal control. The actor-critic structure in the active category is tuned online based on real-time recorded data. The details are given in the next sections.

5.3 SIANN Architecture-Based Classification In industrial process control, there are usually different production performance measures according to different desired process properties. Therefore, the input-output data should be classified into different categories based on various features inherent in the data. It has been shown by [1], that shunting inhibition is important to explain

Fig. 5.1 The structure diagram of the algorithm

5.3 SIANN Architecture-Based Classification

67

the manner in which the amygdala and OFC classify data based on environmental cues. In this section, the structure and updating method are introduced for a SIANN [21] that is used to classify the measured data into different control response categories depending on match between certain attributes of the data and previously stored experiences. In this chapter, SIANN consists of one input layer, one hidden layer and one output layer. In the hidden units, the shunting inhibition is the basic synaptic interaction. Generalized shunting neurons are used in the hidden layer of the SIANN. The structure of each neuron is shown in Fig. 5.2 and given as follows [21]. The output of each neuron is expressed as

sj =

  gh wTj x + w j0 + b j a j + f h (cTj x + c j0 )

,

(5.1)

where x = [x1 , x2 , . . . , xm ]T , w j = [w j1 ; w j2 ; . . . , w jn ], and c j = [c j1 ; c j2 ; . . . , c jn ]. s j is the jth neuron output of hidden layer and xi is the input of jth neuron. The SIANN parameters include: the passive decay rate of the jth neuron a j > 0 , the bias b j , the weights from the ith input to the jth neuron w ji and c ji . w j0 and c j0 are constants. The nonlinear activation functions are f h (·) and gh (·). Function f h (·) is selected positive. Note that the denominator of (5.1) can never be zero due to the definitions of the variables and functions. The output layer of the SIANN is given by [22] y=

n 

v j s j + d = vT s + d,

(5.2)

j=1

where s = [s1 , s2 , . . . , sn ]T and v = [v1 , v2 , . . . , vn ]T , in which vi is the weight between ith hidden layer to the output neuron, and d is the bias. Remark 5.1 In (5.1), the denominator is termed a shunting inhibition. This is because f h (·) is nonnegative, so that the combined effects of neighboring neuron inputs xi in the denominator can only decrease neuron output s j . The numerator of (5.1) is

Fig. 5.2 The structure diagram of the algorithm

68

5 Multiple Actor-Critic Optimal Control via ADP

a standard feedforward neural network structure. It is shown in [21] that neuron structure (5.1) gives more flexibility in classification using a smaller number of hidden layer neurons than standard feedforward neural networks. It is necessary to train the SIANN to properly classify data in real-time into prescribed categories. This is accomplished using previously recorded data. The SIANN weights for proper classification are determined by supervised learning as follows. The gradient descent method is used to train the parameters. Given the desired output category yd , define the error e = y − yd and E=

1 2 e . 2

(5.3)

According to gradient descent algorithm, the updates for the parameters are given by backpropagation as follows [25]: v˙ = −γv

∂ E ∂y = −γv es, ∂e ∂v

d˙ = −γd e,

w˙ j = −γw

(5.4) (5.5)

  g  h wTj x + w j0 x T

∂ E ∂ y ∂s j , = −γw ev j ∂e ∂s j ∂w j a j + f h (cTj x + c j0 )

∂ E ∂ y ∂s j 1 , b˙ j = −γb = −γb ev j ∂e ∂s j ∂b j a j + f h (cTj x + c j0 )   −(gh wTj x + w j0 + b j ) ∂ E ∂ y ∂s j a˙ j = −γa = −γa ev j , 2 ∂e ∂s j ∂a j (a j + f h (cTj x + c j0 ))   −(gh wTj x + w j0 + b j ) ∂ E ∂ y ∂s j c˙ j = −γc = −γc ev j f h (cTj x + c j0 )x T , 2 ∂e ∂s j ∂c j (a j + f h (cTj x + c j0 ))

(5.6)

(5.7)

(5.8)

(5.9)

where γv , γd , γw , γb , γa and γc are learning rates and positive. Using these parameter update equations, the SIANN is trained, using previously recorded data, to properly classify the input-output data into the appropriate categories. Then, based on the trained SIANN classifier, one can decide the category label for the input-output data that are measured in real time. Then, the optimal control for each category is found in real-time using the category ADP structures as in the subsequent sections.

5.4 Optimal Control Based on ADP

69

5.4 Optimal Control Based on ADP In the previous section a SIANN is trained using previously recorded data to classify the input-output data in real-time into different categories based on data attributes and features. Based on this SIANN classification, an ADP structure in each category is used to determine, based on data measured in real time, the optimal control for the data attributes and performance measure in that category. In this section, a novel ADP structure is designed that captures the required system identification and optimality factors in process control. This structure uses fast classification based on stored memories, such as occurs in the amygdala and OFC in [7, 8], and also allows for real-time exploration and learning, such as occurs in the ACC in [7, 8]. This structure provides an integrated utilization of stored memories and real-time exploration [9]. The ADP structure is used in each category as shown in Fig. 5.1. The ADP structure has the standard three networks for model, critic, and actor. In this section, first, the optimization problem is introduced. Then a recursive neural network (RNN) is trained to identify the system dynamics. Novel neural network (NN) parameter tuning algorithms for the model RNN are given and the state estimation error is proven to converge with time. After this model identification phase, the ADP critic network and action network are designed to use data recorded in real time to compute online the performance measure value function and the optimal control action, respectively. A theorem is given to prove that the closed-loop system is UUB. Based on the previous section, suppose the input-output data is classified into category y = i. Then we can train the ith ADP using that input-output data. Throughout this section, it is understood that the design refers to the ADP structure in each category i. To conserve notation, we do not use subscripts i throughout. The performance measure function for the ith ADP is given as  J (x) =



r (x(τ ), u(τ ))dτ,

(5.10)

t

where x and u are the state and control inputs of the ith ADP, r (x, u) = Q(x) + u T Ru, where R is a positive definite matrix. It is assumed that there exists a scalar P > 0, s.t. Q(x) > P x T x. The infinitesimal version of (5.10) is the Bellman equation 0 = JxT x˙ + r, where Jx =

∂J . ∂x

(5.11)

The Hamiltonian function is given as H (x, u, Jx ) = JxT x˙ + r.

The optimal performance measure function is defined as

(5.12)

70

5 Multiple Actor-Critic Optimal Control via ADP

J ∗ (x) = min u





r (x(τ ), u(τ ))dτ,

(5.13)

t

and the optimal control can be obtained by 





u = arg min u

r (x(τ ), u(τ ))dτ.

(5.14)

t

In the following subsections, the detailed derivations for the model neural network, the critic network and the action network are given.

5.4.1 Model Neural Network The previous section showed how to use SIANN to classify the data into one of i categories. Please see Fig. 5.1. In each category i, a RNN is used to identify the system dynamics for the ith ADP structure. The dynamics in category i is taken to be modeled by the RNN given as [26] x˙ = AT1 x + AT2 f (x) + AT3 u + AT4 + ε + τd .

(5.15)

This is a very general model of a nonlinear system that can closely fit many dynamical models. Here, A1 , A2 , A1 and A4 are the unknown ideal weight matrices for ith ADP, which are assumed to satisfy ||Ai || F ≤ AiB , where AiB > 0 is constant, i = 1, 2, 3, 4. ε s the bounded approximate error, and is assumed to satisfy εT ε ≤ β A exT ex , where ˆ β A is the bounded constant target value and the state estimation error is ex = x − x. The activation function f (x) is a monotonically increasing function, and it is taken to satisfy 0 ≤ || f (x1 ) − f (x2 )|| ≤ k||x1 − x2 ||, where x1 > x2 and k > 0. τd is the disturbance and it is bounded , i.e., ||τd || ≤ d B , where d B is a constant number. The approximate dynamics model is constructed as ˆ + Aˆ T3 u + Aˆ T4 + A5 ex , x˙ˆ = Aˆ T1 xˆ + Aˆ T2 f (x)

(5.16)

where Aˆ 1 , Aˆ 2 , Aˆ 3 and Aˆ 4 are the estimated values of the ideal unknown weights, and Aˆ 5 is a square matrix that satisfies A5 −

AT1

1 − AT2 A2 − 2



1 βA k2 + + 2 2 2

 I > 0.

Define the parameter identification errors as A˜ 1 = A1 − Aˆ 1 , A˜ 2 = A2 − Aˆ 2 , A˜ 3 = ˆ A3 − Aˆ 3 , A˜ 4 = A4 − Aˆ 4 , and define f˜(ex ) = f (x) − f (x). Due to disturbances and modeling errors, the model (5.16) cannot exactly reconstruct the unknown dynamics (5.15). It is desired to tune the parameters in (5.16) such that the parameter estimation errors and state estimation error are uniformly ultimately bounded.

5.4 Optimal Control Based on ADP

71

Definition 5.1 Uniformly Ultimately Bounded (UUB) [27–29]: The equilibrium point xe is said to be UUB if there exists a compact set U ∈ Rn such that, for all x(t0 ) = x0 ∈ U , there exists a δ > 0 and a number T (δ, x0 ) such that ||x(t) − xe || < δ for all t ≥ t0 + T . The next theorem extends the result in [26], by providing robust turning methods that make ex negative definite outside a bounded region. These robustified tuning methods are required due to the disturbances in (5.15). Theorem 5.1 Consider the approximate system model (5.16), and take the update methods for the tunable parameters as A˙ˆ1 = β1 xe ˆ xT − β1 ||ex || Aˆ 1 ,

ˆ xT − β2 ||ex || Aˆ 2 , A˙ˆ2 = β2 f (x)e A˙ˆ = β ueT − β ||e || Aˆ , 3

3

x

3

x

3

A˙ˆ4 = β4 exT − β4 ||ex || Aˆ 4 , where βi > 0, i = 1, 2, 3, 4. If the initial values of the state estimation error ex (0), and the parameter identification errors A˜ i (0) are bounded. Then the state estimation error ex , and the parameter identification errors A˜ i are UUB. Proof Let the initial values of the state estimation error ex (0), and the parameter identification errors A˜ i (0) be bounded, then the NN approximation property (5.15) holds for the state x [29]. Define the Lyapunov function candidate: V = V1 + V2 ,

(5.17)

where V1 = 1/2exT ex and V2 = V A1 + V A2 + V A3 + V A4 , in which V A1 = 1 1 1 ˜ ˜T tr { A˜ T2 A˜ 2 }, V A3 = tr { A˜ T3 A˜ 3 }, V A4 = A4 A4 . 2β2 2β3 2β4 As the following equations

1 2β1

tr { A˜ T1 A˜ 1 }, V A2 =

AT1 x − Aˆ T1 xˆ = (AT1 x − AT1 x) ˆ + (AT1 xˆ − Aˆ T1 x) ˆ = AT1 ex + A˜ T1 xˆ

(5.18)

and ˆ = (AT2 f (x) − AT2 f (x)) ˆ + (AT2 f (x) ˆ − Aˆ T2 f (x)) ˆ AT2 f (x) − Aˆ T2 f (x) T ˜ T = A2 f (ex ) + A˜ 2 f (x) ˆ (5.19) hold. Then one has e˙x = x˙ − x˙ˆ = AT1 ex + A˜ T1 xˆ + AT2 f˜(ex ) + A˜ T2 f (x) ˆ + A˜ T3 u + A˜ T4 + ε − A5 ex + τd . (5.20)

72

5 Multiple Actor-Critic Optimal Control via ADP

Therefore, ˆ + A˜ T3 u + A˜ T4 + ε − A5 ex + τd ). V˙1 = exT (AT1 ex + A˜ T1 xˆ + AT2 f˜(ex ) + A˜ T2 f (x) (5.21) As 1 1 exT AT2 f˜(ex ) ≤ exT AT2 A2 ex + k 2 exT ex 2 2

(5.22)

and exT ε ≤

1 T 1 ex ex + ε T ε ≤ 2 2



1 βA + 2 2

 exT ex ,

(5.23)

then one can get 1 ˆ + exT AT2 A2 ex V˙1 ≤ exT AT1 ex + exT A˜ T1 xˆ + exT A˜ T2 f (x) 2   1 2 1 βA T k + + ex ex + exT A˜ T3 u + exT A˜ T4 − exT A5 e + exT τd . + 2 2 2

(5.24)

On the other side, according to [29], one has ˆ xT } + ||ex ||tr { A˜ T1 Aˆ 1 } V˙ A1 = − tr { A˜ T1 xe ≤ − exT A˜ T1 xˆ + ||ex |||| A˜ 1 || F A1B − ||ex |||| A˜ 1 ||2F ,

(5.25)

ˆ xT } + ||ex ||tr { A˜ T2 Aˆ 2 } V˙ A2 = − tr { A˜ T2 f (x)e = − exT A˜ T2 f (x) ˆ + ||ex |||| A˜ 2 || F A2B − ||ex |||| A˜ 2 ||2F ,

(5.26)

V˙ A3 = − tr { A˜ 3T uexT } + ||ex ||tr { A˜ T3 Aˆ 3 } = − exT A˜ T3 u + ||ex |||| A˜ 3 || F A3B − ||ex |||| A˜ 3 ||2F ,

(5.27)

V˙ A4 = − tr {exT A˜ T4 } + ||ex ||tr { A˜ T4 Aˆ 4 } = − exT A˜ T4 + ||ex |||| A˜ 4 || F A4B − ||ex |||| A˜ 4 ||2F .

(5.28)

Therefore, ˆ − exT A˜ T3 u − exT A˜ T4 V˙2 ≤ − exT A˜ T1 xˆ − exT A˜ T2 f (x) + ||ex |||| A˜ 1 || F A1B − ||ex |||| A˜ 1 ||2F + ||ex |||| A˜ 2 || F A2B − ||ex |||| A˜ 2 ||2F + ||ex |||| A˜ 3 || F A3B − ||ex |||| A˜ 3 ||2F + ||ex |||| A˜ 4 || F A4B − ||ex |||| A˜ 4 ||2F . (5.29)

5.4 Optimal Control Based on ADP

73

From (5.24) and (5.29), one can obtain that 



  k2 1 βA + + I − A5 ex 2 2 2 + ||ex |||| A˜ 1 || F A1B − ||ex |||| A˜ 1 ||2F + ||ex |||| A˜ 2 || F A2B − ||ex |||| A˜ 2 ||2F

V˙ ≤ eTx

AT1 +

1 T A A2 + 2 2

+ ||ex |||| A˜ 3 || F A3B − ||ex |||| A˜ 3 ||2F + ||ex |||| A˜ 4 || F A4B − ||ex |||| A˜ 4 ||2F + d B ||ex ||. (5.30)

Let K V = A5 −

AT1

1 − AT2 A2 − 2



1 βA k2 + + 2 2 2

 I , and then

V˙ ≤ − ||ex || [(λmin (K V ) ||ex || + || A˜ 1 || F (|| A˜ 1 || F − A1B ) + || A˜ 2 || F (|| A˜ 2 || F − A2B )) + || A˜ 3 || F (|| A˜ 3 || F − A3B ) + || A˜ 4 || F (|| A˜ 4 || F − A4B ) −d B ]  2  B 2 A1 AB = − ||ex || [(λmin (K V ) ||ex || + || A˜ 1 || F − 1 − 2 2   B 2  B 2 B 2 A A2 A + || A˜ 2 || F − 2 − + || A˜ 3 || F − 3 2 2 2  B 2   B 2  B 2 A3 A A4 − + || A˜ 4 || F − 4 − −d B ] 2 2 2

(5.31)

which is guaranteed negative as long as  ||ex || >

A1B 2

2

 +

A2B 2

2

 +

A3B 2

2

 +

A4B 2

2

λmin (K V )

+ dB ≡ Be ,

(5.32)

or || A˜ 1 || F

|| A˜ 2 || F

|| A˜ 3 || F

|| A˜ 4 || F

2

A1B A1B

+ > + dB , 2 2 2

A2B A2B

+ > + dB , 2 2 

B

AB 2 A3 3 > + dB , + 2 2 

B

AB 2 A 4 > 4 + + dB . 2 2

(5.33)

(5.34)

(5.35)

(5.36)

Thus, V˙ (t) is negative outside a compact set. This demonstrates the UUB of ||ex || and || A˜ 1 || F , || A˜ 2 || F , || A˜ 3 || F , || A˜ 4 || F .

74

5 Multiple Actor-Critic Optimal Control via ADP

Remark 5.2 In fact, many NN activation functions are bounded and have bounded derivatives. In Theorem 5.1, the unknown system is bounded, and the NN parameters and output are bounded. Therefore, the initial state estimation error ex and the initial parameter identification errors A˜ i are bounded. The approximation error boundedness property is established in [25]. From Theorem 5.1, it can be seen that as t → ∞, the parameter estimates Aˆ i converge to bounded regions, such that the state estimation error ex is bounded. Let the steady state value of Aˆ i be denoted as Bi . Then after convergence of the model parameters, (5.16) can be rewritten as x˙ = B1T x + B2T f (x) + B3T u + B4T .

(5.37)

In the following, the optimal control for ith ADP based on the well-trained RNN (5.37) will be designed.

5.4.2 Critic Network and Action Network Based on the trained RNN model (5.37), the critic network expression in each category i is J = WcT ϕc + εc ,

(5.38)

where Wc is the ideal critic network weight matrix, ϕc is the activation function and εc is the approximation error. We assume that ||∇ϕc || ≤ ϕcd M . For the actual neural networks, let the estimate of Wc be Wˆ c , then the actual output of the critic network in each category i is Jˆ = Wˆ cT ϕc .

(5.39)

Define the weight estimation error of the critic network as W˜ c = Wc − Wˆ c .

(5.40)

Substitute this into the Bellman equation (5.11) to obtain the equation error ec = Wˆ cT ∇ϕc x˙ + r = −W˜ cT ∇ϕc x˙ + ε H ,

(5.41)

ε H = WcT ∇ϕc x˙ + r.

(5.42)

where

Let

5.4 Optimal Control Based on ADP

75

Ec =

1 T e ec . 2 c

(5.43)

Then, define the weight update law for W˙ˆ c as ∂ Ec ξ1 (ξ1T Wˆ c + r ) = −αc , W˙ˆ c = −αc 2 ∂ Wˆ c (ξ T ξ1 + 1)

(5.44)

1

˙ where αc > 0 is the learning rate of the critic network and ξ1 = ∇ϕc x. ξ1 T and ξ3 = ξ1 ξ1 + 1, with ||ξ2 || ≥ ξ2m . Then Define ξ2 = ξ3 ξ1 (ξ1T Wˆ c + r ) εH = −αc ξ2 ξ2T W˜ c + αc ξ2 . W˙ˆ c = αc ξ3 ξ32

(5.45)

The action network is used to obtain the control policy u in each category i and is given by u = WaT ϕa + εa ,

(5.46)

where Wa is the ideal weight matrix of the action network, ϕa is the activation function. According to persistent excitation conditions, it has ϕa M ≥ ||ϕa || ≥ ϕam . εa is the action network approximation error. The actual output of the action network is uˆ = Wˆ aT ϕa ,

(5.47)

where Wˆ a is the actual weight of the action network. ∂ Jˆ = 0, the desired feedback control input is From (5.37) and ∂u 1 u = − R −1 B3 ∇ϕcT Wˆ c . 2

(5.48)

The error between the actual output of the action network and the desired feedback control input is expressed as follows: 1 ea = Wˆ aT ϕa + R −1 B3 ∇ϕcT Wˆ c . 2

(5.49)

Define the objective function as follows: Ea =

1 T e ea . 2 a

(5.50)

76

5 Multiple Actor-Critic Optimal Control via ADP

Then the weight update law for the action network weight is a gradient descent algorithm, which is given by  T 1 W˙ˆ a = −αa ϕa Wˆ aT ϕa + R −1 B3 ∇ϕcT Wˆ c , 2

(5.51)

where αa is the learning rate of the action network. Define the weight estimation error of the action network is W˜ a = Wa − Wˆ a .

(5.52)

Then the update law of W˜ a is  T 1 −1 ˙ T T ˆ ˜ ˜ Wa =αa ϕa (Wa − Wa ) ϕa + R B3 ∇ϕc (Wc − Wc ) 2  T 1 1 = αa ϕa −W˜ aT ϕa − R −1 B3 ∇ϕcT W˜ c + WaT ϕa + R −1 B3 ∇ϕcT Wc . 2 2 (5.53) The RNN model (5.37) with controller (5.47) is x˙ =B1T x + B2T f (x) + B3T uˆ + B4T =B1T x + B2T f (x) + B3T WaT ϕa − B3T W˜ aT ϕa + B4T =B1T x + B2T f (x) + B3T u − B3T εa − B3T W˜ aT ϕa + B4T .

(5.54)

The next theorem proves the asymptotic convergence of the optimal control scheme. The result extends the results in [30]. Theorem 5.2 Let the optimal control input for (5.37) be provided by (5.47), and the weight updating laws of the critic network and the action network be given as in (5.44) and (5.51), respectively. Suppose there exist positive scalars l1 , l2 and l3 satisfying l1
0 and Γ3 = l1 x T x + l3 J , l1 > 0, W c W c , Γ2 = 2αc 2αa

l3 > 0. Then the time derivative of the Lyapunov function candidate (5.58) along the trajectories of the closed-loop systems (5.54) is computed as Γ˙ = Γ˙1 + Γ˙2 + Γ˙3 . According to (5.45), it can be obtained that   εH Γ˙1 =W˜ cT −ξ2 ξ2T W˜ c + ξ2 ξ3 1 1 εH ≤ − ||ξ2 ||2 ||W˜ c ||2 + ||ξ2 ||2 ||W˜ c ||2 + || ||2 2 2 ξ3 1 2 ˜ 2 1 εH 2 ≤ − ξ2m ||Wc || + || || . 2 2 ξ3

(5.59)

1 Define ε12 = WaT ϕa + R −1 B3 ∇ϕcT Wc and assume ||ε12 || ≤ ε12M , then based on 2 (5.53), one has

T 1 Γ˙2 = − l2 tr W˜ aT ϕa (W˜ aT ϕa + R −1 B3 ∇ϕcT W˜ c − ε12 ) 2 l2 l2 ε 2 ≤ − l2 ||ϕa ||2 ||W˜ a ||2 + ||ϕa ||2 ||W˜ a ||2 + 12M 2 2 l2 l 2 + ||ϕa ||2 ||W˜ a ||2 + ||R −1 B3 ∇ϕcT ||2 ||W˜ c ||2 4 4 l2 2 l2 −1 2 2 2 ˜ ˜ 2 l2 ε12M ≤ − ϕam ||Wa || + ||R B3 ||2 ϕcd . M || Wc || + 4 4 2

(5.60)

The time derivative of Γ3 is calculated as follows: Γ˙3 = 2l1 x T (B1T x + B2T f (x) + B3T u − B3T εa − B3T W˜ aT ϕa + B4T ) + l3 (−r (x, u)). (5.61) As 2x T B2T f ≤ (||B2 ||2 + k 2 )||x||2 ,

(5.62)

78

5 Multiple Actor-Critic Optimal Control via ADP

−2x T B3T W˜ aT ϕa ≤ ||x||2 + ||B3 ||2 ϕa2M ||W˜ a ||2 ,

(5.63)

2x T B3T u ≤ ||x||2 + ||B3 ||2 ||u||2 ,

(5.64)

−2x T B3T εa ≤ ||x||2 + ||B3 ||2 εa2 M .

(5.65)

Then (5.61) can be rewritten as Γ˙3 ≤(2l1 ||B1 || + l1 ||B2 ||2 + l1 k 2 + 4l1 − l3 P)||x||2 + (||B3 ||2 l1 − l3 λmin (R))||u||2 + l1 ||B3 ||2 ϕa2M ||W˜ a ||2 + l1 ||B3 ||2 εa2 M + l1 ||B4 ||2 . (5.66) Let εΓ =

l2 2 1 εH ε12M + || ||2 + l1 ||B3 ||2 εa2 M + l1 ||B4 ||2 . Then one has 2 2 ξ3

Γ˙ ≤ − (l3 P − 2l1 ||B1 || − l1 ||B2 ||2 − l1 k 2 − 4l1 )||x||2 − (l3 λmin (R) − ||B3 ||2 l1 )||u||2     1 2 l2 2 2 − l ||B ||2 ϕ 2 ˜ c ||2 − l2 ϕam ˜ 2 ξ2m − ||R −1 B3 ||2 ϕcd || W − 1 3 M a M ||Wa || + εΓ . 2 4 4

(5.67) If l1 , l2 and l3 are selected to satisfy (5.55)–(5.57), and  ||x|| >

εΓ l3 P − 2l1 ||B1 || − l1 ||B2 ||2 − l1 k 2 − 4l1

(5.68)

or  ||u|| >

εΓ l3 λmin (R) − ||B3 ||2 l1

(5.69)

or  ||W˜ c || >

εΓ 1 2 ξ 2 2m

2 − l42 ||R −1 B3 ||2 ϕcd M

(5.70)

or  ||W˜ a || >

εΓ l2 2 ϕ 4 am

− l1 ||B3 ||2 ϕa2M

(5.71)

hold. Γ˙ < 0. Therefore, according to the standard Lyapunov extension, x, u, W˜ c and W˜ a are UUB.

5.4 Optimal Control Based on ADP

79

Theorem 5.3 Suppose the hypotheses of Theorem 5.2 hold. If ||ϕa || ≤ ϕa M , ||∇ϕc || ≤ ϕcd M , ||Wc || ≤ WcM , ||Wa || ≤ Wa M and ||εa || ≤ εa M . Then  T ˆ (1) ||εa || ≤ εa M is UUB, where H (x, Wˆ a , Wˆ c ) = JˆxT B1T x+B2T f (x) + B3T u+B 4 ˆ + Q(x) + uˆ T R u. (2) uˆ is close to u within a small bound. Proof (1) According to (5.39) and (5.47), one has   H (x, Wˆ a , Wˆ c ) =∇ϕcT (Wc − W˜ c ) B1T x + B2T f (x) + B4T   + ∇ϕcT (Wc − W˜ c ) B3T WaT ϕa − B3T W˜ aT ϕa + Q(x) + ϕaT (Wa − W˜ a )R(WaT − W˜ aT )ϕa .

(5.72)

Equation (5.72) can be further written as   H (x, Wˆ a , Wˆ c ) =∇ϕcT Wc B1T x + B2T f (x) + B4T   − ∇ϕcT W˜ c B1T x + B2T f (x) + B4T + ∇ϕcT Wc B3T WaT ϕa − ∇ϕcT Wc B3T W˜ aT ϕa − ∇ϕcT W˜ c B3T WaT ϕa + ∇ϕcT W˜ c B3T W˜ aT ϕa + Q(x) + ϕaT Wa RWaT ϕa − ϕaT Wa R W˜ aT ϕa − ϕaT W˜ a RWaT ϕa + ϕaT W˜ a R W˜ aT ϕa .

(5.73)

For a fixed admissible control policy, H (x, Wa , Wc ) is bounded. Then one can let ||H (x, Wa , Wc )|| ≤ H B . Therefore, (5.73) can be written as   H (x, Wˆ a , Wˆ c ) = − ∇ϕcT W˜ c B1T x + B2T f (x) + B4T − ∇ϕcT Wc B3T W˜ aT ϕa − ∇ϕcT W˜ c B3T WaT ϕa + ∇ϕcT W˜ c B3T W˜ aT ϕa − ϕaT Wa R W˜ aT ϕa − ϕaT W˜ a RWaT ϕa + ϕaT W˜ a R W˜ aT ϕa + H B . (5.74) Then one has     lim  H (x, Wˆ a , Wˆ c ) ≤ ϕcd M (||B1 || + ||B2 ||k)||W˜ c ||||x|| + H B t→∞   + ϕcd M ||B4 || + ||B3 ||Wa M ϕa M ||W˜ c || + (ϕcd M WcM ||B3 ||ϕa M + 2ϕa2M Wa M ||R||)||W˜ a || + ϕcd M ||B3 ||ϕa M ||W˜ c ||||W˜ a || + ϕa2M ||R||||W˜ a ||2 . (5.75) According to Theorem 5.2, the signals on the right-hand side of (5.75) are UUB, therefore H (x, u, ˆ Jˆx ) is UUB.

80

5 Multiple Actor-Critic Optimal Control via ADP

(2) As uˆ − u = Wˆ aT ϕa − WaT ϕa − εa = −W˜ aT ϕa − εa .

(5.76)

Therefore, one can get lim ||uˆ − u|| ≤ ||W˜ a ||ϕa M + εa M .

(5.77)

t→∞

Equation (5.77) means that uˆ is close to the control input u within a small bound. Remark 5.3 According to Theorem 5.1 and the properties of NN, we can see that the initial weight estimation errors W˜ c and W˜ a are bounded. Therefore, the initial value H is bounded in Theorem 5.3. Remark 5.4 It should be mentioned that Theorems 5.1–5.3 are based on the following fact: the model network (RNN) is trained first, and then the critic and action networks weights are turned according to the steady system model (5.37). In the next theorem, the simultaneous turning method for the model, critic and action networks will be discussed further. Theorem 5.4 Let the optimal control input for (5.16) be provided by (5.47). The update methods for the tunable parameters of Aˆ i in (5.16), i = 1, 2, 3, 4 are given in Theorem 5.1. The weight updating laws of the critic and the action networks are given as ξ1 (ξ1T Wˆ c + r ) , W˙ˆ c = − αc 2 (ξ1T ξ1 + 1) 1 W˙ˆ a = − αa ϕa (Wˆ aT ϕa + R −1 Aˆ 3 ∇ϕcT Wˆ c )T . 2

(5.78) (5.79)

If 1 A5 > Aˆ T1 + Aˆ T2 Aˆ 2 + 2



3 βA k2 + + 2 2 2

 I,

(5.80)

and there exist positive scalars l5 , l6 and l7 satisfying l5
max , P λmin (R)

l6
0, l6 > 0. From Theorem 5.1, one has

l5 tr {W˜ aT W˜ a }, l5 2αa

(5.84) > 0 and Γ6 =

   2 1 βA 1 T k + + I − A5 ex + A2 A2 + 2 2 2 2 + ||ex |||| A˜ 1 || F A1B − ||ex |||| A˜ 1 ||2F + ||ex |||| A˜ 2 || F A2B − ||ex |||| A˜ 2 ||2F + ||ex |||| A˜ 3 || F A3B − ||ex |||| A˜ 3 ||2F + ||ex |||| A˜ 4 || F A4B − ||ex |||| A˜ 4 ||2F + d B ||ex ||. (5.85)

V˙ ≤exT



AT1

From Theorem 5.2, 1 2 ˜ 2 1 εH 2 Γ˙4 ≤ − ξ2m ||Wc || + || || , 2 2 ξ3

(5.86)

and l5 2 l5 2 2 ˜ 2 l5 ε12M Γ˙5 ≤ − ϕam ||W˜ a ||2 + ||R −1 B3 ||2 ϕcd . M || Wc || + 4 4 2

(5.87)

Since the approximate dynamics model (5.16) with controller (5.47) is ˆ + Aˆ T3 u − Aˆ T3 εa − Aˆ T3 W˜ aT ϕa + Aˆ T4 + A5 ex . x˙ˆ = Aˆ T1 xˆ + Aˆ T2 f (x)

(5.88)

Then Γ˙6 ≤(2l6 || Aˆ 1 || + l6 || Aˆ 2 ||2 + l6 || Aˆ 5 ||2 + l6 k 2 + 4l6 − l7 P)||x||2 + (l6 || Aˆ 3 ||2 − l7 λmin (R))||u||2 + exT ex + l6 || Aˆ 3 ||2 ϕa2M ||W˜ a ||2 + l6 || Aˆ 3 ||2 εa2 M + l6 || Aˆ 4 ||2 .

(5.89)

1 ε H 2 l5 2 || || + ε12M + l6 || Aˆ 3 ||2 εa2 M + l6 || Aˆ 4 ||2 , and Θ A = A5 − AT1 − 2 ξ3 2 3 βA k2 1 T A A2 − ( + + )I . Then 2 2 2 2 2

Let dV Γ =

82

5 Multiple Actor-Critic Optimal Control via ADP

Θ˙ < − (l7 λmin (R) − l6 || Aˆ 3 ||2 )||u||2 − (l7 P − Θ7 )||x||2   1 2 l5 −1 ˆ 2 2 2 ξ − ||R A3 || ϕcd M ||W˜ c ||2 − Θ A ||ex || − 2 2m 4   l5 2 2 2 ˆ − ϕ − l6 || A3 || ϕa M ||W˜ a ||2 4 am 4   − ||ex || (|| A˜ i ||2F − || A˜ i || F AiB )+d B + dV Γ ,

(5.90)

i=1

which is guaranteed negative as long as ||ex || >

dV Γ 4  i=1

(|| A˜ i ||2F − || A˜ i || F AiB ) + d B

.

(5.91)

Thus, Θ˙ is negative outside a compact set. Remark 5.5 From Theorem 5.4, it can be seen that when the system model error enters in a bounded region, then the weights of critic and the action networks will be UUB. Remark 5.6 In this chapter, a novel multiple actor-critic structure is established to obtain the optimal control for unknown systems. The algorithm is different from the methods in [31, 32]. In [31], the optimized adaptive control and trajectory generation for a class of wheeled inverted pendulum (WIP) models of vehicle systems were investigated. In [32], automatic motion control was investigated for WIP models, which have been widely applied for modeling of a large range of two wheeled modern vehicles. The underactuated WIP model was decomposed into two subsystems. One subsystem consists of planar movement of vehicle forward motion and yaw angular motions. The other is about the pendulum tilt motion. In this chapter, a SIANN is trained to classify previously recorded data into different memory categories, and the optimal control for the data attributes and performance measure in that category are calculated by multiple actor-critic structure.

5.5 Simulation Study In this section two simulations are provided to demonstrate the versatility of the proposed SIANN/Multiple ADP structure. Example 5.1 We consider a simplified model of a continuously stirred tank reactor with an exothermic reaction [33]. The model is given by

5.5 Simulation Study

83



x˙1 x˙2



⎡ 13 5 ⎤     −2 ⎢ 6 12 ⎥ x1 =⎣ + u 50 8 ⎦ x2 0 − − 3 3

(5.92)

where x1 represents the temperature and x2 represents the concentration of the initial product of the chemical reaction. Define two categories as follows:

yd =

1, −1,

||x|| ≥ 0.1, ||x|| < 0.1.

(5.93)

For the two categories, define the respective utility functions as

r (x, u) =

x T Qx + u T R1 u, yd = 1, 10 log10 (x1 + x2 )2 + u T R2 u,

(5.94)

yd = −1,

where Q = I2 , R1 = 1 and R2 = 100. For system (5.92), the admissible control is used to obtain process response data which is used as historical input-output data to train the SIANN and model RNN. The norm of the state time-function is shown in Fig. 5.3. First, the SIANN classifier is trained. The SIANN neuron output is (5.1) with w j0 = 0, c j0 = 0 and gh (·) and f h (·) are taken as sigmoid functions. Let the initial values of a j and b j be selected randomly in (−0.1, 0.1). Let the initial weights w j and

0.7

0.6

0.5

||x||

0.4

0.3

0.2

0.1

0 0

500

1000

time steps

Fig. 5.3 The state norm

1500

84

5 Multiple Actor-Critic Optimal Control via ADP 2

yd

y

1.5

category label

1

0.5

0

-0.5

-1

-1.5 0

100

200

300

400

500

600

700

800

900

1000

time steps

Fig. 5.4 The SIANN the training result

c j be selected randomly in (−1, 1). Let the initial parameter in the output of SIANN be d = 1, and the initial weight v be selected randomly in (−1, 1). The number of shunting neurons is 10. The SIANN is trained using the historical data generated above and the parameter tuning algorithms in previous section. When the historical data are fed into the trained SIANN, the classification results are shown in Fig. 5.4. Note that, in Fig. 5.3, the norm of the state falls below 0.1 at 300 time steps. This corresponds to the category classification change observed in Fig. 5.4. Second, the RNN estimation model is trained, which is used to estimate the ˆ system dynamics. Let the  initial  elements of matrices Ai , i = 1, 2, 3, 4 be selected in 2 2 (−0.5, 0.5). Let A5 = , βi = 0.05, i = 1, 2, 3, 4. Let the activation function 2 2 be f (·) = tan sig(·). The stored data generated as in Fig. 5.3 are used to turn the model RNN. The training RNN error is shown in Fig. 5.5. Finally, the optimal controller is simulated based on the trained SIANN classifier and RNN estimation model. The structures of the critic network and action network are 2-8-1 and 2-8-1, respectively, in each category. Let the initial weights of critic and action networks for the two categories be selected in (−0.7, 0.7). Let the activation function be hyperbolic function. Let the learning rate of critic and action networks be 0.02. The trained SIANN is used to classify the system state obtained by the optimal controller. For increasing the accuracy, the weight of SIANN is modified online. The classification result is shown in Fig. 5.6. Once the state belongs to one category, then the RNN is used to get the system state. The test RNN estimation error is shown in Fig. 5.7.

5.5 Simulation Study

85

0.015

ex(1) ex(2)

0.01

ex

0.005

0

-0.005

-0.01

-0.015 0

100

200

300

400

500

600

700

800

900

1000

time steps

Fig. 5.5 The training error of RNN 1

y yd

test results

0.5

0

-0.5

-1

-1.5 0

100

200

300

400

500

600

700

800

900

1000

time steps

Fig. 5.6 The test classification result

After 600 time steps, the convergent trajectory of Wc is obtained, which is shown in Fig. 5.8. The control and state trajectories are shown in Figs. 5.9 and 5.10. The simulation result reveals that the proposed optimal control method for unknown system operates properly.

86

5 Multiple Actor-Critic Optimal Control via ADP 0.01

ex(1) ex(2)

test RNN error

0.005

0

-0.005

-0.01

-0.015 0

100

200

300

400

500

600

700

800

900

1000

time steps

Fig. 5.7 The test RNN error 0.5

W c(1)

0.4

W c(2) W c(3)

0.3

W c(4) 0.2

W c(5) W c(6)

0.1

Wc

W c(7) W c(8)

0 -0.1 -0.2 -0.3 -0.4 -0.5 0

100

200

300

400

500

600

700

800

900

1000

time steps

Fig. 5.8 The test RNN error

Example 5.2 Consider the nonlinear oscillator [34]:

x˙1 = x1 + x2 − x1 (x12 + x22 ), x˙2 = −x1 + x2 − x2 (x12 + x22 ) + u.

(5.95)

5.5 Simulation Study

87

0.16 0.14 0.12

control

0.1 0.08 0.06 0.04 0.02 0 -0.02 0

100

200

300

400

500

600

700

800

900

1000

time steps

Fig. 5.9 The control trajectory 0.5 0.4

x1 x2

0.3 0.2

state

0.1 0 -0.1 -0.2 -0.3 -0.4 -0.5 0

100

200

300

400

500

600

700

800

900

1000

time steps

Fig. 5.10 The state trajectories

Define the performance measure functions as

r (x, u) =

x T Q 1 x + u T R1 u, yd = 1, ||x|| + x12 + u T R1 u, yd = −1.

(5.96)

88

5 Multiple Actor-Critic Optimal Control via ADP 1.5

yd

y e

1

category label

0.5

0

-0.5

-1

-1.5

-2

-2.5 0

20

40

60

80

100

120

140

160

180

200

time steps

Fig. 5.11 The training result for SIANN

Define the category label as yd =

⎧ ⎨ 1,

x1 > 0.3, or x2 > 0.3, or x1 ≤ −0.3, or x2 ≤ −0.3, ⎩ −1, −0.3 ≤ x1 < 0.3, and − 0.3 ≤ x2 < 0.3.

(5.97)

For system (5.95), the admissible control is used to obtain the historical data of the system state. Based on this obtained system data, the SIANN classifier and the RNN are trained. First, the SIANN is trained. The parameters are similar as in Example 5.1. Then the SIANN training result is shown in Fig. 5.11. Second, the RNN estimation model is trained, based on the classified states. Let the initial parameters Aˆ i , i = 1, 2, 3, 4 of RNN be selected in (−0.2, 0.2) and A5 = [2, 2; 2, 2]. Let βi = 0.05, i = 1, 2, 3, 4. The RNN training error is shown in Fig. 5.12. Finally, the optimal controller is established based on the trained SIANN classifier and RNN estimation model. The structures of action network and critic network are 2-8-1 and 2-8-1, respectively. Let the initial weights Wa and Wc be selected in (−1, 1) and the activation function be sigmoid function. The trained SIANN is used to classify the system state online, and the classification result is shown in Fig. 5.13. Once the state belongs to one category, then the RNN is used to get the system state. The RNN estimate error is shown in Fig. 5.14. The control and state trajectories are shown in Figs. 5.15 and 5.16. From simulation results, it can be seen that the proposed control method for unknown systems is effective.

5.5 Simulation Study

89

0.01

ex(1) ex(2)

0.005

0

ex

-0.005

-0.01

-0.015

-0.02

-0.025 0

100

200

300

400

500

600

700

800

time steps

Fig. 5.12 The RNN training error 1.5

test y test yd

test classification

1

0.5

0

-0.5

-1

-1.5 0

200

400

600

800

1000

1200

time steps

Fig. 5.13 The test classification result

1400

1600

1800

2000

90

5 Multiple Actor-Critic Optimal Control via ADP 0.025

ex(1)

0.02

ex(2)

0.015 0.01

test ex

0.005 0 -0.005 -0.01 -0.015 -0.02 -0.025 0

200

400

600

800

1000

1200

1400

1600

1800

2000

1600

1800

2000

time steps

Fig. 5.14 The RNN estimate error 1.5

1

control

0.5

0

-0.5

-1

-1.5 0

200

400

600

800

1000

1200

time steps

Fig. 5.15 The control trajectory

1400

5.6 Conclusions

91

1

x1 x2 0.5

state

0

-0.5

-1

-1.5 0

200

400

600

800

1000

1200

1400

1600

1800

2000

time steps

Fig. 5.16 The state trajectories

5.6 Conclusions This chapter proposed multiple actor-critic structures to obtain the optimal control by input-output data for unknown systems. First we classified the input-output data into several categories by SIANN. The performance measure functions were defined for each category. Then the optimal controller was obtained by ADP algorithm. The RNN was used to reconstruct the unknown system dynamics using input-output data. The neural networks were used to approximate the critic network and action network, respectively. It is proven that the model error and the closed-system are UUB. Simulation results demonstrated the effectiveness of the proposed optimal control scheme for the unknown nonlinear system.

References 1. Levine, D., Ramirez Jr., P.: An attentional theory of emotional influences on risky decisions. Prog. Brain Res. 202(2), 369–388 (2013) 2. Levine, D., Mills, B., Estrada, S.: Modeling emotional influences on human decision making under risk. In: Proceedings of International Joint Conference on Neural Networks, pp. 1657– 1662 (2005) 3. Werbos, P.: Intelligence in the brain: a theory of how it works and how to build it. Neural Netw. 22, 200–212 (2009) 4. Werbos. P.: Stable adaptive control using new critic designs. In: Proceedings of Adaptation, Noise, and Self-Organizing Systems (1998)

92

5 Multiple Actor-Critic Optimal Control via ADP

5. Narendra, K., Balakrishnan, J.: Adaptive control using multiple models. IEEE Trans. Autom. control 42(2), 171–187 (1997) 6. Sugimoto, N., Morimoto, J., Hyon, S., Kawato, M.: The eMOSAIC model for humanoid robot control. Neural Netw. 29–30, 8–19 (2012) 7. Doya, K.: What are the computations of the cerebellum, the basal ganglia and the cerebral cortex? Neural Netw. 12(7–8), 961–974 (1999) 8. Hikosaka, O., Nakahara, H., Rand, M., Sakai, K., Lu, X., Nakamura, K., Miyachi, S., Doya, K.: Parallel neural networks for learning sequential procedures. Trends Neurosci. 22(10), 464–471 (1999) 9. Lee, J., Lee, J.: Approximate dynamic programming-based approaches for input-output datadriven control of nonlinear processes. Automatica 41(7), 1281–1288 (2005) 10. Song, R., Xiao, W., Zhang, H.: Multi-objective optimal control for a class of unknown nonlinear systems based on finite- approximation-error ADP algorithm. Neurocomputing 119(7), 212– 221 (2013) 11. Li, H., Liu, D., Wang, D.: Integral reinforcement learning for linear continuous-time zero-sum games with completely unknown dynamics. IEEE Trans. Autom. Sci. Eng. 11(3), 706–714 (2014) 12. Yang, X., Liu, D., Huang, Y.: Neural-network-based online optimal control for uncertain nonlinear continuous-time systems with control constraints. IET Control Theory Appl. 7(17), 2037–2047 (2013) 13. Lewis, F., Vamvoudakis, K.: Reinforcement learning for partially observable dynamic processes: adaptive dynamic programming using measured output data. IEEE Trans. Syst. Man Cybern. Part B: Cybern. 41(1), 14–25 (2011) 14. Li, Z., Duan, Z., Lewis, F.: Distributed robust consensus control of multi-agent systems with heterogeneous matching uncertainties. Automatica 50(3), 883–889 (2014) 15. Modares, H., Lewis, F., Naghibi-Sistani, M.: Integral reinforcement learning and experience replay for adaptive optimal control of partially unknown constrained-input continuous-time systems. Automatica 50(1), 193–202 (2014) 16. Zhang, H., Lewis, F.: Adaptive cooperative tracking control of higher-order nonlinear systems with unknown dynamics. Automatica 48(7), 1432–1439 (2012) 17. Wei, Q., Liu, D.: Adaptive dynamic programming for optimal tracking control of unknown nonlinear systems with application to coal gasification. IEEE Trans. Autom. Sci. Eng. 11(4), 1020–1036 (2014) 18. Doya, K., Samejima, K., Katagiri, K., Kawato, M.: Multiple model-based reinforcement learning. Neural Comput. 14, 1347–1369 (2002) 19. Levine, D.: Neural dynamics of affect, gist, probability, and choice. Cogn. Syst. Res. 15–16, 57–72 (2012) 20. Werbos, P.: Using ADP to understand and replicate brain intelligence: the next level design. IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, pp. 209–216 (2007) 21. Arulampalam, G., Bouzerdoum, A.: A generalized feedforward neural network architecture for classification and regression. Neural Netw. 16, 561–568 (2003) 22. Bouzerdoum, A.: Classification and function approximation using feedforward shunting inhibitory artificial neural networks. In: Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks, vol. 6, pp. 613–618 (2000) 23. Tivive, F., Bouzerdoum, A.: Efficient training algorithms for a class of shunting inhibitory convolutional neural networks. IEEE Trans. Neural Netw. 16(3), 541–556 (2005) 24. Song, R., Lewis, F., Wei, Q., Zhang, H.: Off-policy actor-critic structure for optimal control of unknown systems with disturbances. IEEE Trans. Cybern. 46(5), 1041–1050 (2016) 25. Hornik, K., Stinchcombe, M., White, H., Auer, P.: Degree of approximation results for feedforward networks approximating unknown mappings and their derivatives. Neural Comput. 6(6), 1262–1275 (1994) 26. Zhang, H., Cui, L., Zhang, X., Luo, Y.: Data-driven robust approximate optimal tracking control for unknown general nonlinear systems using adaptive dynamic programming method. IEEE Trans. Neural Netw. 22(12), 2226–2236 (2011)

References

93

27. Kim, Y., Lewis, F.: Neural network output feedback control of robot manipulators. IEEE Trans. Robot. Autom. 15(2), 301–309 (1999) 28. Khalil, H.: Nonlinear System. Prentice-Hall, NJ (2002) 29. Lewis, F., Jagannathan, S., Yesildirek, A.: Neural Network Control of Robot Manipulators and Nonlinear Systems. Taylor and Francis, London (1999) 30. Song, R., Xiao, W., Zhang, H., Sun, C.: Adaptive dynamic programming for a class of complexvalued nonlinear systems. IEEE Trans. Neural Netw. Learn. Syst. 25(9), 1733–1739 (2014) 31. Yang, C., Li, Z., Li, J.: Trajectory planning and optimized adaptive control for a class of wheeled inverted pendulum vehicle models. IEEE Trans. Cybern. 43(1), 24–36 (2013) 32. Yang, C., Li, Z., Cui, R., Xu, B.: Neural network-based motion control of an underactuated wheeled inverted pendulum model. IEEE Trans. Neural Netw. Learn. Syst. 25(1), 2004–2016 (2014) 33. Beard, R.: Improving the Closed-Loop Performance of Nonlinear Systems, Ph.D. thesis, Rensselaer Polytechnic Institute, Troy, NY (1995) 34. Abu-Khalaf, M., Lewis, F.: Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica 41, 779–791 (2005)

Chapter 6

Optimal Control for a Class of Complex-Valued Nonlinear Systems

In this chapter, an optimal control scheme based on ADP is developed to solve infinite-horizon optimal control problems of continuous-time complex-valued nonlinear systems. A new performance index function is established based on complexvalued state and control. Using system transformations, the complex-valued system is transformed into a real-valued one, which overcomes Cauchy–Riemann conditions effectively. Based on the transformed system and the performance index function, a new ADP method is developed to obtain the optimal control law using neural networks. A compensation controller is developed to compensate the approximation errors of neural networks. Stability properties of the nonlinear system are analyzed and convergence properties of the weights for neural networks are presented. Finally, simulation results demonstrate the performance of the developed optimal control scheme for complex-valued nonlinear systems.

6.1 Introduction In many science problems and engineering applications, the parameters and signals are complex-valued [1, 2], such as quantum systems [3] and complex-valued neural networks [4]. In [5], complex-valued filters were proposed for complex signals and systems. In [6], a complex-valued B-spline neural network was proposed to model the complex-valued Wiener system. In [7], a complex-valued pipelined recurrent neural network for nonlinear adaptive prediction of complex nonlinear and nonstationary signals was proposed. In [8], the output feedback stabilization of complex-valued reaction-advection-diffusion systems was studied. In [4], the global asymptotic stability of delayed complex-valued recurrent neural networks was studied. In [9], a reinforcement learning algorithm with complex-valued functions was proposed. In the investigations of complex-valued systems, many system designers want to find © Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 R. Song et al., Adaptive Dynamic Programming: Single and Multiple Controllers, Studies in Systems, Decision and Control 166, https://doi.org/10.1007/978-981-13-1712-5_6

95

96

6 Optimal Control for a Class of Complex-Valued Nonlinear Systems

the optimal value of the complex-valued parameters or controlled variable, by optimizing a chosen performance index function [10]. From the previous discussions, it can be seen that the optimal control schemes based on ADP are constrained to real-valued systems. In many real-world systems, however, the system states and controls are complex values [11]. As there exist inherent differences between real-valued systems and complex-valued ones, the ADP methods for real-valued systems cannot solve the optimal control problems of complex-valued systems, directly. To the best of our knowledge, there are no discussions on ADP for complex-valued systems. Therefore, a novel ADP method for complex-valued systems is eagerly anticipated. This motivates our research. In this chapter, for the first time an ADP-based optimal control scheme for complex-valued systems is developed. First, a new performance index function is defined based on the complex-valued state and control. Second, using system transformations, the complex-valued system is transformed into a real-valued one, where the Cauchy–Riemann conditions can effectively be avoided. Then, a new ADP method is developed to obtain the optimal control law of the nonlinear systems. Neural networks, including critic and action networks, are employed to implement the developed ADP method. A compensation control method is established to overcome the approximation errors of neural networks. It is proven that the developed control scheme makes the closed-loop system uniformly ultimately bounded (UUB). It is also shown that the weights of neural networks will converge to a finite neighborhood of the optimal weights. Finally, the simulation studies are given to show the effectiveness of the developed control scheme.

6.2 Motivations and Preliminaries Consider the following complex-valued nonlinear system z˙ = f (z) + g(z)u,

(6.1)

where z ∈ Cn is the system state, f (z) ∈ Cn and f (0) = 0. Let g(z) ∈ Cn×n be a bounded input gain, i.e., ||g(·)|| ≤ g, ¯ where g¯ is a positive constant, and || · || stands for the 2-norm, unless special declaration is given. Let u ∈ Cn be the control vector. Let z 0 be the initial state. For system (6.1), the infinite-horizon performance index function is defined as  ∞ r¯ (z(τ ), u(τ ))dτ, (6.2) J (z) = t

where the utility function r¯ (z, u) = z H Q 1 z + u H R1 u. Let Q 1 and R1 be diagonal positive definite matrices. Let z H and u H denote the complex conjugate transpose of z and u, respectively.

6.2 Motivations and Preliminaries

97

The aim of this chapter is to obtain the optimal control of the complex-valued nonlinear system (6.1). In order to achieve this purpose, the following assumptions are necessary. Assumption 6.1 [4] Let i 2 = −1, and then z = x + i y, where x ∈ Rn and y ∈ Rn . If C(z) ∈ Cn is the complex-valued function, then it can be expressed as C(z) = C R (x, y) + iC I (x, y), where C R (x, y) ∈ Rn and C I (x, y) ∈ Rn . Assumption 6.2 Let f (z) = ( f 1 (z), f 2 (z), . . . , f n (z))T , and f j (z) = f jR (x, y) + i f jI (x, y), j = 1, 2, . . . , n. The partial derivatives of f j (z) with respect to x and y sat         ∂ f R   ∂ f R   ∂ f I   ∂ f I  j  j   j   j  R R  R I  IR isfy   ≤ λ j ,   ≤ λ j ,   ≤ λ j and   ≤ λ Ij I , where  ∂ x   ∂ y   ∂ x   ∂ y  1

1

1

1

λ Rj R , λ Rj I , λ Ij R and λ Ij I are positive constants. Let || · ||1 stand for 1-norm. According to above preparations, the system transformation for system (6.1) will be given. Let f (z) = f R (x, y) + i f I (x, y), g(z) = g R (x, y) + ig I (x, y) and u = u R + iu I . Then, system (6.1) can be written as x˙ + i y˙ = f R (x, y) + i f I (x, y) + (g R (x, y) + i g I (x, y))(u R + iu I ).

(6.3)

 R     R  R  f (x, y) x u g (x, y) − g I (x, y) Let η = . and G(η) = I ,ν = R I , F(η) = I g (x, y) g (x, y) u f (x, y) y Then, we can obtain η˙ = F(η) + G(η)ν,

(6.4)

where η ∈ R2n , F(η) ∈ R2n , G(η) ∈ R2n×2n and ν ∈ R2n . From (6.4) we can see that F(0) = 0. Remark 6.1 In fact, the system transformations between system (6.1) and system (6.4) are equivalent and reversible, which can be seen in the following equations η˙ = F(η)   + G(η)ν x˙ f R (x, y) + g R (x, y)u R − g I (x, y)u I ⇔ = f I (x, y) + g I (x, y)u R + g R (x, y)u I y˙ ⇔ x˙ + i y˙ = f R (x, y) + i f I (x, y)  + g R (x, y) + i g I (x, y) u R + (i g I (x, y) + g R (x, y))iu I ⇔ x˙ + i y˙ = f R (x, y) + i f I (x, y)    + g R (x, y) + i g I (x, y) u R + iu I ⇔ z˙ = f (z) + g(z)u. Therefore, if the optimal control for (6.4) is acquired, then the optimal control problem of system (6.1) is also solved. In the following section, the optimal control scheme of system (6.4) will be developed based on continuous-time ADP method.

98

6 Optimal Control for a Class of Complex-Valued Nonlinear Systems

Let Q = diag(Q 1 , Q 1 ) and R = diag(R1 , R1 ). According to the definition of r¯ (z, u), the utility function can be expressed as r (η, ν) = ηT Qη + ν T Rν.

(6.5)

According to (6.5), the performance index function (6.2) can be expressed as  J (η) =



r (η(τ ), ν(τ ))dτ.

(6.6)

t

For an arbitrary admissible control law ν, if the associated performance index function J (η) is given in (6.6), then an infinitesimal version of (6.6) is the so-called nonlinear Lyapunov equation [12] 0 = JηT (F(η) + G(η)ν) + r (η, ν),

(6.7)

∂J is the partial derivative of the performance index function J . Define ∂η the optimal performance index function as

where Jη =







J (η) = min ν∈U

r (η(τ ), ν(η(τ )))dτ,

(6.8)

t

where U is a set of admissible control laws. Defining the Hamiltonian function as H (η, ν, Jη ) = JηT (F(η) + G(η)ν) + r (η, ν),

(6.9)

the optimal performance index function J ∗ (η) satisfies  0 = min H (η, ν, Jη∗ ) . ν∈U

(6.10)

According to (6.9) and (6.10), the optimal control law can be expressed as 1 ν ∗ (η) = − R −1 G T (η)Jη∗ (η). 2

(6.11)

Remark 6.2 In this chapter, the system transformations between (6.1) and (6.4) are necessary. We should say that the optimal control cannot be obtained from (6.1) and (6.2) directly. For example, if the optimal control is calculated from (6.1) and 1 (6.2), according to Bellman optimality principle, we have u = − R1−1 g H (z)Jz (z). 2 ∂J exists only if Let z = x + i y and J = J R + i J I . The partial derivative Jz = ∂z R I ∂J ∂JR ∂JI ∂J = and =− . Cauchy–Riemann conditions are satisfied, i.e., ∂x ∂y ∂y ∂x

6.2 Motivations and Preliminaries

99

As the performance index function J is in the real domain, then J I = 0, which ∂JI ∂JR ∂JI ∂J ∂JR ∂JI ∂JR = = 0 and =− = 0. Therefore = +i = means ∂x ∂y ∂y ∂x ∂z ∂x ∂x I R ∂J ∂J −i = 0. Thus the optimal control of complex-valued system (6.1) is ∂y ∂y u = 0, which is obviously invalid. If complex-valued system (6.1) is transformed into (6.4), then Cauchy–Riemann conditions are avoided. Therefore, the optimal control of system (6.1) can be obtained by the transformed system (6.4) and the performance index function (6.6). In the next section, the ADP-based optimal control method will be given in details.

6.3 ADP-Based Optimal Control Design In this section, neural networks are introduced to implement the optimal control method. Let the number of hidden layer neurons of the neural network be L. Let the weight matrix between the input layer and hidden layer be Y . Let the weight matrix between the hidden layer and output layer be W and let the input vector be X . Then, the output of the neural network is represented as FN (X, Y, W ) = W T σ (Y X ), where σ (Y X ) is the activation function. For convenience of analysis, only the output weight W is updated, while the hidden weight is kept fixed [13]. Hence, the neural network function can be simplified by the expression FN (X, W ) = W T σ¯ (X ), where σ¯ (X ) = σ (Y X ). There are two neural networks in the ADP method, which are critic and action networks, respectively. In the following subsections, the detailed design methods for the critic and action networks will be given.

6.3.1 Critic Network The critic network is utilized to approximate the performance index function J (z), and the ideal critic network is expressed as J (z) = W¯ cH ϕ¯c (z) + εc , where W¯ c ∈ Cn c1 is the ideal critic network weight matrix. Let ϕ¯c (z) ∈ Cn c1 be the activation function and let εc ∈ R be the approximation error of the critic network. Let W¯ c = W¯ cR + i W¯ cI and ϕ¯c = ϕ¯cR + i ϕ¯cI . Then, we have J (η) = (W¯ cR − i W¯ cI )T (ϕ¯ cR + i ϕ¯cI ) + εc = W¯ cRT ϕ¯cR + W¯ cI T ϕ¯cI + i(W¯ cI T ϕ¯cR − W¯ cRT ϕ¯cI ) + εc .  Let Wc =

W¯ cR W¯ cI



 and ϕc =

(6.12)

 ϕ¯ cR . As J (η) is a real-valued function, we can get ϕ¯ cI

100

6 Optimal Control for a Class of Complex-Valued Nonlinear Systems

J (η) = WcT ϕc (η) + εc .

(6.13)

Thus the derivative of J (η) is written as Jη = ∇ϕcT (η)Wc + ∇εc ,

(6.14)

∂ϕc (η) ∂εc and ∇εc = . According to (6.13), Hamiltonian function ∂η ∂η (6.9) can be expressed as where ∇ϕc (η) =

H (η, ν, Wc ) = WcT ∇ψc (F + Gν) + r (η, ν) − ε H ,

(6.15)

where ε H is the residual error, which is expressed as ε H = −∇εcT (F + Gν). Let Wˆ c be the estimate of Wc , and then the output of the critic network is Jˆ(η) = Wˆ cT ϕc (η).

(6.16)

Hence, Hamiltonian function (6.9) can be approximated by H (η, ν, Wˆ c ) = Wˆ cT ∇ϕc (F + Gν) + r (η, ν).

(6.17)

Then, we can define the weight estimation error of the critic network as W˜ c = Wc − Wˆ c .

(6.18)

Note that, for a fixed admissible control law ν, Hamiltonian function (6.9) becomes H (η, ν, Jη ) = 0, which means H (η, ν, Wc ) = 0. Therefore, from (6.15), we have ε H = WcT ∇ϕc (F + Gν) + r (η, ν).

(6.19)

Let ec = H (η, ν, Wˆ c ) − H (η, ν, Wc ). We can obtain ec = − W˜ cT ∇ϕc (η)(F(η) + G(η)ν) + ε H .

(6.20)

1 It is desired to select Wˆ c to minimize the squared residual error E c = ecT ec . 2 Normalized gradient descent algorithm is used to train the critic network [12]. Then, the weight update rule of the critic network W˙ˆ c is derived as ∂ Ec ξ1 (ξ1T Wˆ c + r (η, ν)) = −αc , W˙ˆ c = −αc 2 ∂ Wˆ c (ξ1T ξ1 + 1)

(6.21)

where αc > 0 is the learning rate of the critic network and ξ1 = ∇ϕc (F + Gν). It is a modified Levenberg–Marquardt algorithm, i.e., (ξ1T ξ1 + 1) is replaced by

6.3 ADP-Based Optimal Control Design

101

(ξ1T ξ1 + 1)2 , which is used for normalization, and it will be required in the proofs ξ1 [12]. Let ξ2 = and ξ3 = ξ1T ξ1 + 1. We have ξ3 ξ1 (ξ1 Wˆ c + r ) W˙˜ c = αc ξ32 T

ξ1 (ξ1T Wc + r ) ξ32 εH = −αc ξ2 ξ2T W˜ c + αc ξ2 . ξ3 = −αc ξ2 ξ2T W˜ c + αc

(6.22)

6.3.2 Action Network The action network is used to obtain the control law u. The ideal expression of the action network is u = W¯ aT ϕ¯a (z) + ε¯ a , where W¯ a ∈ Cna1 ×n is the ideal weight matrix of the action network. Let ϕ¯a (η) ∈ Cna1 be the activation function and let ε¯ a ∈ Cn be the approximation error of the action network. W¯ a = W¯aR + i W¯ aI , ϕ ¯a = ϕ¯aR + iϕ¯aI and ε¯ a = ε¯ aR + i ε¯ aI . Let WaT =  Let RT IT R ¯ ¯ − Wa Wa ϕ¯a ε¯ R , ϕa = and εa = aI . We have I IT RT ¯ ¯ ϕ¯a ε¯ a Wa Wa ν = WaT ϕa (η) + εa .

(6.23)

The output of the action network is νˆ (η) = Wˆ aT ϕa (η),

(6.24)

where Wˆ a is the estimation of Wa . We can define the output error of the action network as 1 ea = Wˆ aT ϕa + R −1 G T ∇ϕcT Wˆ c . 2

(6.25)

The objective function to be minimized by the action network is defined as Ea =

1 T e ea . 2 a

(6.26)

The weight update law for the action network weight is a gradient descent algorithm, which is given by

1 −1 T T ˆ T ˙ T ˆ ˆ Wa = −αa ϕa Wa ϕa + R G ∇ϕc Wc , 2

(6.27)

102

6 Optimal Control for a Class of Complex-Valued Nonlinear Systems

where αa is the learning rate of the action network. Define the weight estimation error of the action network as W˜ a = Wa − Wˆ a .

(6.28)

Then, we have 1 W˙˜ a = αa ϕa ((Wa − W˜ a )T ϕa + R −1 G T ∇ϕcT (Wc − W˜ c ))T 2

1 T −1 = αa ϕa −W˜ a ϕa − R G T ∇ϕcT W˜ c 2 T 1 −1 T T T +Wa ϕa + R G ∇ϕc Wc . 2

(6.29)

1 As ν = − R −1 G T Jη , according to (6.14) and (6.23), we have 2 1 1 WaT ϕa + εa = − R −1 G T ∇ϕcT Wc − R −1 G T ∇εc . 2 2

(6.30)

Thus, (6.29) can be rewritten as

T 1 −1 T T ˜ ˙ T ˜ ˜ Wa = −αa ϕa Wa ϕa + R G ∇ϕc Wc − ε12 , 2 where ε12 = −εa −

(6.31)

1 −1 T R G ∇εc . 2

6.3.3 Design of the Compensation Controller In this subsection, a compensation controller is designed to overcome the approximation errors of the critic and action networks. Before the detailed design method, the following assumption is necessary. Assumption 6.3 The approximation errors of the critic and action networks, i.e., εc and εa satisfy ||εc || ≤ εcM and ||εa || ≤ εa M . The residual error is upper bounded, i.e. ||ε H || ≤ ε H M . εcM , εa M and ε H M are positive numbers. The vectors of the activation functions of the action network satisfy ||ϕa || ≤ ϕa M , where ϕa M is a positive number. Define the compensation controller as νc = −

Kc GTη , ηT η + b

(6.32)

6.3 ADP-Based Optimal Control Design

103

ηT η + b , and b > 0 is a constant. Then, the optimal control 2ηT GG T η law can be expressed as

where K c ≥ ||G||2 εa2 M

νall = νˆ + νc ,

(6.33)

where νc is the compensation controller, and νˆ is the output of the action network. Substituting (6.33) into (6.4), we can get η˙ = F + G Wˆ aT ϕa + Gνc .

(6.34)

As Wˆ aT ϕa = WaT ϕa − W˜ aT ϕa = ν − εa − W˜ aT ϕa , we can obtain η˙ = F + Gν − Gεa − G W˜ aT ϕa + Gνc .

(6.35)

In the next subsection, the stability analysis will be given.

6.3.4 Stability Analysis For continuous-time ADP methods, the signals need to be persistently exciting in the learning process [12], i.e., the persistence of excitation assumption. Assumption 6.4 Let the signal ξ2 be persistently exciting over the interval [t, t + T ], i.e. there exist constants β1 > 0, β2 > 0 and T > 0, such that, for all t,  β1 I ≤

t+T

t

ξ2 (τ )ξ2T (τ )dτ ≤ β2 I

(6.36)

holds. Remark 6.3 This assumption makes system (6.4) be persistently excited sufficiently for tuning critic and action networks. In fact, the persistent excitation assumption ensures ξ2m ≤ ξ2 , where ξ2m is a positive number [12]. Before giving the main result, the following preparation works are presented. Lemma 6.1 For ∀x ∈ R n , we have ||x||2 ≤ ||x||1 ≤ Proof Let x = (x1 , x2 , . . . , xn ) . As T

can get ||x||2 ≤ ||x||1 . As ||x||21 = ||x||1 ≤



||x||22

n i=1

n||x||2 .

√ n||x||2 .

=

2 |xi |

n

|xi | ≤ 2

i=1

≤n

(6.37)

n

2 |xi |

= ||x||21 , we

i=1 n i=1

|xi |2 = n||x||22 , we can obtain

104

6 Optimal Control for a Class of Complex-Valued Nonlinear Systems

Theorem 6.1 For system (6.1), if f (z) satisfies Assumptions 6.1 and 6.2, then we have ||F(η) − F(η )||2 ≤ k||(η − η )||2 ,

(6.38)

    x x . Let k Rj = max{λ Rj R , λ Rj I } and k Ij = max{λ Ij R , λ Ij I }, where η = and η = y y n n √ j = 1, 2, . . . , n. Let k = k Rj + k Ij and k = k 2n. j=1

j=1

Proof According to Assumption 6.2 and the mean value theorem for multi-variable functions, we have || f jR (x, y) − f jR (x , y )||1 ≤ λ Rj R ||x − x ||1 + λ Rj I ||y − y ||1 .

(6.39)

According to the definition of 1-norm, we have ||η − η ||1 = ||x − x ||1 + ||y − y ||1 , and || f jR (x, y) − f jR (x , y )||1 ≤ k Rj ||η − η ||1 .

(6.40)

According to (6.40), we can get



|| f (x, y) − f (x , y )||1 ≤ R

R

n

k Rj ||η − η ||1 .

(6.41)

j=1

According to the idea from (6.40) and (6.41), we can also obtain || f I (x, y) − f I (x , y )||1 ≤

n

k Ij ||η − η ||1 .

(6.42)

j=1

Therefore, we can get ||F(η) − F(η )||1 = || f R (x, y) − f R (x , y )||1 + || f I (x, y) − f I (x , y )||1 ≤ k ||η − η ||1 . (6.43) According to Lemma 6.1, we have ||F(η) − F(η )||2 ≤ ||F(η) − F(η )||1 , and

(6.44)

6.3 ADP-Based Optimal Control Design

105

||η − η ||1 ≤



2n||η − η ||2 .

(6.45)

From (6.43)–(6.45), we can obtain ||F(η) − F(η )||2 ≤ k||η − η ||2 . The proof is completed. Next theorems show that the estimation error of critic and action networks are UUB. ˆ Theorem 6.2  T Letthe weights of the critic network Wc be updated by (6.21). If the   ξ1 ˜ ˜  inequality   ξ Wc  > ε H M holds, then the estimation error Wc converges to the set 3  W˜ c ≤ β1−1 β2 T (1 + 2ρβ2 αc ) ε H M exponentially, where ρ > 0. Proof Let W˜ c be defined as in (6.18). Define the following Lyapunov function candidate L=

1 ˜T ˜ W Wc . 2αc c

(6.46)

The derivative of (6.46) is given by  T   T   

T       ˙L = W˜ cT − ξ1 ξ1 W˜ c + ξ1 ε H ≤ −  ξ1 W˜ c   ξ1 W˜ c  −  ε H  .   ξ  ξ  ξ3 ξ32 ξ32 3 3

(6.47)

   T    εH    < ε H M . If  ξ1 W˜ c  > ε H M , then we can get L˙ ≤ 0. That As ξ3 ≥ 1, we have   ξ  ξ 3 3 T ˜ means L decreases and ||ξ2 Wc || is bounded. In light of [14] and Technical Lemma √ 2 in [12], W˜ c ≤ β1−1 β2 T (1 + 2ρβ2 αc ) ε H M . Theorem 6.3 Let the optimal control law be expressed as in (6.33). The weight update laws of the critic and action networks are given as in (6.21) and (6.27), respectively. If there exists parameters l2 and l3 , that satisfy l2 > ||G||,

(6.48)

 ||G||2 2k + 3 , , λmin (R) λmin (Q)

(6.49)

and  l3 > max

respectively, then the system state η in (6.4) is UUB and the weights of the critic and action networks, i.e., Wˆ c and Wˆ a converge to finite neighborhoods of the optimal ones. Proof Choose the Lyapunov function candidate as V = V1 + V2 + V3 ,

(6.50)

106

6 Optimal Control for a Class of Complex-Valued Nonlinear Systems

1 ˜T ˜ l2 tr{W˜ aT W˜ a }, V3 =ηT η + l3 J (η), l2 > 0, l3 > 0. Wc Wc , V2 = 2αc 2αa Taking the derivative of the Lyapunov function candidate (6.50), we can get V˙ = ˙ V1 + V˙2 + V˙3 . According to (6.22), we have

where V1 =

εH V˙1 = −(W˜ cT ξ2 )T W˜ cT ξ2 + (W˜ cT ξ2 )T . ξ3

(6.51)

According to (6.31), we can get 

T  1 −1 T T ˜ T T ˙ ˜ ˜ V2 = −l2 tr Wa ϕa Wa ϕa + R G ∇ϕc Wc − ε12 2 1 = −l2 (W˜ aT ϕa )T W˜ aT ϕa + l2 (W˜ aT ϕa )T ε12 − l2 (W˜ aT ϕa )T R −1 G T ∇ϕcT W˜ c . (6.52) 2 The derivative of V3 can be expressed as V˙3 = (2ηT F − 2ηT G W˜ aT ϕa + 2ηT Gν − 2ηT Gεa + 2ηT Gνc ) + l3 (−r (η, ν)). (6.53) From Theorem 6.1, we have 2ηT F ≤ 2k ||η||2 . In addition, we can obtain −2ηT G W˜ aT ϕa ≤ ||η||2 + ||G||2 (W˜ aT ϕa )T W˜ aT ϕa , 2ηT Gν ≤ ||η||2 + ||G||2 ||ν||2 , −2ηT Gεa ≤ ||η||2 + ||G||2 εa2 M .

(6.54)

From (6.32), we can get

Kc GTη 2η Gνc = 2η G − T η η+b T

T

≤ −||G||2 εa2 M .

(6.55)

Then, (6.53) can be rewritten as V˙3 ≤ (2k + 3 − l3 λmin (Q))||η||2 + (||G||2 − l3 λmin (R))||ν||2 + ||G||2 (W˜ aT ϕa )T W˜ aT ϕa .

(6.56)

εH , M V 4 ]T , ξ3  1 where MV 4 = − R −1 G T ∇ϕcT W˜ c + l2 ε12 , and MV = diag (l3 λmin (Q) − 2k − 3), 2  (l3 λmin (R) − ||G||2 ), 1, (l2 − ||G||2 ) . Thus, we have Let Z = [ηT , v T , (W˜ cT ξ2 )T , (W˜ aT ϕa )]T , and N V = [0, 0,

V˙ ≤ −Z T MV Z + Z T N V ≤ −||Z ||2 λmin (MV ) + ||Z ||||N V ||.

(6.57)

6.3 ADP-Based Optimal Control Design

107

||N V || ≡ Z B , then the λmin (MV ) Lyapunov candidate V˙ ≤ 0. As MV and εξH3 are both upper bounded, we have ||N V || is upper bounded. Therefore, the state η, the weight errors W˜ c and W˜ a are UUB [15]. The proof is completed. According to (6.48)–(6.49), we can see that if ||Z || ≥

Theorem 6.4 Let the weight updating laws of the critic and the action networks be given by (6.21) and (6.27), respectively. If Theorem 6.3 holds, then the control law νall converges to a finite neighborhood of the optimal control law ν ∗ . Proof From Theorem 6.3, there exist νc > 0 and W˜ a > 0, such that lim ||ν|| ≤ ν and lim ||W˜ a || ≤ W˜ a . From (6.33), we have

t→∞

t→∞

νall − ν ∗ = νˆ + νc − ν ∗ = Wˆ aT ϕa + νc − WaT ϕa − εa = W˜ aT ϕa + νc − εa .

(6.58)

Therefore, we have lim ||νall − ν ∗ || ≤ W˜ a ϕa M + ν + εa M .

t→∞

(6.59)

As W˜ a ϕa M + ν + εa M is finite, we can obtain the conclusion. The proof is completed.

6.4 Simulation Study Example 6.1 Our first example is chosen as Example 3 in [3] with modifications. Consider the following nonlinear complex-valued harmonic oscillator system z˙ = i −1

−2z(z 2 − 25 ) + 10(1 + i)u, 2z 2 − 1

(6.60)

where z ∈ C1 , z = z R + i z I and u = u R + iu I . The utility function is defined as r¯ (z, u) = z H Q 1 z + u H R1 u. Let Q 1 = E and R1 = E, where E is the identity matrix with a suitable dimension. Let η = [z R , z I ]T and ν = [u R , u I ]T . Let the critic and action networks be expressed as Jˆ(η) = Wˆ cT ϕc (Yc η) and νˆ (η) = Wˆ aT ϕa (Ya η), where Yc and Ya are constant matrices with suitable dimensions. The activation functions of the critic and action networks are hyperbolic tangent functions. The structures of the critic and action networks are 2-8-1 and 2-8-2, respectively. The initial weights of Wˆ c and Wˆ a are selected arbitrarily from (−0.1, 0.1), respectively. The learning rates of the critic and action networks are selected as αc = αa = 0.01. Let

108

6 Optimal Control for a Class of Complex-Valued Nonlinear Systems

z 0 = −1 + i. Let K c = 3 and b = 1.02. Implementing the developed ADP method for 40 time steps, the trajectories of the control and state are shown in Figs. 6.1 and 6.2, respectively. The weights of the critic and action networks converge to Wc = [−0.0904; 0.0989; −0.0586; 0.0214; −0.0304; 0.0435; −0.0943; −0.0866] and Wa = [−0.0269, 0.0117; 0.0197, −0.0548; 0.0336, −0.0790; 0.0789, −0.0980; −0.0825, −0.0881; 0.0078, −0.0354; −0.0143, 0.0558; 0.0234, −0.0329], respectively. From Fig. 6.2, we can see that the system state is UUB, which verifies the effectiveness of the developed ADP method. Example 6.2 In the second example, the effectiveness of the developed ADP method will be justified by a complex-valued Chen system [16]. The system can be expressed as ⎧ ⎨ z˙ 1 = −μz 1 + z 2 (z 3 + α) + i z 1 u 1 , z˙ 2 = −μz 2 + z 1 (z 3 − α) + 10u 2 , (6.61) ⎩ z˙ 3 = 1 − 0.5(¯z 1 z 2 + z 1 z¯ 2 ) + u 3 , where μ = 0.8 and α = 1.8. Let z = [z 1 , z 2 , z 3 ]T ∈ C3 and g(z) = diag(i z 1 , 10, 1). Let z¯ 1 and z¯ 2 denote the complex conjugate vectors of z 1 and z 2 , respectively. Define η = [z 1R , z 2R , z 3R , z 1I , z 2I , z 3I ]T and ν = [u 1R , u 2R , u 3R , u 1I , u 2I , u 3I ]T . The structures of action and critic networks are 6-8-6 and 6-6-1, respectively. Let the training rules of the neural networks be the same as in Example 1. Let 0.03

real(u) 0.02

imag(u)

0.01

0

-0.01

-0.02

-0.03 0

2

4

6

8

10

time steps

Fig. 6.1 Control trajectories

12

14

16

18

20

6.4 Simulation Study

109

1 0.8

real(z)

0.6

imag(z)

0.4

z

0.2 0 -0.2 -0.4 -0.6 -0.8 -1 0

2

4

6

8

10

12

14

16

18

20

time steps

Fig. 6.2 State trajectories 3.7

real(u1 )

3

real(u2 ) real(u3 )

2

imag(u1 ) imag(u2 )

control

1

imag(u3 )

0

-1

-2

-3 -3.7 0

50

100

150

200

250

time steps

Fig. 6.3 Control trajectories

300

350

400

450 480

110

6 Optimal Control for a Class of Complex-Valued Nonlinear Systems 2

1

z1 , z2 , z3

0

-1

real(z1 ) real(z2 ) real(z3 )

-2

imag(z1 ) imag(z2 ) -3

imag(z3 )

-4 -4.5 0

50

100

150

200

250

280

time steps

Fig. 6.4 State trajectories

z 0 = [1 + i, 1 − i, 0.5]T . Implementing the developed ADP method for 500 time steps, the trajectories of the control and state are displayed in Figs. 6.3 and 6.4, respectively. From the simulation results, we can say that the ADP method developed in this chapter is effective and feasible.

6.5 Conclusion In this chapter, for the first time an optimal control scheme based on ADP method for complex-valued systems has been developed. First, the performance index function is defined based on complex-valued state and control. Then, system transformations are used to overcome Cauchy–Riemann conditions. Based on the transformed system and the corresponding performance index function, a new ADP based optimal control method is established. A compensation controller is presented to compensate the approximation errors of neural networks. Finally, the simulation examples are given to show the effectiveness of the developed optimal control scheme.

References 1. Adali, T., Schreier, P., Scharf, L.: Complex-valued signal processing: the proper way to deal with impropriety. IEEE Trans. Signal Process. 59(11), 5101–5125 (2011)

References

111

2. Fang, T., Sun, J.: Stability analysis of complex-valued impulsive system. IET Control Theory Appl. 7(8), 1152–1159 (2013) 3. Yang, C.: Stability and quantization of complex-valued nonlinear quantum systems. Chaos, Solitons Fractals 42, 711–723 (2009) 4. Hu, J., Wang, J.: Global stability of complex-valued recurrent neural networks with time-delays. IEEE Trans. Neural Netw. Learn. Syst. 23(6), 853–865 (2012) 5. Huang, S., Li, C., Liu, Y.: Complex-valued filtering based on the minimization of complex-error entropy. IEEE Trans. Neural Netw. Learn. Syst. 24(5), 695–708 (2013) 6. Hong, X., Chen, S.: Modeling of complex-valued wiener systems using B-spline neural network. IEEE Trans. Neural Netw. 22(5), 818–825 (2011) 7. Goh, S., Mandic, D.: Nonlinear adaptive prediction of complex-valued signals by complexvalued PRNN. IEEE Trans. Signal Process. 53(5), 1827–1836 (2005) 8. Bolognani, S., Smyshlyaev, A., Krstic, M.: Adaptive output feedback control for complexvalued reaction-advection-diffusion systems, In: Proceedings of American Control Conference, Seattle, Washington, USA, pp. 961–966, (2008) 9. Hamagami, T., Shibuya, T., Shimada, S.: Complex-valued reinforcement learning. In: Proceedings of IEEE International Conference on Systems, Man, and Cybernetics, Taipei, Taiwan, pp. 4175–4179 (2006) 10. Paulraj, A., Nabar, R., Gore, D.: Introduction to Space-Time Wireless Communications. Cambridge University Press, Cambridge (2003) 11. Mandic, D.P., Goh, V.S.L.: Complex Valued Nonlinear Adaptive Filters: Noncircularity, Widely Linear and Neural Models. Wiley, New York (2009) 12. Vamvoudakis, K.G., Lewis, F.L.: Online actor-critic algorithm to solve the continuous-time infinite horizon optimal control problem. Automatica 46(5), 878–888 (2010) 13. Dierks, T., Jagannathan, S.: Online optimal control of affine nonlinear discrete-time systems with unknown internal dynamics by using time-based policy update. IEEE Trans. Neural Netw. Learn. Syst. 23(7), 1118–1129 (2012) 14. Khalil, H.K.: Nonlinear System. Prentice-Hall, Upper Saddle River (2002) 15. Lewis, F.L., Jagannathan, S., Yesildirek, A.: Neural Network Control of Robot Manipulators and Nonlinear Systems. Taylor & Francis, New York (1999) 16. Mahmoud, G.M., Aly, S.A., Farghaly, A.A.: On chaos synchronization of a complex two coupled dynamos system. Chaos, Solitons Fractals 33, 178–187 (2007)

Chapter 7

Off-Policy Neuro-Optimal Control for Unknown Complex-Valued Nonlinear Systems

This chapter establishes an optimal control of unknown complex-valued system. Policy iteration (PI) is used to obtain the solution of the Hamilton–Jacobi–Bellman (HJB) equation. Off-policy learning allows the iterative performance index and iterative control to be obtained by completely unknown dynamics. Critic and action networks are used to get the iterative control and iterative performance index, which execute policy evaluation and policy improvement. Asymptotic stability of the closed-loop system and the convergence of the iterative performance index function are proven. By Lyapunov technique, the uniformly ultimately bounded (UUB) of the weight error is proven. Simulation study demonstrates the effectiveness of the proposed optimal control method.

7.1 Introduction Policy iteration (PI) and the value iteration (VI) are the two main algorithms in ADP [1, 2]. In [3], VI algorithm for nonlinear systems was proposed, which can find the optimal control law by iterative performance index function and iterative control with a zero initial performance index function. When the iteration index increases to infinity, it is proven that the iterative performance index function is a nondecreasing sequence and bounded. In [4], PI algorithm was proposed, which can obtain the optimal control by constructing a sequence of stabilizing iterative control. In conventional PI, the system dynamics is necessary. While in many industrial systems, the dynamics is difficult to be known. Off-policy method, which is based on PI, can solve the HJB equation by completely unknown dynamics. In [5], H∞ tracking control of completely unknown continuous-time systems via off-policy reinforcement learning was discussed. Note that, Ref. [6] studied the optimal control problem for complex-valued nonlinear systems in the frame of infinite-horizon ADP algorithm © Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 R. Song et al., Adaptive Dynamic Programming: Single and Multiple Controllers, Studies in Systems, Decision and Control 166, https://doi.org/10.1007/978-981-13-1712-5_7

113

114

7 Off-Policy Neuro-Optimal Control …

with known dynamics. In this chapter, the further study the optimal control problem of unknown complex-valued nonlinear systems based on off-policy PI. In this chapter, we consider the optimal control problem of complex-valued unknown system. The PI algorithm is used to obtain the solution of the HJB equation. Off-policy learning allows the iterative performance index and the iterative control to be obtained with completely unknown dynamics. Action and critic networks are used to approximate the iterative performance index and iterative control. Therefore the established method has two steps: policy evaluation and policy improvement, which executed by the critic and actor. It is proven that the closed-loop system is asymptotic stability, the iterative performance index function is convergent, and the weight error is uniformly ultimately bounded (UUB). The rest of the chapter is organized as follows. In Sect. 7.2, the problem motivations and preliminaries are presented. In Sect. 7.3, the optimal control design scheme is developed based on PI, and the asymptotic stability is proved. In Sect. 7.4, two examples are given to demonstrate the effectiveness of the proposed optimal control scheme. In Sect. 7.5, the conclusion is drawn.

7.2 Problem Statement Consider a class of complex-valued continuous-time systems as follows: z˙ = f (z) + g(z)u,

(7.1)

where z ∈ Cn is the complex-valued system state, u ∈ Cm is the complex-valued control input. f (z) ∈ Cn and g(z) ∈ Cn×m are complex-valued functions. We want to find an optimal control policy u(t) = u(z(t)), which minimizes the following performance index function 



Ω=

(z H X z + u H Y u)dt,

(7.2)

0

where X and Y are positive definite symmetric matrices. According to [6], the enlarge dimension method is used to obtain the transformed systems. Assume that z = z R + i z I , u = u R + iu I , f = f R + i f I and g = g R + ig I . Then, system (7.1) can be expressed as s˙ = Γ (s) + Υ (s)q,

(7.3)

 R R I   R  R R I  f (z , z ) z g (z , z ) − g I (z R , z I ) where s = I , Γ (s) = , Υ (s) = and z f I (z R , z I ) g I (z R , z I ) g R (z R , z I )  R u . Therefore, the comples-valued system (7.1) is transformed into the realq= uI valued system (7.3). Although the dimensions of system (7.3) increases two-fold,

7.2 Problem Statement

115

the complex-valued optimization problem is mapped into real domain. According to the transformation, performance index function (7.2) can be expressed as 



Ω(s) =

(sTMs + qTN q) dτ,

(7.4)

t

where r (s, q) = s T Ms + q T N q, M = diag(X, X ), and N = diag(Y, Y ). If the associated cost function Ω is C1 , then an infinitesimal equivalent to (7.4) is the Bellman equation 0 = r (s, q) + ΩsT (Γ + Υ q),

(7.5)

∂Ω is disabused here as a column vector. Given that q is an admissible ∂s control policy, if Ω satisfies (7.5), with r (s, q) ≥ 0, then it can be shown that Ω is a Lyapunov function for the system (7.3) with control policy q. We define Hamiltonian function H : where Ωs =

H (s, q, Ωs ) = r (s, q) + ΩsT (Γ + Υ q).

(7.6)

Then the optimal performance index function Ω ∗ satisfies the HJB equation 0 = min H (s, q, Ωs∗ ). q

(7.7)

Assuming that the minimum on the right-hand side of (7.7) exists and is unique, then the optimal control function is 1 q ∗ = − N −1 Υ T Ωs∗ . 2

(7.8)

The optimal control problem can now be formulated: Find an admissible control policy q(t) = q(s(t)) such that the performance index function (7.4) associated with the system (7.3) is minimized.

7.3 Off-Policy Optimal Control Method The PI algorithm given in [7] which can obtain the optimal control, requires full system dynamics, because both Γ and Υ appear in the Bellman equation (7.5). In [8–10], the data is used to construct the optimal control. In this section, the off-policy PI algorithm will be presented to solve (7.5) without the dynamics of system (7.3). First, the PI algorithm is given as follows. The PI algorithm begins from an admissible control q [0] , then the iterative performance index function Ω [i] is updated as

116

7 Off-Policy Neuro-Optimal Control …

0 = r (s, q [i] ) + Ωs[i]T (Γ + Υ q [i] )

(7.9)

and the iterative control is derived by 1 q [i+1] = − N −1 Υ T Ωs[i] . 2

(7.10)

Based on the PI algorithm (7.9) and (7.10), the off-policy PI algorithm will be derived. Note that, for any time t and and time interval T > 0, the iterative performance index function satisfies  t r (s, q [i] )dτ + Ω(s [i] (t)). (7.11) Ω [i] (s(t − T )) = t−T

To derive the off-policy PI method, we rewrite the system (7.3) as s˙ = Γ (s) + Υ (s)q [i] + Υ (s)(q − q [i] ),

(7.12)

where q [i] is the iterative control. Then from (7.11) and (7.12), we can get Ω [i] (s(t)) − Ω [i] (s(t − T ))  t Ωs[i]T s˙ dτ = t−T  t Ωx[i]T (Γ + Υ q [i] + Υ (q − q [i] ))dτ. =

(7.13)

t−T

According to (7.9), we can have Ωs[i]T (Γ + Υ q [i] ) = −s T Ms − q [i]T N q [i] .

(7.14)

Then (7.13) can be expressed as Ω [i] (s(t)) − Ω [i] (s(t − T ))   t (−s T Ms − q [i]T N q [i] )dτ + = t−T

t t−T

Ωs[i]T Υ (q − q [i] )dτ.

(7.15)

From (7.10), we have Ωs[i]T Υ = −2q [i+1]T N . Therefore, we can get the off-policy Bellman equation as

(7.16)

7.3 Off-Policy Optimal Control Method

117

Ω [i] (s(t)) − Ω [i] (s(t − T ))   t T [i]T [i] (−s Ms − q N q )dτ + = t−T

t

−2q [i+1]T N (q − q [i] )dτ.

(7.17)

t−T

7.3.1 Convergence Analysis of Off-Policy PI Algorithm In this subsection, we will analyze the stability of the closed-loop system and the convergence of the established off-policy PI algorithm. Theorem 7.1 Let the iterative performance index function Ω [i] satisfy Ω [i] (s) =





s T Ms + q [i]T N q [i] dτ.

(7.18)

t

Let the iterative control input q [i+1] be obtained from (7.10). Then the closed-loop system (7.3) is asymptotically stable. Proof Take Ω [i] as the Lyapunov function candidate. We make the derivative of Ω [i] along the system s˙ = Γ + Υ q [i+1] , we can have Ω˙ [i] = Ωs[i]T s˙ = Ωs[i]T Γ + Ωs[i]T Υ q [i+1] .

(7.19)

According to (7.9), we can obtain Ωs[i]T Γ = −Ωs[i]T Υ q [i] − s T Ms − q [i]T N q [i] .

(7.20)

Ω˙ [i] = Ωs[i]T Υ q [i+1] − Ωs[i]T Υ q [i] − s T Ms − q [i]T N q [i] .

(7.21)

Then we have

From (7.10), we have Ωs[i]T Υ = −2q [i+1]T N .

(7.22)

Then (7.21) can be expressed as Ω˙ [i] = 2q [i+1]T N (q [i] − q [i+1] ) − s T Ms − q [i]T N q [i] .

(7.23)

Since N is symmetric positivedefinite, there exist a diagonal matrix Λ and an orthogonal symmetric matrix B. The elements of Λ are the singular values of N . Then N = BΛB. Define y [i] = Bq [i] , then q [i] = B−1 y [i] . Therefore, (7.23) is expressed as

118

7 Off-Policy Neuro-Optimal Control …

Ω˙ [i] = 2y [i+1]T Λ(y [i] − y [i+1] ) − s T Ms − y [i]T Λy [i] = 2y [i+1]T Λy [i] − 2y [i]T Λy [i+1] − s T Ms − y [i]T Λy [i] .

(7.24)

Since Λkk is the singular value and Λkk > 0, then we get Ω˙ [i] =

m 

Λkk (2y [i+1]T y [i] − 2y [i+1]T y [i+1] − y [i]T y [i] ) − s T Ms

k=1

< 0.

(7.25)

Therefore, it is clear that the closed-loop system is asymptotically stable under each iterative control input. This completes the proof. Theorem 7.2 Define the iterative performance index Ω [i] satisfying (7.9), and q [i+1] obtaining from (7.10), then Ω [i] is non-increasing as i → ∞, i.e. Ω ∗ (s) ≤ Ω [i+1] ≤ Ω [i] . Proof According to (7.9), we have Ωs[i]T (Γ + Υ q [i] ) = −s T Ms − q [i]T N q [i] .

(7.26)

From (7.26), we can get Ωs[i]T Γ + Ωs[i]T Υ q [i+1] = −s T Ms − q [i]T N q [i] − Ωs[i]T Υ q [i] + Ωs[i]T Υ q [i+1] . (7.27) From (7.22), (7.27) is written as Ωs[i]T Γ + Ωs[i]T Υ q [i+1] = − q [i]T N q [i] + 2q [i+1]T N (q [i] − q [i+1] ) − s T Ms = − q [i]T N q [i] + 2q [i+1]T N q [i] − 2q [i+1]T N q [i+1] − s T Ms.

(7.28)

Furthermore, consider Ω [i+1] and Ω [i] taking the derivatives along system s˙ = Γ + Υ q [i+1] , respectively. Then we can obtain  ∞ T d(Ω [i+1] − Ω [i] ) (Γ + Υ q [i+1] )dτ Ω [i+1] (s) − Ω [i] (s) = − ds 0  ∞  ∞ [i+1]T [i+1] =− Ωs (Γ + Υ q )dτ + Ωs[i]T (Γ + Υ q [i+1] )dτ 0 0  ∞  ∞  T =− s Ms + q [i]T N q [i] Ωs[i+1]T (Γ + Υ q [i+1] )dτ − 0 0 − 2q [i+1]T N q [i] + 2q [i+1]T N q [i+1] dτ. (7.29)

7.3 Off-Policy Optimal Control Method

119

From (7.26), we can have Ωs[i+1]T (Γ + Υ q [i+1] ) = −s T Ms − q [i+1]T N q [i+1] .

(7.30)

Then (7.29) is written as Ω [i+1] (s) − Ω [i] (s)  ∞ = (−q [i]T N q [i] + 2q [i+1]T N q [i] − q [i+1]T N q [i+1] )dτ.

(7.31)

0

According to the proof of Theorem 7.1 and (7.31), we have Ω [i+1] (s) − Ω [i] (s) =−

m 

Λkk (y [i]T y [i] − 2y [i+1]T y [i] + y [i+1]T y [i+1] ) ≤ 0.

(7.32)

k=1

Moreover, according to the definition of Ω ∗ , it has Ω ∗ ≤ Ω [i+1] . Therefore, it is concluded Ω ∗ ≤ Ω [i+1] ≤ Ω [i] . This completes the proof.

7.3.2 Implementation Method of Off-Policy Iteration Algorithm From the off-policy Bellman equation (7.17), we can see that the system dynamics is not required. Therefore, the off-policy PI algorithm depends on the system state s and iterative control q [i] . In (7.17), the function structures Ω [i] and q [i] are unknown, so in the following subsection the critic and action networks are presented to approximate the iterative performance index function and the iterative control. The critic network is expressed as [i] , Ω [i] (s) = WΩ[i]T φΩ (s) + εΩ

(7.33)

[i] where WΩ[i] is the ideal weight of the critic network, φΩ (s) is the active function, εΩ [i] is the residual error. The estimation of Ω (s) is given as follows:

Ωˆ [i] (s) = Wˆ Ω[i]T φΩ (s), where Wˆ Ω[i] is the estimation of WΩ[i] .

(7.34)

120

7 Off-Policy Neuro-Optimal Control …

The action network is given as follows: q [i] (s) = Wq[i]T φq (s) + εq[i] ,

(7.35)

where Wq[i] is the ideal weight of the action network, φq (s) is the active function, εq[i] is the residual error. Accordingly, the estimation of q [i] (s) is given as follows: qˆ [i] (s) = Wˆ q[i]T φq (s).

(7.36)

Then according to (7.17), we can define ˆ [i] ˆ [i] e[i] H = Ω (s(t)) − Ω (s(t − T )) +  t 2qˆ [i+1]T N (q − qˆ [i] )dτ. +



t

(s T Ms + qˆ [i]T N qˆ [i] )dτ

t−T

(7.37)

t−T

From (7.34) we have T Ωˆ [i] (s(t)) − Ωˆ [i] (s(t − T )) = Wˆ Ω[i]T ϕΩ = ( ϕΩ ⊗ I )vec(Wˆ Ω[i]T ) T = ( ϕΩ ⊗ I )Wˆ Ω[i] ,

(7.38)

where ϕΩ = φΩ (s(t)) − φΩ (s(t − T )). According to (7.36), we have 2qˆ [i+1]T N (q − qˆ [i] ) = 2(Wˆ q[i+1]T ϕq )T N (q − qˆ [i] ) = 2ϕqT Wˆ q[i+1] N (q − qˆ [i] )  T = 2 ((q − qˆ [i] ) N ) ⊗ ϕqT vec(Wˆ q[i+1] ).

(7.39)

Then, (7.37) can be expressed as  t [i] T ˆ = ( ϕ ⊗ I ) W + (s T Ms + qˆ [i]T N qˆ [i] )dτ e[i] Ω Ω H t−T  t  T + 2 ((q − qˆ [i] ) N ) ⊗ ϕqT dT vec(Wˆ q[i+1] )

(7.40)

t−T

i.e., e[i] H

  T = ( ϕΩ ⊗ I)  +

t t−T

t



[i] T

2 ((q − qˆ ) N ) ⊗

t−T

(s T Ms + qˆ [i]T N qˆ [i] )dτ.

ϕqT





Wˆ Ω[i] vec(Wˆ q[i+1] )



(7.41)

7.3 Off-Policy Optimal Control Method

121

  t   [i] T [i] T T ¯ Define C = = ( ϕΩ ⊗ I ), 2 ((q − qˆ ) N ) ⊗ ϕq dτ , ¯ t−T  t D¯ [i] = (s T Ms + qˆ [i]T N qˆ [i] )dτ . Then t−T

¯ [i] ˆ [i] + D¯ [i] . e[i] H =C W

Let E [i] =

[i] [i] 0.5e[i]T H e H , then we have W

=

(7.42)

Wˆ Ω[i] . Then the update method vec(Wˆ q[i+1] )

for the weight of critic and action networks is W˙ˆ

[i]

= −ηW C¯ [i]T (C¯ [i] Wˆ [i] + D¯ [i] ),

(7.43)

where ηW > 0. In the following theorem, we will prove that the proposed implementation method is convergent. Theorem 7.3 Let the control input q [i] be equal to (7.36), the updating methods for critic and action networks be as in (7.43). Define the weight error as W˜ [i] = W [i] − Wˆ [i] , then for every iterative step, W˜ [i] is UUB. Proof Let Lyapunov function candidate be as follows: Σ=

l1 ˜ [i]T ˜ [i] W W , 2ηW

(7.44)

where l1 > 0. According to (7.43), we can obtain W˙˜ [i] = ηW C¯ [i]T (C¯ [i] (W [i] − W˜ [i] ) + D¯ [i] ) = −ηW C¯ [i]T C¯ [i] W˜ [i] + ηW C¯ [i]T (C¯ [i] W [i] + D¯ [i] ).

(7.45)

Then we have Σ˙ = −l1 W˜ [i]T C¯ [i]T C¯ [i] W˜ [i] + l1 W˜ [i]T C¯ [i]T (C¯ [i] W [i] + D¯ [i] ) ≤ −l1 ||C¯ [i] ||2 ||W˜ [i] ||2 + ||W˜ [i] ||2 + ||S [i] ||2 = (−l1 ||C¯ [i] ||2 + 1)||W˜ [i] ||2 + ||S [i] ||2 , where S [i] =

1 1 ||l1 C¯ [i]T C¯ [i] W [i] ||2 + ||l1 C¯ [i]T D¯ [i] ||2 . 2 2

(7.46)

122

7 Off-Policy Neuro-Optimal Control …

Therefore, if there exists l1 satisfying l1 ||C¯ [i] ||2 > 1, and ||S [i] || ||W˜ [i] || > l1 ||C¯ [i] ||2 − 1

(7.47)

holds, Σ˙ < 0, which indicates W˜ [i] is UUB. The proof is completed.

7.3.3 Implementation Process In this part, the implementation process is presented. Step 1: Select the initial state x and initial admissible control. Define the matrices of M and N in the performance index function. Step 2: Let i = 1, 2, . . ., select the initial weights Wˆ Ω and Wˆ q of critic and action networks. Step 3: Update the critic and action weights according to (7.43). Step 4: If the output of critic network satisfies ||Ωˆ [i+1] − Ωˆ [i] || < θ , then go to Step 5. Else, go to Step 3. Step 5: Stop.

7.4 Simulation Study In this section, two examples are presented to demonstrate the effectiveness of the proposed optimal control method. Example 7.1 In this chapter, we first consider the following nonlinear complexvalued system [11] z˙ = f (z) + u,

(7.48)

   8 0 2 + 3i 3 − i f (z) = − z+ f¯(z), 0 6 4 − 2i 1 + 2i

(7.49)

1 − exp(−s j ) 1 +i , j = 1, 2. f¯j (z) = 1 + exp(−s j ) 1 + exp(−y j )

(7.50)

where 

and

7.4 Simulation Study

123

1.5

zR 1 zR 2

1

zI1 zI2

z

0.5

0

-0.5

-1

-1.5

0

5

10

15

20

25

30

35

40

45

50

time steps

Fig. 7.1 Control trajectories

According to the transformation in (7.3), it can be obtained that s = [z 1R , z 2R , Γ = [ f 1R , f 2R , f 1I , f 2I ]T and q = [u 1R , u 2R , u 1I , u 2I ]T . For the infinite-horizon optimal control problem, the utility function is defined as r¯ (z, u(z)) = z H X z + u H Y u, where X = Y = I . The real part and the imaginary part of the activation functions in the critic network and the action network are sigmoid functions. The input of networks is s. In the implementation procedure, the initial weights of critic and action networks are selected in [0.5, 0.5]. After 50 time steps, the The critic network weight Wˆ Ω and the action network weight Wˆ q are convergent, then the optimal control is obtained. The state and control trajectories are shown in Figs. 7.1 and 7.2. We can see that the closed-loop system is asymptotically stable. z 1I , z 2I ]T ,

Example 7.2 Consider the following complex-valued oscillator system with modification [6]

z˙ 1 = z 1 + z 2 − z 1 (z 12 + z 22 ) + u 1 , z˙ 2 = −z 1 + z 2 − z 2 (z 12 + z 22 ) + u 2 ,

(7.51)

where z = [z 1 , z 2 ]T ∈ C2 , u = [u 1 , u 2 ]T ∈ C2 , z j = z Rj + i z Ij and u j = u Rj + iu Ij , j = 1, 2. Let s = [z 1R , z 2R , z 1I , z 2I ]T , and q = [u 1R , u 2R , u 1I , u 2I ]T . According to the proposed method, design the critic and action networks to approximate the performance index function and control. The initial weights are selected in [0.5, 0.5]. The activation functions of critic and the action networks are sigmoid functions. After 100 time steps, the critic network weight Wˆ Ω and the action network weight Wˆ q are convergent, and the optimal control is achieved. The state and control trajectories are obtained and demonstrated in Figs. 7.3 and 7.4. They are all convergent.

124

7 Off-Policy Neuro-Optimal Control … 150

uR 1

100

uR 2 uI1 uI2

control

50

0

-50

-100

-150

0

5

10

15

20

25

30

35

40

45

50

time steps

Fig. 7.2 Control trajectories 1

zR 1 zR 2

0.5

zI1 zI2

z

0

-0.5

-1

-1.5

-2

0

10

20

30

40

50

time steps

Fig. 7.3 Control trajectories

60

70

80

90

100

7.5 Conclusion

125

4

uR 1

3

uR 2 uI1

2

uI2

control

1 0 -1 -2 -3 -4

0

10

20

30

40

50

60

70

80

90

100

time steps

Fig. 7.4 Control trajectories

7.5 Conclusion An optimal control method of unknown complex-valued system is proposed based on PI algorithm. Off-policy learning is used to obtain solve HJB equation with completely unknown dynamics. Critic and action networks are used to get the iterative performance index and iterative control. It is proven that the asymptotic stability of the closed-loop system, the convergence of the iterative performance index function and the UUB of the weight error. Simulation study demonstrates the effectiveness of the proposed optimal control method.

References 1. Sutton, R., Barto, A.: Reinforcement Learning: An Introduction, A Bradford Book. The MIT Press, Cambridge (2005) 2. Lewis, F., Vrabie, D.: Reinforcement learning and adaptive dynamic programming for feedback control. IEEE Circuits Syst. Mag. 9(3), 32–50 (2009) 3. Al-Tamimi, A., Lewis, F., Abu-Khalaf, M.: Discrete-time nonlinear HJB solution using approximate dynamic programming: convergence proof. IEEE Trans. Syst. Man Cybern. B Cybern. 38(4), 943–949 (2008) 4. Murray, J., Cox, C., Lendaris, G., Saeks, R.: Adaptive dynamic programming. IEEE Trans. Syst. Man Cybern. Syst. 32(2), 140–153 (2002) 5. Modares, H., Lewis, F., Jiang, Z.: H∞ tracking control of completely unknown continuous-time systems via off-policy reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. 26(10), 2550–2562 (2015)

126

7 Off-Policy Neuro-Optimal Control …

6. Song, R., Xiao, W., Zhang, H., Sun, C.: Adaptive dynamic programming for a class of complexvalued nonlinear systems. IEEE Trans. Neural Netw. Learn. Syst. 25(9), 1733–1739 (2014) 7. Wang, J., Xu, X., Liu, D., Sun, Z., Chen, Q.: Self-learning cruise control using kernel-based least squares policy iteration. IEEE Trans. Control Syst. Technol. 22(3), 1078–1087 (2014) 8. Luo, B., Wu, H., Huang, T., Liu, D.: Data-based approximate policy iteration for affine nonlinear continuous-time optimal control design. Automatica 50(12), 3281–3290 (2014) 9. Modares, H., Lewis, F.: Linear quadratic tracking control of partially-unknown continuous-time systems using reinforcement learning. IEEE Trans. Autom. Control 59, 3051–3056 (2014) 10. Kiumarsi, B., Lewis, F., Modares, H., Karimpur, A., Naghibi-Sistani, M.: Reinforcement Qlearning for optimal tracking control of linear discrete-time systems with unknown dynamics. Automatica 50(4), 1167–1175 (2014) 11. Abu-Khalaf, M., Lewis, F.: Nearly optimal control laws for nonlinear systems withsaturating actuators using a neural network HJB approach. Automatica 41, 779–791 (2005)

Chapter 8

Approximation-Error-ADP-Based Optimal Tracking Control for Chaotic Systems

In this chapter, an optimal tracking contrl scheme is proposed for a class of discretetime chaotic systems using the approximation-error-based ADP algorithm. Via the system transformation, the optimal tracking problem is transformed into an optimal regulation problem, and then the novel optimal tracking control method is proposed. It is shown that for the iterative ADP algorithm with finite approximation error, the iterative performance index functions can converge to a finite neighborhood of the greatest lower bound of all performance index functions under some convergence conditions. Two examples are given to demonstrate the validity of the proposed optimal tracking control scheme for chaotic systems.

8.1 Introduction During recent years, the control problem of chaotic systems has received considerable attentions [1, 2]. Many different methods are applied theoretically and experimentally to control the chaotic systems, such as adaptive synchronization control method [3] and impulsive control method [4]. However, the methods mentioned above are just focus on designing the controller for chaotic systems. Few of them considered the optimal tracking control problem, which is an important index of chaotic systems. Although the iterative ADP algorithm has improved greatly in control field, it is still an open problem about how to solve the optimal tracking control problem for chaotic systems based on ADP algorithm. The reason is that in the most implementation methods of ADP algorithms, the accurate optimal control and performance index function are not obtained, because of the existence of approximate error between the approximate function and the expected one. While the approximate error will influence the control performance of the chaotic systems. So the approximate error need to be considered in the ADP algorithm. This motivates our research. In this chapter, we proposed an approximation-error ADP algorithm to deal with the optimal tracking control problem for chaotic systems. First, the optimal tracking problem of chaotic © Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 R. Song et al., Adaptive Dynamic Programming: Single and Multiple Controllers, Studies in Systems, Decision and Control 166, https://doi.org/10.1007/978-981-13-1712-5_8

127

128

8 Approximation-Error-ADP-Based Optimal Tracking Control for Chaotic Systems

system is transformed into the optimal regulation problem, and the corresponding performance index function is defined. Then, the approximation-error ADP algorithm is established. It is proved that the ADP algorithm with approximation error makes the iterative performance index functions converge to a finite neighborhood of the optimal performance index function. Finally, two simulation examples are given to show the effectiveness of the proposed optimal tracking control algorithm for chaotic systems. This chapter is organized as follows. In Sect. 8.2, we present the problem formulation. In Sect. 8.3, the optimal tracking control scheme is developed and the convergence proof is given. In Sect. 8.4, two examples are given to demonstrate the effectiveness of the proposed tracking control scheme. In Sect. 8.5, the conclusion is given.

8.2 Problem Formulation and Preliminaries Consider the MIMO chaotic dynamic system which can be represented by the following form x1 (k + 1) = f 1 (x(k)) +

m 

g1 j u j (k),

j=1

.. . xn (k + 1) = f n (x(k)) +

m 

gn j u j (k),

(8.1)

j=1

where x(k) = [x1 (k), x2 (k), . . . , xn (k)]T is the system state vector which is assumed to be available for measurement. u(k) = [u 1 (k), u 2 (k), . . . , u m (k)]T is the control input. gi j , i = 1, 2, . . . , n, j = 1, 2, . . . , m is the constant control gain. If we denote f (x(k)) = [ f 1 (x(k)), f 2 (x(k), . . . , f n (x(k))]T and ⎡

g11 ⎢ .. g = ⎣. gn1

⎤ g12 · · · g1m .. .. ⎥ . . . ⎦ gn2 · · · gnm

(8.2)

Then, the chaotic system (8.1) can be rewritten as x(k + 1) = F(x(k), u(k)) = f (x(k)) + gu(k).

(8.3)

In fact, system (8.3) has a large family, and lots of chaotic systems can be described as in (8.3), such as Hénon mapping [5], the new discrete chaotic system proposed in [6] and many others.

8.2 Problem Formulation and Preliminaries

129

The objective of this chapter is to construct an optimal tracking controller, such that the system state x tracks the reference signal xd , and all the signals in the closed-loop system remain bounded. To meet the objective, we make the following assumption. Assumption 8.1 The desired trajectory signal xd is continuous, bounded, and available for measurement. Furthermore, the desired trajectory xd has the following form [7]: xd (k + 1) = f (xd (k)) + gu d (k).

(8.4)

Thus we define the tracking error as follows: e(k + 1) = x(k + 1) − xd (k + 1).

(8.5)

According to Ref. [8, 9], we define the control error w(k) = u(k) − u d (k),

(8.6)

and w(k) = 0, for k < 0, where u d (k) denotes the expected control, which can be given as u d (k) = g T (gg T + ε0 )−1 (xd (k + 1) − f (xd (k))),

(8.7)

where ε0 is a constant matrix, which is used to guarantee the existence of the matrix inverse. Then the tracking error is obtained as follows: e(k + 1) = x(k + 1) − xd (k + 1) = f (x(k)) − f (xd (k)) + gw(k) = f (e(k) + xd (k)) − f (xd (k)) + gw(k).

(8.8)

Remark 8.1 Actually, the term g T (gg T + ε0 )−1 in Eq. (8.7), is the generalized inverse of g. As g ∈ Rn×m , the generalized inverse technique is used to obtain the inverse of g. To solve the optimal tracking control problem, we present the following performance index function J (e(k), w(k)) =

∞ 

U (e(l), w(l)),

(8.9)

l=k

where U (e(k), w(k)) > 0, ∀e(k), w(k), is the utility function. In this chapter, we define

130

8 Approximation-Error-ADP-Based Optimal Tracking Control for Chaotic Systems

U (e(k), w(k)) = eT (k)Qe(k) + (w(k) − w(k − 1))T R(w(k) − w(k − 1)), (8.10) where Q and R are both diagonal positive definite matrices. In (8.10), the first term means the tracking error, and the second term means the difference of the control error. According to Bellman’s principle of optimality, the optimal performance index function satisfies the HJB equation as follows: J ∗ (e(k)) = inf {U (e(k), w(k)) + J ∗ (e(k + 1))}, w(k)

(8.11)

and the optimal control error w∗ (k) is w∗ (k) = arg inf J (e(k), w(k)). w(k)

(8.12)

Thus, the HJB equation (8.11) can be written as J ∗ (e(k)) = U (e(k), w∗ (k)) + J ∗ (e(k + 1)),

(8.13)

and the optimal tracking control is u ∗ (k) = w∗ (k) + u d (k).

(8.14)

We can see that if we want to obtain the optimal tracking control u ∗ (k), we must obtain w∗ (k) and the optimal performance index function J ∗ (e(k)). Generally, J ∗ (e(k)) is unknown before all the control error w∗ (k) is considered. If we adopt the traditional dynamic programming method to obtain the optimal performance index function, then we have to face the “the curse of dimensionality”. This makes the optimal control nearly impossible to be obtained by the HJB equation. So, in the next part, we present an iterative ADP algorithm, based on Bellman’s principle of optimality.

8.3 Optimal Tracking Control Scheme Based on Approximation-Error ADP Algorithm 8.3.1 Description of Approximation-Error ADP Algorithm First, for ∀e(k), let the initial function Υ (e(k)) be an arbitrary function that satisfies Υ (e(k)) ∈ Υ¯e(k) , where Υ¯e(k) is defined as follows. Definition 8.1 Let Υ¯e(k) = {Υ (e(k)) : Υ (e(k)) > 0, and Υ (e(k + 1)) < Υ (e(k))}

(8.15)

8.3 Optimal Tracking Control Scheme Based …

131

be the initial positive definition function set, where e(k + 1) = F(x(k), u(k)) − xd (k + 1). ˆ (e(k)), where For ∀e(k), let the initial performance index function J [0] (e(k)) = θΥ ˆθ > 0 is a large enough finite positive constant. The initial iterative control policy w[0] (k) can be computed as follows:

w[0] (k) = arg inf U (e(k), w(k)) + J [0] (e(k + 1)) , w(k)

(8.16)

ˆ (e(k + 1)). The performance index function can be where J [0] (e(k + 1)) = θΥ updated as J [1] (e(k)) = U (e(k), w[0] (k)) + J [0] (e(k + 1)).

(8.17)

For i = 1, 2, . . ., ∀ e(k), the iterative ADP algorithm will iterate between

w[i] (e(k)) = arg inf U (e(k), w(k)) + J [i] (e(k + 1)) , w(k)

(8.18)

and the iterative performance index functions

J [i+1] (e(k)) = inf U (e(k), w(k)) + J [i] (e(k + 1)) . w(k)

(8.19)

In fact, the accurate iterative control policy w[i] (k) and the iterative performance index function J [i] (e(k)) are generally impossible to be obtained. For example, if neural networks are used to implement the iterative ADP algorithm, no matter what kind of neural networks we choose, the approximate error between the output of the neural networks and the expect output must exist. In fact, as the existence of the approximation error, the accurate iterative control policy of the chaotic systems can not generally be obtained. So the iterative ADP algorithm with finite approximation error is expressed as follows: wˆ [0] (e(k)) = arg inf U (e(k), w(k)) + Jˆ[0] (e(k + 1)) + α[0] (e(k)), w(k)

(8.20)

ˆ (e(k + 1)). The performance index function can be where Jˆ[0] (e(k + 1)) = θΥ updated as Jˆ[1] (e(k)) = U (e(k), wˆ [0] (k)) + Jˆ[0] (e(k + 1))) + β [0] (e(k)),

(8.21)

where α[0] (e(k)) and β [0] (e(k)) are the approximation error functions of the iterative control and iterative performance index function, respectively. For i = 1, 2, . . ., the iterative ADP algorithm will iterate between wˆ [i] (e(k)) = arg inf U (e(k), w(k)) + Jˆ[i] (e(k + 1)) + α[i] (e(k)), w(k)

(8.22)

132

8 Approximation-Error-ADP-Based Optimal Tracking Control for Chaotic Systems

and the performance index function Jˆ[i+1] (e(k)) = inf U (e(k), w(k)) + Jˆ[i] (e(k + 1)) + β [i] (e(k)), w(k)

(8.23)

where α[i] (e(k)) and β [i] (e(k)) are the approximation error functions of the iterative control and iterative performance index function, respectively. Remark 8.2 The proposed approximation-error-ADP-based method for chaotic systems is the development of the traditional ADP method. In the traditional ADP method, the approximate error is existent in the process of implementation. But the analyses of the traditional ADP method have no regard for the approximate error. So, in this chapter, the approximate error is considered in the algorithm analyses, and the corresponding convergence proof is given. In the following theorems, we will prove that the iterative performance index function with approximate error is convergent if some conditions are satisfied.

8.3.2 Convergence Analysis of The Iterative ADP Algorithm In fact, the iterative index i → ∞, the boundary of iterative approximation error will also increase to infinity, although the approximation error is finite in the single iteration [10]. The following theorem will show this property. Theorem 8.1 Let e(k) be an arbitrary controllable state. For i = 1, 2, . . ., define a new iterative performance index function as Δ[i] (e(k)) = inf {U (e(k), w(k)) + Jˆ[i−1] (e(k + 1))}, w(k)

(8.24)

where Jˆ[i−1] (e(k + 1)) is defined in (8.23), and w(k) can accurately be obtained. If the initial iterative performance index function Jˆ[0] (e(k)) = Δ[0] (e(k)), and there exists a finite constant 1 that makes | Jˆ[i] (e(k)) − Δ[i] (e(k))| ≤ 1

(8.25)

hold uniformly, then we have | Jˆ[i] (e(k)) − J [i] (e(k))| ≤ i1 ,

(8.26)

where 1 is called uniform finite approximation error. Proof The theorem can be proved by mathematical induction. First, let i = 1, we have

8.3 Optimal Tracking Control Scheme Based …

133

Δ[1] (e(k)) = inf {U (e(k), w(k)) + Jˆ[0] (e(k + 1))} w(k)

= J [1] (e(k)).

(8.27)

Then, according to (8.25), we can get | Jˆ[1] (e(k)) − J [1] (e(k))| ≤ 1 .

(8.28)

Assume that (8.26) holds for i − 1. Then, for i, we have Δ[i] (e(k)) = inf {U (e(k), w(k)) + Jˆ[i−1] (e(k + 1))} w(k)

≤ inf {U (e(k), w(k)) + J [i−1] (e(k + 1)) + (i − 1)1 } w(k)

= J [i] (e(k)) + (i − 1)1 .

(8.29)

On the other hand, we can have Δ[i] (e(k)) = inf {U (e(k), w(k)) + Jˆ[i−1] (e(k + 1))} w(k)

≥ inf {U (e(k), w(k)) + J [i−1] (e(k + 1)) − (i − 1)1 } w(k)

= J [i] (e(k)) − (i − 1)1 .

(8.30)

So, we get −(i − 1)1 ≤ Δ[i] (e(k)) − J [i] (e(k)) ≤ (i − 1)1 .

(8.31)

Then, according to (8.25), we can get | Jˆ[i] (e(k)) − J [i] (e(k))| ≤ i1 .

(8.32)

It is obvious that the property analysis of the iterative performance index function Jˆ[i] (e(k)) and iterative control policy wˆ [i] (k) are very difficult. In next part, the novel convergence analysis is built. Theorem 8.2 Let e(k) be an arbitrary controllable state. For ∀ i = 0, 1, . . ., let Δ[i] (e(k)) be expressed as (8.24) and Jˆ[i] (e(k)) be expressed as (8.23). Let ζ < ∞ and 1 ≤ η < ∞ are both constants that make J ∗ (e(k + 1)) ≤ ζU (e(k), w(k)),

(8.33)

J [0] (e(k)) ≤ η J ∗ (e(k))

(8.34)

and

134

8 Approximation-Error-ADP-Based Optimal Tracking Control for Chaotic Systems

hold uniformly. If there exists 1 ≤ ι < ∞ that makes Jˆ[i] (e(k)) ≤ ιΔ[i] (e(k))

(8.35)

hold uniformly, then we have ⎛ Jˆ[i] (e(k)) ≤ι⎝1 +

i  ζ j ι j−1 (ι − 1)

(ζ + 1) j

j=1

where we define

i 

+

⎞ ζ i ιi (η − 1) ⎠ (ζ + 1)i

J ∗ (e(k)),

(8.36)

(·) = 0, for ∀ j > i and i, j = 0, 1, . . ..

j

Proof The theorem can be proved by mathematical induction. First, let i = 0. Then, (8.36) becomes Jˆ[0] (e(k)) ≤ ιη J ∗ (e(k)).

(8.37)

As Jˆ[0] (e(k)) ≤ η J ∗ (e(k)), then we can obtain Jˆ[0] (e(k)) ≤ η J ∗ (e(k)) ≤ ιη J ∗ (e(k)), which obtains (8.37). So, the conclusion holds for i = 0. Next, let i = 1. According to (8.27), we have Δ[1] (e(k)) = inf U (e(k), w(k)) + Jˆ[0] (e(k + 1)) w(k)

≤ inf U (e(k), w(k)) + ιη J ∗ (e(k + 1)) . w(k)

(8.38)

As 1 ≤ ι < ∞ and 1 ≤ η < ∞, then, ιη − 1 ≥ 0, then according to (8.33), we have      ιη − 1 ιη − 1 ∗ U (e(k), w(k)) + ιη − J (e(k + 1)) 1+ζ ζ +1 ζ +1  

ιη − 1 inf U (e(k), w(k)) + J ∗ (e(k + 1)) = 1+ζ ζ + 1 w(k)   ζ(ι − 1) ζι(η − 1) + J ∗ (e(k)). (8.39) = 1+ ζ +1 ζ +1

Δ[1] (e(k)) ≤ inf w(k)

According to (8.35), we can obtain   ζ(ι − 1) ζι(η − 1) + J ∗ (e(k)), Jˆ[1] (e(k)) ≤ ι 1 + ζ +1 ζ +1 which shows that (8.36) holds for i = 1.

(8.40)

8.3 Optimal Tracking Control Scheme Based …

135

Assume that (8.36) holds for i − 1. Then, for i, we have Δ[i] (e(k)) = inf U (e(k), w(k)) + Jˆ[i−1] (e(k + 1)) w(k) ⎧ ⎛   i−1 j j−1 ⎨ i−1 ιi−1 (η − 1)  ζ ι (ι − 1) ζ ∗ ≤ inf U (e(k), w(k)) + ι ⎝1 + + J (e(k + 1)) . w(k) ⎩ (ζ + 1) j (ζ + 1)i−1 j=1

(8.41) So, (8.41) can be written as ⎧⎛  i−1 j−1 j−1 ⎨  ζ ι (ι − 1) ζ i−1 ιi−1 (ιη − 1) Δ[i] (e(k)) ≤ inf ⎝1 + ζ + U (e(k), w(k)) w(k) ⎩ (ζ + 1) j (ζ + 1)i−1 j=1 ⎛ ⎛  i−1 j j−1  ζ ι (ι − 1) ζ i−1 ιi−1 (η − 1) + + ⎝ι ⎝1 + (ζ + 1) j (ζ + 1)i−1 j=1 ⎞⎞ ⎛  i−1 j−1 j−1  ζ ι (ι − 1) ζ i−1 ιi−1 (ιη − 1) ⎠⎠ ∗ ⎝ + (e(k + 1)) − J (ζ + 1) j (ζ + 1)i−1 j=1 ⎞ ⎛ i j ι j−1 (ι − 1) i ιi (η − 1) 

ζ ζ ⎠ inf U (e(k), w(k)) + J ∗ (e(k + 1)) = ⎝1 + + j i w(k) (ζ + 1) (ζ + 1) j=1 ⎞ ⎛ i j ι j−1 (ι − 1) i ιi (η − 1)  ζ ζ ⎠ J ∗ (e(k)). = ⎝1 + + (8.42) j i (ζ + 1) (ζ + 1) j=1

Then, according to (8.35), we can obtain (8.36) which proves the conclusion for ∀ i = 0, 1, . . .. Theorem 8.3 Let e(k) be an arbitrary controllable state. Suppose Theorem 8.2 holds for ∀e(k). If for ζ < ∞ and ι ≥ 1, the inequality ι
0. Assume

9.2 Problem Statement

149

||d|| ≤ Bd ,

(9.2)

where Bd is a positive number. For system (9.1) with d = 0, the performance index function is defined as  J (x) =



r (x(τ ), u(τ ))dτ,

(9.3)

t

where r (x, u) = Q(x) + u T Ru, in which Q(x) > 0 and R is a symmetric positive definite matrix. It is assumed that there exists q > 0 satisfying Q(x) ≥ q||x||2 . To begin with the algorithms, let us introduce the concept of the admissible control [18, 20]. Definition 9.1 For system (9.1) with d = 0, a control policy u(x) is defined as admissible, if u(x) is continuous on a set Ω ∈ Rn , u(0) = 0, u(x) stabilizes the system, and J (x) defined in (9.3) is finite for all x. For arbitrary admissible control u and d = 0, the infinitesimal version of (9.3) is the Bellman equation 0 = JxT x˙ + r,

(9.4)

which means along the solution of (9.1), one has JxT x˙ + Q(x) + u T Ru = JxT ( f + gu) + Q(x) + u T Ru = 0.

(9.5)

Therefore, one can get JxT ( f + gu) = −Q(x) − u T Ru.

(9.6)

Define the Hamiltonian function as H (x, u, Jx ) = JxT x˙ + r,

(9.7)

and let J ∗ be the optimal performance index function, then   0 = min H (x, u, Jx∗ ) . u∈U

(9.8)

If the solution J ∗ exists and it is continuously differentiable, then the optimal control can be expressed as   u ∗ = arg min H (x, u, Jx∗ ) . u∈U

(9.9)

In general, if d = 0, then the analytical solution of (9.8) can be approximated by policy iteration (PI) technique, which is given as follows.

150

9 Off-Policy Actor-Critic Structure for Optimal Control …

On-Policy Integral Reinforcement Learning (IRL) From (9.10) and (9.11), it is clear that the exact knowledge of system dynamics is necessary for the iterations. However for many complex industrial control processes, it is difficult to estimate or obtain the system models. Therefore, the on-policy integral reinforcement learning (IRL) algorithm with infinite horizon performance index function is presented [11], which is used to learn the optimal control solution without using the complete knowledge of system dynamics (9.1). The on-policy IRL algorithm is given as follows. Algorithm 2 PI 1: Let i = 0, select an admissible control u [0] . 2: Let the iteration index i ≥ 0, and solve J [i] form 0 = Jx[i]T ( f + gu [i] ) + Q(x) + u [i]T Ru [i] ,

(9.10)

where J [0] = 0. 3: Update the control u [i] by 1 u [i+1] = − R −1 g T Jx[i] . 2

(9.11)

Let d = 0, for any time T > 0, and t > T , the performance index function (9.3) can be written in IRL form as  t r (x(τ ), u(τ ))dτ + J (x(t)). (9.12) J (x(t − T )) = t−T

Define r I =

t t−T

r (x(τ ), u(τ ))dτ . then (9.12) can be expressed as J (x(t − T )) = r I + J (x(t)).

(9.13)

Let u [i] be obtained by (9.11). Then the original system (9.1) with d = 0 can be rewritten as x˙ = f + gu [i] .

(9.14)

Jx[i]T ( f + gu [i] ) = −Q(x) − u [i]T Ru [i] .

(9.15)

From (9.10), one has

Therefore, for i = 1, 2, one can obtain

9.2 Problem Statement

151

J [i] (x(t)) − J [i] (x(t − T )) =



t

t−T



=−

Jx[i]T xdτ ˙ t



t

Q(x)dτ −

t−T

u [i]T Ru [i] dτ.

(9.16)

t−T

This is an IRL Bellman equation that can be solved instead of (9.10) at each step of the PI algorithm. This means that f (x) is not needed in this IRL PI algorithm. Details are given in [7].

9.3 Off-Policy Actor-Critic Integral Reinforcement Learning Based on Sect. 9.2, we will analyze the case of d is not equal to zero. In this section, on-policy IRL for nonzero disturbance is first presented, and then the off-policy IRL is given.

9.3.1 On-Policy IRL for Nonzero Disturbance If disturbance d is not equal to zero, the IRL PI algorithm just gives incorrect results. Let u = u [i] , then the original system (9.1) can be rewritten as x˙ = f + gu [i] + d.

(9.17)

According to (9.15), one has the Bellman equation J [i] (x(t)) − J [i] (x(t − T ))  t Jx[i]T xdτ ˙ = t−T  t   t [i]T [i] Q(x)dτ − u Ru dτ + =− t−T

t−T

t

t−T

(Jx[i]T d)dτ.

(9.18)

Therefore, one can define e1 = J [i] (x(t − T )) − J [i] (x(t))  t   t Q(x)dτ − u [i]T Ru [i] dτ + − t−T

t−T

t

t−T

(Jx[i]T d)dτ.

(9.19)

This shows that equation error el is biased by a term depending on the unknown disturbance d. Therefore:

152

9 Off-Policy Actor-Critic Structure for Optimal Control …

(1) The least squares method in [17] will not give the correct solution for J [i] . (2) If the unknown disturbance d is not equal to zero, the iterative control u [i] cannot guarantee the stability of the closed-loop system when dynamics uncertainty occurs. Therefore, the original method for d = 0 is not adapted for the nonlinear system with unknown external disturbance. In the following subsection, an off-policy IRL algorithm is used to decrease el in (9.19), and makes the closed-loop system with external disturbance stable.

9.3.2 Off-Policy IRL for Nonzero Disturbance Here we detail off-policy IRL for the case of nonzero disturbance. Off-policy IRL allows learning with completely unknown dynamics. However, it is seen here that there are unknown disturbance, off-policy IRL may not perform properly. Let u [i] be obtained by (9.11), and then the original system (9.1) can be rewritten as x˙ = f + gu [i] + g(u − u [i] ) + d.

(9.20)

According to (9.15), one has the off-policy Bellman equation J [i] (x(t)) − J [i] (x(t − T )) = − =− +

 t t−T  t t−T  t t−T

Q(x)dτ − Q(x)dτ −

 t t−T  t

u [i]T Ru [i] dτ +

 t t−T

Jx[i]T v[i] dτ

u [i]T Ru [i] dτ

t−T

−2u [i+1]T R(u − u [i] )dτ +

 t t−T

(Jx[i]T d)dτ,

(9.21) where v[i] = gu − gu [i] + d. The off-policy Bellman equation is the main equation used in off-policy learning. Algorithm 1 is simply implemented by merely iterating on (9.21), as detailed in the next results. Algorithm 3 PI 1: Let i = 0, select an admissible control u [0] . 2: Let the iteration index i ≥ 0, and solve J [i] and u [i] simultaneously from (9.21).

Lemma 9.1 In essence, off-policy Algorithm 3 is equivalent to PI Algorithm 2 and converges to the optimal control solution.

9.3 Off-Policy Actor-Critic Integral Reinforcement Learning

153

Proof (1) From (9.10) in Algorithm 2, one has Jx[i]T ( f + gu [i] ) = −Q(x) − u [i]T Ru [i] .

(9.22)

Since J [i] (x(t)) − J [i] (x(t − T )) =



t

t−T

Jx[i]T xdτ. ˙

(9.23)

Therefore, according to (9.20), one has [i]



[i]

J (x(t)) − J (x(t − T )) = =

t

t−T  t t−T

Jx[i]T xdτ ˙ Jx[i]T ( f + gu [i] + g(u − u [i] ) + d)dτ. (9.24)

From (9.22), one can obtain J [i] (x(t)) − J [i] (x(t − T ))  t   t [i]T [i] Q(x)dτ − u Ru dτ + =− t−T

t−T

t t−T

Jx[i]T v[i] dτ.

(9.25)

Here, (9.25) is the off-policy Bellman equation, which is the main equation in Algorithm 3. From (9.22)–(9.25), it can be seen that from Algorithm 2, we can derive Algorithm 3. In (9.20), if one lets d = 0 and u = u [i] , then (9.24) can be written as J [i] (x(t)) − J [i] (x(t − T )) = =



t



t−T t t−T

Jx[i]T xdτ ˙ Jx[i]T ( f + gu [i] )dτ.

(9.26)

Therefore, according to (9.4), one has 

t t−T

Jx[i]T ( f + gu [i] )dτ =



t

(−Q(x) − u [i]T Ru [i] )dτ,

(9.27)

t−T

which means Jx[i]T ( f + gu [i] ) = −Q(x) − u [i]T Ru [i] .

(9.28)

From (9.26)–(9.28), it can be seen that from Algorithm 3, we can derive Algorithm 2. (2) In [21] and [22], it was shown that as the iteration goes on by Algorithm 2, j [i] and u [i] converge to the optimal solution J ∗ and u ∗ , respectively. Therefore, one can get the optimal solution J ∗ and u ∗ by Algorithm 3.

154

9 Off-Policy Actor-Critic Structure for Optimal Control …

In fact, the iteration (9.10) in Algorithm 2 needs the knowledge of system dynamics. While in (9.21) of Algorithm 3, system dynamics are not expressed explicitly. Therefore, Algorithms 2 and 3 aim at different situations, although the two algorithms are same essentially. The online solution of (9.21) is detailed in Sect. 9.3.3. According to (9.21), one can define e2 = J [i] (x(t − T )) − J [i] (x(t))  t   t [i]T [i] Q(x)dτ − u Ru dτ + − t−T

t−T [i]

t

t−T

Jx[i]T v[i] dτ

= J [i] (x(t − T )) − J (x(t))  t  t Q(x)dτ − u [i]T Ru [i] dτ − t−T t−T  t  t Jx[i]T (gu − gu [i] )dτ + Jx[i]T ddτ. + t−T

(9.29)

t−T

This equation was developed for the case d = 0 in [19]. Therefore, the equation error e2 is biased by a term depending on unknown disturbance d. This may cause nonconvergence or biased results. It is noted that in (29), d is the unknown external disturbance and Jx[i] may be nonanalytic. For solving Jx[i] and u [i] from (9.29), critic and action networks are introduced to obtain Jx[i] and u [i] approximately as shown next.

9.3.3 NN Approximation for Actor-Critic Structure Here we introduce neural network approximation structures for Jx[i] and u [i] . These are termed respectively by the critic NN and the actor NN. For off-policy learning, these two structures are updated simultaneously using the off-policy Bellman equation (9.21), as shown here. Let the ideal critic network expression be J [i] (x) = Wc[i]T ϕc (x) + εc[i] (x),

(9.30)

where Wc[i] ∈ Rn 1 ×1 is the ideal weight of critic network, ϕc ∈ Rn 1 ×1 is the active function, and εc[i] is residual error. Then one has Jx[i] = ∇ϕcT Wc[i] + ∇εc[i] .

(9.31)

Let the estimation of Wc[i] be Wˆ c[i] , and then the estimation of J [i] can be expressed as Jˆ[i] = Wˆ c[i]T ϕc .

(9.32)

9.3 Off-Policy Actor-Critic Integral Reinforcement Learning

155

Accordingly, one has Jˆx[i] = ∇ϕcT Wˆ c[i] .

(9.33)

Let ϕc = ϕc (x(t − T )) = ϕc (x(t)) − ϕc (x(t − T )), then the first term of (9.21) is expressed as Jˆ[i] (x(t)) − Jˆ[i] (x(t − T )) = Wˆ c[i]T ϕc = (ϕcT ⊗ I )vec(Wˆ c[i]T ) = (ϕcT ⊗ I )Wˆ c[i] .

(9.34)

The last term of (9.29) is   Jˆx[i]T d = Wˆ c[i]T ∇ϕc d = (∇ϕc d)T ⊗ I Wˆ c[i] .

(9.35)

Let the ideal action network expression be u [i] (x) = Wa[i]T ϕa (x) + εa[i] (x),

(9.36)

where Wa[i] ∈ Rn 2 ×m is the ideal weight of critic network, ϕa ∈ Rn 2 ×1 is the active function, and εa[i] is residual error. Let the estimation of Wa[i] be Wˆ a[i] , and then the estimation of uˆ [i] can be expressed as uˆ [i] = Wˆ a[i]T ϕa .

(9.37)

Then one has Jx[i]T g(u − uˆ [i] ) = − 2uˆ [i+1]T R(u − uˆ [i] ) = − 2(Wˆ a[i+1]T ϕa )T R(u − uˆ [i] ) = − 2ϕaT Wˆ a[i+1] R(u − uˆ [i] )  T = − 2 ((u − uˆ [i] ) R) ⊗ ϕaT vec(Wˆ a[i+1] ).

(9.38)

Therefore, one can define the residual error as  t Q(x)dτ e3 = Jˆ[i] (x(t − T )) − Jˆ[i] (x(t)) − t−T  t  t uˆ [i]T R uˆ [i] dτ + − Jˆx[i]T g(u − uˆ [i] )dτ t−T t−T  t ( Jˆx[i]T d)dτ. + t−T

(9.39)

156

9 Off-Policy Actor-Critic Structure for Optimal Control …

According to (9.34)–(9.38), (9.39) is written as e3 = −

(ϕcT 

−2  +

t

t−T t t−T

⊗ I )Wˆ c[i] −





t

Q(x)dτ −

t−T

t

uˆ [i]T R uˆ [i] dτ

t−T

 T ((u − uˆ [i] ) R) ⊗ ϕaT dτ vec(Wˆ a[i+1] )

(∇ϕc d)T ⊗ I dτ Wˆ c[i] .

(9.40)

 t  t Q(x)dτ + uˆ [i]T R uˆ [i] dτ, Daa = Define Dcc = −(ϕcT ⊗ I ), Dx x = t−T t−T  t  t  T ((u − uˆ [i] ) R) ⊗ ϕaT dτ and Ddd = (∇ϕc d)T ⊗ I dτ . Then (9.40) is −2 t−T

t−T

expressed as

e3 = Dcc Wˆ c[i] − Dx x + Daa vec(Wˆ a[i+1] ) + Ddd Wˆ c[i] . Let Ψ = [Dcc + Ddd , Daa ] and Wˆ [i] =



(9.41)

Wˆ c[i] . Then one has vec(Wˆ a[i+1] )

e3 = Ψ Wˆ [i] − Dx x .

(9.42)

This error allows the update of the weights for the critic NN and the actor NN simultaneously. Data is repeatedly collected at the end of each interval of length T. When sufficient samples have been collected, this equation can be solved using gradient descent method for the weight vector. Unfortunately, d is unknown, so that t Ddd and Ψ are unknown. Therefore we define D¯ dd = t−T (∇ϕc Bd )T ⊗ I dτ , and Ψ¯ = [Dcc + D¯ dd , Daa ], where Bd is the bound of d in (9.2). Then, one has the estimated residual error expressed as e4 = Ψ¯ Wˆ [i] − Dx x .

(9.43)

Let E = 21 e4T e4 , according to the gradient descent method, a solution can be found by updating Wˆ [i] using W˙ˆ [i] = −αw Ψ¯ T (Ψ¯ Wˆ [i] − Dx x ),

(9.44)

where αw is a positive number. This yields an on-line method for updating the weights for the critic NN and the actor NN simultaneously. Define the weight error W˜ [i] = W [i] − Wˆ [i] , then according to (9.44), one has W˙˜

[i]

= αw Ψ¯ T Ψ¯ Wˆ [i] − αw Ψ¯ T Dx x = − αw Ψ¯ T Ψ¯ W˜ [i] + αw Ψ¯ T Ψ¯ W [i] − αw Ψ¯ T Dx x .

(9.45)

9.4 Disturbance Compensation Redesign and Stability Analysis

157

9.4 Disturbance Compensation Redesign and Stability Analysis It has been seen that if there is unknown disturbance, off-policy IRL may not perform properly and may yield biased solution. In this section we show how to redesign the off-policy IRL method by adding a disturbance compensator. It is shown that this method yields proper performance by using Lyapunov techniques.

9.4.1 Disturbance Compensation Off-Policy Controller Design Here we propose the structure of the disturbance compensated off-policy IRL method. Stability analysis is given in terms of Lyapunov theory. The following assumption is first given. Assumption 9.1 The activation function of the action network satisfies ||ϕa || ≤ ϕa M . The partial derivative of the activation function satisfies ||∇ϕc || ≤ ϕcd M . The [i] and ||Wa[i] || ≤ ideal weights of the critic and action networks satisfy ||Wc[i] || ≤ WcM [i] Wa M on a compact set. The approximation error of the action network satisfies ||εa[i] || ≤ εa[i]M on a compact set. Remark 9.1 Assumption 9.1 is a standard assumption in NN control theory. Many NN activation functions are bounded and have bounded derivatives. Examples include the sigmoid, symmetric sigmoid, hyperbolic tangent, and radial basis function. Continuous functions are bounded on a compact set. Hence the NN weights are bounded. The approximation error boundedness property is established in [23]. From Assumption 9.1, it can be seen that Ψ¯ and Dx x are bounded. Without loss of generality, write ||Ψ¯ || ≤ BΨ and ||Dx x || ≤ Bx x for positive numbers BΨ and Bx x . Disturbance Compensated Off-Policy IRL For unknown disturbance d = 0, the methods in these papers [17, 19] can be modified to compensate for unknown disturbance as now shown. Disturbance compensation controller is designed as T T u [i] c = −K c g M x/(x x + b),

(9.46)

where K c ≥ Bd2 (x T x + b)/2 and b > 0. Let ˆ [i] + u [i] u [i] s =u c

(9.47)

be the control input of system (9.1), then one can write x˙ = f + g uˆ [i] + gu [i] c + d.

(9.48)

158

9 Off-Policy Actor-Critic Structure for Optimal Control …

Fig. 9.1 Whole structure of system (9.48)

The whole structure of system (9.48) is shown in the figure (Fig. 9.1). Based on (9.48), the following theorems can be obtained.

9.4.2 Stability Analysis Our main result follows. It verifies the performance of the disturbance compensation redesigned off-policy IRL algorithm. Theorem 9.1 Let the control input be equal to (9.49), and let the updating methods for critic and action networks be as in (9.44). Suppose Assumption 9.1 holds and let the initial states be in the set such that the NN approximation error is bounded as in the assumption. Then for every iterative step i, the weight errors W˜ [i] are UUB. Proof Choose Lyapunov function candidate as follows: Σ = Σ1 + Σ2 + Σ3 ,

(9.49)

l2 ˜ [i]T ˜ [i] W W , l1 > 0, l2 > 0. 2αw As f is locally Lipchitz, then there exists B f > 0, s.t. || f (x)|| ≤ B f ||x||. Therefore, from (9.48), one has where Σ1 = x T x, Σ2 = l1 J [i] (x) and Σ3 =

Σ˙ 1 =2x T ( f + g uˆ [i] + gu [i] c + d) 2 2 ≤2B f ||x||2 + ||x||2 + Bg2 ||uˆ [i] ||2 + 2x T gu [i] c + ||x|| Bd + 1

≤(2B f + 1)||x||2 + Bg2 ||uˆ [i] ||2 − Bd2 ||gg TM ||||x||2 + ||x||2 Bd2 + 1 ≤(2B f + 1)||x||2 + Bg2 ||uˆ [i] ||2 + 1. From (9.10), one can get

(9.50)

9.4 Disturbance Compensation Redesign and Stability Analysis

159

Σ˙ 2 =l1 J˙[i] (x) ≤ − l1 Q(x) − l1 λmin (R)||uˆ [i] ||2 ≤ − l1 q||x||2 − l1 λmin (R)||uˆ [i] ||2 .

(9.51)

Furthermore, one obtains Σ˙ 3 = − l2 W˜ [i]T Ψ¯ T Ψ¯ W˜ [i] + l2 W˜ [i]T Ψ¯ T Ψ¯ W [i] − l2 W˜ [i]T Ψ¯ T Dx x ≤ − l2 B¯ Ψ2 ||W˜ [i] ||2 + l2 ||W˜ [i] ||( B¯ Ψ2 ||W [i] || + B¯ Ψ Bx x ).

(9.52)

Thus, Σ˙ ≤(2B f + 1 − l1 q)||x||2 + (Bg2 − l1 λmin (R))||uˆ [i] ||2 + 1 − l2 B¯ Ψ2 ||W˜ [i] ||2 + l2 ||W˜ [i] ||( B¯ Ψ2 ||W [i] || + B¯ Ψ Bx x ).

(9.53)

Let Z = [x T , uˆ [i]T , ||W˜ [i] ||]T , then (9.53) is written as Σ˙ ≤ −Z T M Z + Z T N + 1,

(9.54)

where M = diag[l1 q − 2B f − 1, l1 λmin (R) − Bg2 , l2 ], N = [0, 0, l2 ( B¯ Ψ2 ||W [i] || + B¯ Ψ Bx x )]T . Select l1 > max{(2B f + 1)/q, Bg2 /λmin (R)},

(9.55)

l2 > 0.

(9.56)

and

If ||N || + ||Z || ≥ 2λmin (M)



1 ||N ||2 + 2 4λmin (M) λmin (M)

(9.57)

then Σ˙ ≤ 0. Therefore, the weight error W˜ is UUB. Remark 9.2 It is assumed that the compact set in Assumption 9.1 is larger than the set to which the state is bounded. This can be guaranteed under a mild condition on the initial states as detailed in Theorem 4.2.1 of [24]. This result shows that closed-loop system (9.48) is stable regardless of unknown disturbance d. In the next theorem, the convergence analyses of H (x, uˆ [i] , Jˆx[i] ) and u [i] s are given. ˆ[i] Theorem 9.2 Suppose the hypotheses of Theorem 9.1 hold. Define H (x, u [i] s , Jx ) = Jˆx[i]T ( f + g uˆ [i] + gu [i] ˆ [i]T R uˆ [i] . Then c + d) + Q(x) + u

160

9 Off-Policy Actor-Critic Structure for Optimal Control …

[i] (1) u [i] within a finite bound, as t → ∞. s is close to u [i] ˆ[i] (2) H (x, u s , Jx ) = H (x, Wˆ a[i] , Wˆ c[i] ) is UUB.

Proof (1) From Theorem 9.1, x will converge to a finite bound of zero, as t → ∞. ˜ [i] Therefore, there exist Buc > 0 and Bwa > 0, satisfying ||uˆ [i] c || ≤ Buc and || Wa || ≤ [i] [i] [i]T [i]T [i] [i] [i]T [i] Bwa . Since u s − u = Wˆ a ϕa − Wa ϕa − εa + u c = −W˜ a ϕa − εa + u [i] c . Then one can get [i] [i] ||u [i] s − u || ≤ ϕa M Bwa + εa M + Buc .

(9.58)

[i] within a finite bound. Equation (9.58) means that u [i] s is close to u (2) According to (9.32) and (9.37), one has

H (x, Wˆ a[i] , Wˆ c[i] ) =(Wc[i] − W˜ c[i] )T ∇ϕc f + (Wc[i] − W˜ c[i] )T ∇ϕc gu [i] c  [i] [i] T [i]T [i]T ˜ ˜ + (Wc − Wc ) ∇ϕc gWa ϕa − g Wa ϕa + d + Q(x) + ϕaT (Wa[i] − W˜ a[i] )R(Wa[i]T − W˜ a[i]T )ϕa ,

(9.59)

which means that H (x, Wˆ a[i] , Wˆ c[i] ) = Wc[i]T ∇ϕc f − W˜ c[i]T ∇ϕc f + Wc[i]T ∇ϕc gu [i] c [i]T [i] ˜ − Wc ∇ϕc gu c + Wc[i]T ∇ϕc gWa[i]T ϕa − W˜ c[i]T ∇ϕc gWa[i]T ϕa − Wc[i]T ∇ϕc g W˜ a[i]T ϕa + W˜ c[i]T ∇ϕc g W˜ a[i]T ϕa + Wc[i]T ∇ϕc d − W˜ c[i]T ∇ϕc d + Q(x) + ϕaT Wa[i] RWa[i]T ϕa − ϕaT W˜ a[i] RWa[i]T ϕa − ϕaT Wa[i] R W˜ a[i]T ϕa + ϕaT W˜ a[i] R W˜ a[i]T ϕa .

(9.60)

As εa → 0 and εc → 0, for a fixed admissible control policy, one has H (x, Wa[i] , Wc[i] ) is bounded, i.e., there exists H B , s.t. ||H (x, Wa[i] , Wc[i] )|| ≤ H B . Therefore, (9.60) can be written as [i] ˜ [i]T H (x, Wˆ a[i] , Wˆ c[i] ) = − W˜ c[i]T ∇ϕc f + Wc[i]T ∇ϕc gu [i] c − Wc ∇ϕc gu c − W˜ c[i]T ∇ϕc gWa[i]T ϕa − Wc[i]T ∇ϕc g W˜ a[i]T ϕa

+ W˜ c[i]T ∇ϕc g W˜ a[i]T ϕa − ϕaT W˜ a[i] RWa[i]T ϕa − ϕaT Wa[i] R W˜ a[i]T ϕa + ϕaT W˜ a[i] R W˜ a[i]T ϕa + Wc[i]T ∇ϕc d − W˜ c[i]T ∇ϕc d + H B . According to Assumption 9.1, one has

(9.61)

9.4

Disturbance Compensation Redesign and Stability Analysis

161

||H (x, Wˆ a[i] , Wˆ c[i] )|| ≤ϕcd M B f ||W˜ c[i] ||||x|| + ϕcd M Bg ||Wc[i] ||Buc + ϕcd M Bg ||W˜ c[i] ||Buc + ϕcd M Bg ϕa M ||Wa[i] ||||W˜ c[i] || + ϕcd M Bg ϕa M ||Wc[i] ||||W˜ a[i] || + ϕcd M Bg ϕa M ||W˜ a[i] ||||W˜ c[i] || + 2ϕa2M ||Wa[i] ||||R||||W˜ a[i] || + ϕa2M ||R||||W˜ a[i] ||2 + ϕcd M ||Wc[i] ||Bd + ϕcd M ||W˜ c[i] ||Bd + H B . (9.62) According to Theorems 9.1 and 9.2, the signals on the right-hand side of (9.62) ˆ[i] are bounded, therefore H (x, u [i] s , Jx ) is UUB. [i] and u [i] within a The previous theorem indicates that Jˆ[i] and u [i] s close to J [i] [i] small bound. The final theorem shows that J and u convergence to the optimal value and control policy.

Theorem 9.3 Let J [i] and u [i] be defined in (9.10) and (9.11), then ∀i = 0, 1, 2, . . . , u [i+1] is admissible, and 0 ≤ J [i+1] ≤ J [i] . Furthermore, lim J [i] = J ∗ and lim u [i] i→∞

= u∗.

i→∞

Proof The proof can be seen in [19, 21].

9.5 Simulation Study In this section we present simulation results that verify the proper performance of the disturbance compensated off-policy IRL algorithm. Consider the following torsional pendulum system [25] ⎧ dθ ⎪ ⎨ = ω + d1 , dt dθ dω ⎪ ⎩J = u − Mgl sin θ − f d + d2 , dt dt

(9.63)

where M = 1/3 kg and l = 2/3 m are the mass and length of the pendulum bar, respectively. The system states are the current angle θ and the angular velocity ω. Let J = 4/3Ml 2 , and f d = 0.2 be the rotary inertia and frictional factor, respectively. Let g = 9.8m/s2 be the gravity. d = [d1; d2] is white noise. The activation functions ϕc and ϕa be hyperbolic tangent functions. The structures of the critic and action networks are 2-8-1 and 2-8-1, respectively. Let Q = I 2, R = 1, αw = 0.01, then the system state and control input trajectories are displayed in Figs. 9.2 and 9.3. Therefore, we can declare the effectiveness of the disturbance compensated off-policy IRL algorithm in this chapter.

162

9 Off-Policy Actor-Critic Structure for Optimal Control … 1.5

control

1

0.5

0

-0.5 0

500

1000

1500

time steps

Fig. 9.2 The proposed control input u 0.8

x(1) x(2)

0.6 0.4

state

0.2 0 -0.2 -0.4 -0.6 -0.8 -1 0

500

1000

time steps

Fig. 9.3 The state x under the proposed control input

1500

9.6 Conclusion

163

9.6 Conclusion This chapter proposes an optimal controller for unknown continuous time systems with unknown disturbances. Based on policy iteration, an off-policy IRL algorithm is estimated to obtain the iterative control. Critic and action networks are used to obtain the iterative performance index function and control approximately. The weight updating method is given based on off-policy IRL. A compensation controller is constructed to reduce the influence of unknown disturbances. Based on Lyapunov techniques, one proves the weight errors are UUB. The convergence of the Hamiltonian function is also proven. Simulation study demonstrates the effectiveness of the proposed optimal control method for unknown systems with disturbances. From this chapter, we can see that the weight error bound depends on the disturbance bound. In the future, we will further study the method that focus on the unknown disturbance instead of the disturbance bound.

References 1. Jiang, Y., Jiang, Z.: Robust adaptive dynamic programming for large-scale systems with an application to multimachine power systems. IEEE Trans. Circuits Syst. II: Express Br. 59(10), 693–697 (2012) 2. Chen, B., Liu, K., Liu, X., Shi, P., Lin, C., Zhang, H.: Approximation-based adaptive neural control design for a class of nonlinear systems. IEEE Trans. Cybern. 44(5), 610–619 (2014) 3. Lewis, F., Vamvoudakis, K.: Reinforcement learning for partially observable dynamic processes: adaptive dynamic programming using measured output data. IEEE Trans. Syst. Man Cybern. Part B: Cybern. 41(1), 14–25 (2011) 4. Song, R., Xiao, W., Zhang, H., Sun, C.: Adaptive dynamic programming for a class of complexvalued nonlinear systems. IEEE Trans. Neural Netw. Learn. Syst. 25(9), 1733–1739 (2014) 5. Lee, J., Park, J., Choi, Y.: Integral Q-learning and explorized policy iteration for adaptive optimal control of continuous-time linear systems. Automatica 48(11), 2850–2859 (2012) 6. Kiumarsi, B., Lewis, F., Modares, H., Karimpour, A., Naghibi-Sistani, M.: Reinforcement Qlearning for optimal tracking control of linear discrete-time systems with unknown dynamics. Automatica 50(4), 1167–1175 (2014) 7. Vrabie, D., Pastravanu, O., Lewis, F., Abu-Khalaf, M.: Adaptive optimal control for continuoustime linear systems based on policy iteration. Automatica 45(2), 477–484 (2009) 8. Lewis, F., Vrabie, D., Vamvoudakis, K.: Reinforcement learning and feedback control: using natural decision methods to design optimal adaptive controllers. IEEE Control Syst. Mag. 32(6), 76–105 (2012) 9. Bhasin, S., Kamalapurkar, R., Johnson, M., Vamvoudakis, K., Lewis, F., Dixon, W.: A novel actor-critic-identifier architecture for approximate optimal control of uncertain nonlinear systems. Automatica 49(1), 82–92 (2013) 10. Vrabie, D., Lewis, F.: Adaptive dynamic programming for online solution of a zero-sum differential game. J. Control Theory Appl. 9(3), 353–360 (2011) 11. Vrabie, D., Lewis, F.: Integral reinforcement learning for online computation of feedback nash strategies of nonzero-sum differential games. In: Proceedings of Decision and Control, Atlanta, GA, USA, pp. 3066–3071 (2010) 12. Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. A Bradford Book. The MIT Press, Cambridge (2005)

164

9 Off-Policy Actor-Critic Structure for Optimal Control …

13. Wang, J., Xu, X., Liu, D., Sun, Z., Chen, Q.: Self-learning cruise control using Kernel-based least squares policy iteration. IEEE Trans. Control Syst. Technol. 22(3), 1078–1087 (2014) 14. Vamvoudakis, K., Vrabie, D., Lewis, F.: Online adaptive algorithm for optimal control with integral reinforcement learning. Int. J. Robust Nonlinear Control 24(17), 2686–2710 (2015) 15. Li, H., Liu, D., Wang, D.: Integral reinforcement learning for linear continuous-time zero-sum games with completely unknown dynamics. IEEE Trans. Autom. Sci. Eng. 11(3), 706–714 (2014) 16. Luo, B., Wu, H., Huang, T.: Off-policy reinforcement learning for H∞ control design. IEEE Trans. Cybern. 45(1), 65–76 (2015) 17. Jiang, Y., Jiang, Z.: Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics. Automatica 48(10), 2699–2704 (2012) 18. Jiang, Y., Jiang, Z.: Robust adaptive dynamic programming for nonlinear control design. In: Proceedings of IEEE Conference on Decision and Control, Maui, Hawaii, USA, pp. 1896–1901 (2012) 19. Jiang, Y., Jiang, Z.: Robust adaptive dynamic programming and feedback stabilization of nonlinear systems. IEEE Trans. Neural Netw. Learn. Syst. 25(5), 882–893 (2014) 20. Beard, R., Saridis, G., Wen, J.: Galerkin approximations of the generalized Hamilton-JacobiBellman equation. Automatica 33(12), 2159–2177 (1997) 21. Abu-Khalaf, M., Lewis, F.: Nearly optimal control laws for nonlinear systems with saturating actuators using a neural network HJB approach. Automatica 41(5), 779–791 (2005) 22. Saridis, G., Lee, C.: An approximation theory of optimal control for trainable manipulators. IEEE Trans. Syst. Man Cybern. Part B: Cybern. 9(3), 152–159 (1979) 23. Hornik, K., Stinchcombe, M., White, H., Auer, P.: Degree of approximation results for feedforward networks approximating unknown mappings and their derivatives. Neural Comput. 6(6), 1262–1275 (1994) 24. Lewis, F., Jagannathan, S., Yesildirek, A.: Neural Network Control of Robot Manipulators and Nonlinear Systems. Taylor and Francis, London (1999) 25. Liu, D., Wei, Q.: Policy iteration adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Trans. Neural Netw. Learn. Syst. 25(3), 621–634 (2014)

Chapter 10

An Iterative ADP Method to Solve for a Class of Nonlinear Zero-Sum Differential Games

In this chapter, an iterative ADP method is presented to solve a class of continuoustime nonlinear two-person zero-sum differential games. The idea is to use ADP technique to obtain the optimal control pair iteratively which makes the performance index function reach the saddle point of the zero-sum differential games. When the saddle point does not exist, the mixed optimal control pair is obtained to make the performance index function reach the mixed optimum. Rigid proofs are proposed to guarantee the control pair stabilize the nonlinear system. And the convergent property of the performance index function is also proved. Neural networks are used to approximate the performance index function, compute the optimal control policy and model the nonlinear system respectively for facilitating the implementation of the iterative ADP method. Two examples are given to demonstrate the validity of the proposed method.

10.1 Introduction A large class of real systems are controlled by more than one controller or decision maker with each using an individual strategy. These controllers often operate in a group with a general quadratic performance index function as a game [1]. Zero-sum (ZS) differential game theory has been widely applied to decision making problems [2–7], stimulated by a vast number of applications, including those in economics, management, communication networks, power networks, and in the design of complex engineering systems. In these situations, many control schemes are presented in order to reach some form of optimality [8, 9]. Traditional approaches to deal with ZS differential games are to find out the optimal solution or the saddle point of the games. So many interests are developed to discuss the existence conditions of the differential ZS games [10, 11]. © Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 R. Song et al., Adaptive Dynamic Programming: Single and Multiple Controllers, Studies in Systems, Decision and Control 166, https://doi.org/10.1007/978-981-13-1712-5_10

165

166

10 An Iterative ADP Method to Solve for a Class …

In the real world, however, the existence conditions of the saddle point for ZS differential games are so difficult to satisfy that many applications of the ZS differential games are limited to linear systems [12–14]. On the other hand, for many ZS differential games, especially in nonlinear case, the optimal solution of the game (or saddle point) doesn’t exist inherently. Therefore, it is necessary to study the optimal control approach for the ZS differential games that the saddle point is invalid. The earlier optimal control scheme is to adopt the mixed trajectory method [15, 16], one player selects an optimal probability distribution over his control set and the other player selects an optimal probability distribution over his own control set, and then the expected solution of the game can be obtained under the meaning of the probability. The expected solution of the game is called mixed optimal solution and the corresponding performance index function is mixed optimal performance index function. The main difficulty of the mixed trajectory for the ZS differential game is that the optimal probability distribution is too hard to obtain if not impossible under the whole real space. Furthermore, the mixed optimal solution is hardly reached once the control schemes are determined. In most cases (i.e. in engineering cases), however, the optimal solution or mixed optimal solution of the ZS differential games has to be achieved by a determined optimal or mixed optimal control scheme. In order to overcome these difficulties, a new iterative approach is proposed in this chapter to solve the ZS differential games for the nonlinear system. In this chapter, it is the first time that the continuous-time two-person ZS differential games for nonlinear systems are solved by the iterative ADP method. When the saddle point exists, using the proposed iterative ADP method, the optimal control pair is obtained to make the performance index function reach the saddle point. When the saddle point does not exist, according to mixed trajectory method, a determined mixed optimal control scheme is proposed to obtain the mixed optimal performance index function. In a brief, the main contributions of this chapter include: (1) Construct a new iterative method to solve two-person ZS differential games for a class of nonlinear systems using ADP. (2) Obtain the optimal control pair that makes the performance index function reach the saddle point with rigid stability and convergence proof. (3) Design a determined mixed optimal control scheme to obtain the mixed optimal performance index function under the condition that there is no saddle point, and give the analysis of stability and convergence. This chapter is organized as follows. Section 10.2 presents the preliminaries and assumptions. In Sect. 10.3, iterative ADP method for ZS differential games is proposed. In Sect. 10.4, the neural network implementation for the optimal control scheme is presented. In Sect. 10.5, simulation studies are given to demonstrate the effectiveness of the proposed method. The conclusion is drawn in Sect. 10.6.

10.2 Preliminaries and Assumptions Consider the following two-person ZS differential game. The state trajectory at time t of the game denoted by x = x(t) is described by the continuous-time affine nonlinear function

10.2 Preliminaries and Assumptions

167

x˙ = f (x, u, w) = a(x) + b(x)u + c(x)w,

(10.1)

where x ∈ Rn , u ∈ Rk , w ∈ Rm and the initial condition x(0) = x0 is given. The two control variables u and w are functions on [0, ∞) chosen, respectively, by player I and player II from some control sets U [0, ∞) and W [0, ∞), subjects to the constraints u ∈ U (t), and w ∈ W (t) for t ∈ [0, ∞), for given convex and compact sets U (t) ⊂ Rk , W (t) ⊂ Rm . The performance index function is generalized quadratic form (see [17]) given by 



V (x(0), u, w) =

(x T Ax + u T Bu + wT Cw + 2u T Dw

0

+ 2x T Eu + 2x T Fw)dt

(10.2)

where matrices A, B, C, D, E, F are with suitable dimension and A ≥ 0, B > 0, C < 0. So we have for ∀t ∈ [0, ∞), the performance index function V (x(t), u, w) (denoted by V (x) for brevity in the sequel) is convex in u and concave in w. l(x, u, w) = x T Ax + u T Bu + wT Cw + 2u T Dw + 2x T Eu + 2x T Fw is general quadratic utility function. For the above ZS differential game, there are two controllers or players where player I tries to minimize the performance index function V (x), while player II attempts to maximize it. And according to the situation of the two players we have the following definitions. Let V (x) :=

inf

sup

u∈U [t,∞) w∈W [t,∞)

V (x, u, w)

(10.3)

V (x, u, w)

(10.4)

be the upper performance index function, and V (x) :=

sup

inf

w∈W [t,∞) u∈U [t,∞)

be the lower performance index function with the obvious inequality V (x) ≥ V (x). Define the optimal control pairs be (u, w) and (u, w) for upper and lower performance index function respectively. Then we have V (x) = V (x, u, w),

(10.5)

V (x) = V (x, u, w).

(10.6)

and

If both V (x) and V (x) exist and V (x) = V (x) = V ∗ (x),

(10.7)

168

10 An Iterative ADP Method to Solve for a Class …

we say that the optimal performance index function of the ZS differential game or the saddle point exists and the corresponding optimal control pair is denoted by (u ∗ , w∗ ). The following assumptions and lemmas are proposed that are in effect in the remaining sections. Assumption 10.1 The nonlinear system (10.1) is controllable. Assumption 10.2 The upper performance index function and the lower performance index function both exist. Assumption 10.3 The system function f (x, u, w) is Lipschitz continuous on R n containing the origin. Constraint sets U [0, ∞) and W [0, ∞) are subsets of the space of the measurable and integrable functions on [0, ∞). The control sets u ∈ U (t), and w ∈ W (t) for t ∈ [0, ∞) are nonempty closed, convex, and measurable dependent on t. Based on the above assumptions, the following two lemmas are important to apply the dynamic programming method. Lemma 10.1 (principle of optimality) If the upper and lower performance index function are defined as (10.3) and (10.4) respectively, then for 0 ≤ t ≤ tˆ < ∞, x ∈ Rn , u ∈ Rk , w ∈ Rm , we have  V (x) = inf



sup

u∈U [t,tˆ) w∈W [t,tˆ)

(x T Ax + u T Bu

t

 + w Cw + 2u Dw + 2x Eu + 2x Fw)dt + V (x(tˆ)) , T

T

T

T

(10.8)

and  V (x) = sup



inf

w∈W [t,tˆ) u∈U [t,tˆ)

(x T Ax + u T Bu

t

 ˆ + w Cw + 2u Dw + 2x Eu + 2x Fw)dt + V (x(t )) . T

T

T

T

(10.9)

Proof See [18]. According to the above lemma, we can prove the following results. Lemma 10.2 (HJI equation) If the upper and lower performance index function are defined as (10.3) and (10.4) respectively, we can obtain the following Hamilton– Jacobi–Isaacs (HJI) equations H J I (V (x), u, w) = V t (x) + H (V x (x), u, w) = 0,

(10.10)

10.2 Preliminaries and Assumptions

169

dV (x) dV (x) , V x (x) = , H (V x (x), u, w) = inf sup {V x (a(x) + u∈U w∈W dt dx b(x)u + c(x)w) + (x T Ax + u T Bu + wT Cw + 2u T Dw + 2x T Eu + 2x T Fw)} is called upper Hamilton function, and

where V t (x) =

H J I (V (x), u, w) = V t (x) + H (V x (x), u, w) = 0.

(10.11)

dV (x) dV (x) , V x (x) = , H (V x (x), u, w) = sup inf {V x (a(x) + dt dx w∈W u∈U T T T T T b(x)u + c(x)w) + (x Ax + u Bu + w Cw + 2u Dw + 2x Eu + 2x T Fw)} is called lower Hamilton function. where V t (x) =

Proof See [18].

10.3 Iterative Approximate Dynamic Programming Method for ZS Differential Games The optimal control pair is obtained by solving the HJI equations (10.10) and (10.11), but these equations cannot generally be solved. There is no current method for rigorously confronting this type of equation to find the optimal performance index function of the system. This is the reason why we introduce the iterative ADP method. In this section, the iterative ADP method for ZS differential games is proposed and we will show that the iterative ADP method can be expanded to ZS differential games.

10.3.1 Derivation of The Iterative ADP Method The goal of the proposed iterative approximate dynamic programming method is to use adaptive critic designs technique to adaptively construct an optimal control pair (u ∗ , w∗ ), which takes an arbitrary initial state x(0) to the singularity 0, simultaneously makes the performance index function reach the saddle point V ∗ (x) with rigid convergence and stability criteria. As the value of the upper and lower performance index function is not necessarily equal, which means no saddle point, the optimal control pair (u ∗ , w∗ ) may not exist. This motivates us to change the control schemes to obtain the new performance index function V (x) ≤ V o (x) ≤ V (x), where V o (x) is the mixed optimal performance index function of the ZS differential games [15, 16]. As V o (x) is not satisfied with the HJI equations (10.10) and (10.11), it is unsolvable directly in theory. Therefore, a iterative approximation approach is proposed in this chapter to approximate the mixed optimal performance index function. Theorem 10.1 If Assumptions 10.1–10.3 hold, (u, w) is the optimal control pair for the upper performance index function and (u, w) is the optimal control pair

170

10 An Iterative ADP Method to Solve for a Class …

for the lower performance index function, then there exist the control pairs (u, w), (u, w) which make V o (x) = V (x, u, w) = V (x, u, w). Furthermore, if the saddle point exists, then V o (x) = V ∗ (x). Proof According to (10.3) and (10.5), we have V o (x) ≤ V (x, u, w). Simultaneously, we also have V (x, u, w) ≤ V (x, u, w). As the system (10.1) is controllable and w is continuous on Rm , there exists control pair (u, w) which makes V o (x) = V (x, u, w). On the other hand, according to (10.4) and (10.6) we have V o (x) ≥ V (x, u, w). And we can also find that V (x, u, w) ≥ V (x, u, w). As u is continuous on Rk , there exists control pair (u, w) which makes V o (x) = V (x, u, w). Then we have V o (x) = V (x, u, w) = V (x, u, w). If the saddle point exists, we have V ∗ (x) = V (x) = V (x). On the other hand, V (x) ≤ V o (x) ≤ V (x). Then obviously V o (x) = V ∗ (x). The above theorem builds up the relationship between the optimal or mixed optimal performance index function and the upper and lower performance index functions. It also implies that the mixed optimal performance index function can be solved through regulating the optimal control pairs for the upper and lower performance index function. So firstly, it is necessary to find out the optimal control pair for both the upper and lower performance index function. Differentiating the HJI equation (10.10) through the derivative of the control w for the upper performance index function, it yields ∂H T = V x c(x) + 2wT C + 2u T D + 2x T F = 0. ∂w

(10.12)

1 w = − C −1 (2D T u + 2F T x + cT (x)V x ). 2

(10.13)

Then we can get

Substitute (10.13) into (10.10) and take the derivative of u then obtain 1 u = − (B − DC −1 D T )−1 (2(E T − DC −1 F T )x 2 + (bT (x) − DC −1 cT (x))V x ).

(10.14)

Thus, the detailed expression of the optimal control pair (u, w) for upper performance index function is obtained. For the lower performance index function, according to (10.11), take the derivative of the control u and we have ∂H = V Tx b(x) + 2u T B + 2wT D + 2x T E = 0. ∂u Then we can get

(10.15)

10.3 Iterative Approximate Dynamic Programming…

171

1 u = − B −1 (2Dw + 2E T x + bT (x)V x ). 2

(10.16)

Substitute (10.16) into (10.11) and take the derivative of w, then we obtain 1 w = − (C − D T B D)−1 (2(F T − D T B −1 E T )x 2 − (cT (x) − D T B −1 bT (x))V x ).

(10.17)

So the optimal control pair (u, w) for lower performance index function is also obtained. If the equality (10.7) holds under the optimal control pairs (u, w) and (u, w), we have a saddle point; if not, we have a game without saddle point. For such a differential game that the saddle point does not exist, we adopt a mixed trajectory method to achieve the mathematical expectation of the performance index function. To apply mixed trajectory method, the game matrix is necessary to obtain under the trajectory sets of the control pair (u, w). Small enough Gaussian noise γu ∈ Rk and γw ∈ Rm are introduced that are added to the optimal control u and j w respectively, where γui (0, σi2 ), i = 1, 2, . . . , k and γw (0, σ j2 ), j = 1, 2, . . . , m are 2 2 zero-mean exploration noise with variances σi and σ j respectively. Therefore, the upper and lower performance index functions (10.3) and (10.4) become V (x, u, (w + γw )) and V (x, (u + γu ), w) respectively, where V (x, u, (w + γw )) > V (x, (u + γu ), w) holds. Define the following game matrix

I1 I2

 I I1 I I2  L 11 L 12 = L , L 21 L 22

(10.18)

where L 11 = V (x, u, w), L 12 = V (x, (u + γu ), w), L 21 = V (x, u, w) and L 22 = V (x, u, (w + γw )). Define I1 , I2 are trajectories of player I, I I1 , I I2 are trajectories of player II. According to the principle of mixed trajectory [16], we can get the expected performance index function formulated by E(V (x)) = min max PI i

PI I j

2 2  

PI i L i j PI I j ,

(10.19)

i=1 j=1

where PI i > 0, i = 1, 2 is the probability of the trajectories of player I satisfying 2  PI i = 1 and PI I j > 0, j = 1, 2 is the probability of the trajectories of player II i=1

satisfying

2  j=1

PI I j = 1. Then we have

172

10 An Iterative ADP Method to Solve for a Class …

E(V (x)) = αV (x) + (1 − α)V (x), where α =

(10.20)

Vo − V

. V −V For example, let the game matrix be

I1 I2



I I2 I I1  L 11 = 11 L 12 = 7 = L . L 21 = 5 L 22 = 9

According to (10.19), for the trajectories of player I, choose trajectory I1 under the probability PI and choose trajectory I2 under the probability (1 − PI ); for the trajectories of player II, choose trajectory I I1 under the probability PI I and choose trajectory I I2 under the probability (1 − PI I ). Then the mathematical expectation can be expressed as E(L) = L 11 PI PI I + L 12 PI (1 − PI I ) + L 21 (1 − PI )PI I + L 22 (1 − PI )(1 − PI I ) = 11PI PI I + 7PI (1 − PI I ) + 5(1 − PI )PI I + 9(1 − PI )(1 − PI I )    1 1 PI I − + 8. (10.21) = 8 PI − 2 4 So the value of the expectation is 8 and it can be written by E(L) =

1 1 L 11 + L 21 . 2 2

Remark 10.1 From the above example we can see that once the trajectory in the trajectory set is determined, the expected value 8 can not be obtained in reality. In most practical optimal control environment, however, the expected optimal performance (or mixed optimal performance) has to be achieved. So in the following part, the expatiation of the method to achieve the mixed optimal performance index function will be displayed. Calculating the expected performance index function for N times under the exploration noise γw ∈ Rm and γu ∈ Rk in the control w and u, we can obtain E 1 (V (x)), E 2 (V (x)), . . . , E N (V (x)). Then the mixed optimal performance index function can be written by V o (x) = E(E i (V (x))) =

N 1  E i (V (x)) N i=1

= αV (x) + (1 − α)V (x), where α =

Vo − V V −V

.

(10.22)

10.3 Iterative Approximate Dynamic Programming…

173

Let l o (x, u, w, u, w) = αl(x, u, w) + (1 − α)l(x, u, w), then V o (x) can be expressed by  V o (x(0)) =



l o (x, u, w, u, w)dx.

(10.23)

0

Then according to Theorem 10.1, the mixed optimal control pair can be obtained by regulating the control w in the control pair (u, w) that minimizes the error between V (x) and V o (x) where the performance index function V (x) is defined by V (x(0)) = V (x(0), u, w)  ∞ = (x T Ax + u T Bu + wT Cw + 2u T Dw 0

+ 2x T Eu + 2x T Fw)dt,

(10.24)

and V (x(0)) ≤ V(x(0)) ≤ V (x(0)). Define (x(0)) = V(x(0)) − V o (x(0)) = V





l(x, u, w, u, w, w)dx,

(10.25)

0

l(x, w). Then where l(x, u, w, u, w, w) = l(x, u, w) − l o (x, u, w, u, w) denoted by the problem can be described by (x))2 . min(V w

(10.26)

(x) ≥ 0 we have the following According to the principle of optimality, when V Hamilton–Jacoobi–Bellman (HJB) equation x (x), x, w) = 0, (x), w) = V t (x) + H (V H J B(V

(10.27)

x (x) = d V (x) , and the Hamilton function is H (V x (x), t (x) = d V (x) , V where V dt dx x (a(x) + b(x)u + c(x)w) + x, w) = min {V l(x, w)}. w∈W

(x) < 0, we have −V (x) = −(V(x) − V o (x)) > 0, then we also have When V the HJB equation described by x (x)), x, w) (x)), w) = (−V t (x)) + H ((−V H J B((−V x (x), x, w) t (x) + H (V =V = 0,

(10.28)

which is same as (10.27). Then the optimal control w can be obtained by differentiating the HJB equation (10.27) through the derivative of control w:

174

10 An Iterative ADP Method to Solve for a Class …

1 x ). w = − C −1 (2D T u + 2F T x + cT (x)V 2

(10.29)

Remark 10.2 We can also obtain the mixed optimal control pair by regulating the control u in the control pair (u, w) that minimizes the error between V (x) and V o (x) where the performance index function V (x) is defined by V (x(0)) =V (x(0), u, w)  ∞ = (x T Ax + u T Bu + wT Cw + 2u T Dw + 2x T Eu + 2x T Fw)dt. 0

(10.30) Remark 10.3 From (10.24) to (10.30) we can see that effective mixed optimal control schemes are proposed to obtain the mixed optimal performance index function when the saddle point does not exist, while the uniqueness of the mixed optimal control pair is rarely satisfied.

10.3.2 The Procedure of the Method Given the above preparation, we now formulate the iterative approximate dynamic programming method for ZS differential games as follows. Step 1 Initialize the algorithm with a stabilizing performance index function V [0] and control pair (u [0] , w[0] ) where Assumptions 10.1–10.3 hold. Give the computation precision ζ > 0. Step 2 For i = 0, 1, . . ., from the same initial state x(0) run the system with control pair (u [i] , w[i] ) for the upper performance index function and run the system with control pair (u [i] , w[i] ) for the lower performance index function. Step 3 For i = 0, 1, . . ., for upper performance index function let [i]





V (x(0)) =

(x T Ax + u [i+1]T Bu [i+1]

0

+ w[i+1]T Cw[i+1] + 2u [i+1]T Dw[i+1] + 2x T Eu [i+1] + 2x T Fw[i+1] )dt,

(10.31)

and the iterative optimal control pair is formulated by 1 u [i+1] = − (B − DC −1 D T )−1 (2(E T − DC −1 F T )x 2 [i] + (bT (x) − DC −1 cT (x))V x ),

(10.32)

10.3 Iterative Approximate Dynamic Programming…

175

and 1 [i] w[i+1] = − C −1 (2D T u [i+1] + 2F T x + cT (x)V x ), 2

(10.33)

[i]

where (u [i] , w[i] ) is satisfied with H J I (V (x), u [i] , w[i] ) = 0.



[i]

[i+1]

Step 4 If V (x(0)) − V (x(0)) < ζ , let u = u [i] , w = w[i] and V (x) = V

[i+1]

(x) go to step 5, else i = i + 1 and go to step 3.

Step 5 For i = 0, 1, . . . for lower performance index function let 

[i]



V (x(0)) =

(x T Ax + u [i+1]T Bu [i+1]

0

+ w[i+1]T Cw[i+1] + 2u [i+1]T Dw[i+1] + 2x T Eu [i+1] + 2x T Fw[i+1] )dt,

(10.34)

and the iterative optimal control pair is formulated by 1 w[i+1] = − (C − D T B D)−1 (2(F T − D T B −1 E)x 2 + (cT (x) − D T B −1 bT (x))V [i] x ),

(10.35)

and 1 u [i+1] = − B −1 (2Dw[i+1] + 2E T x + bT (x)V [i] x ), 2

(10.36)

where (u [i] , w[i] ) is satisfied with the HJI equation H J I (V [i] (x), u [i] , w[i] ) = 0.



Step 6 If V [i+1] (x(0)) − V [i] (x(0)) < ζ , let u = u [i] , w = w[i] and V (x) = V [i+1] (x) go to next step, else i = i + 1 and go to step 5.



Step 7 If V (x(0)) − V (x(0)) < ζ stop, the saddle point is achieved, else go to the next step. Step 8 For i = 0, 1, . . ., regulate the control w for the upper performance index function and let [i+1] (x(0)) =V[i+1] (x(0)) − V o (x(0)) V  ∞ = (x T Ax + u T Bu + w[i]T Cw[i] 0

+ 2u T Dw[i] + 2x T Eu + 2x T Fw[i] − l o (x, u, w, u, w))dt, (10.37)

176

10 An Iterative ADP Method to Solve for a Class …

the iterative optimal control formulated by 1 x[i+1] ). w[i] = − C −1 (2D T u + 2F T x + cT (x)V 2

(10.38)

Step 9 If |V[i+1] (x(0)) − V o (x(0))| < ζ stop, else i = i + 1 and go to 7. Remark 10.4 For step 8 of the above process, we can also regulate the control u for the lower performance index function and the new performance index function is expressed as V [i+1] (x(0)) =





(x T Ax + u [i]T Bu [i] + wT Cw

0

+ 2u [i]T Dw + 2x T Eu [i] + 2x T Fw − l o (x, u, w, u, w))dt, (10.39) and we can obtain the similar result.

10.3.3 The Properties of the Iterative ADP Method In this subsection, we present the proofs to show that the proposed iterative ADP method for ZS differential games can be used to improve the properties of the system. The following definition is proposed which is necessary for the remaining proofs. Definition 10.1 (K function) A continuous function α : [ 0, a) → [ 0, ∞) is said to belong to class K if it is strictly increasing and α(0) = 0. [i]

Theorem 10.2 If Assumptions 10.1–10.3 hold, u [i] ∈ Rk , w[i] ∈ Rm and V (x) ∈ [i] C1 satisfies the HJI equation H J I (V (x), u [i] , w[i] ) = 0, i = 0, 1, . . ., and for ∀ t, l(x, u [i] , w[i] ) = x T Ax + u [i]T Bu [i] + w[i]T Cw[i] + 2u [i]T Dw[i] + 2x T Eu [i] + 2x T Fw[i] ≥ 0, then the new control pair derived by 1 [i] w[i+1] = − C −1 (2D T u [i+1] + 2F T x + cT (x)V x ), 2 1 [i+1] u = − (B − DC −1 D T )−1 (2(E T − D T C −1 F T )x 2 [i] + (bT (x) − DC −1 cT (x))V x ), which satisfies (10.31) makes the system (10.1) asymptotically stable. [i]

Proof Since the system is continuous and V (x) ∈ C1 , we have

(10.40)

10.3 Iterative Approximate Dynamic Programming…

177

[i]

dV (x) dV [i] (x, u [i+1] , w[i+1] ) = dt dt [i]T

[i]T

[i]T

= V x a(x) + V x b(x)u [i+1] + V x c(x)w[i+1] .

(10.41)

According to (10.40) we can get [i]

dV (x) [i]T [i]T = V x a(x) + V x (b(x) − c(x)C −1 D T )u (i+1) dt 1 [i]T [i]T [i] − V x c(x)C −1 F T x − V x c(x)C −1 cT (x)V x . 2

(10.42)

From the HJI equation we have [i]T

0 =Vx

f (x, u [i] , w[i] ) + l(x, u [i] , w[i] )

[i]T

[i]T

= V x a(x) + V x (b(x) − c(x)C −1 D T )u [i] [i]T

+ 2x T (E − F T C −1 D T )u [i] − V x c(x)C −1 F T x 1 [i]T [i] − V x c(x)C −1 cT (x)V x + x T Ax 4 + u [i]T (B − DC −1 D T )u [i] − x T FC −1 F T x.

(10.43)

Take (10.43) into (10.42) and then get [i]

dV (x) [i]T = V x (b(x) − c(x)C −1 D T )(u [i+1] − u [i] ) − x T Ax dt 1 [i]T [i] − u [i]T (B − DC −1 D T )u [i] − V x c(x)C −1 cT (x)V x 4 − 2x T (E − FC −1 D T )u [i+1] + x T FC −1 F T x. (10.44) According to (10.40) we have [i]

dV (x) = − (u [i+1] − u [i] )T (B − DC −1 D T )(u [i+1] − u [i] ) dt 1 [i]T [i] − u [i+1]T (B − DC −1 D T )u [i+1] − x T Ax − V x c(x)C −1 cT (x)V x 4 − 2x T (E − FC −1 D T )u [i+1] + x T FC −1 F T x. (10.45) If we substitute (10.40) into the utility function, we can obtain

178

10 An Iterative ADP Method to Solve for a Class …

l(x, u [i+1] , w[i+1] ) = x T Ax + u [i+1]T Bu [i+1] + w[i+1]T Cw[i+1] + 2u [i+1]T Dw[i+1] + 2x T Eu [i+1] + 2x T Fw[i+1] = x T Ax + u [i+1]T (B − DC −1 D T )u [i+1] 1 [i]T [i] + V x c(x)C −1 cT (x)V x 4 + 2x T (E − FC −1 D T )u [i+1] − x T FC −1 F T x ≥0.

(10.46)

[i]

dV (x) ≤ 0. dt [i] As V (x) ≥ 0, there exist two functions α( x ) and β( x ) belong to class [i] K and satisfy α( x ) ≤ V (x) ≤ β( x ). For ∀ε > 0, there exists δ(ε) > 0 that makes β(δ) ≤ α(ε). Let t0 is any initial [i] dV (x) time. According to ≤ 0, for t ∈ [t0 , ∞) we have dt

Thus we can derive

[i]

[i]



V (x(t)) − V (x(t0 )) =

[i]

dV (x) dτ ≤ 0. dτ

t

t0

(10.47)

So for ∀t0 and x(t0 ) < δ(ε), for ∀t ∈ [t0 , ∞) we have [i]

[i]

α(ε) ≥ β(δ) ≥ V (x(t0 )) ≥ V (x(t)) ≥ α( x ).

(10.48)

As α( x ) belongs to class K, we can obtain

x ≤ ε.

(10.49)

Then we can conclude that the system (10.1) is asymptotically stable. Theorem 10.3 If Assumptions 10.1–10.3 hold, u [i] ∈ Rk and w[i] ∈ Rm , V [i] (x) ∈ C1 satisfies the HJI equation H J I (V [i] (x), u [i] , w[i] ) = 0, i = 0, 1, . . ., and ∀ t, l(x, u [i] , w[i] ) = x T Ax + u [i]T Bu [i] + w[i]T Cw[i] + 2u [i]T Dw[i] + 2x T Eu [i] + 2x T Fw[i] < 0, then the control pair (u [i] , w[i] ) formulated by 1 −1 B (2Dw[i+1] + 2E T x + bT (x)V [i] x ), 2 1 = − (C − D T B −1 D)−1 (2(F T − D T B −1 E T )x 2 + (cT (x) − D T B −1 bT (x))V [i] x ),

u [i+1] = − w[i+1]

(10.50)

which is satisfied with the performance index function (10.34) makes the system (10.1) asymptotically stable.

10.3 Iterative Approximate Dynamic Programming…

179

Proof Since the system is continuous and V [i] (x) ∈ C1 , we have dV [i] (x) dV [i] (x, u [i+1] , w[i+1] ) = dt dt [i+1] [i+1] a(x) + V [i]T + V [i]T . =V [i]T x x b(x)u x c(x)w

(10.51)

According to (10.50) we can get dV [i] (x) [i]T −1 =V [i]T D)w[i+1] x a(x) + V x (c(x) − b(x)B dt −1 T [i+1] − V [i]T E x + V [i]T x b(x)B x c(x)w 1 − V [i]T b(x)B −1 bT (x)V [i]T x . 2 x

(10.52)

From the HJI equation we have f (x, u [i] , w[i] ) + l(x, u [i] , w[i] ) 0 =V [i]T x [i]T −1 =V [i]T D)w[i] x a(x) + V x (c(x) − b(x)B 1 − V [i]T b(x)B −1 bT (x)V [i] x 4 x T [i]T T −1 + x Ax + w (C − D B D)w[i] −1 T − V [i]T E x − x T F B −1 E T x + 2x T (F − E B −1 D)w[i] . x b(x)B

(10.53)

Take (10.53) into (10.52) and then get dV [i] (x) −1 =V [i]T D)(w[i+1] − w[i] ) x (c(x) − b(x)B dt − x T Ax − w[i]T (C − D T B −1 D)w[i] + x T F B −1 E T x − 2x T (F − E B −1 D)w[i] 1 − V [i]T b(x)B −1 bT (x)V [i] x . 4 x

(10.54)

According to (10.50) we have dV [i] (x) = − (w[i+1] − w[i] )T (C − D T B −1 D)(w[i+1] − w[i] ) dt − w[i+1]T (C − D T B −1 D)w[i+1] − x T Ax + x T F B −1 E T x − 2x T (F − E B −1 D)w[i+1] 1 − V [i]T b(x)B −1 bT (x)V [i] x . 4 x If we substitute (10.50) into the utility function, we obtain

(10.55)

180

10 An Iterative ADP Method to Solve for a Class …

l(x, u [i+1] , w[i+1] ) = x T Ax + u [i+1]T Bu [i+1] + w(i+1)T Cw[i+1] + 2u [i+1]T Dw[i+1] + 2x T Eu [i+1] + 2x T Fw[i+1] = w[i+1]T (C − D T B −1 D)w[i+1] + x T Ax − x T F B −1 E T x − 2x T (F − E B −1 D)w[i+1] 1 + V [i]T b(x)B −1 bT (x)V [i] x 4 x 0. dt [i] As V (x) < 0, there exist two functions α( x ) and β( x ) belong to class K and satisfy α( x ) ≤ −V [i] (x) ≤ β( x ). For ∀ ε > 0, there exists δ(ε) > 0 that makes β(δ) ≤ α(ε). According to dV [i] (x) − < 0, for t ∈ [t0 , ∞) we have dt So we have

−V [i] (x(t)) − (−V [i] (x(t0 ))) = −



t t0

dV [i] (x) dτ < 0. dτ

(10.57)

So for ∀t0 and x(t0 ) < δ(ε) for ∀t ∈ [t0 , ∞) we have α(ε) ≥ β(δ) ≥ −V [i] (x(t0 )) > −V [i] (x(t)) ≥ α( x ).

(10.58)

As α( x ) belongs to class K, we can obtain

x < ε.

(10.59)

Then we can conclude that the system (10.1) is asymptotically stable. Corollary 10.1 If Assumptions 10.1–10.3 hold, u [i] ∈ Rk and w[i] ∈ Rm , V [i] (x) ∈ C1 satisfies the HJI equation H J I (V [i] (x), u [i] , w[i] ) = 0, i = 0, 1, . . ., and for ∀ t, l(x, u [i] , w[i] ) = x T Ax + u [i]T Bu [i] + w[i]T Cw[i] + 2u [i]T Dw[i] + 2x T Eu [i] + 2x T Fw[i] ≥ 0, then the control pairs (u [i] , w[i] ) which satisfies the performance index function (10.34) makes the system (10.1) asymptotically stable. [i]

Proof According to (10.31) and (10.34), we have V [i] (x) ≤ V (x). As the utility [i] function l(x, u [i] , w[i] ) ≥ 0, we have V [i] (x) ≥ 0. So we get 0 ≤ V [i] (x) ≤ V (x). From Proposition 10.1, we know that for ∀ t0 , there exist two functions α( x ) and β( x ) belong to class K and satisfy [i]

[i]

α(ε) ≥ β(δ) ≥ V (x(t0 )) ≥ V (x(t)) ≥ α( x ).

(10.60)

10.3 Iterative Approximate Dynamic Programming… [i]

181 [i]

According to V (x) ∈ C1 , V [i] (x) ∈ C1 and V (x) → 0, there exist time instants t1 and t2 (not loss of generality, let t0 < t1 < t2 ) that satisfies [i]

[i]

[i]

V (x(t0 )) ≥ V (x(t1 )) ≥ V [i] (x(t0 )) ≥ V (x(t2 )).

(10.61)

[i]

Choose ε1 > 0 that satisfies V [i] (x(t0 )) ≥ α(ε1 ) ≥ V (x(t2 )). Then there exists [i] δ1 (ε1 ) > 0 that makes α(ε1 ) ≥ β(δ1 ) ≥ V (x(t2 )). Then we can obtain [i]

V [i] (x(t0 )) ≥ α(ε1 ) ≥ β(δ1 ) ≥ V (x(t2 )) [i]

≥ V (x(t)) ≥ V [i] (x(t)) ≥ α( x ).

(10.62)

According to (10.60), we have α(ε) ≥ β(δ) ≥ V [i] (x(t0 )) ≥ α(ε1 ) ≥ β(δ1 ) ≥ V [i] (x(t)) ≥ α( x ).

(10.63)

As α( x ) belongs to class K, we can obtain

x ≤ ε.

(10.64)

Then we can conclude that the system (10.1) is asymptotically stable. [i]

Corollary 10.2 If Assumptions 10.1–10.3 hold, u [i] ∈ Rk and w[i] ∈ Rm , V (x) ∈ [i] C1 satisfies the HJI equation H J I (V (x), u [i] , w[i] ) = 0, i = 0, 1, . . ., and ∀ t, l(x, u [i] , w[i] ) = x T Ax + u [i]T Bu [i] + w[i]T Cw[i] + 2u [i]T Dw[i] + 2x T Eu [i] + 2x T Fw[i] < 0, then the control pairs (u [i] , w[i] ) which is satisfied with the performance index function (10.31) makes the system (10.1) asymptotically stable. [i]

Proof According to (10.31) and (10.34), we have V [i] (x) ≤ V (x). From the utility [i] [i] function l(x, u [i] , w[i] ) < 0, we know V (x) < 0. So we get V [i] (x) ≤ V (x) < 0. From Proposition 10.2, we know that there exist two functions α( x ) and β( x ) belong to class K and satisfy α(ε) ≥ β(δ) ≥ −V [i] (x(t0 )) ≥ −V [i] (x(t)) ≥ α( x ). [i]

(10.65)

[i]

According to V (x) ∈ C1 , V [i] (x) ∈ C1 and V (x) → 0, there exist time instants t1 and t2 that satisfies [i]

−V [i] (x(t0 )) ≥ −V [i] (x(t1 )) ≥ −V (x(t0 )) ≥ −V [i] (x(t2 )). [i]

(10.66)

Choose ε1 > 0 that satisfies −V (x(t0 )) ≥ α(ε1 ) ≥ −V [i] (x(t2 )). Then there exists δ1 (ε1 ) > 0 that makes α(ε1 ) ≥ β(δ1 ) ≥ −V [i] (x(t2 )). Then we can obtain

182

10 An Iterative ADP Method to Solve for a Class … [i]

−V (x(t0 )) ≥ α(ε1 ) ≥ β(δ1 ) ≥ −V [i] (x(t2 )) [i]

≥ −V [i] (x(t)) ≥ −V (x(t)) ≥ α( x ).

(10.67)

According to (10.60), we have [i]

α(ε) ≥ β(δ) ≥ −V (x(t0 )) ≥ α(ε1 ) ≥ β(δ1 ) [i]

≥ −V (x(t)) ≥ α( x ).

(10.68)

As α( x ) belongs to class K, we can obtain

x < ε.

(10.69)

Then we can conclude that the system (10.1) is asymptotically stable. [i]

Theorem 10.4 If Assumptions 10.1–10.3 hold, u [i] ∈ Rk and w[i] ∈ Rm , V (x) ∈ [i] C1 satisfies the HJI equation H J I (V (x), u [i] , w[i] ) = 0, i = 0, 1, . . . and l(x, u [i] , [i]T [i] [i] [i]T T w ) = x Ax + u Bu + w Cw[i] + 2u [i]T Dw[i] + 2x T Eu [i] + 2x T Fw[i] is the utility function, then the control pair (u [i] , w[i] ) which is satisfied with the upper performance index function (10.31) is a pair of asymptotically stable controls for system (10.1). Proof For the time sequel t0 < t1 < t2 < · · · < tm < tm+1 < · · · , not loss of generality, we suppose l(x, u [i] , w[i] ) ≥ 0 in [ t2n , t(2n+1) ) and l(x, u [i] , w[i] ) < 0 in [ t2n+1 , t(2(n+1)) ) where n = 0, 1, . . .. t For t ∈ [t0 , t1 ) we have l(x, u [i] , w[i] ) ≥ 0 and t01 l(x, u [i] , w[i] )dt ≥ 0. According to Theorem 10.2, we have

x(t0 ) ≥ x(t1 ) ≥ x(t1 ) ,

where t1 ∈ [t0 , t1 ). For t ∈ [t1 , t2 ) we have l(x, u [i] , w[i] ) < 0 and



t2

(10.70)

l(x, u [i] , w[i] )dt < 0. Accord-

t1

ing to Corollary 10.2 we have

x(t1 ) > x(t2 ) > x(t2 ) ,

(10.71)



where t2 ∈ [t1 , t2 ). So we can obtain

x(t0 ) ≥ x(t0 ) > x(t2 ) ,

where t0 ∈ [t0 , t2 ).

(10.72)

10.3 Iterative Approximate Dynamic Programming…

183

Then using the mathematical induction, for ∀ t, we have x(t ) ≤ x(t) where t ∈ [t, ∞). So we can conclude that the system (10.1) is asymptotically stable and the proof is completed.

Theorem 10.5 If Assumptions 10.1–10.3 hold, u [i] ∈ Rk and w[i] ∈ Rm , V [i] (x) ∈ C1 satisfies the HJI equation H J I (V [i] (x), u [i] , w[i] ) = 0, i = 0, 1, . . . and l(x, u [i] , w[i] ) = x T Ax + u [i]T Bu [i] + w[i]T Cw[i] + 2u [i]T Dw[i] + 2x T Eu [i] + 2x T Fw[i] is the utility function, then the control pair (u [i] , w[i] ) which is satisfied with the upper performance index function (10.39) is a pair of asymptotically stable controls for system (10.1). Proof Similar to the proof of Theorem 10.3, the conclusion can be obtained according to Theorem 10.3 and Corollary 10.1 and the proof process is omitted. In the following part, the analysis of convergence property for the ZS differential games is presented to guarantee the iterative control pair reach the optimal. [i]

Proposition 10.1 If Assumptions 10.1–10.3 hold, u [i] ∈ Rk , w[i] ∈ Rm and V (x) ∈ [i] C1 satisfies the HJI equation H J I (V (x), u [i] , w[i] ) = 0, then the iterative control pair (u [i] , w[i] ) formulated by 1 [i] w[i+1] = − C −1 (2D T u [i+1] + 2F T x + cT (x)V x ), 2 1 u [i+1] = − (B − DC −1 D T )−1 (2(E T − D T C −1 F T )x 2 [i] + (bT (x) − DC −1 cT (x))V x ),

(10.73)

[i]

makes the the upper performance index function V (x) → V (x) as i → ∞. Proof To show the convergence of the upper performance index function, we will [i+1] [i] primarily consider the property of d(V (x) − V (x)) dt. [i]

According to the HJI equation H J I (V (x), u [i] , w[i] ) = 0, we can obtain [i+1] (x) dt by replacing the index “i” by the index “i + 1”: dV dV

[i+1]

dt

(x)

= − (x T Ax + u [i+1]T (B − DC −1 D T )u [i+1] 1 [i]T [i] + V x c(x)C −1 cT (x)V x 4 + 2x T (E − FC −1 D T )u [i+1] − x T FC −1 F T x).

According to (10.45), we can obtain

(10.74)

184

10 An Iterative ADP Method to Solve for a Class …

d(V

[i+1]

[i+1]

[i]

(x) − V (x)) dt

dV

(x)

[i]

dV (x) dt dt = − (x T Ax + u [i+1]T (B − DC −1 D T )u [i+1] 1 [i]T [i] + V x c(x)C −1 cT (x)V x + 2x T (E − FC −1 D T )u [i+1] − x T FC −1 F T x) 4 − (−(u [i+1] − u [i] )T (B − DC −1 D T )(u [i+1] − u [i] ) 1 [i]T [i] − u [i+1]T (B − DC −1 D T )u [i+1] − x T Ax − V x c(x)C −1 cT (x)V x 4 − 2x T (E − FC −1 D T )u [i+1] + x T FC −1 F T x) =



= u [i+1]T (B − DC −1 D T )u [i+1] >0.

(10.75)

Since the system f is asymptotically stable, its state trajectories x converge to (i+1) [i] (i+1) [i] zero, and so does V (x) − V (x). Since d(V (x) − V (x)) dt ≥ 0 on these [i+1]

[i]

(x) − V (x) ≤ 0, that is V trajectories, this implies that V [i] such, V (x) → V (x) as i → ∞.

[i+1]

[i]

(x) ≤ V (x). As

Remark 10.5 The convergence of the upper performance index function can not guarantee the convergence of the lower performance index function. In fact, the lower performance index function may be boundary but not convergent. So it is necessary to analyze the convergence of lower performance index function. Proposition 10.2 If Assumptions 10.1–10.3 hold, u [i] ∈ Rk , w[i] ∈ Rm and V [i] (x) ∈ C1 satisfies the HJI function H J I (V [i] (x), u [i] , w[i] ) = 0, then the iterative control pair (u [i] , w[i] ) formulated by 1 −1 B (2Dw[i+1] + 2E T x + bT (x)V [i] x ) 2 1 = − (C − D T B D)−1 (2(F T − D T B −1 E T )x 2 + (cT (x) − D T B −1 bT (x))V [i] x )

u [i+1] = − w[i+1]

(10.76)

makes the lower performance index function V [i] (x) → V (x(t)) as i → ∞. Proof To show the convergence of the lower performance index function, we also consider the property of d(V [i+1] (x) − V [i] (x)) dt. From the HJI equation H J I (V [i] (x), u [i] , w[i] ) = 0, we can obtain dV [i+1] (x) dt by replacing the index “i” by the index “i + 1”:

10.3 Iterative Approximate Dynamic Programming…

185

dV [i+1] (x) = − w[i+1]T (C − D T B −1 D)w[i+1] − x T Ax dt + x T F B −1 E T x − 2x T (F − E B −1 D)w[i+1] 1 − V [i]T b(x)B −1 bT (x)V [i] x . 4 x

(10.77)

According to (10.50), we have d(V [i+1] (x) − V [i] (x)) dt =dV [i+1] (x) dt − dV [i] (x) dt = − w[i+1]T (C − D T B −1 D)w[i+1] − x T Ax 1 + x T F B −1 E T x − 2x T (F − E B −1 D)w[i+1] − V [i]T b(x)B −1 bT (x)V [i] x 4 x [i+1] [i] T T −1 [i+1] [i] − (−(w − w ) (C − D B D)(w −w ) − w[i+1]T (C − D T B −1 D)w[i+1] − x T Ax + x T F B −1 E T x 1 − 2x T (F − E B −1 D)w[i] − V [i]T b(x)B −1 bT (x)V [i] x ) 4 x =(w[i+1]T − w[i]T )(C − D T B −1 D)(w[i+1] − w[i] ) V [i] (x). As trajectories, this implies that V [i] (x) is convergent. such, V ˜ w[i] ) = Theorem 10.7 If Assumptions 10.1–10.3 hold, u ∈ Rk , w[i] ∈ Rm and l(x, [i] o l(x, u, w ) − l (x, u, w, u, w) is the utility function, then the control pair (u, w[i] ) [i] (x) convergent as that satisfies the performance index function (10.37) makes V i → ∞. Proof For the time sequel t0 < t1 < t2 < · · · < tm < tm+1 < · · · , not loss of gener˜ w[i] ) < 0 in [ t2n+1 , t(2(n+1)) ) ˜ w[i] ) ≥ 0 in [ t2n , t(2n+1) ) and l(x, ality, we suppose l(x, where n = 0, 1, 2, . . .. ˜ w[i] ) ≥ 0 and t1 l(x, ˜ w[i] )dt ≥ 0. According For t ∈ [ t2n , t(2n+1) ) we have l(x, t0 [i+1] [i] (x). (x) ≤ V to Proposition 10.5, we have V ˜ w[i] ) < 0 and t2 l(x, ˜ w[i] )dt < 0. AccordFor t ∈ [ t2n+1 , t(2(n+1)) ) we have l(x, t1 [i] (x). [i+1] (x) > V ing to Proposition 10.5 we have V Then for ∀ t0 , we have [i+1] (x(t0 )) =

V



t1

˜ w[i] )dt +

l(x,

t0

+ ···+





t2

˜ w[i] )dt

l(x,

t1 t(m+1)

˜ w[i] )dt + · · · l(x,

tm

[i] (x(t0 )) . < V

(10.92)

So V˜ [i] (x) is convergent as i → ∞. ˜ w[i] ) = Theorem 10.8 If Assumptions 10.1–10.3 hold, u ∈ Rk , w[i] ∈ Rm and l(x, [i] o l(x, u, w ) − l (x, u, w, u, w) is the utility function, then the control pair (u, w[i] ) that satisfies the performance index function (10.37) makes V[i] (x) → V o (x) as i → ∞. Proof It is proved by contradiction. Suppose that the control pair (u, w[i] ) makes the performance index function V [i] (x) converge to V (x) and V (x) = V o (x). According to Theorem 10.7, as i → ∞ we have the following HJB equation based on the principle of optimality x (x), x, w) = 0. (x), w) = V t (x) + H (V H J B(V

(10.93)

From the above assumption we know that |V[i] (x) − V o (x)| = 0 as i → ∞. From Theorem 10.7, we know that there exists control pair (u, w ) that makes (x). V (x, u, w ) = V o (x) which minimizes the performance index function V According to the principle of optimality we also have the following HJB equation t (x) + H (V x (x), x, w ) = 0. (x), w ) = V H J B(V

(10.94)

190

10 An Iterative ADP Method to Solve for a Class …

It is contradiction. So the assumption does not hold. Thus we have V[i] (x) → V o (x) as i → ∞. Remark 10.6 If we regulate the control u for the lower performance index function which satisfies (10.39), we can also prove that the iterative control pair (u [i] , w) stabilizes the nonlinear system (10.1) and the performance index function V [i] (x) → V o (x) as i → ∞. The proof procedure is similar to the proof of Propositions 10.1– 10.5 and Theorems 10.7 and 10.8, and it is omitted.

10.4 Neural Network Implementation As the computer can only deal with the digital and discrete signal it is necessary to transform the continuous-time system and performance index function to the corresponding discrete-time form. Discretization of the system function and performance index function using Euler and trapezoidal methods [19, 20] leads to

 x(t + 1) = x(t) + a(x(t)) + b(x(t))u(t) + c(x(t))w(t) t,

(10.95)

and V (x(0)) =

∞  (x T (t)Ax(t) + u T (t)Bu(t) + wT (t)Cw(t) t=0

+ 2u T (t)Dw(t) + 2x T (t)Eu(t) + 2x T (t)Fw(t)) t,

(10.96)

where t is the time interval. In the case of linear systems the performance index function is quadratic and the control policy is linear. In the nonlinear case, this is not necessarily true and therefore we use neural networks to approximate u [i] , w[i] and V [i] (x). Using NN estimation error can be expressed by F(X ) = F(X, Y ∗ , W ∗ ) + ξ(X ),

(10.97)

where Y ∗ , W ∗ are the ideal weight parameters, ξ(X ) is the estimation error. There are ten neural networks to implement the iterative ADP method, where three are model networks, three are critic networks and four are action networks respectively. All the neural networks are chosen as three-layer feed-forward neural network. The whole structure diagram is shown in Fig. 10.1.

10.4 Neural Network Implementation

191

Fig. 10.1 The structure diagram of the iterative ADP method for ZS differential games

10.4.1 The Model Network For the nonlinear system, before carrying out iterative ADP method, we should first train the model network. The output of the model network is given as x(t ˆ + 1) = WmT σ (VmT X (t)),

(10.98)

where X (t) = [x(t) u(t) w(t)]T . We define the error function of the model network as ˆ + 1) − x(t + 1). em (t) = x(t

(10.99)

Define the performance error index as E m (t) =

1 2 e (t). 2 m

(10.100)

192

10 An Iterative ADP Method to Solve for a Class …

Then the gradient-based weight update rule for the critic network can be described by wm (t + 1) = wm (t) + wm (t),

(10.101)

  ∂ E m (t) , wm (t) = ηm − ∂wm (t)

(10.102)

where ηm > 0 is the learning rate of the model network. After the model network is trained, its weights are kept unchanged.

10.4.2 The Critic Network The critic network is used to approximate the performance index functions i.e. [i] V (x), V [i] (x) and V[i] (x). The output of the critic network is denoted as Vˆ [i] (X (t)) = Wc[i]T σ (Yc[i]T X (t)),

(10.103)

 T where X (t) = x(t) u(t) w(t) is the input of critic networks. The three critic networks have three target functions. For the upper performance index function, the target function can be written as

[i] V (x(t)) = x T (t)Qx(t) + u [i+1]T (t)Ru [i+1] (t) + w[i+1]T Cw[i+1] + 2u [i+1]T Dw[i+1] [i]  + 2x T Eu [i+1] + 2x T Fw[i+1] t + Vˆ (x(t + 1)),

(10.104)

[i] where Vˆ (x(t + 1)) is the output of the upper critic network. For the lower performance index function, the target function can be written as

V [i] (x(t)) = x T (t)Qx(t) + u [i+1]T (t)Ru [i+1] (t) + w[i+1]T Cw[i+1] + 2u [i+1]T Dw[i+1]  [i] + 2x T Eu [i+1] + 2x T Fw[i+1] t + Vˆ (x(t + 1)), [i]

(10.105)

where Vˆ (x(t + 1)) is the output of the lower critic network. Then for the upper performance index function, we define the error function for the critic network by

10.4 Neural Network Implementation

193

[i] ˆ [i] e[i] c (t) = V (x(t)) − V (x(t)),

(10.106)

[i]

where Vˆ (x(t)) is the output of the upper critic neural network. And the objective function to be minimized in the critic network is E c[i] (t) =

1 [i] (e (t))2 . 2 c

(10.107)

So the gradient-based weight update rule for the critic network [21, 22] is given by wc[i] (t + 1) = wc[i] (t) + wc[i] (t), wc[i] (t) ∂ E c[i] (t) ∂wc[i] (t)

=

 = ηc −

∂ E c[i] (t) ∂wc[i] (t)

(10.108)

 ,

∂ E c[i] (t) ∂ Vˆ [i] (x(t)) , ∂ Vˆ [i] (x(t)) ∂wc[i] (t)

(10.109)

(10.110)

where ηc > 0 is the learning rate of critic network and wc (t) is the weight vector in the critic network. For the lower performance index function, the error function for the critic network is defined by [i] ˆ [i] e[i] c (t) = V (x(t)) − V (x(t)).

(10.111)

And for the mixed optimal performance index function, the error function can be expressed as eco(i) (t) = Vˆ [i] (x(t)) − V o (x(t)),

(10.112)

where Vˆ [i] (x(t)) is the output of the critic network. For the lower and mixed optimal performance index function, weight update rule of the critic neural network is the same as the one for upper performance index function. The details are omitted here.

10.4.3 The Action Network The action network is used to approximate the optimal and mixed optimal controls. There are four action networks, two are used to approximate the optimal control

194

10 An Iterative ADP Method to Solve for a Class …

pair for the upper performance index function and two are used to approximate the optimal control pair for the lower performance index function. For the two action networks for upper performance index function, x(t) is used  T as the input for the first action network to create the control u(t), and x(t) u(t) is used as the input for the other action network to create the control w(t). The target function of the first action network (u network) is the discretization formulation of Eq. (10.32): 1 u [i+1] = − (B − DC −1 D T )−1 (2(E T − DC −1 F T )x 2 [i] ∂ V (x(t + 1)) + (bT (x) − DC −1 cT (x)) . ∂ x(t + 1)

(10.113)

And the target function of the second action network (w network) is the discretization formulation of Eq. (10.33): w

[i+1]

  [i] 1 −1 ∂ V (x(t + 1)) T [i+1] T T =− C + 2F x + c (x) 2D u . (10.114) 2 ∂ x(t + 1)

While in the two action network for lower performance index function, x(t) is used  T as the input for the first action network to create the control w(t), and x(t) u(t) is used as the input for the other action network to create the control u(t). The target function of the first action network (w network) is the discretization formulation of Eq. (10.35): 1 w[i+1] = − (C − D T B D)−1 (2(F T − D T B −1 E)x 2 ∂ V [i] (x(t + 1)) . + (cT (x) − D T B −1 bT (x)) ∂ x(t + 1)

(10.115)

The target function of the second action network (u network) is the discretization formulation of Eq. (10.36):   1 ∂ V [i] (x(t + 1)) u [i+1] = − B −1 2Dw[i+1] + 2E T x + bT (x) . (10.116) 2 ∂ x(t + 1) The output of the action network i.e. the first network for the upper performance index function can be formulated as uˆ [i] (t) = Wa[i]T σ (Ya[i]T x(t)).

(10.117)

And the target of the output of the action network is given by (10.113). So we can define the output error of the action network as

10.4 Neural Network Implementation

195

ea[i] (t) = uˆ [i] (t) − u [i] (t).

(10.118)

The weighs in the action network are updated to minimize the following performance error measure E a[i] (t) =

1 [i] (e (t))2 . 2 a

(10.119)

The weights update algorithm is similar to the one for the critic network. By the gradient descent rule, we can obtain wa[i] (t + 1) = wa[i] (t) + wa[i] (t), wa[i] (t) ∂ E a[i] (t) ∂wa[i] (t)

=

  ∂ E a[i] (t) , = ηa − [i] ∂wa (t)

∂ E a[i] (t) ∂ea[i] (t) ∂ uˆ [i] (t) , ∂ea[i] (t) ∂ uˆ [i] (t) ∂wa[i] (t)

(10.120)

(10.121)

(10.122)

where ηa > 0 is the learning rate of action network. The weights update rule for the other action networks is similar and is omitted. After obtaining the optimal control pairs for the upper and lower performance index respectively, we regulate the w action network for the upper performance index function to compute the mixed optimal control pair. The output of the w action network is wˆ [i] (t) = Wa[i]T σ (Ya[i]T z(t)),

(10.123)

T  where z(t) = x(t) u(t) . As the error function of the mixed optimal performance index function is expressed as (10.112), then the update rule can be written as  o(i)   o(i)  ∂ E c (t) ∂ E c (t) ∂eco(i) (t) ∂w[i] (t) [i] wa (t) = −ηa = −ηa (10.124) [i] ∂wa[i] (t) ∂eco(i) (t) ∂w[i] (t) ∂wa (t) where E co(i) (t) = 21 (eco(i) (t))2 .

10.5 Simulation Study In this section, two examples are used to illustrate the effectiveness of the proposed approach for continuous-time nonlinear quadratic ZS game.

196

10 An Iterative ADP Method to Solve for a Class …

Example 10.1 Consider the following linear system [17] x˙ = x + u + w

(10.125)

with the performance index function 



J=

(x 2 + u 2 − γ 2 w2 )dt,

(10.126)

0

where γ 2 = 2. We choose three-layer neural networks as the critic network and the model network with the structure 1-8-1 and 2-8-1 respectively. The structure of action networks is also three layers. For the upper performance index function, the structure of the u action network is 1-8-1 and the structure of the w action network is 2-8-1; for the lower performance index function, the structure of the u action network is 2-8-1 and the structure of the w action network is 1-8-1. The initial weights of action networks, critic network and model network are all set to be random in [−1, 1]. It should be mentioned that the model network should be trained first. For the given initial state x(0) = 1, we train the model network for 20 000 steps under the learning rate ηm = 0.01. After the training of the model network is completed, the weights of model network keep unchanged. Then the critic network and the action networks are trained for 100 time steps so that the given accuracy ζ = 10−6 is reached. In the training process, the learning rate ηa = ηc = 0.01. The system and the performance index function are transformed according to (10.95) and (10.96) where the sample time interval is t = 0.01. Take the iteration number i = 4. The convergence trajectory of the performance index function is shown in Fig. 10.2. From Theorem 4.1 in [17] we know that the saddle point exists for the system (10.125) with the performance index function (10.126). Then from the simulation Fig. 10.2, we can see that the performance index functions reach the saddle point after Fig. 10.4 iterations which demonstrates the effectiveness of the proposed method in the chapter. Figure 10.3 shows the iterative trajectories of the control variables u. The optimal control trajectories are displayed in Fig. 10.4 and the corresponding state trajectory is shown in Fig. 10.5. Example 10.2 Consider the following continuous-time affine nonlinear system    x1 0.1 + x1 x2 0.1x12 + 0.05x2 + w 0.2x12 − 0.15x2 x1 0.2 + x1 x2   x2 0.5 + x1 0.1 + x1 u, + x22 0.1 + x12 0.3 + x1 x2

 x˙ =

(10.127)

10.5

Simulation Study

197

T T T    where x = x1 x2 , u = u 1 u 2 u 3 and w = w1 w2 . The performance index function is formulated by 



V (x(0), u, w) =

(x T Ax + u T Bu + wT Cw + 2u T Dw + 2x T Eu

0

+ 2x T Fw)dt,

(10.128)

⎡ ⎤       T 100 101 10 −3 0 100 ,E = where A = , B = ⎣ 0 1 0 ⎦, C = ,D= 011 01 0 −3 011 001   −1 0 and F = . 0 −1 The system and the performance index function are transformed according to (10.95) and (10.96) where the sample time interval is t = 0.01. The critic network and the model network are also chosen as three-layer neural networks with the structure 2-8-1 and 7-8-2 respectively. The action network is also chosen as three-layer neural networks. For the upper performance index function, the structure of the u action network is 2-8-3 and the structure of the w action network is 5-8-2; for the lower performance index function, the structure of the u action network is 2-8-2 and the structure of the w action network is 4-8-3. The initial weight is also randomly chosen in [−1, 1]. For the given initial state x(0) = [−1 0.5 ]T , we train the model network for 20000 steps. After the training of the model network is completed, the weights keep unchanged. Then the critic network and the action network are trained 

0.6

performance index function

0.5

1st iteration for upper performance index function

0.4

optimal performance index function

0.3

0.2

1st iteration for lower performance index function

0.1

0 0

10

20

30

40

50

60

70

80

time steps

Fig. 10.2 The trajectories of upper and lower performance index functions

90

100

198

10 An Iterative ADP Method to Solve for a Class … 0

-0.1

1st iteration for upper performance index

-0.2

the optimal control

control

-0.3

1st iteration for lower performance index

-0.4

-0.5

-0.6

-0.7

-0.8 0

10

20

30

40

50

60

70

80

90

100

60

70

80

90

100

time steps

Fig. 10.3 The control trajectories 0 -0.1 -0.2

optimal control

-0.3

control u

-0.4 -0.5 -0.6

control w

-0.7 -0.8 -0.9 -1 0

10

20

30

40

50

time steps

Fig. 10.4 The trajectories of optimal control pair

10.5

Simulation Study

199

1 0.9 0.8 0.7

state

0.6 0.5 0.4 0.3 0.2 0.1 0 0

10

20

30

40

50

60

70

80

90

100

900

1000

time steps

Fig. 10.5 The state trajectory 1.6

performance index function

1.4

1.2

1

limiting iteration for lower performance index function

0.8 st

1 iteration for lower performance index function

0.6

1st iteration for upper performance index function

0.4 limiting iteration for upper performance index function 0.2

0 0

100

200

300

400

500

600

700

800

time steps

Fig. 10.6 Performance index functions

for 1000 time steps so that the given accuracy ζ = 10−4 is reached. The convergence curves of the performance index functions are shown in Fig. 10.6. Figure 10.7 shows the convergent trajectories of control variable u 1 for upper performance index function. And the optimal control trajectories for upper perfor-

200

10 An Iterative ADP Method to Solve for a Class … 0.05

u1 for upper performance index function

0.045 0.04 0.035

1st iteration 0.03 0.025

limiting iteration

0.02 0.015 0.01 0.005 0 0

100

200

300

400

500

600

700

800

900

1000

800

900

1000

time steps

Fig. 10.7 The control u 1 for upper performance index function

controls for upper performance index function

0.2

0.15

u3 u2

0.1

u1 0.05

0

-0.05

w2 w1

-0.1

-0.15 0

100

200

300

400

500

600

700

time steps

Fig. 10.8 The optimal control for upper performance index function

10.5

Simulation Study

201

0.04

1st iteration

w2 for lower performance index function

0.03

0.02

0.01

limiting iteration

0

-0.01

-0.02

-0.03 0

100

200

300

400

500

600

700

800

900

1000

800

900

1000

time steps

Fig. 10.9 The control w2 for lower performance index function

controls for lower performance index function

0.15

0.1

u3 w 2 u1

0.05

w1

0

u2 -0.05

-0.1 0

100

200

300

400

500

600

700

time steps

Fig. 10.10 The optimal control for lower performance index function

mance index functions is displayed in Fig. 10.8. Figure 10.9 shows the convergent trajectories of control variable w2 for lower performance index function. And the optimal control trajectories for lower performance index functions is displayed in Fig. 10.10.

202

10 An Iterative ADP Method to Solve for a Class … 1.4

performance index function

1.2

upper performance index function

1

mixed optimal performance index function 0.8

0.6

0.4

lower performance index function 0.2

0 0

100

200

300

400

500

600

700

800

900

1000

time steps

Fig. 10.11 The performance index function trajectories

After 7 iterations, we obtain the value of upper performance index function V (x(0)) = 70.93180829072188 and the value of lower performance index function V (x(0)) = 22.15196172808129. So we know the saddle point does not exist, it can also be seen from Fig. 10.6. Using the mixed trajectory method proposed in this chapter, let Gaussian noise γw ∈ R2 and γu ∈ R3 where γwi (0, σi2 ), i = 1, 2 j and γu (0, σ j2 ), j = 1, 2, 3. σi = 10−3 , i = 1, 2 and σ j = 10−3 , j = 1, 2, 3. Take the Gaussian noise γw and γu into w and u and take N = 100. According to (10.18)– (10.23) we can get the value of the mixed optimal performance index function 45.15594227340300. And the mixed optimal performance index function can be expressed as V o (x) = 0.4716V (x) + 0.5284V (x). Figure 10.11 displays the trajectories of the mixed optimal performance index function, the corresponding mixed optimal control trajectories and state trajectories are shown in Figs. 10.12 and 10.13 respectively. From the above simulation results we can see that using the proposed iterative ADP method, we obtain the mixed optimal performance index function successfully as the saddle point of the differential game does not exists.

10.5

Simulation Study

203

0.25

0.2

mixed optimal control

0.15

u1

0.1

u2

0.05

u3

0

-0.05

w2

-0.1

w1

-0.15 0

100

200

300

400

500

600

700

800

900

1000

time steps

Fig. 10.12 The mixed optimal control trajectories 0.5

x

1

x

2

state trajectories

0

-0.5

-1

0

200

400

600

time steps

Fig. 10.13 The state trajectories

800

1000

204

10 An Iterative ADP Method to Solve for a Class …

10.6 Conclusions In this chapter, we proposed an effective iterative ADP method to solve a class of continuous-time nonlinear two-person ZS differential games. Under the situation that the saddle point exists, using the iterative ADP method the performance index function reaches the saddle point with rigid stability and convergence analysis. When there is no saddle point, a determined control scheme is proposed to guarantee the performance index function to reach the mixed optimal solution and the stability and convergence properties are also analyzed. Finally, the simulation studies have successfully demonstrated the outstanding characters.

References 1. Jamshidi, M.: Large-Scale Systems-Modeling and Control. North-Holland, Amsterdam, The Netherlands (1982) 2. Chang, H., Marcus, S.: Two-person zero-sum markov games: receding horizon approach. IEEE Trans. Autom. Control 48(11), 1951–1961 (2003) 3. Chen, B., Tseng, C., Uang, H.: Fuzzy differential games for nonlinear stochastic systems: suboptimal approach. IEEE Trans. Fuzzy Syst. 10(2), 222–233 (2002) 4. Hwnag, K., Chiou, J., Chen, T.: Reinforcement learning in zero-sum Markov games for robot soccer systems. In: Proceedings of the 2004 IEEE International Conference on Networking, Sensing and Control Taipei, Taiwan, pp. 1110–1114 (2004) 5. Laraki, R., Solan, E.: The value of zero-sum stopping games in continuous time. SIAM J. Control Optim. 43(5), 1913–1922 (2005) 6. Leslie, D., Collins, E.: Individual Q-learning in normal form games. SIAM J. Control Optim. 44(2), 495–514 (2005) 7. Gu, D.: A differential game approach to formation control. IEEE Trans. Control Syst. Technol. 16(1), 85–93 (2008) 8. Basar, T., Olsder, G.: Dynamic Noncooperative Game Theory. Academic, New York (1982) 9. Altman, E., Basar, T.: Multiuser rate-based flow control. IEEE Trans. Commun. 46(7), 940–949 (1998) 10. Goebel, R.: Convexity in zero-sum differential games. In: Proceedings of IEEE Conference on Decision and Control, pp. 3964–3969 (2002) 11. Zhang, P., Deng, H., Xi, J.: On the value of two-person zero-sum linear quadratic differential games. In: Proceedings of the 44th IEEE Conference on Decision and Control, and the European Control Conference 2005 Seville, Spain, pp. 12–15 (2005) 12. Hua, X., Mizukami, K.: Linear-quadratic zero-sum differential games for generalized state space systems. IEEE Trans. Autom. Control 39(1), 143–147 (1994) 13. Jimenez, M., Poznyak, A.: Robust and adaptive strategies with pre-identification via sliding mode technique in LQ differential games. In: Proceedings of the 2006 American Control Conference Minneapolis, Minnesota, USA, pp. 14–16 (2006) 14. Engwerda, J.: Uniqueness conditions for the affine open-loop linear quadratic differential game. Automatica 44(2), 504–511 (2008) 15. Bertsekas, D.: Convex Analysis and Optimization. Athena Scientific, Belmont (2003) 16. Owen, G.: Game Theory. Acadamic Press, New York (1982) 17. Basar, T., Bernhard, P.: H ∞ Optimal Control and Related Minimax Design Problems. Birkhäuser, Boston (1995) 18. Yong, J.: Dynamic programming and Hamilton–Jacobi–Bellman equation. Shanghai Science Press, Shanghai (1991)

References

205

19. Padhi, R., Unnikrishnan, N., Wang, X., Balakrishman, S.: A single network adaptive critic (SNAC) architecture for optimal control synthesis for a class of nonlinear systems. Neural Netw. 19(10), 1648–1660 (2006) 20. Gupta, S.: Numerical Methods for Engineerings. Wiley Eastern Ltd. and New Age International Company, New Delhi (1995) 21. Si, J., Wang, Y.: On-line learning control by association and reinforcement. IEEE Trans. Neural Netw. 12(2), 264–275 (2001) 22. Enns, R., Si, J.: Helicopter trimming and tracking control using direct neural dynamic programming. IEEE Trans. Neural Netw. 14(7), 929–939 (2003)

Chapter 11

Neural-Network-Based Synchronous Iteration Learning Method for Multi-player Zero-Sum Games

In this chapter, a synchronous solution method for multi-player zero-sum (ZS) games without system dynamics is established based on neural network. The policy iteration (PI) algorithm is presented to solve the Hamilton–Jacobi–Bellman (HJB) equation. It is proven that the obtained iterative cost function is convergent to the optimal game value. For avoiding system dynamics, off-policy learning method is given to obtain the iterative cost function, controls and disturbances based on PI. Critic neural network (CNN), action neural networks (ANNs) and disturbance neural networks (DNNs) are used to approximate the cost function, controls and disturbances. The weights of neural networks compose the synchronous weight matrix, and the uniformly ultimately bounded (UUB) of the synchronous weight matrix is proven. Two examples are given to show that the effectiveness of the proposed synchronous solution method for multi-player ZS games.

11.1 Introduction The importance of strategic behavior in the human and social world is increasingly recognized in theory and practice. As a result, game theory has emerged as a fundamental instrument in pure and applied research [1]. Modern day society relies on the operation of complex systems, including aircraft, automobiles, electric power systems, economic entities, business organizations, banking and finance systems, computer networks, manufacturing systems, and industrial processes. Networked dynamical agents have cooperative team-based goals as well as individual selfish goals, and their interplay can be complex and yield unexpected results in terms of emergent teams. Cooperation and conflict of multiple decision-makers for such systems can be studied within the field of cooperative and noncooperative game theory [2]. It knows that many real-world systems are often controlled by more than one controller or decision maker with each using an individual strategy. These controllers often operate in a group with a general quadratic performance index © Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 R. Song et al., Adaptive Dynamic Programming: Single and Multiple Controllers, Studies in Systems, Decision and Control 166, https://doi.org/10.1007/978-981-13-1712-5_11

207

208

11 Neural-Network-Based Synchronous Iteration...

function as a game. Therefore, some scholars research the multi-player games. In [3], off-policy integral reinforcement learning method was developed to solve nonlinear continuous-time multi-player non-zero-sum (NZS) games. In [4], a multi-player zero-sum (ZS) differential games for a class of continuous-time uncertain nonlinear systems were solved using upper and lower iterations. ZS game theory relies on solving the Hamilton–Jacobi–Isaacs (HJI) equations, a generalized version of the Hamilton–Jacobi–Bellman (HJB) equations appearing in optimal control problems. In the nonlinear case the HJI equations are difficult or impossible to solve, and may not have global analytic solutions even in simple cases. Therefore, many approximate methods are proposed to obtain the solution of HJI equations [5–8]. In this chapter, multi-player ZS games for completely unknown continuous-time nonlinear systems will be explicitly figured out. The main contributions of this chapter are summarized as follows. (1) A synchronous solution method based on PI algorithm and neural networks is established. (2) It is proven that the iterative cost function converges to the optimal game value with system dynamics for traditional PI algorithm. (3) Synchronous solution method is given to solve the off-policy HJB equation with convergence analysis, according to critic neural network (CNN), action neural networks (ANNs) and disturbance neural networks (DNNs). (4) The uniformly ultimately bounded (UUB) of the synchronous weight matrix is proven. The rest of this chapter is organized as follows. In Sect. 11.2, we present the motivations and preliminaries of the discussed problem. In Sect. 11.3, the synchronous solution of multi-player ZS games is developed and the convergence proof is given. In Sect. 11.4, two examples are given to demonstrate the effectiveness of the proposed scheme. In Sect. 11.5, the conclusion is drawn.

11.2 Motivations and Preliminaries In this chapter, we consider the continuous-time nonlinear system described by x˙ = f (x) + g(x)

p  i=1

u i + h(x)

q 

dj,

(11.1)

j=1

where x ∈ Ω ∈ Rn is the system state, u i ∈ Rm 1 and d j ∈ Rm 2 are the control input and the disturbance input, respectively. f (x) ∈ Rn , g(x) and h(x) are unknown functions. f (0) = 0 and x = 0 is an equilibrium point of the system. Assume that f (x), g(x) and h(x) are locally Lipschitz functions on the compact set Ω that contains the origin. The dynamical system is stabilizable on Ω. The performance index function is a generalized quadratic form given by

11.2 Motivations and Preliminaries





J (x(0), U p , Dq ) =

209

 x T Qx+

0

p 

u iT Ri u i −

i=1

q 

 d Tj S j d j dt,

(11.2)

j=1

where Q, Ri and S j are positive definite matrixes, U p = {u 1 , u 2 , . . . , u p } and Dq = {d1 , d2 , . . . , dq }. Then, we define the multi-player ZS differential game subject to (11.1) as V ∗ (x(0)) = inf inf · · · inf sup sup · · · sup J (x(0), U p , Dq ). u1

u2

up

d1

d2

(11.3)

dq

The multi-player ZS differential game selects the minimizing player set U p and the maximizing player set Dq such that the saddle point U p∗ and Dq∗ satisfies the following inequalities J (x, U p∗ , Dq ) ≤ J (x, U p∗ , Dq∗ ) ≤ J (x, U p , Dq∗ ),

(11.4)

where U p∗ = {u ∗1 , u ∗2 , . . . , u ∗p } and Dq∗ = {d1∗ , d2∗ , . . . , dq∗ }. In this chapter, we assume that the multi-player optimal control problem has a unique solution if and only if the Nash condition holds [9] V ∗ (x) = inf sup J (x, U p , Dq ) = sup inf J (x, U p , Dq ). U p Dq

(11.5)

Dq U p

If we give the feedback policy (U p (x), Dq (x)), then the value or cost of the policy is 



V (x(t)) =

 x T Qx+

t

p 

u iT Ri u i −

i=1

q 

 d Tj S j d j dt.

(11.6)

j=1

By using Leibniz’s formula and differentiating, (11.6) has a differential equivalent. Then we can obtain the nonlinear ZS game Bellman equation, which is given in terms of the Hamiltonian function H (x, ∇V, U p , D q ) = x Qx + T

p 

u iT Ri u i

i=1



+ ∇V T ⎝ f + g

where ∇V =

∂V . The stationary conditions are ∂x

d Tj S j d j

j=1 p  i=1

= 0,



q 

ui + h

q 

⎞ dj⎠

j=1

(11.7)

210

11 Neural-Network-Based Synchronous Iteration...

and

∂H = 0, i = 1, 2, . . . , p, ∂u i

(11.8)

∂H = 0, j = 1, 2, . . . , q. ∂d j

(11.9)

According to (11.7), we have the optimal controls and the disturbances are 1 u i∗ = − Ri−1 g T ∇V ∗ , i = 1, 2, . . . , p, 2 and

d ∗j =

1 −1 T S h ∇V ∗ , j = 1, 2, . . . , q. 2 j

(11.10)

(11.11)

From Bellman equation (11.7), we can derive V ∗ from the solution of the HJI equation 1 ∇V T g Ri−1 g T ∇V 4 i=1 p

0 = x T Qx + ∇V T f −

1 T + ∇V T h S −1 j h ∇V. 4 j=1 q

(11.12)

Note that if (11.12) is solved, then the optimal controls are obtained. In general case, the PI algorithm can be applied to get V ∗ . The algorithm implementation process is given in Algorithm 1. The convergence of Algorithm 4 will be analyzed in the next theorem. Algorithm 4 PI for nonlinear multi-player ZS differential games [0] [0] [0] [0] [0] 1: Start with stabilizing initial policies u [0] 1 , u 2 , . . ., u p , and d1 , d2 , . . ., dq . [k] 2: Let k = 1, 2, 3, . . ., solve V from

0 = x T Qx +

p 

u i[k]T Ri u i[k] −

i=1

q 

d [k]T S j d [k] j j

(11.13)

j=1

  p q   . + ∇V [k]T f + g u i[k] + h d [k] j i=1

j=1

3: Update control and disturbance using 1 u i[k+1] = − Ri−1 g T ∇V [k] 2

(11.14)

1 −1 T S h ∇V [k] . 2 j

(11.15)

and = d [k+1] j 4: Let k = k + 1, return to step 2 and continue.

11.2 Motivations and Preliminaries

211

Theorem 11.1 Define V [k] as in (11.13). Let control policy u i[k] and disturbance [k] policy d [k] j be in (11.14) and (11.15), respectively. Then the iterative values V ∗ converge to the optimal game values V , as k → ∞. Proof According to (11.13), we have V˙ [k+1] = −x T Qx −

p 

u i[k+1]T Ri u i[k+1] +

i=1

q 

d [k+1]T S j d [k+1] . j j

(11.16)

j=1

Then V˙ [k] = − x T Qx −

p 

u i[k]T Ri u i[k] +

i=1

− +

p 

u i[k+1]T Ri u i[k+1] +

i=1

j=1 q 

u i[k+1]T Ri u i[k+1] −

i=1

d [k+1]T S j d [k+1] j j d [k+1]T S j d [k+1] j j

j=1

=V˙ [k+1] +

p 

u i[k+1]T Ri u i[k+1]

i=1



d [k]T S j d [k] j j

j=1 q 

p 

p 

q 

d [k+1]T S j d [k+1] j j

j=1

u i[k]T Ri u i[k] +

i=1



q 

q 

d [k]T S j d [k] j j .

(11.17)

j=1

By transformation, we have V˙ [k] = V˙ [k+1] −

p 

(u i[k+1] − u i[k] )T Ri (u i[k+1] − u i[k] )

i=1

+2

p 

u i[k+1]T Ri (u i[k+1] − u i[k] )

i=1

+

q 

[k+1] T (d [k+1] − d [k] − d [k] j j ) S j (d j j )

j=1

−2

p 

d [k+1]T S j (d [k+1] − d [k] j j j ).

i=1 [k+1] Let u i[k] = u i[k+1] − u i[k] and d [k] − d [k] j = dj j , then

(11.18)

212

11 Neural-Network-Based Synchronous Iteration...

V˙ [k] = V˙ [k+1] −

p 

u i[k]T Ri u i[k] + 2

i=1

+

q 

p 

u i[k+1]T Ri u i[k]

i=1

d [k]T S j d [k] j j −2

j=1

p 

d [k+1]T S j d [k] j j .

(11.19)

i=1

From (11.14) and (11.15), we have ∇V [k]T g = −2u i[k+1]T Ri ,

(11.20)

Sj. ∇V [k]T h = 2d [k+1]T j

(11.21)

and

Then (11.19) is expressed as V˙ [k] = V˙ [k+1] −

p 

u i[k]T Ri u i[k] −

i=1

+

q  j=1

d [k]T S j d [k] j j −

p 

V [k]T gu i[k]

i=1 p 

V [k]T hd [k] j .

(11.22)

i=1

Thus a sufficient conditions for V˙ [k] ≤ V˙ [k+1] are u i[k]T Ri u i[k] − V [k]T gu i[k] > 0,

(11.23)

[k]T S j d [k] hd [k] d [k]T j j − V j < 0.

(11.24)

and

[k]T h|| and V [k]T gu i[k] > 0, or δ H (S j ) Hence, if δ H (S j )||d [k] j || ≤ ||V [k]T ||d [k] h|| and δ L (Ri )||u i[k] || > ||V [k]T g||, where δ L is the operaj || ≤ ||V tor which takes the minimum singular value, and δ H is the operator which takes the  maximum singular value. Then V˙ [k] ≤ V˙ [k+1] . The proof completes.

From Algorithm 4, we can see that the PI algorithm depends on system dynamics, which is unknown in this chapter. Therefore, in the next section, off-policy PI algorithm will be presented which can solve the control and disturbance policies synchronously.

11.3 Synchronous Solution of Multi-player ZS Games

213

11.3 Synchronous Solution of Multi-player ZS Games In this section, off-policy algorithm will be proposed based on Algorithm 4. The neural networks implementation process is also given. Based on that, the stability of the synchronous solution method is proven.

11.3.1 Derivation of Off-Policy Algorithm Let u i[k] and d [k] j be obtained by (11.14) and (11.15), then the original system (11.1) is rewritten as x˙ = f + g

p 

u i[k]

+h

i=1

q 

d [k] j

+g

j=1

p 

(u i −

u i[k] )

+h

i=1

q 

(d j − d [k] j ). (11.25)

j=1

Substitute (11.25) into (11.6), we have V [k] (x(t + T )) − V [k] (x(t))  t+T ∇V [k]T xdτ ˙ = t ⎛ ⎞  t+T p q   ⎠ dτ = ∇V [k]T ⎝ f + g u i[k] + h d [k] j t





t+T

+

∇V [k]T ⎝g

i=1 p 

t

j=1

(u i − u i[k] ) + h

i=1

q 

⎞ ⎠ (d j − d [k] j ) dτ.

(11.26)

j=1

According to (11.13), (11.26) is V [k] (x(t + T )) − V [k] (x(t)) t+T  x T Qx +

=−

p 

u i[k]T Ri u i[k] −

i=1

t





t+T

+ t

∇V [k]T ⎝g

q 

 [k] dτ d [k]T S d j j j

j=1

p  i=1

(u i − u i[k] ) + h

q 

⎞ ⎠ (d j − d [k] j ) dτ.

(11.27)

j=1

Then (11.27) is the off-policy Bellman equation for multi-player ZS games, which is expressed as

214

11 Neural-Network-Based Synchronous Iteration...

V [k] (x(t + T )) − V [k] (x(t))   t+T  p q   [k] x T Qx + dτ =− u i[k]T Ri u i[k] − d [k]T S d j j j t

 + t

i=1

t+T

j=1

 p  −2 u i[k+1]T R i (u i − u i[k] ) i=1

 q



d [k+1]T Sj j

(d j −

d [k] j )

 dτ.

(11.28)

j=1

It can be seen that (11.28) shows two points. First, the system dynamics is not nec[k] essary for obtaining V [k] . Second, u i[k] , d [k] can be obtained synchronously. j and V In the next part, the implementation method for solving (11.28) will be presented.

11.3.2 Implementation Method for Off-Policy Algorithm In this part, the method for solving off-policy Bellman equation (11.28) is given. Critic, action and disturbance networks are applied to approximate V [k] , u i[k] and d [k] j . The implementation block diagram is shown in Fig. 11.1. Here CNN, ANNs and DNNs are used to approximate the cost, control policies and disturbances. In the neural network, if the number of hidden layer neurons is L, the weight matrix between the input layer and hidden layer is Y , the weight matrix between the hidden layer and output layer is W and the input vector of the neural network is X , then the output of three-layer neural network is represented by

ANN

ANN Plant

CNN DNN

DNN

Fig. 11.1 Implementation block diagram

11.3 Synchronous Solution of Multi-player ZS Games

215

FN (X, Y, W ) = W T σˆ (Y X ),

(11.29)

where σˆ (Y X ) is the activation function. For convenience of analysis, only the output weight W is updating during the training, while the hidden weight is kept unchanged. Hence, in the following part, the neural network function (11.29) can be simplified by the expression FN (X, W ) = W T σ (X ).

(11.30)

The neural network expression of CNN is given as V [k] (x) = A[k]T φV (x) + δV (x),

(11.31)

where A[k] is the ideal weight of critic network, φV (x) is the active function, and δV (x) is residual error. Let the estimation of A[k] is Aˆ [k] . Then the estimation of V [k] (x) is Vˆ [k] (x) = Aˆ [k]T φV (x),

(11.32)

∇ Vˆ [k] (x) = ∇φVT (x) Aˆ [k] .

(11.33)

and

The neural network expression of ANN is u i[k] = Bi[k]T φu (x) + δu (x),

(11.34)

where Bi[k] is the ideal weight of action network, φu (x) is the active function, and δu (x) is residual error. Let Bˆ i[k] be the estimation of Bi[k] , then the estimation of u i[k] is uˆ i[k] = Bˆ i[k]T φu (x).

(11.35)

The neural network expression of DNN is [k]T φd (x) + δd (x), d [k] j = Cj

(11.36)

where C [k] j is the ideal weight of action network, φd (x) is the active function, and δ (x) is residual error. Let Cˆ[k] be the estimation of C [k] , then the estimation of d [k] d

j

j

j

is ˆ [k]T φd (x). dˆ [k] j = Cj According to (11.28), we define the equation error as

(11.37)

216

11 Neural-Network-Based Synchronous Iteration...

e[k] =Vˆ [k] (x(t)) − Vˆ [k] (x(t + T ))   t+T  p q   [k] ˆ x T Qx + dτ − uˆ i[k]T Ri uˆ i[k] − S dˆ [k]T d j j j t



t+T

+ t

i=1



j=1

 p

−2 uˆ i[k+1]T Ri

(u i − uˆ i[k] )

i=1

 q

−dˆ [k+1]T Sj j

 [k] ˆ (d j − d j ) dτ .

(11.38)

j=1

Therefore, substitute (11.32), (11.35) and (11.37) into (11.38), we have e[k] =Vˆ [k] (x(t)) − Vˆ [k] (x(t + T ))   t+T  p q   [k]T [k] [k]T [k] T ˆ ˆ x Qx + − uˆ i Ri uˆ i − d j S j d j dτ t



i=1

t+T

+ t

j=1

 p  T ˆ [k+1] −2 φu Bi Ri (u i − uˆ i[k] ) i=1

−φdT Cˆ [k+1] Sj j

q 

 (d j − dˆ [k] ) dτ . j

(11.39)

j=1

Since φuT Bˆ i[k+1]T Ri

p 

(u i − uˆ i[k] ) =

  p

i=1

(u i − uˆ i[k]

T

 Ri

 ⊗ φuT vec( Bˆ i[k+1] ),

i=1

(11.40) where ⊗ denotes kronecker product, and φdT Cˆ [k+1]T Sj j

q 

(d j − dˆ [k] j )

j=1

T     q [k] T ˆ = (d j − d j S j ⊗ φd vec(Cˆ [k+1] ). j j=1

Substitute (11.40) and (11.41) into (11.39)

(11.41)

11.3 Synchronous Solution of Multi-player ZS Games

217

e[k] = (φV (x(t)) − φV (x(t + T )))T ⊗ I Aˆ [k]  t+T  p q   x T Qx + dτ − uˆ i[k]T Ri uˆ i[k] − S j dˆ [k] dˆ [k]T j j t



i=1

t+T

+

−2



 p 

t

(u i −

uˆ i[k] )

T

j=1



 ⊗

Ri

φuT

vec( Bˆ i[k+1] )

i=1

⎞ ⎞ ⎛⎛  T  q [k] [k+1] T⎠ ˆ ˆ ⎠ ⎝ ⎝ (d j − d j ) S j ⊗ φd vec(C j ) dτ −

(11.42)

j=1

Define  Π=

t+T

t

ΠV = (φV (x(t)) − φV (x(t + T )))T ⊗ I,   p q   [k]T [k] [k]T [k] T ˆ ˆ x Qx + uˆ i Ri uˆ i − d j S j d j dτ , i=1



t+T

Πu =

−2

(11.44)

j=1

  p

t

(11.43)

(u i − uˆ i[k] )

Ri



 ⊗ φuT dτ,

T



T

(11.45)

i=1



t+T

Πd = −

  q

t

(d j − dˆ [k] j )

Sj

⊗ φdT dτ.

(11.46)

j=1

Then we have e[k] = ΠV Aˆ [k] − Π +⎡Πu vec( Bˆ i[k+1]⎤) + Πd vec(Cˆ [k+1] ) j [k] ˆ A ⎥ ⎢ = [ΠV Πu Πd ] ⎣ vec( Bˆ i[k+1] ) ⎦ − Π . [k+1] vec(Cˆ j )

(11.47)

Define activation function matrix ΠΠ = [ΠV Πu Πd ] ,

(11.48)

and the synchronous weight matrix ⎤ Aˆ [k] ⎥ ⎢ = ⎣ vec( Bˆ i[k+1] ) ⎦ . [k+1] vec(Cˆ j ) ⎡

Wˆ i,[k]j

(11.49)

Then (11.47) is e[k] = ΠΠ Wˆ i,[k]j − Π.

(11.50)

218

11 Neural-Network-Based Synchronous Iteration...

Define E [k] = 1/2e[k]T e[k] , then according to gradient descent algorithm, the update method of the weight Wˆ i,[k]j is  W˙ˆ i,[k]j = −ηi,[k]j ΠΠT ΠΠ Wˆ i,[k]j − Π ,

(11.51)

where ηi,[k]j is a positive number. According to gradient descent algorithm, the optimal weight Wˆ i,[k]j makes E [k] minimum, which can be obtained adaptively by (11.51). Therefore, the weights of critic, action and disturbance networks are solved simultaneously. In this proposed method, only one equation is necessary instead of (11.13)–(11.15) in Algorithm 4 to obtain the optimal solution for the multi-player ZS games.

11.3.3 Stability Analysis Theorem 11.2 Let the update method for critic, action and disturbance networks be as in (11.51). Define the weight estimation error as W˜ i,[k]j = Wi,[k]j − Wˆ i,[k]j , Then W˜ i,[k]j is UUB. Proof Let Lyapunov function candidate be Λi,[k]j =

α 2ηi,[k]j

˜ [k] W˜ i,[k]T j Wi, j , ∀i, j, k,

(11.52)

where α > 0. According to (11.51), we have  W˙˜ i,[k]j =ηi,[k]j ΠΠT ΠΠ (Wi,[k]j − W˜ i,[k]j ) − Π = − ηi,[k]j ΠΠT ΠΠ W˜ i,[k]j + ηi,[k]j ΠΠT ΠΠ Wi,[k]j − ηi,[k]j ΠΠT Π.

(11.53)

Therefore, the gradient of (11.52) is α ˜ [k]T ˙˜ [k] Wi, j Wi, j ηi,[k]j  [k] [k] T T T ˜ −Π =α W˜ i,[k]T Π + Π Π W − Π Π W Π Π Π Π Π j i, j i, j

Λ˙ i,[k]j =

[k] T ˜ [k]T T ˜ [k]T T ˜ [k] = − α W˜ i,[k]T j ΠΠ ΠΠ Wi, j + α Wi, j ΠΠ ΠΠ Wi, j − α Wi, j ΠΠ Π [k] T ˜ [k]T T ≤ − α||W˜ i,[k]j ||2 ||ΠΠ ||2 + α W˜ i,[k]T j ΠΠ ΠΠ Wi, j − α Wi, j ΠΠ Π

1 α2 ≤ − α||W˜ i,[k]j ||2 ||ΠΠ ||2 + ||W˜ i,[k]j ||2 ||ΠΠ ||2 + ||Wi,[k]j ||2 ||ΠΠ ||2 2 2 2 1 ˜ [k] 2 α + ||Wi, j || ||ΠΠ ||2 + ||Π ||2 . (11.54) 2 2

11.3 Synchronous Solution of Multi-player ZS Games

219

By transformation, (11.54) is α2 α2 Λ˙ i,[k]j ≤ (−α + 1) ||W˜ i,[k]j ||2 ||ΠΠ ||2 + ||Wi,[k]j ||2 ||ΠΠ ||2 + ||Π ||2 . 2 2

(11.55)

Define Σi,[k]j =

α2 α2 ||Wi,[k]j ||2 ||ΠΠ ||2 + ||Π ||2 . 2 2

(11.56)

Then (11.55) is Λ˙ i,[k]j ≤ (−α + 1) ||W˜ i,[k]j ||2 ||ΠΠ ||2 + Σi,[k]j .

(11.57)

α > 1,

(11.58)

Thus, if

and ||W˜ i,[k]j ||2 >

Σi,[k]j (α − 1)||ΠΠ ||2

,

(11.59)

then W˜ i,[k]j is UUB. The proof completes.



According to Theorem 11.2, if the convergence condition is satisfied, then Vˆ [k] → [k] V , uˆ i[k] → u i[k] and dˆ [k] j → dj . [k]

11.4 Simulation Study In this section, two examples will be provided to demonstrate the effectiveness of the optimal control scheme proposed in this chapter. Example 11.1 Consider the following linear system [10] with modifications x˙ = x + u + d.

(11.60)

In this chapter, the initial state is x(0) = 1. We select hyperbolic tangent functions as the activation functions of critic, action and disturbance networks. The structures of critic, action and disturbance networks are 1-8-1. The initial weight W is selected arbitrarily from (−1, 1), the dimension of W is 24 × 1. For the cost function, Q, R and S in the utility function are identity matrices of appropriate dimensions. After 500 time steps, the simulation results are obtained. In Fig. 11.2, the cost function is shown, which converges to zero as time increasing. The control and disturbance

220

11 Neural-Network-Based Synchronous Iteration... 0.12

0.1

cost function

0.08

0.06

0.04

0.02

0

0

50

100

150

200

250

300

350

400

450

500

300

350

400

450

500

time steps

Fig. 11.2 Cost function 0

control

-0.05

-0.1

-0.15

0

50

100

150

200

250

time steps

Fig. 11.3 Control

trajectories are given in Figs. 11.3 and 11.4. Under the action of the obtained control and disturbance inputs, the state trajectory is displayed in Fig. 11.5. It is clear that the presented method in this chapter is very effective and feasible.

11.4 Simulation Study

221

0.15

disturbance

0.1

0.05

0

0

50

100

150

200

250

300

350

400

450

500

300

350

400

450

500

time steps

Fig. 11.4 Disturbance 1 0.9 0.8 0.7

state

0.6 0.5 0.4 0.3 0.2 0.1 0

0

50

100

150

200

250

time steps

Fig. 11.5 State

222

11 Neural-Network-Based Synchronous Iteration... 0

-0.5

cost function

-1

-1.5

-2

-2.5

-3

-3.5

0

500

1000

1500

2000

2500

time steps

Fig. 11.6 Cost function

Example 11.2 Consider the following affine in control input nonlinear system [11] x˙ = f (x) + g(x)

p  i=1

u i + h(x)

q 

dj,

(11.61)

j=1

  x2 where f (x) = , −x2 − 21 x1 + 41 x2 (cos(2x1 ) + 2)2 + 41 x2 (sin(4x12 ) + 2)2     0 0 g(x) = , h(x) = , p = q = 1. cos(2x1 ) + 2 sin(4x12 ) + 2 In this simulation, the initial state is x(0) = [1, −1]T . Hyperbolic tangent functions are used to be as the activation functions of critic, action and disturbance networks. The structures of the networks are 2-8-1. The initial weight W is selected arbitrarily from (−1, 1), the dimension of W is 24 × 1. For the cost function of (11.61), Q, R and S in the utility function are identity matrices of appropriate dimensions. The simulation results are obtained by 2500 time steps. The cost function is shown in Fig. 11.6, it is ZS. The control and disturbance trajectories are given in Figs. 11.7 and 11.8. The state trajectories are displayed in Fig. 11.9. We can see that the closed-loop system state, control and disturbance inputs converge to zero, as time step increasing. So the proposed synchronous method for multi-player ZS games in this chapter is very effective.

11.4 Simulation Study

223

0.35

0.3

control

0.25

0.2

0.15

0.1

0.05

0

0

500

1000

1500

2000

2500

2000

2500

time steps

Fig. 11.7 Control 0 -0.2

disturbance

-0.4 -0.6 -0.8 -1 -1.2 -1.4 -1.6

0

500

1000

1500

time steps

Fig. 11.8 Disturbance

224

11 Neural-Network-Based Synchronous Iteration... 1

x(1) x(2)

0.8 0.6 0.4

state

0.2 0 -0.2 -0.4 -0.6 -0.8 -1

0

500

1000

1500

2000

2500

time steps

Fig. 11.9 State

11.5 Conclusions This chapter proposed a synchronous solution method for multi-player ZS games without system dynamics based on neural network. PI algorithm is presented to solve the HJB equation with system dynamics. It is proven that the obtained iterative cost function by PI is convergent to optimal game value. Based on PI, off-policy learning method is given to obtain the iterative cost function, controls and disturbances. The weights of CNN, ANNs and DNNs compose synchronous weight matrix, which is proven to be UUB by Lyapunov technique. Simulation study indicates the effectiveness of the proposed synchronous solution method for multi-player ZS games. A future research problem is to use the proposed approach to a class of systems with interconnection term.

References 1. Yeung, D., Petrosyan, L.: Cooperative Stochastic Differential Games. Springer, Berlin (2006) 2. Lewis, F., Vrabie, D., Syrmos, V.: Optimal Control, 3rd edn. Wiley, Hoboken (2012) 3. Song, R., Lewis, F., Wei, Q.: Off-policy integral reinforcement learning method to solve nonlinear continuous-time multi-player non-zero-sum games. IEEE Trans. Neural Networks Learn. Syst. 28(3), 704–713 (2016) 4. Liu, D., Wei, Q.: Multiperson zero-sum differential games for a class of uncertain nonlinear systems. Int. J. Adap. Control Signal Process. 28(3–5), 205–231 (2014)

References

225

5. Mu, C., Sun, C., Song, A., Yu, H.: Iterative GDHP-based approximate optimal tracking control for a class of discrete-time nonlinear systems. Neurocomputing 214(19), 775–784 (2016) 6. Fang, X., Zheng, D., He, H., Ni, Z.: Data-driven heuristic dynamic programming with virtual reality. Neurocomputing 166(20), 244–255 (2015) 7. Feng, T., Zhang, H., Luo, Y., Zhang, J.: Stability analysis of heuristic dynamic programming algorithm for nonlinear systems. Neurocomputing 149(Part C, 3), 1461–1468 (2015) 8. Feng, T., Zhang, H., Luo, Y., Liang, H.: Globally optimal distributed cooperative control for general linear multi-agent systems. Neurocomputing 203(26), 12–21 (2016) 9. Lewis, F., Vrabie, D., Syrmos, V.: Optimal Control. Wiley, NewYork (2012) 10. Basar, T., Olsder, G.: Dynamic Noncooperative Game Theory. Academic Press, New York (1982) 11. Vamvoudakis, K., Lewis, F.: Multi-player non-zero-sum games: online adaptive learning solution of coupled Hamilton-Jacobi equations. Automatica 47(8), 1556–1569 (2011)

Chapter 12

Off-Policy Integral Reinforcement Learning Method for Multi-player Non-zero-Sum Games

This chapter establishes an off-policy integral reinforcement learning (IRL) method to solve nonlinear continuous-time non-zero-sum (NZS) games with unknown system dynamics. The IRL algorithm is presented to obtain the iterative control and offpolicy learning is used to allow the dynamics to be completely unknown. Off-policy IRL is designed to do policy evaluation and policy improvement in policy iteration (PI) algorithm. Critic and action networks are used to obtain the performance index and control for each player. Gradient descent algorithm makes the update of critic and action weights simultaneously. The convergence analysis of the weights is given. The asymptotic stability of the closed-loop system and the existence of Nash equilibrium are proven. Simulation study demonstrates the effectiveness of the developed method for nonlinear continuous-time NZS games with unknown system dynamics.

12.1 Introduction Non-zero-sum (NZS) games with N players rely on solving the coupled Hamilton– Jacobi (HJ) equations, which means each player decides for the Nash equilibrium depending on HJ equations coupled through their quadratic terms. In linear NZS games, it reduces to the coupled algebraic Riccati equations [1]. In nonlinear NZS games, it is difficult, even doesn’t have analytic solutions. Therefore, many intelligent methods are proposed to obtain the approximate solutions [2–6]. IRL allows the development of a Bellman equation that does not contain the system dynamics [7–10]. It is worth noting that most of the IRL algorithms are on-policy, i.e., the performance index function is evaluated by using system data generated with policies being evaluated. It means on-policy learning methods use the “inaccurate” data to learn the performance index function, which will increase the accumulated error. While off-policy IRL, which can learn the solution of HJB equation from the system data generated by an arbitrary control. Moreover, off-policy IRL can be regarded as a direct learning method for NZS games, which avoids the © Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 R. Song et al., Adaptive Dynamic Programming: Single and Multiple Controllers, Studies in Systems, Decision and Control 166, https://doi.org/10.1007/978-981-13-1712-5_12

227

228

12 Off-Policy Integral Reinforcement Learning Method for Multi-player …

identification of system dynamics. In [11], a novel PI approach for finding online adaptive optimal controllers for CT linear systems with completely unknown system dynamics was presented. This paper gives idea of off-policy control for nonlinear systems. In [12], an off-policy RL method was introduced to learn the solution of HJI equation of CT systems with completely unknown dynamics, from real system data instead of mathematical system model, and its convergence was proved. In [13], an off-policy optimal control method was developed for unknown CT systems with unknown disturbances. These researches are foundation of our work. It is noted that, the system dynamics is not known in advance in many multi-player NZS games. This makes it difficult to obtain the Nash equilibrium depending on HJ equations for each player. Therefore, effective methods are required to develop to deal with the multi-player NZS games with unknown dynamics. This motivates our research interest. In this chapter, an off-policy IRL method is studied for CT multi-player NZS games with unknown dynamics. First, the PI algorithm is introduced with convergence analysis. Then off-policy is designed to policy evaluation and policy improvement without system dynamics. Neural networks are used to approximate critic and action networks. The update methods for neural networks weights are given. It is proven that the weight errors of neural networks are uniformly ultimately bounded (UUB). It is also proven that closed-loop system is asymptotically stable and the Nash equilibrium can be obtained. The main contributions of this chapter are summarized as follows. (1) For nonlinear CT NZS games, off-policy Bellman equation is developed for policy updating without system dynamics. (2) It is proven that the weight errors of neural networks are UUB. (3) The asymptotic stability of the closed-loop system is proven. The rest of the chapter is organized as follows. In Sect. 12.2, the problem motivations and preliminaries are presented. In Sect. 12.3, the multi-player learning PI solution for NZS games is introduced. In Sect. 12.4, the off-policy IRL optimal control method is considered. In Sect. 12.5, examples are given to demonstrate the effectiveness of the proposed optimal control scheme. In Sect. 12.6, the conclusion is drawn.

12.2 Problem Statement Consider the following nonlinear system x(t) ˙ = f (x(t)) +

N 

g(x(t))u j (t),

(12.1)

j=1

where state x(t) ∈ Rn and controls u j (t) ∈ Rm . This system has N inputs or players, and they influence each other through their joint effects on the overall system

12.2 Problem Statement

229

state dynamics. The system functions f (x) and g(x) are unknown. Let Ω containing the origin be a closed subset of Rn to which all motions of (12.1) are N  restricted. Let f + gu j be Lipschitz continuous on Ω and that system (12.1) j=1

is stabilizable in the sense that there exists admissible control on Ω that asymptotically stabilizes the system. Define u −i as the supplementary set of player i: u −i = {u j , j ∈ {1, 2, . . . , i − 1, i + 1, . . . , N }. Define the performance index function of player i as 



Ji (x(0), u i , u −i ) =

ri (x(t), u i , u −i )dt,

(12.2)

0

where the utility function ri (x, u i , u −i ) = Q i (x) + Q i (x)0 and Ri j > 0 are symmetric matrices. Define the multiplayer differential game Vi∗ (x(t), u i , u −i ) = min ui





N  j=1

⎛ ⎝ Q i (x) +

t

u Tj Ri j u j , in which function

N 

⎞ u Tj Ri j u j ⎠ dτ .

(12.3)

j=1

This game implies that all the players have the same competitive hierarchical level and seek to attain a Nash equilibrium as given by the following definition. Definition 12.1 Nash equilibrium [14, 15]: Policies {u i∗ , u ∗−i } = {u ∗1 , u ∗2 , . . . , u i∗ , . . . , u ∗N } are said to constitute a Nash equilibrium solution for the N -player game if Ji∗ = Ji (u i∗ , u ∗−i ) ≤ Ji (u i , u ∗−i ), i ∈ N

(12.4)

hence the N -tuple {J1∗ , J2∗ , . . . , JN∗ } is known as a Nash equilibrium value set or outcome of the N -player game.

12.3 Multi-player Learning PI Solution for NZS Games In this section, the PI solution for NZS games is introduced with convergence analysis. From Definition 12.1, it can be seen that if any player unilaterally changes his control policy while the policies of all other players remain the same, then that player will obtain worse performance. For fixed stabilizing feedback control policies u i and u −i define the value function  Vi (x(t)) = t



 ri (x, u i , u −i )dτ = t



⎛ ⎝ Q i (x) +

N  j=1

⎞ u Tj Ri j u j ⎠ dτ .

(12.5)

230

12 Off-Policy Integral Reinforcement Learning Method for Multi-player …

Using the Leibniz formula, Hamiltonian function is given by the Bellman equation Hi (x, ∇Vi , u i , u −i ) = Q i (x) +

N 

⎛ u Tj Ri j u j + ∇ViT ⎝ f (x) +

j=1

N 

⎞ g(x)u j ⎠ = 0.

j=1

(12.6) Then the following policies can be yielded from

∂ Hi =0 ∂u i

1 u i (x) = − Rii−1 g T (x)∇Vi . 2

(12.7)

Therefore, one obtains the N -coupled Hamilton–Jacobi (HJ) equations ⎛

⎞ N  1 T ⎠ + Q i (x) 0 = ∇ViT ⎝ f (x) − g(x)R −1 j j g (x)∇V j 2 j=1 1 −1 T ∇V jT g(x)R −1 + j j Ri j R j j g (x)∇V j . 4 j=1 N

(12.8)

Define the best response HJ equation as the Bellman equation (12.6) with control u i∗ by (12.7), and arbitrary policies u −i as Hi (x, ∇Vi , u i∗ , u −i ) = ∇ViT f (x) + Q i + ∇ViT

N 

g(x)u j

j=i

 1 − ∇ViT g(x)Rii−1 g T (x)∇Vi + u Tj Ri j u j . 4 j=i N

(12.9)

In [1], the following policy iteration for N -player games has been proposed to solve (8). Here [0] and [k] in the superscript mean the iteration index. The following two theorems reveal that the convergence of Algorithm 5. Theorem 12.1 If  is bounded, and the system functions f (x) and g(x) are known, the iterative control u i[k] of player i is obtained by PI algorithm in (12.10)–(12.12), and the controls u −i do not update their control policies. Then the iterative cost function is convergent, and the values converge to the best response HJ equation (12.9). Proof Let Hio (x, ∇Vi[k] , u −i ) = min Hi (x, ∇Vi[k] , u i[k] , u −i ) = Hi (x, ∇Vi[k] , u i[k+1] , u −i ). u i[k]

(12.13)

12.3 Multi-player Learning PI Solution for NZS Games

231

Algorithm 5 Policy Iteration [0] [0] 1: Start with stabilizing initial policies u [0] 1 , u2 , . . . , u N . [k] [k] 2: Given the N -tuple of policies u 1 , u 2 , . . . , u [k] N , solve for the N -tuple of costs V1[k] , V2[k] , . . . , VN[k] using [k]T Hi (x, ∇Vi[k] , u i[k] , u [k] ( f (x) + −i ) = ∇Vi

N 

[k] [k] [k] g(x)u [k] j ) + ri (x, u 1 , u 2 , . . . , u N ) = 0

j=1

(12.10) with Vi[k] (0) = 0. 3: Update the N-tuple of control policies using u i[k+1] = arg min[Hi (x, ∇Vi , u i , u −i )],

(12.11)

1 u i[k+1] = − Rii−1 g T (x)∇Vi[k] . 2

(12.12)

ui

which explicitly is

Since only player i updates its control, then let Hi (x, ∇Vi[k] , u i[k] , u −i ) = 0.

(12.14)

Therefore, from (12.13) and (12.14), one has Hio (x, ∇Vi[k] , u −i ) ≤ 0.

(12.15)

By u i[k+1] and the current policies u −i , the orbital derivative becomes V˙i[k] = Hi (x, ∇Vi[k] , u i[k+1] , u −i ) − ri (x, u i[k+1] , u −i ).

(12.16)

According to (12.13), Eq. (12.16) is V˙i[k] = Hio (x, ∇Vi[k] , u −i ) − ri (x, u i[k+1] , u −i ).

(12.17)

From (12.15), it has V˙i[k] ≤ −ri (x, u i[k+1] , u −i ).

(12.18)

On the other side, as only player i updates its control, then Hi (x, ∇Vi[k+1] , u i[k+1] , u −i ) = 0 and from (12.10)

(12.19)

232

12 Off-Policy Integral Reinforcement Learning Method for Multi-player …

V˙i[k+1] = Hi (x, ∇Vi[k+1] , u −i ) − ri (x, u i[k+1] , u −i ) = −ri (x, u i[k+1] , u −i ). (12.20) According to (12.18) and (12.20), V˙i[k] ≤ V˙i[k+1] .

(12.21)

Vi[k] ≥ Vi[k+1]

(12.22)

It follows that [15],

which means that Vi[k] is a monotonically decreasing sequence. Define lim Vi[k] = Vi∞ , according to [16, 17], it has Vi∞ = Vi∗ . That is the algok→∞

rithm converges to the best response HJ equation Vi∗ . Theorem 12.2 If the system functions f (x) and g(x) are known, all the iterative controls u i[k] of players i are obtained by PI algorithm in (12.10)–(12.12). If Ω is bounded, then the iterative values Vi[k] converge to the optimal game values Vi∗ , as k → ∞. Proof As V˙i[k+1] = −Q i (x) −

N 

u [k+1]T Ri j u [k+1] , j j

j=1

and V˙i[k] = − Q i (x) −

N 

u [k]T Ri j u [k] j j +

j=1



N 

N 

u [k+1]T Ri j u [k+1] j j

j=1

u [k+1]T Ri j u [k+1] j j

j=1

= V˙i[k+1] −

N  j=1

= V˙i[k+1] −

N 

u [k]T Ri j u [k] j j +

N 

u [k+1]T Ri j u [k+1] j j

j=1 T

[k+1] (u [k+1] − u [k] − u [k] j j ) Ri j (u j j )

j=1

+2

N 

u [k+1]T Ri j u [k+1] −2 j j

j=1

= V˙i[k+1] −

N 

u [k+1]T Ri j u [k] j j

j=1 N  j=1

T

[k+1] (u [k+1] − u [k] − u [k] j j ) Ri j (u j j )

(12.23)

12.3 Multi-player Learning PI Solution for NZS Games

+2

N 

u [k+1]T Ri j (u [k+1] − u [k] j j j ).

233

(12.24)

j=1

Let [k+1] u [k] − u [k] j = uj j .

(12.25)

Then a sufficient condition for V˙i[k] ≤ V˙i[k+1] is N 

u [k]T Ri j u [k] j j −2

N 

u [k+1]T Ri j u [k] j j ≥ 0.

(12.26)

[k]T [k] u [k]T Ri j u [k] g(x)R −T j j + ∇V j j j Ri j u j ≥ 0.

(12.27)

j=1

j=1

According to (12.12), (12.26) means

[k] ˙ [k] ≤ V˙i[k+1] is true. If ∇V j[k]T g(x) Hence, if ∇V j[k]T g(x)R −T j j Ri j u j ≥ 0, then Vi [k] ˙ [k] ≤ V˙i[k+1] becomes R −T j j Ri j u j < 0, then a sufficient condition for Vi [k]T [k] Ri j u [k] g(x)R −T ||u [k]T j j || ≥ ||∇V j j j Ri j u j ||,

(12.28)

[k] −T H δ L (Ri j )||u [k] j || ≥ ||∇V j ||||g(x)||δ (R j j Ri j ),

(12.29)

i.e.,

where δ L (·) is the operator which takes the minimum singular value, δ H (·) is the operator which takes the maximum singular value. Specifically, (12.29) holds if ˙ [k] ≤ V˙i[k+1] , it follows that δ H (R −T j j Ri j ) = 0. By integration of Vi Vi[k] ≥ Vi[k+1]

(12.30)

which shows that Vi[k] is a nonincreasing function. Define lim Vi[k] = Vi∞ , one has k→∞

Vi∞ ≤ Vi[k+1] ≤ Vi[k] .

(12.31)

According to [17], it has Vi∞ = Vi∗ . Thus the algorithm converges to Vi∗ . [k] ˙ [k] ≤ Remark 12.1 In the above proof, if ∇V j[k]T g(x)R −T j j Ri j u j ≥ 0, then Vi [k] ˙ [k] ≤ V˙i[k+1] . V˙i[k+1] . If ∇V j[k]T g(x)R −T j j Ri j u j < 0, and (12.29) establish, then Vi

234

12 Off-Policy Integral Reinforcement Learning Method for Multi-player …

Theorems 12.1–12.2 prove that the PI algorithm (Algorithm 5) is convergent. Note that, in system (12.1), f (x) and g(x) are unknown, therefore Algorithm 5 in (12.10)–(12.12) cannot be used to obtain the Nash equilibrium for unknown system (12.1), directly. In this chapter, an off-policy IRL method based on Algorithm 5 is established for multi-player NZS games.

12.4 Off-Policy Integral Reinforcement Learning Method From Algorithm 5, it can be seen that the PI algorithm depended on the system dynamics, which is not known in this chapter. Therefore, off-policy IRL algorithm is established to solve the NZS games. In this section, the off-policy IRL is first presented. Then the method for solving off-policy Bellman equation is developed. At last the theoretical analyses and implement method are given.

12.4.1 Derivation of Off-Policy Algorithm Let u i[k] be obtained by (12.12), and then the original system (12.1) can be rewritten as x˙ = f (x) +

N 

g(x)u [k] j +

j=1

N 

g(x)(u j − u [k] j ).

(12.32)

j=1

Then Vi[k] (x(t + T )) − Vi[k] (x(t))  t+T ∇Vi[k]T xdτ ˙ = t ⎛ ⎞  t+T N  ⎠ dτ = ∇Vi[k]T ⎝ f (x) + g(x)u [k] j t

j=1

 +

t+T

t

∇Vi[k]T

N 

g(x)(u j − u [k] j ))dτ.

(12.33)

j=1

From (12.10), one has ∇Vi[k]T (

f (x) +

N  j=1

g(x)u [k] j )

= −Q i (x) −

N  j=1

u [k]T Ri j u [k] j j .

(12.34)

12.4 Off-Policy Integral Reinforcement Learning Method

235

Then (12.33) is given by Vi[k] (x(t + T )) − Vi[k] (x(t))  t+T  t+T  N =− Q i (x)dτ − u [k]T Ri j u [k] j j dτ t



t

t+T

+ t

j=1

N  ∇Vi[k]T g(x) (u j − u [k] j ))dτ.

(12.35)

j=1

From (12.12), one has g T (x)∇Vi[k] = −2Rii u i[k+1] .

(12.36)

Then the off-policy Bellman equation is obtained from (12.35) Vi[k] (x(t + T )) − Vi[k] (x(t))  t+T  t+T  N =− Q i (x)dτ − u [k]T Ri j u [k] j j dτ t

t

 −2

t+T

t

u i[k+1]T Rii

N 

j=1

(u j − u [k] j )dτ.

(12.37)

j=1

Remark 12.2 Notice that in (12.37), the term ∇Vi[k]T g(x) depending on the unknown function g(x) is replaced by u i[k+1]T Rii , which can be obtained by measuring the state online. Therefore, (12.37) plays an important role in separating the system dynamics from the iterative process. It is referred to as off-policy Bellman equation. By offpolicy Bellman equation (12.37), one can obtain the optimal control of the N -player nonlinear differential game without the requirement of the system dynamics. The off-policy IRL method is summed as follows. Algorithm 6 Off-Policy IRL [0] [0] 1: Let the iterative index k = 0, select admissible controls u [0] 1 , u2 , . . . , u N . [k] [k+1] from off-policy Bellman equation 2: Let the iterative index k > 0, and solve for Vi and u i (12.37).

The next part will give the implementation method for Algorithm 6.

236

12 Off-Policy Integral Reinforcement Learning Method for Multi-player …

12.4.2 Implementation Method for Off-Policy Algorithm In this part, the method for solving off-policy Bellman equation (12.37) will be given. Critic and action networks are used to approximate Vi[k] (x(t)) and u i[k] for each player, respectively. The neural network expression of critic network is given as Vi[k] (x) = Mi[k]T ϕi (x) + εi[k] (x),

(12.38)

where Mi[k] is the ideal weight of critic network, ϕi (x) is the active function, εi[k] (x) is residual error. Let the estimation of Mi[k] be Mˆ i[k] , and then the estimation of Vi[k] (x) is Vˆi[k] (x) = Mˆ i[k]T ϕi (x).

(12.39)

∇ Vˆi[k] (x) = ∇ϕiT (x) Mˆ i[k] .

(12.40)

Accordingly,

Define the neural network expression of action network as u i[k] (x) = Ni[k]T φi (x) + δi[k] (x),

(12.41)

where Ni[k] is the ideal weight of action network, φi (x) is the active function, δi[k] (x) is residual error. Let Nˆ i[k] be the estimation of Ni[k] , then the estimation of u i[k] (x) is uˆ i[k] (x) = Nˆ i[k]T φi (x).

(12.42)

Therefore, from (12.37) the equation error is defined as ei[k]

= Vˆi[k] (x(t)) − Vˆi[k] (x(t + T )) − 

t+T

− t

N 

uˆ [k]T Ri j uˆ [k] j j dτ



t+T



t+T

−2 t

j=1

Q i (x)dτ

t

uˆ i[k+1]T Rii

N 

(u j − uˆ [k] j )dτ.

j=1

(12.43) Substitute (12.39) and (12.42) into (12.43), then one can get ei[k] = Mˆ i[k]T ϕi (x(t)) − Mˆ i[k]T ϕi (x(t + T ))  t+T  t+T  N − Q i (x)dτ − uˆ [k]T Ri j uˆ [k] j j dτ t

t

 −2 t

t+T

j=1

φiT (x(τ )) Nˆ i[k+1] Rii

N  j=1

(u j − uˆ [k] j )dτ.

(12.44)

12.4 Off-Policy Integral Reinforcement Learning Method

237

As φiT (x) Nˆ i[k+1] Rii ⎛⎛⎛ = ⎝⎝⎝

N  j=1

N 

(u j − uˆ [k] j ) ⎞T





⎠ Rii ⎠ ⊗ φiT ⎠ vec( Nˆ i[k+1] ). (u j − uˆ [k] j )

(12.45)

j=1

Here the symbol ⊗ stands for Kronecker product. Then (12.44) becomes

ei[k] = − (ϕi (x(t + T )) − ϕi (x(t)))T ⊗ I Mˆ i[k]  t+T  t+T  N − Q i (x)dτ − uˆ [k]T Ri j uˆ [k] j j dτ t

t

 −2

⎛⎛⎛ t+T

⎝⎝⎝

t

N 

j=1

(u j −

⎞T

⎠ uˆ [k] j )





Rii ⎠ ⊗ φiT ⎠dτ vec( Nˆ i[k+1] ).

(12.46)

j=1

 t+T [k] I x x,i = (ϕi (x(t + T )) − ϕi (x(t)))T ⊗ I , I xu,i =− Q i (x)dτ − ⎛⎛⎛ ⎞ ⎞T t ⎞  t+T   t+T N N  ⎜⎜⎝ [k] T⎟ ⎠ R ⎟ uˆ [k]T Ri j uˆ [k] (u j − uˆ [k] ⎝⎝ ii ⎠ ⊗ φi ⎠dτ . j j dτ , and Iuu,i = 2 j )

Define

t

t

j=1

j=1

Therefore (12.46) is written as [k] [k] − Iuu,i vec( Nˆ i[k+1] ). ei[k] = −Ix x,i Mˆ i[k] + Ixu,i

Let

[k] Iw,i

= [−Ix x,i



[k] Iuu,i ],

Wˆ i[k] =

(12.47)

 Mˆ i[k] , then (12.47) is given by vec( Nˆ i[k+1] )

[k] ˆ [k] [k] ei[k] = Iw,i . Wi + Ixu,i

(12.48)

For obtaining the update rule of the weights of critic and action networks, the optimization objective is defined as E i[k] =

1 [k]T [k] e ei . 2 i

(12.49)

Thus, using the gradient descent algorithm, one can get [k]T [k] ˆ [k] [k] (Iw,i Wi + Ixu,i ), W˙ˆ i[k] = −ηi[k] Iw,i

(12.50)

238

12 Off-Policy Integral Reinforcement Learning Method for Multi-player …

where ηi[k] is a positive number. Remark 12.3 According to gradient descent algorithm, the optimal weight Wi[k] , which makes (12.49) minimum, can be obtained by (12.50). Therefore, the weights of critic and action networks are solved simultaneously. It means that only one equation is necessary instead of (12.10) and (12.11) in Algorithm 5 to obtain the optimal solution for the multi-player NZS games.

12.4.3 Stability Analysis In Theorems 12.1–12.2, the convergent property of the iterative cost function is proven. In the following theorems one first proves that the weights estimation of critic and action networks will converge to the ideal ones within a small bound, based on the off-policy IRL. Then one proves that the asymptotic stability of the closed-loop system and the existence of Nash equilibrium. Theorem 12.3 Let the update method for critic and action networks be as in (12.50). Define the weight estimation error as W˜ i[k] = Wi[k] − Wˆ i[k] , ∀ k, i. Then W˜ i[k] is UUB. Proof Select Lyapunov function candidate as follows: L i[k] =

l 2ηi[k]

W˜ i[k]T W˜ i[k] ,

(12.51)

where l > 0. According to (12.50), one has   [k]T [k] [k] Iw,i (Wi[k] − W˜ i[k] ) + Ixu,i W˙˜ i[k] = ηi[k] Iw,i [k]T [k] ˜ [k] [k]T [k] [k] Iw,i Wi + ηi[k] Iw,i (Iw,i Wi[k] + Ixu,i ). = − ηi[k] Iw,i

(12.52)

Thus, the gradient of (12.51) is [k]T [k] ˜ [k] [k]T [k] [k] Iw,i Wi + l W˜ i[k]T Iw,i (Iw,i Wi[k] + Ixu,i ) L˙ i[k] = − l W˜ i[k]T Iw,i [k] 2 [k]T [k] [k]T [k] || + l W˜ i[k]T Iw,i Iw,i Wi[k] + l W˜ i[k]T Iw,i Ixu,i ≤ − l||W˜ i[k] ||2 ||Iw,i 1 [k] 2 [k] 2 ≤ − l||W˜ i[k] ||2 ||Iw,i || + ||W˜ i[k] ||2 ||Iw,i || 2 l2 1 l 2 [k] 2 [k] 2 [k] 2 + ||Wi[k] ||2 ||Iw,i || + ||W˜ i[k] ||2 ||Iw,i || + ||Ixu,i || . 2 2 2

(12.53)

That is l2 l 2 [k] 2 [k] 2 [k] 2 || + ||Wi[k] ||2 ||Iw,i || + ||Ixu,i || . L˙ i[k] ≤ −(l − 1)||W˜ i[k] ||2 ||Iw,i 2 2

(12.54)

12.4 Off-Policy Integral Reinforcement Learning Method

Let Ci[k] =

[k] 2 l2 ||Wi[k] ||2 ||Iw,i || 2

239

[k] 2 + l2 ||Ixu,i || , then (12.54) is 2

[k] 2 L˙ i[k] ≤ −(l − 1)||W˜ i[k] ||2 ||Iw,i || + Ci[k] .

(12.55)

Therefore, if l satisfies l > 1,

(12.56)

and ||W˜ i[k] ||2 >

Ci[k]

[k] 2 (l − 1)||Iw,i ||

,

(12.57)

then W˜ i[k] is UUB. Based on Theorem 12.3, one can assume that Vˆi[k] (x) → Vi[k] (x) and uˆ i[k] (x) → Then one can get the following theorem which proves that the asymptotic stability of the closed-loop system further. u i[k] (x).

Theorem 12.4 Let the control be given as in (12.12), then the closed-loop system is asymptotically stable. Proof Define Lyapunov function candidate as Vi[k] (x(t))





= t

⎛ ⎝ Q i[k] (x)

+

N 

⎞ ⎠ dτ , ∀ u [k]T Ri j u [k] j j

k.

(12.58)

j=1

Take the time derivative to obtain V˙i[k] (x(t)) = −Q i[k] (x) −

N 

u [k]T Ri j u [k] j j < 0.

(12.59)

j=1

Therefore, Vi[k] (x(t)), ∀ k is a Lyapunov function. The closed-loop system is asymptotically stable.   From Theorem 12.4, it is clear that, ∀ k, the iterative control u i[k] , i = 1, 2, . . . , N makes the closed-loop system stable asymptotically. Now, we are ready to give the  following theorem which demonstrates u i∗ , i = 1, 2, . . . , N } is in Nash equilibrium.   Theorem 12.5 For a fixed stabilizing  1, 2, . . . , N , define Hamil ∗ control u i , i = function as in (12.6). Let u i , i = 1, 2, . . . , N be defined as in (12.7), then tonian u i∗ , i = 1, 2, . . . , N } is in Nash equilibrium.

240

12 Off-Policy Integral Reinforcement Learning Method for Multi-player …

Proof As Vi (x(∞)) = 0, then from (12.2), one has 



Ji (x(0), u i , u −i ) =

 Q i (x) +

N 

0

u Tj Ri j u j

  dt +



V˙i dt + Vi (x(0)).

0

j=1

(12.60) According to (12.6), (12.60) can be written as  Ji (x(0), u i , u −i ) =



Hi (x, ∇Vi , u i , u −i )dt + Vi (x(0)).

(12.61)

0

As u i∗ = u i (Vi (x)) which is given by (12.7), then Hi (x, ∇Vi , u i∗ , u ∗−i ) = Q i (x) +

N 

⎛ ∗ T⎝ u ∗T f (x) + j Ri j u j + ∇Vi

j=1

N 

⎞ g(x)u ∗j ⎠

j=1

= 0.

(12.62)

According to (12.6), one has Hi (x, ∇Vi , u i , u −i ) = Hi (x, ∇Vi , u i∗ , u ∗−i ) +

N 

(u j − u ∗j )T Ri j (u j − u ∗j )

j=1

+ ∇ViT

N 

g(x)(u j − u ∗j ) + 2

j=1

N 

∗ u ∗T j Ri j (u j − u j ).

j=1

(12.63) Therefore, (12.61) is expressed as 

N ∞

Ji (x(0), u i , u −i ) = Vi (x(0)) + 0

 + 0



∇ViT

N 

(u j − u ∗j )T Ri j (u j − u ∗j )dt

j=1

g(x)(u j −

j=1

u ∗j )dt

 +



2 0

N 

∗ u ∗T j Ri j (u j − u j )dt.

j=1

(12.64) Furthermore, (12.64) can be written as

12.4 Off-Policy Integral Reinforcement Learning Method

241

Ji (x(0), u i , u −i )  ∞ N  =Vi (x(0)) + ∇ViT g(x)(u j − u ∗j )dt 0





+ 

0

∇ViT g(x)(u i − u i∗ )dt



+

2 0

 0

N 

u ∗T j Ri j (u j



u ∗j )dt



N 

(u j −

u ∗j )T Ri j (u j





+ 0

j=1, j=i ∞

+

j=1, j=i

u ∗j )dt

j=1, j=i

2u i∗T Rii (u i − u i∗ )dt 



+ 0

(u i − u i∗ )T Rii (u i − u i∗ ). (12.65)

Note that, at the equilibrium point, one has u i = u i∗ and u −i = u ∗−i . Thus Ji (x(0), u i∗ , u ∗−i ) = Vi (x(0)).

(12.66)

From (12.65), one can get  ∞ ∇ViT g(x)(u i − u i∗ )dt Ji (x(0), u i , u ∗−i ) =Vi (x(0)) + 0  ∞  ∞ ∗T ∗ + 2u i Rii (u i − u i )dt + (u i − u i∗ )T Rii (u i − u i∗ )dt. 0

0

(12.67) Note that u i∗ = u i (Vi (x)), then ∇ViT g(x) = −2u i∗T Rii . Therefore, Eq. (12.67) is Ji (x(0), u i , u ∗−i ) = Vi (x(0)) +



∞ 0

(u i − u i∗ )T Rii (u i − u i∗ ).

(12.68)

∗ ∗ Then clearly Ji (x(0), and Ji (x(0), u i , u ∗−i ) in (12.68) satisfy   ∗ u i , u −i ) in (12.66) (12.4). It means u i , i = 1, 2, . . . , N is in Nash equilibrium.

Based on above analyses, it is clear that off-policy IRL obtains Vˆi[k] (x) and uˆ i[k] (x) simultaneously. Based on the weight update method (12.50), Vˆi[k] (x) → Vi[k] (x) and uˆ i[k] (x) → u i[k] (x). It proves that u i[k] (x) makes the closed-loop system asympstable, and u i[k] (x) → u i∗ (x), as k → ∞. The final theorem demonstrates totically ∗ u i , i = 1, 2, . . . , N } is in Nash equilibrium. Therefore in the following subsection, one is ready to give the following computational algorithm for practical online implementation.

242

12 Off-Policy Integral Reinforcement Learning Method for Multi-player …

Algorithm 7 Computational Algorithm for Off-Policy IRL [0] [0] 1: Let the iterative index k = 0, start with stabilizing initial policies u [0] 1 , u2 , . . . , u N . [k] ˆ 2: Let the iterative index k > 0, and solve for Wi by [k]T [k] ˆ [k] [k] W˙ˆ i[k] = −ηi[k] Iw,i (Iw,i Wi + I xu,i ).

(12.69)

3: Update N -tuple of control policies uˆ i[k+1] and Vˆi[k] (x). 4: Let k ← k + 1, repeat Step 2 until ||Vˆi[k+1] − Vˆi[k] || ≤ ε,

(12.70)

where ε > 0 is a predefined constant threshold. 5: Stop.

12.5 Simulation Study Here we present simulations of linear and nonlinear systems to show that the games can be solved by off-policy IRL method of this chapter. Example 12.1 Consider the following linear system with modification [18, 19] x˙ = 2x + 3u 1 + 3u 2 .

(12.71)

Define  J1 =



0

(9x 2 + 0.64u 21 + u 22 )dt,

(12.72)

and 



J2 = 0

(3x 2 + u 21 + u 22 )dt.

(12.73)

Let the initial state be x(0) = 0.5. For each player the structures of the critic and action networks are 1-8-1 and 1-8-1, respectively. The initial weights are selected in (−0.5, 0.5). Let Q 1 = 9, Q 2 = 3, R11 = 0.64 and R12 = R21 = R22 = 1. The activation functions ϕi and φi are hyperbolic functions. ηi = 0.01. After 100 time steps, the state and control trajectories are shown in Figs. 12.1 and 12.2. From above analyses, the iterative value function is monotonic decreasing, which are given in Figs. 12.3 and 12.4. Example 12.2 Consider the following nonlinear system in [20] with modification x˙ = f (x) + gu 1 + gu 2 + gu 3 ,

(12.74)

12.5 Simulation Study

243

0.5 0.45 0.4 0.35

x

0.3 0.25 0.2 0.15 0.1 0.05 0 0

5

10

15

20

25

30

time steps

Fig. 12.1 The trajectory of state 0.3

u1 u2

0.2

0.1

u1 , u2

0

-0.1

-0.2

-0.3

-0.4

-0.5 0

5

10

15

time steps

Fig. 12.2 The trajectories of control

20

25

30

244

12 Off-Policy Integral Reinforcement Learning Method for Multi-player … 1.42

1.415

V1

1.41

1.405

1.4

1.395

1.39 0

100

200

300

400

500

600

700

800

900

1000

700

800

900

1000

iteration steps

Fig. 12.3 The trajectory of V1 1.94

1.92

V2

1.9

1.88

1.86

1.84

1.82 0

100

200

300

400

500

600

iteration steps

Fig. 12.4 The trajectory of V2

12.5 Simulation Study

245

1.5

x1 x2

1

x

0.5

0

-0.5

-1

-1.5 0

50

100

150

200

250

300

time steps

Fig. 12.5 The trajectory of state

 f 1 (x) where f (x) = , f 1 (x) = −2x1 + x2 , f 2 (x) = −0.5x1 − x2 + x12 x2 + 0.25 f 2 (x) 

0 . Define x2 (cos(2x1 ) + 2)2 + 0.25x2 (sin(4x12 ) + 2)2 , and g(x) = 2x1  1 2 1 2 2 2 2 x + x + u 1 + 0.9u 2 + u 3 dt, J1 (x) = 8 1 4 2 0   ∞ 1 2 J2 (x) = x1 + x23 + u 21 + u 22 + 5u 23 dt, 2 0   ∞ 1 2 1 2 J3 (x) = x1 + x2 + 3u 21 + 2u 22 + u 23 dt. 4 2 0 





(12.75) (12.76) (12.77)

Let the initial state be x(0) = [1; −1]. For each player the structures of the critic and action networks are 2-8-1 and 2-8-1, respectively. The activation functions ϕi and φi are hyperbolic functions. Let ηi = 0.05, after 1000 time steps, the state and control trajectories are shown in Figs. 12.5 and 12.6. Figures 12.7, 12.8 and 12.9 are value function for each player, which is monotonic decreasing.

246

12 Off-Policy Integral Reinforcement Learning Method for Multi-player … 500

u1 u2

400

u3 300

u1 , u2 , u3

200

100

0

-100

-200

-300 0

50

100

150

200

250

300

time steps

Fig. 12.6 The trajectories of control 73

72.5

72

V1

71.5

71

70.5

70

69.5 0

50

100

150

200

250

300

iteration steps

Fig. 12.7 The trajectory of V1

350

400

450

500

12.5 Simulation Study

247

62

61

60

V2

59

58

57

56

55

54 0

50

100

150

200

250

300

350

400

450

500

350

400

450

500

iteration steps

Fig. 12.8 The trajectory of V2 29.2

29

28.8

V3

28.6

28.4

28.2

28

27.8 0

50

100

150

200

250

300

iteration steps

Fig. 12.9 The trajectory of V3

248

12 Off-Policy Integral Reinforcement Learning Method for Multi-player …

12.6 Conclusion This chapter establishes an off-policy IRL method for CT multi-player NZS games with unknown dynamics. Since the system dynamics is unknown in this chapter, offpolicy IRL is studied to do policy evaluation and policy improvement in PI algorithm. The critic and action networks are used to obtain the performance index and control for each player. The convergence of the weights is proven. The asymptotic stability of the closed-loop system and the existence of Nash equilibrium are proven. Simulation study demonstrates the effectiveness of the proposed method of this chapter. Furthermore, it is noted that the condition is hard for the proof of Theorem 12.2. In the next work, we will concentrate on relaxing the condition for the proof of Theorem 12.2. We also notice that the system unknown functions f and g are same for each player in this chapter. In our future work, we will discuss the off-policy IRL method for different unknown functions f i or gi for each player. This will make the research more extensive and deep.

References 1. Vamvoudakis, K., Lewis, F.: Multi-player non-zero-sum games: online adaptive learning solution of coupled Hamilton-Jacobi equations. Automatica 47(8), 1556–1569 (2011) 2. Zhang, H., Wei, Q., Luo, Y.: A novel infinite-time optimal tracking control scheme for a class of discrete-time nonlinear systems via the greedy HDP iteration algorithm. IEEE Trans. Syst. Man Cybern. Part B-Cybern. 38(4), 937–942 (2008) 3. Wei, Q., Wang, F., Liu, D., Yang, X.: Finite-approximation-error based discrete-time iterative adaptive dynamic programming. IEEE Trans. Cybern. 44(12), 2820–2833 (2014) 4. Wei, Q., Liu, D.: A novel iterative-Adaptive dynamic programming for discrete-time nonlinear. IEEE Trans. Automat. Sci. Eng. 11(4), 1176–1190 (2014) 5. Song, R., Xiao, W., Zhang, H., Sun, C.: Adaptive dynamic programming for a class of complexvalued nonlinear systems. IEEE Trans. Neural Netw. Learn. Syst. 25(9), 1733–1739 (2014) 6. Song, R., Lewis, F., Wei, Q., Zhang, H., Jiang, Z., Levine, D.: Multiple Actor-Critic Structures for Continuous-Time Optimal Control Using Input-Output Data. IEEE Trans. Neural Netw. Learn. Syst. 26(4), 851–865 (2015) 7. Modares, H., Lewis, F., Naghibi-Sistani, M.B.: Adaptive optimal control of unknown constrained-input systems using policy iteration and neural networks. IEEE Trans. Neural Netw. Learn. Syst. 24(10), 1513–1525 (2013) 8. Modares, H., Lewis, F.: Optimal tracking control of nonlinear partially-unknown constrainedinput systems using integral reinforcement learning. Automatica 50(7), 1780–1792 (2014) 9. Modares, H., Lewis, F., Naghibi-Sistani, M.: Integral reinforcement learning and experience replay for adaptive optimal control of partially-unknown constrained-input continuous-time systems. Automatica 50, 193–202 (2014) 10. Kiumarsi, B., Lewis, F., Naghibi-Sistani, M., Karimpour, A.: Approximate dynamic programming for optimal tracking control of unknown linear systems using measured data. IEEE Trans. Cybern. 45(12), 2770–2779 (2015) 11. Jiang, Y., Jiang, Z.: Computational adaptive optimal control for continuous-time linear systems with completely unknown dynamics. Automatica 48(10), 2699–2704 (2012) 12. Luo, B., Wu, H., Huang, T.: Off-policy reinforcement learning for H control design. IEEE Trans. Cybern. 45(1), 65–76 (2015)

References

249

13. Song, R., Lewis, F., Wei, Q., Zhang, H.: Off-policy actor-critic structure for optimal control of unknown systems with disturbances. IEEE Trans. Cybern. 46(5), 1041–1050 (2016) 14. Lewis, F., Vrabie, D., Syrmos, V.L.: Optimal Control, 3rd edn. Wiley, Hoboken (2012) 15. Vamvoudakis, K., Lewis, F., Hudas, G.: Multi-agent differential graphical games: online adaptive learning solution for synchronization with optimality. Automatica 48(8), 1598–1611 (2012) 16. Abu-Khalaf, M., Lewis, F.: Nearly optimal control laws for nonlinear systems withsaturating actuators using a neural network HJB approach. Automatica 41(5), 779–791 (2005) 17. Leake, R., Liu, R.: Construction of suboptimal control sequences. SIAM J. Control 5(1), 54–63 (1967) 18. Jungers, M., De Pieri, E., Abou-Kandil, H.: Solving coupled algebraic Riccati equations from closed-loop Nash strategy, by lack of trust approach. Int. J. Tomogr. Stat. 7(F07), 49–54 (2007) 19. Limebeer, D., Anderson, B., Hendel, H.: A Nash game approach to mixed H2/H control. IEEE Trans. Autom. Control 39(1), 69–82 (1994) 20. Liu, D., Li, H., Wang, D.: Online synchronous approximate optimal learning algorithm for multiplayer nonzero-sum games with unknown dynamics. IEEE Trans. Syst. Man Cybern.: Syst. 44(8), 1015–1027 (2014)

Chapter 13

Optimal Distributed Synchronization Control for Heterogeneous Multi-agent Graphical Games

In this chapter, a new optimal coordination control for the consensus problem of heterogeneous multi-agent differential graphical games by iterative ADP is developed. The main idea is to use iterative ADP technique to obtain the iterative control law which makes all the agents track a given dynamics and simultaneously makes the iterative performance index function reach the Nash equilibrium. In the developed heterogeneous multi-agent differential graphical games, the agent of each node is different from the one of other nodes. The dynamics and performance index function for each node depend only on local neighbor information. A cooperative policy iteration algorithm for graphical differential games is developed to achieve the optimal control law for the agent of each node, where the coupled Hamilton–Jacobi (HJ) equations for optimal coordination control of heterogeneous multi-agent differential games can be avoided. Convergence analysis is developed to show that the performance index functions of heterogeneous multi-agent differential graphical games can converge to the Nash equilibrium. Simulation results will show the effectiveness of the developed optimal control scheme.

13.1 Introduction A large class of real systems are controlled by more than one controller or decision maker with each using an individual strategy. These controllers often operate in a group with coupled performance index functions as a game [1]. Stimulated by a vast number of applications, including those in economics, management, communication networks, power networks, and in the design of complex engineering systems, game theory [2] has been very successful in modeling strategic behavior, where the outcome for each player depends on the actions of himself and all the other players. For the previous policy iteration ADP algorithms of multi-player games, it always desires the system states for each agent converge the equilibrium of the systems. In many real world games, it requires that the states of each agent track a desired © Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 R. Song et al., Adaptive Dynamic Programming: Single and Multiple Controllers, Studies in Systems, Decision and Control 166, https://doi.org/10.1007/978-981-13-1712-5_13

251

252

13 Optimal Distributed Synchronization Control …

dynamics, i.e., synchronization control. Synchronization behavior of the multi-agent optimal control of ADP, which pioneering work was proposed by Vamvoudakis et al. [3], was developed for graphical differential games, where the Nash equilibrium of the game was achieved by policy iteration algorithm. In this chapter, inspired by [3], the optimal cooperative control for the heterogeneous multi-agent graphical differential games is investigated. We emphasize that in the developed heterogeneous graphical differential game, the autonomous agent of each node and the desired dynamics can be different from each other. According to system transformation, the optimal synchronization control problem is transformed into an optimal multi-agent cooperative regulation problem. The corresponding performance index function of the graphical differential game can also be transformed, which makes the optimal control law for each agent can be expressed. The main contribution of this chapter is to develop an effective policy iteration of ADP to obtain the optimal cooperative control law for heterogeneous multi-agent graphical differential games. Convergence properties of the iterative ADP algorithm are also developed, which guarantee that the performance index function of the agent for each node converges to the Nash equilibrium of the games. A simulation example will be presented to show the effectiveness of the developed algorithm. This chapter is organized as follows. In Sect. 13.2, graphs and synchronization of heterogeneous multi-agent dynamic systems are presented. In Sect. 13.3, optimal distributed cooperative control for heterogeneous multi-agent differential graphical games will be presented. The expressions of the optimal control law will be developed in this section. In Sect. 13.4, heterogeneous multi-agent differential graphical games by iterative ADP algorithm will be developed. The properties of the iterative performance index functions will be analyzed. In Sect. 13.5, simulation results are given to demonstrate the effectiveness of the developed algorithm. Finally, in Sect. 13.6, the conclusions will be shown.

13.2 Graphs and Synchronization of Multi-agent Systems In this section, a background review of communication graphs is given and the problem of synchronization of heterogeneous multi-agent systems subject to external disturbances is formulated.

13.2.1 Graph Theory Let G = (V, E, A) be a weighted graph of N nodes with the nonempty finite set of nodes V = {v1 , v2 , . . . , v N }, where the set of edges E belongs to the product space of V (i.e., E ⊆ V × V), an edge of G is denoted by ρi j = (v j , vi ), which is a direct path from node j to node i, and A = [ρi j ] is a weighted adjacency matrix with nonnegative

13.2 Graphs and Synchronization of Multi-agent Systems

253

adjacency elements, i.e., ρi j ≥ 0, ρi j ∈ E ⇔ ρi j > 0, otherwise ρi j = 0. The node index i belongs to a finite index set N = {1, 2, . . . , N }. Definition 13.1 (Laplacian Matrix) The graph Laplacian matrix L = [li j ] is defined as D − A, with D = diag{di } ∈ R N ×N being the in-degree matrix of graph, where N ρi j is in-degree of node vi in graph. di = Σ j=1 In this chapter, we assume that the graph is simple, e.g., no repeated edges and no self loops. The set of neighbors of node vi is denoted by Ni = {v j ∈ V; (v j , vi ) ∈ E}. A graph is referred to as a spanning tree, if there is a node vi (called the root), such that there is a directed path from the root to any other nodes in the graph. A graph is said to be strongly connected, if there is a directed path from node i to node j, for all distinct nodes vi , v j ∈ V. A digraph has a spanning tree if it is strongly connected, but not vice versa.

13.2.2 Synchronization and Tracking Error Dynamic Systems Consider a heterogeneous multi-agent system with N agents in the form of communication network G. For an arbitrary node i, i = 1, 2, . . . , N , the heterogeneous agent is expressed as x˙i = Ai xi + Bi u i ,

(13.1)

where xi (t) ∈ Rn is the state of node vi , and u i (t) ∈ Rm is the input coordination control. Let Ai and Bi for ∀ i ∈ N be constant matrices. In this chapter, we assume that the control gain matrix Bi satisfies rank{Bi } ≥ n for the convenience of our analysis. For ∀ i ∈ N, assume that the state xi of the agent for each node is controllable. The local neighborhood tracking error for ∀ i ∈ N is defined as δi =



ρi j (xi − x j ) + σi (xi − x0 ),

(13.2)

j∈Ni

where σi ≥ 0 is the pinning gain. Note that σi > 0 for at least one i. We let σi = 0 if and only if there is not a direct path from the control node to the ith node in G. Otherwise we have σi > 0. The target node or state is x0 (t) ∈ Rn , which satisfies the dynamics x˙0 = A0 x0 .

(13.3)

The synchronization control design problem is to design local control protocols for all the nodes in G to synchronize to the state of the control node, i.e., for ∀ i ∈ N, we require xi (t) → x0 (t).

254

13 Optimal Distributed Synchronization Control …

According to (13.1), the global network dynamics can be expressed as x˙ = Ax + Bu,

(13.4)

where the global state vector of the multi-agent system (13.4) is x = [x1T , x2T , . . . , x NT ]T ∈ R N n and the global control input u = [u T1 , u T2 , . . . , u TN ]T ∈ R N m . Let A = diag{Ai } and B = diag{Bi }, i = 1, 2, . . . , N . According to (13.2) and (13.4), the global error vector for the network G is δ = (L ⊗ In )x + (G ⊗ In )(x − x¯0 ) = ((L + G) ⊗ In )(x − x¯0 ) = L(x − x¯0 ), (13.5) where L = (L + G) ⊗ In and In is an identity matrix with n dimensions. Let L be the Laplacian matrix and let ⊗ denote the Kronecker product operator. Let G = [σi j ] ∈ R N ×N be a diagonal matrix with diagonal entries σi . In [3], for ∀ i ∈ N, the system matrices Ai = A0 . This makes the control u i ≡ 0 as xi reaches the target node x0 . In this chapter, according to (13.1) and (13.3), for ∀ i ∈ N, as xi = x0 , the control u i cannot be zero. Thus, we should first solve the desired control at the target state. Let u di be the desired control that satisfies the following equation x˙0 = Ai x0 + Bi u di .

(13.6)

u di = Bi+ (A0 − Ai )x0 ,

(13.7)

According to (13.3), we have

where Bi+ is the Moore–Penrose pseudo-inverse matrix of Bi . Let xei = xi − x0 and u ei = u i − u di . Then, according to (13.7), agent (13.1) can be rewritten as x˙ei = Ai xei + Bi u ei + (Ai − A0 )x0 + Bi u di = Ai xei + Bi u ei .

(13.8)

According to (13.2), the tracking error vector for the network G can be expressed as δ˙i =



ρi j (x˙i − x˙ j ) + σi (x˙i − x˙0 )

j∈Ni

=



ρi j (x˙ei − x˙ej ) + σi x˙ei

j∈Ni

=



ρi j (Ai xei + Bi u ei − A j xej + B j u ej ) + σi (Ai xei + Bi u ei )

j∈Ni

=

 j∈Ni

ρi j Ai (xei − xej ) +

 j∈Ni

ρi j (Ai − A j )xej +

 j∈Ni

ρi j (Bi u ei − B j u ej )

13.2 Graphs and Synchronization of Multi-agent Systems

255

+ σi Ai xei + σi Bi u ei   = Ai δi + ρi j (Ai − A j )xej + (di + σi )Bi u ei − ρi j B j u ej . j∈Ni

(13.9)

j∈Ni

From (13.9), we can see that the tracking error vector is a function with respect to δi and xej . This means that the distributed control u i should be designed by the information of δi and xej . A system transformation is necessary. Let j1i , j2i , . . . , j Ni i be the nodes in Ni . Define new state and control vectors as x˜ei = T T T T ˜ ei = [ u Tej i , u Tej i , . . . , u Tej i ]T . If we let z i = [δiT , x˜eiT ]T and [ xej i , x i , . . . , x i ] and u ej ej 1

2

1

Ni

2

Ni

u¯ ei = [u Tei , u˜ Tei ]T , then we can obtain the following expression z˙ i = A¯ i z i + B¯ i u¯ ei ,

(13.10)

where A¯ i and B¯ i are expressed as ⎡

⎤ Ai ei j1i (Ai − A j1i ) · · · ei jNi (Ai − A jNi ) i i ⎢ 0 ⎥ 0 A j1i ⎢ ⎥ ⎢ ⎥, ¯ Ai = ⎢ . .. ⎥ . . ⎣ . ⎦ 0 0 A jNi

(13.11)

i

and ⎡

(di + σi )Bi −ei j1i B j1i · · · −ei jNi B jNi i i ⎢ 0 B j1i 0 ⎢ B¯ i = ⎢ .. .. ⎢ . . ⎣ 0 0 B jNi

⎤ ⎥ ⎥ ⎥, ⎥ ⎦

(13.12)

i

respectively. We can see that system (13.10) is a multi-input system. The controls are u ei and all the u ej from its neighbors, where the controls for agent i are coupled with the agents of its neighbors. This makes the controller design very difficult. In the next section, an optimal distributed control by iterative ADP algorithm will be developed that makes all agents reach the target state.

13.3 Optimal Distributed Cooperative Control for Multi-agent Differential Graphical Games In this section, our goal is to design an optimal distributed control to reach a consensus while simultaneously minimizing the local performance index function for system (13.10). Define u ei as the vector of the control for the neighbors of node i, i.e., u −ei  {u ej : j ∈ Ni }.

256

13 Optimal Distributed Synchronization Control …

13.3.1 Cooperative Performance Index Function For ∀ i ∈ N, let Q ii , Rii , and Ri j , j ∈ Ni , be all constant and symmetric matrices which satisfy Q ii ≥ 0, Rii > 0, and Ri j ≥ 0. For ∀ i ∈ N, define the utility function as  u Tej Ri j u ej Ui (z i , u ei , u −ei ) = z iT Q¯ ii z i + u Tei Rii u ei + j∈Ni

=

δiT Q ii δi

+

u Tei Rii u ei

+



u Tej Ri j u ej ,

(13.13)

j∈Ni

where Q ii , Rii , and Ri j are positive definite functions for j ∈ Ni , and ⎡

δi ⎢ xej1i ⎢ z iT Q¯ ii z i = ⎢ .. ⎣ . xejNi

⎤T ⎡

⎤⎡ δi 0 ··· 0 ⎥ ⎢ x 0 · · · 0 ⎥ ⎢ ej1i .. . . .. ⎥ ⎢ .. . . . ⎦⎣ . xejNi 0 0 ··· 0

Q ii ⎢ 0 ⎢ ⎢ .. ⎣ .

⎥ ⎥ ⎥ ⎦

i

⎤ ⎥ ⎥ ⎥ = δiT Q ii δi . ⎦

(13.14)

i

Then, we can define the local performance index functions as Ji (z i (0), u ei , u −ei ) =



0

=

 T T T u ej Ri j u ej dt δi Q ii δi + u ei Rii u ei + j∈Ni





Ui z i , u ei , u −ei dt.

(13.15)

0

From (13.15), we can see that the performance index function includes only the information about the inputs of node i and its neighbors. The goal of this chapter is to design the local optimal distributed control to minimize the local performance index functions (13.15) subject to (13.10) and make all nodes (agents) reach the consensus at the target state x0 . Definition 13.2 (Admissible Coordination Control Law [3]) Control law u ei , i ∈ N is defined as admissible coordination control law if u ei is continuous, u ei (0) = 0, u ei stabilizes agent (13.10) locally, and the performance index function (13.15) is finite. If u ei and u ej , j ∈ Ni are admissible control laws, then we can define the local performance index function Vi (z i ) as Vi (z i (0)) =





Ui z i , u ei , u −ei dt.

(13.16)

0

For admissible control laws u ei and u ej , j ∈ Ni , the Hamiltonian Hi (z i , u −ei ) satisfies the following cooperative Hamilton–Jacobi (HJ) equation

∂ Vi , u ei , ∂z i

13.3 Optimal Distributed Cooperative Control for Multi-agent …



∂ ViT ∂ Vi A¯ i z i + B¯ i u¯ ei + δiT Q ii δi Hi z i , , u ei , u −ei ≡ ∂z i ∂z i  + u Tei Rii u ei + u Tej Ri j u ej = 0

257

(13.17)

j∈Ni

with boundary condition Vi (0) = 0. If Vi∗ (z i ) is the local performance index function, then Vi∗ (z i ) satisfies the coupled HJ equations

∂V ∗ min Hi z i , i , u ei , u −ei = 0 u ei ∂z i

(13.18)

and the local optimal control law u ∗ei satisfies

∂V ∗ u ∗ei = arg min Hi z i , i , u ei , u −ei . u ei ∂z i

(13.19)

13.3.2 Nash Equilibrium In this subsection, the optimality of the multi-agent systems will be developed. The corresponding property will also be presented. According to [3], we introduce the Nash equilibrium definition for multi-agent differential games. Definition Nash equilibrium) A sequence of control laws  ∗ ∗ 13.3 ∗ (Global  u e1 , u e2 , . . . , u eN is said to be a Nash equilibrium solution for an N multi-agent game for all i ∈ N and ∀u ei , u −ei , 

Ji∗ = Ji (u ∗e1 , u ∗e2 , . . . , u ∗eN ) ≤ Ji (u ∗e1 , u e2 , . . . , u ∗eN ),

(13.20)

for u ei = u ∗ei . The N -tuple of the local performance index function {J1∗ , J2∗ , . . . , JN∗ } is known as the Nash equilibrium of the N multi-agent game in G.  Theorem 13.1 Let Ji∗ be the Nash equilibrium solution that satisfies (13.20). If u ∗e1 ,  u ∗e2 , . . ., u ∗eN are optimal control laws, then for ∀i, we have u ∗ei

= arg min Hi u ei

∂ Ji∗ ∗ zi , , u ei , u −ei . ∂z i

(13.21)

Proof The conclusion can be proven by contradiction. Assume that the conclusion   is false. Let u ∗e1 , u ∗e2 , . . . , u ∗eN be optimal control laws. Then, there must exist a u ∗el , such that u ∗el



= arg min Hl u el

∂ Jl∗ ∗ zl , , u el , u −el . ∂zl

(13.22)

258

13 Optimal Distributed Synchronization Control …

  As u ∗e1 , u ∗e2 , . . . , u ∗eN is the optimal control law and Jl∗ is the Nash equilibrium solution, for l = 1, 2, . . . , N , we can get

∂ J∗ ∂ J∗

Hl zl , l , u ∗el , u ∗−el = l A¯ l zl + B¯ l u¯ ∗el + δlT Q ll δl ∂zl ∂zl  ∗ ∗ +u ∗T u ∗T el Rll u el + ej Rl j u ej = 0.

(13.23)

j∈Nl

If (13.23) holds, then we can let u oel satisfy

u oel

= arg min Hl u el

∂ Jl∗ ∗ zl , , u el , u −el . ∂zl

(13.24)

According to (13.24), we have



∂ J∗ ∂ J∗ Hl zl , l , u oel , u ∗−el = min Hl zl , l , u el , u ∗−el . u el ∂zl ∂zl Hence, we can obtain



∂ Jl∗ o ∗ ∂ Jl∗ ∗ ∗ Hl zl , ,u ,u ,u ,u ≤ Hl zl , ∂zl el −el ∂zl el −el = 0.

(13.25)

(13.26)

Find a performance index function Jlo (zl (0), u oel , u ∗−el ) that satisfies



∂ Jo ∂ Jo Hl zl , l , u oel , u ∗−el = l A¯ l zl + B¯ l u¯ oel + δlT Q ll δl ∂zl ∂zl  ∗ ∗ +u ∗T u ∗T el Rll u el + ej Rl j u ej = 0.

(13.27)

j∈Nl

Then, we can obtain

 o T oT o ∗T ∗ ˙ Jl = − δl Q ll δl + u el Rll u el + u ej Rl j u ej .

(13.28)

j∈Nl

According to (13.27) and (13.28), we have



 ∂ Jl∗ o ∗ ∗ T oT o ∗T ∗ ˙ Jl = Hl zl , − δl Q ll δl + u el Rll u el + ,u ,u u ej Rl j u ej ≤ J˙lo . ∂zl el −el j∈N l

(13.29) Hence, Jl∗ ≥ Jlo for l = 1, 2, . . . N . It is contradiction. So the conclusion holds.

13.3 Optimal Distributed Cooperative Control for Multi-agent …

259

From Theorem 13.1, we can see that if for ∀ i ∈ N, all the control from its neighbors are optimal, i.e., u ej = u ∗ej , j ∈ Ni , then the optimal distributed control u ∗ei can be obtained by (13.21). The following results can be derived according to the properties of u ∗ei , i ∈ N. Lemma 1 Let V ∗ (z i ) > 0, i ∈ N be a solution to the HJ equation (13.18), the optimal distributed control policies u ∗ei , i ∈ N be given by (13.19) in term to V ∗ (z i (0)). Then, (1) The local neighborhood consensus error (13.10) converges to zero. (2) The local performance index functions Ji∗ (z i (0), u ∗ei , u ∗−ei ) are equal to Vi∗ (z i (0)), i ∈ N, and u ∗ei and u ∗ej , j ∈ Ni are in Nash equilibrium. Proof The proof can be seen in [3] and the details are omitted here.

13.4 Heterogeneous Multi-agent Differential Graphical Games by Iterative ADP Algorithm From the previous subsection, we know that if we have obtained the optimal performance index function, then the optimal control law can be obtained by solving the HJ equation (13.18). However, Eq. (13.18) is difficult or impossible to be solved directly by dynamic programming. In this section, a policy iteration (PI) algorithm of adaptive dynamic programming is developed which solves the HJ equation (13.18) iteratively. Convergence properties of the developed algorithm will also be presented in this section.

13.4.1 Derivation of the Heterogeneous Multi-agent Differential Graphical Games In the developed policy iteration algorithm, the performance index function and control law are updated by iterations, with the iteration index k increasing from 0 to infinity. For ∀ i ∈ N, let u i[0] be an arbitrary admissible control law. Let Vi[0] be the iterative performance index function constructed by u i[0] , which satisfies

∂ V [0] [0] = 0. Hi z i , i , u [0] , u ei −ei ∂z i

(13.30)

According to (13.30), we can update the iterative control law u [1] ei by  u [1] ei

= arg min Hi u ei

 ∂ Vi[0] [0] , u ei , u −ei . zi , ∂z i

(13.31)

260

13 Optimal Distributed Synchronization Control …

From ∀k = 1, 2, . . ., solve for Vi[k] that satisfies the following Hamiltonian

[k] [k]   ∂V ∂V [k] [k] [k] A¯ i z i + B¯ i u¯ ei + δiT Q ii δi Hi z i , i , u ei , u −ei = i ∂z i ∂z i  [k]T [k]T [k] [k] +u ei Rii u ei + u ej Ri j u ej = 0,

(13.32)

j∈Ni

and update the iterative control law by  u [k+1] ei

= arg min Hi u ei

 ∂ Vi[k] [k] , u ei , u −ei . zi , ∂z i

(13.33)

From the multi-agent policy iteration (13.30)–(13.33), we can see that Vi[k] is used to approximate Ji∗ and u i[k] is used to approximate u i∗ . Therefore, when i → ∞, it is expected that the algorithm is convergent to make Vi[k] and u i[k] converge to the optimal ones. In the next subsection, we will show such properties of the developed policy iteration algorithm.

13.4.2 Properties of the Developed Policy Iteration Algorithm In this subsection, we will first present the converge property of the multi-agent policy iteration algorithm. Initialized by an admissible control laws, the monotonicity of the iterative performance index functions is discussed. In [3], the properties of the iterative performance index function are analyzed for the linear agent expressed by x˙i = Axi + Bi u i . In this chapter, inspired by [3], the properties of the iterative performance index function are analyzed for x˙i = Ai xi + Bi u i . Lemma 2 (Solution for best iterative control law) Given fixed neighbor policies u −ei  {u j : j ∈ Ni }, the best iterative control law can be expressed by [k] 1 −1 T ∂ Vi R u [k+1] = − (d + σ )B . i i i ei 2 ii ∂δi

(13.34)

Proof According to (13.32) and (13.33), taking the derivation of u ei , we have    [k]T ∂ Vi[k]T  A¯ i z i + B¯ i u¯ [k] , Rii u [k] + u ej Ri j u [k] + δiT Q ii δi + u [k]T ei ei ei ej u ei ∂z i j∈Ni ⎡ ⎤ (di + σi )BiT 0 · · · 0  [k] T ⎢ −e i B Ti B Ti 0 ⎥ i j1 j ⎢ ⎥ j1 1 1 −1 ∂ u¯ ei ⎢ ⎥ = − Rii .. ⎢ ⎥ .. ⎢ ⎥ 2 ∂u ei . . ⎣ ⎦ T T 0 Bi −ei j i B i

u [k+1] = arg min ei



Ni

jN

i

jN

i

13.4 Heterogeneous Multi-agent Differential Graphical Games …

 ×

∂ Vi[k]T ∂ Vi[k]T ∂ Vi[k]T , , ..., ∂δi ∂ xej i ∂ x ji 1

=−

261

T

Ni

∂ V [k] 1 (di + σi )Rii−1 BiT i . 2 ∂δi

(13.35)

The proof is completed. Remark 1 In [3], it shows that the iterative control law and the iterative performance index function are both functions of δi . From Lemma 2, we can see that to obtain , the partial information of the tracking error z i , i.e., the iterative control law u [k+1] ei δi , is necessary. On the other hand, according to (13.9), δ˙ is a function of z i . If the iterative control u −ei is given, then we can find an iterative performance index function Vi[k] = Vi[k] (z i ) that satisfies (13.32). Hence, Vi[k] is the function of z i , which is a obvious difference from the one in [3]. Next, inspired by [3], the convergence properties of the iterative performance index functions will be developed by the following theorems. Theorem 13.2 (Convergence of policy iteration algorithm when only one agent updates its policy and all players u −ei in its neighborhood do not change) Given fixed neighbor policies u o−ei , let the iterative performance index function Vi[k] and iterative are updated by the policy iteration (13.30)–(13.33). Then, the control law u [k+1] ei iterative performance index function Vi[k] is convergent as k → ∞, i.e., lim Vi[k] = k→∞

Vi∞ , where Vi∞ satisfies the following HJ equation  min Hi u ei

∂ V [k] z i , i , u ei , u o−ei ∂z i

 = 0.

(13.36)

[k]T T , u oT , . . . , u oT ]T and uˆ [k] ˜ oT Proof Let u˜ oei = [ u oT ei = [u ei , u ej ] . Taking the derivaej i ej i ej i 1

2

Ni

tive of Vi[k] along the trajectory A¯ i z i + B¯ i uˆ [k+1] , we have ei V˙i[k] ∂ Vi[k] [k+1] ( A¯ i z i + B¯ i uˆ ei ) ∂z i



 ∂ V [k] [k+1] o [k+1]T [k+1] o = Hi z i , i , u ei , u −ei − δiT Q ii δi + u ei Rii u ei + u oT R u i j ej ej . (13.37) ∂z i

=

j∈Ni

According to (13.32) and (13.33), we can get

Hi z i ,

∂ Vi[k] ∂z i

, u [k+1] , u o−ei ei



= min Hi u ei

 ≤ Hi = 0.

zi ,

zi ,

∂ Vi[k] ∂z i

∂ Vi[k] ∂z i

 , u ei , u o−ei 

o , u [k] ei , u −ei

(13.38)

262

13 Optimal Distributed Synchronization Control …

Substituting the iteration index k + 1 into the HJ equation (13.32), we have

∂ V [k+1] [k+1] o Hi z i , i , u ei , u −ei ∂z i

[k+1]  ∂ Vi [k+1] [k+1]T [k+1] T oT o ¯ ¯ ( Ai z i + Bi uˆ ei ) + δi Q ii δi + u ei Rii u ei + u ej Ri j u ej = ∂z i j∈N i

=0,

(13.39)

which means

 [k+1] oT o R u + u R u V˙i[k+1] = − δiT Q ii δi + u [k+1]T ii i j ej ej . ei ei

(13.40)

j∈Ni

According to (13.37), (13.38) and (13.40), we can obtain V˙i[k+1] ≥ V˙i[k] . By integrating both sides of V˙i[k+1] ≥ V˙i[k] , we can obtain Vi[k+1] ≤ Vi[k] . As Vi[k] is lower bounded, we have Vi[k] is convergent as k → ∞, i.e., lim Vi[k] = Vi∞ . According to k→∞

(13.32) and (13.33), letting k → ∞, we can obtain (13.36). The proof is completed. From Theorem 13.2, we can see that for i = 1, 2, . . . , Ni , if the neighbor control laws u o−ei are fixed, then the iterative performance index functions and iterative control laws are convergent to the optimum. In next subsection, the convergence property for iterative performance index function and iterative control laws when all nodes update their control laws will be developed. Theorem 13.3 (Convergence of policy iteration algorithm when all agents update their policies) Assume all nodes i update their policies at each iteration of policy iteration algorithm (13.30)–(13.33). Define ς (Ri j R −1 j j ) as the maximum singular −1 −1 value of Ri j R j j . For small ς (Ri j R j j ), the iterative performance index function Vi[k] is convergent to the optimal J ∗ , i.e., lim Vi[k] = Ji∗ . k→∞

T Proof According to (13.33) and (13.35), letting u˘ [k+1] = [u [k+1]T , u˜ [k]T ei ei ei ] , we have

dVi[k] ∂ Vi[k] ¯ = ( Ai z i + B¯ i u˘ [k+1] ). ei dt ∂z i

(13.41)

According to Hamilton function (13.32), we have  [k]T ∂ Vi[k] ¯ ∂ V [k] [k]T [k] T u ej Ri j u [k] Ai z i = − i B¯ i u¯ [k] ei − δi Q ii δi − u ei Rii u ei − ej . (13.42) ∂z i ∂z i j∈N i

According to Hamiltonian (13.32) for k + 1, we can obtain

13.4 Heterogeneous Multi-agent Differential Graphical Games …

263

V˙i[k+1] − V˙i[k] =

 [k+1]T ∂ Vi[k] ¯ [k] ) − u [k+1]T Rii u [k+1] − u ej Ri j u [k+1] Bi (u¯ ei − u˘ [k+1] ei ei ei ej ∂z i j∈Ni  [k]T [k] + u [k]T u ej Ri j u [k] ei Rii u ei + ej j∈Ni

∂ Vi[k]

=(di + σi ) +

∂δi

[k] u [k]T ei Rii u ei

[k+1] Bi (u [k] ) − u [k+1]T Rii u [k+1] − ei − u ei ei ei

+





u [k+1]T Ri j u [k+1] ej ej

j∈Ni [k] u [k]T ej Ri j u ej

j∈Ni

=2u [k+1]T Rii (u [k+1] ei ei

[k+1]T [k] − u [k] Rii u [k+1] + u [k]T ei ) − u ei ei ei Rii u ei  [k+1]  [k+1] T [k+1] [k+1] + (u ej − u [k] − u [k] (u ej − u [k] ej ) Ri j (u ej ej ) − 2 ej )Ri j u ej j∈Ni

=(u [k+1] ei −2





[k+1] T u [k] ei ) Rii (u ei



u [k] ei )

+



j∈Ni

(u [k+1] ej

T

[k+1] − u [k] − u [k] ej ) Ri j (u ej ej )

j∈Ni [k+1] (u [k+1] − u [k] . ej ej )Ri j u ej

(13.43)

j∈Ni

A sufficient condition for V˙i[k+1] − V˙i[k] ≥ 0 is [k] u [k]T ei Rii u ei +



[k] u [k]T ej Ri j u ej ≥ 2

j∈Ni



[k+1] u [k]T ej Ri j u ej

j∈Ni

=−

 j∈Ni

(d j +

−1 σ j )u [k]T ej Ri j R j j B j

∂ V j[k] ∂δ j

,

(13.44) [k] where u ei = u [k+1] − u , i ∈ {i, N }. Let ς (R ) be the maximum singular i i j ei ei value of Ri j . We can see that for ∀ i, if ς (Ri j R −1 j j ), ρi j , j ∈ Ni and σi are [k+1] ˙ small, then inequality (13.44) holds, which means Vi ≥ V˙i[k] . By integration of V˙i[k+1] ≥ V˙i[k] , we can obtain Vi[k+1] ≤ Vi[k] . Hence, the iterative performance index function Vi[k] is monotonically nonincreasing and lower bounded. As such Vi[k] is convergent as k → ∞, i.e., lim Vi[k] = Vi∞ . k→∞

[ ] [ ] It is obvious that Vi∞ ≥ J ∗ . On the other hand, let {u [ ] e1 , u e2 , . . . , u eN } be arbitrary admissible control laws. For ∀i, there must exist a performance index function Vi[ ] that satisfies   ∂ Vi[ ] [ ] [ ] Hi z i , , u ei , u −ei = 0. (13.45) ∂z i

264

13 Optimal Distributed Synchronization Control …

 Let u [ +1] ei

= arg min Hi u ei

∂ V [ ] z i , i , u ei , u [ ] −ei ∂z i

 and then we have Vi∞ ≤ Vi[ +1] ≤

[ ] [ ] Vi[ ] . As {u [ ] e1 , u e2 , . . . , u eN } are arbitrary, let

  ∗ ∗ [ ] [ ] ∗ {u [ ] e1 , u e2 , . . . , u eN } = u e1 , u e2 , . . . , u eN .

(13.46)

and then we can obtain Vi∞ ≤ J ∗ . Therefore, we can obtain that lim Vi[k] = Ji∗ . The k→∞

proof is completed. Remark 2 In [3], it shows that for linear multi-agent system x˙i = Axi + Bi u i , if the edge weights ρi j and ς (Ri j R −1 j j ) are small, then the iterative performance index function converge to the optimum. From Theorem 13.3, we have that for multi-agent [k] system (13.1), if ς (Ri j R −1 j j ) is small, then iterative performance index function Vi will also converge to the optimum, as k → ∞, while the constraint for ρi j is omitted.

13.4.3 Heterogeneous Multi-agent Policy Iteration Algorithm Based on above preparations, we summarize the heterogeneous multi-agent policy iteration algorithm in Algorithm 8. Algorithm 8 Heterogeneous multi-agent policy iteration algorithm Initialization: Choose randomly an admissible control law u i[0] , ∀ i ∈ N; Choose a computation precision ε; Choose a constant 0 < ζ < 1; Choose positive definite matrices Q ii , Rii , and Ri j . Obtain the desired control u di , for ∀ i ∈ I by u di = Bi−1 (A0 − Ai )x0 . Transform the agent into (13.10), i.e., z˙ i = A¯ i z i + B¯ i u¯ ei . Iteration: 1: Construct the utility function Ui (z i , u ei , u −ei ) by (13.13). 2: Let the iteration index i = 0. Construct a performance index function Vi0 to satisfy (13.30); 3: For i ∈ N, let k = k + 1. Do Policy Improvement   ∂ Vi[k] [k+1] [k] u ei = arg min Hi z i , , u ei , u −ei ; u ei ∂z i 4: Do Policy Evaluation

∂ V [k] [k] , u Hi z i , i , u [k] ei −ei = 0; ∂z i 5: If Vi[k+1] ≤ Vi[k] , goto next step. Else, let Ri j = ζ Ri j , j ∈ Ni and goto Step 2. 6: If Vi[k] − Vi[k+1] ≤ ε, then the optimal performance index function and optimal control law are obtained. Goto Step 6. Else goto Step 3; [k] [k] ∗ 7: return u [k] ei and Vi . Let u i = u ei + u di .

13.5 Simulation Study

265

13.5 Simulation Study In this section, the performance of our iterative ADP algorithm is evaluated by numerical experiments. Consider the five-node strongly connected digraph structure shown in Fig. 13.1 with a leader node connected to node 3. The edge weights are taken as ρ13 = 0.1, ρ14 = 0.1, ρ21 = 0.5, ρ31 = 0.4, ρ32 = 0.3, ρ41 = 0.2, ρ45 = 0.1, ρ52 = 0.8, ρ54 = 0.7, respectively. The pinning gains are taken σ1 = 0.5, σ2 = 0.3, σ3 = 0.2, σ4 = 0.1, and σ5 = 0.1, respectively. For the structure in Fig. 13.1, each node dynamics is considered as Node 1:         x˙11 −1 1 20 x11 u 11 = + , (13.47) x˙12 x12 u 12 −3 −1 01 Node 2: 

x˙21 x˙22



 =

−1 1 −4 −1



    x21 20 u 21 + , x22 u 22 03

(13.48)

    x31 20 u 31 + , x32 u 32 02

(13.49)

    x41 10 u 41 + , x42 u 42 01

(13.50)

Node 3: 

x˙31 x˙32



 =

−2 1 −1 −1



Node 4: 

x˙41 x˙42



 =

−1 1 −2 −1



Node 5:

Leader

1

4

2

5

3

Fig. 13.1 The structure of five-node digraph

266

13 Optimal Distributed Synchronization Control … 2

2

In xe22

1

tracking error

tracking error

Lm xe11 In xe11

0

In xe12

-1

1

Lm x

0

In xe21

-1

Lm xe12

Lm xe21

-2

-2 0

50

100

150

0

50

(a) time steps

100

150

(b)time steps

1

2

In xe41

tracking error

Lm xe32

tracking error

e22

Lm x e31 In xe31

0.5

0 In xe32

-0.5 0

0

Lm xe41

-2

In xe42

-4

50

100

150

0

Lm x

50

(c) time steps

e42

100

150

(d) time steps

Fig. 13.2 Tracking errors xei , i = 1, 2, 3, 4. a Tracking error xe1 . b Tracking error xe2 . c Tracking error xe3 . d Tracking error xe4



x˙51 x˙52





−2 1 = −3 −1



    30 x51 u 51 + . x52 u 52 02

(13.51)

Let the desired dynamics be expressed as 

x˙01 x˙02





−2 1 = −4 −1



 x01 , x02

(13.52)

where I denotes the identity matrix with suitable dimensions. Define the utility function as in (13.13), where the weight matrices are expressed as Q 11 = Q 22 = Q 33 = Q 44 = Q 55 = I, R11 = 4I,R12 = I,R13 = I,R14 = I, R21 = I,R22 = 4I, R31 = I,R32 = I,R33 = 5I, R41 = I,R44 = 9I,R45 = I, R52 = I,R52 = I,R55 = 9I.

(13.53)

13.5 Simulation Study

267

3

4

2

2

Lm xe52

1

Control errors

Tracking errors

In ue11

In xe52

0

In x

-1

Lm xe51

e51

-2

In ue12

0 -2

Lm ue11

-4

Lm ue12

-6 0

50

100

150

0

50

(a) time steps

100

150

(b) time steps 4

0

Control errors

Control errors

Lm ue21 Lm ue22 In ue21

-2

-4

In ue22

In ue31

2

In ue32

0

Lm ue32

-2

Lm ue31

-4 -6

0

50

100

150

0

50

100

150

(d) time steps

(c)time steps

Fig. 13.3 Tracking error xe5 and the controls errors u ei , i = 1, 2, 3. a Tracking errors xe5 . b Control error u e1 . c Control error u e2 . d Control error u e3 5

Control error

0

Lm u e41 -5

In u e41 Lm u e42

-10

In u e42 -15 0

50

100

150

100

150

(a) time steps 4

Lm u e51

Control error

2 0

In u e51 Lm u e52

-2

In u e52 -4 -6 0

50

(b) time steps

Fig. 13.4 Controls errors u ei , i = 4, 5. a Control error u e4 . b Control error u e5

268

13 Optimal Distributed Synchronization Control … x1 x2 3

x3

2

x5

x4 x0

xi2

1 0 -1 -2 -3 4 2

0 xi1

-2 -4

0

50

100

150

200

250

300

time steps

Fig. 13.5 The evolution of the agent states u1 u2 u3

15

u4

10

u5

ui2

5 0 -5 -10 -15 20 10 0

ui1

-10 -20

0

50

100

150

200

250

300

time steps

Fig. 13.6 The evolution of the agent control

Let the initial states of each agent in the games be 

     1 −1 1 , x2 (0) = , x3 (0) = , x1 (0) = −1 1 1       2 −2 0.5 x4 (0) = , x5 (0) = , x0 (0) = . −2 2 −0.5

(13.54)

13.5 Simulation Study

269

Define xei = xi − x0 , i = 1, 2, 3, 4, 5. The graphical game is implemented as in Algorithm 8 for k = 20 iterations. The tracking errors of every agent for nodes 1–5 are shown in Figs. 13.2a–d and 13.3a, respectively. The control errors u ei , i = 1, 2, . . . , 5 are shown in Figs. 13.3b–d and 13.4, respectively. From Figs. 13.2, 13.3 and 13.4, we can see that after 20 iterations, the iterative states and iterative control laws are convergent to the optimum. Implementing the achieved optimal control laws to the agents, the states are shown in Fig. 13.5 and the corresponding optimal control trajectories for each node are shown in Fig. 13.6. From Figs. 13.5 and 13.6, we can see that the states for each nodes are convergent to the desired trajectory x0 , and the effectiveness of the developed iterative ADP algorithm can be verified.

13.6 Conclusion In this chapter, an effective policy iteration based ADP algorithm is developed to solve the optimal coordination control problems for heterogeneous multi-agent differential graphical games. The developed heterogeneous differential graphical games permits the agent dynamics of each node to be different from the agents of other nodes. An optimal cooperative policy iteration algorithm for graphical differential games is developed to achieve the optimal control law for the agent of each node to guarantee that the dynamics of each node can track the desired one. Convergence analysis is developed to show that the performance index functions of heterogeneous multi-agent differential graphical games can converge to the Nash equilibrium. Finally, simulation results will show the effectiveness of the developed optimal control scheme.

References 1. Jamshidi, M.: Large-Scale Systems-Modeling and Control. The Netherlands Press, Amsterdam (1982) 2. Owen, G.: Game Theory. Acadamic Press, New York (1982) 3. Vamvoudakis, K., Lewis, F., Hudas, G.: Multi-agent differential graphical games: online adaptive learning solution for synchronization with optimality. Automatica 48(8), 1598–1611 (2012)

Index

A Action neural networks, 207 Adaptive dynamic programming, 4 Algebraic Riccati equation, 1 Anterior cingulate cortex, 63 Approximate/adaptive dynamic programming, v Approximate dynamic programming, 4

C Critic neural network, 207

Non-zero-sum, 208, 227

O Orbitofrontal cortex, 66

P Policy iteration, 4

R Recurrent neural network, 63 Reinforcement learning, 3

D Disturbance neural networks, 207

E Extreme learning machine, 7

S Shunting inhibitory artificial neural network, 63, 64 Single-hidden layer feed-forward network, 7 Support vector machine, 8

H Hamilton-Jacobi-Bellman, 1 Hamilton–Jacobi–Isaacs, 208

U Uniformly ultimately bounded, 63

I Integral reinforcement learning, vi

V Value iteration, 4

L Linear quadratic regulator, 1

W Wheeled inverted pendulum, 83

N Neural network, 4, 69

Z Zero-sum, 165

© Science Press, Beijing and Springer Nature Singapore Pte Ltd. 2019 R. Song et al., Adaptive Dynamic Programming: Single and Multiple Controllers, Studies in Systems, Decision and Control 166, https://doi.org/10.1007/978-981-13-1712-5

271